Efficient Throughput Cores for Asymmetric Manycore Processors - PowerPoint PPT Presentation

1 / 84
About This Presentation
Title:

Efficient Throughput Cores for Asymmetric Manycore Processors

Description:

from Jon Stokes 'Clearing up the confusion over Intel's Larrabee, part II' at arstechnica.com ... even under strong area and power constraints ... Key Insights ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 85
Provided by: michae681
Category:

less

Transcript and Presenter's Notes

Title: Efficient Throughput Cores for Asymmetric Manycore Processors


1
Efficient Throughput Cores for Asymmetric
ManycoreProcessors
  • David Tarjan

2
Outline
  • Motivation
  • Early Work on Lightweight Federation (presented
    in Proposal, 1 slide each recap, DAC 2008)
  • Diverge on Miss (submitted to SC 2009)
  • Sharing Tracker (submitted to MICRO 2009)

3
Whats going on here?
From The Landscape of Parallel Computing
Research A View from Berkeley
4
We ran into trouble!
From Avi Mendlsons Lecture Slideshttp//www.cs.t
echnion.ac.il/mendlson/Lecture1.ppt
5
Architecture Reaction
AMD Phenom, from amd.com
from Jon Stokes Clearing up the confusion over
Intels Larrabee, part II at arstechnica.com
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Small incremental benefit for few threads
12
Want faster small cores for few threads case
13
(No Transcript)
14
SIMD Efficient and Fast
from Jon Stokes Clearing up the confusion over
Intels Larrabee, part II at arstechnica.com
15
Challenges
  • Preserve single-thread performance even under
    strong area and power constraints
  • Need performance to be scalable from 1 to N
    threads
  • Widest possible benefit of SIMD units
  • Relatively small caches need to provide good
    hitrates

16
Outline
  • Motivation
  • Early Work on Lightweight Federation (presented
    in Proposal, 1 slide each recap)
  • Diverge on Miss
  • Sharing Tracker

17
Lightweight OOO
  • Most OOO structures are overdesigned
  • Redesign Issue Queue to optimize for the common
    case, which is approx. one consumer
  • Remove Load-Store Queue, since actual forwarding
    is rare, have a structure (Memory Alias Table)
    which can only detect memory order violations.

18
Federation
  • Different workloads have different amounts of
    active threads
  • Want CMP to adapt hardware at runtime
  • Combine two cores dynamically
  • Register files of multithreaded cores can be
    re-used for active list/ROB
  • Can be used both for 1-way and 2-way cores

19
Outline
  • Motivation
  • Early Work on Lightweight Federation (presented
    in Proposal, 1 slide each recap)
  • Diverge on Miss
  • Sharing Tracker

20
Related Work
  • Diverge on Miss Dynamic Warp Formation by Fung
    et al.

21
Divergent Memory Access
Trees, Hashtables, Spatial Datastructures, N-Body
22
Memory Coalescing Buffer
23
Diverge on Miss
24
Diverge on Miss Miss Patterns
  • Misses are semi-random, meaning that all threads
    miss a number of times, but in different
    iterations
  • Trailing threads catch up when leading threads
    miss cache

25
Diverge on Miss
if(foo(threadID)) do A more code here
26
Diverge on Miss
while(work) d datacalc_index(i) acc f(d)
27
Simulation Assumptions
  • 32-wide SIMD cores
  • 32 cores
  • 256 GB/sec off-chip bandwidth
  • 32KB L1 caches per core

28
Workload
  • Ray Tracing
  • Molecular Dynamics (2 kernels)
  • DNA Sequence Alignment
  • K-means
  • Gaussian Filter

29
(No Transcript)
30
Speedup with Diverge on Miss
  • 30 higher performance with Diverge on Miss
  • Max performance at ΒΌ the register file size

31
With 256KB L2
32
Speedup with L2
  • 20 higher performance with Diverge on Miss
  • Need fewer warps for good performance

33
Outline
  • Motivation
  • Early Work on Lightweight Federation (presented
    in Proposal, 1 slide each recap)
  • Diverge on Miss
  • Sharing Tracker

34
Sharing Tracker
  • Cache Coherency not supported on most manycore
    processors
  • Coherency is useful to share cache lines
  • But, has too much overhead

35
Related Work
  • Sharing Tracker Coherence Engine, Distributed
    Coherence Engine

36
How to share cache lines?
Sharing Tracker

37
Sharing Tracker
1. Cache Miss -gt Query ST2. ST Hit -gt Forward to
Cache2a. Req from ST -gtCheck Cache3. Cache Hit
-gt Forward Cache Line
38
Sharing Tracker
Sharing Tracker requires 3x less bits
39
Core Areas
  • Per-core L2 makes core 40-60 bigger
  • Can Sharing Tracker eliminate need for L2s?

40
Sharing Tracker Performance
41
Sharing Tracker With L2 Performance
42
Sharing Tracker With L2 Bandwidth
43
Sharing Tracker no L2 Perf
44
Sharing Tracker no L2 Bandwidth
45
Sharing Tracker Perf/Area
  • 28 higher Perf/Area

46
Conclusion
  • Low-power OOO is feasible
  • Throughput oriented cores can give good
    single-thread performance
  • Memory divergent code can make good use of SIMD
  • Small, non-coherent caches for SIMD cores can be
    augmented to provide good performance

47
List of Publications (1)
  • Dissertation Work
  • Federation Repurposing Scalar Cores for
    Out-of-Order Instruction Issue, D. Tarjan, M.
    Boyer and K. Skadron (DAC 2008)
  • Federation Very Low Overhead Out-of-Order
    Execution, D. Tarjan, M. Boyer and K. Skadron
    (accepted to TACO pending major revisions)
  • Increasing Memory Miss Tolerance for SIMD Cores,
    D. Tarjan and K. Skadron (submitted to SC 2009)
  • Adapting Partial Cache Coherence Hardware to
    Reduce Off-Chip Memory Traffic with Non-Coherent
    Caches in GPUs, D. Tarjan and K. Skadron
    (submitted to MICRO 2009)
  • CUDA
  • Accelerating Leukocyte Tracking using CUDA A
    Case Study in Leveraging Manycore Coprocessors,
    M. Boyer, D. Tarjan, S. Acton and K. Skadron
    (IPDPS 2009)
  • A Performance Study of General-Purpose
    Applications on Graphics Processors using CUDA,
    S. Che, M. Boyer, J. Meng, D. Tarjan, J. W.
    Sheaffer, and K. Skadron (JPDC 2008)
  • Parameter Variation
  • The Impact of Systematic Process Variations on
    Symmetrical Performance in Chip Multi-processors,
    E. Humenay, D. Tarjan and K. Skadron (DATE 2007)
  • Impact of Parameter Variations on Multi-Core
    Chips, E. Humenay, D. Tarjan and K. Skadron (WCED
    2006)

48
List of Publications (2)
  • Branch Prediction
  • Merging path and gshare indexing in perceptron
    branch prediction, D. Tarjan and K. Skadron (TACO
    2005)
  • An Ahead Pipelined Alloyed Perceptron with Single
    Cycle Access Time, D. Tarjan and K. Skadron (WCED
    2004)
  • HotSpot
  • Temperature-Aware Computer Systems Modeling and
    Implementation, K. Skadron, M.R. Stan, W. Huang,
    S. Velusamy, K. Sankaranarayanan, and D. Tarjan
    (TACO 2004)
  • Temperature-Aware Computer Systems Opportunities
    and Challenges, K. Skadron, M.R. Stan, W. Huang,
    S. Velusamy, K. Sankaranarayanan, and D. Tarjan
    (IEEE MICRO 2003)
  • Temperature-Aware Microarchitecture, K. Skadron,
    M.R. Stan, W. Huang, S. Velusamy, K.
    Sankaranarayanan, and D. Tarjan (ISCA 2003, Best
    Student Paper)

49
QA
50
Diverge on Miss Slip Control
  • Limit how far ahead/behind threads can be
  • Increase slip if latency limited
  • Decrease slip if bandwidth or ALU limited
  • Simple linear increase or decrease

51
Warp Splitting
  • Started to work on control-flow divergence
  • Memory divergence turned out to be a bigger
    problem

52
FilterCache
  • Benefits turned out to be small
  • Small caches and large working sets lead to large
    churn and litte reuse

53
Warp Splitting
  • Warp SIMD group
  • Branch dependent on per lane data value can lead
    to divergence
  • Split Warps on divergent branches
  • Execute split warp during free time slots
  • Cuts serialization latency of divergence

54
Diverge on Miss
while(work) if() do A else do B
55
FilterCache
SIMD
scalar
Data
Tag
56
FilterCache
  • Detect SIMD stride and reuse patterns
  • Probabilistic LRU update based on number of lanes
    accessing cache line
  • Eviction cost based on SIMD miss parallelism

57
Area Overhead
58
Simulation Methodology
  • Simulator
  • SimpleScalar 3.0
  • Wattch
  • Workloads
  • SPEC2000 benchmarks
  • Simpoint select 100 million instructions

59
Adaptive Pipeline
In-order
Out-of-order
60
Performance Impact
61
Performance Results
62
Energy Efficiency Results
63
Energy-Area Efficiency Results
64
Lightweight Out-of-Order Core
  • CMPs primarily limited by power
  • Secondary limit is area
  • High-performance cores spend most of their power
    in OOO structures
  • Data from 130nm AMD Opteron

65
LSQ
from Mesa-Martinez et al. Power Model
Validation Through Thermal Measurements in ISCA
2007
66
Subscription-Based Issue Queue
  • Consumers subscribe to their producers at rename
  • Producers set ready bits of their consumers at
    execute
  • Fixed number of subscriber slots can cause stalls
  • Select uses static priority encoder based on IQ
    position rather than oldest-first

67
Issue Queue Example

1
1
1
IQ2
IQ3
1
IQ3
0
1

2
0
0
1
1

3
68
Simplified Load-Store Queue
  • Memory Alias Table (MAT)
  • Address-based hash table (Counting Bloom Filter)
  • No store forwarding
  • No conservative waiting on stores
  • Only detect memory order violations after they
    have occurred and flush the pipeline when the
    offending instruction commits

69
MAT Example
st 0x13, r5
ld r1, 0x13
70
MAT Example
st 0x13, r5
ld r1, 0x13
EXE
ld executes and increments counter
71
MAT Example
st 0x13, r5
COM
ld r1, 0x13
st commits and sets flag
72
MAT Example
ld r1, 0x13
COM
Flush
ld commits, sees flag, and flushes pipeline
73
MAT Example
ld r1, 0x13
MAT is reset and execution resumes
74
Federation
  • Best architecture depends on workload

Lots of parallelism
Limited parallelism
How do we choose at design time without knowing
the workload characteristics?
75
Basic Idea
  • Allow the architecture to adapt at runtime

76
Key Insights
  • Large, multi-threaded register files can be
    repurposed to support out-of-order execution
  • If cores are small, single-cycle communication
    between neighbors is feasible
  • Leverage work on Lightweight core

77
Adaptive Pipeline (2)
I
D
EXE
DEC
RF
Fetch
In-Order Core 1
I
D
EXE
DEC
RF
Fetch
In-Order Core 2
78
(No Transcript)
79
Technology Trends
  • Device Density still improving
  • Active and Idle Power are not
  • Vth not scaling -gt Vdd not scaling
  • Frequency not scaling due to multiple factors

80
But not everybody will profit!
  • Irregular parallelism
  • Irregular control flow
  • Few threads

From Drake et al. MPEG-2 Decoding in a Stream
Programming Language, IPDPS 2006
81
Divergent Memory Access
82
SIMD
R2
0
38
8
12
10
20
2
1
4
5
6
9
47
26
2
5
add r2, r5, r7 ld r1, r2, 4 blt r2, 10
83
SIMD Efficient and Fast
84
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com