Title: Efficient Throughput Cores for Asymmetric Manycore Processors
1Efficient Throughput Cores for Asymmetric
ManycoreProcessors
2Outline
- Motivation
- Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap, DAC 2008) - Diverge on Miss (submitted to SC 2009)
- Sharing Tracker (submitted to MICRO 2009)
3Whats going on here?
From The Landscape of Parallel Computing
Research A View from Berkeley
4We ran into trouble!
From Avi Mendlsons Lecture Slideshttp//www.cs.t
echnion.ac.il/mendlson/Lecture1.ppt
5Architecture Reaction
AMD Phenom, from amd.com
from Jon Stokes Clearing up the confusion over
Intels Larrabee, part II at arstechnica.com
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Small incremental benefit for few threads
12Want faster small cores for few threads case
13(No Transcript)
14SIMD Efficient and Fast
from Jon Stokes Clearing up the confusion over
Intels Larrabee, part II at arstechnica.com
15Challenges
- Preserve single-thread performance even under
strong area and power constraints - Need performance to be scalable from 1 to N
threads - Widest possible benefit of SIMD units
- Relatively small caches need to provide good
hitrates
16Outline
- Motivation
- Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap) - Diverge on Miss
- Sharing Tracker
17Lightweight OOO
- Most OOO structures are overdesigned
- Redesign Issue Queue to optimize for the common
case, which is approx. one consumer - Remove Load-Store Queue, since actual forwarding
is rare, have a structure (Memory Alias Table)
which can only detect memory order violations.
18Federation
- Different workloads have different amounts of
active threads - Want CMP to adapt hardware at runtime
- Combine two cores dynamically
- Register files of multithreaded cores can be
re-used for active list/ROB - Can be used both for 1-way and 2-way cores
19Outline
- Motivation
- Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap) - Diverge on Miss
- Sharing Tracker
20Related Work
- Diverge on Miss Dynamic Warp Formation by Fung
et al.
21Divergent Memory Access
Trees, Hashtables, Spatial Datastructures, N-Body
22Memory Coalescing Buffer
23Diverge on Miss
24Diverge on Miss Miss Patterns
- Misses are semi-random, meaning that all threads
miss a number of times, but in different
iterations - Trailing threads catch up when leading threads
miss cache
25Diverge on Miss
if(foo(threadID)) do A more code here
26Diverge on Miss
while(work) d datacalc_index(i) acc f(d)
27Simulation Assumptions
- 32-wide SIMD cores
- 32 cores
- 256 GB/sec off-chip bandwidth
- 32KB L1 caches per core
28Workload
- Ray Tracing
- Molecular Dynamics (2 kernels)
- DNA Sequence Alignment
- K-means
- Gaussian Filter
29(No Transcript)
30Speedup with Diverge on Miss
- 30 higher performance with Diverge on Miss
- Max performance at ΒΌ the register file size
31With 256KB L2
32Speedup with L2
- 20 higher performance with Diverge on Miss
- Need fewer warps for good performance
33Outline
- Motivation
- Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap) - Diverge on Miss
- Sharing Tracker
34Sharing Tracker
- Cache Coherency not supported on most manycore
processors - Coherency is useful to share cache lines
- But, has too much overhead
35Related Work
- Sharing Tracker Coherence Engine, Distributed
Coherence Engine
36How to share cache lines?
Sharing Tracker
37Sharing Tracker
1. Cache Miss -gt Query ST2. ST Hit -gt Forward to
Cache2a. Req from ST -gtCheck Cache3. Cache Hit
-gt Forward Cache Line
38Sharing Tracker
Sharing Tracker requires 3x less bits
39Core Areas
- Per-core L2 makes core 40-60 bigger
- Can Sharing Tracker eliminate need for L2s?
40Sharing Tracker Performance
41Sharing Tracker With L2 Performance
42Sharing Tracker With L2 Bandwidth
43Sharing Tracker no L2 Perf
44Sharing Tracker no L2 Bandwidth
45Sharing Tracker Perf/Area
46Conclusion
- Low-power OOO is feasible
- Throughput oriented cores can give good
single-thread performance - Memory divergent code can make good use of SIMD
- Small, non-coherent caches for SIMD cores can be
augmented to provide good performance
47List of Publications (1)
- Dissertation Work
- Federation Repurposing Scalar Cores for
Out-of-Order Instruction Issue, D. Tarjan, M.
Boyer and K. Skadron (DAC 2008) - Federation Very Low Overhead Out-of-Order
Execution, D. Tarjan, M. Boyer and K. Skadron
(accepted to TACO pending major revisions) - Increasing Memory Miss Tolerance for SIMD Cores,
D. Tarjan and K. Skadron (submitted to SC 2009) - Adapting Partial Cache Coherence Hardware to
Reduce Off-Chip Memory Traffic with Non-Coherent
Caches in GPUs, D. Tarjan and K. Skadron
(submitted to MICRO 2009) - CUDA
- Accelerating Leukocyte Tracking using CUDA A
Case Study in Leveraging Manycore Coprocessors,
M. Boyer, D. Tarjan, S. Acton and K. Skadron
(IPDPS 2009) - A Performance Study of General-Purpose
Applications on Graphics Processors using CUDA,
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W.
Sheaffer, and K. Skadron (JPDC 2008) - Parameter Variation
- The Impact of Systematic Process Variations on
Symmetrical Performance in Chip Multi-processors,
E. Humenay, D. Tarjan and K. Skadron (DATE 2007) - Impact of Parameter Variations on Multi-Core
Chips, E. Humenay, D. Tarjan and K. Skadron (WCED
2006)
48List of Publications (2)
- Branch Prediction
- Merging path and gshare indexing in perceptron
branch prediction, D. Tarjan and K. Skadron (TACO
2005) - An Ahead Pipelined Alloyed Perceptron with Single
Cycle Access Time, D. Tarjan and K. Skadron (WCED
2004) - HotSpot
- Temperature-Aware Computer Systems Modeling and
Implementation, K. Skadron, M.R. Stan, W. Huang,
S. Velusamy, K. Sankaranarayanan, and D. Tarjan
(TACO 2004) - Temperature-Aware Computer Systems Opportunities
and Challenges, K. Skadron, M.R. Stan, W. Huang,
S. Velusamy, K. Sankaranarayanan, and D. Tarjan
(IEEE MICRO 2003) - Temperature-Aware Microarchitecture, K. Skadron,
M.R. Stan, W. Huang, S. Velusamy, K.
Sankaranarayanan, and D. Tarjan (ISCA 2003, Best
Student Paper)
49QA
50Diverge on Miss Slip Control
- Limit how far ahead/behind threads can be
- Increase slip if latency limited
- Decrease slip if bandwidth or ALU limited
- Simple linear increase or decrease
51Warp Splitting
- Started to work on control-flow divergence
- Memory divergence turned out to be a bigger
problem
52FilterCache
- Benefits turned out to be small
- Small caches and large working sets lead to large
churn and litte reuse
53Warp Splitting
- Warp SIMD group
- Branch dependent on per lane data value can lead
to divergence - Split Warps on divergent branches
- Execute split warp during free time slots
- Cuts serialization latency of divergence
54Diverge on Miss
while(work) if() do A else do B
55FilterCache
SIMD
scalar
Data
Tag
56FilterCache
- Detect SIMD stride and reuse patterns
- Probabilistic LRU update based on number of lanes
accessing cache line - Eviction cost based on SIMD miss parallelism
57Area Overhead
58Simulation Methodology
- Simulator
- SimpleScalar 3.0
- Wattch
- Workloads
- SPEC2000 benchmarks
- Simpoint select 100 million instructions
59Adaptive Pipeline
In-order
Out-of-order
60Performance Impact
61Performance Results
62Energy Efficiency Results
63Energy-Area Efficiency Results
64Lightweight Out-of-Order Core
- CMPs primarily limited by power
- Secondary limit is area
- High-performance cores spend most of their power
in OOO structures - Data from 130nm AMD Opteron
65LSQ
from Mesa-Martinez et al. Power Model
Validation Through Thermal Measurements in ISCA
2007
66Subscription-Based Issue Queue
- Consumers subscribe to their producers at rename
- Producers set ready bits of their consumers at
execute - Fixed number of subscriber slots can cause stalls
- Select uses static priority encoder based on IQ
position rather than oldest-first
67Issue Queue Example
1
1
1
IQ2
IQ3
1
IQ3
0
1
2
0
0
1
1
3
68Simplified Load-Store Queue
- Memory Alias Table (MAT)
- Address-based hash table (Counting Bloom Filter)
- No store forwarding
- No conservative waiting on stores
- Only detect memory order violations after they
have occurred and flush the pipeline when the
offending instruction commits
69MAT Example
st 0x13, r5
ld r1, 0x13
70MAT Example
st 0x13, r5
ld r1, 0x13
EXE
ld executes and increments counter
71MAT Example
st 0x13, r5
COM
ld r1, 0x13
st commits and sets flag
72MAT Example
ld r1, 0x13
COM
Flush
ld commits, sees flag, and flushes pipeline
73MAT Example
ld r1, 0x13
MAT is reset and execution resumes
74Federation
- Best architecture depends on workload
Lots of parallelism
Limited parallelism
How do we choose at design time without knowing
the workload characteristics?
75Basic Idea
- Allow the architecture to adapt at runtime
76Key Insights
- Large, multi-threaded register files can be
repurposed to support out-of-order execution - If cores are small, single-cycle communication
between neighbors is feasible - Leverage work on Lightweight core
77Adaptive Pipeline (2)
I
D
EXE
DEC
RF
Fetch
In-Order Core 1
I
D
EXE
DEC
RF
Fetch
In-Order Core 2
78(No Transcript)
79Technology Trends
- Device Density still improving
- Active and Idle Power are not
- Vth not scaling -gt Vdd not scaling
- Frequency not scaling due to multiple factors
80But not everybody will profit!
- Irregular parallelism
- Irregular control flow
- Few threads
From Drake et al. MPEG-2 Decoding in a Stream
Programming Language, IPDPS 2006
81Divergent Memory Access
82SIMD
R2
0
38
8
12
10
20
2
1
4
5
6
9
47
26
2
5
add r2, r5, r7 ld r1, r2, 4 blt r2, 10
83SIMD Efficient and Fast
84(No Transcript)