Efficient Throughput Cores for Asymmetric Manycore Processors

About This Presentation

Title:

Efficient Throughput Cores for Asymmetric Manycore Processors

Description:

from Jon Stokes 'Clearing up the confusion over Intel's Larrabee, part II' at arstechnica.com ... even under strong area and power constraints ... Key Insights ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 85

Provided by: michae681

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Throughput Cores for Asymmetric Manycore Processors

1
Efficient Throughput Cores for Asymmetric
ManycoreProcessors

David Tarjan

2
Outline

Motivation
Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap, DAC 2008)
Diverge on Miss (submitted to SC 2009)
Sharing Tracker (submitted to MICRO 2009)

3
Whats going on here?
From The Landscape of Parallel Computing
Research A View from Berkeley
4
We ran into trouble!
From Avi Mendlsons Lecture Slideshttp//www.cs.t
echnion.ac.il/mendlson/Lecture1.ppt
5
Architecture Reaction
AMD Phenom, from amd.com
from Jon Stokes Clearing up the confusion over
Intels Larrabee, part II at arstechnica.com
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Small incremental benefit for few threads
12
Want faster small cores for few threads case
13
(No Transcript)
14
SIMD Efficient and Fast
from Jon Stokes Clearing up the confusion over
Intels Larrabee, part II at arstechnica.com
15
Challenges

Preserve single-thread performance even under
strong area and power constraints
Need performance to be scalable from 1 to N
threads
Widest possible benefit of SIMD units
Relatively small caches need to provide good
hitrates

16
Outline

Motivation
Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap)
Diverge on Miss
Sharing Tracker

17
Lightweight OOO

Most OOO structures are overdesigned
Redesign Issue Queue to optimize for the common
case, which is approx. one consumer
Remove Load-Store Queue, since actual forwarding
is rare, have a structure (Memory Alias Table)
which can only detect memory order violations.

18
Federation

Different workloads have different amounts of
active threads
Want CMP to adapt hardware at runtime
Combine two cores dynamically
Register files of multithreaded cores can be
re-used for active list/ROB
Can be used both for 1-way and 2-way cores

19
Outline

Motivation
Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap)
Diverge on Miss
Sharing Tracker

20
Related Work

Diverge on Miss Dynamic Warp Formation by Fung
et al.

21
Divergent Memory Access
Trees, Hashtables, Spatial Datastructures, N-Body
22
Memory Coalescing Buffer
23
Diverge on Miss
24
Diverge on Miss Miss Patterns

Misses are semi-random, meaning that all threads
miss a number of times, but in different
iterations
Trailing threads catch up when leading threads
miss cache

25
Diverge on Miss
if(foo(threadID)) do A more code here
26
Diverge on Miss
while(work) d datacalc_index(i) acc f(d)
27
Simulation Assumptions

32-wide SIMD cores
32 cores
256 GB/sec off-chip bandwidth
32KB L1 caches per core

28
Workload

Ray Tracing
Molecular Dynamics (2 kernels)
DNA Sequence Alignment
K-means
Gaussian Filter

29
(No Transcript)
30
Speedup with Diverge on Miss

30 higher performance with Diverge on Miss
Max performance at ¼ the register file size

31
With 256KB L2
32
Speedup with L2

20 higher performance with Diverge on Miss
Need fewer warps for good performance

33
Outline

Motivation
Early Work on Lightweight Federation (presented
in Proposal, 1 slide each recap)
Diverge on Miss
Sharing Tracker

34
Sharing Tracker

Cache Coherency not supported on most manycore
processors
Coherency is useful to share cache lines
But, has too much overhead

35
Related Work

Sharing Tracker Coherence Engine, Distributed
Coherence Engine

36
How to share cache lines?
Sharing Tracker

37
Sharing Tracker
1. Cache Miss -gt Query ST2. ST Hit -gt Forward to
Cache2a. Req from ST -gtCheck Cache3. Cache Hit
-gt Forward Cache Line
38
Sharing Tracker
Sharing Tracker requires 3x less bits
39
Core Areas

Per-core L2 makes core 40-60 bigger
Can Sharing Tracker eliminate need for L2s?

40
Sharing Tracker Performance
41
Sharing Tracker With L2 Performance
42
Sharing Tracker With L2 Bandwidth
43
Sharing Tracker no L2 Perf
44
Sharing Tracker no L2 Bandwidth
45
Sharing Tracker Perf/Area

28 higher Perf/Area

46
Conclusion

Low-power OOO is feasible
Throughput oriented cores can give good
single-thread performance
Memory divergent code can make good use of SIMD
Small, non-coherent caches for SIMD cores can be
augmented to provide good performance

47
List of Publications (1)

Dissertation Work
Federation Repurposing Scalar Cores for
Out-of-Order Instruction Issue, D. Tarjan, M.
Boyer and K. Skadron (DAC 2008)
Federation Very Low Overhead Out-of-Order
Execution, D. Tarjan, M. Boyer and K. Skadron
(accepted to TACO pending major revisions)
Increasing Memory Miss Tolerance for SIMD Cores,
D. Tarjan and K. Skadron (submitted to SC 2009)
Adapting Partial Cache Coherence Hardware to
Reduce Off-Chip Memory Traffic with Non-Coherent
Caches in GPUs, D. Tarjan and K. Skadron
(submitted to MICRO 2009)
CUDA
Accelerating Leukocyte Tracking using CUDA A
Case Study in Leveraging Manycore Coprocessors,
M. Boyer, D. Tarjan, S. Acton and K. Skadron
(IPDPS 2009)
A Performance Study of General-Purpose
Applications on Graphics Processors using CUDA,
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W.
Sheaffer, and K. Skadron (JPDC 2008)
Parameter Variation
The Impact of Systematic Process Variations on
Symmetrical Performance in Chip Multi-processors,
E. Humenay, D. Tarjan and K. Skadron (DATE 2007)
Impact of Parameter Variations on Multi-Core
Chips, E. Humenay, D. Tarjan and K. Skadron (WCED
2006)

48
List of Publications (2)

Branch Prediction
Merging path and gshare indexing in perceptron
branch prediction, D. Tarjan and K. Skadron (TACO
2005)
An Ahead Pipelined Alloyed Perceptron with Single
Cycle Access Time, D. Tarjan and K. Skadron (WCED
2004)
HotSpot
Temperature-Aware Computer Systems Modeling and
Implementation, K. Skadron, M.R. Stan, W. Huang,
S. Velusamy, K. Sankaranarayanan, and D. Tarjan
(TACO 2004)
Temperature-Aware Computer Systems Opportunities
and Challenges, K. Skadron, M.R. Stan, W. Huang,
S. Velusamy, K. Sankaranarayanan, and D. Tarjan
(IEEE MICRO 2003)
Temperature-Aware Microarchitecture, K. Skadron,
M.R. Stan, W. Huang, S. Velusamy, K.
Sankaranarayanan, and D. Tarjan (ISCA 2003, Best
Student Paper)

49
QA
50
Diverge on Miss Slip Control

Limit how far ahead/behind threads can be
Increase slip if latency limited
Decrease slip if bandwidth or ALU limited
Simple linear increase or decrease

51
Warp Splitting

Started to work on control-flow divergence
Memory divergence turned out to be a bigger
problem

52
FilterCache

Benefits turned out to be small
Small caches and large working sets lead to large
churn and litte reuse

53
Warp Splitting

Warp SIMD group
Branch dependent on per lane data value can lead
to divergence
Split Warps on divergent branches
Execute split warp during free time slots
Cuts serialization latency of divergence

54
Diverge on Miss
while(work) if() do A else do B
55
FilterCache
SIMD
scalar
Data
Tag
56
FilterCache

Detect SIMD stride and reuse patterns
Probabilistic LRU update based on number of lanes
accessing cache line
Eviction cost based on SIMD miss parallelism

57
Area Overhead
58
Simulation Methodology

Simulator
SimpleScalar 3.0
Wattch
Workloads
SPEC2000 benchmarks
Simpoint select 100 million instructions

59
Adaptive Pipeline
In-order
Out-of-order
60
Performance Impact
61
Performance Results
62
Energy Efficiency Results
63
Energy-Area Efficiency Results
64
Lightweight Out-of-Order Core

CMPs primarily limited by power
Secondary limit is area
High-performance cores spend most of their power
in OOO structures
Data from 130nm AMD Opteron

65
LSQ
from Mesa-Martinez et al. Power Model
Validation Through Thermal Measurements in ISCA
2007
66
Subscription-Based Issue Queue

Consumers subscribe to their producers at rename
Producers set ready bits of their consumers at
execute
Fixed number of subscriber slots can cause stalls
Select uses static priority encoder based on IQ
position rather than oldest-first

67
Issue Queue Example

1
1
1
IQ2
IQ3
1
IQ3
0
1

2
0
0
1
1

3
68
Simplified Load-Store Queue

Memory Alias Table (MAT)
Address-based hash table (Counting Bloom Filter)
No store forwarding
No conservative waiting on stores
Only detect memory order violations after they
have occurred and flush the pipeline when the
offending instruction commits

69
MAT Example
st 0x13, r5
ld r1, 0x13
70
MAT Example
st 0x13, r5
ld r1, 0x13
EXE
ld executes and increments counter
71
MAT Example
st 0x13, r5
COM
ld r1, 0x13
st commits and sets flag
72
MAT Example
ld r1, 0x13
COM
Flush
ld commits, sees flag, and flushes pipeline
73
MAT Example
ld r1, 0x13
MAT is reset and execution resumes
74
Federation

Best architecture depends on workload

Lots of parallelism
Limited parallelism
How do we choose at design time without knowing
the workload characteristics?
75
Basic Idea

Allow the architecture to adapt at runtime

76
Key Insights

Large, multi-threaded register files can be
repurposed to support out-of-order execution
If cores are small, single-cycle communication
between neighbors is feasible
Leverage work on Lightweight core

77
Adaptive Pipeline (2)
I
D
EXE
DEC
RF
Fetch
In-Order Core 1
I
D
EXE
DEC
RF
Fetch
In-Order Core 2
78
(No Transcript)
79
Technology Trends

Device Density still improving
Active and Idle Power are not
Vth not scaling -gt Vdd not scaling
Frequency not scaling due to multiple factors

80
But not everybody will profit!

Irregular parallelism
Irregular control flow
Few threads

From Drake et al. MPEG-2 Decoding in a Stream
Programming Language, IPDPS 2006
81
Divergent Memory Access
82
SIMD
R2
0
38
8
12
10
20
2
1
4
5
6
9
47
26
2
5
add r2, r5, r7 ld r1, r2, 4 blt r2, 10
83
SIMD Efficient and Fast
84
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Efficient Throughput Cores for Asymmetric Manycore Processors - PowerPoint PPT Presentation

Efficient Throughput Cores for Asymmetric Manycore Processors

from Jon Stokes 'Clearing up the confusion over Intel's Larrabee, part II' at arstechnica.com ... even under strong area and power constraints ... Key Insights ... – PowerPoint PPT presentation