Title: A LowComplexity, HighPerformance Fetch Unit for Simultaneous Multithreading Processors
1A Low-Complexity, High-Performance Fetch Unit for
Simultaneous Multithreading Processors
- Ayose Falcón Alex Ramirez Mateo
Valero - HPCA-10
- February 18, 2004
2Simultaneous Multithreading
- SMT Tullsen95 / Multistreaming Yamamoto95
- Instructions from different threads coexist in
each processor stage - Resources are shared among different threads
- But
- Sharing implies competition
- In caches, queues, FUs,
- Fetch policy decides!
3Motivation
- SMT performance is limited by fetch performance
- A superscalar fetch is not enough to feed an
aggressive SMT core - SMT fetch is a bottleneck Tullsen96 Burns99
- Straightforward solution Fetch from several
threads each cycle - a) Multiple fetch units (1 per thread) ?
EXPENSIVE! - b) Shared fetch fetch policy Tullsen96
- Multiple PCs
- Multiple branch predictions per cycle
- Multiple I-cache accesses per cycle
- Does the performance of this fetch organization
compensate its complexity?
4Talk Outline
- Motivation
- Fetch Architectures for SMT
- High-Performance Fetch Engines
- Simulation Setup
- Results
- Summary Conclusions
5Fetching from a Single Thread (1.X)
Instruction Cache
Branch
Predictor
- Fine-grained, non-simultaneous sharing
- Simple ? similar to a superscalar fetch unit
- No additional HW needed
- A fetch policy is needed
- Decides fetch priority among threads
- Several proposals in the literature
6Fetching from a Single Thread (1.X)
- Buta single thread is not enough to fill fetch
BW - Gshare / hybrid branch predictor BTB limits
fetch width to one basic block per cycle (6-8
instructions)
- Fetch BW is heavily underused
- Avg 40 wasted with 1.8
- Avg 60 wasted with 1.16
- Fully use the fetch BW
- 31 fetch cycles with 1.8
- 6 fetch cycles with 1.16
7Fetching from Multiple Threads (2.X)
- Increases fetch throughput
- More threads ? more possibilities to fill fetch BW
- More fetch BW use than 1.X
- Fully use the fetch BW
- 54 of cycles with 2.8
- 16 of cycles with 2.16
8Fetching from Multiple Threads (2.X)
- Butwhat is the additional HW cost of a 2.X
fetch?
MERGE
9Our Goal
- Can we take the best of both worlds?
- Low complexity of a 1.X fetch architecture
-
- High performance of a 2.X fetch architecture
- That iscan a single thread provide sufficient
instructions to fill the available fetch
bandwidth?
10Talk Outline
- Motivation
- Fetch Architectures for SMT
- High-Performance Fetch Engines
- Simulation Setup
- Results
- Summary Conclusions
11High Performance Fetch Engines (I)
- We look for high performance
- Gshare / hybrid branch predictor BTB
- Low performance
- Limit fetch BW to one basic block per cycle
- 6-8 instructions
- We look for low complexity
- Trace cache, Branch Target Address Cache,
Collapsing Buffer, etc - Fetch multiple basic blocks per cycle
- 12-16 instructions
- High complexity
12High Performance Fetch Engines (II)
- Our alternatives
- Gskew Michaud97 FTB Reinman99
- FTB fetch blocks are larger than basic blocks
- 5 speedup over gshareBTB in superscalars
- Stream Predictor Ramirez02
- Streams are larger than FTB fetch blocks
- 11 speedup over gskewFTB in superscalars
13Talk Outline
- Motivation
- Fetch Architectures for SMT
- High-Performance Fetch Engines
- Simulation Setup
- Results
- Summary Conclusions
14Simulation Setup
- Modified version of SMTSIM Tullsen96
- Trace-driven, allowing wrong-path execution
- Decoupled fetch (1 additional pipeline stage)
- Branch predictor sizes of approx. 45KB
- Decode rename width limited to 8 instructions
- Fetch width 8/16 inst.
- Fetch buffer 32 inst.
15Workloads
- SPECint2000
- Code layout optimized
- Spike Cohn97 profile data using train input
- Most representative 300M instruction trace
- Using ref input
- Workloads including 2, 4, 6, and 8 threads
- Classified according to threads characteristics
- ILP ? only ILP benchmarks
- MEM ? memory-bounded benchmarks
- MIX ? mix of ILP and MEM benchmarks
16Talk Outline
- Motivation
- Fetch Architectures for SMT
- High-Performance Fetch Engines
- Simulation Setup
- Results
- ILP workloads
- MEM MIX workloads
- Summary Conclusions
Only for 2 4 threads (see paper for the rest)
17ILP Workloads - Fetch Throughput
Fetch Throughput
- With a given fetch bandwidth, fetching from two
threads always benefits fetch performance - Critical point is 1.16
- Stream predictor ? Better fetch performance than
2.8 - GshareBTB / gskewFTB ? Worse fetch perform.
than 2.8
18ILP Workloads 1.X (1.8) vs 2.X (2.8)
Commit Throughput
- ILP benchmarks have few memory problems and high
parallelism - Fetch unit is the real limiting factor
- The higher the fetch throughput, the higher the
IPC
19ILP Workloads
- So2.X better than 1.X in ILP workloads
- But, what about 1.2X instead of 2.X?
- That is, 1.16 instead of 2.8
- Maintain single thread fetch
- Cache lines and buses already 16-instruction wide
- We have to modify the HW to select 16 instead of
8 instructions
20ILP Workloads 2.X (2.8) vs 1.2X (1.16)
Commit Throughput
Similar/Better performance than 2.16!
- With 1.16, stream predictor increases throughput
(9 avg) - Streams are long enough for a 16-wide fetch
- Fetching a single block per cycle is not enough
- GshareBTB ? 10 slowdown
- GskewFTB ? 4 slowdown
21MEM MIX Workloads - Fetch Throughput
Fetch Throughput
- Same trend compared to ILP fetch throughput
- For a given fetch BW, fetching from two threads
is better - Stream gt gskew FTB gt gshare BTB
22MEM MIX Workloads 1.X (1.8) vs 2.X (2.8)
Commit Throughput
- With memory-bounded benchmarksoverall
performance actually decreases!! - Memory-bounded threads monopolize resources for
many cycles - Previously identified ? New fetch policies
- Flush Tullsen01 or stall Luo01, El-Mousry03
problematic threads
23MEM MIX workloads
- Fetching from only one thread allows to fetch
only from the first, most priority thread - Allows the highest priority thread to proceed
with more resources - Avoids low-quality (less priority) threads to
monopolize more and more resources on cache
misses - Registers, IQ slots, etc.
- Only the highest priority thread is fetched
- When cache miss is resolved, instructions from
the second thread will be consumed - ICOUNT will give it more priority after the cache
miss resolution - A powerful fetch unit can be harmful if not well
used
24MEM MIX workloads 1.X (1.8) vs 1.2X (1.16)
Commit Throughput
- Even 2.16 has worse commit performance than 1.8
- More interference introduced by low-quality
threads - Overall, 1.16 is the best combination
- Low complexity ? fetching from one thread
- High performance ? wide fetch
25Talk Outline
- Motivation
- Fetch Architectures for SMT
- High-Performance Fetch Engines
- Simulation Setup
- Results
- Summary Conclusions
26Summary
- Fetch unit is the most significant obstacle to
obtain high SMT performance - However, researchers usually dont care about SMT
fetch performance - They care on how to combine threads to maintain
available fetch throughput - A simple gshare/hybrid BTB is commonly used
- Everybody assumes that 2.8 (2.X) is the correct
answer - Fetching from many threads can be
counterproductive - Sharing implies competing
- Low-quality threads monopolize more and more
resources
27Conclusions
- 1.16 (1.2X) is the best fetch option
- Using a high-width fetch architecture
- Its not the prediction accuracy, its the fetch
width - Beneficial for both ILP and MEM workloads
- 1.X is bad for ILP
- 2.X is bad for MEM
- Fetches only from the most promising thread
(according to fetch policy), and as much as
possible - Offers the best performance/complexity tradeoff
- Fetching from a single thread may require
revisiting current SMT fetch policies
28Thanks
29Backup Slides
30SMT Workloads
31Simulation Setup