A LowComplexity, HighPerformance Fetch Unit for Simultaneous Multithreading Processors - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

A LowComplexity, HighPerformance Fetch Unit for Simultaneous Multithreading Processors

Description:

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors. Ayose Falc n Alex Ramirez Mateo Valero. HPCA-10. February 18, 2004. HPCA-10 ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 32
Provided by: ayose
Category:

less

Transcript and Presenter's Notes

Title: A LowComplexity, HighPerformance Fetch Unit for Simultaneous Multithreading Processors


1
A Low-Complexity, High-Performance Fetch Unit for
Simultaneous Multithreading Processors
  • Ayose Falcón Alex Ramirez Mateo
    Valero
  • HPCA-10
  • February 18, 2004

2
Simultaneous Multithreading
  • SMT Tullsen95 / Multistreaming Yamamoto95
  • Instructions from different threads coexist in
    each processor stage
  • Resources are shared among different threads
  • But
  • Sharing implies competition
  • In caches, queues, FUs,
  • Fetch policy decides!

3
Motivation
  • SMT performance is limited by fetch performance
  • A superscalar fetch is not enough to feed an
    aggressive SMT core
  • SMT fetch is a bottleneck Tullsen96 Burns99
  • Straightforward solution Fetch from several
    threads each cycle
  • a) Multiple fetch units (1 per thread) ?
    EXPENSIVE!
  • b) Shared fetch fetch policy Tullsen96
  • Multiple PCs
  • Multiple branch predictions per cycle
  • Multiple I-cache accesses per cycle
  • Does the performance of this fetch organization
    compensate its complexity?

4
Talk Outline
  • Motivation
  • Fetch Architectures for SMT
  • High-Performance Fetch Engines
  • Simulation Setup
  • Results
  • Summary Conclusions

5
Fetching from a Single Thread (1.X)
Instruction Cache
Branch
Predictor
  • Fine-grained, non-simultaneous sharing
  • Simple ? similar to a superscalar fetch unit
  • No additional HW needed
  • A fetch policy is needed
  • Decides fetch priority among threads
  • Several proposals in the literature

6
Fetching from a Single Thread (1.X)
  • Buta single thread is not enough to fill fetch
    BW
  • Gshare / hybrid branch predictor BTB limits
    fetch width to one basic block per cycle (6-8
    instructions)
  • Fetch BW is heavily underused
  • Avg 40 wasted with 1.8
  • Avg 60 wasted with 1.16
  • Fully use the fetch BW
  • 31 fetch cycles with 1.8
  • 6 fetch cycles with 1.16

7
Fetching from Multiple Threads (2.X)
  • Increases fetch throughput
  • More threads ? more possibilities to fill fetch BW
  • More fetch BW use than 1.X
  • Fully use the fetch BW
  • 54 of cycles with 2.8
  • 16 of cycles with 2.16

8
Fetching from Multiple Threads (2.X)
  • Butwhat is the additional HW cost of a 2.X
    fetch?

MERGE
9
Our Goal
  • Can we take the best of both worlds?
  • Low complexity of a 1.X fetch architecture
  • High performance of a 2.X fetch architecture
  • That iscan a single thread provide sufficient
    instructions to fill the available fetch
    bandwidth?

10
Talk Outline
  • Motivation
  • Fetch Architectures for SMT
  • High-Performance Fetch Engines
  • Simulation Setup
  • Results
  • Summary Conclusions

11
High Performance Fetch Engines (I)
  • We look for high performance
  • Gshare / hybrid branch predictor BTB
  • Low performance
  • Limit fetch BW to one basic block per cycle
  • 6-8 instructions
  • We look for low complexity
  • Trace cache, Branch Target Address Cache,
    Collapsing Buffer, etc
  • Fetch multiple basic blocks per cycle
  • 12-16 instructions
  • High complexity

12
High Performance Fetch Engines (II)
  • Our alternatives
  • Gskew Michaud97 FTB Reinman99
  • FTB fetch blocks are larger than basic blocks
  • 5 speedup over gshareBTB in superscalars
  • Stream Predictor Ramirez02
  • Streams are larger than FTB fetch blocks
  • 11 speedup over gskewFTB in superscalars

13
Talk Outline
  • Motivation
  • Fetch Architectures for SMT
  • High-Performance Fetch Engines
  • Simulation Setup
  • Results
  • Summary Conclusions

14
Simulation Setup
  • Modified version of SMTSIM Tullsen96
  • Trace-driven, allowing wrong-path execution
  • Decoupled fetch (1 additional pipeline stage)
  • Branch predictor sizes of approx. 45KB
  • Decode rename width limited to 8 instructions
  • Fetch width 8/16 inst.
  • Fetch buffer 32 inst.

15
Workloads
  • SPECint2000
  • Code layout optimized
  • Spike Cohn97 profile data using train input
  • Most representative 300M instruction trace
  • Using ref input
  • Workloads including 2, 4, 6, and 8 threads
  • Classified according to threads characteristics
  • ILP ? only ILP benchmarks
  • MEM ? memory-bounded benchmarks
  • MIX ? mix of ILP and MEM benchmarks

16
Talk Outline
  • Motivation
  • Fetch Architectures for SMT
  • High-Performance Fetch Engines
  • Simulation Setup
  • Results
  • ILP workloads
  • MEM MIX workloads
  • Summary Conclusions

Only for 2 4 threads (see paper for the rest)
17
ILP Workloads - Fetch Throughput
Fetch Throughput
  • With a given fetch bandwidth, fetching from two
    threads always benefits fetch performance
  • Critical point is 1.16
  • Stream predictor ? Better fetch performance than
    2.8
  • GshareBTB / gskewFTB ? Worse fetch perform.
    than 2.8

18
ILP Workloads 1.X (1.8) vs 2.X (2.8)
Commit Throughput
  • ILP benchmarks have few memory problems and high
    parallelism
  • Fetch unit is the real limiting factor
  • The higher the fetch throughput, the higher the
    IPC

19
ILP Workloads
  • So2.X better than 1.X in ILP workloads
  • But, what about 1.2X instead of 2.X?
  • That is, 1.16 instead of 2.8
  • Maintain single thread fetch
  • Cache lines and buses already 16-instruction wide
  • We have to modify the HW to select 16 instead of
    8 instructions

20
ILP Workloads 2.X (2.8) vs 1.2X (1.16)
Commit Throughput
Similar/Better performance than 2.16!
  • With 1.16, stream predictor increases throughput
    (9 avg)
  • Streams are long enough for a 16-wide fetch
  • Fetching a single block per cycle is not enough
  • GshareBTB ? 10 slowdown
  • GskewFTB ? 4 slowdown

21
MEM MIX Workloads - Fetch Throughput
Fetch Throughput
  • Same trend compared to ILP fetch throughput
  • For a given fetch BW, fetching from two threads
    is better
  • Stream gt gskew FTB gt gshare BTB

22
MEM MIX Workloads 1.X (1.8) vs 2.X (2.8)
Commit Throughput
  • With memory-bounded benchmarksoverall
    performance actually decreases!!
  • Memory-bounded threads monopolize resources for
    many cycles
  • Previously identified ? New fetch policies
  • Flush Tullsen01 or stall Luo01, El-Mousry03
    problematic threads

23
MEM MIX workloads
  • Fetching from only one thread allows to fetch
    only from the first, most priority thread
  • Allows the highest priority thread to proceed
    with more resources
  • Avoids low-quality (less priority) threads to
    monopolize more and more resources on cache
    misses
  • Registers, IQ slots, etc.
  • Only the highest priority thread is fetched
  • When cache miss is resolved, instructions from
    the second thread will be consumed
  • ICOUNT will give it more priority after the cache
    miss resolution
  • A powerful fetch unit can be harmful if not well
    used

24
MEM MIX workloads 1.X (1.8) vs 1.2X (1.16)
Commit Throughput
  • Even 2.16 has worse commit performance than 1.8
  • More interference introduced by low-quality
    threads
  • Overall, 1.16 is the best combination
  • Low complexity ? fetching from one thread
  • High performance ? wide fetch

25
Talk Outline
  • Motivation
  • Fetch Architectures for SMT
  • High-Performance Fetch Engines
  • Simulation Setup
  • Results
  • Summary Conclusions

26
Summary
  • Fetch unit is the most significant obstacle to
    obtain high SMT performance
  • However, researchers usually dont care about SMT
    fetch performance
  • They care on how to combine threads to maintain
    available fetch throughput
  • A simple gshare/hybrid BTB is commonly used
  • Everybody assumes that 2.8 (2.X) is the correct
    answer
  • Fetching from many threads can be
    counterproductive
  • Sharing implies competing
  • Low-quality threads monopolize more and more
    resources

27
Conclusions
  • 1.16 (1.2X) is the best fetch option
  • Using a high-width fetch architecture
  • Its not the prediction accuracy, its the fetch
    width
  • Beneficial for both ILP and MEM workloads
  • 1.X is bad for ILP
  • 2.X is bad for MEM
  • Fetches only from the most promising thread
    (according to fetch policy), and as much as
    possible
  • Offers the best performance/complexity tradeoff
  • Fetching from a single thread may require
    revisiting current SMT fetch policies

28
Thanks
  • Questions Answers

29
Backup Slides
30
SMT Workloads
31
Simulation Setup
Write a Comment
User Comments (0)
About PowerShow.com