A LowComplexity, HighPerformance Fetch Unit for Simultaneous Multithreading Processors - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

A LowComplexity, HighPerformance Fetch Unit for Simultaneous Multithreading Processors

Description:

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors. Ayose Falc n Alex Ramirez Mateo Valero. HPCA-10. February 18, 2004. HPCA-10 ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 32

Provided by: ayose

Category:

more less

Transcript and Presenter's Notes

Title: A LowComplexity, HighPerformance Fetch Unit for Simultaneous Multithreading Processors

1
A Low-Complexity, High-Performance Fetch Unit for
Simultaneous Multithreading Processors

Ayose Falcón Alex Ramirez Mateo
Valero
HPCA-10
February 18, 2004

2
Simultaneous Multithreading

SMT Tullsen95 / Multistreaming Yamamoto95
Instructions from different threads coexist in
each processor stage
Resources are shared among different threads
But
Sharing implies competition
In caches, queues, FUs,
Fetch policy decides!

3
Motivation

SMT performance is limited by fetch performance
A superscalar fetch is not enough to feed an
aggressive SMT core
SMT fetch is a bottleneck Tullsen96 Burns99
Straightforward solution Fetch from several
threads each cycle
a) Multiple fetch units (1 per thread) ?
EXPENSIVE!
b) Shared fetch fetch policy Tullsen96
Multiple PCs
Multiple branch predictions per cycle
Multiple I-cache accesses per cycle
Does the performance of this fetch organization
compensate its complexity?

4
Talk Outline

Motivation
Fetch Architectures for SMT
High-Performance Fetch Engines
Simulation Setup
Results
Summary Conclusions

5
Fetching from a Single Thread (1.X)
Instruction Cache
Branch
Predictor

Fine-grained, non-simultaneous sharing
Simple ? similar to a superscalar fetch unit
No additional HW needed
A fetch policy is needed
Decides fetch priority among threads
Several proposals in the literature

6
Fetching from a Single Thread (1.X)

Buta single thread is not enough to fill fetch
BW
Gshare / hybrid branch predictor BTB limits
fetch width to one basic block per cycle (6-8
instructions)

Fetch BW is heavily underused
Avg 40 wasted with 1.8
Avg 60 wasted with 1.16
Fully use the fetch BW
31 fetch cycles with 1.8
6 fetch cycles with 1.16

7
Fetching from Multiple Threads (2.X)

Increases fetch throughput
More threads ? more possibilities to fill fetch BW

More fetch BW use than 1.X
Fully use the fetch BW
54 of cycles with 2.8
16 of cycles with 2.16

8
Fetching from Multiple Threads (2.X)

Butwhat is the additional HW cost of a 2.X
fetch?

MERGE
9
Our Goal

Can we take the best of both worlds?
Low complexity of a 1.X fetch architecture
High performance of a 2.X fetch architecture
That iscan a single thread provide sufficient
instructions to fill the available fetch
bandwidth?

10
Talk Outline

Motivation
Fetch Architectures for SMT
High-Performance Fetch Engines
Simulation Setup
Results
Summary Conclusions

11
High Performance Fetch Engines (I)

We look for high performance
Gshare / hybrid branch predictor BTB
Low performance
Limit fetch BW to one basic block per cycle
6-8 instructions
We look for low complexity
Trace cache, Branch Target Address Cache,
Collapsing Buffer, etc
Fetch multiple basic blocks per cycle
12-16 instructions
High complexity

12
High Performance Fetch Engines (II)

Our alternatives
Gskew Michaud97 FTB Reinman99
FTB fetch blocks are larger than basic blocks
5 speedup over gshareBTB in superscalars
Stream Predictor Ramirez02
Streams are larger than FTB fetch blocks
11 speedup over gskewFTB in superscalars

13
Talk Outline

Motivation
Fetch Architectures for SMT
High-Performance Fetch Engines
Simulation Setup
Results
Summary Conclusions

14
Simulation Setup

Modified version of SMTSIM Tullsen96
Trace-driven, allowing wrong-path execution
Decoupled fetch (1 additional pipeline stage)
Branch predictor sizes of approx. 45KB
Decode rename width limited to 8 instructions
Fetch width 8/16 inst.
Fetch buffer 32 inst.

15
Workloads

SPECint2000
Code layout optimized
Spike Cohn97 profile data using train input
Most representative 300M instruction trace
Using ref input
Workloads including 2, 4, 6, and 8 threads
Classified according to threads characteristics
ILP ? only ILP benchmarks
MEM ? memory-bounded benchmarks
MIX ? mix of ILP and MEM benchmarks

16
Talk Outline

Motivation
Fetch Architectures for SMT
High-Performance Fetch Engines
Simulation Setup
Results
ILP workloads
MEM MIX workloads
Summary Conclusions

Only for 2 4 threads (see paper for the rest)
17
ILP Workloads - Fetch Throughput
Fetch Throughput

With a given fetch bandwidth, fetching from two
threads always benefits fetch performance
Critical point is 1.16
Stream predictor ? Better fetch performance than
2.8
GshareBTB / gskewFTB ? Worse fetch perform.
than 2.8

18
ILP Workloads 1.X (1.8) vs 2.X (2.8)
Commit Throughput

ILP benchmarks have few memory problems and high
parallelism
Fetch unit is the real limiting factor
The higher the fetch throughput, the higher the
IPC

19
ILP Workloads

So2.X better than 1.X in ILP workloads
But, what about 1.2X instead of 2.X?
That is, 1.16 instead of 2.8
Maintain single thread fetch
Cache lines and buses already 16-instruction wide
We have to modify the HW to select 16 instead of
8 instructions

20
ILP Workloads 2.X (2.8) vs 1.2X (1.16)
Commit Throughput
Similar/Better performance than 2.16!

With 1.16, stream predictor increases throughput
(9 avg)
Streams are long enough for a 16-wide fetch
Fetching a single block per cycle is not enough
GshareBTB ? 10 slowdown
GskewFTB ? 4 slowdown

21
MEM MIX Workloads - Fetch Throughput
Fetch Throughput

Same trend compared to ILP fetch throughput
For a given fetch BW, fetching from two threads
is better
Stream gt gskew FTB gt gshare BTB

22
MEM MIX Workloads 1.X (1.8) vs 2.X (2.8)
Commit Throughput

With memory-bounded benchmarksoverall
performance actually decreases!!
Memory-bounded threads monopolize resources for
many cycles
Previously identified ? New fetch policies
Flush Tullsen01 or stall Luo01, El-Mousry03
problematic threads

23
MEM MIX workloads

Fetching from only one thread allows to fetch
only from the first, most priority thread
Allows the highest priority thread to proceed
with more resources
Avoids low-quality (less priority) threads to
monopolize more and more resources on cache
misses
Registers, IQ slots, etc.
Only the highest priority thread is fetched
When cache miss is resolved, instructions from
the second thread will be consumed
ICOUNT will give it more priority after the cache
miss resolution
A powerful fetch unit can be harmful if not well
used

24
MEM MIX workloads 1.X (1.8) vs 1.2X (1.16)
Commit Throughput

Even 2.16 has worse commit performance than 1.8
More interference introduced by low-quality
threads
Overall, 1.16 is the best combination
Low complexity ? fetching from one thread
High performance ? wide fetch

25
Talk Outline

Motivation
Fetch Architectures for SMT
High-Performance Fetch Engines
Simulation Setup
Results
Summary Conclusions

26
Summary

Fetch unit is the most significant obstacle to
obtain high SMT performance
However, researchers usually dont care about SMT
fetch performance
They care on how to combine threads to maintain
available fetch throughput
A simple gshare/hybrid BTB is commonly used
Everybody assumes that 2.8 (2.X) is the correct
answer
Fetching from many threads can be
counterproductive
Sharing implies competing
Low-quality threads monopolize more and more
resources

27
Conclusions

1.16 (1.2X) is the best fetch option
Using a high-width fetch architecture
Its not the prediction accuracy, its the fetch
width
Beneficial for both ILP and MEM workloads
1.X is bad for ILP
2.X is bad for MEM
Fetches only from the most promising thread
(according to fetch policy), and as much as
possible
Offers the best performance/complexity tradeoff
Fetching from a single thread may require
revisiting current SMT fetch policies

28
Thanks