Title: A%20256%20Kbits%20L-TAGE%20branch%20predictor
1A 256 Kbits L-TAGE branch predictor
- André Seznec
- IRISA/INRIA/HIPEAC
2- Directly derived from
- A case for (partially) tagged branch
predictors, - A. Seznec and P. Michaud JILP Feb. 2006
-
- Tricks
- Loop predictor
- Kernel/user histories
3TAGE TAgged GEometric history length predictors
The genesis
4 Back around 2003
- 2bcgskew was state-of-the-art, but
- but was lagging behind neural inspired
predictors on a few benchmarks - Just wanted to get best of both behaviors and
maintain - Reasonable implementation cost
- Use only global history
- Medium number of tables
- In-time response
5The basis A Multiple length global history
predictor
TO
T1
T2
?
L(0)
T3
L(1)
L(2)
T4
L(3)
L(4)
6GEometric History Length predictor
The set of history lengths forms a geometric
series
Capture correlation on very long histories
0, 2, 4, 8, 16, 32, 64, 128
most of the storage for short history !!
What is important L(i)-L(i-1) is drastically
increasing
7Combining multiple predictions ?
- Classical solution
- Use of a meta predictor
- wasting storage !?!
- chosing among 5 or 10 predictions ??
- Neural inspired predictors, Jimenez and Lin 2001
- Use an adder tree instead of a meta-predictor
- Partial matching
- Use tagged tables and the longest matching
history - Chen et al 96, Michaud 2005
8CBP-1 (2004) OGEHL Final computation through a
sum
L(0)
PredictionSign
12 components 3.670 misp/KI
9TAGEGeometric history length PPM-like
optimized update policy
Tagless base predictor
10(No Transcript)
11Prediction computation
- General case
- Longest matching component provides the
prediction - Special case
- Many mispredictions on newly allocated entries
weak Ctr -
- On many applications, Altpred more accurate than
Pred - Property dynamically monitored through a single
4-bit counter
12TAGE update policy
- General principle
- Minimize the footprint of the prediction.
-
- Just update the longest history matching
component and allocate at most one entry on
mispredictions
13A tagged table entry
- Ctr 3-bit prediction counter
- U 2-bit useful counter
- Was the entry recently useful ?
- Tag partial tag
14Updating the U counter
- If (Altpred ? Pred) then
- Pred taken U U 1
- Pred ? taken U U - 1
- Graceful aging
- Periodic shift of all U counters
- implemented through the reset of a single bit
15Allocating a new entry on a misprediction
- Find a single useless entry with a longer
history - Priviledge the smallest possible history
- To minimize footprint
- But not too much
- To avoid ping-pong phenomena
- Initialize Ctr as weak and U as zero
16Improve the global history
- Address conditional branch history
- path confusion on short histories ?
- Address path
- Direct hashing leads to path confusion ?
- Represent all branches in branch history
- Use also path history ( 1 bit per branch, limited
to 16 bits)
17Design tradeoff for CBP2 (1)
- 13 components
- Bring the best accuracy on distributed traces
- 8 components not very far !
- History length
- Min4 , Max 640
- Could use any Min in 2,6 and any Max in 300,
2000
18Design tradeoff for CBP2 (2)
- Tag width tradeoff
- (destructive) false match is better tolerated on
shorter history - 7 bits on T1 to 15 bits on T12
- Tuning the number of table entries
- Smaller number for very long histories
- Smaller number for very short histories
19Adding a loop predictor
- The loop predictor captures the number of
iterations of a loop - When successively encounters 4 times the same
number of iterations, the loop predictor
provides the prediction. - Advantages
- Very reliable
- Small storage budget 256 52-bit entries
- Complexity ?
- Might be difficult to manage speculative
iteration numbers on deep pipelines
20Using a kernel history and a user history
- Traces mix user and kernel activities
- Kernel activity after exception
- Global history pollution
- Solution use two separate global histories
- User history is updated only in user mode
- Kernel history is updated in both modes
21L-TAGE submission accuracy (distributed traces)
3.314 misp/KI
22 Reducing L-TAGE complexity
- Included 241,5 Kbits TAGE predictor
- 3.368 misp/KI
- Loop predictor beneficial only on gzip
- Might not be worth the extra complexity
23Using less tables
- 8 components 256 Kbits TAGE predictor
- 3.446 misp/KI
24TAGE prediction computation time ?
- 3 successive steps
- Index computation
- Table read
- Partial match multiplexor
- Does not fit on a single cycle
- But can be ahead pipelined !
25Ahead pipelining a global history branch
predictor (principle)
- Initiate branch prediction X1 cycles in advance
to provide the prediction in time - Use information available
- X-block ahead instruction address
- X-block ahead history
- To ensure accuracy
- Use intermediate path information
26Practice
A
C
B
bc
Ahead pipelined TAGE 4// prediction computations
Ha A
273-branch ahead pipelined 8 component 256 Kbits
TAGE
3.552 misp/KI
28A final case for the Geometric History Length
predictors
- delivers state-of-the-art accuracy
- uses only global information
- Very long history 300 bits !!
- can be ahead pipelined
- many effective design points
- OGEHL or TAGE ?
- Nb of tables, history lengths
29The End ?