Title: Design tradeoffs for the Alpha EV8 Conditional Branch Predictor
1Design tradeoffs for the Alpha EV8 Conditional
Branch Predictor
- André Seznec, IRISA/INRIA
- Stephen Felix, Intel
- Venkata Krishnan, Stargen Inc
- Yiannakis Sazeides, University of Cyprus
2Alpha EV8 (cancelled june 2001)
- SMT 4 threads
- wide-issue superscalar processor
- 8-way issue
- 512 registers
Single process performance is the
goal Multiprocess performance is the bonus
5-10 overhead for SMT
3Challenges on the EV8 conditional branch predictor
- High accuracy is needed
- 14 cycles minimum miss penalty
- silicon budget is large, but don t waste it
- Up to 16 predictions per cycle
- from two non-contiguous fetch blocks!
- history vector(s) must be updated with 0 to 16
branch outcomes ! - Branch history information is 3 fetch blocks old
- Various implementation constraints
- master the number of physical memory arrays
- use of single-ported memory cells
- timing constraints on indexing functions
4Alpha EV8 front-end pipeline
- Fetches up to two, 8-instruction blocks per cycle
from the I-cache - a block ends either on an aligned 8-instruction
end or on a taken control flow - up to 16 conditional branches fetched and
predicted per cycle - Next two block addresses must be predicted on a
single cycle - critical path use of a line predictor backed
with a complex PC address generator conditional
branch predictor, RAS, jump predictor ..
5instruction fetch blocks on EV8
6PC address generation pipeline
C and D
A and B
Y and Z
PC address generation is completed
Line prediction is completed
Prediction table read is completed
7Minimum background in branch prediction (0)
- Only direction is an issue
- targets are recomputed on the fly
8Minimum background in branch prediction (1)
- Use the past to predict the current branch
- read and update tables
- 2-bit counters prediction hysteresis
- index
- address
- global branch history what happened with the
last few branches - a single history vector
- or local branch history what happened the last
times this branch was executed - must maintain a table of histories
9Minimum background on branch prediction (2)
- Problems
- global or local schemes ?
- which precised scheme ?
- Interferences in tables may ruin accuracy
10Global vs local history
- 16 local history reads
- 2-ported history read
- 16-ported prediction table ?
- Speculative history
- up to 256 branches inflight
- 3 fetch blocks old history
- tight loops ?
- SMT sharing is disastrous
- Global history
- bank-interleaved prediction tables
- even no arbitration !
- Speculative history
- 3 fetch blocks old
- not a real issue !
- use of path !
- SMT sharing may be constructive
11EV8 predictor (derived from) (2Bc-gskew)
122Bc-gskew hybrid skewed predictor
- Leverages
- de-aliased predictor e-gskew
- bimodal for easy-to-predict branches
- State-of-the-art global history branch predictor
- at least (published) in 1999 -)
- along with the YAGS predictor -)
132Bc-gskew degrees of freedom partial
update policy
- on correct predictions, only updates correct
components - do not destruct other predictions
- two tricks to further minimize write pressure
- better accuracy !
- USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS
!! - On correct predictions
- prediction bit is only read
- hysteresis bit is only written
142Bc-gskew degrees of freedom (2)
sharing hysteresis bits
- Using 2-bit counters
- strong states occur more often than weak states
- very small loss of accuracy when sharing
hysteresis bit between 2 or 4 counters
152Bc-gskew degrees of freedom (3)
- Different applications different optimal history
lengths - use of different indexing functions allows to use
different history lengths for the predictor
tables - smoothens the difficulty
- Different prediction table sizes
- the bimodal table may be smaller than the other
tables -)
16EV8 predictor
Smaller bimodal table
Different history lenghts
17Dealing with implementation constraints
18Issues on global history
Blocks A and B
Blocks Y and Z
Blocks C and D
Branch infos from C, B and A are not valid to
predict D!
On each cycle, from 0 to 16 branch are
predicted 0 to 16 bits to be inserted in the
history vector !?
19Block compressed history lghist
- Incorporate at most one bit in the history per
fetch block - 0, 1 or 2 bits to be incorporated in history
vector per cycle - Which bit ?
- Direction of the last conditional branch in the
block - previous ones are not taken
- XORed with position (1st half/ 2nd half) in the
block - more uniform distribution of the history vectors
20instruction fetch blocks on EV8
1 is inserted
1 is inserted
21The EV8 branch predictor information vector
- History information is not available for the
three last blocks A, B, and C - but, addresses are available !!
Information vector to index the predictor
1. Instruction address 2. Lghist
(3-blocks-old history path) 3. Path info
on the last three blocks
22Using single-ported memory arrays
The challenge 16 predictions to be performed
per cycle from two
non-contiguous blocks ! 8 updates per cycle
for two non-contiguous
blocks !
But single-ported arrays are highly desirable -)
23Different arrays for hysteresis and predictions
- Prediction uses the most significant bit of a
2-bit counter - Partial update on correct prediction only
strengthening the hysteresis bit (not even
always) - only a WRITE on the hysteresis
- Misprediction
- update read of hysteresis write of
(prediction hysteresis) for a single branch
Prediction and hysteresis arrays can be distinct
208 Kbits prediction 144 Kbits hysteresis
24Bank-interleaved or double-ported branch
predictor ?
- Reads of predictions for 2 8-instructions blocks
- double-porting memory cells twice as large
- losing half of the entries ?
- bank-interleaving need for arbitration
- longer critical electrical path
- losing throughput
- short loops fitting in a single 8-instruction
block !?
????????
25Conflict free interleaved bank predictor
Key idea Forces the two prediction blocks read
in // to lie in two distinct banks
Block for A is determined by Y and Z
4-way interleaved
if (y6,y5) Bz then Ba (y6,y51) else Ba
(y6,y5)
26Conflict free bank-interleaved predictor (2)
- Conflicts are avoided by construction
- Bank number is computed one cycle ahead
- not on the critical path
Single ported bank-interleaved memory arrays !
27 Logical view vs real implementation
- 4 tables 4 banks 2 (pred. hyst.)
- 32 memory arrays
- Indexing functions are computed, then arrays are
accessed
- 4 banks 2 (pred. hyst.)
- 4 tables in a single array
- 8 memory arrays
- No time to lose
- start access and compute part of the index in //
28Reading the branch prediction tables
29Reading the branch prediction tables (2)
- Span over 5/2 cycles
- Cycle -1
- bank number computation
- bank selection
- Cycle 0
- phase 0 wordline selection
- phase 1 column selection
- Cycle 1
- phase 0 unshuffle permutation
30Constraints on the different parts in the indices
- Strong Wordline bits
- immediate availability
- common to the four logical tables
- Medium Column bits
- a single 2-entry XOR gate
- Weak Unshuffle bits
- near complete freedom, a full tree of XOR gates
if needed
31Designing the indexing functions (1)6 wordline
bits
- Must be available at the beginning of the cycle
- block address bits
- 3-block old lghist bits
- path bits
- Tradeoff
- address bits for emphasizing bimodal component
behavior - lghist bits are more uniformly distributed
4 lghist bits 2 address bits
32Designing the indexing functions (2)Column
selection and unshuffle
- Favor independance of the four indexing
functions - if two (address,history) pairs conflict on a
table then try to avoid repeating the conflict on
an other table - Guarantee that for a single address, two
histories differing by only one or two bits will
not map on the same entry - Favor usage of the whole table
- lghist bits are more uniformly distributed than
address bits
XORing 2 lghist bits for column bits a XOR tree
with up to 11 bits for unshuffle
33EV8 branch predictor configuration
- 208 Kbits for prediction and 144 Kbits for
hysteresis - BIM 16K 16K, 4 lghist bits ( 3-block path)
- G0 64K 32 K, 13 lghist bits
- G1 64K 64 K, 21 lghist bits
- Meta 64 K 32 K, 17 lghist bits
- 4 prediction banks and 4 hysteresis banks
34Performance evaluation
35Benchmarks characteristics
- Highly optimized SPECint 95
- much more not-taken than taken
- ratio lghist/ghist length
- from 1.12 to 1.59
- from 8.9 to 16.2 branches per 100 instructions
362Bc-gskew vs other global history predictors
37History should be longer than log2(size)
38Quality of information vector
39Quality of information vector (2)
- Lghist vs ghist no significant loss of
information - Using path or no path does not appear (here) as
an issue - 3-block old lghist slight loss accuracy
- EV8 info vector
- path info on the last blocks recovers part of
the loss associated with 3-block old lghist
EV8 info vector stands the comparison
40Reducing some table sizes no significant impact
41Qualities of the indexing functions
42Qualities of the indexing functions (2)
- Using history bits in the wordline selection is a
good tradeoff - Embedding path information in lghist is
beneficial - EV8 indexing functions are as good as indexing
functions designed with no constraints
43Pushing the limits !?
44Conclusion
- Design of a real branch predictor leads to
challenges ignored in acamedic studies - 3-block old history vector
- impossibility to maintain a complete history
- multiple // accesses to the predictor
- minimization of the number of memory arrays
- timing constraints on the indexing functions
We turned out these difficulties and adapted a
state of the art academic branch predictor to
real world constraints.
45Summary of the contributions
- Efficient information vector can be built with
mixing path and compressed history - dont focus on the info vector, use what is
convenient! - Use of different table sizes, history lengths in
the predictor. - Sharing of hysteresis bits
- Conflict free parallel access scheme for the
predictor - Engineering of indexing functions