Design tradeoffs for the Alpha EV8 Conditional Branch Predictor - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Description:

very small loss of accuracy when sharing hysteresis bit between 2 or 4 counters ... Different arrays for hysteresis and predictions ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 46
Provided by: Sez78
Category:

less

Transcript and Presenter's Notes

Title: Design tradeoffs for the Alpha EV8 Conditional Branch Predictor


1
Design tradeoffs for the Alpha EV8 Conditional
Branch Predictor
  • André Seznec, IRISA/INRIA
  • Stephen Felix, Intel
  • Venkata Krishnan, Stargen Inc
  • Yiannakis Sazeides, University of Cyprus

2
Alpha EV8 (cancelled june 2001)
  • SMT 4 threads
  • wide-issue superscalar processor
  • 8-way issue
  • 512 registers

Single process performance is the
goal Multiprocess performance is the bonus
5-10 overhead for SMT
3
Challenges on the EV8 conditional branch predictor
  • High accuracy is needed
  • 14 cycles minimum miss penalty
  • silicon budget is large, but don t waste it
  • Up to 16 predictions per cycle
  • from two non-contiguous fetch blocks!
  • history vector(s) must be updated with 0 to 16
    branch outcomes !
  • Branch history information is 3 fetch blocks old
  • Various implementation constraints
  • master the number of physical memory arrays
  • use of single-ported memory cells
  • timing constraints on indexing functions

4
Alpha EV8 front-end pipeline
  • Fetches up to two, 8-instruction blocks per cycle
    from the I-cache
  • a block ends either on an aligned 8-instruction
    end or on a taken control flow
  • up to 16 conditional branches fetched and
    predicted per cycle
  • Next two block addresses must be predicted on a
    single cycle
  • critical path use of a line predictor backed
    with a complex PC address generator conditional
    branch predictor, RAS, jump predictor ..

5
instruction fetch blocks on EV8
6
PC address generation pipeline
C and D
A and B
Y and Z
PC address generation is completed
Line prediction is completed
Prediction table read is completed
7
Minimum background in branch prediction (0)
  • Only direction is an issue
  • targets are recomputed on the fly

8
Minimum background in branch prediction (1)
  • Use the past to predict the current branch
  • read and update tables
  • 2-bit counters prediction hysteresis
  • index
  • address
  • global branch history what happened with the
    last few branches
  • a single history vector
  • or local branch history what happened the last
    times this branch was executed
  • must maintain a table of histories

9
Minimum background on branch prediction (2)
  • Problems
  • global or local schemes ?
  • which precised scheme ?
  • Interferences in tables may ruin accuracy

10
Global vs local history
  • 16 local history reads
  • 2-ported history read
  • 16-ported prediction table ?
  • Speculative history
  • up to 256 branches inflight
  • 3 fetch blocks old history
  • tight loops ?
  • SMT sharing is disastrous
  • Global history
  • bank-interleaved prediction tables
  • even no arbitration !
  • Speculative history
  • 3 fetch blocks old
  • not a real issue !
  • use of path !
  • SMT sharing may be constructive

11
EV8 predictor (derived from) (2Bc-gskew)
12
2Bc-gskew hybrid skewed predictor
  • Leverages
  • de-aliased predictor e-gskew
  • bimodal for easy-to-predict branches
  • State-of-the-art global history branch predictor
  • at least (published) in 1999 -)
  • along with the YAGS predictor -)

13
2Bc-gskew degrees of freedom partial
update policy
  • on correct predictions, only updates correct
    components
  • do not destruct other predictions
  • two tricks to further minimize write pressure
  • better accuracy !
  • USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS
    !!
  • On correct predictions
  • prediction bit is only read
  • hysteresis bit is only written

14
2Bc-gskew degrees of freedom (2)
sharing hysteresis bits
  • Using 2-bit counters
  • strong states occur more often than weak states
  • very small loss of accuracy when sharing
    hysteresis bit between 2 or 4 counters

15
2Bc-gskew degrees of freedom (3)
  • Different applications different optimal history
    lengths
  • use of different indexing functions allows to use
    different history lengths for the predictor
    tables
  • smoothens the difficulty
  • Different prediction table sizes
  • the bimodal table may be smaller than the other
    tables -)

16
EV8 predictor
Smaller bimodal table
Different history lenghts
17
Dealing with implementation constraints
18
Issues on global history
Blocks A and B
Blocks Y and Z
Blocks C and D
Branch infos from C, B and A are not valid to
predict D!
On each cycle, from 0 to 16 branch are
predicted 0 to 16 bits to be inserted in the
history vector !?
19
Block compressed history lghist
  • Incorporate at most one bit in the history per
    fetch block
  • 0, 1 or 2 bits to be incorporated in history
    vector per cycle
  • Which bit ?
  • Direction of the last conditional branch in the
    block
  • previous ones are not taken
  • XORed with position (1st half/ 2nd half) in the
    block
  • more uniform distribution of the history vectors

20
instruction fetch blocks on EV8
1 is inserted
1 is inserted
21
The EV8 branch predictor information vector
  • History information is not available for the
    three last blocks A, B, and C
  • but, addresses are available !!

Information vector to index the predictor
1. Instruction address 2. Lghist
(3-blocks-old history path) 3. Path info
on the last three blocks
22
Using single-ported memory arrays
The challenge 16 predictions to be performed
per cycle from two
non-contiguous blocks ! 8 updates per cycle
for two non-contiguous
blocks !
But single-ported arrays are highly desirable -)
23
Different arrays for hysteresis and predictions
  • Prediction uses the most significant bit of a
    2-bit counter
  • Partial update on correct prediction only
    strengthening the hysteresis bit (not even
    always)
  • only a WRITE on the hysteresis
  • Misprediction
  • update read of hysteresis write of
    (prediction hysteresis) for a single branch

Prediction and hysteresis arrays can be distinct
208 Kbits prediction 144 Kbits hysteresis
24
Bank-interleaved or double-ported branch
predictor ?
  • Reads of predictions for 2 8-instructions blocks
  • double-porting memory cells twice as large
  • losing half of the entries ?
  • bank-interleaving need for arbitration
  • longer critical electrical path
  • losing throughput
  • short loops fitting in a single 8-instruction
    block !?

????????
25
Conflict free interleaved bank predictor
Key idea Forces the two prediction blocks read
in // to lie in two distinct banks
Block for A is determined by Y and Z
4-way interleaved
if (y6,y5) Bz then Ba (y6,y51) else Ba
(y6,y5)
26
Conflict free bank-interleaved predictor (2)
  • Conflicts are avoided by construction
  • Bank number is computed one cycle ahead
  • not on the critical path

Single ported bank-interleaved memory arrays !
27
 Logical view  vs real implementation
  • 4 tables 4 banks 2 (pred. hyst.)
  • 32 memory arrays
  • Indexing functions are computed, then arrays are
    accessed
  • 4 banks 2 (pred. hyst.)
  • 4 tables in a single array
  • 8 memory arrays
  • No time to lose
  • start access and compute part of the index in //

28
Reading the branch prediction tables
29
Reading the branch prediction tables (2)
  • Span over 5/2 cycles
  • Cycle -1
  • bank number computation
  • bank selection
  • Cycle 0
  • phase 0 wordline selection
  • phase 1 column selection
  • Cycle 1
  • phase 0 unshuffle permutation

30
Constraints on the different parts in the indices
  • Strong Wordline bits
  • immediate availability
  • common to the four logical tables
  • Medium Column bits
  • a single 2-entry XOR gate
  • Weak Unshuffle bits
  • near complete freedom, a full tree of XOR gates
    if needed

31
Designing the indexing functions (1)6 wordline
bits
  • Must be available at the beginning of the cycle
  • block address bits
  • 3-block old lghist bits
  • path bits
  • Tradeoff
  • address bits for emphasizing bimodal component
    behavior
  • lghist bits are more uniformly distributed

4 lghist bits 2 address bits
32
Designing the indexing functions (2)Column
selection and unshuffle
  • Favor independance of the four indexing
    functions
  • if two (address,history) pairs conflict on a
    table then try to avoid repeating the conflict on
    an other table
  • Guarantee that for a single address, two
    histories differing by only one or two bits will
    not map on the same entry
  • Favor usage of the whole table
  • lghist bits are more uniformly distributed than
    address bits

XORing 2 lghist bits for column bits a XOR tree
with up to 11 bits for unshuffle
33
EV8 branch predictor configuration
  • 208 Kbits for prediction and 144 Kbits for
    hysteresis
  • BIM 16K 16K, 4 lghist bits ( 3-block path)
  • G0 64K 32 K, 13 lghist bits
  • G1 64K 64 K, 21 lghist bits
  • Meta 64 K 32 K, 17 lghist bits
  • 4 prediction banks and 4 hysteresis banks

34
Performance evaluation
  • Sorry,
  • SPEC 95 -)

35
Benchmarks characteristics
  • Highly optimized SPECint 95
  • much more not-taken than taken
  • ratio lghist/ghist length
  • from 1.12 to 1.59
  • from 8.9 to 16.2 branches per 100 instructions

36
2Bc-gskew vs other global history predictors
37
History should be longer than log2(size)
38
Quality of information vector
39
Quality of information vector (2)
  • Lghist vs ghist no significant loss of
    information
  • Using path or no path does not appear (here) as
    an issue
  • 3-block old lghist slight loss accuracy
  • EV8 info vector
  • path info on the last blocks recovers part of
    the loss associated with 3-block old lghist

EV8 info vector stands the comparison
40
Reducing some table sizes no significant impact
41
Qualities of the indexing functions
42
Qualities of the indexing functions (2)
  • Using history bits in the wordline selection is a
    good tradeoff
  • Embedding path information in lghist is
    beneficial
  • EV8 indexing functions are as good as indexing
    functions designed with no constraints

43
Pushing the limits !?
44
Conclusion
  • Design of a real branch predictor leads to
    challenges ignored in acamedic studies
  • 3-block old history vector
  • impossibility to maintain a complete history
  • multiple // accesses to the predictor
  • minimization of the number of memory arrays
  • timing constraints on the indexing functions

We turned out these difficulties and adapted a
state of the art academic branch predictor to
real world constraints.
45
Summary of the contributions
  • Efficient information vector can be built with
    mixing path and compressed history
  • dont focus on the info vector, use what is
    convenient!
  • Use of different table sizes, history lengths in
    the predictor.
  • Sharing of hysteresis bits
  • Conflict free parallel access scheme for the
    predictor
  • Engineering of indexing functions
Write a Comment
User Comments (0)
About PowerShow.com