Design tradeoffs for the Alpha EV8 Conditional Branch Predictor - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Description:

very small loss of accuracy when sharing hysteresis bit between 2 or 4 counters ... Different arrays for hysteresis and predictions ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 46

Provided by: Sez78

Category:

more less

Transcript and Presenter's Notes

Title: Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

1
Design tradeoffs for the Alpha EV8 Conditional
Branch Predictor

André Seznec, IRISA/INRIA
Stephen Felix, Intel
Venkata Krishnan, Stargen Inc
Yiannakis Sazeides, University of Cyprus

2
Alpha EV8 (cancelled june 2001)

SMT 4 threads
wide-issue superscalar processor
8-way issue
512 registers

Single process performance is the
goal Multiprocess performance is the bonus
5-10 overhead for SMT
3
Challenges on the EV8 conditional branch predictor

High accuracy is needed
14 cycles minimum miss penalty
silicon budget is large, but don t waste it
Up to 16 predictions per cycle
from two non-contiguous fetch blocks!
history vector(s) must be updated with 0 to 16
branch outcomes !
Branch history information is 3 fetch blocks old
Various implementation constraints
master the number of physical memory arrays
use of single-ported memory cells
timing constraints on indexing functions

4
Alpha EV8 front-end pipeline

Fetches up to two, 8-instruction blocks per cycle
from the I-cache
a block ends either on an aligned 8-instruction
end or on a taken control flow
up to 16 conditional branches fetched and
predicted per cycle
Next two block addresses must be predicted on a
single cycle
critical path use of a line predictor backed
with a complex PC address generator conditional
branch predictor, RAS, jump predictor ..

5
instruction fetch blocks on EV8
6
PC address generation pipeline
C and D
A and B
Y and Z
PC address generation is completed
Line prediction is completed
Prediction table read is completed
7
Minimum background in branch prediction (0)

Only direction is an issue
targets are recomputed on the fly

8
Minimum background in branch prediction (1)

Use the past to predict the current branch
read and update tables
2-bit counters prediction hysteresis
index
address
global branch history what happened with the
last few branches
a single history vector
or local branch history what happened the last
times this branch was executed
must maintain a table of histories

9
Minimum background on branch prediction (2)

Problems
global or local schemes ?
which precised scheme ?
Interferences in tables may ruin accuracy

10
Global vs local history

16 local history reads
2-ported history read
16-ported prediction table ?
Speculative history
up to 256 branches inflight
3 fetch blocks old history
tight loops ?
SMT sharing is disastrous

Global history
bank-interleaved prediction tables
even no arbitration !
Speculative history
3 fetch blocks old
not a real issue !
use of path !
SMT sharing may be constructive

11
EV8 predictor (derived from) (2Bc-gskew)
12
2Bc-gskew hybrid skewed predictor

Leverages
de-aliased predictor e-gskew
bimodal for easy-to-predict branches
State-of-the-art global history branch predictor
at least (published) in 1999 -)
along with the YAGS predictor -)

13
2Bc-gskew degrees of freedom partial
update policy

on correct predictions, only updates correct
components
do not destruct other predictions
two tricks to further minimize write pressure
better accuracy !
USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS
!!
On correct predictions
prediction bit is only read
hysteresis bit is only written

14
2Bc-gskew degrees of freedom (2)
sharing hysteresis bits

Using 2-bit counters
strong states occur more often than weak states
very small loss of accuracy when sharing
hysteresis bit between 2 or 4 counters

15
2Bc-gskew degrees of freedom (3)

Different applications different optimal history
lengths
use of different indexing functions allows to use
different history lengths for the predictor
tables
smoothens the difficulty
Different prediction table sizes
the bimodal table may be smaller than the other
tables -)

16
EV8 predictor
Smaller bimodal table
Different history lenghts
17
Dealing with implementation constraints
18
Issues on global history
Blocks A and B
Blocks Y and Z
Blocks C and D
Branch infos from C, B and A are not valid to
predict D!
On each cycle, from 0 to 16 branch are
predicted 0 to 16 bits to be inserted in the
history vector !?
19
Block compressed history lghist

Incorporate at most one bit in the history per
fetch block
0, 1 or 2 bits to be incorporated in history
vector per cycle
Which bit ?
Direction of the last conditional branch in the
block
previous ones are not taken
XORed with position (1st half/ 2nd half) in the
block
more uniform distribution of the history vectors

20
instruction fetch blocks on EV8
1 is inserted
1 is inserted
21
The EV8 branch predictor information vector

History information is not available for the
three last blocks A, B, and C
but, addresses are available !!

Information vector to index the predictor
1. Instruction address 2. Lghist
(3-blocks-old history path) 3. Path info
on the last three blocks
22
Using single-ported memory arrays
The challenge 16 predictions to be performed
per cycle from two
non-contiguous blocks ! 8 updates per cycle
for two non-contiguous
blocks !
But single-ported arrays are highly desirable -)
23
Different arrays for hysteresis and predictions

Prediction uses the most significant bit of a
2-bit counter
Partial update on correct prediction only
strengthening the hysteresis bit (not even
always)
only a WRITE on the hysteresis
Misprediction
update read of hysteresis write of
(prediction hysteresis) for a single branch

Prediction and hysteresis arrays can be distinct
208 Kbits prediction 144 Kbits hysteresis
24
Bank-interleaved or double-ported branch
predictor ?

Reads of predictions for 2 8-instructions blocks
double-porting memory cells twice as large
losing half of the entries ?
bank-interleaving need for arbitration
longer critical electrical path
losing throughput
short loops fitting in a single 8-instruction
block !?

????????
25
Conflict free interleaved bank predictor
Key idea Forces the two prediction blocks read
in // to lie in two distinct banks
Block for A is determined by Y and Z
4-way interleaved
if (y6,y5) Bz then Ba (y6,y51) else Ba
(y6,y5)
26
Conflict free bank-interleaved predictor (2)

Conflicts are avoided by construction
Bank number is computed one cycle ahead
not on the critical path

Single ported bank-interleaved memory arrays !
27
Logical view vs real implementation

4 tables 4 banks 2 (pred. hyst.)
32 memory arrays
Indexing functions are computed, then arrays are
accessed

4 banks 2 (pred. hyst.)
4 tables in a single array
8 memory arrays
No time to lose
start access and compute part of the index in //

28
Reading the branch prediction tables
29
Reading the branch prediction tables (2)

Span over 5/2 cycles
Cycle -1
bank number computation
bank selection
Cycle 0
phase 0 wordline selection
phase 1 column selection
Cycle 1
phase 0 unshuffle permutation

30
Constraints on the different parts in the indices

Strong Wordline bits
immediate availability
common to the four logical tables
Medium Column bits
a single 2-entry XOR gate
Weak Unshuffle bits
near complete freedom, a full tree of XOR gates
if needed

31
Designing the indexing functions (1)6 wordline
bits

Must be available at the beginning of the cycle
block address bits
3-block old lghist bits
path bits
Tradeoff
address bits for emphasizing bimodal component
behavior
lghist bits are more uniformly distributed

4 lghist bits 2 address bits
32
Designing the indexing functions (2)Column
selection and unshuffle

Favor independance of the four indexing
functions
if two (address,history) pairs conflict on a
table then try to avoid repeating the conflict on
an other table
Guarantee that for a single address, two
histories differing by only one or two bits will
not map on the same entry
Favor usage of the whole table
lghist bits are more uniformly distributed than
address bits

XORing 2 lghist bits for column bits a XOR tree
with up to 11 bits for unshuffle
33
EV8 branch predictor configuration

208 Kbits for prediction and 144 Kbits for
hysteresis
BIM 16K 16K, 4 lghist bits ( 3-block path)
G0 64K 32 K, 13 lghist bits
G1 64K 64 K, 21 lghist bits
Meta 64 K 32 K, 17 lghist bits
4 prediction banks and 4 hysteresis banks

34
Performance evaluation

Sorry,
SPEC 95 -)

35
Benchmarks characteristics

Highly optimized SPECint 95
much more not-taken than taken
ratio lghist/ghist length
from 1.12 to 1.59
from 8.9 to 16.2 branches per 100 instructions

36
2Bc-gskew vs other global history predictors
37
History should be longer than log2(size)
38
Quality of information vector
39
Quality of information vector (2)

Lghist vs ghist no significant loss of
information
Using path or no path does not appear (here) as
an issue
3-block old lghist slight loss accuracy
EV8 info vector
path info on the last blocks recovers part of
the loss associated with 3-block old lghist

EV8 info vector stands the comparison
40
Reducing some table sizes no significant impact
41
Qualities of the indexing functions
42
Qualities of the indexing functions (2)

Using history bits in the wordline selection is a
good tradeoff
Embedding path information in lghist is
beneficial
EV8 indexing functions are as good as indexing
functions designed with no constraints

43
Pushing the limits !?
44
Conclusion

Design of a real branch predictor leads to
challenges ignored in acamedic studies
3-block old history vector
impossibility to maintain a complete history
multiple // accesses to the predictor
minimization of the number of memory arrays
timing constraints on the indexing functions

We turned out these difficulties and adapted a
state of the art academic branch predictor to
real world constraints.
45
Summary of the contributions

Efficient information vector can be built with
mixing path and compressed history
dont focus on the info vector, use what is
convenient!
Use of different table sizes, history lengths in
the predictor.
Sharing of hysteresis bits
Conflict free parallel access scheme for the
predictor
Engineering of indexing functions