CDA 5155 - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

CDA 5155

Description:

Call instructions write return address to R31 AND RAS ... Branch targets are stable or predictable (RAS) Dependencies are limited ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 46
Provided by: csF2
Category:
Tags: cda | ras

less

Transcript and Presenter's Notes

Title: CDA 5155


1
CDA 5155
  • Week 3
  • Branch Prediction
  • Superscalar Execution

2
M U X
1
REG file
M U X
PC
Inst mem
Data memory
M U X
sign ext
bpc
target
Control
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
beq
3
Branch Target Buffer
Fetch PC
Send PC to BTB
found?
No
Yes
use target
use PC1
Predicted target PC
4
Branch prediction
  • Predict not taken 50 accurate
  • No BTB needed always use PC1
  • Predict backward taken 65 accurate
  • BTB holds targets for backward branches (loops)
  • Predict same as last time 80 accurate
  • Update BTB for any taken branch

5
What about indirect branches?
  • Could use same approach
  • PC1 unlikely indirect target
  • Indirect jumps often have multiple targets (for
    same instruction)
  • Switch statements
  • Virtual function calls
  • Shared library (DLL) calls

6
Indirect jump Special Case
  • Return address stack
  • Function returns have deterministic behavior
    (usually)
  • Return to different locations (BTB doesnt work
    well)
  • Return location known ahead of time
  • In some register at the time of the call
  • Build a specialize structure for return addresses
  • Call instructions write return address to R31 AND
    RAS
  • Return instructions pop predicted target off
    stack
  • Issues finite size (save or forget on
    overflow?)
  • Issues long jumps (clear when wrong?)

7
Costs of branch prediction/speculation
  • Performance costs?
  • Minimal no difference between waiting and
    squashing and it is a huge gain when prediction
    is correct!
  • Power?
  • Large in very long/wide pipelines many
    instructions can be squashed
  • Squashed mispredictions ? pipeline
    length/width before target resolved

8
Costs of branch prediction/speculation
  • Area?
  • Can be large predictors can get very big as we
    will see next time
  • Complexity?
  • Designs are more complex
  • Testing becomes more difficult, but

9
What else can be speculated?
  • Dependencies
  • I think this data is coming from that store
    instruction
  • Values
  • I think I will load a 0 value
  • Accuracy?
  • Branch prediction (direction) is Boolean (T,NT)
  • Branch targets are stable or predictable (RAS)
  • Dependencies are limited
  • Values cover a huge space (0 4B)

10
Parts of the branch predictor
  • Direction Predictor
  • For conditional branches
  • Predicts whether the branch will be taken
  • Examples
  • Always taken backwards taken
  • Address Predictor
  • Predicts the target address (use if predicted
    taken)
  • Examples
  • BTB Return Address Stack Precomputed Branch
  • Recovery logic

Ref The Precomputed Branch Architecture
11
Characteristics of branches
  • Individual branches differ
  • Loops tend not to exit
  • Unoptimized code not-taken
  • Optimized code taken
  • If-statements
  • Tend to be less predictable
  • Unconditional branches
  • Still need address prediction

12
Example gzip
  • gzip loop branch A_at_ 0x1200098d8
  • Executed 1359575 times
  • Taken 1359565 times
  • Not-taken 10 times
  • time taken 99 - 100

Easy to predict (direction and address)
13
Example gzip
  • gzip if branch B_at_ 0x12000fa04
  • Executed 151409 times
  • Taken 71480 times
  • Not-taken 79929 times
  • time taken 49

Easy to predict? (maybe not/ maybe dynamically)
14
Example gzip
A
B
0
100
Direction prediction always taken Accuracy 73
15
Branch Backwards
Most backward branches are heavily TAKEN Forward
branches slightly more likely to be NOT-TAKEN
Ref The Effects of Predicated Execution on
Branch Prediction
16
Using history
  • 1-bit history (direction predictor)
  • Remember the last direction for a branch

Branch History Table
branchPC
How big is the BHT?
17
Example gzip
A
B
0
100
Direction prediction always taken Accuracy 73
How many times will branch A mispredict?
How many times will branch B mispredict?
18
Using history
  • 2-bit history (direction predictor)

Branch History Table
branchPC
How big is the BHT?
19
Example gzip
A
B
0
100
Direction prediction always taken Accuracy 76
How many times will branch A mispredict?
How many times will branch B mispredict?
20
Using History Patterns
  • 80 percent of branches are either heavily TAKEN
    or heavily NOT-TAKEN
  • For the other 20, we need to look a patterns of
    reference to see if they are predictable using a
    more complex predictor
  • Example gcc has a branch that flips each time

T(1) NT(0) 1010101010101010101010101010101010
1010
21
Local history
branchPC
Branch History Table
Pattern History Table
10101010
What is the prediction for this BHT 10101010?
When do I update the tables?
22
Local history
branchPC
Branch History Table
Pattern History Table
01010101
On the next execution of this branch instruction,
the branch history table is 01010101, pointing
to a different pattern
What is the accuracy of a flip/flop branch
0101010101010?
23
Global history
Pattern History Table
Branch History Register
01110101
for (i0 ilt100 i) for (j0 jlt3
j) jlt3 j 1 1101 ? taken jlt3 j 2 1011 ?
taken jlt3 j 3 0111 ? not taken ilt100 1110 ?
usually taken
if (aa 2) aa 0 if (bb 2) bb 0 if
(aa ! bb)
How can branches interfere with each other?
24
Gshare predictor
branchPC
Pattern History Table
Branch History Register
xor
01110101
Ref Combining Branch Predictors
25
Bimod predictor
Global history reg
branchPC
xor
Choice predictor
PHT skewed taken
PHT skewed Not-taken
mux
26
Tournament predictors
Local predictor (e.g. 2-bit)
Global/gshare predictor (much more state)
Prediction 1
Prediction 2
Selection table (2-bit state machine)
Prediction
How do you select which predictor to use? How do
you update the various predictor/selector?
27
Overriding Predictors
  • Big predictors are slow, but more accurate
  • Use a single cycle predictor in fetch
  • Start the multi-cycle predictor
  • When it completes, compare it to the fast
    prediction.
  • If same, do nothing
  • If different, assume the slow predictor is right
    and flush pipline.
  • Advantage reduced branch penalty for those
    branches mispredicted by the fast predictor and
    correctly predicted by the slow predictor

28
Pipelined Gshare Predictor
  • How can we get a pipelined global prediction by
    stage 1?
  • Start in stage 2
  • Dont have the most recent branch history
  • Access multiple entries
  • E.g. if we are missing last three branches, get 8
    histories and pick between them during fetch
    stage.

Ref Reconsidering Complex Branch Predictors

29
Exceptions
  • Exceptions are events that are difficult or
    impossible to manage in hardware alone.
  • Exceptions are usually handled by jumping into a
    service (software) routine.
  • Examples I/O device request, page fault, divide
    by zero, memory protection violation (seg fault),
    hardware failure, etc.

30
Taking and Exception
  • Once an exception occurs, how does the processor
    proceed.
  • Non-pipelined dont fetch from PC save state
    fetch from interrupt vector table
  • Pipelined depends on the exception
  • Precise Interrupt Must stop all instruction
    after the exception (squash)
  • Divide by zero flush fetch/decode
  • Page fault (fetch or mem stage?)
  • Save state after last instruction before
    exception completes (PC, regs)
  • Fetch from interrupt vector table

31
Optimizing CPU Performance
  • Golden Rule tCPU NinstCPItCLK
  • Given this, what are our options
  • Reduce the number of instructions executed
  • Compiler Job (COP 5621 COP 5622)
  • Reduce the clock period
  • Fabrication (Some Engineering classes)
  • Reduce the cycles to execute an instruction
  • Approach Instruction Level Parallelism (ILP)

32
Adding width to basic pipelining
  • 5 stage RISC load-store architecture
  • About as simple as things get
  • Instruction fetch
  • get 2 instructions from memory/cache
  • Instruction decode
  • translate opcodes into control signals and read
    regs
  • Execute
  • perform ALU operations
  • Memory
  • Access memory operations if load/store
  • Writeback/retire
  • update register file

33
Stage 1 Fetch
  • Design a datapath that can fetch two instructions
    from memory every cycle.
  • Use PC to index memory to read instruction
  • Read 2 instructions
  • Increment the PC (by 2)
  • Write everything needed to complete execution to
    the pipeline register (IF/ID)
  • Instruction 1 instruction 2 PC1 PC2

34
Rest of pipelined datapath
35
Stage 2 Decode
  • Design a datapath that reads the IF/ID pipeline
    register, decodes instructions and reads register
    file (specified by regA and regB of instruction
    bits for both instructions).
  • Write everything needed to complete execution to
    the pipeline register (ID/EX)
  • Pass on both instructions.
  • Including PC1, PC2 even though decode didnt
    use it.

36
Rest of pipelined datapath
Stage 1 Fetch datapath
Changes? Hazard detection?
37
Stage 3 Execute
  • Design a datapath that performs the proper ALU
    operations for the instructions specified and the
    values present in the ID/EX pipeline register.
  • The inputs to ALUtop are the contents of regAtop
    and either the contents of RegBtop or the
    offsettop field on the instruction.
  • The inputs to ALUbottom are the contents of
    regAbottom and either the contents of RegBbottom
    or the offsetbottom field on the instruction.
  • Also, calculate PC1offsettop in case this is a
    branch.
  • Also, calculate PC2offsetbottom in case this is
    a branch.

38
PC 1
Stage 2 Decode datapath
Control Signals
How many data forwarding paths?
39
Stage 4 Memory Operation
  • Design a datapath that performs the proper memory
    operation(s) for the instructions specified and
    the values present in the EX/Mem pipeline
    register.
  • ALU results contain addresses for ld and st
    instructions.
  • Opcode bits control memory R/W and enable
    signals.
  • Write everything needed to complete execution to
    the pipeline register (Mem/WB)
  • ALU results and MemData(x2)
  • Instruction bits for opcodes and destRegs
    specifiers

40
PC1 offset
Stage 3 Execute datapath
contents of regB
Control Signals
Should we process 2 memory operations in one
cycle?
41
Stage 5 Write back
  • Design a datapath that completes the execution of
    these instructions, writing to the register file
    if required.
  • Write MemData to destReg for ld instructions
  • Write ALU result to destReg for add or nand
    instructions.
  • Opcode bits also control register write enable
    signal.

42
What about ordering the register writes if
same destination specifier for each instruction?
Alu Result
Memory Read Data
Stage 4 Memory datapath
Control Signals
Mem/WB Pipeline register
43
How Much ILP is There?
44
ALU Operation GOOD, Branch BAD
Expected Number of Branches Between
Mispredicts E(X) 1/(1-p) E.g., p 95, E(X)
20 brs, 100-ish insts
45
How Accurate are Branch Predictors?
Write a Comment
User Comments (0)
About PowerShow.com