CpE 442 Designing a Pipeline Processor (lect. II) - PowerPoint PPT Presentation

About This Presentation
Title:

CpE 442 Designing a Pipeline Processor (lect. II)

Description:

... add r1,r2,r3 IF ID/RF EX MEM WB ALU Im Reg Dm Reg Dependencies backwards in time are hazards ALU Im Reg Dm Im bubble bubble bubble ALU Reg Dm Reg ALU Im Reg ... – PowerPoint PPT presentation

Number of Views:342
Avg rating:3.0/5.0
Slides: 42
Provided by: cseeWvuE5
Category:

less

Transcript and Presenter's Notes

Title: CpE 442 Designing a Pipeline Processor (lect. II)


1
CpE 442 Designing a Pipeline Processor (lect.
II)
2
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to Hazards (15 minutes)
  • Forwarding (25 minutes)
  • 1 cycle Load Delay (5 minutes)
  • 1 cycle Branch Delay (15 minutes)
  • What makes pipelining hard
  • Summary (5 minutes)

3
Review Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
4
Review A Pipelined Datapath
Clk
Ifetch
Reg/Dec
Exec
Mem
Wr
ExtOp
ALUOp
Branch
RegWr
1
0
PC4
PC4
PC
Imm16
PC4
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Rb
IUnit
IF/ID Register
ID/Ex Register
Ex/Mem Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr
MemtoReg
RegDst
5
Review Pipeline Control Data Stationary Control
  • The Main Control generates the control signals
    during Reg/Dec
  • Control signals for Exec (ExtOp, ALUSrc, ...) are
    used 1 cycle later
  • Control signals for Mem (MemWr Branch) are used 2
    cycles later
  • Control signals for Wr (MemtoReg MemWr) are used
    3 cycles later

Reg/Dec
Exec
Mem
Wr
ExtOp
ExtOp
ALUSrc
ALUSrc
ALUOp
ALUOp
Main Control
RegDst
RegDst
Ex/Mem Register
IF/ID Register
ID/Ex Register
Mem/Wr Register
MemWr
MemWr
MemWr
Branch
Branch
Branch
MemtoReg
MemtoReg
MemtoReg
MemtoReg
RegWr
RegWr
RegWr
RegWr
6
Review Pipeline Summary
  • Pipeline Processor
  • Natural enhancement of the multiple clock cycle
    processor
  • Each functional unit can only be used once per
    instruction
  • If a instruction is going to use a functional
    unit
  • it must use it at the same stage as all other
    instructions
  • Pipeline Control
  • Each stages control signal depends ONLY on the
    instruction that is currently in that stage

7
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to Hazards
  • Forwarding (25 minutes)
  • 1 cycle Load Delay (5 minutes)
  • 1 cycle Branch Delay (15 minutes)
  • What makes pipelining hard
  • Summary (5 minutes)

8
Introduction to Hazards
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • structural hazards HW cannot support this
    combination of instructions
  • data hazards instruction depends on result of
    prior instruction still in the pipeline
  • control hazards pipelining of branches other
    instructionsCommon solution is to stall the
    pipeline until the hazardbubbles in the pipeline

9
A Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Mem
Reg
Reg
Load
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
10
Option 1 Stall to resolve Memory Structural
Hazard
Time (clock cycles)
I n s t r. O r d e r
Mem
Reg
Reg
Load
Instr 1
Instr 2
Instr 3(stall)
Instr 4
11
Option 2 Duplicate to Resolve Structural Hazard
  • Separate Instruction Cache (Im) Data Cache (Dm)

Time (clock cycles)
I n s t r. O r d e r
Load
Instr 1
Instr 2
Instr 3
Instr 4
12
Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
13
Data Hazard on r1 (Figure 6.30, page 397, PH)
  • Dependencies backwards in time are hazards

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
14
Option1 HW Stalls to Resolve Data Hazard
  • Dependencies backwards in time are hazards

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4, r1,r3
Reg
Reg
ALU
Im
Dm
and r6,r1,r7
Dm
Reg
or r8,r1,r9
Reg
xor r10,r1,r11
Reg
15
But recall use of Data Stationary Control
  • The Main Control generates the control signals
    during Reg/Dec
  • Control signals for Exec (ExtOp, ALUSrc, ...) are
    used 1 cycle later
  • Control signals for Mem (MemWr Branch) are used 2
    cycles later
  • Control signals for Wr (MemtoReg MemWr) are used
    3 cycles later

Reg/Dec
Exec
Mem
Wr
ExtOp
ExtOp
ALUSrc
ALUSrc
ALUOp
ALUOp
Main Control
RegDst
RegDst
Ex/Mem Register
IF/ID Register
ID/Ex Register
Mem/Wr Register
MemWr
MemWr
MemWr
Branch
Branch
Branch
MemtoReg
MemtoReg
MemtoReg
MemtoReg
RegWr
RegWr
RegWr
RegWr
16
Option 1 How HW really stalls pipeline
  • HW doesnt change PC gt keeps fetching same
    instruction sets control signals to to
    benign values (0)

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
stall
stall
stall
sub r4,r1,r3
and r6,r1,r7
Dm
Reg
17
Option 2 SW inserts indepdendent instructions
  • Worst case inserts NOP instructions

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
Dm
Reg
Reg
nop
Dm
Reg
Reg
nop
Im
Dm
Reg
Reg
ALU
nop
sub r4,r1,r3
and r6,r1,r7
Dm
Reg
18
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to Hazards (15 minutes)
  • Forwarding
  • 1 cycle Load Delay (5 minutes)
  • 1 cycle Branch Delay (15 minutes)
  • What makes pipelining hard
  • Summary (5 minutes)

19
Option 3 Insight Data is available! (Figure
6.35, page 415, PH)
  • Pipeline registers already contain needed data

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
20
HW Change for Forwarding (Bypassing)
  • Increase multiplexors to add paths from pipeline
    registers
  • Assumes register read during write gets new
    value (otherwise more results to be forwarded)

21
Complete data Path with Hazard detection and
Forwarding Figure 6.41 in the text
22
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to Hazards (15 minutes)
  • Forwarding (25 minutes)
  • 1 cycle Load Delay
  • 1 cycle Branch Delay (15 minutes)
  • What makes pipelining hard
  • Summary (5 minutes)

23
From Last Lecture The Delay Load Phenomenon
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Clock
I0 Load
Plus 1
Plus 2
Plus 3
Plus 4
  • Although Load is fetched during Cycle 1
  • The data is NOT written into the Reg File until
    the end of Cycle 5
  • We cannot read this value from the Reg File until
    Cycle 6
  • 3-instruction delay before the load take effect

24
Forwarding reduces Data Hazard to 1 cycle
(Figure 6.47, page 420 PH)
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1, 0(r2)
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r6
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
25
Option1 HW Stalls to Resolve Data Hazard
  • Interlock checks for hazard stalls

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1, 0(r2)
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
stall
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
26
Option 2 SW inserts independent instructions
  • Worst case inserts NOP instructions
  • MIPS I solution No HW checking

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1, 0(r2)
I n s t r. O r d e r
nop
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
27
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
28
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

29
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
Fast code LW Rb,b LW Rc,c LW Re,e
ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB
Rd,Re,Rf SW d,Rd
30
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to Hazards (15 minutes)
  • Forwarding (25 minutes)
  • 1 cycle Load Delay (5 minutes)
  • 1 cycle Branch Delay
  • What makes pipelining hard
  • Summary (5 minutes)

31
From Last Lecture The Delay Branch Phenomenon
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Cycle 11
Clk
12 Beq (target is 1000)
16 R-type
20 R-type
24 R-type
1000 Target of Br
  • Although Beq is fetched during Cycle 4
  • Target address is NOT written into the PC until
    the end of Cycle 7
  • Branchs target is NOT fetched until Cycle 8
  • 3-instruction delay before the branch take effect

32
Control Hazard on Branches 3 stage stall
33
Branch Stall Impact
  • If CPI 1, 30 branch, Stall 3 cycles gt new CPI
    1.9!
  • 2 part solution
  • Determine branch taken or not sooner, AND
  • Compute taken branch address earlier
  • Solution Option 1
  • Move Zero test to ID/RF stage
  • Adder to calculate new PC in ID/RF stage
  • 1 clock cycle penalty for branch vs. 3

34
Option 1 move HW forward to reduce branch
delayData Path before change
Execute Addr. Calc.
Memory Access
Instr. Decode Reg. Fetch
Write Back
Instruction Fetch
35
Branch Delay now 1 clock cycleData Path after
change
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
36
Option 2 No Stalls, Define Branch as Delayed,
insertinstruction after the branch and allow it
to execute always,
  • Worst case, SW inserts NOP into branch delay if
    no instruction can be found
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction,
  • example sw r1,0(r2) beqd r0,r2,T
  • change to, beqd r0,r2,T sw r1,0(r2)
  • From the target address only valuable when
    branch
  • From fall through only valuable when dont
    branch
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instructions executed in branch
    delay slots useful in computation
  • about 50 (60 x 80) of slots usefully filled

37
Complete data Path with Hazard detection and
Forwarding Figure 6.41 in the text
38
Example Text Figure 6.52
39
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to Hazards (15 minutes)
  • Forwarding (25 minutes)
  • 1 cycle Load Delay (5 minutes)
  • 1 cycle Branch Delay (15 minutes)
  • What makes pipelining hard
  • Summary (5 minutes)

40
When is pipelining hard?
  • Interrupts 5 instructions executing in 5 stage
    pipeline
  • How to stop the pipeline?
  • Restrart?
  • Who caused the interrupt?
  • Stage Problem interrupts occurring
  • IF Page fault on instruction fetch misaligned
    memory access memory-protection violation
  • ID Undefined or illegal opcode
  • EX Arithmetic interrupt
  • MEM Page fault on data fetch misaligned memory
    access memory-protection violation
  • Load with data page fault, Add with instruction
    page fault?
  • Solution 1 interrupt vector/instruction 2
    interrupt ASAP, restart everything incomplete

41
Data path with Exception Handling, Text Figure
6.55, add a Cause register, an Exception PC, and
constant addr. of Exception Handeling routine
42
Review Summary of Pipelining Basics
  • Speed Up Š Pipeline Depth (number of stages) if
    ideal CPI is 1, then
  • Hazards limit performance on computers
  • structural need more HW resources
  • data need forwarding, compiler scheduling
  • control early evaluation PC, delayed branch,
    prediction
  • Increasing length of pipe increases impact of
    hazards since pipelining helps instruction
    bandwidth, not latency
  • Compilers key to reducing cost of data and
    control hazards
  • load delay slots
  • branch delay slots
  • Exceptions, Instruction Set, FP makes pipelining
    harder
  • Longer pipelines gt Branch prediction, more
    instruction parallelism?
Write a Comment
User Comments (0)
About PowerShow.com