Lecture 10: Pipelining - PowerPoint PPT Presentation


Title: Lecture 10: Pipelining


1
Lecture 10 Pipelining
  • Computer Engineering 585
  • Fall 2001

2
DLX Stages RTL activities
3
Pipelining is Not That Easy for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (Same table used for
    reading newspaper and breakfast)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (share
    the towel hand over from shower to sink)
  • Control hazards Pipelining of branches other
    instructions that change the PC
  • Common solution is to stall the pipeline until
    the hazard is resolved, inserting one or more
    bubbles in the pipeline

4
Structural Hazard Memory Port
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
ALUU
Reg
Mem
Mem
Reg
Load
ALUU
Mem
Reg
Reg
Mem
Instruction 1
ALUU
Mem
Reg
Reg
Mem
Instruction 2
Mem
Reg
Reg
ALUU
Mem
Instruction 3
ALUU
Mem
Reg
Mem
Instruction 4
5
Bubbles due to One Memory Port
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
Load
ALUU
Mem
Reg
Reg
Mem
Instruction 1
ALUU
Mem
Reg
Reg
Mem
Instruction 2
Stall
ALUU
Mem
Reg
Mem
Instruction 3
6
Bubbles Instruction-Time Space
7
Speed Up Equation for Pipelining
  • CPIpipelined Ideal CPI Pipeline stall
    clock cycles per instr
  • Ideal CPI x Pipeline
    depth x Clock Cycleunpipelined
  • Speedup -------------------------------------
    ------------
  • (Ideal CPI Pipeline stall
    CPI) x Clock Cyclepipelined
  • Pipeline
    depth x Clock Cycleunpipelined
  • Speedup -------------------------------------
    ------------
  • (1 Pipeline stall CPI) x
    Clock Cyclepipelined

8
Example Dual-port vs. Single-port
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads/Stores are 40 of instructions executed
  • SpeedUpA Pipeline Depth/(1 0)
    x(clockunpipe/clockpipe) Pipeline Depth
  • SpeedUpB Pipeline Depth/(1 0.4 x 1)
    x (clockunpipe/(clockunpipe / 1.05)
  • (Pipeline Depth/1.4) x 1.05
  • 0.75 x Pipeline Depth
  • SpeedUpA / SpeedUpB Pipeline Depth/(0.75 x
    Pipeline Depth) 1.33
  • Machine A is 1.33 times faster

9
Data Hazard
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
ADD R1, R2, R3
ALU
Reg
Reg
DM
IM
SUB R4, R1, R5

ALU
Program execution order (in instructions)
DM
Reg
Reg
IM
AND R6, R1, R7
ALU
DM
IM
Reg
OR R8, R1, R9
ALU
IM
Reg
XOR R10, R1, R11
Reg
IM
10
Data Forwarding
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
ADD R1, R2, R3
ALU
Reg
Reg
DM
IM

SUB R4, R1, R5
ALU
Reg
Program execution order (in instructions)
DM
Reg
IM
AND R6, R1, R7
ALU
DM
IM
Reg
OR R8, R1, R9
ALU
IM
Reg
XOR R10, R1, R11
Reg
IM
ADD
11
Hardware Support for Forwarding
ID/EX
EX/MEM
MEM/WB
Zero?
Mux
ALU
Data memory
Mux
FIGURE 3.20 Forwarding of results to the ALU
requires the addition of three ext
ra
inputs on each ALU multiplexer and the addition
of three paths to the new inputs
.
12
Data Hazard (Add to Store)
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
ADD R1, R2, R3
Reg
ALU
Reg
IM
DM
IM
Program execution order (in instructions)
LW R4, 0(R1)

IM
ALU
DM
Reg
Reg
SW 12(R1), R4
IM
ALU
DM
Reg
FIGURE 3.11 Stores require an operand during
MEM, and forwarding of that operan
d is shown here.
13
Not Forwardable Data Hazards
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
LW R1, 0(R2)
ALU
Reg
Reg
DM
IM

Program execution order (in instructions)
ALU
SUB R4, R1, R5
DM
Reg
IM
ALU
AND R6, R1, R7
IM
Reg
OR R8, R1, R9
IM
Reg
14
Load Stalls
Time (in clock cycles)
CC 1
CC 2
CC 5
CC 6
CC 33
CC 4
LW R1, 0(R2)
DM
Reg
IM
Reg
ALU
Program execution order (in instructions)

SUB R4, R1, R5
IM
Reg
Bubble
DM
ALU
AND R6, R1, R7
IM
Reg
ALU
Bubble
IM
Reg
Bubble
OR R8, R1, R9
FIGURE 3.13 The load interlock causes a stall to
be inserted at clock cycle 4,
delaying the
SUB
instruction and those
that follow by one cycle.
15
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

16
Effectiveness of Load Scheduling
Int Avg 25, FP Avg 13, Overall 24
45
41
40
35
30
24
24
25
23
20
20
Fraction of loads that cause a stall
20
15
12
10
10
10
4
5
0
li
ear
gcc
doduc
mdljdp
su2cor
eqntott
hydro2d
espresso
compress
Benchmark
View by Category
About This Presentation
Title:

Lecture 10: Pipelining

Description:

... still in the pipeline (share the towel hand over from shower to sink) ... Common solution is to stall the pipeline until the hazard is resolved, inserting ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 17
Provided by: Rand233
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Lecture 10: Pipelining


1
Lecture 10 Pipelining
  • Computer Engineering 585
  • Fall 2001

2
DLX Stages RTL activities
3
Pipelining is Not That Easy for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (Same table used for
    reading newspaper and breakfast)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (share
    the towel hand over from shower to sink)
  • Control hazards Pipelining of branches other
    instructions that change the PC
  • Common solution is to stall the pipeline until
    the hazard is resolved, inserting one or more
    bubbles in the pipeline

4
Structural Hazard Memory Port
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
ALUU
Reg
Mem
Mem
Reg
Load
ALUU
Mem
Reg
Reg
Mem
Instruction 1
ALUU
Mem
Reg
Reg
Mem
Instruction 2
Mem
Reg
Reg
ALUU
Mem
Instruction 3
ALUU
Mem
Reg
Mem
Instruction 4
5
Bubbles due to One Memory Port
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
Load
ALUU
Mem
Reg
Reg
Mem
Instruction 1
ALUU
Mem
Reg
Reg
Mem
Instruction 2
Stall
ALUU
Mem
Reg
Mem
Instruction 3
6
Bubbles Instruction-Time Space
7
Speed Up Equation for Pipelining
  • CPIpipelined Ideal CPI Pipeline stall
    clock cycles per instr
  • Ideal CPI x Pipeline
    depth x Clock Cycleunpipelined
  • Speedup -------------------------------------
    ------------
  • (Ideal CPI Pipeline stall
    CPI) x Clock Cyclepipelined
  • Pipeline
    depth x Clock Cycleunpipelined
  • Speedup -------------------------------------
    ------------
  • (1 Pipeline stall CPI) x
    Clock Cyclepipelined

8
Example Dual-port vs. Single-port
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads/Stores are 40 of instructions executed
  • SpeedUpA Pipeline Depth/(1 0)
    x(clockunpipe/clockpipe) Pipeline Depth
  • SpeedUpB Pipeline Depth/(1 0.4 x 1)
    x (clockunpipe/(clockunpipe / 1.05)
  • (Pipeline Depth/1.4) x 1.05
  • 0.75 x Pipeline Depth
  • SpeedUpA / SpeedUpB Pipeline Depth/(0.75 x
    Pipeline Depth) 1.33
  • Machine A is 1.33 times faster

9
Data Hazard
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
ADD R1, R2, R3
ALU
Reg
Reg
DM
IM
SUB R4, R1, R5

ALU
Program execution order (in instructions)
DM
Reg
Reg
IM
AND R6, R1, R7
ALU
DM
IM
Reg
OR R8, R1, R9
ALU
IM
Reg
XOR R10, R1, R11
Reg
IM
10
Data Forwarding
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
ADD R1, R2, R3
ALU
Reg
Reg
DM
IM

SUB R4, R1, R5
ALU
Reg
Program execution order (in instructions)
DM
Reg
IM
AND R6, R1, R7
ALU
DM
IM
Reg
OR R8, R1, R9
ALU
IM
Reg
XOR R10, R1, R11
Reg
IM
ADD
11
Hardware Support for Forwarding
ID/EX
EX/MEM
MEM/WB
Zero?
Mux
ALU
Data memory
Mux
FIGURE 3.20 Forwarding of results to the ALU
requires the addition of three ext
ra
inputs on each ALU multiplexer and the addition
of three paths to the new inputs
.
12
Data Hazard (Add to Store)
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
ADD R1, R2, R3
Reg
ALU
Reg
IM
DM
IM
Program execution order (in instructions)
LW R4, 0(R1)

IM
ALU
DM
Reg
Reg
SW 12(R1), R4
IM
ALU
DM
Reg
FIGURE 3.11 Stores require an operand during
MEM, and forwarding of that operan
d is shown here.
13
Not Forwardable Data Hazards
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
LW R1, 0(R2)
ALU
Reg
Reg
DM
IM

Program execution order (in instructions)
ALU
SUB R4, R1, R5
DM
Reg
IM
ALU
AND R6, R1, R7
IM
Reg
OR R8, R1, R9
IM
Reg
14
Load Stalls
Time (in clock cycles)
CC 1
CC 2
CC 5
CC 6
CC 33
CC 4
LW R1, 0(R2)
DM
Reg
IM
Reg
ALU
Program execution order (in instructions)

SUB R4, R1, R5
IM
Reg
Bubble
DM
ALU
AND R6, R1, R7
IM
Reg
ALU
Bubble
IM
Reg
Bubble
OR R8, R1, R9
FIGURE 3.13 The load interlock causes a stall to
be inserted at clock cycle 4,
delaying the
SUB
instruction and those
that follow by one cycle.
15
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

16
Effectiveness of Load Scheduling
Int Avg 25, FP Avg 13, Overall 24
45
41
40
35
30
24
24
25
23
20
20
Fraction of loads that cause a stall
20
15
12
10
10
10
4
5
0
li
ear
gcc
doduc
mdljdp
su2cor
eqntott
hydro2d
espresso
compress
Benchmark
About PowerShow.com