CS 162 Computer Architecture Lecture 3: Pipelining Contd' - PowerPoint PPT Presentation

Loading...

PPT – CS 162 Computer Architecture Lecture 3: Pipelining Contd' PowerPoint presentation | free to download - id: 1b76e5-YmExY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CS 162 Computer Architecture Lecture 3: Pipelining Contd'

Description:

Next PC value is computed in the 3rd step, but we need to bring in next instn in ... Branch address is computed in 3rd stage. With pipeline, the PC value has changed! ... – PowerPoint PPT presentation

Number of Views:344
Avg rating:3.0/5.0
Slides: 22
Provided by: davep173
Learn more at: http://www.cs.ucr.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 162 Computer Architecture Lecture 3: Pipelining Contd'


1
CS 162 Computer Architecture Lecture 3
Pipelining Contd.
  • Instructor L.N. Bhuyan
  • www.cs.ucr.edu/bhuyan/cs162

2
Single Cycle Datapath (From Ch 5)
M u x
a d d
4
ltlt 2
PCSrc
MemWrite
2521
ReadReg1
Read Addr
P C
Readdata
Readdata1
Zero
ReadReg2
310
2016
A L U
Instruc- tion
Address
Readdata2
M u x
MemTo- Reg
WriteReg
M u x
Dmem
Imem
Regs
ALU- con
WriteData
WriteData
1511
M u x
RegDst
ALU- src
RegWrite
MemRead
150
ALUOp
3
Required Changes to Datapath
  • Introduce registers to separate 5 stages by
    putting IF/ID, ID/EX, EX/MEM, and MEM/WB
    registers in the datapath.
  • Next PC value is computed in the 3rd step, but we
    need to bring in next instn in the next cycle
    Move PCSrc Mux to 1st stage. The PC is
    incremented unless there is a new branch address.
  • Branch address is computed in 3rd stage. With
    pipeline, the PC value has changed! Must carry
    the PC value along with instn. Width of IF/ID
    register (IR)(PC) 64 bits.

4
Changes to Datapath Contd.
  • For lw instn, we need write register address at
    stage 5. But the IR is now occupied by another
    instn! So, we must carry the IR destination field
    as we move along the stages. See connection in
    fig.
  • Length of ID/EX register (Reg132)(Reg232)(of
    fset32) (PC32) (destination register5)
    133 bits
  • Assignment What are the lengths of EX/MEM, and
    MEM/WB registers

5
Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch Decode
Execute Memory
Write Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
S
h
i
f
t
l
e
f
t

2
R
e
a
d
n
o
r
e
g
i
s
t
e
r

1
i
A
d
d
r
e
s
s
P
C
t
R
e
a
d
c
u
d
a
t
a

1
r
t
R
e
a
d
s
Z
e
r
o
n
r
e
g
i
s
t
e
r

2
I
A
L
U
R
e
a
d
A
L
U
0
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
1
d
a
t
a

2
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
Imem
u
Regs
u
W
r
i
t
e
x
x
d
a
t
a
1
0
W
r
i
t
e
Dmem
d
a
t
a
3
2
1
6
S
i
g
n
e
x
t
e
n
d
5
69 bits
64 bits
133 bits
102 bits
6
Pipelined Control (6.3)
  • Start with single-cycle controller
  • Group control lines by pipeline stage needed
  • Extend pipeline registers with control bits

W
B
I
n
s
t
r
u
c
t
i
o
n
Mem
W
B
C
o
n
t
r
o
l
E
X
W
B
Mem
MemToRegRegWrite
Branch MemReadMemWrite
I
F
/
I
D
I
D
/
E
X
E
X
/
M
E
M
M
E
M
/
W
B
7
Pipelined Processor Datapath Control
  • More work to correctly handle pipeline hazards

PCSrc
I
D
/
E
X
0
M
W
B
u
E
X
/
M
E
M
x
1
C
o
n
t
r
o
l
M
W
B
M
E
M
/
W
B
E
X
M
W
B
I
F
/
I
D
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
Branch
RegWrite
S
h
i
f
t
l
e
f
t

2
ALUSrc
MemWrite
MemToReg
n
R
e
a
d
o
i
r
e
g
i
s
t
e
r

1
t
P
C
A
d
d
r
e
s
s
c
R
e
a
d
u
r
d
a
t
a

1
t
R
e
a
d
s
n
Z
e
r
o
r
e
g
i
s
t
e
r

2
I
A
L
U
R
e
a
d
A
L
U
Imem
0
R
e
a
d
W
r
i
t
e
d
a
t
a

2
r
e
s
u
l
t
1
A
d
d
r
e
s
s
d
a
t
a
r
e
g
i
s
t
e
r
M
M
Regs
u
u
W
r
i
t
e
x
x
d
a
t
a
Dmem
1
0
W
r
i
t
e
d
a
t
a
I
n
s
t
r
u
c
t
i
o
n
1
6
3
2
6

1
5

0

MemRead
S
i
g
n
A
L
U
e
x
t
e
n
d
c
o
n
t
r
o
l
I
n
s
t
r
u
c
t
i
o
n

2
0

1
6

ALUOp
0
M
u
I
n
s
t
r
u
c
t
i
o
n
x

1
5

1
1

1
RegDst
8
Recap
  • if can keep all pipeline stages busy, can retire
    (complete) up to one instruction per clock cycle
    (thereby achieving single-cycle throughput)
  • The pipeline paradox (for MIPS) any instruction
    still takes 5 cycles to execute (even though can
    retire one instruction per cycle)

9
Problems for Pipelining
  • Hazards prevent next instruction from executing
    during its designated clock cycle, limiting
    speedup
  • Structural hazards HW cannot support this
    combination of instructions (single memory for
    instruction and data)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards conditional branches other
    instructions may stall the pipeline delaying
    later instructions

10
Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Reg
M
Reg
Load
Instr 1
Instr 2
M
Reg
M
Reg
Instr 3
Instr 4
  • Cant read same memory twice in same clock cycle

11
EX MIPS multicycle datapath Structural Hazard
in Memory
PC
Instruction Register
ReadReg1
Address
Memory
A
Readdata 1
ReadReg2
A L U
Instruction or Data
ALU- Out
Registers
B
Readdata 2
WriteReg
Data
MemoryData Register
Data
12
Structural Hazards limit performance
  • Example if 1.3 memory accesses per instruction
    (30 of instructions execute loads and
    stores)and only one memory access per cycle then
  • Average CPI ? 1.3
  • Otherwise datapath resource is more than 100
    utilized

Structural Hazard Solution Add more Hardware
13
Speed Up Equation for Pipelining
  • CPIpipelined Ideal CPI Pipeline stall clock
    cycles per instn
  • Speedup Ideal CPI x Pipeline depth Clock
    Cycleunpipelined
  • -------------------------------
    --- X -------------------------
  • Ideal CPI Pipeline stall CPI
    Clock Cyclepipelined
  • Speedup Pipeline depth Clock
    Cycleunpipelined
  • ------------------------ X
    ---------------------------
  • 1 Pipeline stall CPI Clock
    Cyclepipelined

x
14
Example Dual-port vs. Single-port
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads are 40 of instructions executed
  • SpeedUpA Pipeline Depth/(1 0) x
    (clockunpipe/clockpipe)
  • Pipeline Depth
  • SpeedUpB Pipeline Depth/(1 0.4 x 1)
    x (clockunpipe/(clockunpipe / 1.05)
  • (Pipeline Depth/1.4) x 1.05
  • 0.75 x Pipeline Depth
  • SpeedUpA / SpeedUpB Pipeline Depth/(0.75 x
    Pipeline Depth) 1.33
  • Machine A is 1.33 times faster

15
Data Hazard on Register 1 (6.4)
add 1 ,2, 3
sub 4, 1 ,3
and 6, 1 ,7
or 8, 1 ,9
xor 10, 1 ,11
16
Data Hazard Solution
  • Forward result from one stage to another
  • or OK if implement register file properly

Time (clock cycles)
I n s t r. O r d e r
IF
ID/RF
EX
MEM
WB
add 1,2,3
Reg
Reg
ALU
IM
DM
sub 4,1,3
DM
Reg
Reg
DM
Reg
and 6,1,7
Reg
IM
DM
Reg
Reg
or 8,1,9
ALU
xor 10,1,11
17
Hazard Detection for Forwarding
  • A hazard must be detected just before execution
    so that in case of hazard, the data can be
    forwarded to the input of the ALU.
  • It can be detected when a source register (Rs or
    Rt or both) of the instruction at the EX stage
    is equal to the destination register (Rd) of an
    instruction in the pipeline (either in MEM or WB
    stage)
  • Compare the values of Rs and Rt registers in the
    ID/EX stage with Rd at EX/MEM and MEM/WB stages
    gt Need to carry Rs, Rt, Rd values to the ID/EX
    register from the IF/ID register (only Rd was
    carried before)
  • If they match, forward the data to the input of
    the ALU through the multiplexor.
  • See Fig. 6.43 pp. 488 of the text

18
Forwarding What about Loads?
  • Dependencies backward in time are
    hazards
  • Cant solve with forwarding alone
  • Must stall instruction dependent on load
  • Load-Use hazard

IF
ID/RF
EX
MEM
WB
lw 1,0(2)
Reg
Reg
ALU
IM
DM
sub 4,1,3
DM
Reg
Reg
19
Data Hazard Even with Forwarding
  • Must stall pipeline 1 cycle (insert 1 bubble)

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw 1, 0(2)
Reg
Reg
ALU
IM
DM
sub 4,1,6
DM
Reg
Reg
DM
Reg
Reg
and 6,1,7
or 8,1,9
IM
Reg
DM
ALU
20
Compiler Schemes to Improve Load Delay
  • Compiler will detect data dependency and inserts
    nop instructions until data is available
  • sub 2, 1, 3
  • nop
  • and 12, 2, 5
  • or 13, 6, 2
  • add 14, 2, 2
  • sw 15, 100(2)
  • Compiler will find independent instructions to
    fill in the delay slots

21
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd
About PowerShow.com