CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor

About This Presentation

Title:

CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor

Description:

Computer Architecture and Engineering Designing a Pipeline Processor The pipelined datapath consists of combination logic blocks separated by pipeline registers. – PowerPoint PPT presentation

Number of Views:180

Avg rating:3.0/5.0

Slides: 46

Provided by: cseeWvuE5

Learn more at: https://community.wvu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor

1
CpE 242Computer Architecture and
EngineeringDesigning a Pipeline Processor
2
Outline of Todays Lecture

Recap and Introduction (5 minutes)
Introduction to the Concept of Pipelined
Processor (15 minutes)
Pipelined Datapath and Pipelined Control (25
minutes)
How to Avoid Race Condition in a Pipeline Design?
(5 minutes)
Pipeline Example Instructions Interaction (15
minutes)
Summary (5 minutes)

3
A Single Cycle Processor
RegDst
Branch
lt3126gt
Main Control
Instructionlt310gt
op
ALUSrc
Jump

Instruction Fetch Unit
Zero
lt2125gt
lt1620gt
lt1115gt
lt015gt
Clk
ALUop
3
Rd
Rt
Rd
lt50gt
Imm16
RegDst
0
1
Mux
func
Rs
Rt
3
RegWr
ALUctr
5
5
5
MemtoReg
busA
MemWr
Zero
Rw
Ra
Rb
busW
32
32 32-bit Registers
0
ALU
32
busB
32
0
Clk
Mux
32
Mux
32
1
WrEn
Adr
1
Data In
32
Data Memory
Extender
imm16
32
16
Instrlt150gt
Clk
ALUSrc
ExtOp
4
Drawbacks of this Single Cycle Processor

Long cycle time
Cycle time must be long enough for the load
instruction
PCs Clock -to-Q
Instruction Memory Access Time
Register File Access Time
ALU Delay (address calculation)
Data Memory Access Time
Register File Setup Time
Clock Skew
Cycle time is much longer than needed for all
other instructions. Examples
R-type instructions do not require data memory
access
Jump does not require ALU operation nor data
memory access

5
Overview of a Multiple Cycle Implementation

The root of the single cycle processors
problems
The cycle time has to be long enough for the
slowest instruction
Solution
Break the instruction into smaller steps
Execute each step (instead of the entire
instruction) in one cycle
Cycle time time it takes to execute the longest
step
Keep all the steps to have similar length
This is the essence of the multiple cycle
processor
The advantages of the multiple cycle processor
Cycle time is much shorter
Different instructions take different number of
cycles to complete
Load takes five cycles
Jump only takes three cycles
Allows a functional unit to be used more than
once per instruction

6
Multiple Cycle Processor

MCP A functional unit to be used more than once
per instruction

PCWr
PCWrCond
PCSrc
BrWr
Zero
ALUSelA
MemWr
IRWr
RegWr
RegDst
IorD
1
Mux
32
PC
0
32
Zero
Rs
Ra
RAdr
5
32
32
Rt
Rb
busA
32
Ideal Memory
32
Instruction Reg
Reg File
5
32
4
Rt
0
32
Rw
WrAdr
32
1
32
Rd
Din
Dout
busW
32
busB
2
32
3
Imm
32
ALUOp
MemtoReg
ExtOp
ALUSelB
7
Timing Diagram of a Load Instruction
Instruction Fetch
Instr Decode /
Address
Reg Wr
Data Memory
Reg. Fetch
Clk
Clk-to-Q
New Value
Old Value
PC
Instruction Memory Access Time
Rs, Rt, Rd, Op, Func
Old Value
New Value
Delay through Control Logic
ALUctr
Old Value
New Value
ExtOp
Old Value
New Value
ALUSrc
Old Value
New Value
RegWr
Old Value
New Value
Register File Access Time
busA
Old Value
New Value
Delay through Extender Mux
Register File Write Time
busB
Old Value
New Value
ALU Delay
Address
Old Value
New Value
Data Memory Access Time
busW
Old Value
New
8
Outline of Todays Lecture

Recap and Introduction (5 minutes)
Introduction to the Concept of Pipelined
Processor (15 minutes)
Pipelined Datapath and Pipelined Control (25
minutes)
How to Avoid Race Condition in a Pipeline Design?
(5 minutes)
Pipeline Example Instructions Interaction (15
minutes)
Summary (5 minutes)

9
The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec Calculate the memory address
Mem Read the data from the Data Memory
Wr Write the data back to the register file

10
Key Ideas Behind Pipelining

Grading the Final exam for a class of 100
students
5 problems, five people grading the exam
Each person ONLY grade one problem
Pass the exam to the next person as soon as one
finishes his part
Assume each problem takes 12 min to grade
Each individual exam still takes 1 hour to grade
But with 5 people, all exams can be graded five
times quicker
The load instruction has 5 stages
Five independent functional units to work on each
stage
Each functional unit is used only once
The 2nd load can start as soon as the 1st
finishes its Ifetch stage
Each load still takes five cycles to complete
The throughput, however, is much higher

11
Key Ideas Behind Pipelining
buffer
Input Tasks
K stage pipeline
Stage 1
Stage 2
Stage k

Let n be number of tasks or exams (or
instructions)
Let k be number of stages for each task
Let T be the time per stage
Time per task T . k
Total Time per n tasks for non-pipelined solution
T . k . n
Total Time per n tasks for pipelined solution T
. k T . (n-1)
Speedup pipelined perform/ non-pipelined
performance
Total Time non-pipelined/ Total Time for
pipelined
k . n / k n-1 k approx. when n gtgt k

12
Pipelining the Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw

The five independent functional units in the
pipeline datapath are
Instruction Memory for the Ifetch stage
Register Files Read ports (bus A and busB) for
the Reg/Dec stage
ALU for the Exec stage
Data Memory for the Mem stage
Register Files Write port (bus W) for the Wr
stage
One instruction enters the pipeline every cycle
One instruction comes out of the pipeline
(complete) every cycle
The Effective Cycles per Instruction (CPI) is 1

13
The Four Stages of R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec ALU operates on the two register operands
Wr Write the ALU output back to the register file

14
Pipelining the R-type and Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type

We have a problem
Two instructions try to write to the register
file at the same time!

15
Important Observation

Each functional unit can only be used once per
instruction
Each functional unit must be used at the same
stage for all instructions
Load uses Register Files Write Port during its
5th stage
R-type uses Register Files Write Port during its
4th stage

16
Solution 1 Insert Bubble into the Pipeline
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Load
R-type
Pipeline
R-type
R-type
Bubble

Insert a bubble into the pipeline to prevent 2
writes at the same cycle
The control logic can be complex
No instruction is completed during Cycle 5
The Effective CPI for load is 2

17
Solution 2 Delay R-types Write by One Cycle

Delay R-types register write by one cycle
Now R-type instructions also use Reg Files write
port at Stage 5
Mem stage is a NOOP stage nothing is being done

1
2
3
4
5
R-type
Mem
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
R-type
R-type
Load
R-type
R-type
18
The Four Stages of Store
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec Calculate the memory address
Mem Write the data into the Data Memory

19
The Four Stages of Beq
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec ALU compares the two register operands
Adder calculates the branch target address
Mem If the registers we compared in the Exec
stage are the same,
Write the branch target address into the PC

20
Outline of Todays Lecture

Recap and Introduction (5 minutes)
Introduction to the Concept of Pipelined
Processor (15 minutes)
Pipelined Datapath and Pipelined Control
How to Avoid Race Condition in a Pipeline Design?
(5 minutes)
Pipeline Example Instructions Interaction (15
minutes)
Summary (5 minutes)

21
A Pipelined Datapath
Clk
Ifetch
Reg/Dec
Exec
Mem
Wr
22
The Instruction Fetch Stage

Location 10 lw 1, 0x100(2) 1 lt-
Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ExtOp
ALUOp
Branch
RegWr
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 14
Data Mem
Rs
Zero
busA
A
Ra
busB
IF/ID lw 1, 100 (2)
Exec Unit
RA
Do
Rb
IUnit
ID/Ex Register
Ex/Mem Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr
MemtoReg
RegDst
23
A Detail View of the Instruction Unit

Location 10 lw 1, 0x100(2)

You are here!
Clk
Ifetch
Reg/Dec
1
0
4
PC 14
10
IF/ID lw 1, 100 (2)
Address
Instruction Memory
Instruction
24
The Decode / Register Fetch Stage

Location 10 lw 1, 0x100(2) 1 lt-
Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ExtOp
ALUOp
Branch
RegWr
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
ID/Ex Reg. 2 0x100
Exec Unit
RA
Do
Rb
IUnit
Ex/Mem Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr
MemtoReg
RegDst
25
Loads Address Calculation Stage

Location 10 lw 1, 0x100(2) 1 lt-
Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ALUOpAdd
Branch
RegWr
ExtOp1
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Ex/Mem Loads Address
Rb
IUnit
ID/Ex Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc1
MemWr
MemtoReg
RegDst0
26
A Detail View of the Execution Unit
You are here!
Clk
Exec
Mem
27
Loads Memory Access Stage

Location 10 lw 1, 0x100(2) 1 lt-
Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ExtOp
ALUOp
Branch0
RegWr
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Mem/Wr Loads Data
Rb
IUnit
ID/Ex Register
Ex/Mem Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr0
MemtoReg
RegDst
28
Loads Write Back Stage

Location 10 lw 1, 0x100(2) 1 lt-
Mem(2) 0x100

You are somewhere out there!
Clk
Ifetch
Reg/Dec
Exec
Mem
Wr
ExtOp
ALUOp
Branch
RegWr1
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Rb
IUnit
ID/Ex Register
Ex/Mem Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr
MemtoReg1
RegDst
29
How About Control Signals?

Key Observation Control Signals at Stage N
Func (Instr. at Stage N)
N Exec, Mem, or Wr
Example Controls Signals at Exec Stage
Func(Loads Exec)

Ifetch
Reg/Dec
Exec
Mem
ALUOpAdd
Wr
Branch
RegWr
ExtOp1
1
0
PC4
PC4
Imm16
PC4
IF/ID
Imm16
PC
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Ex/Mem Loads Address
Rb
IUnit
ID/Ex Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc1
MemWr
MemtoReg
RegDst0
30
Pipeline Control

The Main Control generates the control signals
during Reg/Dec
Control signals for Exec (ExtOp, ALUSrc, ...) are
used 1 cycle later
Control signals for Mem (MemWr Branch) are used 2
cycles later
Control signals for Wr (MemtoReg MemWr) are used
3 cycles later

Reg/Dec
Exec
Mem
Wr
ExtOp
ExtOp
ALUSrc
ALUSrc
ALUOp
ALUOp
Main Control
RegDst
RegDst
Ex/Mem Register
IF/ID Register
ID/Ex Register
Mem/Wr Register
MemWr
MemWr
MemWr
Branch
Branch
Branch
MemtoReg
MemtoReg
MemtoReg
MemtoReg
RegWr
RegWr
RegWr
RegWr
31
Outline of Todays Lecture

Recap and Introduction (5 minutes)
Introduction to the Concept of Pipelined
Processor (15 minutes)
Pipelined Datapath and Pipelined Control (25
minutes)
How to Avoid Race Condition in a Pipeline Design?
Pipeline Example Instructions Interaction (15
minutes)
Summary (5 minutes)

32
Beginning of the Wrs Stage A Real World Problem
Clk
Clk
RegAdr
WrAdr
RegWr
MemWr
RegWrs Clk-to-Q
MemWrs Clk-to-Q
RegAdrs Clk-to-Q
WrAdrs Clk-to-Q
RegWr
MemWr
Ex/Mem
Mem/Wr
RegAdr
WrAdr
Reg File
Data Memory
Data
Data

At the beginning of the Wr stage, we have a
problem if
RegAdrs (Rd or Rt) Clk-to-Q gt RegWrs
Clk-to-Q
Similarly, at the beginning of the Mem stage, we
have a problem if
WrAdrs Clk-to-Q gt MemWrs Clk-to-Q
We have a race condition between Address and
Write Enable!

33
The Pipeline Problem

Multiple Cycle design prevents race condition
between Addr and WrEn
Make sure Address is stable by the end of Cycle N
Asserts WrEn during Cycle N 1
This approach can NOT be used in the pipeline
design because
Must be able to write the register file every
cycle
Must be able write the data memory every cycle

Clock
Store
Store
R-type
R-type
34
Synchronize Register File Synchronize Memory

Solution And the Write Enable signal with the
Clock
This is the ONLY place where gating the clock is
used
MUST consult circuit expert to ensure no timing
violation
Example Clock High Time gt Write Access Delay

Synchronize Memory and Register File
Clk
Address, Data, and WrEn must be stable at least 1
set-up time before the Clk edge
I_Addr
I_WrEn
Write occurs at the cycle following the clock
edge that captures the signals
C_WrEn
WrEn
WrEn
C_WrEn
I_WrEn
Address
Reg File or Memory
Data
Reg File or Memory
I_Addr
Address
Clk
I_Data
Data
35
Outline of Todays Lecture

Recap and Introduction (5 minutes)
Introduction to the Concept of Pipelined
Processor (15 minutes)
Pipelined Datapath and Pipelined Control (25
minutes)
How to Avoid Race Condition in a Pipeline Design?
(5 minutes)
Pipeline Example Instructions Interaction
Summary (5 minutes)

36
A More Extensive Pipelining Example
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Clock
0 Load
4 R-type
8 Store
12 Beq (target is 1000)
End of Cycle 4
End of Cycle 5
End of Cycle 6
End of Cycle 7

End of Cycle 4 Loads Mem, R-types Exec,
Stores Reg, Beqs Ifetch
End of Cycle 5 Loads Wr, R-types Mem, Stores
Exec, Beqs Reg
End of Cycle 6 R-types Wr, Stores Mem, Beqs
Exec
End of Cycle 7 Stores Wr, Beqs Mem

37
Pipelining Example End of Cycle 4

0 Loads Mem 4 R-types Exec 8 Stores
Reg 12 Beqs Ifetch

8 Stores Reg
4 R-types Exec
0 Loads Mem
12 Beqs Ifet
ALUOpR-type
ExtOpx
Branch0
RegWr0
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 16
Data Mem
Rs
Zero
busA
A
Ra
busB
IF/ID Beq Instruction
Exec Unit
RA
Do
ID/Ex Stores busA B
Ex/Mem R-types Result
Mem/Wr Loads Dout
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc0
MemtoRegx
RegDst1
38
Pipelining Example End of Cycle 5

0 Lws Wr 4 Rs Mem 8 Stores Exec 12
Beqs Reg 16 Rs Ifetch

12 Beqs Reg
8 Stores Exec
4 R-types Mem
0 Loads Wr
16 Rs Ifet
ALUOpAdd
ExtOp1
Branch0
RegWr1
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 20
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
IF/ID Instruction _at_ 16
ID/Ex Beqs busA B
Ex/Mem Stores Address
Mem/Wr R-types Result
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
MemtoReg1
RegDstx
ALUSrc1
39
Pipelining Example End of Cycle 6

4 Rs Wr 8 Stores Mem 12 Beqs Exec 16
Rs Reg 20 Rs Ifet

16 R-types Reg
12 Beqs Exec
8 Stores Mem
20 R-types Ifet
4 R-types Wr
ALUOpSub
ExtOp1
Branch0
RegWr1
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 24
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
IF/ID Instruction _at_ 20
ID/ExR-types busA B
Ex/Mem Beqs Results
Mem/Wr Nothing for St
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
MemtoReg0
RegDstx
ALUSrc0
40
Pipelining Example End of Cycle 7

8 Stores Wr 12 Beqs Mem 16 Rs Exec
20 Rs Reg 24 Rs Ifet

20 R-types Reg
16 R-types Exec
12 Beqs Mem
24 R-types Ifet
8 Stores Wr
ALUOpR-type
ExtOpx
Branch1
RegWr0
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 1000
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
IF/ID Instruction _at_ 24
ID/ExR-types busA B
Ex/Mem Rtypes Results
Mem/WrNothing for Beq
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
MemtoRegx
RegDst1
ALUSrc0
41
The Delay Branch Phenomenon
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Cycle 11
Clk
12 Beq (target is 1000)
16 R-type
20 R-type
24 R-type
1000 Target of Br

Although Beq is fetched during Cycle 4
Target address is NOT written into the PC until
the end of Cycle 7
Branchs target is NOT fetched until Cycle 8
3-instruction delay before the branch take
effect
This is referred to as Branch Hazard
Clever design techniques can reduce the delay to
ONE instruction

42
The Delay Load Phenomenon
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Clock
I0 Load
Plus 1
Plus 2
Plus 3
Plus 4

Although Load is fetched during Cycle 1
The data is NOT written into the Reg File until
the end of Cycle 5
We cannot read this value from the Reg File until
Cycle 6
3-instruction delay before the load take effect
This is referred to as Data Hazard
Clever design techniques can reduce the delay to
ONE instruction

43
Summary

Disadvantages of the Single Cycle Processor
Long cycle time
Cycle time is too long for all instructions
except the Load
Multiple Clock Cycle Processor
Divide the instructions into smaller steps
Execute each step (instead of the entire
instruction) in one cycle
Pipeline Processor
Natural enhancement of the multiple clock cycle
processor
Each functional unit can only be used once per
instruction
If a instruction is going to use a functional
unit
it must use it at the same stage as all other
instructions
Pipeline Control
Each stages control signal depends ONLY on the
instruction that is currently in that stage

44
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
45
Where to get more information?

Everything You Need to know about Pipeline
Computer
Peter Kogge, The Architecture of Pipeline
Computers, McGraw Hill Book Company, 1981
Some Classic References on RISC Pipelines
Manolis Katevenis, Reduced Instruction Set
Computer Architectures for VLSI, PhD Thesis, UC
Berkeley, 1984.
Other references
David. A Patterson, Reduced Instruction Set
Computers, Communications of the ACM, January
1985.
Shing Kong, Performance, Resources, and
Complexity, PhD Thesis, UC Berkeley, 1989.