CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor - PowerPoint PPT Presentation

About This Presentation
Title:

CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor

Description:

Computer Architecture and Engineering Designing a Pipeline Processor The pipelined datapath consists of combination logic blocks separated by pipeline registers. – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 46
Provided by: cseeWvuE5
Category:

less

Transcript and Presenter's Notes

Title: CpE 242 Computer Architecture and Engineering Designing a Pipeline Processor


1
CpE 242Computer Architecture and
EngineeringDesigning a Pipeline Processor
2
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to the Concept of Pipelined
    Processor (15 minutes)
  • Pipelined Datapath and Pipelined Control (25
    minutes)
  • How to Avoid Race Condition in a Pipeline Design?
    (5 minutes)
  • Pipeline Example Instructions Interaction (15
    minutes)
  • Summary (5 minutes)

3
A Single Cycle Processor
RegDst
Branch
lt3126gt
Main Control
Instructionlt310gt
op
ALUSrc
Jump

Instruction Fetch Unit
Zero
lt2125gt
lt1620gt
lt1115gt
lt015gt
Clk
ALUop
3
Rd
Rt
Rd
lt50gt
Imm16
RegDst
0
1
Mux
func
Rs
Rt
3
RegWr
ALUctr
5
5
5
MemtoReg
busA
MemWr
Zero
Rw
Ra
Rb
busW
32
32 32-bit Registers
0
ALU
32
busB
32
0
Clk
Mux
32
Mux
32
1
WrEn
Adr
1
Data In
32
Data Memory
Extender
imm16
32
16
Instrlt150gt
Clk
ALUSrc
ExtOp
4
Drawbacks of this Single Cycle Processor
  • Long cycle time
  • Cycle time must be long enough for the load
    instruction
  • PCs Clock -to-Q
  • Instruction Memory Access Time
  • Register File Access Time
  • ALU Delay (address calculation)
  • Data Memory Access Time
  • Register File Setup Time
  • Clock Skew
  • Cycle time is much longer than needed for all
    other instructions. Examples
  • R-type instructions do not require data memory
    access
  • Jump does not require ALU operation nor data
    memory access

5
Overview of a Multiple Cycle Implementation
  • The root of the single cycle processors
    problems
  • The cycle time has to be long enough for the
    slowest instruction
  • Solution
  • Break the instruction into smaller steps
  • Execute each step (instead of the entire
    instruction) in one cycle
  • Cycle time time it takes to execute the longest
    step
  • Keep all the steps to have similar length
  • This is the essence of the multiple cycle
    processor
  • The advantages of the multiple cycle processor
  • Cycle time is much shorter
  • Different instructions take different number of
    cycles to complete
  • Load takes five cycles
  • Jump only takes three cycles
  • Allows a functional unit to be used more than
    once per instruction

6
Multiple Cycle Processor
  • MCP A functional unit to be used more than once
    per instruction

PCWr
PCWrCond
PCSrc
BrWr
Zero
ALUSelA
MemWr
IRWr
RegWr
RegDst
IorD
1
Mux
32
PC
0
32
Zero
Rs
Ra
RAdr
5
32
32
Rt
Rb
busA
32
Ideal Memory
32
Instruction Reg
Reg File
5
32
4
Rt
0
32
Rw
WrAdr
32
1
32
Rd
Din
Dout
busW
32
busB
2
32
3
Imm
32
ALUOp
MemtoReg
ExtOp
ALUSelB
7
Timing Diagram of a Load Instruction
Instruction Fetch
Instr Decode /
Address
Reg Wr
Data Memory
Reg. Fetch
Clk
Clk-to-Q
New Value
Old Value
PC
Instruction Memory Access Time
Rs, Rt, Rd, Op, Func
Old Value
New Value
Delay through Control Logic
ALUctr
Old Value
New Value
ExtOp
Old Value
New Value
ALUSrc
Old Value
New Value
RegWr
Old Value
New Value
Register File Access Time
busA
Old Value
New Value
Delay through Extender Mux
Register File Write Time
busB
Old Value
New Value
ALU Delay
Address
Old Value
New Value
Data Memory Access Time
busW
Old Value
New
8
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to the Concept of Pipelined
    Processor (15 minutes)
  • Pipelined Datapath and Pipelined Control (25
    minutes)
  • How to Avoid Race Condition in a Pipeline Design?
    (5 minutes)
  • Pipeline Example Instructions Interaction (15
    minutes)
  • Summary (5 minutes)

9
The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec Calculate the memory address
  • Mem Read the data from the Data Memory
  • Wr Write the data back to the register file

10
Key Ideas Behind Pipelining
  • Grading the Final exam for a class of 100
    students
  • 5 problems, five people grading the exam
  • Each person ONLY grade one problem
  • Pass the exam to the next person as soon as one
    finishes his part
  • Assume each problem takes 12 min to grade
  • Each individual exam still takes 1 hour to grade
  • But with 5 people, all exams can be graded five
    times quicker
  • The load instruction has 5 stages
  • Five independent functional units to work on each
    stage
  • Each functional unit is used only once
  • The 2nd load can start as soon as the 1st
    finishes its Ifetch stage
  • Each load still takes five cycles to complete
  • The throughput, however, is much higher

11
Key Ideas Behind Pipelining
buffer
Input Tasks
K stage pipeline
Stage 1
Stage 2
Stage k
  • Let n be number of tasks or exams (or
    instructions)
  • Let k be number of stages for each task
  • Let T be the time per stage
  • Time per task T . k
  • Total Time per n tasks for non-pipelined solution
    T . k . n
  • Total Time per n tasks for pipelined solution T
    . k T . (n-1)
  • Speedup pipelined perform/ non-pipelined
    performance
  • Total Time non-pipelined/ Total Time for
    pipelined
  • k . n / k n-1 k approx. when n gtgt k

12
Pipelining the Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw
  • The five independent functional units in the
    pipeline datapath are
  • Instruction Memory for the Ifetch stage
  • Register Files Read ports (bus A and busB) for
    the Reg/Dec stage
  • ALU for the Exec stage
  • Data Memory for the Mem stage
  • Register Files Write port (bus W) for the Wr
    stage
  • One instruction enters the pipeline every cycle
  • One instruction comes out of the pipeline
    (complete) every cycle
  • The Effective Cycles per Instruction (CPI) is 1

13
The Four Stages of R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec ALU operates on the two register operands
  • Wr Write the ALU output back to the register file

14
Pipelining the R-type and Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type
  • We have a problem
  • Two instructions try to write to the register
    file at the same time!

15
Important Observation
  • Each functional unit can only be used once per
    instruction
  • Each functional unit must be used at the same
    stage for all instructions
  • Load uses Register Files Write Port during its
    5th stage
  • R-type uses Register Files Write Port during its
    4th stage

16
Solution 1 Insert Bubble into the Pipeline
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Load
R-type
Pipeline
R-type
R-type
Bubble
  • Insert a bubble into the pipeline to prevent 2
    writes at the same cycle
  • The control logic can be complex
  • No instruction is completed during Cycle 5
  • The Effective CPI for load is 2

17
Solution 2 Delay R-types Write by One Cycle
  • Delay R-types register write by one cycle
  • Now R-type instructions also use Reg Files write
    port at Stage 5
  • Mem stage is a NOOP stage nothing is being done

1
2
3
4
5
R-type
Mem
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
R-type
R-type
Load
R-type
R-type
18
The Four Stages of Store
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec Calculate the memory address
  • Mem Write the data into the Data Memory

19
The Four Stages of Beq
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec ALU compares the two register operands
  • Adder calculates the branch target address
  • Mem If the registers we compared in the Exec
    stage are the same,
  • Write the branch target address into the PC

20
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to the Concept of Pipelined
    Processor (15 minutes)
  • Pipelined Datapath and Pipelined Control
  • How to Avoid Race Condition in a Pipeline Design?
    (5 minutes)
  • Pipeline Example Instructions Interaction (15
    minutes)
  • Summary (5 minutes)

21
A Pipelined Datapath
Clk
Ifetch
Reg/Dec
Exec
Mem
Wr
22
The Instruction Fetch Stage
  • Location 10 lw 1, 0x100(2) 1 lt-
    Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ExtOp
ALUOp
Branch
RegWr
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 14
Data Mem
Rs
Zero
busA
A
Ra
busB
IF/ID lw 1, 100 (2)
Exec Unit
RA
Do
Rb
IUnit
ID/Ex Register
Ex/Mem Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr
MemtoReg
RegDst
23
A Detail View of the Instruction Unit
  • Location 10 lw 1, 0x100(2)

You are here!
Clk
Ifetch
Reg/Dec
1
0
4
PC 14
10
IF/ID lw 1, 100 (2)
Address
Instruction Memory
Instruction
24
The Decode / Register Fetch Stage
  • Location 10 lw 1, 0x100(2) 1 lt-
    Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ExtOp
ALUOp
Branch
RegWr
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
ID/Ex Reg. 2 0x100
Exec Unit
RA
Do
Rb
IUnit
Ex/Mem Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr
MemtoReg
RegDst
25
Loads Address Calculation Stage
  • Location 10 lw 1, 0x100(2) 1 lt-
    Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ALUOpAdd
Branch
RegWr
ExtOp1
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Ex/Mem Loads Address
Rb
IUnit
ID/Ex Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc1
MemWr
MemtoReg
RegDst0
26
A Detail View of the Execution Unit
You are here!
Clk
Exec
Mem
27
Loads Memory Access Stage
  • Location 10 lw 1, 0x100(2) 1 lt-
    Mem(2) 0x100

You are here!
Clk
Ifetch
Reg/Dec
Exec
Mem
ExtOp
ALUOp
Branch0
RegWr
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Mem/Wr Loads Data
Rb
IUnit
ID/Ex Register
Ex/Mem Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr0
MemtoReg
RegDst
28
Loads Write Back Stage
  • Location 10 lw 1, 0x100(2) 1 lt-
    Mem(2) 0x100

You are somewhere out there!
Clk
Ifetch
Reg/Dec
Exec
Mem
Wr
ExtOp
ALUOp
Branch
RegWr1
1
0
PC4
PC4
PC
Imm16
PC4
IF/ID
Imm16
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Rb
IUnit
ID/Ex Register
Ex/Mem Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc
MemWr
MemtoReg1
RegDst
29
How About Control Signals?
  • Key Observation Control Signals at Stage N
    Func (Instr. at Stage N)
  • N Exec, Mem, or Wr
  • Example Controls Signals at Exec Stage
    Func(Loads Exec)

Ifetch
Reg/Dec
Exec
Mem
ALUOpAdd
Wr
Branch
RegWr
ExtOp1
1
0
PC4
PC4
Imm16
PC4
IF/ID
Imm16
PC
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
Ex/Mem Loads Address
Rb
IUnit
ID/Ex Register
Mem/Wr Register
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc1
MemWr
MemtoReg
RegDst0
30
Pipeline Control
  • The Main Control generates the control signals
    during Reg/Dec
  • Control signals for Exec (ExtOp, ALUSrc, ...) are
    used 1 cycle later
  • Control signals for Mem (MemWr Branch) are used 2
    cycles later
  • Control signals for Wr (MemtoReg MemWr) are used
    3 cycles later

Reg/Dec
Exec
Mem
Wr
ExtOp
ExtOp
ALUSrc
ALUSrc
ALUOp
ALUOp
Main Control
RegDst
RegDst
Ex/Mem Register
IF/ID Register
ID/Ex Register
Mem/Wr Register
MemWr
MemWr
MemWr
Branch
Branch
Branch
MemtoReg
MemtoReg
MemtoReg
MemtoReg
RegWr
RegWr
RegWr
RegWr
31
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to the Concept of Pipelined
    Processor (15 minutes)
  • Pipelined Datapath and Pipelined Control (25
    minutes)
  • How to Avoid Race Condition in a Pipeline Design?
  • Pipeline Example Instructions Interaction (15
    minutes)
  • Summary (5 minutes)

32
Beginning of the Wrs Stage A Real World Problem
Clk
Clk
RegAdr
WrAdr
RegWr
MemWr
RegWrs Clk-to-Q
MemWrs Clk-to-Q
RegAdrs Clk-to-Q
WrAdrs Clk-to-Q
RegWr
MemWr
Ex/Mem
Mem/Wr
RegAdr
WrAdr
Reg File
Data Memory
Data
Data
  • At the beginning of the Wr stage, we have a
    problem if
  • RegAdrs (Rd or Rt) Clk-to-Q gt RegWrs
    Clk-to-Q
  • Similarly, at the beginning of the Mem stage, we
    have a problem if
  • WrAdrs Clk-to-Q gt MemWrs Clk-to-Q
  • We have a race condition between Address and
    Write Enable!

33
The Pipeline Problem
  • Multiple Cycle design prevents race condition
    between Addr and WrEn
  • Make sure Address is stable by the end of Cycle N
  • Asserts WrEn during Cycle N 1
  • This approach can NOT be used in the pipeline
    design because
  • Must be able to write the register file every
    cycle
  • Must be able write the data memory every cycle

Clock
Store
Store
R-type
R-type
34
Synchronize Register File Synchronize Memory
  • Solution And the Write Enable signal with the
    Clock
  • This is the ONLY place where gating the clock is
    used
  • MUST consult circuit expert to ensure no timing
    violation
  • Example Clock High Time gt Write Access Delay

Synchronize Memory and Register File
Clk
Address, Data, and WrEn must be stable at least 1
set-up time before the Clk edge
I_Addr
I_WrEn
Write occurs at the cycle following the clock
edge that captures the signals
C_WrEn
WrEn
WrEn
C_WrEn
I_WrEn
Address
Reg File or Memory
Data
Reg File or Memory
I_Addr
Address
Clk
I_Data
Data
35
Outline of Todays Lecture
  • Recap and Introduction (5 minutes)
  • Introduction to the Concept of Pipelined
    Processor (15 minutes)
  • Pipelined Datapath and Pipelined Control (25
    minutes)
  • How to Avoid Race Condition in a Pipeline Design?
    (5 minutes)
  • Pipeline Example Instructions Interaction
  • Summary (5 minutes)

36
A More Extensive Pipelining Example
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Clock
0 Load
4 R-type
8 Store
12 Beq (target is 1000)
End of Cycle 4
End of Cycle 5
End of Cycle 6
End of Cycle 7
  • End of Cycle 4 Loads Mem, R-types Exec,
    Stores Reg, Beqs Ifetch
  • End of Cycle 5 Loads Wr, R-types Mem, Stores
    Exec, Beqs Reg
  • End of Cycle 6 R-types Wr, Stores Mem, Beqs
    Exec
  • End of Cycle 7 Stores Wr, Beqs Mem

37
Pipelining Example End of Cycle 4
  • 0 Loads Mem 4 R-types Exec 8 Stores
    Reg 12 Beqs Ifetch

8 Stores Reg
4 R-types Exec
0 Loads Mem
12 Beqs Ifet
ALUOpR-type
ExtOpx
Branch0
RegWr0
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 16
Data Mem
Rs
Zero
busA
A
Ra
busB
IF/ID Beq Instruction
Exec Unit
RA
Do
ID/Ex Stores busA B
Ex/Mem R-types Result
Mem/Wr Loads Dout
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
ALUSrc0
MemtoRegx
RegDst1
38
Pipelining Example End of Cycle 5
  • 0 Lws Wr 4 Rs Mem 8 Stores Exec 12
    Beqs Reg 16 Rs Ifetch

12 Beqs Reg
8 Stores Exec
4 R-types Mem
0 Loads Wr
16 Rs Ifet
ALUOpAdd
ExtOp1
Branch0
RegWr1
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 20
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
IF/ID Instruction _at_ 16
ID/Ex Beqs busA B
Ex/Mem Stores Address
Mem/Wr R-types Result
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
MemtoReg1
RegDstx
ALUSrc1
39
Pipelining Example End of Cycle 6
  • 4 Rs Wr 8 Stores Mem 12 Beqs Exec 16
    Rs Reg 20 Rs Ifet

16 R-types Reg
12 Beqs Exec
8 Stores Mem
20 R-types Ifet
4 R-types Wr
ALUOpSub
ExtOp1
Branch0
RegWr1
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 24
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
IF/ID Instruction _at_ 20
ID/ExR-types busA B
Ex/Mem Beqs Results
Mem/Wr Nothing for St
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
MemtoReg0
RegDstx
ALUSrc0
40
Pipelining Example End of Cycle 7
  • 8 Stores Wr 12 Beqs Mem 16 Rs Exec
    20 Rs Reg 24 Rs Ifet

20 R-types Reg
16 R-types Exec
12 Beqs Mem
24 R-types Ifet
8 Stores Wr
ALUOpR-type
ExtOpx
Branch1
RegWr0
1
0
PC4
PC4
Imm16
PC4
Imm16
PC 1000
Data Mem
Rs
Zero
busA
A
Ra
busB
Exec Unit
RA
Do
IF/ID Instruction _at_ 24
ID/ExR-types busA B
Ex/Mem Rtypes Results
Mem/WrNothing for Beq
Rb
IUnit
Rt
WA
RFile
Di
Rw
Di
Rt
0
I
Rd
1
MemtoRegx
RegDst1
ALUSrc0
41
The Delay Branch Phenomenon
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Cycle 11
Clk
12 Beq (target is 1000)
16 R-type
20 R-type
24 R-type
1000 Target of Br
  • Although Beq is fetched during Cycle 4
  • Target address is NOT written into the PC until
    the end of Cycle 7
  • Branchs target is NOT fetched until Cycle 8
  • 3-instruction delay before the branch take
    effect
  • This is referred to as Branch Hazard
  • Clever design techniques can reduce the delay to
    ONE instruction

42
The Delay Load Phenomenon
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Clock
I0 Load
Plus 1
Plus 2
Plus 3
Plus 4
  • Although Load is fetched during Cycle 1
  • The data is NOT written into the Reg File until
    the end of Cycle 5
  • We cannot read this value from the Reg File until
    Cycle 6
  • 3-instruction delay before the load take effect
  • This is referred to as Data Hazard
  • Clever design techniques can reduce the delay to
    ONE instruction

43
Summary
  • Disadvantages of the Single Cycle Processor
  • Long cycle time
  • Cycle time is too long for all instructions
    except the Load
  • Multiple Clock Cycle Processor
  • Divide the instructions into smaller steps
  • Execute each step (instead of the entire
    instruction) in one cycle
  • Pipeline Processor
  • Natural enhancement of the multiple clock cycle
    processor
  • Each functional unit can only be used once per
    instruction
  • If a instruction is going to use a functional
    unit
  • it must use it at the same stage as all other
    instructions
  • Pipeline Control
  • Each stages control signal depends ONLY on the
    instruction that is currently in that stage

44
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
45
Where to get more information?
  • Everything You Need to know about Pipeline
    Computer
  • Peter Kogge, The Architecture of Pipeline
    Computers, McGraw Hill Book Company, 1981
  • Some Classic References on RISC Pipelines
  • Manolis Katevenis, Reduced Instruction Set
    Computer Architectures for VLSI, PhD Thesis, UC
    Berkeley, 1984.
  • Other references
  • David. A Patterson, Reduced Instruction Set
    Computers, Communications of the ACM, January
    1985.
  • Shing Kong, Performance, Resources, and
    Complexity, PhD Thesis, UC Berkeley, 1989.
Write a Comment
User Comments (0)
About PowerShow.com