Appendix A Pipelining: Basic and Intermediate Concepts - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Appendix A Pipelining: Basic and Intermediate Concepts

Description:

Title: CS 5513 Computer Architecture Author: Ki Hwan Yum Last modified by: Yum Created Date: 8/26/2003 7:14:14 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 50

Provided by: KiH9

Learn more at: https://people.engr.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Appendix A Pipelining: Basic and Intermediate Concepts

1
Appendix APipelining Basic and Intermediate
Concepts
2
Pipelining

An implementation technique whereby multiple
instructions are overlapped in execution.
Each step in the pipeline (called a pipe stage)
completes a part of an instruction.
Because all stages proceed at the same time, the
length of a processor (clock) cycle is determined
by the time required for the slowest pipe stage.

3
Pipelining

Designers goal Balancing the length of each
pipeline stage.
If the stages are perfectly balanced, the time
per instruction on the pipelined processor is,

Time per instruction on unpipelined machine
Number of pipe stages
Speedup from pipelining number of pipe stages
4
RISC Instruction Set (MIPS64)

64-bit version of the MIPS instruction set.
32 registers
3 classes of instructions
ALU instructions DADD, DSUB,
Load and store instructions LD, SD,
Branches and jumps

5
Implementation of a RISC (Unpipelined, Multicycle)

Implementation of an integer subset of a RISC
architecture that takes at most 5 clock cycles.
Instruction Fetch (IF)
Instruction Decode/Register Fetch (ID)
Execution/Effective Address Calculation (EX)
Memory Access (MEM)
Write-Back (WB)

6
Instruction Format (32-bit Version)

All MIPS instructions are 32 bits long.

R-format (add, sub, )
I-format (lw, sw, )
J-format (j)
7
Instruction Fetch Cycle (IF)

Send the program counter (PC) to memory.
Fetch the current instruction from memory.
Update the PC to the next sequential PC by adding
4 to the PC.

8
Instruction Decode/Register Fetch Cycle (ID)

Decode the instruction and read the registers
from the register file.
Do the equality test on the registers for a
possible branch.
Sign-extend the offset field of the instruction
in case it is needed.
Compute the possible branch target address by
adding the sign-extended offset to the
incremented PC.

9
Execution/Effective Address Calculation (EX)

The ALU operates on the operands prepared in the
prior cycle.
Memory reference instructions The ALU adds the
base register and the offset to form the
effective address.
Register-Register The ALU performs the operation
specified by the ALU opcode on the values from
the register file.
Register-Immediate The ALU performs the
operation specified by the opcode on the first
value from the register file and the
sign-extended immediate.

10
Memory Access (MEM)

If the instruction is a load, memory does a read
using the effective address computed in the
previous cycle.
If it is a store, then the memory writes the data
from the second register read from the register
file using the effective address.

11
Write-Back cycle (WB)

In this implementation, branch instructions
require 2 cycles, store instructions require 4
cycles, and all other instructions require 5
cycles.
Assuming a branch frequency of 12 and a store
frequency of 10, What is the overall CPI?

13
Classic 5 Stage Pipeline for a RISC Processor
14
Performance Issues in Pipelining

Pipelining increases the CPU instruction
throughput.
Throughput the number of instructions completed
per unit of time.
Pipelining does not decrease the execution time
of an individual instruction.
It increases the execution time due to overhead
(clock skew and pipeline register delay) in the
control of the pipeline.

15
Example (p. A-10)

Consider the unpipelined processor. Assume that
it has a 1ns clock cycle and that it uses 4
cycles for ALU operations and branches and 5
cycles for memory operations. Assume that the
relative frequencies of these operations are 40,
20, and 40, respectively. Suppose that due to
clock skew and setup, pipelining the processor
adds 0.2ns of overhead to the clock. Ignoring any
latency impact, how much speedup in the
instruction execution rate will we gain from a
pipeline?

16
Classic 5 Stage Pipeline for a RISC Processor
17
Classic 5-Stage Pipeline

What happens in the pipeline?
One resource cannot be used for two different
operations on the same clock cycle.
gt Separate instruction and data memories.
The register file is used in two stages ID (two
reads) and WB (one write).
gt Register write in the first half of the
clock cycle and register read in the second half.

18
Pipeline Hazards
19
Pipeline Hazards

Situations that prevent the next instructions in
the instruction stream from executing during its
designated clock cycle.
Hazards reduce the performance from the ideal
speedup gained by pipelining.
Structural Hazards
Data Hazards
Control Hazards
Hazards can make it necessary to stall the
pipeline.

20
Pipeline Hazards

When an instruction is stalled, all instructions
issued later than the stalled instruction are
also stalled.
No new instructions are fetched during the stall.

21
Structural Hazards

Hardware cannot support the combination of
instructions that we want to execute in the same
clock cycle.
Suppose we have a single memory instead of two
memories.

22
Control Hazards

This arises from the need to make a decision
based on the results of one instruction while
others are executing.
branch instruction
Pipeline stall (or bubble)
How can we overcome this problem?

23
Branch Hazards

To minimize the branch penalty, put in enough
hardware so that we can test registers, calculate
the branch target address, and update the PC
during the second stage.

24
Example

Estimate the impact on the CPI of stalling on
branches. Assume all other instructions have a
CPI of 1.

25
Branch Prediction

Computers do indeed use prediction to handle
branches.
Simplest Always predict that branches will fail.
If youre right, the pipeline proceeds at full
speed.
Dynamic hardware predictors make their guesses
depending on the behavior of each branch.
Popular Keeping a history for each branch as
taken or untaken, and then using the past to
predict the future. gt about 90 accuracy

26
Branch Prediction
When the guess is wrong, the pipeline must make
sure that the instruction following the wrongly
guessed branch have no effect and must restart
the pipeline from the proper branch address.
27
Delayed Branch

Delayed decision
Used in MIPS
The delayed branch always executes the next
sequential instruction, with the branch taking
place after that one instruction delay.

28
(No Transcript)
29

MIPS software will place an instruction
immediately after the delayed branch instruction
that is not affected by the branch, and a taken
branch changes the address of the instruction
that follows this safe instruction.
Compilers typically fill about 50 of the branch
delay slots with useful instructions.

30
Data Hazards

An instruction depends on the results of a
previous instruction still in the pipeline.
e.g.
add s0, t0, t1
sub t2, s0, t3
The add instruction doesnt write the result
until the 5th stage. gt 3 bubbles

31
Solution

forwarding (or bypassing) getting the missing
item early from the internal resources.
e.g. as soon as the ALU creates the sum for the
add, we can supply it as the input for the
subtract.

32
(No Transcript)
33
Load-Use Data Hazard
34

Even with forwarding, we still have to stall one
stage for a load-use data hazard.
Delayed loads to follow a load with an
instruction independent of that load.

35
(No Transcript)
36
Implementation of the MIPS Datapath
37
Events on Every Pipe Stage of the MIPS Pipeline

See Figure A.19 on page A-32.

38
Revised Datapath
39
Revised Pipeline Structure

See Figure A.25 on page A-39.

40
Extending the MIPS to Handle Multicycle Operations
41
Floating-Point Operations

The floating-point pipeline will allow for a
longer latency for operations.
the EX cycle may be repeated as many times as
needed to complete the operation.
The number of repetitions can vary for different
operations.
There may be multiple floating-point functional
units.

42
Assumptions

Main integer unit handles loads and stores,
integer ALU operations, and branches.
FP and integer multiplier.
FP adder handles FP add, subtract, and
conversion.
FP and integer divider.
The EX stages of these functional units are not
pipelined.

43
MIPS with 3 FP Functional Units
44

Because EX is not pipelined, no other instruction
using that functional unit may issue until the
previous instruction leaves EX.
Instruction issue (p. A-33) the process of
letting an instruction move from the ID stage
into the EX stage of the pipeline.
If an instruction cannot proceed to the EX stage,
the entire pipeline behind that instruction will
be stalled.

Latency the number of intervening cycles between
an instruction that produces a result and an
instruction that uses the result.
Initiation interval the number of cycles that
must elapse between issuing two operations of a
given type.

46
Example (Figure A.30)
Functional Unit Latency Initiation Interval
Integer ALU 0 1
Data memory (integer/FP loads) 1 1
FP add 3 1
FP multiply (integer multiply) 6 1
FP divide (integer divide) 24 25
47

Since most operations consume their operands at
the beginning of EX stage, the latency is usually
the number of stages after EX that an instruction
produces a result.
0 for Integer ALU operations.
1 for loads.
Pipeline latency is essentially equal to 1 cycle
less than the depth of the execution pipeline,
which is the number of stages from the EX stage
to the stage that produces the result.