Title: Appendix A Pipelining: Basic and Intermediate Concepts
1Appendix APipelining Basic and Intermediate
Concepts
2Pipelining
- An implementation technique whereby multiple
instructions are overlapped in execution. - Each step in the pipeline (called a pipe stage)
completes a part of an instruction. - Because all stages proceed at the same time, the
length of a processor (clock) cycle is determined
by the time required for the slowest pipe stage.
3Pipelining
- Designers goal Balancing the length of each
pipeline stage. - If the stages are perfectly balanced, the time
per instruction on the pipelined processor is,
Time per instruction on unpipelined machine
Number of pipe stages
Speedup from pipelining number of pipe stages
4RISC Instruction Set (MIPS64)
- 64-bit version of the MIPS instruction set.
- 32 registers
- 3 classes of instructions
- ALU instructions DADD, DSUB,
- Load and store instructions LD, SD,
- Branches and jumps
5Implementation of a RISC (Unpipelined, Multicycle)
- Implementation of an integer subset of a RISC
architecture that takes at most 5 clock cycles. - Instruction Fetch (IF)
- Instruction Decode/Register Fetch (ID)
- Execution/Effective Address Calculation (EX)
- Memory Access (MEM)
- Write-Back (WB)
6Instruction Format (32-bit Version)
- All MIPS instructions are 32 bits long.
R-format (add, sub, )
I-format (lw, sw, )
J-format (j)
7Instruction Fetch Cycle (IF)
- Send the program counter (PC) to memory.
- Fetch the current instruction from memory.
- Update the PC to the next sequential PC by adding
4 to the PC.
8Instruction Decode/Register Fetch Cycle (ID)
- Decode the instruction and read the registers
from the register file. - Do the equality test on the registers for a
possible branch. - Sign-extend the offset field of the instruction
in case it is needed. - Compute the possible branch target address by
adding the sign-extended offset to the
incremented PC.
9Execution/Effective Address Calculation (EX)
- The ALU operates on the operands prepared in the
prior cycle. - Memory reference instructions The ALU adds the
base register and the offset to form the
effective address. - Register-Register The ALU performs the operation
specified by the ALU opcode on the values from
the register file. - Register-Immediate The ALU performs the
operation specified by the opcode on the first
value from the register file and the
sign-extended immediate.
10Memory Access (MEM)
- If the instruction is a load, memory does a read
using the effective address computed in the
previous cycle. - If it is a store, then the memory writes the data
from the second register read from the register
file using the effective address.
11Write-Back cycle (WB)
- Register-Register ALU instruction or Load
instruction Write the result into the register
file.
12- In this implementation, branch instructions
require 2 cycles, store instructions require 4
cycles, and all other instructions require 5
cycles. - Assuming a branch frequency of 12 and a store
frequency of 10, What is the overall CPI?
13Classic 5 Stage Pipeline for a RISC Processor
14Performance Issues in Pipelining
- Pipelining increases the CPU instruction
throughput. - Throughput the number of instructions completed
per unit of time. - Pipelining does not decrease the execution time
of an individual instruction. - It increases the execution time due to overhead
(clock skew and pipeline register delay) in the
control of the pipeline.
15Example (p. A-10)
- Consider the unpipelined processor. Assume that
it has a 1ns clock cycle and that it uses 4
cycles for ALU operations and branches and 5
cycles for memory operations. Assume that the
relative frequencies of these operations are 40,
20, and 40, respectively. Suppose that due to
clock skew and setup, pipelining the processor
adds 0.2ns of overhead to the clock. Ignoring any
latency impact, how much speedup in the
instruction execution rate will we gain from a
pipeline?
16Classic 5 Stage Pipeline for a RISC Processor
17Classic 5-Stage Pipeline
- What happens in the pipeline?
- One resource cannot be used for two different
operations on the same clock cycle. - gt Separate instruction and data memories.
- The register file is used in two stages ID (two
reads) and WB (one write). - gt Register write in the first half of the
clock cycle and register read in the second half.
18Pipeline Hazards
19Pipeline Hazards
- Situations that prevent the next instructions in
the instruction stream from executing during its
designated clock cycle. - Hazards reduce the performance from the ideal
speedup gained by pipelining. - Structural Hazards
- Data Hazards
- Control Hazards
- Hazards can make it necessary to stall the
pipeline.
20Pipeline Hazards
- When an instruction is stalled, all instructions
issued later than the stalled instruction are
also stalled. - No new instructions are fetched during the stall.
21Structural Hazards
- Hardware cannot support the combination of
instructions that we want to execute in the same
clock cycle. - Suppose we have a single memory instead of two
memories.
22Control Hazards
- This arises from the need to make a decision
based on the results of one instruction while
others are executing. - branch instruction
- Pipeline stall (or bubble)
- How can we overcome this problem?
23Branch Hazards
- To minimize the branch penalty, put in enough
hardware so that we can test registers, calculate
the branch target address, and update the PC
during the second stage.
24Example
- Estimate the impact on the CPI of stalling on
branches. Assume all other instructions have a
CPI of 1.
25Branch Prediction
- Computers do indeed use prediction to handle
branches. - Simplest Always predict that branches will fail.
- If youre right, the pipeline proceeds at full
speed. - Dynamic hardware predictors make their guesses
depending on the behavior of each branch. - Popular Keeping a history for each branch as
taken or untaken, and then using the past to
predict the future. gt about 90 accuracy
26Branch Prediction
When the guess is wrong, the pipeline must make
sure that the instruction following the wrongly
guessed branch have no effect and must restart
the pipeline from the proper branch address.
27Delayed Branch
- Delayed decision
- Used in MIPS
- The delayed branch always executes the next
sequential instruction, with the branch taking
place after that one instruction delay.
28(No Transcript)
29- MIPS software will place an instruction
immediately after the delayed branch instruction
that is not affected by the branch, and a taken
branch changes the address of the instruction
that follows this safe instruction. - Compilers typically fill about 50 of the branch
delay slots with useful instructions.
30Data Hazards
- An instruction depends on the results of a
previous instruction still in the pipeline. - e.g.
- add s0, t0, t1
- sub t2, s0, t3
- The add instruction doesnt write the result
until the 5th stage. gt 3 bubbles
31Solution
- forwarding (or bypassing) getting the missing
item early from the internal resources. - e.g. as soon as the ALU creates the sum for the
add, we can supply it as the input for the
subtract.
32(No Transcript)
33Load-Use Data Hazard
34- Even with forwarding, we still have to stall one
stage for a load-use data hazard. - Delayed loads to follow a load with an
instruction independent of that load.
35(No Transcript)
36Implementation of the MIPS Datapath
37Events on Every Pipe Stage of the MIPS Pipeline
- See Figure A.19 on page A-32.
38Revised Datapath
39Revised Pipeline Structure
- See Figure A.25 on page A-39.
40Extending the MIPS to Handle Multicycle Operations
41Floating-Point Operations
- The floating-point pipeline will allow for a
longer latency for operations. - the EX cycle may be repeated as many times as
needed to complete the operation. - The number of repetitions can vary for different
operations. - There may be multiple floating-point functional
units.
42Assumptions
- Main integer unit handles loads and stores,
integer ALU operations, and branches. - FP and integer multiplier.
- FP adder handles FP add, subtract, and
conversion. - FP and integer divider.
- The EX stages of these functional units are not
pipelined.
43MIPS with 3 FP Functional Units
44- Because EX is not pipelined, no other instruction
using that functional unit may issue until the
previous instruction leaves EX. - Instruction issue (p. A-33) the process of
letting an instruction move from the ID stage
into the EX stage of the pipeline. - If an instruction cannot proceed to the EX stage,
the entire pipeline behind that instruction will
be stalled.
45- Latency the number of intervening cycles between
an instruction that produces a result and an
instruction that uses the result. - Initiation interval the number of cycles that
must elapse between issuing two operations of a
given type.
46Example (Figure A.30)
Functional Unit Latency Initiation Interval
Integer ALU 0 1
Data memory (integer/FP loads) 1 1
FP add 3 1
FP multiply (integer multiply) 6 1
FP divide (integer divide) 24 25
47- Since most operations consume their operands at
the beginning of EX stage, the latency is usually
the number of stages after EX that an instruction
produces a result. - 0 for Integer ALU operations.
- 1 for loads.
- Pipeline latency is essentially equal to 1 cycle
less than the depth of the execution pipeline,
which is the number of stages from the EX stage
to the stage that produces the result.
48- To achieve a higher clock rate, fewer logic
levels are put in each pipe stage. - gt The number of pipe stages required for more
complex operations is larger. - The penalty for the faster clock rate is longer
latency for operations.
49Supporting Multiple FP Operations
unpipelined