Appendix A Pipelining: Basic and Intermediate Concepts - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Appendix A Pipelining: Basic and Intermediate Concepts

Description:

Title: CS 5513 Computer Architecture Author: Ki Hwan Yum Last modified by: Yum Created Date: 8/26/2003 7:14:14 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 50
Provided by: KiH9
Category:

less

Transcript and Presenter's Notes

Title: Appendix A Pipelining: Basic and Intermediate Concepts


1
Appendix APipelining Basic and Intermediate
Concepts
2
Pipelining
  • An implementation technique whereby multiple
    instructions are overlapped in execution.
  • Each step in the pipeline (called a pipe stage)
    completes a part of an instruction.
  • Because all stages proceed at the same time, the
    length of a processor (clock) cycle is determined
    by the time required for the slowest pipe stage.

3
Pipelining
  • Designers goal Balancing the length of each
    pipeline stage.
  • If the stages are perfectly balanced, the time
    per instruction on the pipelined processor is,

Time per instruction on unpipelined machine
Number of pipe stages
Speedup from pipelining number of pipe stages
4
RISC Instruction Set (MIPS64)
  • 64-bit version of the MIPS instruction set.
  • 32 registers
  • 3 classes of instructions
  • ALU instructions DADD, DSUB,
  • Load and store instructions LD, SD,
  • Branches and jumps

5
Implementation of a RISC (Unpipelined, Multicycle)
  • Implementation of an integer subset of a RISC
    architecture that takes at most 5 clock cycles.
  • Instruction Fetch (IF)
  • Instruction Decode/Register Fetch (ID)
  • Execution/Effective Address Calculation (EX)
  • Memory Access (MEM)
  • Write-Back (WB)

6
Instruction Format (32-bit Version)
  • All MIPS instructions are 32 bits long.

R-format (add, sub, )
I-format (lw, sw, )
J-format (j)
7
Instruction Fetch Cycle (IF)
  • Send the program counter (PC) to memory.
  • Fetch the current instruction from memory.
  • Update the PC to the next sequential PC by adding
    4 to the PC.

8
Instruction Decode/Register Fetch Cycle (ID)
  • Decode the instruction and read the registers
    from the register file.
  • Do the equality test on the registers for a
    possible branch.
  • Sign-extend the offset field of the instruction
    in case it is needed.
  • Compute the possible branch target address by
    adding the sign-extended offset to the
    incremented PC.

9
Execution/Effective Address Calculation (EX)
  • The ALU operates on the operands prepared in the
    prior cycle.
  • Memory reference instructions The ALU adds the
    base register and the offset to form the
    effective address.
  • Register-Register The ALU performs the operation
    specified by the ALU opcode on the values from
    the register file.
  • Register-Immediate The ALU performs the
    operation specified by the opcode on the first
    value from the register file and the
    sign-extended immediate.

10
Memory Access (MEM)
  • If the instruction is a load, memory does a read
    using the effective address computed in the
    previous cycle.
  • If it is a store, then the memory writes the data
    from the second register read from the register
    file using the effective address.

11
Write-Back cycle (WB)
  • Register-Register ALU instruction or Load
    instruction Write the result into the register
    file.

12
  • In this implementation, branch instructions
    require 2 cycles, store instructions require 4
    cycles, and all other instructions require 5
    cycles.
  • Assuming a branch frequency of 12 and a store
    frequency of 10, What is the overall CPI?

13
Classic 5 Stage Pipeline for a RISC Processor
14
Performance Issues in Pipelining
  • Pipelining increases the CPU instruction
    throughput.
  • Throughput the number of instructions completed
    per unit of time.
  • Pipelining does not decrease the execution time
    of an individual instruction.
  • It increases the execution time due to overhead
    (clock skew and pipeline register delay) in the
    control of the pipeline.

15
Example (p. A-10)
  • Consider the unpipelined processor. Assume that
    it has a 1ns clock cycle and that it uses 4
    cycles for ALU operations and branches and 5
    cycles for memory operations. Assume that the
    relative frequencies of these operations are 40,
    20, and 40, respectively. Suppose that due to
    clock skew and setup, pipelining the processor
    adds 0.2ns of overhead to the clock. Ignoring any
    latency impact, how much speedup in the
    instruction execution rate will we gain from a
    pipeline?

16
Classic 5 Stage Pipeline for a RISC Processor
17
Classic 5-Stage Pipeline
  • What happens in the pipeline?
  • One resource cannot be used for two different
    operations on the same clock cycle.
  • gt Separate instruction and data memories.
  • The register file is used in two stages ID (two
    reads) and WB (one write).
  • gt Register write in the first half of the
    clock cycle and register read in the second half.

18
Pipeline Hazards
19
Pipeline Hazards
  • Situations that prevent the next instructions in
    the instruction stream from executing during its
    designated clock cycle.
  • Hazards reduce the performance from the ideal
    speedup gained by pipelining.
  • Structural Hazards
  • Data Hazards
  • Control Hazards
  • Hazards can make it necessary to stall the
    pipeline.

20
Pipeline Hazards
  • When an instruction is stalled, all instructions
    issued later than the stalled instruction are
    also stalled.
  • No new instructions are fetched during the stall.

21
Structural Hazards
  • Hardware cannot support the combination of
    instructions that we want to execute in the same
    clock cycle.
  • Suppose we have a single memory instead of two
    memories.

22
Control Hazards
  • This arises from the need to make a decision
    based on the results of one instruction while
    others are executing.
  • branch instruction
  • Pipeline stall (or bubble)
  • How can we overcome this problem?

23
Branch Hazards
  • To minimize the branch penalty, put in enough
    hardware so that we can test registers, calculate
    the branch target address, and update the PC
    during the second stage.

24
Example
  • Estimate the impact on the CPI of stalling on
    branches. Assume all other instructions have a
    CPI of 1.

25
Branch Prediction
  • Computers do indeed use prediction to handle
    branches.
  • Simplest Always predict that branches will fail.
  • If youre right, the pipeline proceeds at full
    speed.
  • Dynamic hardware predictors make their guesses
    depending on the behavior of each branch.
  • Popular Keeping a history for each branch as
    taken or untaken, and then using the past to
    predict the future. gt about 90 accuracy

26
Branch Prediction
When the guess is wrong, the pipeline must make
sure that the instruction following the wrongly
guessed branch have no effect and must restart
the pipeline from the proper branch address.
27
Delayed Branch
  • Delayed decision
  • Used in MIPS
  • The delayed branch always executes the next
    sequential instruction, with the branch taking
    place after that one instruction delay.

28
(No Transcript)
29
  • MIPS software will place an instruction
    immediately after the delayed branch instruction
    that is not affected by the branch, and a taken
    branch changes the address of the instruction
    that follows this safe instruction.
  • Compilers typically fill about 50 of the branch
    delay slots with useful instructions.

30
Data Hazards
  • An instruction depends on the results of a
    previous instruction still in the pipeline.
  • e.g.
  • add s0, t0, t1
  • sub t2, s0, t3
  • The add instruction doesnt write the result
    until the 5th stage. gt 3 bubbles

31
Solution
  • forwarding (or bypassing) getting the missing
    item early from the internal resources.
  • e.g. as soon as the ALU creates the sum for the
    add, we can supply it as the input for the
    subtract.

32
(No Transcript)
33
Load-Use Data Hazard
34
  • Even with forwarding, we still have to stall one
    stage for a load-use data hazard.
  • Delayed loads to follow a load with an
    instruction independent of that load.

35
(No Transcript)
36
Implementation of the MIPS Datapath
37
Events on Every Pipe Stage of the MIPS Pipeline
  • See Figure A.19 on page A-32.

38
Revised Datapath
39
Revised Pipeline Structure
  • See Figure A.25 on page A-39.

40
Extending the MIPS to Handle Multicycle Operations
41
Floating-Point Operations
  • The floating-point pipeline will allow for a
    longer latency for operations.
  • the EX cycle may be repeated as many times as
    needed to complete the operation.
  • The number of repetitions can vary for different
    operations.
  • There may be multiple floating-point functional
    units.

42
Assumptions
  • Main integer unit handles loads and stores,
    integer ALU operations, and branches.
  • FP and integer multiplier.
  • FP adder handles FP add, subtract, and
    conversion.
  • FP and integer divider.
  • The EX stages of these functional units are not
    pipelined.

43
MIPS with 3 FP Functional Units
44
  • Because EX is not pipelined, no other instruction
    using that functional unit may issue until the
    previous instruction leaves EX.
  • Instruction issue (p. A-33) the process of
    letting an instruction move from the ID stage
    into the EX stage of the pipeline.
  • If an instruction cannot proceed to the EX stage,
    the entire pipeline behind that instruction will
    be stalled.

45
  • Latency the number of intervening cycles between
    an instruction that produces a result and an
    instruction that uses the result.
  • Initiation interval the number of cycles that
    must elapse between issuing two operations of a
    given type.

46
Example (Figure A.30)
Functional Unit Latency Initiation Interval
Integer ALU 0 1
Data memory (integer/FP loads) 1 1
FP add 3 1
FP multiply (integer multiply) 6 1
FP divide (integer divide) 24 25
47
  • Since most operations consume their operands at
    the beginning of EX stage, the latency is usually
    the number of stages after EX that an instruction
    produces a result.
  • 0 for Integer ALU operations.
  • 1 for loads.
  • Pipeline latency is essentially equal to 1 cycle
    less than the depth of the execution pipeline,
    which is the number of stages from the EX stage
    to the stage that produces the result.

48
  • To achieve a higher clock rate, fewer logic
    levels are put in each pipe stage.
  • gt The number of pipe stages required for more
    complex operations is larger.
  • The penalty for the faster clock rate is longer
    latency for operations.

49
Supporting Multiple FP Operations
unpipelined
Write a Comment
User Comments (0)
About PowerShow.com