Appendix A Pipelining: Basic and Intermediate Concepts - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Appendix A Pipelining: Basic and Intermediate Concepts

Description:

each have one load of clothes. to wash, dry, and fold. Washer takes 30 minutes ... Why Would a Designer Allow Structural Hazard? ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 84
Provided by: haor
Category:

less

Transcript and Presenter's Notes

Title: Appendix A Pipelining: Basic and Intermediate Concepts


1
Appendix A PipeliningBasic and Intermediate
Concepts
2
Outline
  • What is pipelining?
  • The basic pipeline for a RISC instruction set
  • The major hurdle of pipelining pipeline hazards
  • Data hazards
  • Control hazards

3
What Is Pipelining?
4
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

5
Sequential Laundry
6 PM
7
8
9
11
10
Midnight
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

6
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads

7
Pipelining Lessons
  • Pipelining does not help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
8
What is Pipelining?
  • Pipelining is an implementation technique whereby
    multiple instructions are overlapped in
    execution.
  • Not visible to the programmer
  • Each step in the pipeline completes a part of an
    instruction.
  • Each step is completing different parts of
    different instructions in parallel
  • Each of these steps is called a pipe stage or a
    pipe segment.

9
What is Pipelining ? (Cont.)
  • The time required between moving an instruction
    one step down the pipeline is a machine cycle.
  • All the stages must be ready to proceed at the
    same time
  • Slowest pipe stage dominates
  • Machine cycle is usually one clock cycle
    (sometimes two, rarely more)
  • The pipeline designers goal is to balance the
    length of each pipeline stage.

10
What is Pipelining? (Cont.)
  • If the stages are perfectly balanced, then the
    time per instruction on the pipelined machine
    assuming ideal conditions--is equal to
  • Simple model - common latch clock

11
Major Pipeline Benefit Performance
  • Ideal Performance
  • time-per instruction unpiped-instruction-
    time/ stages
  • Asymptote of course - however 10 is commonly
    achieved
  • Difference is due to difficulty in achieving
    laminar stage design
  • 2 ways to view the performance mechanism
  • Reduced CPI
  • Assume a processor takes multiple clock cycles
    per instruction
  • Reduced cycle-time
  • Assume a processor takes 1 long clock cycle per
    instruction

12
Other Pipeline Benefits
  • Completely HW mechanism
  • No programming model shift required to exploit
    this form of concurrency
  • BUT - the compiler will need to change and
    compile time will go up
  • All modern machines are pipelined
  • Key technique in advancing performance in the
    80s
  • In the 90s we just moved to multiple pipelines
  • Beware - no benefit is totally free/good

13
Start with Unpipelined RISC
Use DLX for example, which is similar to MIPS
  • Every instruction can be executed in 5 steps
  • Every instructions takes at most 5 clock cycles
  • Each step outputs just passed to next step (no
    latches)

14
Steps 1 2
  • IF - instruction fetch step
  • IR ? Mem PC --------- fetch the next
    instruction from memory
  • NPC ? PC 4 ---------- compute the new PC
  • ID - instruction decode and register fetch step
  • A ? Regs IR 6.. 10
  • B ? Regs IR 11.. 15
  • Imm ? ((IR16)16IR16..31)
  • Done in parallel with instruction (opcode)
    decoding
  • Possible since register specifiers are encoded in
    fixed fields
  • May fetch register contents that we dont use
  • Also calculate the sign extended immediate

15
Step 3
  • EX - execution/effective address step (4 options
    depending on opcode)
  • Memory Reference (LW R1, 1000(R2) SW R3,
    500(R4))
  • ALUOutput ? A Imm --------- effective address
  • Register - Register ALU instruction (ADD R1, R2,
    R3)
  • ALUOutput ?A func B
  • Register - Immediate ALU instruction (ADD, R1,
    R2, )
  • ALUOutput ? A op Imm
  • Branch (BEQZ R1, 2000)
  • ALUOutput ? NPC Imm
  • Cond ? (A op 0)
  • In Load-Store machine no instruction needs to
    simultaneously calculate a data address and
    perform an ALU operation on the data
  • Hence EX/EFA can be combined into a single cycle.

16
Steps 4 5
  • MEM - memory access/ branch completion
  • PC ? NPC
  • Memory reference
  • LMD ?Mem ALUOutput (Load) OR
  • Mem ALUOutput lt-- B (Store)
  • Branch
  • if (cond) then PC ?ALUOutput
  • WB - write back
  • Register - Register ALU
  • Regs IR16.. 20 ? ALUOutput
  • Register - Immediate ALU
  • Regs IR11.. 15 ?ALUOutput
  • Load
  • Regs IR11.. 15 ? LMD

17
Datapath
PC?NPCPC?ALUO(Branch)
NPC?PC4
Cond?(A op 0)
Load/Store
(Load Result)
IR?MemPC
ALUO (ALU op)or LMD (Load)
ALUO ?AImm A func B A op Imm NPCImm
18
Discussion
  • Assume separate instruction and data memories
  • Implement with separate instruction and data
    caches (Chapter 5)
  • Data memory reference only occurs at stage 4
  • Load and Store
  • Update registers only occurs at stage 5
  • All ALU operations and Load
  • All register reads are early (in ID) and all
    writes are late (in WB)

19
Discussion (Cont.)
  • Branch and store require 4 cycles and others 5
  • Branch 12, store 5 ? overall
    CPI4.83(50.8340.17)
  • Model is correct but not optimized
  • ALUs - 1 would have sufficed since in any given
    cycle only 1 is active
  • Instruction and data memories do not have to be
    separate
  • Branches can be completed at the end of ID stage
    (see later)

20
The Basic Pipeline for DLX/MIPS
21
Simple DLX/MIPS Pipeline
  • Stages now get executed 1 per cycle
  • Ideal result is the CPI reduced from 5 to 1
  • Is it really this simple? Of course not but its
    a start
  • Different operations use the same resource on the
    same cycle?
  • Structure Hazard!!
  • Separate instruction and data memories (IM, DM)
  • Register files read in ID and write in WB
    (distinct use)
  • Write PC in IF and write either the incremented
    PC or the value of the branch target of an
    earlier branch (branch handling problem)
  • Registers are needed between two adjacent stages
    for storing intermediate results
  • Otherwise, they will be overwritten by next
    instruction)

22
Best Case Pipeline Scenario
Fill
Drain
Stable(5 times throughput)
23
Perform register write/read in the first/second
half of CC
Read
Write
A pipeline can be though of as a series of data
paths (resources) shifted in time
24
IF/ID, ID/EX,EX/MEM, MEM/WB are
pipelineregisters/latches
25
Events on Every Pipe Stage Figure A.19
Extra pipeline registers between stages are used
to store intermediate results
26
Events on Every Pipe Stage (Cont.) Figure A.19
27
Important Pipeline Characteristics
  • Latency
  • Time it takes for an instruction to go through
    the pipe
  • Latency stages x stage-delay
  • Dominant feature if there are lots of exceptions
  • Throughput
  • Determined by the rate at which instructions can
    start/finish
  • Dominant feature if no exceptions

28
Basic Performance Issues
  • Pipelining improve CPU instruction throughput
  • Does not reduce the execution time of an
    individual instruction
  • Slightly increase the execution time of an
    individual instruction
  • Overhead in the control of the pipeline
  • Pipeline register delay clock skew (Appendix
    A-10)
  • Limit the practical depth of a pipeline
  • A program runs faster and has lower total
    execution time, even though no single instruction
    runs faster

29
Benefit Example
From the viewpoint of reduced clock cycle (i.e.
CPI 1)
  • Unpipelined DLX
  • 5 steps - take 50, 50, 60, 50, 50 ns respectively
  • Hence total instruction time 260 ns (one clock
    cycle)
  • Looks like a 5-stage pipeline
  • But there are parasites everywhere
  • Assume 5 ns added to slowest stage for extra
    latches
  • Primarily due to set-up and hold times
  • Hence (assuming no stage/step improvement)
  • Must run at slowest stage parasites 60 5
    65 ns/stage
  • In steady state (no exceptions) instruction done
    every 65 ns
  • Speedup 260/65 4x improvement

30
Benefit Example (Cont.)
From the viewpoint of reduced CPI
  • Unpipelined DLX
  • 10-ns clock cycles
  • Clock cycles ALU/4 (40), branches/4 (20) and
    memory/5 (40)
  • Average instruction execution time clock cycle
    Average CPI 10 ns ((4020) 4 40 5)
    10ns 4.4 44ns
  • Pipelined DLX
  • 1 ns of overhead to the clock 11-ns clock cycles
  • 11-ns is also the average instruction execution
    time
  • Speedup from pipelining 44ns/11ns 4 times

31
Pipeline Hazards
  • Pipeline hazards prevent the next instruction in
    the instruction stream from execution during its
    designated clock cycles
  • Hazards reduce the pipeline performance from the
    ideal speedup

32
Pipeline Hazards
  • Structural hazards
  • Caused by resource conflict
  • Possible to avoid by adding resources but may
    be too costly
  • Data hazards
  • Instruction depends on the results of a previous
    instruction in a way that is exposed by the
    overlapping of instructions in the pipeline
  • Can be mitigated somewhat by a smart compiler
  • Control hazards
  • When the PC does not get just incremented
  • Branches and jumps - not too bad

33
Hazards cause Stalls Two Policy Choices
  • How about just stalling all stages
  • OK but problem is usually adjacent stage
    conflicts
  • Hence nothing moves and stall condition never
    clears
  • Cheap option but it does not work
  • Stall later let earlier progress
  • Instructions issued later than the stalled
    instructions are also stalled
  • Instructions issued earlier than the stalled
    instructions must continue
  • But we will see in Chapters 3 and 4 that we can
    reorder the instructions or let the instructions
    after the stalled instruction goes on to reduce
    the impacts of hazards

34
Structural Hazards
  • If some combination of instructions cannot be
    accommodated because of resource conflicts, the
    machine is said to have a structural hazard.
  • Some functional unit is not fully pipelined
  • Some resource has not been duplicated enough to
    allow all combinations of instructions in the
    pipeline to execute
  • Single port register file - conflict with
    multiple stage needs
  • Memory fetch - may need one in both IF and MEM
    stages
  • Pipeline stalls instructions until the required
    unit is available
  • A stall is commonly called a pipeline bubble or
    just bubble

35
Structural Hazard Example
36
Remove Structural Hazard
No real hazard if inst1 is not a load or store
(Only load/store/branch use stage 4)
37
Pipeline Stalled for a Structural Hazard (Another
View)
38
Calculating Stall Effects
Ignore pipeline overhead andassume balanced
pipeline
From the viewpoint of decreasing CPI
Therefore,
All instruction takethe same numberof cycles
39
Calculating Stall Effects (Cont.)
From the viewpoint of decreasing clock cycle time
Therefore,
40
Example Dual-port vs. Single-port Memory
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads are 40 of instructions executed

41
Why Would a Designer Allow Structural Hazard?
  • A machine without structural hazards will always
    have a lower CPI (if all other factors are equal)
  • Why would a Designer Allow Structural Hazard?
  • Reduce cost
  • Pipeline or duplicate all the functional units
    may be too costly

42
Why Would a Designer Allow Structural Hazard?
(Cont.)
  • DLX implementation with a FP multiply unit but no
    pipelining
  • Accept a new multiply every five clock cycles
    (initiation interval)
  • How does structural hazard impact mdlidp2?
  • Mdljdp2 has 14 FP multiplication
  • DLX implementation can handle up to 20 of FP
    multiplication
  • If FP multiplication in mdljdp2 are not clustered
    but distributed uniformly ? performance impact is
    very limited
  • If FP multiplication in mdljdp2 are all clustered
    without intervening instruction and 14 of
    instructions take 5 cycles each ? CPI increases
    from 1 to 1.7
  • In practice, impact of structural hazard lt 0.03
    (data hazard has a more severe impact !!!)

43
Data Hazards
44
Introduction
  • Data hazards occur when the pipeline changes the
    order of read/write accesses to operands so that
    the order differs from the order seen
    sequentially executing instructions on an
    unpipelined machine.
  • Example later instructions use a result not
    having been produced by an earlier instruction
  • Example
  • ADD R1, R2, R3
  • SUB R4, R1, R5
  • AND R6, R1, R7
  • OR R8, R1, R9
  • XOR R10, R1, R11

R1 ? R2 R3 R1 gets produced in the first
instruction,and used in every subsequent
instruction
45
The use of the result of ADD in the next three
instructions causes a hazard, since the register
is not written until after those instructions
read it
read/write
46
Forwarding -- also called bypassing, shorting,
short-circuiting
  • Key is to keep the ALU result around
  • Example
  • ADD R1,R2,R3
  • SUB R4, R1,R5
  • How do we handle this in general?
  • Forwarded value can be at ALU output or Mem stage
    output
  • ADD produces R1 value at ALU output
  • SUB needs it again at the ALU input

47
Forwarding (Cont.)
  • Use the example at slide 44 as an example
  • Forward the result from where ADD produces
    (EX/MEM register) to where SUB needs it (ALU
    input latch)
  • Forwarding works as follows
  • ALU result from EX/MEM register is fed back to
    ALU input latch
  • If the forwarding hardware detects the previous
    ALU operation has written the register
    corresponding to a source for the current ALU
    operation, control logic selects the forwarding
    result as the ALU input rather than the value
    read from the register file
  • Generalization of forward
  • Pass a result directly to the functional unit
    requires it a result is forwarded from the
    pipeline register corresponding to the output of
    one unit to the input of another

48
Result With Forwarding
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
IM
OR R8, R1, R9
IM
XOR R10, R1, R11
49
Multiplexing Issues in Forwarding
50
Another Forwarding Example
  • Example
  • ADD R1, R2, R3
  • LW R4, 0(R1)
  • SW 12(R1), R4
  • Forwarding Result Next Page

51
A?R2B?R3
R1?AO
AOAB(Prod. R1)
Do Nothing
A?R1B?R4Imm?0
R4?LMD
LMDMemAO(Prod. R4)
AOAImm(Use R1)
A?R1B?R4Imm?12
MemAO?B(Use R4)
AOAImm(Use R1)
52
When Forwarding Fails
DM LMD?MEMALUO RD R1?LMD
RSA?R1, B?R5 ALU ALUO?A-B
RSA?R1, B?R7 ALU ALUO?A ANDB
RSA?R1, B?R5 ALU ALUO?A OR B
53
Stalls
  • Some latencies cant be absorbed -- the case in
    the previous slide
  • Stalls are the result
  • Need pipeline interlock circuits
  • Detects a hazard and introduces bubbles until the
    hazard clears
  • CPI for stalled instructions will bloat by the
    number of bubbles
  • Bubbles cause the forwarding paths to change
  • In MIPS/DLX, if the instruction after load uses
    the load result, one clock-cycle stall will occur!

54
Bubbles and new Forwarding Paths
55
Handling Stalls
Hardware vs. Software
  • Hardware Pipeline Interlocks
  • Must detect when required data cannot be provided
  • Stall stages to create bubble
  • Software pipeline or instruction scheduling
  • Performed by a smart compiler

LW RB, BLW RC, CADD RA, RB, RCSW A, RALW RE,
ELW RF, FSUB RD, RE, RFSW D, RD
LW RB, BLW RC, CLW RE, EADD RA, RB, RCLW RF,
F SW A, RA SUB RD, RE, RFSW D, RD
A B C D E F Pipeline Scheduling
56
Memory Reference May Also Cause Data Hazard
  • The previous two operations are with registers,
    but it is also possible for a pair of
    instructions to create a dependence by
    reading/writing the same memory location
  • But the latter is impossible for MIPS
  • Why???

Because data memory reference only occurs at
stage 4
57
Data Hazard Forms
i occurs before j program execution order
  • RAW - read after write
  • j reads before i writes - hence j gets wrong old
    value
  • Most common form of data hazard problem
  • As we have seen forwarding can overcome this one
  • WAW - write after write
  • instructions i then j
  • j writes before i writes - leaving incorrect
    value
  • can this happen in MIPS? Why?
  • WAW can happen only in pipelines that write in
    more than one pipe stage (or allow an instruction
    to proceed even when a previous instruction is
    stalled)

58
Data Hazard Forms (Cont.)
  • WAR - write after read
  • i then j is intended order
  • j writes before i reads - i ends up with
    incorrect new value
  • Is this a Problem in the MIPS? Why?
  • May happen only when some instructions write
    results early in pipe stages, and others read a
    source late in stages
  • RAR read after read
  • Not a hazard

59
MIPS Ordering
  • Some things are not a problem
  • MIPS has only a single memory write and a single
    register write stage
  • Hence this ordering requirement is preserved
  • However things can get a lot worse
  • And will when we look at varying operational
    latencies
  • For example floating point instructions in the
    MIPS
  • WAR MIPS Ordering
  • Writing happens late in the pipe
  • Reading happens early
  • Hence no WAR problems
  • However, other machines might exhibit this problem

60
Control Hazards
61
Introduction
  • Control Hazards How does branch influence the
    pipeline?
  • Problem is more complex - need 2 things
  • Branch target (taken means not PC4, not taken
    the condition fails) (MEM)
  • CC valid - in the DLX case the result of the Zero
    detect unit (EX)
  • Both happen late in the pipe
  • How to deal with branch?
  • Stall the pipeline as soon as we detect the
    branch (ID), and stall the pipeline until we
    reach the MEM stage
  • Three-cycle stall
  • The first IF is essentially a stall (when taken
    branch)
  • Consider a 30 branch frequency and an ideal CPI
    of 1

62
DLX Pipeline Re-visited
63
A branch causes a 3-cycle stall in the DLX
pipeline
64
Branch Delay Reduction
Branch delay is the length of the control hazard
  • Hardware mechanisms
  • Find out if the branch is taken or not taken
    earlier in the pipeline
  • Compute the taken PC earlier
  • With another adder
  • BTA (branch taken address) can be computed during
    the ID stage
  • Then BTA and the normal PC mechanism contain both
    possibilities
  • Choice depends on the instruction and CC - also
    known after ID stage
  • Software mechanisms
  • Design your ISA properly
  • e.g. BNEZ, BEQZ on DLX permit CCs to be known
    during ID
  • Do instruction scheduling into the branch delay
    slots
  • Know likelihood of taken vs. not taken branch
    statistics
  • Improves chances to guess correctly
  • Varies with application and instruction placement

65
New Improved DLX Pipeline
NPC
Imm
66
Hardware Mechanism
  • Move Zero test to ID/RF stage
  • Adder to calculate new PC in ID/RF stage
  • 1 clock cycle penalty for branch versus 3
  • Note an ALU instruction followed by a branch on
    the result of the instruction will incur a data
    hazard stall
  • Example
  • SUB R1, R2, R3 IF-ID-EXE-MEM-WB
  • BEQZ R1, 100 IF-ID -EXE-
    MEM-WB

67
Branch Behavior in Programs
  • Integer benchmark
  • Conditional branch frequencies of 14 to 16
  • Much lower unconditional branch frequencies
  • FP benchmark
  • Conditional branch frequencies of 3 to 12
  • Forward branches dominate backward branches
    (3.71)
  • 67 of conditional branches are taken on average
  • 60 of forward branches are taken on average
  • 85 of backward branches are taken on average
    (usually loop)

68
Control Hazard Avoidance
  • Simplest Scheme
  • Freeze pipe until you know the CC and branch
    target
  • Cheap but slow
  • Too slow since wed negate half of the pipeline
    speedup since 2 or 3 bubbles (old vs. new DLX
    pipeline designs)
  • Predict not taken (47 DLX branches not taken on
    average)
  • Make sure to defer state change (destructive
    phase) is delayed until you know whether you
    guessed right
  • If not then back out or flush
  • Predict taken (53 DLX branches taken on
    average)
  • No use in DLX/MIPS (target address and branch
    outcome are known at the same stage)
  • Or let the compiler decide - same options

69
Predict-Not-Taken
A Stall indeed
70
Delayed Branch
  • Delayed branch ? make the stall cycle useful
  • Add delay slots branch penalty length of
    branch delay
  • 1 for DLX/MIPS
  • Instructions in the delay slot are executed
    whether or not the branch is taken
  • See if the compiler can schedule something useful
    in these slots
  • Hope that filled slots actually help advance the
    computation
  • A branch delay of length n
  • branch instruction sequential successor1 sequent
    ial successor2 ........ sequential successorn
  • branch target if taken

Always execute!
71
Delayed-Branch Behavior
72
Delayed Branch (Cont.)
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken
  • From fall through only valuable when branch not
    taken
  • When the slots cannot be scheduled, they are
    filled with the no-op instruction (indeed,
    stall!!)
  • Canceling branches allow more slots to be filled

73
SUB R4, R5, R6
Scheduling the branch-delay slot
74
Delay-Branch Scheduling Schemes and Their
Requirements
75
Delayed Branch (Cont.)
  • Limitation on delayed-branch scheduling
  • Restrictions on the instructions that are
    scheduled into delay slots
  • Ability to predict at compile time if a branch is
    likely to be taken or not
  • Delayed branches are architecturally visible
    feature
  • Advantage Use compiler scheduling to reduce
    branch penalties
  • Disadvantage expose an aspect of implementation
    that is likely to change
  • Delay branch is less useful for longer branch
    delay
  • Can not easily hide the longer delay ? change to
    hardware scheme

76
Canceling (Nullifying) Branch
  • Eliminate the requirements on the instruction
    placed in the delay slot, enabling the compiler
    to use from target/falling through without
    meeting the requirements
  • Idea
  • Associate each branch instruction the predicted
    direction
  • If branch goes as predicted then nothing changes
  • If unpredicted direction then nullify all or some
    of the delay slot instructions
  • Result is more freedom for the compiler delay
    slot scheduling
  • A common approach in HPs PA processors

77
Delayed and Canceling Delay Branches allow
control hazards to be hidden 70 of the time
78
Performance of Delayed and Canceling Branches
(Another View)
On average, 30 ofbranch delay slots are wasted
79
Evaluating Branch Alternatives
Appendix A-2426 for anexample on the
branchevaluation
  • With ideal CPI 1 and stalls freq x penalty

80
Static Branch Prediction Using Compiler
Technology
  • How to statically predict branches?
  • Examination of program behavior
  • Always predict taken (on average, 67 are taken)
  • Mis-prediction rate varies large (959)
  • Predict backward branches taken, forward branches
    un-taken (mis-prediction rate gt 60 -- 70)
  • Profile-based predictor use profile information
    collected from earlier runs
  • Simplest is the Basket bit idea
  • Easily extends to use more bits
  • Definite win for some regular applications

81
Mis-prediction Rate for a Profile-based Predictor
FP is better than integer
82
Prediction-taken VS. Profile-based Predictor
20 for prediction-taken 110 for profile-based
Standard Deviationsare large
83
Performance of DLX Integer Pipeline
of all cycles due to control data hazard
stalls
Stall instructions 9 23 CPI 1.09
1.23 Average CPI 1.11Improvement 5/1.11 4.5
Write a Comment
User Comments (0)
About PowerShow.com