CS 378 Programming for Performance SingleThread Performance: Review of Pipelining - PowerPoint PPT Presentation

About This Presentation
Title:

CS 378 Programming for Performance SingleThread Performance: Review of Pipelining

Description:

Siddhartha Chatterjee. 3. Sequential laundry takes 6 hours for 4 loads ... Siddhartha Chatterjee. 9. End of Cycle 4: Load's Mem, R-type's Exec, Store's Reg, ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 37
Provided by: siddhartha4
Category:

less

Transcript and Presenter's Notes

Title: CS 378 Programming for Performance SingleThread Performance: Review of Pipelining


1
CS 378Programming for PerformanceSingle-Thread
Performance Review of Pipelining
  • Siddhartha Chatterjee
  • Spring 2008

2
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

3
Sequential Laundry
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

4
Pipelined Laundry Start work ASAP
  • Pipelined laundry takes 3.5 hours for 4 loads

5
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6
The Five Stages of Load
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec Calculate the memory address
  • Mem Read the data from the Data Memory
  • WrB Write the data back to the register file

7
Key Ideas Behind Instruction Pipelining
  • The load instruction has 5 stages
  • Five independent functional units to work on each
    stage
  • Each functional unit is used only once!
  • A second load can start doing Ifetch as soon as
    the first load finishes its Ifetch stage
  • Each load still takes five cycles to complete
  • The latency of a single load is still 5 cycles
  • The throughput is much higher
  • CPI approaches 1
  • Cycle time is 1/5th the cycle time of the
    single-cycle implementation
  • Instructions start executing before previous
    instructions complete execution

CPI ? Cycle time ?
8
Pipelining the Load Instruction
  • The five independent pipeline stages are
  • Read next instruction The Ifetch stage
  • Decode instruction and fetch register values
    The Reg/Dec stage
  • Execute the operation The Exec stage
  • Access data memory The Mem stage
  • Write data to destination register The WrB
    stage
  • One instruction enters the pipeline every cycle
  • One instruction comes out of the pipeline
    (completed) every cycle
  • The effective CPI is 7/3 (tends to 1) 1/5
    cycle time

9
A More Extensive Pipelining Example
  • End of Cycle 4 Loads Mem, R-types Exec,
    Stores Reg, Beqs Ifetch
  • End of Cycle 5 Loads WrB, R-types Mem, Stores
    Exec, Beqs Reg
  • End of Cycle 6 R-types WrB, Stores Mem, Beqs
    Exec
  • End of Cycle 7 Stores WrB, Beqs Mem

10
Single Cycle vs. Multiple Cycle vs. Pipelined
11
Basics of Pipelining
  • Time
  • Discrete time steps
  • Represented as 1, 2, 3,
  • Space
  • Pipe stages or segments (things that do
    processing)
  • Represented as P, Q, R, S (or F, D, X, M, W for
    the DLX pipeline)
  • Operands
  • Instructions or data items
  • Things that flow through, and are processed by,
    the pipeline
  • Represented as a, b, c,
  • In drawing pipelines, we conceal the obvious fact
    that each operand undergoes some changes in each
    pipe stage

12
Notations for Describing Pipelines
  • Space-time diagram,
  • or Gantt chart
  • Reservation table by stages
  • Rows represent pipeline
  • stages
  • Unbounded one way
  • Notation of HP
  • Reservation table by
  • instructions
  • Rows represent operands
  • Unbounded both ways

13
Basic Terms
  • Filling a pipeline
  • Flushing or draining a pipeline
  • Stage or segment delay
  • Each stage may have a different stage delay
  • Beat time ( max stage delay)
  • Number of stages
  • End-to-end latency
  • number of stages beat time
  • Stages are separated by latches (registers)

14
Speedup Throughput of a Linear Pipeline
15
Data Hazard Setup
D(u) domain of instruction u The set of
all memory locations, registers
(including implicit ones), flags, condition
codes etc. that may be read by
instruction u
Instruction u
R(u) range of instruction u The set of
all memory locations, registers
(including implicit ones), flags, condition
codes etc. that may be written by
instruction u
  • u lt v is a relation that means that instruction
  • u precedes instruction v in the original program
  • order (i.e., on an unpipelined machine)
  • The relation lt is irreflexive, anti-symmetric,
  • and transitive

Instruction u Instruction v
16
Data Hazard Definition
Given two instructions u and v, such that u lt v,
there is a data hazard between them if any of the
following conditions holds
The existence of one of these conditions means
that a change in the order of reading/writing
operands by the instructions from the order seen
by sequentially executing instructions on
an unpipelined machine could violate the intended
semantics
17
Why Data Hazards Occur
  • Pipelining changes relative timing of
    instructions
  • Reads and writes occur at fixed positions of the
    pipeline
  • So, if two instructions are too close (function
    of pipeline structure), order of reads and writes
    could change and produce incorrect values
  • This instruction sequence exchanges values in R1
    and R2
  • On unpipelined DLX, back-to-back execution of
    sequence produces correct results
  • On current pipelined DLX, initiation of sequence
    in consecutive cycles produces incorrect results
  • Reads are early, writes are late, so RAW hazards
    would be violated

XOR R2, R2, R1 XOR R1, R1, R2 XOR R2, R2, R1
18
Data Dependence and Hazards
  • True (value, flow) dependence between
    instructions u and v means u produces a result
    value that v uses
  • This is a producer-consumer relationship
  • This is a dependence based on values, not on the
    names of the containers of the values
  • Every true dependence is a RAW hazard
  • Not every RAW hazard is a true dependence
  • Any RAW hazard that cannot be removed by renaming
    is a true dependence

19
More on Hazards
  • RAW hazards corresponding to value dependences
    are most difficult to deal with, since they can
    never be eliminated
  • The second instruction is waiting for information
    produced by the first instruction
  • WAR and WAW hazards are name dependences
  • Two instructions happen to use the same register
    (name), although they dont have to
  • Can often be eliminated by renaming, either in
    software or hardware
  • Implies the use of additional resources, hence
    additional cost
  • Renaming is not always possible implicit
    operands such as accumulator, PC, or condition
    codes cannot be renamed
  • These hazards dont cause problems for DLX
    pipeline
  • Relative timing does not change even with
    pipelined execution, because reads occur early
    and writes occur late in pipeline

20
The Precedence Relation
  • Consider a straight line program listed in
    original program order
  • Define a relation D (the dependence relation)
    between pairs of instructions (u, v) as follows
  • D(u, v) if and only if (u lt v) and there is a
    WAR, WAW, or RAW hazard between instructions u
    and v
  • D is irreflexive and anti-symmetric but not
    transitive
  • Define the precedence relation P as the
    transitive closure of the dependence relation D
  • P is irreflexive, anti-symmetric, and transitive
  • Represent P by graph of its transitive reduction
    (precedence graph)
  • If P(u,v), then u must precede v in execution,
    that is, the two instructions cannot be
    interchanged, and in a pipeline they must
    maintain a sufficient distance

ADD R4, R5, R6 ADD R3, R4, R5 ADD R2, R3, R7
21
Example of Precedence Relation
1ADD R1, R7, R8 2SW 2000(R9), R8 3LW R3,
0(R1) 4LW R4, 3000(R9) 5ADD R5, R3,
R4 6MUL R6, R5, R5
Assume that registers R7, R8, R9 are already
initialized such that (R7)(R8) (R9)2000 holds
22
Data Hazard Effect on Compiler
23
Data Hazard Effect on Pipelining
If executed in the pipeline discussed so far,
this data hazard would lead to incorrect
execution for the SUB and AND instructions, as
they would access the old value of register R1.
1ADD R1, R2, R3 2SUB R4, R5, R1 3AND R6, R1,
R7 4OR R8, R1, R9 5XOR R10, R1, R11
24
Solution Interlocks and Stalling
1ADD R1, R2, R3 2LW R4, 0(R1) 3SW 12(R1), R4
  • Add interlocks (additional control logic) between
    pipeline stages to detect hazard condition and to
    stall instruction in current pipeline stage until
    preceding instructions move sufficiently forward
    in the pipeline to guarantee correct results
  • LW stalls in D stage waiting for ADD to complete
    its write to R1 in cycle 5
  • We are assuming a split-phase clock, so that the
    write happens in the first half of cycle 5 and
    the read in the second half of cycle 5, so that
    LW can move to X stage in cycle 6
  • This causes following instructions to stall as
    well (e.g., SW stalls in F stage because LW is
    stalled in D stage)
  • It would also be possible to achieve a similar
    effect by inserting NOPs between the instructions
    as spacers

25
Optimization Value Forwarding
  • There is slack in how soon a value is actually
    available and how late it is actually required in
    the pipeline
  • Result of R-type available at end of X stage
  • Operand of dependent R-type not needed until
    beginning of X stage
  • Communication of values among instructions
    happens through register file
  • Globally known names of containers of values
  • Accessed at fixed stages of pipeline (read in D,
    written in W)
  • Forwarding/bypassing/short-circuiting corresponds
    to establishing a direct path between the
    producer of a value and its consumer, bypassing
    the container
  • Allows us to exploit slack
  • Requires additional resources (forwarding paths
    and controller)
  • Identify all forwarding paths needed on DLX
    (Figure 3.19 is incomplete)

26
Example of Forwarding
1ADD R1, R2, R3 2LW R4, 0(R1) 3SW 12(R1), R4
27
Forwarding Stalling
L1LW R2, 40(R8) L2LW R3, 60(R8) AADD R4, R2,
R3 SSW 60(R8), R4
  • Load has a latency
  • of one cycle that cannot
  • be hidden, as seen
  • between L2 and A

28
Compile-Time Scheduling
A B C D E - F
L1 LW Rb, B L2 LW Rc, C A ADD Ra, Rb,
Rc S1 SW A, Ra L3 LW Re, E L4 LW Rf,
F S SUB Rd, Re, Rf S2 SW D, Rd
L1 LW Rb, B L2 LW Rc, C L3 LW Re, E A ADD Ra,
Rb, Rc L4 LW Rf, F S1 SW A, Ra S SUB Rd, Re,
Rf S2 SW D, Rd
29
Code Generation Examples for Branches
if (x gt 0) y z else y -z
switch (a) case 2 x break case 4
y break case 6 z break case -2
x-- break case -4 y-- break case -6
z-- break default break
blez r7, L18 addu r3, r3, r4 j L33 L18 subu r3,
r3, r4 L33
while (a lt b) a b-- x
j L33 L34 addu r5, r5, 1 addu r6, r6, -1 addu
r7, r7, 1 L33 slt r2, r5, r6 bne r2, r0, L34
Register r3 contains y Register r4 contains
z Register r5 contains a Register r6 contains
b Register r7 contains x
30
Control Hazard
  • A peculiar kind of RAW hazard involving the
    program counter
  • PC written by branch instruction
  • PC read by instruction fetch unit (not another
    instruction)
  • Possible misbehavior is that instructions fetched
    and executed after the branch instruction are not
    the ones specified by the branch instruction

31
More on Control Hazards
  • Branch delay the length of the control hazard
  • What determines branch delay?
  • We need to know that we have a branch instruction
  • We need to have the BTA
  • We need to know the branch outcome
  • So, we have to wait until we know all of these
    quantities
  • DLX pipeline as currently designed
  • computes BTA in EX
  • computes branch outcome in EX
  • changes PC in MEM
  • To reduce branch delay, we need to move these to
    earlier pipeline stages
  • Cant move up beyond ID (need to know its a
    branch instruction)

32
Delayed Branches on DLX
  • One branch delay slot on redesigned DLX
  • Always execute instruction in branch delay slot
    (irrespective of branch outcome)
  • Question What instruction do we put in the
    branch delay slot?
  • Fill with NOP (always possible, penalty 1)
  • Fill from before (not always possible, penalty
    0)
  • Fill from target (not always possible, penalty
    1-T)
  • BTA is dynamic
  • BTA is another branch
  • Fill from fall-through (not always possible,
    penalty T)

33
Details of Various Branch Flavors
A B C D
true
false
X cond
M N P Q
E F G H
34
Pipelining Multicycle Operations
  • Assume five-stage pipeline
  • Third stage (execution) has two functional units
    E1 and E2
  • Instruction goes through either E1 or E2, but not
    both
  • E1 and E2 are not pipelined
  • Stage delay of E1 2 cycles
  • Stage delay of E2 4 cycles
  • No buffering on inputs of E1 and E2
  • Stage delay of other stages 1 cycle
  • Consider an instruction sequence of five
    instructions
  • Instructions 1, 3, 5 need E1
  • Instructions 2, 4 need E2

35
Space-Time Diagram Multicycle Operations
  • Out-of-order completion
  • 3 finishes before 2, and 5 finishes before 4
  • Instructions may be delayed after entering the
    pipeline because of structural hazards
  • Instructions 2 and 4 both want to use E2 unit at
    same time
  • Instruction 4 stalls in ID unit
  • This causes instruction 5 to stall in IF unit

36
Floating-Point Operations in DLX
Out-of-order completion has ramifications
for exceptions
WAW hazards possible WAR hazards not possible
Longer operation latency implies more
frequent stalls for RAW hazards
Structural hazard instructions have varying
running times
Structural hazard not fully pipelined
Write a Comment
User Comments (0)
About PowerShow.com