CS 42906290 Lecture 06 Outoforder execution, Outoforder completion a'k'a' the cool stuff - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

CS 42906290 Lecture 06 Outoforder execution, Outoforder completion a'k'a' the cool stuff

Description:

A stall, and loop overhead instructions (the SUBI & BNEZ) ... A data dependency doesn't necessarily have to cause stalls or hazards ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 78
Provided by: michaelt8
Category:

less

Transcript and Presenter's Notes

Title: CS 42906290 Lecture 06 Outoforder execution, Outoforder completion a'k'a' the cool stuff


1
CS 4290/6290 Lecture 06Out-of-order
execution,Out-of-order completion(a.k.a. the
cool stuff)
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, Michael Niemier,
    and Milos Pruvlovic)

2
The Big Picture
  • During the past several lectures weve discussed
    the benefits and hazards of pipelining.
  • Major benefit of pipelining
  • Allows instruction execution to overlap if they
    are independent of each other
  • Instructions are evaluated in psuedo-parallel
  • Now extend many pipeline fundamentals
  • Reduce impact of data and control hazards
  • Increase amount of parallelism that can be
    extracted statically or dynamically

3
The medium-sized picture
  • Performance of a pipeline essentially judged by
    the number of clock cycles it took to execute an
    instruction (CPI)
  • Ideally, if one instruction was issued every
    cycle, the average CPI for a perfect pipeline
    should be 1
  • But hazards and stalls slowed this down to
  • Pipeline CPI
  • Ideal pipeline CPI Structural Stalls RAW
    stalls WAR stalls WAW stalls Control Stalls
  • Now study techniques to help reduce the bad
    terms above
  • Loop unrolling, pipeline scheduling,
    scoreboarding, register renaming, branch
    prediction, etc.

4
Instruction level parallelism (ILP)
  • In discussions of ways to increase parallelism
    well often try to do it within a basic block
  • Basic block set of instructions with no
    branches except for entry to, exit from it
  • Generally, its pretty small
  • (recall branch instructions occur 15-20 of the
    time, so a basic block is usually about 6-7
    instructions)
  • To obtain better performance than what weve
    already seen, we must exploit ILP across multiple
    basic blocks

5
Loop-level parallelism
  • Lots of programs do the same thing over and over
    again on different sets of data (i.e. a loop)
  • Problems for parallelism arise with a loop like
    this
  • for (i1 ilt1000 i)
  • xi xi yi
  • When translated to assembly code, probably have
  • A load or two, and add instruction, and maybe a
    store
  • Then youll have a branch to start the loop again
  • In other words, the basic block size is SMALL
  • Within such loops, there is very little
    opportunity for overlap

6
Basic pipeline scheduling
  • Before talking about ways to find more ILP, lets
    first review and set the stage
  • To keep pipeline full, parallelism among
    instructions must be generated
  • find unrelated instructions that can overlap in a
    pipeline
  • (i.e. lets avoid RAW hazards)
  • If instruction dependent on another instruction
    it must be separated from it
  • by a distance in clock cycles that is equal to
    the pipeline latency of the source instruction

7
Assumed latencies
  • The following latencies will be assumed for a few
    examples
  • A standard integer pipeline is also assumed so
    branches have a delay of 1 clock cycle
  • Functional units are fully pipelined OR
    replicated so no structural hazards ensue
  • (So, were assuming multiple functional units as
    before)

8
Loop unrolling (part 1)
  • Were going to walk through the unrolling of a
    simple for loop to show how we might gain ILP
  • Our candidate for loop is
  • for (i1 ilt1000 i)
  • xi xi yi
  • This loop is parallel b/c the body of each
    iteration is independent (more pathological cases
    later)
  • To unroll this loop, lets look at some
    assembly

Loop LD F0, 0(R1) F0 array
element ADDD F4, F0, F2 add scalar in
F2 SD 0(R1), F4 store result SUBI R1, R1,
8 decrement pointer 8 bytes (per data
word) BNEZ R1, Loop branch R1 ! 0
9
Loop unrolling (part 2)
(Board?)
Loop LD F0, 0(R1) 1 stall 2 ADDD F4, F0,
F2 3 stall 4 stall 5 SD 0(R1),
F4 6 SUBI R1, R1, 8 7 stall 8 BNEZ R1,
Loop 9 stall 10
With, no scheduling, our loop would execute
like this
Originally, there were 5 instructions. Now,
there are 5 more stalls and 10 total clock cycles
are required per iteration.
Loop LD F0, 0(R1) SUBI R1, R1, 8 ADDD F4,
F0, F2 stall BNEZ R1, Loop delayed
branch SD 8(R1), F4 altered and
interchanged
With, scheduling, we can reduce the number of
stalls
Execution time has been reduced from 10 clock
cycles to 6
10
Loop unrolling (part 3)
  • From the last slide, we see that
  • Running one iteration of the loop takes 6 clock
    cycles
  • But real work is only performed on 3 of those
    cycles
  • The LD, ADDD, and SD instructions
  • The other 3 clock cycles are devoted to
  • A stall, and loop overhead instructions (the SUBI
    BNEZ)
  • i.e. half of the instructions perform useful
    work
  • Note this is a bad thing
  • Loop unrolling increases of real work
    instructions relative to the loop overhead
    instructions
  • We replicate loop body and adjust termination code

11
Loop unrolling (part 4)
Loop unrolled 4 times
  • 3 branches and 3 decrements of R1 have been
  • eliminated by copying instructions.
  • This loop has not been scheduled so every
  • operation is followed by a dependent
  • instructions
  • This loop will take 28 clock cycles to run
  • Each LD has 1 stall
  • Each ADDD has 2 stalls
  • The SUBI has 1 stall
  • The BNEZ has 1 stall
  • And there are 14 instruction cycles
  • More registers are required!

Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1),
F8 LD F10, -16(R1) ADDD F12, F10, F2 SD
-16(R1), F12 LD F14, -24(R1) ADDD F16, F14,
F2 SD -24(R1), F16 SUBI R1, R1, 32 BNEZ R1,
Loop
12
Loop unrolling (part 5)
Loop unrolled 4 times (and scheduled!)
  • Scheduling has eliminated all of the stalls!
  • The execution time of the unrolled loop has
  • Dropped to a total of 14 clock cycles
  • Putting this in perspective
  • Only 3.5 clock cycles per iteration
  • In the previous example, 7 were needed
  • And before loop unrolling, 6 were needed

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10,
F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1),
F8 SUBI R1, R1, 32 SD -16(R1), F12 BNEZ R1,
Loop SD 8(R1), F16 (8-32 -24)
Loop unrolling is a very useful technique and is
not inherently tied to a specific processor
implementation
13
Dependencies
  • Remember
  • To exploit ILP, we, the compiler, whoever, must
    determine which instructions can execute in
    parallel
  • Instructions are parallel if they can execute
    at the same time in a pipeline without causing
    any stalls
  • 2 instructions that are dependent are not
    parallel and cannot be reordered
  • 3 kinds of dependencies to consider
  • Data dependencies
  • Name dependencies
  • Control dependencies

14
Data dependencies
  • An instruction j is dependent on instruction i if
    any of the following is true
  • Instruction i produces a result used by
    instruction j
  • Instruction j is data dependent on instruction k
    and instruction k is data dependent on
    instruction i
  • If 2 instructions are data dependent, they cannot
    execute at the same time or completely overlap
  • The result would be at least one RAW hazard
  • For code to execute correctly, the original data
    dependence must be preserved during execution

15
What data dependencies really are
  • Data dependencies are properties of assembly code
    thats where they exist in the program!
  • A data dependency doesnt necessarily have to
    cause stalls or hazards
  • Organization of pipeline determines whether or
    not this actually happens
  • To summarize
  • A dependence indicates the possibility of a
    hazard
  • A dependence determines the order in which
    results must be calculated
  • A dependence sets an upper bound on how much
    parallelism can be exploited

16
Data dependence example
An example of a dependence chain
Loop LD F0, 0(R1) F0 array element ADDD F4,
F0, F2 add scalar in F2 SD 0(R1), F4 store
result
  • When the data flow dependencies occur because of
    registers, detecting them is a relatively easy
    process.
  • Problems arise with memory locations
  • 100(R4) and 20(R6) may be identical
  • 20(R4) and 20(R4) may be different because R4 has
    changed
  • (Remember this itll come back real soon)
  • It may be possible to avoid hazards, maintain
    dependencies by transforming code

17
Name dependencies
  • A name dependence occurs when 2 instructions
    use the same register or memory location (a name)
    but there is no flow of data b/t the 2
    instructions
  • There are 2 types
  • Antidependencies
  • Occur when an instruction j writes a register or
    memory location that instruction i reads and i
    is executed first
  • Corresponds to a WAR hazard
  • Output dependencies
  • Occur when instruction i and instruction j write
    the same register or memory location
  • Protected against by checking for WAW hazards

18
More on name dependencies
  • Note difference from data dependencies
  • No value is transmitted between instructions here
  • Resources get reused need to ensure that each
    instruction gets its value from right resource
    before its reused
  • VERY IMPORTANT
  • Instructions with name dependencies could be
    executed at the same time or reordered to avoid
    conflicts
  • Example

Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F0, -8(R1) ADDD F4, F0, F2
LD/LD dependence is an Output dependence
(WAW) ADDD/LD dependence is an Antidependence
(WAR)
19
Control dependencies
  • A control dependence determines ordering of
    instruction with respect to branch instruction so
    non-branch instruction executed only when it
    should be
  • Often arise due to conditional statements (i.e.
    ifs)
  • 2 constraints on control dependencies are
  • If instruction is control dependent on branch, it
    cannot be moved before the branch
  • B/c it no longer would be controlled by the
    branch!
  • You might see this next one coming
  • but, if an instruction is not controlled by a
    branch, it cant be moved so that it is

20
Control dependencies do what?
  • They make sure instructions execute in order
  • Well, this is kinda true. Theres an exception
    to every rule right?
  • Sometimes we could allow an instruction to
    execute even if it should not have been as long
    as its execution doesnt hurt anything
  • Control dependence is not the critical property
    to preserve exception behavior and data flow
    are
  • Control dependencies preserve dataflow
  • Makes sure that instructions that produce results
    and consume them get the right data at the right
    time

21
Dynamic scheduling
  • Allows some data hazards to be overcome in HW
  • Compiler is SW which schedules instructions to
    avoid hazards, dependencies, etc. statically
  • In dynamic scheduling, HW rearranges instructions
    to avoid stalls
  • Some advantages of dynamic scheduling are
  • Dependencies not known at compile time may be
    resolved and stalls avoided
  • The compiler doesnt have to do as much work
  • Code for a particular pipeline may run well on
    another
  • The catch Well need some more hardware

22
The idea behind dynamic scheduling
  • In all of the pipelines weve studied so far all
    of the instructions have been issued in order
  • If an instruction is stalled, the pipeline is
    stalled
  • Example
  • DIVD F0, F2, F4
  • ADDD F10, F0, F8
  • SUBD F12, F8, F14
  • One way to eliminate this problem is to NOT
    require instructions to execute in order

The ADDD instruction is data dependent on DIVD.
But, the SUBD instruction could execute
23
Out of order execution means
  • In previous discussions, both structural and data
    hazards checked in ID stage
  • But, to execute SUBD early (for example) have to
    break up issue stage into
  • Decoding the instruction and checking for
    structural hazards
  • Waiting for the absence of a data hazard and
    reading operands
  • Instructions call still be issued in-order and
    structural hazards can be check for at this time
  • But, we want instructions to start executing as
    soon as their operands are ready
  • So, out-of-order execution means out-of-order
    completion

24
Scoreboarding
  • Scoreboarding allows instructions to execute out
    of order when sufficient resources are available
    and there are no data dependencies
  • Now, a case study CDC 6600 Scoreboard
  • Goal of the scoreboard is to try to make sure one
    instruction is executed each clock cycle
  • So if 1 inst. is stalled, it tries to find
    another to start
  • But, WAR hazards are possible now

DIVD F0, F2, F4 ADDD F10, F0, F8 (read) SUBD F8,
F8, F14 (write)
If the pipeline executes SUBD before ADDD, we
have an antidependence
25
The CDC 6600
  • The CDC 6600 has 16 separate functional units
  • 4 floating point units, 5 units for memory
    references, and 7 units for integer operations
  • Well simplify a bit and assume 2 multipliers,
    one FP adder, one FP divider, and an integer unit
    for all memory references, branches, integer ops
  • Every instruction goes through scoreboard where a
    record of data dependencies is constructed
  • Scoreboard determines when operation can read its
    operands and begin execution
  • If it cannot issue immediately, scoreboard
    monitors operand availability and issues when
    ready

26
Scoreboard stage 1 Issue
  • If functional unit for instruction is free and no
    other active instruction has the same destination
    register, instruction issued to its functional
    unit
  • Instructions info. recorded in scoreboard
  • WAW hazard is avoided
  • The pipeline stalls and no other instructions
    will issue until the hazard is cleared
  • Theres a buffer between the instruction fetch
    and issue stages of the pipeline
  • It may be a single entry or a queue
  • (if its a queue, instructions may continue to
    fetch, but not issue)

27
Scoreboard stage 2 Read Operands
  • The scoreboard monitors availability of operands
    for a given instruction
  • Operands are said to be available if
  • No earlier issued instruction is going to write
    it
  • Register containing the operand is being written
    by a currently active functional unit
  • When operands are available, issued instruction
    is instructed by scoreboard to read its registers
    and begin execution
  • RAW hazards resolved here and in this way
    instructions may be issued/executed out of order

28
Scoreboard stage 3 Execution
  • The instructions functional unit will begin
    execution when it receives its operands
  • Upon completion, the functional unit notifies the
    scoreboard that it has completed

29
Scoreboard stage 4 Write result
  • After execution, scoreboard checks for a WAR
    hazard.
  • If present, completing instruction stalled
  • Generally, completing instruction cannot be
    allowed to write its results if
  • There is an instruction that should have been
    issued/executed before the completing instruction
    that has not read its operands
  • AND, one of the operands is the same register as
    the result of the completing instruction
  • If theres no WAR hazard or after it is cleared,
    the scoreboard allows the destination register to
    be written

30
The parts of a scoreboard
31
A scoreboard example
This example is from notes prepared by David
Patterson and Randy Katz
32
A scoreboard example Cycle 1
33
A scoreboard example Cycle 2
Issue 2nd Load?
34
A scoreboard example Cycle 3
Issue MULT?
35
A scoreboard example Cycle 4
36
A scoreboard example Cycle 5
37
A scoreboard example Cycle 6
38
A scoreboard example Cycle 7
Read ADDD/SUBD operands? Issue ADDD?
39
A scoreboard example Cycle 8a
40
A scoreboard example Cycle 8b
41
A scoreboard example Cycle 9
Read MULT/SUBD operands? Issue ADDD?
42
A scoreboard example Cycle 11
43
A scoreboard example Cycle 12
Read DIVD operands?
44
A scoreboard example Cycle 13
45
A scoreboard example Cycle 14
46
A scoreboard example Cycle 15
47
A scoreboard example Cycle 16
48
A scoreboard example Cycle 17
Write result of ADDD?
49
A scoreboard example Cycle 18
50
A scoreboard example Cycle 19
51
A scoreboard example Cycle 20
52
A scoreboard example Cycle 21
53
A scoreboard example Cycle 22
54
A scoreboard example Cycle 61
55
A scoreboard example Cycle 62
56
Scoreboarding conclusions
  • Scoreboard uses available ILP to minimize stalls
    that arise from a programs true data
    dependencies
  • But, scoreboard limited by several factors
  • Instructions in a basic block are only so
    parallel
  • if each instruction depends on predecessor, no
    dynamic scheduling can reduce stalls
  • Scoreboard itself has finite size
  • can look only in some window for potentially
    executable instructions
  • and types of functional units
  • (which determines structural hazards)
  • Antidependencies and output dependencies
  • (WAR, WAR hazards)

57
Scheduling
  • Finds instructions to execute in each cycle
  • Static (in-order) schedulinglooks only at the
    next instruction
  • Dynamic (out-of-order) schedulinglooks at a
    window of instructions
  • How many instructions are we looking for?
  • 3-4 is typical today, 8 is in the works
  • A CPU that can ideally do N instrs per cycleis
    called N-way superscalar, N-issue
    superscalar, or simply N-way or N-issue.

58
Static Scheduling
  • Cycle 1
  • Start I1.
  • Can we also start I2? No.
  • Cycle 2
  • Start I2.
  • Can we also start I3? Yes.
  • Can we also start I4? No.
  • If the next instruction can not start,stops
    looking for things to do in this cycle!

Program code
I1 ADD R1, R2, R3
I2 SUB R4, R1, R5
I3 AND R6, R1, R7
I4 OR R8, R2, R6
I5 XOR R10, R2, R11
59
Dynamic Scheduling
  • Cycle 1
  • Operands ready? I1, I5.
  • Start I1, I5.
  • Cycle 2
  • Operands ready? I2, I3.
  • Start I2,I3.
  • Window size (W)how many instructions ahead do
    we look.
  • Do not confuse with issue width (N).
  • E.g. a 4-issue out-of-order processor can have a
    128-entry window (it can look at the next 128
    instructions).

Program code
I1 ADD R1, R2, R3
I2 SUB R4, R1, R5
I3 AND R6, R1, R7
I4 OR R8, R2, R6
I5 XOR R10, R2, R11
60
Dynamic Scheduling Pipeline
  • Fetch gets the next few instructions(reads the
    instruction stream in-order)
  • Decode decodes the instructions fetched in the
    previous cycle (in-order)
  • Then we can start looking at instructions and try
    to execute them out of order.
  • Important we fetch and decode in-order even in
    an out-of-order processor.

61
Register Renaming
  • Name dependences
  • I3 can not go before I2 becauseI3 will overwrite
    R5
  • I5 can not go before I2 becauseI2, when it goes,
    will overwriteR2 with a stale value
  • Name dependences because the dependence is
    because of register name,not the flow of data.

Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I4 OR R8, R6, R2
I5 XOR R2, R4, R11
62
Register Renaming
  • Solution give I3 some othersome other name
    (e.g. S)for the value it produces.
  • But I4 uses that value,so we must also change
    that to S
  • In fact, all uses of R5 from I3 to the next
    instruction that writes to R5 again must now be
    changed to S!
  • We get rid of output dependences in the same way
    change R2 in I5 (and subsequent instrs) to T.

Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I4 OR R8, R6, R2
I5 XOR R2, R4, R11
63
Register Renaming
  • Implementation
  • Space for T, S, etc.
  • How do we know whento rename a register?
  • Simple Solution
  • Do renaming in-order, just after decoding
  • Change the name of a registereach time we decode
    aninstruction that will write to it.
  • Remember what name we gave it ?

Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND S, R11, R7
I4 OR R8, R6, R2
I5 XOR T, R4, R11
64
Register Renaming Example
Renaming table
Original
Renamed
Destination
R1
T1
Source
R2
R2

Source
R5
R5

R8
R8

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
65
Register Renaming Example
Renaming table
Original
Renamed
Source
R1
T1
Destination
R2
T2

Source
R5
R5

R8
R8

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
66
Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T2

Destination
R5
T3

Source
R8
R8

Source
Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
67
Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T2

R5
T3

R8
T4

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
I4 OR R8, R6, R2
I4 OR T4, R6, T2
68
Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T5

R5
T3

R8
T4

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
I4 OR R8, R6, R2
I4 OR T4, R6, T2
I5 XOR R2, R4, R11
I5 XOR T5, R4, R11
69
Register Names
  • We keep using new names
  • Each name needs a place to keep its value
  • We can have only so many of those places
  • What happens when we run out of names?
  • There must be a way to recycle names
  • When can we recycle a name?
  • When we have given its value to allinstructions
    that use it as a source operand!
  • This is not as easy as it sounds

70
Implementing Dynamic Scheduling
  • Tomasulos Algorithm
  • Used in IBM 260/91 (in the 60s)
  • Tracks when operands are availableto satisfy
    data dependences
  • Removes name dependencesthrough register
    renaming
  • Very similar to what is used today

71
Tomasulos Algorithm The Picture
72
Tomasulos Algorithm Issue
  • Get next instruction from instruction queue.
  • Find a free reservation station for it(if none
    are free, stall until one is)
  • Read operands that are in the registers
  • If the operand is not in the register,find which
    reservation station will produce it
  • In effect, this step renames registers(reservatio
    n station IDs are temporary names)

73
Tomasulos Algorithm Execute
  • Monitor results as they are produced
  • Put a result into all reservation stations
    waiting for it (missing source operand)
  • When all operands available for an
    instruction,it is ready (we can actually execute
    it)
  • Several ready instrs for one functional unit?
  • Pick one.
  • Except for load/storeLoad/Store must be done
    inthe proper order to avoid hazards through
    memory

74
Tomasulos Algorithm Write Result
  • When result is computed, make it availableon the
    common data bus (CDB), wherewaiting
    reservation stations can pick it up
  • Stores write to memory
  • Result stored in the register file
  • This step frees the reservation station
  • For our register renaming, this recycles the
    temporary name future instructions can again find
    the value in the actual register, until it is
    renamed again)

75
Tomasulos Algorithm Load/Store
  • The reservation stations take care of dependences
    through registers.
  • Dependences also possible through memory
  • Stores can not be reordered with respect toother
    load/store operations to the same address
  • Example
  • Can I3 execute before I2?
  • Not if R3 is 100!

I1 ADD R1, R2, R3
I2 ST R4, 100(R1)
I3 LD R4, (R2)
76
Tomasulos Algorithm Load/Store
  • Load
  • Wait for all previous stores to compute address
  • If any store to the same address,wait for it to
    actually write to memory
  • Alternatively, just get the value of the last
    such store
  • Store
  • Wait for all previous loads and stores to compute
    addresses
  • If any load/store from/to the same address,wait
    for it to read/write

77
Tomasulos Algorithm Example
  • We need to have
  • Instruction status
  • Not part of HW, but having it makes our life
    easier
  • Reservation stations
  • All fields for each reservation station
  • Register status
  • Which reservation station it is renamed to

Loop L.D F0, 0(R1) Load 64-bit FP
value MUL.D F4,F0,F2 Multiply
FP S.D F4,0(R1) Store 64-bit FP
value DADDUI R1,R1,-8 Add (int)
immediate BNE R1,R2,Loop Branch if R1!R2
Write a Comment
User Comments (0)
About PowerShow.com