Title: CS 42906290 Lecture 06 Outoforder execution, Outoforder completion a'k'a' the cool stuff
1CS 4290/6290 Lecture 06Out-of-order
execution,Out-of-order completion(a.k.a. the
cool stuff)
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, Michael Niemier,
and Milos Pruvlovic)
2The Big Picture
- During the past several lectures weve discussed
the benefits and hazards of pipelining. - Major benefit of pipelining
- Allows instruction execution to overlap if they
are independent of each other - Instructions are evaluated in psuedo-parallel
- Now extend many pipeline fundamentals
- Reduce impact of data and control hazards
- Increase amount of parallelism that can be
extracted statically or dynamically
3The medium-sized picture
- Performance of a pipeline essentially judged by
the number of clock cycles it took to execute an
instruction (CPI) - Ideally, if one instruction was issued every
cycle, the average CPI for a perfect pipeline
should be 1 - But hazards and stalls slowed this down to
- Pipeline CPI
- Ideal pipeline CPI Structural Stalls RAW
stalls WAR stalls WAW stalls Control Stalls - Now study techniques to help reduce the bad
terms above - Loop unrolling, pipeline scheduling,
scoreboarding, register renaming, branch
prediction, etc.
4Instruction level parallelism (ILP)
- In discussions of ways to increase parallelism
well often try to do it within a basic block - Basic block set of instructions with no
branches except for entry to, exit from it - Generally, its pretty small
- (recall branch instructions occur 15-20 of the
time, so a basic block is usually about 6-7
instructions) - To obtain better performance than what weve
already seen, we must exploit ILP across multiple
basic blocks
5Loop-level parallelism
- Lots of programs do the same thing over and over
again on different sets of data (i.e. a loop) - Problems for parallelism arise with a loop like
this - for (i1 ilt1000 i)
- xi xi yi
- When translated to assembly code, probably have
- A load or two, and add instruction, and maybe a
store - Then youll have a branch to start the loop again
- In other words, the basic block size is SMALL
- Within such loops, there is very little
opportunity for overlap
6Basic pipeline scheduling
- Before talking about ways to find more ILP, lets
first review and set the stage - To keep pipeline full, parallelism among
instructions must be generated - find unrelated instructions that can overlap in a
pipeline - (i.e. lets avoid RAW hazards)
- If instruction dependent on another instruction
it must be separated from it - by a distance in clock cycles that is equal to
the pipeline latency of the source instruction
7Assumed latencies
- The following latencies will be assumed for a few
examples - A standard integer pipeline is also assumed so
branches have a delay of 1 clock cycle - Functional units are fully pipelined OR
replicated so no structural hazards ensue - (So, were assuming multiple functional units as
before)
8Loop unrolling (part 1)
- Were going to walk through the unrolling of a
simple for loop to show how we might gain ILP - Our candidate for loop is
- for (i1 ilt1000 i)
- xi xi yi
- This loop is parallel b/c the body of each
iteration is independent (more pathological cases
later) - To unroll this loop, lets look at some
assembly
Loop LD F0, 0(R1) F0 array
element ADDD F4, F0, F2 add scalar in
F2 SD 0(R1), F4 store result SUBI R1, R1,
8 decrement pointer 8 bytes (per data
word) BNEZ R1, Loop branch R1 ! 0
9Loop unrolling (part 2)
(Board?)
Loop LD F0, 0(R1) 1 stall 2 ADDD F4, F0,
F2 3 stall 4 stall 5 SD 0(R1),
F4 6 SUBI R1, R1, 8 7 stall 8 BNEZ R1,
Loop 9 stall 10
With, no scheduling, our loop would execute
like this
Originally, there were 5 instructions. Now,
there are 5 more stalls and 10 total clock cycles
are required per iteration.
Loop LD F0, 0(R1) SUBI R1, R1, 8 ADDD F4,
F0, F2 stall BNEZ R1, Loop delayed
branch SD 8(R1), F4 altered and
interchanged
With, scheduling, we can reduce the number of
stalls
Execution time has been reduced from 10 clock
cycles to 6
10Loop unrolling (part 3)
- From the last slide, we see that
- Running one iteration of the loop takes 6 clock
cycles - But real work is only performed on 3 of those
cycles - The LD, ADDD, and SD instructions
- The other 3 clock cycles are devoted to
- A stall, and loop overhead instructions (the SUBI
BNEZ) - i.e. half of the instructions perform useful
work - Note this is a bad thing
- Loop unrolling increases of real work
instructions relative to the loop overhead
instructions - We replicate loop body and adjust termination code
11Loop unrolling (part 4)
Loop unrolled 4 times
- 3 branches and 3 decrements of R1 have been
- eliminated by copying instructions.
- This loop has not been scheduled so every
- operation is followed by a dependent
- instructions
- This loop will take 28 clock cycles to run
- Each LD has 1 stall
- Each ADDD has 2 stalls
- The SUBI has 1 stall
- The BNEZ has 1 stall
- And there are 14 instruction cycles
- More registers are required!
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1),
F8 LD F10, -16(R1) ADDD F12, F10, F2 SD
-16(R1), F12 LD F14, -24(R1) ADDD F16, F14,
F2 SD -24(R1), F16 SUBI R1, R1, 32 BNEZ R1,
Loop
12Loop unrolling (part 5)
Loop unrolled 4 times (and scheduled!)
- Scheduling has eliminated all of the stalls!
- The execution time of the unrolled loop has
- Dropped to a total of 14 clock cycles
- Putting this in perspective
- Only 3.5 clock cycles per iteration
- In the previous example, 7 were needed
- And before loop unrolling, 6 were needed
Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10,
F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1),
F8 SUBI R1, R1, 32 SD -16(R1), F12 BNEZ R1,
Loop SD 8(R1), F16 (8-32 -24)
Loop unrolling is a very useful technique and is
not inherently tied to a specific processor
implementation
13Dependencies
- Remember
- To exploit ILP, we, the compiler, whoever, must
determine which instructions can execute in
parallel - Instructions are parallel if they can execute
at the same time in a pipeline without causing
any stalls - 2 instructions that are dependent are not
parallel and cannot be reordered - 3 kinds of dependencies to consider
- Data dependencies
- Name dependencies
- Control dependencies
14Data dependencies
- An instruction j is dependent on instruction i if
any of the following is true - Instruction i produces a result used by
instruction j - Instruction j is data dependent on instruction k
and instruction k is data dependent on
instruction i - If 2 instructions are data dependent, they cannot
execute at the same time or completely overlap - The result would be at least one RAW hazard
- For code to execute correctly, the original data
dependence must be preserved during execution
15What data dependencies really are
- Data dependencies are properties of assembly code
thats where they exist in the program! - A data dependency doesnt necessarily have to
cause stalls or hazards - Organization of pipeline determines whether or
not this actually happens - To summarize
- A dependence indicates the possibility of a
hazard - A dependence determines the order in which
results must be calculated - A dependence sets an upper bound on how much
parallelism can be exploited
16Data dependence example
An example of a dependence chain
Loop LD F0, 0(R1) F0 array element ADDD F4,
F0, F2 add scalar in F2 SD 0(R1), F4 store
result
- When the data flow dependencies occur because of
registers, detecting them is a relatively easy
process. - Problems arise with memory locations
- 100(R4) and 20(R6) may be identical
- 20(R4) and 20(R4) may be different because R4 has
changed - (Remember this itll come back real soon)
- It may be possible to avoid hazards, maintain
dependencies by transforming code
17Name dependencies
- A name dependence occurs when 2 instructions
use the same register or memory location (a name)
but there is no flow of data b/t the 2
instructions - There are 2 types
- Antidependencies
- Occur when an instruction j writes a register or
memory location that instruction i reads and i
is executed first - Corresponds to a WAR hazard
- Output dependencies
- Occur when instruction i and instruction j write
the same register or memory location - Protected against by checking for WAW hazards
18More on name dependencies
- Note difference from data dependencies
- No value is transmitted between instructions here
- Resources get reused need to ensure that each
instruction gets its value from right resource
before its reused - VERY IMPORTANT
- Instructions with name dependencies could be
executed at the same time or reordered to avoid
conflicts - Example
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F0, -8(R1) ADDD F4, F0, F2
LD/LD dependence is an Output dependence
(WAW) ADDD/LD dependence is an Antidependence
(WAR)
19Control dependencies
- A control dependence determines ordering of
instruction with respect to branch instruction so
non-branch instruction executed only when it
should be - Often arise due to conditional statements (i.e.
ifs) - 2 constraints on control dependencies are
- If instruction is control dependent on branch, it
cannot be moved before the branch - B/c it no longer would be controlled by the
branch! - You might see this next one coming
- but, if an instruction is not controlled by a
branch, it cant be moved so that it is
20Control dependencies do what?
- They make sure instructions execute in order
- Well, this is kinda true. Theres an exception
to every rule right? - Sometimes we could allow an instruction to
execute even if it should not have been as long
as its execution doesnt hurt anything - Control dependence is not the critical property
to preserve exception behavior and data flow
are - Control dependencies preserve dataflow
- Makes sure that instructions that produce results
and consume them get the right data at the right
time
21Dynamic scheduling
- Allows some data hazards to be overcome in HW
- Compiler is SW which schedules instructions to
avoid hazards, dependencies, etc. statically - In dynamic scheduling, HW rearranges instructions
to avoid stalls - Some advantages of dynamic scheduling are
- Dependencies not known at compile time may be
resolved and stalls avoided - The compiler doesnt have to do as much work
- Code for a particular pipeline may run well on
another - The catch Well need some more hardware
22The idea behind dynamic scheduling
- In all of the pipelines weve studied so far all
of the instructions have been issued in order - If an instruction is stalled, the pipeline is
stalled - Example
- DIVD F0, F2, F4
- ADDD F10, F0, F8
- SUBD F12, F8, F14
- One way to eliminate this problem is to NOT
require instructions to execute in order
The ADDD instruction is data dependent on DIVD.
But, the SUBD instruction could execute
23Out of order execution means
- In previous discussions, both structural and data
hazards checked in ID stage - But, to execute SUBD early (for example) have to
break up issue stage into - Decoding the instruction and checking for
structural hazards - Waiting for the absence of a data hazard and
reading operands - Instructions call still be issued in-order and
structural hazards can be check for at this time - But, we want instructions to start executing as
soon as their operands are ready - So, out-of-order execution means out-of-order
completion
24Scoreboarding
- Scoreboarding allows instructions to execute out
of order when sufficient resources are available
and there are no data dependencies - Now, a case study CDC 6600 Scoreboard
- Goal of the scoreboard is to try to make sure one
instruction is executed each clock cycle - So if 1 inst. is stalled, it tries to find
another to start - But, WAR hazards are possible now
DIVD F0, F2, F4 ADDD F10, F0, F8 (read) SUBD F8,
F8, F14 (write)
If the pipeline executes SUBD before ADDD, we
have an antidependence
25The CDC 6600
- The CDC 6600 has 16 separate functional units
- 4 floating point units, 5 units for memory
references, and 7 units for integer operations - Well simplify a bit and assume 2 multipliers,
one FP adder, one FP divider, and an integer unit
for all memory references, branches, integer ops - Every instruction goes through scoreboard where a
record of data dependencies is constructed - Scoreboard determines when operation can read its
operands and begin execution - If it cannot issue immediately, scoreboard
monitors operand availability and issues when
ready
26Scoreboard stage 1 Issue
- If functional unit for instruction is free and no
other active instruction has the same destination
register, instruction issued to its functional
unit - Instructions info. recorded in scoreboard
- WAW hazard is avoided
- The pipeline stalls and no other instructions
will issue until the hazard is cleared - Theres a buffer between the instruction fetch
and issue stages of the pipeline - It may be a single entry or a queue
- (if its a queue, instructions may continue to
fetch, but not issue)
27Scoreboard stage 2 Read Operands
- The scoreboard monitors availability of operands
for a given instruction - Operands are said to be available if
- No earlier issued instruction is going to write
it - Register containing the operand is being written
by a currently active functional unit - When operands are available, issued instruction
is instructed by scoreboard to read its registers
and begin execution - RAW hazards resolved here and in this way
instructions may be issued/executed out of order
28Scoreboard stage 3 Execution
- The instructions functional unit will begin
execution when it receives its operands - Upon completion, the functional unit notifies the
scoreboard that it has completed
29Scoreboard stage 4 Write result
- After execution, scoreboard checks for a WAR
hazard. - If present, completing instruction stalled
- Generally, completing instruction cannot be
allowed to write its results if - There is an instruction that should have been
issued/executed before the completing instruction
that has not read its operands - AND, one of the operands is the same register as
the result of the completing instruction - If theres no WAR hazard or after it is cleared,
the scoreboard allows the destination register to
be written
30The parts of a scoreboard
31A scoreboard example
This example is from notes prepared by David
Patterson and Randy Katz
32A scoreboard example Cycle 1
33A scoreboard example Cycle 2
Issue 2nd Load?
34A scoreboard example Cycle 3
Issue MULT?
35A scoreboard example Cycle 4
36A scoreboard example Cycle 5
37A scoreboard example Cycle 6
38A scoreboard example Cycle 7
Read ADDD/SUBD operands? Issue ADDD?
39A scoreboard example Cycle 8a
40A scoreboard example Cycle 8b
41A scoreboard example Cycle 9
Read MULT/SUBD operands? Issue ADDD?
42A scoreboard example Cycle 11
43A scoreboard example Cycle 12
Read DIVD operands?
44A scoreboard example Cycle 13
45A scoreboard example Cycle 14
46A scoreboard example Cycle 15
47A scoreboard example Cycle 16
48A scoreboard example Cycle 17
Write result of ADDD?
49A scoreboard example Cycle 18
50A scoreboard example Cycle 19
51A scoreboard example Cycle 20
52A scoreboard example Cycle 21
53A scoreboard example Cycle 22
54A scoreboard example Cycle 61
55A scoreboard example Cycle 62
56Scoreboarding conclusions
- Scoreboard uses available ILP to minimize stalls
that arise from a programs true data
dependencies - But, scoreboard limited by several factors
- Instructions in a basic block are only so
parallel - if each instruction depends on predecessor, no
dynamic scheduling can reduce stalls - Scoreboard itself has finite size
- can look only in some window for potentially
executable instructions - and types of functional units
- (which determines structural hazards)
- Antidependencies and output dependencies
- (WAR, WAR hazards)
57Scheduling
- Finds instructions to execute in each cycle
- Static (in-order) schedulinglooks only at the
next instruction - Dynamic (out-of-order) schedulinglooks at a
window of instructions - How many instructions are we looking for?
- 3-4 is typical today, 8 is in the works
- A CPU that can ideally do N instrs per cycleis
called N-way superscalar, N-issue
superscalar, or simply N-way or N-issue.
58Static Scheduling
- Cycle 1
- Start I1.
- Can we also start I2? No.
- Cycle 2
- Start I2.
- Can we also start I3? Yes.
- Can we also start I4? No.
- If the next instruction can not start,stops
looking for things to do in this cycle!
Program code
I1 ADD R1, R2, R3
I2 SUB R4, R1, R5
I3 AND R6, R1, R7
I4 OR R8, R2, R6
I5 XOR R10, R2, R11
59Dynamic Scheduling
- Cycle 1
- Operands ready? I1, I5.
- Start I1, I5.
- Cycle 2
- Operands ready? I2, I3.
- Start I2,I3.
- Window size (W)how many instructions ahead do
we look. - Do not confuse with issue width (N).
- E.g. a 4-issue out-of-order processor can have a
128-entry window (it can look at the next 128
instructions).
Program code
I1 ADD R1, R2, R3
I2 SUB R4, R1, R5
I3 AND R6, R1, R7
I4 OR R8, R2, R6
I5 XOR R10, R2, R11
60Dynamic Scheduling Pipeline
- Fetch gets the next few instructions(reads the
instruction stream in-order) - Decode decodes the instructions fetched in the
previous cycle (in-order) - Then we can start looking at instructions and try
to execute them out of order. - Important we fetch and decode in-order even in
an out-of-order processor.
61Register Renaming
- Name dependences
- I3 can not go before I2 becauseI3 will overwrite
R5 - I5 can not go before I2 becauseI2, when it goes,
will overwriteR2 with a stale value - Name dependences because the dependence is
because of register name,not the flow of data.
Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I4 OR R8, R6, R2
I5 XOR R2, R4, R11
62Register Renaming
- Solution give I3 some othersome other name
(e.g. S)for the value it produces. - But I4 uses that value,so we must also change
that to S - In fact, all uses of R5 from I3 to the next
instruction that writes to R5 again must now be
changed to S! - We get rid of output dependences in the same way
change R2 in I5 (and subsequent instrs) to T.
Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I4 OR R8, R6, R2
I5 XOR R2, R4, R11
63Register Renaming
- Implementation
- Space for T, S, etc.
- How do we know whento rename a register?
- Simple Solution
- Do renaming in-order, just after decoding
- Change the name of a registereach time we decode
aninstruction that will write to it. - Remember what name we gave it ?
Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND S, R11, R7
I4 OR R8, R6, R2
I5 XOR T, R4, R11
64Register Renaming Example
Renaming table
Original
Renamed
Destination
R1
T1
Source
R2
R2
Source
R5
R5
R8
R8
Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
65Register Renaming Example
Renaming table
Original
Renamed
Source
R1
T1
Destination
R2
T2
Source
R5
R5
R8
R8
Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
66Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T2
Destination
R5
T3
Source
R8
R8
Source
Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
67Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T2
R5
T3
R8
T4
Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
I4 OR R8, R6, R2
I4 OR T4, R6, T2
68Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T5
R5
T3
R8
T4
Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
I4 OR R8, R6, R2
I4 OR T4, R6, T2
I5 XOR R2, R4, R11
I5 XOR T5, R4, R11
69Register Names
- We keep using new names
- Each name needs a place to keep its value
- We can have only so many of those places
- What happens when we run out of names?
- There must be a way to recycle names
- When can we recycle a name?
- When we have given its value to allinstructions
that use it as a source operand! - This is not as easy as it sounds
70Implementing Dynamic Scheduling
- Tomasulos Algorithm
- Used in IBM 260/91 (in the 60s)
- Tracks when operands are availableto satisfy
data dependences - Removes name dependencesthrough register
renaming - Very similar to what is used today
71Tomasulos Algorithm The Picture
72Tomasulos Algorithm Issue
- Get next instruction from instruction queue.
- Find a free reservation station for it(if none
are free, stall until one is) - Read operands that are in the registers
- If the operand is not in the register,find which
reservation station will produce it - In effect, this step renames registers(reservatio
n station IDs are temporary names)
73Tomasulos Algorithm Execute
- Monitor results as they are produced
- Put a result into all reservation stations
waiting for it (missing source operand) - When all operands available for an
instruction,it is ready (we can actually execute
it) - Several ready instrs for one functional unit?
- Pick one.
- Except for load/storeLoad/Store must be done
inthe proper order to avoid hazards through
memory
74Tomasulos Algorithm Write Result
- When result is computed, make it availableon the
common data bus (CDB), wherewaiting
reservation stations can pick it up - Stores write to memory
- Result stored in the register file
- This step frees the reservation station
- For our register renaming, this recycles the
temporary name future instructions can again find
the value in the actual register, until it is
renamed again)
75Tomasulos Algorithm Load/Store
- The reservation stations take care of dependences
through registers. - Dependences also possible through memory
- Stores can not be reordered with respect toother
load/store operations to the same address - Example
- Can I3 execute before I2?
- Not if R3 is 100!
I1 ADD R1, R2, R3
I2 ST R4, 100(R1)
I3 LD R4, (R2)
76Tomasulos Algorithm Load/Store
- Load
- Wait for all previous stores to compute address
- If any store to the same address,wait for it to
actually write to memory - Alternatively, just get the value of the last
such store - Store
- Wait for all previous loads and stores to compute
addresses - If any load/store from/to the same address,wait
for it to read/write
77Tomasulos Algorithm Example
- We need to have
- Instruction status
- Not part of HW, but having it makes our life
easier - Reservation stations
- All fields for each reservation station
- Register status
- Which reservation station it is renamed to
Loop L.D F0, 0(R1) Load 64-bit FP
value MUL.D F4,F0,F2 Multiply
FP S.D F4,0(R1) Store 64-bit FP
value DADDUI R1,R1,-8 Add (int)
immediate BNE R1,R2,Loop Branch if R1!R2