CS 42906290 Lecture 06 Outoforder execution, Outoforder completion a'k'a' the cool stuff

About This Presentation

Title:

CS 42906290 Lecture 06 Outoforder execution, Outoforder completion a'k'a' the cool stuff

Description:

A stall, and loop overhead instructions (the SUBI & BNEZ) ... A data dependency doesn't necessarily have to cause stalls or hazards ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 78

Provided by: michaelt8

Category:

more less

Transcript and Presenter's Notes

Title: CS 42906290 Lecture 06 Outoforder execution, Outoforder completion a'k'a' the cool stuff

1
CS 4290/6290 Lecture 06Out-of-order
execution,Out-of-order completion(a.k.a. the
cool stuff)

(Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, Michael Niemier,
and Milos Pruvlovic)

2
The Big Picture

During the past several lectures weve discussed
the benefits and hazards of pipelining.
Major benefit of pipelining
Allows instruction execution to overlap if they
are independent of each other
Instructions are evaluated in psuedo-parallel
Now extend many pipeline fundamentals
Reduce impact of data and control hazards
Increase amount of parallelism that can be
extracted statically or dynamically

3
The medium-sized picture

Performance of a pipeline essentially judged by
the number of clock cycles it took to execute an
instruction (CPI)
Ideally, if one instruction was issued every
cycle, the average CPI for a perfect pipeline
should be 1
But hazards and stalls slowed this down to
Pipeline CPI
Ideal pipeline CPI Structural Stalls RAW
stalls WAR stalls WAW stalls Control Stalls
Now study techniques to help reduce the bad
terms above
Loop unrolling, pipeline scheduling,
scoreboarding, register renaming, branch
prediction, etc.

4
Instruction level parallelism (ILP)

In discussions of ways to increase parallelism
well often try to do it within a basic block
Basic block set of instructions with no
branches except for entry to, exit from it
Generally, its pretty small
(recall branch instructions occur 15-20 of the
time, so a basic block is usually about 6-7
instructions)
To obtain better performance than what weve
already seen, we must exploit ILP across multiple
basic blocks

5
Loop-level parallelism

Lots of programs do the same thing over and over
again on different sets of data (i.e. a loop)
Problems for parallelism arise with a loop like
this
for (i1 ilt1000 i)
xi xi yi
When translated to assembly code, probably have
A load or two, and add instruction, and maybe a
store
Then youll have a branch to start the loop again
In other words, the basic block size is SMALL
Within such loops, there is very little
opportunity for overlap

6
Basic pipeline scheduling

Before talking about ways to find more ILP, lets
first review and set the stage
To keep pipeline full, parallelism among
instructions must be generated
find unrelated instructions that can overlap in a
pipeline
(i.e. lets avoid RAW hazards)
If instruction dependent on another instruction
it must be separated from it
by a distance in clock cycles that is equal to
the pipeline latency of the source instruction

7
Assumed latencies

The following latencies will be assumed for a few
examples
A standard integer pipeline is also assumed so
branches have a delay of 1 clock cycle
Functional units are fully pipelined OR
replicated so no structural hazards ensue
(So, were assuming multiple functional units as
before)

8
Loop unrolling (part 1)

Were going to walk through the unrolling of a
simple for loop to show how we might gain ILP
Our candidate for loop is
for (i1 ilt1000 i)
xi xi yi
This loop is parallel b/c the body of each
iteration is independent (more pathological cases
later)
To unroll this loop, lets look at some
assembly

Loop LD F0, 0(R1) F0 array
element ADDD F4, F0, F2 add scalar in
F2 SD 0(R1), F4 store result SUBI R1, R1,
8 decrement pointer 8 bytes (per data
word) BNEZ R1, Loop branch R1 ! 0
9
Loop unrolling (part 2)
(Board?)
Loop LD F0, 0(R1) 1 stall 2 ADDD F4, F0,
F2 3 stall 4 stall 5 SD 0(R1),
F4 6 SUBI R1, R1, 8 7 stall 8 BNEZ R1,
Loop 9 stall 10
With, no scheduling, our loop would execute
like this
Originally, there were 5 instructions. Now,
there are 5 more stalls and 10 total clock cycles
are required per iteration.
Loop LD F0, 0(R1) SUBI R1, R1, 8 ADDD F4,
F0, F2 stall BNEZ R1, Loop delayed
branch SD 8(R1), F4 altered and
interchanged
With, scheduling, we can reduce the number of
stalls
Execution time has been reduced from 10 clock
cycles to 6
10
Loop unrolling (part 3)

From the last slide, we see that
Running one iteration of the loop takes 6 clock
cycles
But real work is only performed on 3 of those
cycles
The LD, ADDD, and SD instructions
The other 3 clock cycles are devoted to
A stall, and loop overhead instructions (the SUBI
BNEZ)
i.e. half of the instructions perform useful
work
Note this is a bad thing
Loop unrolling increases of real work
instructions relative to the loop overhead
instructions
We replicate loop body and adjust termination code

11
Loop unrolling (part 4)
Loop unrolled 4 times

3 branches and 3 decrements of R1 have been
eliminated by copying instructions.
This loop has not been scheduled so every
operation is followed by a dependent
instructions
This loop will take 28 clock cycles to run
Each LD has 1 stall
Each ADDD has 2 stalls
The SUBI has 1 stall
The BNEZ has 1 stall
And there are 14 instruction cycles
More registers are required!

Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1),
F8 LD F10, -16(R1) ADDD F12, F10, F2 SD
-16(R1), F12 LD F14, -24(R1) ADDD F16, F14,
F2 SD -24(R1), F16 SUBI R1, R1, 32 BNEZ R1,
Loop
12
Loop unrolling (part 5)
Loop unrolled 4 times (and scheduled!)

Scheduling has eliminated all of the stalls!
The execution time of the unrolled loop has
Dropped to a total of 14 clock cycles
Putting this in perspective
Only 3.5 clock cycles per iteration
In the previous example, 7 were needed
And before loop unrolling, 6 were needed

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10,
F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1),
F8 SUBI R1, R1, 32 SD -16(R1), F12 BNEZ R1,
Loop SD 8(R1), F16 (8-32 -24)
Loop unrolling is a very useful technique and is
not inherently tied to a specific processor
implementation
13
Dependencies

Remember
To exploit ILP, we, the compiler, whoever, must
determine which instructions can execute in
parallel
Instructions are parallel if they can execute
at the same time in a pipeline without causing
any stalls
2 instructions that are dependent are not
parallel and cannot be reordered
3 kinds of dependencies to consider
Data dependencies
Name dependencies
Control dependencies

14
Data dependencies

An instruction j is dependent on instruction i if
any of the following is true
Instruction i produces a result used by
instruction j
Instruction j is data dependent on instruction k
and instruction k is data dependent on
instruction i
If 2 instructions are data dependent, they cannot
execute at the same time or completely overlap
The result would be at least one RAW hazard
For code to execute correctly, the original data
dependence must be preserved during execution

15
What data dependencies really are

Data dependencies are properties of assembly code
thats where they exist in the program!
A data dependency doesnt necessarily have to
cause stalls or hazards
Organization of pipeline determines whether or
not this actually happens
To summarize
A dependence indicates the possibility of a
hazard
A dependence determines the order in which
results must be calculated
A dependence sets an upper bound on how much
parallelism can be exploited

16
Data dependence example
An example of a dependence chain
Loop LD F0, 0(R1) F0 array element ADDD F4,
F0, F2 add scalar in F2 SD 0(R1), F4 store
result

When the data flow dependencies occur because of
registers, detecting them is a relatively easy
process.
Problems arise with memory locations
100(R4) and 20(R6) may be identical
20(R4) and 20(R4) may be different because R4 has
changed
(Remember this itll come back real soon)
It may be possible to avoid hazards, maintain
dependencies by transforming code

17
Name dependencies

A name dependence occurs when 2 instructions
use the same register or memory location (a name)
but there is no flow of data b/t the 2
instructions
There are 2 types
Antidependencies
Occur when an instruction j writes a register or
memory location that instruction i reads and i
is executed first
Corresponds to a WAR hazard
Output dependencies
Occur when instruction i and instruction j write
the same register or memory location
Protected against by checking for WAW hazards

18
More on name dependencies

Note difference from data dependencies
No value is transmitted between instructions here
Resources get reused need to ensure that each
instruction gets its value from right resource
before its reused
VERY IMPORTANT
Instructions with name dependencies could be
executed at the same time or reordered to avoid
conflicts
Example

Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F0, -8(R1) ADDD F4, F0, F2
LD/LD dependence is an Output dependence
(WAW) ADDD/LD dependence is an Antidependence
(WAR)
19
Control dependencies

A control dependence determines ordering of
instruction with respect to branch instruction so
non-branch instruction executed only when it
should be
Often arise due to conditional statements (i.e.
ifs)
2 constraints on control dependencies are
If instruction is control dependent on branch, it
cannot be moved before the branch
B/c it no longer would be controlled by the
branch!
You might see this next one coming
but, if an instruction is not controlled by a
branch, it cant be moved so that it is

20
Control dependencies do what?

They make sure instructions execute in order
Well, this is kinda true. Theres an exception
to every rule right?
Sometimes we could allow an instruction to
execute even if it should not have been as long
as its execution doesnt hurt anything
Control dependence is not the critical property
to preserve exception behavior and data flow
are
Control dependencies preserve dataflow
Makes sure that instructions that produce results
and consume them get the right data at the right
time

21
Dynamic scheduling

Allows some data hazards to be overcome in HW
Compiler is SW which schedules instructions to
avoid hazards, dependencies, etc. statically
In dynamic scheduling, HW rearranges instructions
to avoid stalls
Some advantages of dynamic scheduling are
Dependencies not known at compile time may be
resolved and stalls avoided
The compiler doesnt have to do as much work
Code for a particular pipeline may run well on
another
The catch Well need some more hardware

22
The idea behind dynamic scheduling

In all of the pipelines weve studied so far all
of the instructions have been issued in order
If an instruction is stalled, the pipeline is
stalled
Example
DIVD F0, F2, F4
ADDD F10, F0, F8
SUBD F12, F8, F14
One way to eliminate this problem is to NOT
require instructions to execute in order

The ADDD instruction is data dependent on DIVD.
But, the SUBD instruction could execute
23
Out of order execution means

In previous discussions, both structural and data
hazards checked in ID stage
But, to execute SUBD early (for example) have to
break up issue stage into
Decoding the instruction and checking for
structural hazards
Waiting for the absence of a data hazard and
reading operands
Instructions call still be issued in-order and
structural hazards can be check for at this time
But, we want instructions to start executing as
soon as their operands are ready
So, out-of-order execution means out-of-order
completion

24
Scoreboarding

Scoreboarding allows instructions to execute out
of order when sufficient resources are available
and there are no data dependencies
Now, a case study CDC 6600 Scoreboard
Goal of the scoreboard is to try to make sure one
instruction is executed each clock cycle
So if 1 inst. is stalled, it tries to find
another to start
But, WAR hazards are possible now

DIVD F0, F2, F4 ADDD F10, F0, F8 (read) SUBD F8,
F8, F14 (write)
If the pipeline executes SUBD before ADDD, we
have an antidependence
25
The CDC 6600

The CDC 6600 has 16 separate functional units
4 floating point units, 5 units for memory
references, and 7 units for integer operations
Well simplify a bit and assume 2 multipliers,
one FP adder, one FP divider, and an integer unit
for all memory references, branches, integer ops
Every instruction goes through scoreboard where a
record of data dependencies is constructed
Scoreboard determines when operation can read its
operands and begin execution
If it cannot issue immediately, scoreboard
monitors operand availability and issues when
ready

26
Scoreboard stage 1 Issue

If functional unit for instruction is free and no
other active instruction has the same destination
register, instruction issued to its functional
unit
Instructions info. recorded in scoreboard
WAW hazard is avoided
The pipeline stalls and no other instructions
will issue until the hazard is cleared
Theres a buffer between the instruction fetch
and issue stages of the pipeline
It may be a single entry or a queue
(if its a queue, instructions may continue to
fetch, but not issue)

27
Scoreboard stage 2 Read Operands

The scoreboard monitors availability of operands
for a given instruction
Operands are said to be available if
No earlier issued instruction is going to write
it
Register containing the operand is being written
by a currently active functional unit
When operands are available, issued instruction
is instructed by scoreboard to read its registers
and begin execution
RAW hazards resolved here and in this way
instructions may be issued/executed out of order

28
Scoreboard stage 3 Execution

The instructions functional unit will begin
execution when it receives its operands
Upon completion, the functional unit notifies the
scoreboard that it has completed

29
Scoreboard stage 4 Write result

After execution, scoreboard checks for a WAR
hazard.
If present, completing instruction stalled
Generally, completing instruction cannot be
allowed to write its results if
There is an instruction that should have been
issued/executed before the completing instruction
that has not read its operands
AND, one of the operands is the same register as
the result of the completing instruction
If theres no WAR hazard or after it is cleared,
the scoreboard allows the destination register to
be written

30
The parts of a scoreboard
31
A scoreboard example
This example is from notes prepared by David
Patterson and Randy Katz
32
A scoreboard example Cycle 1
33
A scoreboard example Cycle 2
Issue 2nd Load?
34
A scoreboard example Cycle 3
Issue MULT?
35
A scoreboard example Cycle 4
36
A scoreboard example Cycle 5
37
A scoreboard example Cycle 6
38
A scoreboard example Cycle 7
Read ADDD/SUBD operands? Issue ADDD?
39
A scoreboard example Cycle 8a
40
A scoreboard example Cycle 8b
41
A scoreboard example Cycle 9
Read MULT/SUBD operands? Issue ADDD?
42
A scoreboard example Cycle 11
43
A scoreboard example Cycle 12
Read DIVD operands?
44
A scoreboard example Cycle 13
45
A scoreboard example Cycle 14
46
A scoreboard example Cycle 15
47
A scoreboard example Cycle 16
48
A scoreboard example Cycle 17
Write result of ADDD?
49
A scoreboard example Cycle 18
50
A scoreboard example Cycle 19
51
A scoreboard example Cycle 20
52
A scoreboard example Cycle 21
53
A scoreboard example Cycle 22
54
A scoreboard example Cycle 61
55
A scoreboard example Cycle 62
56
Scoreboarding conclusions

Scoreboard uses available ILP to minimize stalls
that arise from a programs true data
dependencies
But, scoreboard limited by several factors
Instructions in a basic block are only so
parallel
if each instruction depends on predecessor, no
dynamic scheduling can reduce stalls
Scoreboard itself has finite size
can look only in some window for potentially
executable instructions
and types of functional units
(which determines structural hazards)
Antidependencies and output dependencies
(WAR, WAR hazards)

57
Scheduling

Finds instructions to execute in each cycle
Static (in-order) schedulinglooks only at the
next instruction
Dynamic (out-of-order) schedulinglooks at a
window of instructions
How many instructions are we looking for?
3-4 is typical today, 8 is in the works
A CPU that can ideally do N instrs per cycleis
called N-way superscalar, N-issue
superscalar, or simply N-way or N-issue.

58
Static Scheduling

Cycle 1
Start I1.
Can we also start I2? No.
Cycle 2
Start I2.
Can we also start I3? Yes.
Can we also start I4? No.
If the next instruction can not start,stops
looking for things to do in this cycle!

Program code
I1 ADD R1, R2, R3
I2 SUB R4, R1, R5
I3 AND R6, R1, R7
I4 OR R8, R2, R6
I5 XOR R10, R2, R11
59
Dynamic Scheduling

Cycle 1
Operands ready? I1, I5.
Start I1, I5.
Cycle 2
Operands ready? I2, I3.
Start I2,I3.
Window size (W)how many instructions ahead do
we look.
Do not confuse with issue width (N).
E.g. a 4-issue out-of-order processor can have a
128-entry window (it can look at the next 128
instructions).

Program code
I1 ADD R1, R2, R3
I2 SUB R4, R1, R5
I3 AND R6, R1, R7
I4 OR R8, R2, R6
I5 XOR R10, R2, R11
60
Dynamic Scheduling Pipeline

Fetch gets the next few instructions(reads the
instruction stream in-order)
Decode decodes the instructions fetched in the
previous cycle (in-order)
Then we can start looking at instructions and try
to execute them out of order.
Important we fetch and decode in-order even in
an out-of-order processor.

61
Register Renaming

Name dependences
I3 can not go before I2 becauseI3 will overwrite
R5
I5 can not go before I2 becauseI2, when it goes,
will overwriteR2 with a stale value
Name dependences because the dependence is
because of register name,not the flow of data.

Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I4 OR R8, R6, R2
I5 XOR R2, R4, R11
62
Register Renaming

Solution give I3 some othersome other name
(e.g. S)for the value it produces.
But I4 uses that value,so we must also change
that to S
In fact, all uses of R5 from I3 to the next
instruction that writes to R5 again must now be
changed to S!
We get rid of output dependences in the same way
change R2 in I5 (and subsequent instrs) to T.

Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I4 OR R8, R6, R2
I5 XOR R2, R4, R11
63
Register Renaming

Implementation
Space for T, S, etc.
How do we know whento rename a register?
Simple Solution
Do renaming in-order, just after decoding
Change the name of a registereach time we decode
aninstruction that will write to it.
Remember what name we gave it ?

Program code
I1 ADD R1, R2, R3
I2 SUB R2, R1, R5
I3 AND S, R11, R7
I4 OR R8, R6, R2
I5 XOR T, R4, R11
64
Register Renaming Example
Renaming table
Original
Renamed
Destination
R1
T1
Source
R2
R2

Source
R5
R5

R8
R8

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
65
Register Renaming Example
Renaming table
Original
Renamed
Source
R1
T1
Destination
R2
T2

Source
R5
R5

R8
R8

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
66
Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T2

Destination
R5
T3

Source
R8
R8

Source
Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
67
Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T2

R5
T3

R8
T4

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
I4 OR R8, R6, R2
I4 OR T4, R6, T2
68
Register Renaming Example
Renaming table
Original
Renamed
R1
T1
R2
T5

R5
T3

R8
T4

Decoded
Renamed
I1 ADD T1, R2, R3
I1 ADD R1, R2, R3
I2 SUB T2, T1, R5
I2 SUB R2, R1, R5
I3 AND R5, R11, R7
I3 AND T3, R11, R7
I4 OR R8, R6, R2
I4 OR T4, R6, T2
I5 XOR R2, R4, R11
I5 XOR T5, R4, R11
69
Register Names

We keep using new names
Each name needs a place to keep its value
We can have only so many of those places
What happens when we run out of names?
There must be a way to recycle names
When can we recycle a name?
When we have given its value to allinstructions
that use it as a source operand!
This is not as easy as it sounds

70
Implementing Dynamic Scheduling

Tomasulos Algorithm
Used in IBM 260/91 (in the 60s)
Tracks when operands are availableto satisfy
data dependences
Removes name dependencesthrough register
renaming
Very similar to what is used today

71
Tomasulos Algorithm The Picture
72
Tomasulos Algorithm Issue

Get next instruction from instruction queue.
Find a free reservation station for it(if none
are free, stall until one is)
Read operands that are in the registers
If the operand is not in the register,find which
reservation station will produce it
In effect, this step renames registers(reservatio
n station IDs are temporary names)

73
Tomasulos Algorithm Execute

Monitor results as they are produced
Put a result into all reservation stations
waiting for it (missing source operand)
When all operands available for an
instruction,it is ready (we can actually execute
it)
Several ready instrs for one functional unit?
Pick one.
Except for load/storeLoad/Store must be done
inthe proper order to avoid hazards through
memory

74
Tomasulos Algorithm Write Result

When result is computed, make it availableon the
common data bus (CDB), wherewaiting
reservation stations can pick it up
Stores write to memory
Result stored in the register file
This step frees the reservation station
For our register renaming, this recycles the
temporary name future instructions can again find
the value in the actual register, until it is
renamed again)

75
Tomasulos Algorithm Load/Store

The reservation stations take care of dependences
through registers.
Dependences also possible through memory
Stores can not be reordered with respect toother
load/store operations to the same address
Example
Can I3 execute before I2?
Not if R3 is 100!

I1 ADD R1, R2, R3
I2 ST R4, 100(R1)
I3 LD R4, (R2)
76
Tomasulos Algorithm Load/Store

Load
Wait for all previous stores to compute address
If any store to the same address,wait for it to
actually write to memory
Alternatively, just get the value of the last
such store
Store
Wait for all previous loads and stores to compute
addresses
If any load/store from/to the same address,wait
for it to read/write

77
Tomasulos Algorithm Example

We need to have
Instruction status
Not part of HW, but having it makes our life
easier
Reservation stations
All fields for each reservation station
Register status
Which reservation station it is renamed to

Loop L.D F0, 0(R1) Load 64-bit FP
value MUL.D F4,F0,F2 Multiply
FP S.D F4,0(R1) Store 64-bit FP
value DADDUI R1,R1,-8 Add (int)
immediate BNE R1,R2,Loop Branch if R1!R2

Write a Comment

User Comments (0)