Taking advantage of more ILP with multiple issue 3'6 - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Taking advantage of more ILP with multiple issue 3'6

Description:

Combine static and dynamic scheduling to issue multiple instructions per clock ... Predicated instructions (used extensively in Intels IA-64) ... – PowerPoint PPT presentation

Number of Views:171

Avg rating:3.0/5.0

Slides: 31

Provided by: perand

Category:

more less

Transcript and Presenter's Notes

Title: Taking advantage of more ILP with multiple issue 3'6

1
Lecture 4
ILP and Its dynamic exploitation (contd) and
exploiting ILP with software approaches

Taking advantage of more ILP with multiple issue
(3.6)
Static multiple issue The VLIW approach (4.3)
Compiler techniques for exposing ILP (4.2,4.4-4.5)

Hardware-based speculation (3.7)

Limitations of ILP (3.8)

2
Multiple instruction issue per clock

Goal Maximize the number of completed
instructions per cycle

Superscalar
Combine static and dynamic scheduling to issue
multiple instructions per clock

HW-centric and less sensitive to poorly
scheduled code

Used e.g. in PowerPC, Sparc, Alpha, HP-PA,
Pentium)

Very Long Instruction Words (VLIW)
Static scheduling used to form packages of
independent instructions that can be issued
together

Relies on compiler to find independent
instructions
Used e.g. in IA-64 Itanium (EPIC) and in
multimedia/DSP Processors (e.g. Trimedia)

3
Multiple-Issue Processors
4
A Superscalar MIPS

Issue 2 instructions simultaneously 1 FP 1
integer
Fetch two instr./clock cycle one integer and one
FP
Can only issue 2nd instruction if 1st instruction
issues
Need more ports to the register file
Type Pipe stages
Int. IF ID EX MEM WB
FP IF ID EX MEM WB
Int. IF ID EX MEM WB
FP IF ID EX MEM WB
Int. IF ID EX MEM WB
FP IF ID EX MEM WB

EX stage should be fully pipelined
1 load delay slot corresponds to three
instructions!

5
Statically Scheduled Superscalar MIPS

Difficult to find a sufficient number of instr.
to issue

Can be scheduled dynamically with Tomasulos alg.

6
Limits to Superscalar Execution

Difficulties in scheduling within the constraints
on number of functional units and the ILP in the
code chunk

Instruction decode complexity increases with the
number of issued instructions

Data and control dependences are in general more
costly in a superscalar processor than in a
single-issue processor

Techniques to enlarge the instruction window to
extract more ILP are important
7
Very Long Instruction Word (VLIW)

Independent functional units with no hazard
detection

Compiler is responsible for instruction
scheduling

8
Some VLIW Characteristics

Can be hard to exploit parallelism
n functional units and k pipeline stages implies
n x k independent instructions

Memory and register bandwidth
Complexity increases with the number of
functional units
Code size

No binary code compatibility
Relies heavily on compiler technology
9
Detecting data dependencies

Finding dependences is fundamental to
perform instruction scheduling
determine the degree of parallelism in loops and
eliminate name dependencies

10
Loop-carried dependencies
A loop iteration is often dependent of results
calculated in an earlier iteration.
Example for (i 6 i lt 100 i i1)
Yi Yi-5 Yi

This loop has a dependency distance of 5 and we
can extract ILP in 5 consecutive iterations

What dependences can the compiler detect?
11
Conditions for detection of data dependences

Assumptions
Array indices are affine, i.e, can be written a
x i b
There is a write to Aa x j b followed by a
read to
Ac x k d for some loop indices mlt j,k
lt n

There is a data dependence if and only if
There exists j,k and jltk, such that a x j b
c x k d

The dependence test may fail in the general case
12
The GCD test

A simple test to decide whether there is NO
dependency between loop iterations
Loop carried dependences gtGCD(c,a) divides (d -
b) ?
If GCD(c,a) does NOT divide (d-b) gt NO dependency

13
Software Pipelining 1(3)Symbolic loop unrolling

The instructions in a loop are taken from
different iterations in the original loop

14
Software pipelining 2(3)

Example
loop LD F0,0(R1)
ADDD F4,F0,F2
SD 0(R1),F4
SUBI R1,R1,8
BNEZ R1,loop

Look at three iterations of the loop
body LD F0,0(R1) Iteration
i ADDD F4,F0,F2 SD 0(R1),F4 LD F0,0(R1)
Iteration i1 ADDD F4,F0,F2 SD 0(R1),F4 LD F
0,0(R1) Iteration i2 ADDD F4,F0,F2 SD 0(R1
),F4 l
15
Software pipelining 3(3)

Instructions from three consecutive iterations
form the loop body
loop SD 0(R1),F4 from iteration i
ADDD F4,F0,F2 from iteration i1
LD F0,-16(R1) from iteration i2
SUBI R1,R1,8
BNEZ R1,loop

No data dependences within a loop iteration

The dependence distance is 2 iterations

WAR hazard elimination is needed

Requires startup and finish code

16
Trace scheduling 1(2)
Creates a sequence of instructions that are
likely to be executed -- a trace.

Two steps
Trace selection Find a likely sequence of basic
blocks (trace) across statically predicted
branches (e.g. if-then-else)

Trace compaction Schedule the trace to be as
efficient as possible while preserving
correctness in the case the prediction is wrong

Yields more instruction level parallelism
Accurate static branch prediction key to success

17
Trace scheduling 2(2)

The leftmost sequence is chosen as the most
likely trace

The assignment to B is control dependent on the
if statement.

Trace compaction has to respect data dependences

The rightmost (less likely) trace has to be
augmented with fix up code

18
Hardware support for speculation

Loop unrolling, software pipelining, and trace
scheduling are limited by the compilers ability
to do branch prediction

Dynamic techniques can predict the future based
on history information. HW support for
speculation includes

Branch prediction (has been discussed)
Predicated instructions (used extensively in
Intels IA-64)
Hardware support for compiler speculation
Hardware-based speculation

19
Conditional or predicated instructions

Executed only if a condition is satisfied
Control converted into data dependences

Example
Normal code Conditional
BNEZ R1,L CMOVZ R2,R3,R1
MOV R2,R3
L

Useful for short if-then statements
More complex might slow down cycle time

20
Compiler speculation

The compiler moves instructions before a branch
so that they can be executed before the branch
condition is known

Advantage creates longer schedulable code
sequences gt more ILP can exploited

Example if (A 0) A B else A A4
Non speculative code Speculative code
LW R1,0(R3) LW R1,0(R3)
BNEZ R1,L1 LW R14,0(R2)
LW R1,0(R2) BEQZ R1,L3
J L2 ADD R14,R1,4
L1 ADD R1,R1,4 L3 SW 0(R3),R14
L2 SW 0(R3),R1
Must not affect the exception behavior

21
HW supported speculation

A combination of three main ideas
Dynamic instruction scheduling takes advantage
of ILP

Dynamic branch prediction allows instruction
scheduling across branches

Speculative execution executes instructions
before all control dependences are resolved

Hardware based speculation uses a data-flow
approach instructions execute when their
operands are available
22
HW vs. SW speculation

Advantages
Dynamic runtime disambiguation of memory addresses

Dynamic branch prediction is often better than
static which limits the performance of SW specul.

HW speculation can maintain a precise exception
model

Can achieve higher performance on older code

Main disadvantage
Complex implementation and extensive need of
hardware resources

23
HW Support for Speculation

Need a reorder buffer for uncommited inst.
Reorder buffer (ROB) can be operand source
Once operation commits, the register file is
updated
Use reorder buffer number instead of reservation
station
Instructions commit in order
Flush reorder buffer when a branch is
mispredicted
Store buffers integrated into the ROB.

24
Four Steps of a Speculative Tomasulo Algorithm

1. Issue get instruction from FP Op Queue
If reservation station and reorder buffer slot
free, issue instr send operands reorder
buffer nr. for destination

2. Execution operate on operands (EX)
If both operands ready execute if not, watch
CDB for result when both operands are in
reservation station execute

3. Write result finish execution
Write on Common Data Bus to all awaiting FUs
reorder buffer mark reservation station available

4. Commit update register with reorder result
When instr. is at head of reorder buffer result
is present update register with result (or store
to memory) and remove instr. from reorder buffer

25
A Model of an Ideal Processor

Provides a base for ILP measurements
No structural hazards

Machine with perfect speculation
Branch predictionperfect no mispredictions
Jump predictionall jumps perfectly predicted

Memory-address alias analysisaddresses are known
a store can be moved before a load provided
addresses not equal
There are only true data dependences left!

26
Upper Bound on ILP
27
More Realistic HW Branch Impact
28
Renaming Register impact
29
Window Impact
30
Summary