Title: Taking advantage of more ILP with multiple issue 3'6
1Lecture 4
ILP and Its dynamic exploitation (contd) and
exploiting ILP with software approaches
- Taking advantage of more ILP with multiple issue
(3.6) - Static multiple issue The VLIW approach (4.3)
- Compiler techniques for exposing ILP (4.2,4.4-4.5)
- Hardware-based speculation (3.7)
2Multiple instruction issue per clock
- Goal Maximize the number of completed
instructions per cycle
- Superscalar
- Combine static and dynamic scheduling to issue
multiple instructions per clock
- HW-centric and less sensitive to poorly
scheduled code
- Used e.g. in PowerPC, Sparc, Alpha, HP-PA,
Pentium)
- Very Long Instruction Words (VLIW)
- Static scheduling used to form packages of
independent instructions that can be issued
together
- Relies on compiler to find independent
instructions - Used e.g. in IA-64 Itanium (EPIC) and in
multimedia/DSP Processors (e.g. Trimedia)
3Multiple-Issue Processors
4A Superscalar MIPS
- Issue 2 instructions simultaneously 1 FP 1
integer - Fetch two instr./clock cycle one integer and one
FP - Can only issue 2nd instruction if 1st instruction
issues - Need more ports to the register file
- Type Pipe stages
- Int. IF ID EX MEM WB
- FP IF ID EX MEM WB
- Int. IF ID EX MEM WB
- FP IF ID EX MEM WB
- Int. IF ID EX MEM WB
- FP IF ID EX MEM WB
- EX stage should be fully pipelined
- 1 load delay slot corresponds to three
instructions!
5Statically Scheduled Superscalar MIPS
- Difficult to find a sufficient number of instr.
to issue
- Can be scheduled dynamically with Tomasulos alg.
6Limits to Superscalar Execution
- Difficulties in scheduling within the constraints
on number of functional units and the ILP in the
code chunk
- Instruction decode complexity increases with the
number of issued instructions
- Data and control dependences are in general more
costly in a superscalar processor than in a
single-issue processor
Techniques to enlarge the instruction window to
extract more ILP are important
7Very Long Instruction Word (VLIW)
- Independent functional units with no hazard
detection
- Compiler is responsible for instruction
scheduling
8Some VLIW Characteristics
- Can be hard to exploit parallelism
- n functional units and k pipeline stages implies
n x k independent instructions
- Memory and register bandwidth
- Complexity increases with the number of
functional units - Code size
No binary code compatibility
Relies heavily on compiler technology
9Detecting data dependencies
- Finding dependences is fundamental to
- perform instruction scheduling
- determine the degree of parallelism in loops and
- eliminate name dependencies
10Loop-carried dependencies
A loop iteration is often dependent of results
calculated in an earlier iteration.
Example for (i 6 i lt 100 i i1)
Yi Yi-5 Yi
- This loop has a dependency distance of 5 and we
can extract ILP in 5 consecutive iterations
What dependences can the compiler detect?
11Conditions for detection of data dependences
- Assumptions
- Array indices are affine, i.e, can be written a
x i b - There is a write to Aa x j b followed by a
read to - Ac x k d for some loop indices mlt j,k
lt n
- There is a data dependence if and only if
- There exists j,k and jltk, such that a x j b
c x k d
The dependence test may fail in the general case
12The GCD test
- A simple test to decide whether there is NO
dependency between loop iterations - Loop carried dependences gtGCD(c,a) divides (d -
b) ? - If GCD(c,a) does NOT divide (d-b) gt NO dependency
13Software Pipelining 1(3)Symbolic loop unrolling
- The instructions in a loop are taken from
different iterations in the original loop
14Software pipelining 2(3)
- Example
- loop LD F0,0(R1)
- ADDD F4,F0,F2
- SD 0(R1),F4
- SUBI R1,R1,8
- BNEZ R1,loop
Look at three iterations of the loop
body LD F0,0(R1) Iteration
i ADDD F4,F0,F2 SD 0(R1),F4 LD F0,0(R1)
Iteration i1 ADDD F4,F0,F2 SD 0(R1),F4 LD F
0,0(R1) Iteration i2 ADDD F4,F0,F2 SD 0(R1
),F4 l
15Software pipelining 3(3)
- Instructions from three consecutive iterations
form the loop body - loop SD 0(R1),F4 from iteration i
- ADDD F4,F0,F2 from iteration i1
- LD F0,-16(R1) from iteration i2
- SUBI R1,R1,8
- BNEZ R1,loop
- No data dependences within a loop iteration
- The dependence distance is 2 iterations
- WAR hazard elimination is needed
- Requires startup and finish code
16Trace scheduling 1(2)
Creates a sequence of instructions that are
likely to be executed -- a trace.
- Two steps
- Trace selection Find a likely sequence of basic
blocks (trace) across statically predicted
branches (e.g. if-then-else)
- Trace compaction Schedule the trace to be as
efficient as possible while preserving
correctness in the case the prediction is wrong
- Yields more instruction level parallelism
- Accurate static branch prediction key to success
17Trace scheduling 2(2)
- The leftmost sequence is chosen as the most
likely trace
- The assignment to B is control dependent on the
if statement.
- Trace compaction has to respect data dependences
- The rightmost (less likely) trace has to be
augmented with fix up code
18Hardware support for speculation
- Loop unrolling, software pipelining, and trace
scheduling are limited by the compilers ability
to do branch prediction
Dynamic techniques can predict the future based
on history information. HW support for
speculation includes
- Branch prediction (has been discussed)
- Predicated instructions (used extensively in
Intels IA-64) - Hardware support for compiler speculation
- Hardware-based speculation
19Conditional or predicated instructions
- Executed only if a condition is satisfied
- Control converted into data dependences
- Example
- Normal code Conditional
- BNEZ R1,L CMOVZ R2,R3,R1
- MOV R2,R3
- L
- Useful for short if-then statements
- More complex might slow down cycle time
20Compiler speculation
- The compiler moves instructions before a branch
so that they can be executed before the branch
condition is known
- Advantage creates longer schedulable code
sequences gt more ILP can exploited
- Example if (A 0) A B else A A4
- Non speculative code Speculative code
- LW R1,0(R3) LW R1,0(R3)
- BNEZ R1,L1 LW R14,0(R2)
- LW R1,0(R2) BEQZ R1,L3
- J L2 ADD R14,R1,4
- L1 ADD R1,R1,4 L3 SW 0(R3),R14
- L2 SW 0(R3),R1
- Must not affect the exception behavior
21HW supported speculation
- A combination of three main ideas
- Dynamic instruction scheduling takes advantage
of ILP
- Dynamic branch prediction allows instruction
scheduling across branches
- Speculative execution executes instructions
before all control dependences are resolved
Hardware based speculation uses a data-flow
approach instructions execute when their
operands are available
22HW vs. SW speculation
- Advantages
- Dynamic runtime disambiguation of memory addresses
- Dynamic branch prediction is often better than
static which limits the performance of SW specul.
- HW speculation can maintain a precise exception
model
- Can achieve higher performance on older code
- Main disadvantage
- Complex implementation and extensive need of
hardware resources
23HW Support for Speculation
- Need a reorder buffer for uncommited inst.
- Reorder buffer (ROB) can be operand source
- Once operation commits, the register file is
updated - Use reorder buffer number instead of reservation
station - Instructions commit in order
- Flush reorder buffer when a branch is
mispredicted - Store buffers integrated into the ROB.
24Four Steps of a Speculative Tomasulo Algorithm
- 1. Issue get instruction from FP Op Queue
- If reservation station and reorder buffer slot
free, issue instr send operands reorder
buffer nr. for destination
- 2. Execution operate on operands (EX)
- If both operands ready execute if not, watch
CDB for result when both operands are in
reservation station execute
- 3. Write result finish execution
- Write on Common Data Bus to all awaiting FUs
reorder buffer mark reservation station available
- 4. Commit update register with reorder result
- When instr. is at head of reorder buffer result
is present update register with result (or store
to memory) and remove instr. from reorder buffer
25A Model of an Ideal Processor
- Provides a base for ILP measurements
- No structural hazards
- Register renaminginfinite virtual registers and
all WAW WAR hazards avoided
- Machine with perfect speculation
- Branch predictionperfect no mispredictions
- Jump predictionall jumps perfectly predicted
- Memory-address alias analysisaddresses are known
a store can be moved before a load provided
addresses not equal - There are only true data dependences left!
26Upper Bound on ILP
27More Realistic HW Branch Impact
28Renaming Register impact
29Window Impact
30Summary
- Software (compiler) tricks
- Loop unrolling
- Software pipelining
- Static instruction scheduling (with register
renaming) - Trace scheduling (implies static branch
prediction) - Speculative execution
- Hardware tricks
- Dynamic instruction scheduling
- Dynamic branch prediction
- Multiple issue
- Superscalar
- VLIW/EPIC
- Conditional instructions
- Speculative execution