Title: CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP
1CS 6461 Computer ArchitectureBasic Compiler
Techniques for Exposing ILP
- Instructor Morris Lancaster
- Corresponding to Hennessey and Patterson
- Fifth Edition
- Section 3.2
2Basic Compiler Techniques for Exposing ILP
- Crucial for processors that use static issue, and
important for processors that make dynamic issue
decisions but use static scheduling
3Basic Pipeline Scheduling and Loop Unrolling
- Exploiting parallelism among instructions
- Finding sequences of unrelated instructions that
can be overlapped in the pipeline - Separation of a dependent instruction from a
source instruction by a distance in clock cycles
equal to the pipeline latency of the source
instruction. (Avoid the stall) - The compiler works with a knowledge of the amount
of available ILP in the program and the
latencies of the functional units within the
pipeline - This couples the compiler, sometimes to the
specific chip version, or at least requires the
setting of appropriate compiler flags
4Assumed Latencies
Instruction Producing Result Instruction Using Result Latency In Clock Cycles (needed to avoid stall)
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0 Result of the load can be bypassed without stalling store
5Basic Pipeline Scheduling and Loop Unrolling
(cont)
- Assume standard 5 stage integer pipeline
- Branches have a delay of one clock cycle
- Functional units are fully pipelined or
replicated (as many times as the pipeline depth) - An operation of any type can be issued on every
clock cycle and there are no structural hazards
6Basic Pipeline Scheduling and Loop Unrolling
(cont)
- Sample code
- For (i1000 igt0 ii-1)xi xi s
- MIPS code
- Loop L.D F0,0(R1) F0 array
element ADD.D F4,F0,F2 add scalar in
F2 S.D F4,0(R1) store back - DADDUI R1,R1,-8 decrement index
- BNE R1,R2,Loop R2 is precomputed so
that 8(R2) is last value to
be computed
7Basic Pipeline Scheduling and Loop Unrolling
(cont)
- MIPS code
- Loop L.D F0,0(R1) 1 clock cycle
- stall 2 ADD.D F4,F0,F2 3 stall 4
stall 5 S.D F4,0(R1) 6 - DADDUI R1,R1,-8 7 stall 8
- BNE R1,R2,Loop 9
8Rescheduling Gives
- Sample code
- For (i1000 igt0 ii-1)xi xi s
- MIPS code
- Loop L.D F0,0(R1) 1 DADDUI R1,R1,-8 2
ADD.D F4,F0,F2 3 stall 4 stall 5 - S.D F4,8(R1) 6
- BNE R1,R2,Loop 7
9Unrolling Summary (continued)
- Simple Unroll Loop L.D F0,0(R1) ADD.D F4,F
0,F2 S.D F4,0(R1) L.D F0,-8(R1) ADD.D F4,F0
,F2 S.D F4,-8(R1) L.D F0,-16(R1) ADD.D F4,F0
,F2 S.D F4,-16(R1) L.D F0,-24(R1) ADD.D F4,F
0,F2 S.D F4,-24(R1) - DADDUI R1,R1,-32 BNE R1,R2,Loop
Name Dependences
Data Dependences
10Unrolling and Renaming Gives
- MIPS code
- Loop L.D F0,0(R1) ADD.D F4,F0,F2 we have a
stall coming S.D F4,0(R1) L.D F6,-8(R1) ADD
.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) AD
D.D F12,F10,F2 - S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F
14,F2 S.D F16,-24(R1) DADDUI R1,R1,-32 BNE
R1,R2,Loop
11Unrolling and Removing Hazards Gives
- MIPS code
- Loop L.D F0,0(R1) total of 14 clock cycles
L.D F6,-8(R1) - L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4
,F0,F2 - ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F
14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R
1,-32 S.D F12,16(R1) S.D F16,8(R1) BNE
R1,R2,Loop
12Unrolling Summary for Above
- Determine that it was legal to move the S.D after
the DADDUI and BNE, and find the amount to adjust
the S.D offset - Determine that unrolling the loop would be useful
by finding that the loop iterations were
independent, except for loop maintenance code - Use different registers to avoid unnecessary
constraints that would be forced by using the
same registers - Eliminate the extra test and branch instruction
and adjust the loop termination and iteration
code. - Determine that the loads and stores can be
interchanged by determining that the loads and
stores from different iterations are independent - Schedule the code, preserving any dependencies
13Unrolling Summary (continued)
- Example on Page 311 shows the steps
Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,
0(R1) L.D F0,-8(R1) ADD.D F4,F0,F2 S.D F4,-8
(R1) L.D F0,-16(R1) ADD.D F4,F0,F2 S.D F4,-1
6(R1) L.D F0,-24(R1) ADD.D F4,F0,F2 S.D F4,-
24(R1) - DADDUI R1,R1,-32 BNE R1,R2,Loop
Name Dependences
Data Dependences
14Unrolling Summary (Renaming)
- Example on Page 311 shows the steps
Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8
(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F1
2,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S
.D F16,-24(R1) - DADDUI R1,R1,-32 BNE R1,R2,Loop
Name Dependences
Data Dependences
15Unrolling Summary (continued)
- Limits to Impacts of Unrolling Loops
- As we unroll more, each unroll yields a decreased
amount of improvement of distribution of overhead - Growth in code size
- Shortfall in available registers (register
pressure) - Scheduling the code to increase ILP causes the
number of live values to increase - This could generate a shortage of registers and
negatively impact the optimization - Useful in a variety of processors today