CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP - PowerPoint PPT Presentation

About This Presentation

Title:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP

Description:

... (r1) add.d f8,f6,f2 s.d f8,-8(r1) l.d f10,-16(r1) add.d f12,f10,f2 s.d f12,-16(r1) l.d f14,-24(r1) add.d f16,f14,f2 s.d f16,-24(r1) daddui ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 16

Provided by: BA746

Learn more at: https://www2.seas.gwu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP

1
CS 6461 Computer ArchitectureBasic Compiler
Techniques for Exposing ILP

Instructor Morris Lancaster
Corresponding to Hennessey and Patterson
Fifth Edition
Section 3.2

2
Basic Compiler Techniques for Exposing ILP

Crucial for processors that use static issue, and
important for processors that make dynamic issue
decisions but use static scheduling

3
Basic Pipeline Scheduling and Loop Unrolling

Exploiting parallelism among instructions
Finding sequences of unrelated instructions that
can be overlapped in the pipeline
Separation of a dependent instruction from a
source instruction by a distance in clock cycles
equal to the pipeline latency of the source
instruction. (Avoid the stall)
The compiler works with a knowledge of the amount
of available ILP in the program and the
latencies of the functional units within the
pipeline
This couples the compiler, sometimes to the
specific chip version, or at least requires the
setting of appropriate compiler flags

4
Assumed Latencies
Instruction Producing Result Instruction Using Result Latency In Clock Cycles (needed to avoid stall)
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0 Result of the load can be bypassed without stalling store
5
Basic Pipeline Scheduling and Loop Unrolling
(cont)

Assume standard 5 stage integer pipeline
Branches have a delay of one clock cycle
Functional units are fully pipelined or
replicated (as many times as the pipeline depth)
An operation of any type can be issued on every
clock cycle and there are no structural hazards

6
Basic Pipeline Scheduling and Loop Unrolling
(cont)

Sample code
For (i1000 igt0 ii-1)xi xi s
MIPS code
Loop L.D F0,0(R1) F0 array
element ADD.D F4,F0,F2 add scalar in
F2 S.D F4,0(R1) store back
DADDUI R1,R1,-8 decrement index
BNE R1,R2,Loop R2 is precomputed so
that 8(R2) is last value to
be computed

7
Basic Pipeline Scheduling and Loop Unrolling
(cont)

MIPS code
Loop L.D F0,0(R1) 1 clock cycle
stall 2 ADD.D F4,F0,F2 3 stall 4
stall 5 S.D F4,0(R1) 6
DADDUI R1,R1,-8 7 stall 8
BNE R1,R2,Loop 9

8
Rescheduling Gives

Sample code
For (i1000 igt0 ii-1)xi xi s
MIPS code
Loop L.D F0,0(R1) 1 DADDUI R1,R1,-8 2
ADD.D F4,F0,F2 3 stall 4 stall 5
S.D F4,8(R1) 6
BNE R1,R2,Loop 7

9
Unrolling Summary (continued)

Simple Unroll Loop L.D F0,0(R1) ADD.D F4,F
0,F2 S.D F4,0(R1) L.D F0,-8(R1) ADD.D F4,F0
,F2 S.D F4,-8(R1) L.D F0,-16(R1) ADD.D F4,F0
,F2 S.D F4,-16(R1) L.D F0,-24(R1) ADD.D F4,F
0,F2 S.D F4,-24(R1)
DADDUI R1,R1,-32 BNE R1,R2,Loop

Name Dependences
Data Dependences
10
Unrolling and Renaming Gives

MIPS code
Loop L.D F0,0(R1) ADD.D F4,F0,F2 we have a
stall coming S.D F4,0(R1) L.D F6,-8(R1) ADD
.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) AD
D.D F12,F10,F2
S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F
14,F2 S.D F16,-24(R1) DADDUI R1,R1,-32 BNE
R1,R2,Loop

11
Unrolling and Removing Hazards Gives

MIPS code
Loop L.D F0,0(R1) total of 14 clock cycles
L.D F6,-8(R1)
L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4
,F0,F2
ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F
14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R
1,-32 S.D F12,16(R1) S.D F16,8(R1) BNE
R1,R2,Loop

12
Unrolling Summary for Above

Determine that it was legal to move the S.D after
the DADDUI and BNE, and find the amount to adjust
the S.D offset
Determine that unrolling the loop would be useful
by finding that the loop iterations were
independent, except for loop maintenance code
Use different registers to avoid unnecessary
constraints that would be forced by using the
same registers
Eliminate the extra test and branch instruction
and adjust the loop termination and iteration
code.
Determine that the loads and stores can be
interchanged by determining that the loads and
stores from different iterations are independent
Schedule the code, preserving any dependencies

13
Unrolling Summary (continued)

Example on Page 311 shows the steps
Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,
0(R1) L.D F0,-8(R1) ADD.D F4,F0,F2 S.D F4,-8
(R1) L.D F0,-16(R1) ADD.D F4,F0,F2 S.D F4,-1
6(R1) L.D F0,-24(R1) ADD.D F4,F0,F2 S.D F4,-
24(R1)
DADDUI R1,R1,-32 BNE R1,R2,Loop

Name Dependences
Data Dependences
14
Unrolling Summary (Renaming)

Example on Page 311 shows the steps
Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8
(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F1
2,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S
.D F16,-24(R1)
DADDUI R1,R1,-32 BNE R1,R2,Loop

Name Dependences
Data Dependences
15
Unrolling Summary (continued)

Limits to Impacts of Unrolling Loops
As we unroll more, each unroll yields a decreased
amount of improvement of distribution of overhead
Growth in code size
Shortfall in available registers (register
pressure)
Scheduling the code to increase ILP causes the
number of live values to increase
This could generate a shortage of registers and
negatively impact the optimization
Useful in a variety of processors today