CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP - PowerPoint PPT Presentation

About This Presentation
Title:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP

Description:

... (r1) add.d f8,f6,f2 s.d f8,-8(r1) l.d f10,-16(r1) add.d f12,f10,f2 s.d f12,-16(r1) l.d f14,-24(r1) add.d f16,f14,f2 s.d f16,-24(r1) daddui ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 16
Provided by: BA746
Category:

less

Transcript and Presenter's Notes

Title: CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP


1
CS 6461 Computer ArchitectureBasic Compiler
Techniques for Exposing ILP
  • Instructor Morris Lancaster
  • Corresponding to Hennessey and Patterson
  • Fifth Edition
  • Section 3.2

2
Basic Compiler Techniques for Exposing ILP
  • Crucial for processors that use static issue, and
    important for processors that make dynamic issue
    decisions but use static scheduling

3
Basic Pipeline Scheduling and Loop Unrolling
  • Exploiting parallelism among instructions
  • Finding sequences of unrelated instructions that
    can be overlapped in the pipeline
  • Separation of a dependent instruction from a
    source instruction by a distance in clock cycles
    equal to the pipeline latency of the source
    instruction. (Avoid the stall)
  • The compiler works with a knowledge of the amount
    of available ILP in the program and the
    latencies of the functional units within the
    pipeline
  • This couples the compiler, sometimes to the
    specific chip version, or at least requires the
    setting of appropriate compiler flags

4
Assumed Latencies
Instruction Producing Result Instruction Using Result Latency In Clock Cycles (needed to avoid stall)
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0 Result of the load can be bypassed without stalling store
5
Basic Pipeline Scheduling and Loop Unrolling
(cont)
  • Assume standard 5 stage integer pipeline
  • Branches have a delay of one clock cycle
  • Functional units are fully pipelined or
    replicated (as many times as the pipeline depth)
  • An operation of any type can be issued on every
    clock cycle and there are no structural hazards

6
Basic Pipeline Scheduling and Loop Unrolling
(cont)
  • Sample code
  • For (i1000 igt0 ii-1)xi xi s
  • MIPS code
  • Loop L.D F0,0(R1) F0 array
    element ADD.D F4,F0,F2 add scalar in
    F2 S.D F4,0(R1) store back
  • DADDUI R1,R1,-8 decrement index
  • BNE R1,R2,Loop R2 is precomputed so
    that 8(R2) is last value to
    be computed

7
Basic Pipeline Scheduling and Loop Unrolling
(cont)
  • MIPS code
  • Loop L.D F0,0(R1) 1 clock cycle
  • stall 2 ADD.D F4,F0,F2 3 stall 4
    stall 5 S.D F4,0(R1) 6
  • DADDUI R1,R1,-8 7 stall 8
  • BNE R1,R2,Loop 9

8
Rescheduling Gives
  • Sample code
  • For (i1000 igt0 ii-1)xi xi s
  • MIPS code
  • Loop L.D F0,0(R1) 1 DADDUI R1,R1,-8 2
    ADD.D F4,F0,F2 3 stall 4 stall 5
  • S.D F4,8(R1) 6
  • BNE R1,R2,Loop 7

9
Unrolling Summary (continued)
  • Simple Unroll Loop L.D F0,0(R1) ADD.D F4,F
    0,F2 S.D F4,0(R1) L.D F0,-8(R1) ADD.D F4,F0
    ,F2 S.D F4,-8(R1) L.D F0,-16(R1) ADD.D F4,F0
    ,F2 S.D F4,-16(R1) L.D F0,-24(R1) ADD.D F4,F
    0,F2 S.D F4,-24(R1)
  • DADDUI R1,R1,-32 BNE R1,R2,Loop

Name Dependences
Data Dependences
10
Unrolling and Renaming Gives
  • MIPS code
  • Loop L.D F0,0(R1) ADD.D F4,F0,F2 we have a
    stall coming S.D F4,0(R1) L.D F6,-8(R1) ADD
    .D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) AD
    D.D F12,F10,F2
  • S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F
    14,F2 S.D F16,-24(R1) DADDUI R1,R1,-32 BNE
    R1,R2,Loop

11
Unrolling and Removing Hazards Gives
  • MIPS code
  • Loop L.D F0,0(R1) total of 14 clock cycles
    L.D F6,-8(R1)
  • L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4
    ,F0,F2
  • ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F
    14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R
    1,-32 S.D F12,16(R1) S.D F16,8(R1) BNE
    R1,R2,Loop

12
Unrolling Summary for Above
  • Determine that it was legal to move the S.D after
    the DADDUI and BNE, and find the amount to adjust
    the S.D offset
  • Determine that unrolling the loop would be useful
    by finding that the loop iterations were
    independent, except for loop maintenance code
  • Use different registers to avoid unnecessary
    constraints that would be forced by using the
    same registers
  • Eliminate the extra test and branch instruction
    and adjust the loop termination and iteration
    code.
  • Determine that the loads and stores can be
    interchanged by determining that the loads and
    stores from different iterations are independent
  • Schedule the code, preserving any dependencies

13
Unrolling Summary (continued)
  • Example on Page 311 shows the steps
    Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,
    0(R1) L.D F0,-8(R1) ADD.D F4,F0,F2 S.D F4,-8
    (R1) L.D F0,-16(R1) ADD.D F4,F0,F2 S.D F4,-1
    6(R1) L.D F0,-24(R1) ADD.D F4,F0,F2 S.D F4,-
    24(R1)
  • DADDUI R1,R1,-32 BNE R1,R2,Loop

Name Dependences
Data Dependences
14
Unrolling Summary (Renaming)
  • Example on Page 311 shows the steps
    Loop L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,
    0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8
    (R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F1
    2,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S
    .D F16,-24(R1)
  • DADDUI R1,R1,-32 BNE R1,R2,Loop

Name Dependences
Data Dependences
15
Unrolling Summary (continued)
  • Limits to Impacts of Unrolling Loops
  • As we unroll more, each unroll yields a decreased
    amount of improvement of distribution of overhead
  • Growth in code size
  • Shortfall in available registers (register
    pressure)
  • Scheduling the code to increase ILP causes the
    number of live values to increase
  • This could generate a shortage of registers and
    negatively impact the optimization
  • Useful in a variety of processors today
Write a Comment
User Comments (0)
About PowerShow.com