Advanced Computer Architecture 5MD00 5Z033 Exploiting ILP with SW approaches - PowerPoint PPT Presentation


PPT – Advanced Computer Architecture 5MD00 5Z033 Exploiting ILP with SW approaches PowerPoint presentation | free to view - id: 255bda-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Advanced Computer Architecture 5MD00 5Z033 Exploiting ILP with SW approaches


Used in first Pentium processor (also in Larrabee, but canceled! ... More ports needed for FP register file to execute FP load & FP op in parallel. Type Pipe Stages ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 40
Provided by: henkcor2


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Advanced Computer Architecture 5MD00 5Z033 Exploiting ILP with SW approaches

Advanced Computer Architecture5MD00 /
5Z033Exploiting ILP with SW approaches
  • Henk Corporaal
  • TUEindhoven
  • December 2009

  • Static branch prediction and speculation
  • Basic compiler techniques
  • Multiple issue architectures
  • Advanced compiler support techniques
  • Loop-level parallelism
  • Software pipelining
  • Hardware support for compile-time scheduling

We discussed previously dynamic branch
predictionThis does not help the compiler !!!
  • We need Static Branch Prediction

Static Branch Prediction and Speculation
  • Static branch prediction useful for code
  • Example
  • ld r1,0(r2)
  • sub r1,r1,r3 hazard
  • beqz r1,L
  • or r4,r5,r6
  • addu r10,r4,r3
  • L addu r7,r8,r9
  • If the branch is taken most of the times and
    since r7 is not needed on the fall-through path,
    we could move addu r7,r8,r9 directly after the
  • If the branch is not taken most of the times and
    assuming that r4 is not needed on the taken path,
    we could move or r4,r5,r6 after the ld

Static Branch Prediction Methods
  • Always predict taken
  • Average misprediction rate for SPEC 34 (9-59)
  • Backward branches predicted taken, forward
    branches not taken
  • In SPEC, most forward branches are taken, so
    always predict taken is better
  • Profiling
  • Run the program and profile all branches. If a
    branch is taken (not taken) most of the times, it
    is predicted taken (not taken)
  • Behavior of a branch is often biased to taken or
    not taken
  • Average misprediction rate for SPECint 15
    (11-22), SPECfp 9 (5-15)
  • Can we do better? YES, use control flow
    restructuring to exploit correlation

Static exploitation of correlation
If correlation, branch prediction in block d
depends on branch in block a
control flow restructuring
Basic compiler techniques
  • Dependencies limit ILP (Instruction-Level
  • We can not always find sufficient independent
    operations to fill all the delay slots
  • May result in pipeline stalls
  • Scheduling to avoid stalls
  • Loop unrolling create more exploitable

Dependencies Limit ILP Example
C loop for (i1 ilt1000 i) xi xi
  • MIPS assembly code
  • R1 x1
  • R2 x10008
  • F2 s
  • Loop L.D F0,0(R1) F0 xi
  • ADD.D F4,F0,F2 F4 xis
  • S.D 0(R1),F4 xi F4
  • ADDI R1,R1,8 R1 xi1
  • BNE R1,R2,Loop branch if R1!x10008

Schedule this on a MIPS Pipeline
  • FP operations are mostly multicycle
  • The pipeline must be stalled if an instruction
    uses the result of a not yet finished multicycle
  • Well assume the following latencies
  • Producing Consuming Latency
  • instruction instruction (clock cycles)
  • FP ALU op FP ALU op 3
  • FP ALU op Store double 2
  • Load double FP ALU op 1
  • Load double Store double 0

Where to Insert Stalls?
  • How would this loop be executed on the MIPS FP

Inter-iteration dependence
Loop L.D F0,0(R1) ADD.D F4,F0,F2
S.D F4,0(R1) ADDI R1,R1,8
BNE R1,R2,Loop
Which true (flow) dependences?
Where to Insert Stalls
  • How would this loop be executed on the MIPS FP
  • 10 cycles per iteration

Loop L.D F0,0(R1) stall ADD.D
F4,F0,F2 stall stall S.D
0(R1),F4 ADDI R1,R1,8 stall
BNE R1,R2,Loop stall
Code Scheduling to Avoid Stalls
  • Can we reorder the order of instruction to avoid
  • Execution time reduced from 10 to 6 cycles per
  • But only 3 instructions perform useful work, rest
    is loop overhead. How to avoid this ???

Loop L.D F0,0(R1) ADDI R1,R1,8
ADD.D F4,F0,F2 stall BNE
R1,R2,Loop S.D -8(R1),F4
watch out!
Loop Unrolling increasing ILP
  • At source level
  • for (i1 ilt1000 i)
  • xi xi s
  • for (i1 ilt1000 ii4)
  • xi xi s
  • xi1 xi1s
  • xi2 xi2s
  • xi3 xi3s
  • Any drawbacks?
  • loop unrolling increases code size
  • more registers needed

MIPS code after scheduling Loop L.D
F0,0(R1) L.D F6,8(R1) L.D
F10,16(R1) L.D F14,24(R1) ADD.D
F4,F0,F2 ADD.D F8,F6,F2 ADD.D
F12,F10,F2 ADD.D F16,F14,F2 S.D
0(R1),F4 S.D 8(R1),F8 ADDI
R1,R1,32 SD -16(R1),F12 BNE
R1,R2,Loop SD -8(R1),F16
Multiple issue architectures
  • How to get CPI lt 1 ?
  • Superscalar multiple instructions issued per
  • Statically scheduled
  • Dynamically scheduled (see previous lecture)
  • VLIW ?
  • single instruction issue, but multiple operations
    per instruction
  • SIMD / Vector ?
  • single instruction issue, single operation, but
    multiple data sets per operation
  • Multi-processor ?

Instruction Parallel (ILP) Processors
  • The name ILP is used for
  • Multiple-Issue Processors
  • Superscalar varying no. instructions/cycle (0 to
    8), scheduled by HW (dynamic issue capability)
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
    III/4, etc.
  • VLIW (very long instr. word) fixed number of
    instructions (4-16) scheduled by the compiler
    (static issue capability)
  • Intel Architecture-64 (IA-64, Itanium), TriMedia,
    TI C6x
  • (Super-) pipelined processors
  • Anticipated success of multiple instructions led
    to Instructions Per Cycle (IPC) metric instead
    of CPI

Vector processors
  • Vector Processing Explicit coding of independent
    loops as operations on large vectors of numbers
  • Multimedia instructions being added to many
  • Different implementations
  • real SIMD
  • (multiple) subword units
  • deeply pipelined units, with forwarding between

Simple In-order Superscalar
  • In-order Superscalar 2-issue processor 1 Integer
    1 FP
  • Used in first Pentium processor (also in
    Larrabee, but canceled!!)
  • Fetch 64-bits/clock cycle Int on left, FP on
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports needed for FP register file to
    execute FP load FP op in parallel
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay impacts the next 3
    instructions !

Dynamic trace for unrolled code
  • for (i1 ilt1000 i)
  • ai ais
  • Integer instruction FP instruction Cycle
  • L LD F0,0(R1) 1
  • LD F6,8(R1) 2
  • LD F10,16(R1) ADDD F4,F0,F2 3
  • LD F14,24(R1) ADDD F8,F6,F2 4
  • LD F18,32(R1) ADDD F12,F10,F2 5
  • SD 0(R1),F4 ADDD F16,F14,F2 6
  • SD 8(R1),F8 ADDD F20,F18,F2 7
  • SD 16(R1),F12 8
  • ADDI R1,R1,40 9
  • SD -16(R1),F16 10
  • BNE R1,R2,L 11
  • SD -8(R1),F20 12

Load 1 cycle latency ALU op 2 cycles latency
  • 2.4 cycles per element vs. 3.5 for ordinary MIPS
  • Int and FP instructions not perfectly balanced

Multiple-Issue Issues
  • While Integer/FP split is simple for the HW, get
    IPC of 2 only for programs with
  • Exactly 50 FP operations AND no hazards
  • More complex decode and issue
  • Even 2-issue superscalar gt examine 2 opcodes, 6
    register specifiers, and decide if 1 or 2
    instructions can issue (N-issue O(N2)
  • Register file complexity for 2-issue
    superscalar needs 2x reads and 1x writes/cycle
  • Rename logic must be able to rename same
    register multiple times in one cycle! For
    instance, consider 4-way issue
  • add r1, r2, r3 add p11, p4, p7 sub r4, r1,
    r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
    4(p22) add r5, r1, r2 add p12, p23, p4
  • Imagine doing this transformation in a single
  • Result buses Need to complete multiple
  • Need multiple buses with associated matching
    logic at every reservation station.

VLIW Processors
  • Superscalar HW too difficult to build gt let
    compiler find independent instructions and pack
    them in one Very Long Instruction Word (VLIW)
  • Example VLIW processor with 2 ld/st units, two
    FP units, one integer/branch unit, no branch delay

Superscalar versus VLIW
  • VLIW advantages
  • Much simpler to build. Potentially faster
  • VLIW disadvantages and proposed solutions
  • Binary code incompatibility
  • Object code translation or emulation
  • Less strict approach (EPIC, IA-64, Itanium)
  • Increase in code size, unfilled slots are wasted
  • Use clever encodings, only one immediate field
  • Compress instructions in memory and decode them
    when they are fetched, or when put in L1 cache
  • Lockstep operation if the operation in one
    instruction slot stalls, the entire processor is
  • Less strict approach

Use compressed instructions
L1 Instruction Cache
compressed instructions in memory
compress here?
or compress here?
Q What are pros and cons?
Advanced compiler support techniques
  • Loop-level parallelism
  • Software pipelining
  • Global scheduling (across basic blocks)

Detecting Loop-Level Parallelism
  • Loop-carried dependence a statement executed in
    a certain iteration is dependent on a statement
    executed in an earlier iteration
  • If there is no loop-carried dependence, then its
    iterations can be executed in parallel
  • for (i1 ilt100 i)
  • Ai1 AiCi / S1 /
  • Bi1 BiAi1 / S2 /

A loop is parallel ? the corresponding dependence
graph does not contain a cycle
Finding Dependences
  • Is there a dependence in the following loop?
  • for (i1 ilt100 i)
  • A2i3 A2i 5.0
  • Affine expression an expression of the form ai
    b (a, b constants, i loop index variable)
  • Does the following equation have a solution?
  • ai b cj d
  • GCD test if there is a solution, then GCD(a,c)
    must divide d-b
  • Note Because the GCD test does not take the loop
    bounds into account, there are cases where the
    GCD test says yes, there is a solution while in
    reality there isnt

Software Pipelining
  • We have already seen loop unrolling
  • Software pipelining is a related technique that
    that consumes less code space. It interleaves
    instructions from different iterations
  • instructions in one iteration are often dependent
    on each other

Iteration 0
Iteration 1
Iteration 2
Software- pipelined iteration
Steady state kernel
Simple Software Pipelining Example
  • L l.d f0,0(r1) load Mi
  • add.d f4,f0,f2 compute Mi
  • s.d f4,0(r1) store Mi
  • addi r1,r1,-8 i i-1
  • bne r1,r2,L
  • Software pipelined loop
  • L s.d f4,16(r1) store Mi
  • add.d f4,f0,f2 compute Mi-1
  • l.d f0,0(r1) load Mi-2
  • addi r1,r1,-8
  • bne r1,r2,L
  • Need hardware to avoid the WAR hazards

Global code scheduling
  • Loop unrolling and software pipelining work well
    when there are no control statements (if
    statements) in the loop body -gt loop is a single
    basic block
  • Global code scheduling scheduling/moving code
    across branches larger scheduling scope
  • When can the assignments to B and C be moved
    before the test?

Which scheduling scope?
Decision Tree
Comparing scheduling scopes
Scheduling scope creation (1)
Partitioning a CFG into scheduling scopes
Trace Scheduling
  • Find the most likely sequence of basic blocks
    that will be executed consecutively (trace
  • Optimize the trace as much as possible (trace
  • move operations as early as possible in the trace
  • pack the operations in as few VLIW instructions
    as possible
  • additional bookkeeping code may be necessary on
    exit points of the trace

Scheduling scope creation (2)
Partitioning a CFG into scheduling scopes
Code movement (upwards) within regions
destination block
source block
Hardware support for compile-time scheduling
  • Predication
  • (discussed already)
  • see also Itanium example
  • Deferred exceptions
  • Speculative loads

Predicated Instructions (discussed before)
  • Avoid branch prediction by turning branches into
    conditional or predicated instructions
  • If false, then neither store result nor cause
  • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
    conditional move PA-RISC can annul any following
  • IA-64/Itanium conditional execution of any
  • Examples
  • if (R10) R2 R3 CMOVZ R2,R3,R1
  • if (R1 lt R2) SLT R9,R1,R2
  • R3 R1 CMOVNZ R3,R1,R9
  • else CMOVZ R3,R2,R9
  • R3 R2

Deferred Exceptions
ld r1,0(r3) load A bnez r1,L1 test
A ld r1,0(r2) then part load B j
L2 L1 addi r1,r1,4 else part inc A L2 st
r1,0(r3) store A
if (A0) A B else A A4
  • How to optimize when then-part is almost
    always executed?

ld r1,0(r3) load A ld r9,0(r2)
speculative load B beqz r1,L3 test A
addi r9,r1,4 else part L3 st r9,0(r3)
store A
  • What if the load generates a page fault?
  • What if the load generates an index-out-of-bounds

HW supporting Speculative Loads
  • Speculative load (sld) does not generate
  • Speculation check instruction (speck) check for
    exception. The exception occurs when this
    instruction is executed.

ld r1,0(r3) load A sld r9,0(r2)
speculative load of B bnez r1,L1 test
A speck 0(r2) perform exception check j
L2 L1 addi r9,r1,4 else part L2 st
r9,0(r3) store A
How further?
Burton Smith Microsoft 2005