Taking advantage of more ILP with multiple issue 3'6 - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Taking advantage of more ILP with multiple issue 3'6

Description:

Combine static and dynamic scheduling to issue multiple instructions per clock ... Predicated instructions (used extensively in Intels IA-64) ... – PowerPoint PPT presentation

Number of Views:171
Avg rating:3.0/5.0
Slides: 31
Provided by: perand
Category:

less

Transcript and Presenter's Notes

Title: Taking advantage of more ILP with multiple issue 3'6


1
Lecture 4
ILP and Its dynamic exploitation (contd) and
exploiting ILP with software approaches
  • Taking advantage of more ILP with multiple issue
    (3.6)
  • Static multiple issue The VLIW approach (4.3)
  • Compiler techniques for exposing ILP (4.2,4.4-4.5)
  • Hardware-based speculation (3.7)
  • Limitations of ILP (3.8)

2
Multiple instruction issue per clock
  • Goal Maximize the number of completed
    instructions per cycle
  • Superscalar
  • Combine static and dynamic scheduling to issue
    multiple instructions per clock
  • HW-centric and less sensitive to poorly
    scheduled code
  • Used e.g. in PowerPC, Sparc, Alpha, HP-PA,
    Pentium)
  • Very Long Instruction Words (VLIW)
  • Static scheduling used to form packages of
    independent instructions that can be issued
    together
  • Relies on compiler to find independent
    instructions
  • Used e.g. in IA-64 Itanium (EPIC) and in
    multimedia/DSP Processors (e.g. Trimedia)

3
Multiple-Issue Processors
4
A Superscalar MIPS
  • Issue 2 instructions simultaneously 1 FP 1
    integer
  • Fetch two instr./clock cycle one integer and one
    FP
  • Can only issue 2nd instruction if 1st instruction
    issues
  • Need more ports to the register file
  • Type Pipe stages
  • Int. IF ID EX MEM WB
  • FP IF ID EX MEM WB
  • Int. IF ID EX MEM WB
  • FP IF ID EX MEM WB
  • Int. IF ID EX MEM WB
  • FP IF ID EX MEM WB

  • EX stage should be fully pipelined
  • 1 load delay slot corresponds to three
    instructions!

5
Statically Scheduled Superscalar MIPS
  • Difficult to find a sufficient number of instr.
    to issue
  • Can be scheduled dynamically with Tomasulos alg.

6
Limits to Superscalar Execution
  • Difficulties in scheduling within the constraints
    on number of functional units and the ILP in the
    code chunk
  • Instruction decode complexity increases with the
    number of issued instructions
  • Data and control dependences are in general more
    costly in a superscalar processor than in a
    single-issue processor

Techniques to enlarge the instruction window to
extract more ILP are important
7
Very Long Instruction Word (VLIW)
  • Independent functional units with no hazard
    detection
  • Compiler is responsible for instruction
    scheduling

8
Some VLIW Characteristics
  • Can be hard to exploit parallelism
  • n functional units and k pipeline stages implies
    n x k independent instructions
  • Memory and register bandwidth
  • Complexity increases with the number of
    functional units
  • Code size

No binary code compatibility
Relies heavily on compiler technology
9
Detecting data dependencies
  • Finding dependences is fundamental to
  • perform instruction scheduling
  • determine the degree of parallelism in loops and
  • eliminate name dependencies

10
Loop-carried dependencies
A loop iteration is often dependent of results
calculated in an earlier iteration.
Example for (i 6 i lt 100 i i1)
Yi Yi-5 Yi
  • This loop has a dependency distance of 5 and we
    can extract ILP in 5 consecutive iterations

What dependences can the compiler detect?
11
Conditions for detection of data dependences
  • Assumptions
  • Array indices are affine, i.e, can be written a
    x i b
  • There is a write to Aa x j b followed by a
    read to
  • Ac x k d for some loop indices mlt j,k
    lt n
  • There is a data dependence if and only if
  • There exists j,k and jltk, such that a x j b
    c x k d

The dependence test may fail in the general case
12
The GCD test
  • A simple test to decide whether there is NO
    dependency between loop iterations
  • Loop carried dependences gtGCD(c,a) divides (d -
    b) ?
  • If GCD(c,a) does NOT divide (d-b) gt NO dependency

13
Software Pipelining 1(3)Symbolic loop unrolling
  • The instructions in a loop are taken from
    different iterations in the original loop

14
Software pipelining 2(3)
  • Example
  • loop LD F0,0(R1)
  • ADDD F4,F0,F2
  • SD 0(R1),F4
  • SUBI R1,R1,8
  • BNEZ R1,loop

Look at three iterations of the loop
body LD F0,0(R1) Iteration
i ADDD F4,F0,F2 SD 0(R1),F4 LD F0,0(R1)
Iteration i1 ADDD F4,F0,F2 SD 0(R1),F4 LD F
0,0(R1) Iteration i2 ADDD F4,F0,F2 SD 0(R1
),F4 l
15
Software pipelining 3(3)
  • Instructions from three consecutive iterations
    form the loop body
  • loop SD 0(R1),F4 from iteration i
  • ADDD F4,F0,F2 from iteration i1
  • LD F0,-16(R1) from iteration i2
  • SUBI R1,R1,8
  • BNEZ R1,loop
  • No data dependences within a loop iteration
  • The dependence distance is 2 iterations
  • WAR hazard elimination is needed
  • Requires startup and finish code

16
Trace scheduling 1(2)
Creates a sequence of instructions that are
likely to be executed -- a trace.
  • Two steps
  • Trace selection Find a likely sequence of basic
    blocks (trace) across statically predicted
    branches (e.g. if-then-else)
  • Trace compaction Schedule the trace to be as
    efficient as possible while preserving
    correctness in the case the prediction is wrong
  • Yields more instruction level parallelism
  • Accurate static branch prediction key to success

17
Trace scheduling 2(2)
  • The leftmost sequence is chosen as the most
    likely trace
  • The assignment to B is control dependent on the
    if statement.
  • Trace compaction has to respect data dependences
  • The rightmost (less likely) trace has to be
    augmented with fix up code

18
Hardware support for speculation
  • Loop unrolling, software pipelining, and trace
    scheduling are limited by the compilers ability
    to do branch prediction

Dynamic techniques can predict the future based
on history information. HW support for
speculation includes
  • Branch prediction (has been discussed)
  • Predicated instructions (used extensively in
    Intels IA-64)
  • Hardware support for compiler speculation
  • Hardware-based speculation

19
Conditional or predicated instructions
  • Executed only if a condition is satisfied
  • Control converted into data dependences
  • Example
  • Normal code Conditional
  • BNEZ R1,L CMOVZ R2,R3,R1
  • MOV R2,R3
  • L
  • Useful for short if-then statements
  • More complex might slow down cycle time

20
Compiler speculation
  • The compiler moves instructions before a branch
    so that they can be executed before the branch
    condition is known
  • Advantage creates longer schedulable code
    sequences gt more ILP can exploited
  • Example if (A 0) A B else A A4
  • Non speculative code Speculative code
  • LW R1,0(R3) LW R1,0(R3)
  • BNEZ R1,L1 LW R14,0(R2)
  • LW R1,0(R2) BEQZ R1,L3
  • J L2 ADD R14,R1,4
  • L1 ADD R1,R1,4 L3 SW 0(R3),R14
  • L2 SW 0(R3),R1
  • Must not affect the exception behavior

21
HW supported speculation
  • A combination of three main ideas
  • Dynamic instruction scheduling takes advantage
    of ILP
  • Dynamic branch prediction allows instruction
    scheduling across branches
  • Speculative execution executes instructions
    before all control dependences are resolved

Hardware based speculation uses a data-flow
approach instructions execute when their
operands are available
22
HW vs. SW speculation
  • Advantages
  • Dynamic runtime disambiguation of memory addresses
  • Dynamic branch prediction is often better than
    static which limits the performance of SW specul.
  • HW speculation can maintain a precise exception
    model
  • Can achieve higher performance on older code
  • Main disadvantage
  • Complex implementation and extensive need of
    hardware resources

23
HW Support for Speculation
  • Need a reorder buffer for uncommited inst.
  • Reorder buffer (ROB) can be operand source
  • Once operation commits, the register file is
    updated
  • Use reorder buffer number instead of reservation
    station
  • Instructions commit in order
  • Flush reorder buffer when a branch is
    mispredicted
  • Store buffers integrated into the ROB.

24
Four Steps of a Speculative Tomasulo Algorithm
  • 1. Issue get instruction from FP Op Queue
  • If reservation station and reorder buffer slot
    free, issue instr send operands reorder
    buffer nr. for destination
  • 2. Execution operate on operands (EX)
  • If both operands ready execute if not, watch
    CDB for result when both operands are in
    reservation station execute
  • 3. Write result finish execution
  • Write on Common Data Bus to all awaiting FUs
    reorder buffer mark reservation station available
  • 4. Commit update register with reorder result
  • When instr. is at head of reorder buffer result
    is present update register with result (or store
    to memory) and remove instr. from reorder buffer

25
A Model of an Ideal Processor
  • Provides a base for ILP measurements
  • No structural hazards
  • Register renaminginfinite virtual registers and
    all WAW WAR hazards avoided
  • Machine with perfect speculation
  • Branch predictionperfect no mispredictions
  • Jump predictionall jumps perfectly predicted
  • Memory-address alias analysisaddresses are known
    a store can be moved before a load provided
    addresses not equal
  • There are only true data dependences left!

26
Upper Bound on ILP
27
More Realistic HW Branch Impact
28
Renaming Register impact
29
Window Impact
30
Summary
  • Software (compiler) tricks
  • Loop unrolling
  • Software pipelining
  • Static instruction scheduling (with register
    renaming)
  • Trace scheduling (implies static branch
    prediction)
  • Speculative execution
  • Hardware tricks
  • Dynamic instruction scheduling
  • Dynamic branch prediction
  • Multiple issue
  • Superscalar
  • VLIW/EPIC
  • Conditional instructions
  • Speculative execution
Write a Comment
User Comments (0)
About PowerShow.com