Title: Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches
1Chapter 4Exploiting Instruction-Level
Parallelism with Software Approaches
2Basic Compiler Techniques for Exposing
- Basic pipeline scheduling and loop unrolling
- To keep a pipeline full, parallelism among
instructions must be exploited by finding
sequences of unrelated instructions that can be
overlapped in the pipeline. - A compilers ability to perform such kind of
scheduling depends on both the amount of ILP
available in the program and on the latencies of
the functional units in the pipeline. - To avoid a pipeline stall, a dependent
instruction must be separated from the source
instruction by a distance in clock cycles equal
to the pipeline latency of that source
instruction..
3Scheduling and Loop Unrolling
- Basic assumptions
- The latencies of the FP unit
- Inst. producing result Inst. Using result
Latency - FP ALU op Another FP ALU op 3
- FP ALU op Store double 2
- Load double FP ALU op 1
- Load double Store double 0
-
- The branch delay of the pipeline implementation
is 1 delay slot. - The functional units are fully pipelined or
replicated such that no structural hazards can
occur
4Loop Unrolling by Compilers
- Example
- for (j1, jlt 1000, j)
- xjxjs
- Assume R1 initially holds the highest address of
the first element and 8(R2) holds the last
element. - Loop L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1)
- DADDUI R1, R1, -8
- BNE R1, R2,Loop
- Performance of scheduled code with loop unrolling.
5Performance of Unscheduled Code without Loop
Unrolling
- Clock cycle issued
- Loop L.D F0, 0(R1) 1
- stall 2
- ADD.D F4, F0, F2 3
- stall 4
- stall 5
- S.D F4, 0(R1) 6
- DADDUI R1, R1, -8 7
- stall 8
- BNE R1, R2,Loop 9
- stall 10
- Need 10 cycles per result
6Performance of Scheduled Code without Loop
Unrolling
- Loop L.D F0, 0(R1)
- DADDUI R1, R1, -8
- ADD.D F4, F0, F2
- stall
- BNE R1, R2,Loop delay branch
- S.D F4, 8(R1)
- Need 6 cycles per result
7Performance of Unscheduled Code with Loop
Unrolling
- Unroll the loop 4 iterations
- Loop L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1)
- L.D F6, -8(R1)
- ADD.D F8, F6, F2
- S.D F8, -8(R1)
- L.D F10, -16(R1)
- ADD.D F12, F10, F2
- S.D F12, -16(R1)
- L.D F14, -24(R1)
- ADD.D F16, F14, F2
- S.D F16, -24(R1)
- DADDUI R1, R1, --32
- BNE R1, R1, Loop
- Needs 7 cycles per result
8Performance of Scheduled Code with Loop Unrolling
- Loop L.D F0, 0(R1)
- L.D F6, -8(R1)
- L.D F10, -16(R1)
- L.D F14, -24(R1)
- ADD.D F4, F0, F2
- ADD.D F8, F6, F2
- ADD.D F12, F10, F2
- ADD.D F16, F14, F2
- S.D F4, 0(R1)
- S.D F8, -8(R1) DADDUI R1, R1, --32
- S.D F12, 16(R1)
- BNE R1, R1, Loop
- S.D F16, 8(R1)
- Need 3.5 cycles per result
9Using Loop Unrolling and Pipeline Scheduling with
Static Multiple Issue
10Static Branch Prediction
- For a compiler to effectively schedule the code
such as for scheduling branch delay slot, we need
to statically predict the behavior of branches. - Static branch prediction used in a compiler
- LD R1, 0(R2)
- DSUBU R1, R1, R3
- BEQZ R1, L
- OR R4, R5, R6
- DADDU R10, R4, R3
- L DADDU R7, R8, R9
- If the BEQZ was almost always taken and the value
of R7 was not needed on the fall through path,
DADDU can be moved to the position after LD. - If it is rarely taken and the value of R4 was not
needed on the taken path, OR can be moved to the
position after LD.
11Branch Behavior in Programs
- Program behavior
- Average frequency of taken branches 67
- 60 of the forward branches are taken.
- 85 of the backward branches are taken
- Methods for statically branch prediction
- By examination of the program behavior
- Predict-taken (mis-prediction rate 959).
- Predict-forward-untaken and backward taken.
- The above two approaches combined mis-prediction
rate is 3040. - By the use of profile information collected from
earlier runs of the program.
12Mis-prediction Rate for a Profile-Based Predictor
13Comparison between Profile-Based and Predict-Taken
14The Basic VLIW Approach
- VLIW uses multiple, independent functional units.
- Multiple, independent instructions are issued by
processing a large instruction package that
consists of multiple operations. - A VLIW instruction might include one
integer/branch instruction, two memory
references, and two floating-point operations. - If each operation requires a 16 to 24 bits field,
the length of each VLIW instruction is of 112 to
168 bits. - Performance of VLIW
15Scheduling of VLIW Instructions
16Limitations to VLIW Implementation
- Limitations
- Technical problem
- To generate enough straight-line code fragment
requires ambitiously unrolling loops, which
increases code size. - Poor code density
- Whenever the instructions are not full, the
unused functional units translate into wasted
bits in the instruction encoding (only 60 full). - Logistical problem
- Binary code compatibility it depends on
- Instruction set definition,
- The detailed pipeline structure, including both
functional units and their latencies. - Advantages of a superscalar processor over a VLIW
processor - Little impact on code density.
- Even unscheduled programs, or those compiled for
older implementations, can be run.
17Advanced Compiler Support for Exposing and
Exploiting ILP
- Exploiting Loop-Level Parallelism
- Converting the loop-level parallelism into ILP
- Software pipelining (Symbolic loop unrolling)
- Global code scheduling
18Loop-Level Parallelism
- Concepts and techniques
- Loop-level parallelism is normally analyzed at
the source level while most ILP analysis is done
once the instructions have been generated by the
compiler. - The analysis of loop-level parallelism focuses on
determining whether data accesses in later
iterations are data dependent on data values
produced in earlier iterations. - Example
- for (i1 ilt1000 i)
- xixis
- Loop-carried data dependence Dependence exists
between different iterations of the loop. - A loop is parallel unless there is a cycle in the
dependences. Therefore, a non-cycled loop-carried
data dependence can be eliminated by code
transformation.
19Loop-Carried Data Dependence (1)
- Example
- for (I1 Ilt100 II1)
- AI1 AICI / S1 /
- BI1 BIAI1 / s2 /
-
- Dependence graph
20Loop-Carried Data Dependence (2)
- Example
- for (I1 Ilt100 II1)
- AI AIBI / S1 /
- BI1 CIDI / s2 /
-
- Code transformation
- A1 A1 B1
- for (I1 Ilt99 II1)
- BI1 CIDI / s2 /
- AI1 AI1BI1 / S1 /
-
- Convert loop-carried data dependence into data
dependence.
21Loop-Carried Data Dependence (3)
- True loop-carried data dependence are usually in
the form of a recurrence. - For (I2 Ilt100 I)
- YI YI-1 YI
-
- Even true loop-carried data dependence has
parallelism. - For (I6 Ilt100 I)
- YI YI-5 YI
-
- The first, second, , five iterations are
parallel.
22Detecting and Eliminating Dependencies
- Finding the dependences in a program is an
important part of three tasks - Good scheduling of code
- Determining which loops might contain
parallelism, and - Eliminating name dependence
- Example
- for (i1 ilt 100 i)
- Ai Bi Ci
- Di Ai Ei
-
- Absence of loop-carried dependence, which implies
existence of a large amount of parallelism.
23Dependence Detection Problem
- NP complete.
- GCD test heuristic
- Suppose we have stored to an array element with
index value ajb and loaded from the same array
with index value ckd, where j and k are the
for-loop index variable that runs from m to n. A
dependence exists if two conditions hold - There are tow iteration indices, j and k, both
within the limits of the for loop. - The loop stores into an array element indexed by
ajb and later fetches from that same array
element when it is indexed by ckd. That is,
ajbckd. - Note, a,b,c, and d are generally unknown at
compile time, making it impossible to tell if a
dependence exists. - A simple and sufficient test for the absence of a
dependence. If a loop-carried dependence exists,
then GCD(c,a) must divide (d-b). That is if
GCD(c,a) does not divide (d-b), no dependence is
possible (Example on page 324).
24Situations where Dependence Analysis Fails
- When objects are referenced via pointers rather
than array indices - When array indexing is indirect through another
array. - When a dependence may exist for some value of the
inputs, but does not exist in actuality. - Others.
25Eliminating Dependent Computations
- Copy propagation
- DADDUI R1, R2, 4
- DADDUI R1, R2, 4
- to
- DADDUI R1, R2, 8
- Tree height reduction
- ADD R1, R2, R3
- ADD R4, R1, R6
- ADD R8, R4, R7
- to
- ADD R1, R2, R3
- ADD R4, R6, R7
- ADD R8, R1, R4
26Software Pipelining Symbolic Loop Unrolling
- Software pipelining is a technique for
reorganizing loops such that each iteration in
the software-pipelined code is made from
instructions chosen from different iterations of
the original loop. - A software-pipelined loop interleaves
instructions from different loop iterations
without unrolling the loop. - A software pipeline loop consists of a loop body,
start-up code and clean-up code
27Example
- Original loop Reorganized loop
- Loop L.D F0, 0(R1) Loop S.D F4, 16(R1)
- ADD.D F4, F0, F2 ADD.D F4, F0, F2
- S.D F4, 0(R1) L.D F0, 0(R1)
- DADDUI R1, R1, -8 DADDUI R1, R1, -8
- BNE R1, R2, Loop BNE R1, R2,
Loop - Iteration i L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1)
- Iteration i1 L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1)
- Iteration i2 L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1)
28Comparison between Software-Pipelining and Loop
Unrolling
- Software pipelining consumes less code space.
- Loop unrolling reduces the overhead of the loop
-- the branch and counter-updated code. - Software pipelining reduces the time when the
loop is not running at peak speed to once per
loop at the beginning and end.
29Global Code Scheduling
30Trace Scheduling Focusing on Critical Path
- Trace selection
- Trace compaction
- Bookkeeping code
31Hardware Support for Exposing More Parallelism at
Compile Time
- The difficulty of uncovering more ILP at compile
time ( due to unknown branch behavior) can be
overcome by employing the following techniques - Conditional or predicated instructions
- Speculation
- Static speculation performed by the compiler with
hardware support. - Dynamic speculation performed by hardware using
branch prediction to guide speculation process.
32Conditional or Predicated instructions
- Basic concept
- An instruction refers to a condition, which is
evaluated as part of the instruction execution.
If the condition is true, the instruction is
executed normally, otherwise, the execution
continues as if it is a no-op. - The conditional instruction allows us to convert
the control dependence present in the
branch-based code sequence to a data dependence. - A conditional instruction can be used to
speculatively move an instruction that is time
critical - To use a conditional instruction successfully
like the one in examples, we must ensure that the
speculated instruction does not introduce an
exception.
33Conditional Move
34On Time Critical Path
- Example on page 342 and 343
35Example (Cont.)
36Limiting Factors
- The usefulness of conditional instructions is
limited by several factors - Conditional instructions that are annulled still
take execution time. - Conditional instructions are most useful when the
condition can be evaluated early. - The use of conditional instructions is limited
when the control flow involves more than a simple
alternative sequence. - Conditional instructions may have some speed
penalty compared with unconditional instructions. - Machines that use conditional instruction
- Alpha Conditional move
- HP PA Any register-register instruction
- SPARC Conditional move
- ARM All instructions.
37Compiler Speculation with Hardware Support
- In moving instructions across a branch the
compiler must ensure that exception behavior is
not changed and the dynamic data dependence
remains the same. - The simplest case is that the compiler is
conservative about what instructions it
speculatively moves, and the exception behavior
is unaffected. - Four methods
- The hardware and OS cooperatively ignore
exceptions for speculative instructions. - Speculative instructions that never raise
exceptions are used, and checks are introduced to
determine when an exception should occur. - Poison bits are attached to the result registers
written by speculated instructions when the
instruction cause exceptions. - The instruction results are buffered until it is
certain that the instruction is no longer
speculative.
38Types of Exceptions
- Two types of exceptions needs to be
distinguished - Exceptions cause program error, which indicates
the program must be terminated. Ex., memory
protection error. - Exceptions can be normally resumed, Ex., page
faults. - Basic principles employed by the above mechanism
- Exceptions that can be resumed can be accepted
and processed for speculative instructions just
as if they are normal instruction. - Exceptions that indicate a program error should
not occur in correct programs.
39Hardware-Software Cooperation for Speculation
- The hardware and OS simply
- Handle all resumable exceptions when exception
occurs, and - Return an undefined value for any exception that
would cause termination. - If a normal instruction generate
- terminating exception --gt return an undefined
value and program proceeds normally --gt generate
incorrect result, or - resumable exception --gt accepted and handled
accordingly --gt program terminated normally. - If a speculative instruction generate
- terminating exception --gt return an undefined
value --gt a correct program will not use it --gt
the result is still correct. - resumable exception --gt accepted and handled
accordingly --gt program terminated normally.
40Example
41Speculative Instructions Never (Method 2)
42Answer
43Speculation with Poison Bits
- A poison bit is added to every register and
another bit is added to every instruction to
indicate whether the instruction is speculative. - Three steps
- The poison bit is set whenever a speculative
instruction results in a terminating exception
all other exceptions are handled immediately. - If a speculative instruction uses a register with
a poison bit turned on, the destination register
of the instruction simply has its poison bit
turned on. - If a normal instruction attempts to use a
register source with its poison bit turned on,
the instruction causes a fault.
44Example
45Hardware Support for Memory Reference Speculation
- Moving load across stores is usually done when
the compiler is certain the address do not
conflict. - To support speculative load
- A special check instruction to check for address
conflict is placed at the original location of
the load instruction. - When a speculated load is executed, the hardware
saves the address of the accessed memory
location. - If the value stored in the location is changed
before check instruction, speculation fails. If
not, it succeeds.
46Hardware- versus Software-Based Speculation
- Dynamic runtime disambiguation of memory
addresses is conducive to speculate extensively.
This allows us to move loads past stores at
runtime. - Hardware-based speculation is better because
hardware-based branch predictions is better than
software-based branch prediction done at compile
time. - Hardware-based speculation maintains a completely
precise exception model. - Hardware-based speculation does not require
bookkeeping codes. - Hardware-based speculation with dynamic
scheduling does not require different code
sequence for different implementation of an
architecture to achieve good performance. - Compiler-based approaches can see further in the
code sequence.
47Concluding Remarks
- Hardware and software approaches to increasing
ILP tend to fuse together.