EECS 583 Lecture 16 Sentinel Scheduling Intro to Modulo Scheduling - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

EECS 583 Lecture 16 Sentinel Scheduling Intro to Modulo Scheduling

Description:

EECS 583 Lecture 16. Sentinel Scheduling. Intro to Modulo ... Code bloat. Typical unroll is 4-16x. Use profile statistics to only unroll 'important' loops ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 31
Provided by: scottm80
Category:

less

Transcript and Presenter's Notes

Title: EECS 583 Lecture 16 Sentinel Scheduling Intro to Modulo Scheduling


1
EECS 583 Lecture 16Sentinel SchedulingIntro
to Modulo Scheduling
  • University of Michigan
  • March 10, 2004

2
Class Problem From Last Time
1 r1 r7 4 2 branch p1 Exit1 3 store (r1,
-1) 4 branch p2 Exit2 5 r2 load(r7) 6 r3
r2 4 7 branch p3 Exit3 8 r4 r3 / r8
r4
r1
r2
r4, r8
1. Starting with the graph assuming
restricted speculation, what edges can be removed
if general speculation support is provided? 2.
With more renaming, what dependences could be
removed?
3
Sentinel Speculation Model
1 branch x 0
  • Ignoring all speculative exceptions is painful
  • Debugging issue (is a program ever fully
    correct?)
  • Also, handling of all fixable exceptions for
    speculative ops can be slow
  • Extra page faults
  • Sentinel speculation
  • Mark speculative ops (opcode bit)
  • Exceptions for speculative ops are noted, but not
    handed immediately (return garbage value)
  • Check for exception conditions in the home
    block of speculative potentially excepting ops

2 y x 3 z y 4 4 w z
2 y x 3 z y 4 1 branch x 0
check exception 4 w z
4
Delaying Speculative Exceptions
1 branch x 0
  • 3 things needed
  • Record exceptions
  • Check for exceptions
  • Regenerate exception
  • Re-execute ops including dependent ops
  • Terminate execution or process exception
  • Recording them
  • Extend every register with an extra bit
  • Exception tag (or NAT bit)
  • Reg data is garbage when set
  • Bit is set when either
  • Speculative op causes exception
  • Speculative op has a NATd source operand
    (exception propagation)

2 y x 3 z y 4 4 w z
2 y x 3 z y 4 1 branch x 0
check exception 4 w z
5
Delaying Speculative Exceptions (2)
1 branch x 0
  • Check for exceptions
  • Test NAT bit of appropriate register (last
    register in dependence chain) in home block
  • Explicit checks
  • Insert new operation to check NAT
  • Implicit checks
  • Non-speculative use of register automatically
    serves as NAT check
  • Regenerate exception
  • Figure out the exact cause
  • Handle if possible
  • Check with NAT condition branches to recovery
    code
  • Compiler generates the recovery code specific to
    each check

2 y x 3 z y 4 4 w z
2 y x 3 z y 4 1 branch x 0
check NAT(z) 4 w z
6
Delaying Speculative Exceptions (3)
In recovery code, the exception condition will be
regenerated as the excepting op is re-executed
with the same inputs If the exception can be
handled, it is, all dependent ops are
re-executed, and execution is returned to point
after the check If the exception is a program
error, execution is terminated in the
recovery code
Recovery code consists of chain of operations
starting with a potentially excepting
speculative op up to its corresponding check
2 y x 3 z y 4 1 branch x 0
branch NAT(z) fixup 4 w z
done
Recovery code
fixup
2 y x 3 z y 4 jump done
7
Implicit vs Explicit Checks
  • Explicit
  • Essentially just a conditional branch
  • Nothing special needs to be added to the
    processor
  • Problems
  • Code size
  • Checks take valuable resources
  • Implicit
  • Use existing instructions as checks
  • Removes problems of explicit checks
  • However, how do you specify the address of the
    recovery block?, how is control transferred
    there?
  • Hardware table
  • Indexed by PC
  • Indicates where to go when NAT is set
  • Itanium uses explicit checks for loads only

8
Class Problem
1 r1 r7 4 2 branch p1 Exit1 3 store (r1,
-1) 4 branch p2 Exit2 5 r2 load(r7) 6 r3
r2 4 7 branch p3 Exit3 8 r4 r3 / r8
r4
r1
r2
r4, r8
  • Move ops 5, 6, 8 as far up in the SBas possible
    assuming sentinel speculationsupport
  • Insert the necessary checks andrecovery code
    (assume ld, st, and divcan cause exceptions)

9
Hidden Cost of Delaying Exceptions
2 y x 3 z y 4 1 branch x 0
branch NAT(z) fixup 4 w z
done
Recovery code
fixup
2 y x 3 z y 4 jump done
Register pressure Consider the lifetime of
x Used to terminate at 2 Now must be extended
into the recovery block In general, all registers
live-in to a speculative dependence chain must be
preserved until exception is checked for and
regenerated
10
Trace Scheduling
  • Trace scheduling is one of the more famous VLIW
    compiler techniques
  • Invented by Josh Fisher and John Ellis in the
    early 80s
  • Bulldog compiler
  • Multiflow compiler
  • Recall, traces are sequences of basic blocks that
    are likely execution paths
  • Traces have both side entrances and exits
  • Scheduler must worry about both

Trace
r4
1 r1 r2 r3 2 r4 load(r1) 3 p1 cmpp(r3
0) 4 branch p1 Exit1 5 store (r4, -1) 6 r2
r2 4 7 r5 load(r2) 8 p2 cmpp(r5 gt 9) 9
branch p2 Exit2
r2
r4
r2
r5
11
Strategy of Trace Scheduling
  • Ignore side entrances and exits during scheduling
  • Insert bookkeeping code after scheduling is
    complete to fixup trace boundaries
  • Upward code motion Still must obey both R1 and
    R2
  • But, fixup required for side entrances. Side
    exits are no problem due to R1
  • Downward code motion ignore R1 (use
    compensation code to fix up)
  • Fixup required for both side entrances and exits

1 r1 r2 r3 6 r2 r2 4 7 r5
load(r2) 3 p1 cmpp(r3 0) 4 branch p1
Exit1 2 r4 load(r1) 5 store (r4, -1) 8 p2
cmpp(r5 gt 9) 9 branch p2 Exit2
r4
r2
r4
r2
r5
move 6 and 7 up, 2 down Assume 5 and 7 may not
access the same location
12
Insertion of Bookeeping Code (1)
Account for 6 and 7 moving up
1 r1 r2 r3 6 r2 r2 4 7 r5
load(r2) 3 p1 cmpp(r3 0) 4 branch p1
Exit1 2 r4 load(r1) 5 store (r4, -1) 8 p2
cmpp(r5 gt 9) 9 branch p2 Exit2
r4
6 r2 r2 4 7 r5 load(r2)
r4
r2
r2
r5
13
Insertion of Bookeeping Code (2)
Account for 2 moving down
r4
6 r2 r2 4 7 r5 load(r2) 3 p1
cmpp(r3 0) 4 branch p1 Exit1
1 r1 r2 r3 6 r2 r2 4 7 r5
load(r2) 3 p1 cmpp(r3 0) 4 branch p1
Exit1 2 r4 load(r1) 5 store (r4, -1) 8 p2
cmpp(r5 gt 9) 9 branch p2 Exit2
2 r4 load(r1)
r2
r4
r2
r5
14
Class Problem
1 r1 r7 4 2 branch p1 Exit1 3 store (r1,
-1) 4 r2 load(r7) 5 r3 r2 4 6 branch p3
Exit2 7 r4 r3 / r8
r7
r4
r2
r4, r8
  • Move ops 4, 5, 7 as far up in the trace.Move ops
    1,3 as far down as possibleas possible assuming
    general spec
  • Insert the necessary bookeeping codeto make
    these moves legal

15
Scalar Scheduling Wrap Up
  • SB scheduling has no bookkeeping, so its simpler
  • Replicate code during SB formation thus
    eliminating need for bookkeeping
  • Trace scheduling
  • In general has less code expansion than SB
  • But, it can be quite messy
  • Elcor/Impact
  • Uses SB/HB scheduling
  • Restricted or general speculation models
    supported
  • General is default
  • Sentinel speculation not supported (though it
    should be)
  • Next time modulo scheduling for loops

16
Change Focus to Scheduling Loops
Most of program execution time is spent in
loops Problem How do we achieve compact
schedules for loops
r1 _a r2 _b r9 r1 4 1 r3 load(r1) 2
r4 r3 26 3 store (r2, r4) 4 r1 r1 4 5
r2 r2 4 6 p1 cmpp (r1 lt r9) 7 brct p1 Loop
Loop
for (j0 jlt100 j) bj aj 26
17
Basic Approach List Schedule the Loop Body
time
1
2
3
n
Iteration
Schedule each iteration resources 4 issue, 2
alu, 1 mem, 1 br latencies add1, mpy3, ld 2,
st 1, br 1
time ops 0 1, 4 1 6 2 2 3 - 4 - 5 3, 5, 7
1 r3 load(r1) 2 r4 r3 26 3 store (r2,
r4) 4 r1 r1 4 5 r2 r2 4 6 p1 cmpp
(r1 lt r9) 7 brct p1 Loop
Total time 6 n
18
Unroll Then Schedule Larger Body
time
1,2
3,4
5,6
n-1,n
Iteration
Schedule each iteration resources 4 issue, 2
alu, 1 mem, 1 br latencies add1, cmpp 1,
mpy3, ld 2, st 1, br 1
time ops 0 1, 4 1 1, 6, 4 2 2, 6 3 2 4 - 5 3,
5, 7 6 3,5,7
1 r3 load(r1) 2 r4 r3 26 3 store (r2,
r4) 4 r1 r1 4 5 r2 r2 4 6 p1 cmpp
(r1 lt r9) 7 brct p1 Loop
Total time 7 n/2
19
Problems With Unrolling
  • Code bloat
  • Typical unroll is 4-16x
  • Use profile statistics to only unroll important
    loops
  • But still, code grows fast
  • Barrier after across unrolled bodies
  • I.e., for unroll 2, can only overlap iterations 1
    and 2, 3 and 4,
  • Does this mean unrolling is bad?
  • No, in some settings its very useful
  • Low trip count
  • Lots of branches in the loop body
  • But, in other settings, there is room for
    improvement

20
Overlap Iterations Using Pipelining
time
1
2
3
n
Iteration
n
With hardware pipelining, while one instruction
is in fetch, another is in decode, another in
execute. Same thing here, multiple iterations
are processed simultaneously, with each
instruction in a separate stage. 1 iteration
still takes the same time, but time to complete n
iterations is reduced!
3
2
1
21
A Software Pipeline
time
Prologue - fill the pipe
A B A C B A D C B A D
C B A D C B
A D C B
D C
D
Kernel steady state
A B C D
Loop body with 4 ops
Epilogue - drain the pipe
Steady state 4 iterations executed simultaneously
, 1 operation from each iteration. Every cycle,
an iteration starts and finishes when the pipe is
full.
22
Creating Software Pipelines
  • Lots of software pipelining techniques out there
  • Modulo scheduling
  • Most widely adopted
  • Practical to implement, yields good results
  • Conceptual strategy
  • Unroll the loop completely
  • Then, schedule the code completely with 2
    constraints
  • All iteration bodies have identical schedules
  • Each iteration is scheduled to start some fixed
    number of cycles later than the previous
    iteration
  • Initiation Interval (II) fixed delay between
    the start of successive iterations
  • Given the 2 constraints, the unrolled schedule is
    repetitive (kernel) except the portion at the
    beginning (prologue) and end (epilogue)
  • Kernel can be re-rolled to yield a new loop

23
Creating Software Pipelines (2)
  • Create a schedule for 1 iteration of the loop
    such that when the same schedule is repeated at
    intervals of II cycles
  • No intra-iteration dependence is violated
  • No inter-iteration dependence is violated
  • No resource conflict arises between operation in
    same or distinct iterations
  • We will start out assuming Itanium-style hardware
    support, then remove it later
  • Rotating registers
  • Predicates
  • Brtop

24
Terminology
Initiation Interval (II) fixed delay between
the start of successive iterations Each
iteration can be divided into stages consisting
of II cycles each Number of stages in 1
iteration is termed the stage count (SC) Takes
SC-1 cycles to fill/drain the pipe
time
Iter 3
II
Iter 2
Iter 1
25
Resource Usage Legality
  • Need to guarantee that
  • No resource is used at 2 points in time that are
    separated by an interval which is a multiple of
    II
  • I.E., within a single iteration, the same
    resource is never used more than 1x at the same
    time modulo II
  • Known as modulo constraint, where the name modulo
    scheduling comes from
  • Modulo reservation table solves this problem
  • To schedule an op at time T needing resource R
  • The entry for R at T mod II must be free
  • Mark busy at T mod II if schedule

br
alu1
alu2
mem
bus0
bus1
0
II 3
1
2
26
Dependences in a Loop
  • Need worry about 2 kinds
  • Intra-iteration
  • Inter-iteration
  • Delay
  • Minimum time interval between the start of
    operations
  • Operation read/write times
  • Distance
  • Number of iterations separating the 2 operations
    involved
  • Distance of 0 means intra-iteration
  • Recurrence manifests itself as a circuit in the
    dependence graph

1
lt1,1gt
lt1,2gt
2
lt1,2gt
lt1,0gt
3
lt1,0gt
4
Edges annotated with tuple
ltdelay, distancegt
27
Dynamic Single Assignment (DSA) Form
Impossible to overlap iterations because each
iteration writes to the same register. So,
well have to remove the anti and output
dependences. Recall back the notion of a
rotating register (virtual for now) Each
register is an infinite push down array (Expanded
virtual reg or EVR) Write to top element,
but can reference any element Remap
operation slides everything down ? rn changes
to rn1 A program is in DSA form if the same
virtual register (EVR element) is never assigned
to more than 1x on any dynamic execution path
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 6 p1-1 cmpp (r1-1
lt r9) remap r1, r2, r3, r4, p1 7 brct p1-1 Loop
1 r3 load(r1) 2 r4 r3 26 3 store (r2,
r4) 4 r1 r1 4 5 r2 r2 4 6 p1 cmpp
(r1 lt r9) 7 brct p1 Loop
DSA conversion
28
Physical Realization of EVRs
  • EVR may contain an unlimited number values
  • But, only a finite contiguous set of elements of
    an EVR are ever live at any point in time
  • These must be given physical registers
  • Conventional register file
  • Remaps are essentially copies, so each EVR is
    realized by a set of physical registers and
    copies are inserted
  • Rotating registers
  • Direct support for EVRs
  • No copies needed
  • File rotated after each loop iteration is
    completed

29
Loop Dependence Example
1,1
1
2,0
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 6 p1-1 cmpp (r1-1
lt r9) remap r1, r2, r3, r4, p1 7 brct p1-1 Loop
2
0,0
3,0
3
0,0
1,1
1,1
4
1,0
1,1
5
6
In DSA form, there are no inter-iteration anti or
output dependences!
1,0
7
ltdelay, distancegt
30
Class Problem
Latencies ld 2, st 1, add 1, cmpp 1, br
1
1 r1-1 load(r20) 2 r3-1 r11
r12 3 store (r3-1, r20) 4 r2-1 r20
4 5 p1-1 cmpp (r2-1 lt 100) remap r1, r2,
r3 6 brct p1-1 Loop
Draw the dependence graph showing both intra and
inter iteration dependences
Write a Comment
User Comments (0)
About PowerShow.com