EECS 583 Class 18 Iterative Modulo Scheduling Part II - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

EECS 583 Class 18 Iterative Modulo Scheduling Part II

Description:

3/23 Partitioning for multicluster processors (Kevin Fan) ... Kmin = min unroll factor = MAXi (ceiling((Endi Starti) / II) ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 24
Provided by: scottm80
Category:

less

Transcript and Presenter's Notes

Title: EECS 583 Class 18 Iterative Modulo Scheduling Part II


1
EECS 583 Class 18Iterative Modulo
SchedulingPart II
  • University of Michigan
  • March 21, 2005
  • Note The Msched example slides from the last
    lecture were wrong, use the ones from these notes

2
Schedule for the Rest of the Semester
  • 4 more lectures
  • 3/21 (today) Finish modulo scheduling
  • 3/23 Partitioning for multicluster processors
    (Kevin Fan)
  • 3/28 Compiler-controlled data prefetching
    (Manjunath Kudlur)
  • 3/30 Register allocation
  • Exam Monday, April 4
  • Paper presentations by you guys
  • 4/6 Memory group
  • 4/11 Code generation group
  • 4/13 Opti group (part I)
  • 4/18 Opti group (part II)
  • Final project demos
  • 4/20 or 4/25-4/27
  • SIG meetings this week same time as last week

3
Reading Material
  • Todays class
  • Iterative Modulo Scheduling An Algorithm for
    Software Pipelining Loops, B. Rau, MICRO-27,
    1994, pp. 63-74.
  • "Code Generation Schemas for Modulo Scheduled
    DO-Loops and WHILE-Loops", B. Rau, M. Schlansker,
    and P. Tirumalai,MICRO-25, Dec. 1992.
  • Material for the next lecture
  • Region-based Hierarchical Operation Partitioning
    for Multicluster Processors, M. Chu et al,
    PLDI-2003, June, 2003

4
Modulo Scheduling - Driver
  • compute MII
  • II MII
  • budget BUDGET_RATIO number of ops
  • while (schedule is not found) do
  • iterative_schedule(II, budget)
  • II
  • Budget_ratio is a measure of the amount of
    backtracking that can be performed before giving
    up and trying a higher II

5
Modulo Scheduling Iterative Scheduler
  • iterative_schedule(II, budget)
  • compute op priorities
  • while (there are unscheduled ops and budget gt 0)
    do
  • op unscheduled op with the highest priority
  • min early time for op (E(Y))
  • max min II 1
  • t find_slot(op, min, max)
  • schedule op at time t
  • / Backtracking phase undo previous scheduling
    decisions /
  • Unschedule all previously scheduled ops that
    conflict with op
  • budget--

6
Modulo Scheduling Find_slot
  • find_slot(op, min, max)
  • / Successively try each time in the range /
  • for (t min to max) do
  • if (op has no resource conflicts in MRT at t)
  • return t
  • / Op cannot be scheduled in its specified range
    /
  • / So schedule this op and displace all
    conflicting ops /
  • if (op has never been scheduled or min gt previous
    scheduled time of op)
  • return min
  • else
  • return MIN(1 prev scheduled time of op, max)

7
Modulo Scheduling Example
resources 4 issue, 2 alu, 1 mem, 1 br latencies
add1, mpy3, ld 2, st 1, br 1
Step1 Compute to loop into form that uses LC
for (j0 jlt100 j) bj aj 26
LC 99
1 r3 load(r1) 2 r4 r3 26 3 store (r2,
r4) 4 r1 r1 4 5 r2 r2 4 7 brlc Loop
Loop
Loop
1 r3 load(r1) 2 r4 r3 26 3 store (r2,
r4) 4 r1 r1 4 5 r2 r2 4 6 p1 cmpp
(r1 lt r9) 7 brct p1 Loop
8
Example Step 2
resources 4 issue, 2 alu, 1 mem, 1 br latencies
add1, mpy3, ld 2, st 1, br 1
Step 2 DSA convert
LC 99
LC 99
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
1 r3 load(r1) 2 r4 r3 26 3 store (r2,
r4) 4 r1 r1 4 5 r2 r2 4 7 brlc Loop
Loop
9
Example Step 3
Step3 Draw dependence graph Calculate MII
resources 4 issue, 2 alu, 1 mem, 1 br latencies
add1, mpy3, ld 2, st 1, br 1
1,1
1
2,0
LC 99
0,0
2
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
3,0
RecMII 1 RESMII 2 MII 2
3
1,1
1,1
4
0,0
1,1
5
1,1
7
10
Example Step 4
Step 4 Calculate priorities (MAX height to
pseudo stop node)
1,1
0,0
1
Iter1
Iter2
2,0
0,0
1 H 5 2 H 3 3 H 0 4 H 0 5 H 0 7 H
0
1 H 5 2 H 3 3 H 0 4 H 4 5 H 0 7 H
0
2
3,0
0,0
3
0,0
1,1
1,1
0,0
4
0,0
1,1
5
0,0
1,1
7
11
Example Step 5
resources 4 issue, 2 alu, 1 mem, 1 br latencies
add1, mpy3, ld 2, st 1, br 1
Schedule brlc at time II - 1
Unrolled Schedule
Rolled Schedule
LC 99
0
1
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
2
0
3
1
7
4
5
6
br
mem
alu1
alu0
0
MRT
1
X
12
Example Step 6
Step6 Schedule the highest priority op Op1 E
0, L 1 Place at time 0 (0 2)
Unrolled Schedule
Rolled Schedule
LC 99
1
0
1
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
2
1
0
3
1
7
4
5
6
br
mem
alu1
alu0
X
0
MRT
1
X
13
Example Step 7
Step7 Schedule the highest priority op Op4 E
0, L 1 Place at time 0 (0 2)
Unrolled Schedule
Rolled Schedule
LC 99
1
4
0
1
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
2
1
4
0
3
1
7
4
5
6
br
mem
alu1
alu0
X
X
0
MRT
1
X
14
Example Step 8
Step8 Schedule the highest priority op Op2 E
2, L 3 Place at time 2 (2 2)
Unrolled Schedule
Rolled Schedule
LC 99
1
4
0
1
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
2
2
1
4
2
0
3
1
7
4
5
6
br
mem
alu1
alu0
X
X
X
0
MRT
1
X
15
Example Step 9
Step9 Schedule the highest priority op Op3 E
5, L 6 Place at time 5 (5 2)
Unrolled Schedule
Rolled Schedule
LC 99
1
4
0
1
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
2
2
1
2
4
0
3
1
7
3
4
3
5
6
br
mem
alu1
alu0
X
X
X
0
MRT
1
X
X
16
Example Step 10
Step10 Schedule the highest priority op Op5 E
0, L 1 Place at time 1 (1 2)
Unrolled Schedule
Rolled Schedule
LC 99
1
4
0
5
1
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
2
2
1
2
4
0
3
1
7
3
5
4
3
5
6
br
mem
alu1
alu0
X
X
X
0
MRT
1
X
X
X
17
Example Step 11
Step11 calculate ESC, SC max unrolled sched
length / ii unrolled sched time of branch
rolled sched time of br (iiesc) SC 6 / 2
3, ESC SC 1 time of br 1 22 5
Unrolled Schedule
Rolled Schedule
LC 99
1
4
0
5
1
1 r3-1 load(r10) 2 r4-1 r3-1
26 3 store (r20, r4-1) 4 r1-1 r10
4 5 r2-1 r20 4 remap r1, r2, r3, r4 7
brlc Loop
Loop
2
2
1
2
4
0
3
1
7
3
5
4
3
7
5
6
br
mem
alu1
alu0
X
X
X
0
MRT
1
X
X
X
18
Example Step 12
Finishing touches - Sort ops, initialize ESC,
insert BRF and staging predicate, initialize
staging predicate outside loop
Staging predicate, each successive stage
increment the index of the staging predicate by
1, stage 1 gets px0
LC 99 ESC 2 p10 1
1 r3-1 load(r10) if p10 2 r4-1
r3-1 26 if p11 4 r1-1 r10 4 if
p10 3 store (r20, r4-1) if p12 5 r2-1
r20 4 if p10 7 brlc Loop if p12
Loop
Unrolled Schedule
1
4
0
Stage 1
5
1
2
2
Stage 2
3
4
Stage 3
3
7
5
6
19
Example Dynamic Execution of the Code
time ops executed
LC 99 ESC 2 p10 1
0 1, 4 1 5 2 1,2,4 3 5 4 1,2,4 5 3,5,7 6
1,2,4 7 3,5,7 98 1,2,4 99 3,5,7 100 2 101
3,7 102 - 103 3,7
Loop
1 r3-1 load(r10) if p10 2 r4-1
r3-1 26 if p11 4 r1-1 r10 4 if
p10 3 store (r20, r4-1) if p12 5 r2-1
r20 4 if p10 7 brlc Loop if p12
20
Class Problem
latencies add1, mpy3, ld 2, st 1, br 1
How many resources of each type are required to
achieve an II1 schedule? If the resources are
non-pipelined, how many resources of each type
are required to achieve II1 Assuming pipelined
resources, generate the II1 modulo schedule.
for (j0 jlt100 j) bj aj 26
LC 99
1 r3 load(r1) 2 r4 r3 26 3 store (r2,
r4) 4 r1 r1 4 5 r2 r2 4 7 brlc Loop
Loop
21
What if We Dont Have Hardware Support?
  • No predicates
  • Predicates enable kernel-only code by selectively
    enabling/disabling operations to create
    prolog/epilog
  • Now must create explicit prolog/epilog code
    segments
  • No rotating registers
  • Register names not automatically changed each
    iteration
  • Must unroll the body of the software pipeline,
    explicitly rename
  • Consider each register lifetime i in the loop
  • Kmin min unroll factor MAXi (ceiling((Endi
    Starti) / II))
  • Create Kmin static names to handle maximum
    register lifetime
  • Apply modulo variable expansion

22
No Predicates
E
D
C
B
A
A
B
A
Kernel-only code with rotating registers
and predicates, II 1
prolog
C
B
A
D
C
B
A
C
B
B
E
D
C
B
A
D
C
B
kernel
D
C
C
E
D
C
B
D
E
D
C
epilog
E
D
E
Without predicates, must create explicit prolog
and epilogs, but no explicit renaming is needed
as rotating registers take care of this
23
No Predicates and No Rotating Registers
Assume Kmin 4 for this example
A1
B1
A2
prolog
B1
C1
B2
A3
C1
B2
C1
D1
C2
B3
A4
D1
C2
B3
D1
C2
D1
E1
D2
C3
B4
A1
E2
D3
C4
B1
A2
unrolled kernel
E3
D4
C1
B2
A3
E4
D1
C2
B3
A4
E1
D2
C3
B4
E4
D1
C2
B3
E3
D4
C1
B2
E2
D3
C4
B1
E2
D3
C4
E1
D2
C3
E4
D1
C2
E3
D4
C1
epilog
E3
D4
E2
D3
E1
D2
E4
D1
E4
E3
E2
E1
Write a Comment
User Comments (0)
About PowerShow.com