Title: EECS 583 Lecture 24 Group 2 Dataflow analysis opti Group 3 Scheduling, Regalloc, Code gen
1EECS 583 Lecture 24Group 2 Dataflow analysis
optiGroup 3 Scheduling, Regalloc, Code gen
- University of Michigan
- April 10, 2002
2Today
- Dataflow analysis and optimization
- BDD-based predicate analysis
- More intelligent predicate relation analysis
using binary decision diagrahms - Beth, Laura, Bill
- Scheduling, register allocation, code generation
- Retargeting Elcor to TI C6x
- Handling multiple clusters
- Jeff, Dave
- Power-sensitive scheduling
- Dealing with power in a modulo scheduler
- Dynamic voltage scaling
- Hai, Amit
3Next time (Monday, 4/15)
- G2 Dataflow analysis and optimization
- Partial inlining Chunhui, Dukhyun, Jeremy
- G3 Scheduling, register allocation, code
generation - While loop software pipelining Arnar, Tomas,
Misha - G4 Memory optimization
- Data layout Tony, Marius
- Exams returned on Monday if it kills me !!!!!!!!
- On Wednes (4/17, last class)
- Last 2 memory optimization groups go any spill
over
4Course evaluations
- Written portion of the evaluation is important
because this is the first time the class in this
form was offered - So, I am interested in what you guys think needs
to be improved - Put some thought into your answers !
- Note saying the test was too long is not that
useful - Question A
- What did you NOT like about the class or do you
think needs the most improvement? What would you
have done differently? - Question B
- Thurs/Fri group meetings Did you like these?
Were they useful? How could they be more useful?
Are they worth the time?
5Group 2 Predicate Analysis using Binary Decision
Diagrams
- University of Michigan
- April 10, 2002
6Background
- Our goal
- Provide a system similar to Elcors PQS which
uses BDDs rather than partition graphs to answer
questions about relationships between predicates - From last time - Predicates and BDDs
- Represent predicated control flow as Boolean
equations (with BDDs) - Supports general predicated code
- Efficient and accurate analysis of condition
relations - Building BDDs
- start with a single node 1
- add variables to the BDD, each new variable is a
single ITE node with a then-arc and an invert-arc
to 1. - The BDD is built and queried using the ITE(f,g,h)
function
7Our Project
- Initialize the BBD
- parse through hyperblock looking for cmpps
- the tree is built from these operations
- examine the comparison of each cmpp
- the comparison used in the cmpps will create
intervals from register to literal compares and
conditions from the register to register compares - these will be represented as boolean functions in
the BDD - create boolean functions to represent each
predicate based on the functions which represent
the comparisons in each of its cmpps - Use the ITE function to manipulate the BDD
- functions give answers to queries used by data
flow analysis such as is_disjoint, is_subset, ... - Must be similar to current queries of PQS
8Example cmpps
- p1 cmpp.un(r1 lt 3)
- p2 cmpp.un(r1 gt r2)
- p1 cmpp.on(r3 lt 5)
- p3 cmpp.un(r1 gt 4
- p3 cmpp.an(r4 gt2)
9Step 1
- Look at Register to Constant comparisons
- For each register create at number line
- Split the number line into segments based on
literals in comparison - create BDD with a node representing a finite
domain on the number line
10Step 1 - r1s number line
3
4
-8
8
- Conditions r1 lt 3, r1 gt 4
- Intervals
- I01 (-8,3), I11 (3,4), I21 (4,4), I31
(4,8) - Need 2 BDD nodes to represent 4 intervals (v0,v1)
- I01 00
- I11 01
- I21 10
- I31 11
- Insert BDD nodes and intervals into BDD currently
consisting of the single node 1
11Step 1 - BDD
1
v0
12Step 1 - BDD (cont)
I01
I11
I21
I31
I01
I11
I21
v1
v1
v1
v1
v1
v1
v1
v0
v0
v0
v0
v0
1
0
0
1
13Step 1 - Reduced BDD
14Step 1 - r3s number line
5
-8
8
- Conditions r3 lt 5
- Intervals
- I03 (-8,5), I13 (5,8)
- Need 1 BDD node to represent 2 intervals (v2)
- I03 1
- I13 0
- Insert BDD nodes and intervals into BDD
15Step 1 - BDD
I03
I13
v2
16Step 1 - r4s number line
2
-8
8
- Conditions r4 lt 2
- Intervals
- I04 (-8,2), I14 (2,8)
- Need 1 BDD node to represent 2 intervals (v3)
- I04 1
- I14 0
- Insert BDD nodes and intervals into BDD
17Step 1 - BDD
I14
I03
I13
I04
v3
v2
18Step 2
- Look at Register to Register comparisons
- 5 Basic Types
- gt, gt, , lt,lt
- Disjoint Outcomes
- (1) R1 gt R2
- (2) R1 R2
- (3) R1 lt R2
- Map disjoint outcome space to 2 Boolean variables
- (R1 lt R2) (0,0), (R1 R2) (-,1), (R1 gt R2)
(1,0)
19Step 2 - r1 gt r2
- 2 variables to represent 3 disjoint outcomes (v4,
v5)
20Step 2 - r1 gt r2 BDD
gt
lt
gt
lt
v4
v4
v5
1
21Final comparison BDD
I14
I03
I31
I04
r1 gtr2
I01
I13
I21
I11
v4
v1
v1
v1
v1
v2
v3
v5
v0
1
22Step 3 Predicate Node Creation
- Traverse code, creating new Predicate Nodes using
the ITE function - The structure of the BDD is determined by
- Predicate Condition
- Condition Type
- Guard
23Step 3 Predicate Node Creation
Px cmmp.XX(C) if Pg
24Step 3 Predicate Layer
- P1 cmpp.UN(r1 lt 3)
- First, examine the condition, r1 lt 3
- In the Register to Integer family the intervals
for r1 are - (- ?, 3), (3, 4), (4, 4), (4, ?)
- R1 lt 3 corresponds to interval I0, node I01 in
the BDD - Type UN predicate
- Guarded under true
- Predicate node for p1 created by
- P1 ITE(I01, 1, 0)
25Step 3 p1 cmpp.UN(r1 lt 3)
26Step3 p2 cmpp.UN(r1 gt r2)
- The condition is a Register to Register type.
- There is a family in the BDD corresponding to the
comparisons of r1 and r2 - R1 gt R2 is specifically the node needed
- Type UN predicate
- Guarded under True
- Predicate node for p2 created by
- P2 ITE(r1gtr2, 1, 0)
27Step 3 p2 cmpp.UN(r1 gt r2)
28Step 3 p1 cmpp.ON(r3 lt 5)
- Condition is a Register to Integer type.
- The Register to Integer family has the following
intervals for r3 - I03 (- ?, 5), I13 (5, ?)
- R3 lt 5 corresponds to I03, this is the condition
- Type ON predicate
- Guarded under True
- Predicate node for p1 created by
- P1 ITE(I03, ITE(1, 1, p1), p1)
- The p1 in the ITE function is the previous
predicate node for p1.
29Step 3 p1 cmpp.ON(r3 lt 5)
- Condition corresponds to I03
- P1 ITE(I03,ITE(1, 1, p1), p1) ITE(I03, 1, p1)
I03 p1
30Step 3 p3 cmpp.UN(r1 gt 4)
- Condition is a Register to Integer type
- The Register to Integer family has the following
intervals for r1 - I01 (- ?, 3), I11 (3, 4), I21 (4, 4), I31
(4, ?) - R1 gt 4 corresponds to both I21 or I31
- Type UN predicate
- Guarded under True
- P3 ITE(ITE(I21, 1, I31), 1, 0)
31Step 3 p3 cmpp.UN(r1 gt 4)
P3 ITE(ITE(I21, 1, I31), 1, 0)
p1
p3
p2
I03
I31
I04
I14
r1 gtr2
I01
I13
I21
I11
v2
v4
v1
v1
v1
v1
v2
v3
v1
v5
v0
1
32Step 3 p3 cmpp.AN(r4 gt 2)
- Condition corresponds to I14
- P3 ITE(1, ITE(I14, p3, 0), p3)
p1
p3
p2
I03
I31
I04
I14
r1 gtr2
I01
I13
I21
I11
p3
v2
v1
v4
v1
v1
v1
v1
v2
v3
v3
v5
v0
v1
1
33Step 3 Final Predicate BDD
p3
p1
p2
I04
I14
I03
I31
r1 gtr2
I01
I13
I21
I11
v2
v4
v1
v1
v1
v1
v2
v3
v3
v5
v0
v1
1
34Queries to PQS-BDD
- Are p2 and p3 disjoint?
- Tmp ITE(ITE(p2, ITE(p3, 0, 1), ITE(p3, 1, 0)),
0, 1) - If tmp null, then p2 and p3 are disjoint.
- Is p1 a subset of p3?
- Tmp ITE(p1, ITE(p3, 0, 1), 0)
- If tmp null, then p1 is a subset of p3.
- All queries currently answered using PQS can be
answered with the PQS-BDD system with ITE
functions.
35Cluster Scheduling
- Jeff Ringenberg
- David Oehmke
36Motivations
- Register File
- Size increases linearly with the number of
registers - Size increases quadratically with the number of
ports - Access time increases logarithmically with the
number of read ports and number of registers - Wide machines require large numbers of registers
and ports - 8 wide ideal, fully-orthogonal VLIW machine
requires approximately 16 read ports and 8 write
ports
37Clustering
- Functional units and registers files are broken
into sets (generally uniform) - Each functional unit in a cluster is fully
connected to the local register file for that
cluster - Limited connectivity between clusters
- Register files for an 8-wide, 2 cluster machine
are approximately one quarter the area of
register file for a single cluster machine - Connectivity
- Most papers assume an explicit move operation to
move data between clusters - Some actual architectures allow operands to be
directly read from other clusters via a limited
bandwidth cross path
38Clustering in the TI c6000
39Compiling for Clusters
- Compilation for a clustered machine is more
complex than for a single cluster - Assign operations to a clusters functional units
- Assign data to a clusters register file
- Move data between clusters when necessary
- Complications
- Spread operations and data over clusters to
achieve parallelism (partitioning) - Hide/limit inter-cluster communication penalty
- NP complete problem
40Cluster Scheduling Algorithms
- BUG, Bottom-up greedy
- Original algorithm from Bulldog compiler
- Pre-scheduling cluster assignment
- Limited Connectivity VLIW
- Schedule assuming fully connected, then partition
and insert necessary copy operations - Partial Component Clustering
- Pre-scheduling DAG decomposition and cluster
assignment with iterative improvement phase - Effective Cluster Assignment for Modulo
Scheduling - Pre-modulo scheduling cluster assignment
41Cluster Scheduling Algorithms
- Instruction Scheduling for Clustered VLIW DSPs
(targets TI C6201 architecture) - Partitioning using simulated annealing with list
scheduler as cost function - Unified Assign and Schedule
- Assign operations to clusters while scheduling
- Simple modification to list scheduler
- CARS
- Single phase cluster assignment, register
allocation, and instruction scheduling
42Bottom-up Greedy (BUG)
- Assign (node, destinations)
- if (!node.parent node.fu ! unassigned)
- return
- for each operand of node
- fus,cycles LikelyFUs(node,destinations)
- Assign (operand, fus)
- fus,cycles LikelyFUs(node,destinations)
- node.fu fus.front
- node.cycle cycle.front
- availablenode.funode.cyclefalse
- for each operand of node
- if (operand.type DEF
- operand.location unassigned
- MustHaveSingleLocation(operand))
- AssignLocation(operand,node)
43- LikelyFUs (node, destinations)
- minMAX_INT
- for each fu in FeasibleLocations(node)
- t CompletionCycle(node,fu,destinations)
- if (t lt min)
- min t
- fus fu
- cycles StartCycle(node,fu)
- if (t min)
- fus fu
- cycles StartCycle(node,fu)
- return fus, cycles
44- FeasibleLocations (node)
- Returns the list of functional units that can
perform that operation - StartCycle (node, fu)
- Returns an estimate of the earliest cycle that a
functional unit can be used to compute the node
operation - Takes into account availability of function units
and operand locations (delay and distance) if
available - CompletionCycle (node, fu, destinations)
- Returns StartCycle(node,fu) Delay(node,fu)
Distance(fu,destinations) - Delay (node,fu)
- Returns number of cycles to compute the operation
on the functional unit - Distance (fu,destinations)
- Returns minimum number of cycles to move the
result of the functional unit to one of the
destinations (0 if destinations is empty)
45Notes on BUG
- Top level routine calls Assign(root,NULL) for
each root node - Assign called in decreasing depth order of the
roots - The loop through the operands in Assign is also
done in decreasing depth order - Assign for data node
- DEF nodes do nothing
- USE nodes pass any final locations to their
parent nodes as the destinations list - Separate phase assigns locations to DEF and USE
nodes that are still unassigned - Successors and predecessors are taken into
account for this assignment
46Shortcomings of BUG
- Interconnect resource constraints cannot be
checked - Assignment can oversaturate available buses
- Assignment of values to registers occurs
on-the-fly after FUs are assigned to operations - Subsequent copies of non-local data are scheduled
later - Prior knowledge of these copies would benefit the
FU assignment and scheduling - BUG is greedy
- Future knowledge is not used in decisions
- Decisions cannot be changed
47BUG Example
Assign(6,) Assign(4,M1)
(A11102,A21113) Assign(2,A1)
2A1,0 (A10101,A20112)
4A1,1 (A11102,A21113)
Assign(1,M1) 1A2,0 (A12103,A2011
2) 6M1,2 Assign(5,) (A12103,A2210
3) Assign(2,A1,A2) Assign(3,A1,A2)
3L2,0 5A1,2 (A12103,A22103) Pr
oblem move A2,M1 time1 move L2,A1 time1
1
2 -
Cluster 1 Multiply(M1), ALU(A1) Cluster 2
Load(L2), ALU(A2) All Delays 1 Distance 0
within cluster Distance 1 between clusters
48Unified Assign and Schedule
49Cluster Priority Heuristics
- None
- Cluster list is not ordered
- Random
- Priority is a random number
- Magnitude-weighted Predecessor (MWP)
- Number of flow-dependent predecessors assigned to
the cluster - Completion-weighted Predecessor (CWP)
- Latest ready time for any flow-dependent
predecessor assigned to the cluster - Critical-Path in Single Cluster using CWP (CPSC)
- Priority calculation like CWP, but all nodes on
critical path assigned to a single cluster
50Advantages of UAS
- Simple modification to list scheduler
- Most common instruction scheduling technique
- Cluster assignment is done with full knowledge of
resource and interconnect availability - Better cluster utilization than BUG
- Generates more compact schedules than BUG
51Speedup compared to optimal for most frequently
executed basic block
52Code size increase due to copy operations for
most frequently executed basic block
53Speedup compared to 1-cluster 8-issue machine
(same number of resources) for full benchmark.
54Instruction Scheduling for Clustered VLIW DSPs
- Partition (simulated annealing)
- T 10
- RandomPartion(P)
- mincost ListSchedule(graph,P)
- while (T gt 0.01)
- for i1 to 50
- rRandom(1,n)
- SwitchToOtherCluster(Pr)
- cost ListSchedule(graph,P)
- delta cost mincost
- if (delta lt 0 or Random(0,1) lt exp(-delta/T))
- mincost cost
- else
- SwitchToOtherCluster(Pr)
- T T 0.9
- return P
55Getting Non-Local Operands
- Check for an already existing copy with either
the destination or source in the required cluster
(CSE) - Use the crosspath if the crosspath is available
this cycle and the operand supports it (take into
account commutativity) - Insert a copy operation in a previous cycle
56TI optimizing assembler versus Optimal
57TI compiler versus algorithm
58Effective Cluster Assignment for Modulo Scheduling
- Problem
- Acyclic scheduling is concerned with minimizing
the schedule length - Cyclic scheduling is concerned with maximizing
throughput - Algorithm
- Greedy cluster assignment
- Insert any necessary copy operations
- Schedule using any standard non-cluster aware
modulo scheduler
59Cluster Assignment
- Give higher priority to nodes in recurrence
cycles - More critical the recurrence (higher recII)
higher the priority - Speculatively reserve space for future copy
operations to minimize resource contention - Aggressive cluster assignment could fill a
cluster and prevent scheduling of a required copy - Iterative approach
- Correct early sub-optimal assignments
60Standard Bottom-Up Greedy Approach
61Modified Priority and Speculative Copy Approach
62Cluster Selection
63Power-Aware Modulo Scheduling
Amit Marathe, Hai Huang
EECS 583 Class Presentation II 10th April, 2002
64Reference Paper and Motivation
- Power-Aware Modulo Scheduling for
High-Performance VLIW Processors Yun and Kim,
Seoul National University - Published in ISPLED 2001 (ACM Conference)
- Motivation
- Reduce Step Power and Peak Power from the
software perspective - step/peak powers are more important than average
power as far as reliability is concerned (not
necessarily optimum power consumption)
65Step Power
- Step power is the difference in the average power
consumed in two consecutive clock cycles - Reflected by surge in current for
charging/discharging - Due to Aggressive/wider datapath design,
increasing clock frequency, growing transistors - Reduces reliability and causes timing and logic
errors (circuit switches at wrong time, latches
wrong value) - At Microarchitectural level
- Represents inductive noise Ldi/dt
- Large surge in current gt more noise gt more
faults - Aggressive turning off of FUs to reduce average
power consumption can have conflicting goals with
reducing step power
66 Peak Power
- Peak Power is the maximum power dissipation
during the execution of a given program - Peak power is exponentially proportional to chip
reliability - High Peak power leads to device degradation,
reducing the chip lifetime - Complex cooling systems needed to avoid
overheating and ensure system reliability
67 Power-Aware Modulo Scheduling Algo
- Aims at generating a balanced schedule that would
reduce both step power and the peak power - Ideology Compilers are smarter than
hardware-assisted solutions - Because compilers can fully control the usage of
the functional units - Machine models tested
- 8-issue VLIW
- 1 IALU, 2 MEM, 1 IMPY, 2 FALU, 2 FMPY
- 16-issue VLIW
- 2(8-issue)
- Benchmarks Tested
- SPEC95 FP
68 Power-Aware Modulo Scheduling (contd)
- (Too) Simple Power Estimation Method
- P(op,i) is the power consumed by operation op
in pipeline stage I - The total power consumed in 1 clock cycle is
given by the sum of the total power consumed in
each pipeline stage of that clock cycle - Total power consumed in one pipeline stage in a
given clock cycle is the total power consumed by
all ops in that pipeline stage - Problems What about inter-stage effects or
inter-operation effects on power consumption ?
69 Base Algo (Iterative Modulo Scheduling)
70Balanced IMS (power-aware algo)
Cost Function
Aim Minimize the cost function This is NOT a
complicated function ? It just says that pick a
schedule in which the above function is
minimized. P (Lsp,i) is the power consumed in
time-slot i of the software pipeline loop.
Ideal P (Lsp,I) is when all the ops are
no-ops Peak power is maximum P(Lsp,i) and the
step power is P(Lsp,I) P(Lsp,I-1) Somehow,
minimizing the cost function minimizes peak power
step power
71Summary
- IMS selects earliest time slot (within the
computed slack time slack time is the range of
time in which the op can be scheduled without
violating dependency constraints) in which there
is no resource conflict and schedules the op - BIMS uses the cost function F(Lsp,I) to place an
instruction in one of the time slots in the slack
(basically that time slot which incurs least
increase of the cost function)
72An Example for a better picture
IMS If power(noop)0, and power(other
ops)1 Peak power 4 Step power 3
Balanced IMS If power(noop) 0 and power(other
ops) 1 Peak power 2 Step power 0
73Results
74Conclusions
- They dont make a strong case as to why average
power is not important (they dont even analyze
the average power) - Power model too simplistic
- Seems to be a novel idea (so far most of the
papers have focused on reducing step/peak power
in hardware) - Promising results (almost 37.1 reduction in
step-power consumption) - Idea worth exploring for large systems
75Dynamic Voltage ScalingAn Overview
76Issue of Operating Voltage
- Predominant device technology is CMOS
- Energy proportional to operating voltage
- Maximum gate delays inversely related to voltage
- Can reduce unit computation energy by reducing
frequency and voltage
77Dynamic Voltage Scaling (DVS)
- Weiser94
- busy system --gt increase freqency
- idle system --gt reduce frequency
- Needs processors supporting software adjustable
PLL, voltage regulator - e.g., Xscale, SpeedStep, PowerNow!, Crusoe
78RT System vs. NRT System
- All systems can be classified as either
- 1. Real-Time System
- 2. Non Real-Time System (or Soft Real-Time
System) - NRTS works well with DVS no deadline
- Some challenges of using DVS with RTS
79Real-Time Systems
- A task is characterized as (P, D, C)
- P - period
- D - deadline
- C - worst case execution time (WCET)
- All it matters is the task meeting its deadline!
80RTS with DVS
- Static DVS
- Worst-case utilization, U
- Task1 (10, 3) Task2 (5, 1) Task3 (10, 4)
- U 3/10 1/5 4/10 0.9
Before
After
1.0
T1
T2
T3
T1
T2
T3
0.5
5
10
15
20
81RTS with DVS (cont.)
- Dynamic DVS
- Observation WCET is much greater than ACET
- Use actual execution time instead of WCET yields
even higher energy saving
82NRTS with DVS
- Use the past to predict the future
- Potentially have longer delay
- No problem, no strict deadlines
- Opportunity for more aggressive DVS algorithms
- Needs to strike balance between energy-saving and
performance
83DVS in Compiler?
- Mosse00
- Power Management Points (PMP) are inserted to the
generated code - Application monitors its own progress and adjust
clock speed if appropriate - Targeted to single-threaded embedded systems
84Power Management Points
- A task is divided into n sections
- Each section has a WCET
- PMPs are inserted at the section boundaries
- Obtains actual run-time of the section
- Compare actual time to WCET, and adjust processor
frequency accordingly - Natural places are loop boundaries and procedure
call sites - Use profiling information to eliminate
unnecessary PMPs to reduce overheads
85Voltage Adjustment Schemes
- NPM No Power Management
- Every section runs at highest speed
- SPM Static Power Management
- Same as Static DVS approach use worst case
utilization - DPM-P Dynamic Power Management Proportional
- Task is divided in n sections, with task
deadline d - j sections finished at
- Speed is set to
86Voltage Adjustment Schemes
- DPM-G Dynamic Power Management Greedy
- Task is divided in n sections, with task
deadline d - j sections finished at
- Speed is set to
87Voltage Adjustment Schemes
- DPM-S Dynamic Power Management Statistic
- Task is divided in n sections, with task
deadline d - j sections finished at
- Speed is set to
88Performance
89Performance
90Conclusion
- DVS is a powerful way to save energy
- If deadline is not an issue, then opportunity to
be more aggressive to save energy - If meeting deadline is important, then be more
conservative - DVS applying to compiler is still an open
research area systems are mostly multitasking