EECS 583 Lecture 24 Group 2 Dataflow analysis opti Group 3 Scheduling, Regalloc, Code gen

About This Presentation

Title:

EECS 583 Lecture 24 Group 2 Dataflow analysis opti Group 3 Scheduling, Regalloc, Code gen

Description:

... examine the condition, r1 3. In the Register to Integer family the intervals for r1 ... The Register to Integer family has the following intervals for r1: ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 91

Provided by: scottm80

Category:

more less

Transcript and Presenter's Notes

Title: EECS 583 Lecture 24 Group 2 Dataflow analysis opti Group 3 Scheduling, Regalloc, Code gen

1
EECS 583 Lecture 24Group 2 Dataflow analysis
optiGroup 3 Scheduling, Regalloc, Code gen

University of Michigan
April 10, 2002

2
Today

Dataflow analysis and optimization
BDD-based predicate analysis
More intelligent predicate relation analysis
using binary decision diagrahms
Beth, Laura, Bill
Scheduling, register allocation, code generation
Retargeting Elcor to TI C6x
Handling multiple clusters
Jeff, Dave
Power-sensitive scheduling
Dealing with power in a modulo scheduler
Dynamic voltage scaling
Hai, Amit

3
Next time (Monday, 4/15)

G2 Dataflow analysis and optimization
Partial inlining Chunhui, Dukhyun, Jeremy
G3 Scheduling, register allocation, code
generation
While loop software pipelining Arnar, Tomas,
Misha
G4 Memory optimization
Data layout Tony, Marius
Exams returned on Monday if it kills me !!!!!!!!
On Wednes (4/17, last class)
Last 2 memory optimization groups go any spill
over

4
Course evaluations

Written portion of the evaluation is important
because this is the first time the class in this
form was offered
So, I am interested in what you guys think needs
to be improved
Put some thought into your answers !
Note saying the test was too long is not that
useful
Question A
What did you NOT like about the class or do you
think needs the most improvement? What would you
have done differently?
Question B
Thurs/Fri group meetings Did you like these?
Were they useful? How could they be more useful?
Are they worth the time?

5
Group 2 Predicate Analysis using Binary Decision
Diagrams

University of Michigan
April 10, 2002

6
Background

Our goal
Provide a system similar to Elcors PQS which
uses BDDs rather than partition graphs to answer
questions about relationships between predicates
From last time - Predicates and BDDs
Represent predicated control flow as Boolean
equations (with BDDs)
Supports general predicated code
Efficient and accurate analysis of condition
relations
Building BDDs
start with a single node 1
add variables to the BDD, each new variable is a
single ITE node with a then-arc and an invert-arc
to 1.
The BDD is built and queried using the ITE(f,g,h)
function

7
Our Project

Initialize the BBD
parse through hyperblock looking for cmpps
the tree is built from these operations
examine the comparison of each cmpp
the comparison used in the cmpps will create
intervals from register to literal compares and
conditions from the register to register compares
these will be represented as boolean functions in
the BDD
create boolean functions to represent each
predicate based on the functions which represent
the comparisons in each of its cmpps
Use the ITE function to manipulate the BDD
functions give answers to queries used by data
flow analysis such as is_disjoint, is_subset, ...
Must be similar to current queries of PQS

8
Example cmpps

p1 cmpp.un(r1 lt 3)
p2 cmpp.un(r1 gt r2)
p1 cmpp.on(r3 lt 5)
p3 cmpp.un(r1 gt 4
p3 cmpp.an(r4 gt2)

9
Step 1

Look at Register to Constant comparisons
For each register create at number line
Split the number line into segments based on
literals in comparison
create BDD with a node representing a finite
domain on the number line

10
Step 1 - r1s number line
3
4
-8
8

Conditions r1 lt 3, r1 gt 4
Intervals
I01 (-8,3), I11 (3,4), I21 (4,4), I31
(4,8)
Need 2 BDD nodes to represent 4 intervals (v0,v1)
I01 00
I11 01
I21 10
I31 11
Insert BDD nodes and intervals into BDD currently
consisting of the single node 1

11
Step 1 - BDD
1
v0
12
Step 1 - BDD (cont)
I01
I11
I21
I31
I01
I11
I21
v1
v1
v1
v1
v1
v1
v1
v0
v0
v0
v0
v0
1
0
0
1
13
Step 1 - Reduced BDD
14
Step 1 - r3s number line
5
-8
8

Conditions r3 lt 5
Intervals
I03 (-8,5), I13 (5,8)
Need 1 BDD node to represent 2 intervals (v2)
I03 1
I13 0
Insert BDD nodes and intervals into BDD

15
Step 1 - BDD
I03
I13
v2
16
Step 1 - r4s number line
2
-8
8

Conditions r4 lt 2
Intervals
I04 (-8,2), I14 (2,8)
Need 1 BDD node to represent 2 intervals (v3)
I04 1
I14 0
Insert BDD nodes and intervals into BDD

17
Step 1 - BDD
I14
I03
I13
I04
v3
v2
18
Step 2

Look at Register to Register comparisons
5 Basic Types
gt, gt, , lt,lt
Disjoint Outcomes
(1) R1 gt R2
(2) R1 R2
(3) R1 lt R2
Map disjoint outcome space to 2 Boolean variables
(R1 lt R2) (0,0), (R1 R2) (-,1), (R1 gt R2)
(1,0)

19
Step 2 - r1 gt r2

2 variables to represent 3 disjoint outcomes (v4,
v5)

20
Step 2 - r1 gt r2 BDD
gt
lt
gt

lt
v4
v4
v5
1
21
Final comparison BDD
I14
I03
I31
I04
r1 gtr2
I01
I13
I21
I11
v4
v1
v1
v1
v1
v2
v3
v5
v0
1
22
Step 3 Predicate Node Creation

Traverse code, creating new Predicate Nodes using
the ITE function
The structure of the BDD is determined by
Predicate Condition
Condition Type
Guard

23
Step 3 Predicate Node Creation
Px cmmp.XX(C) if Pg
24
Step 3 Predicate Layer

P1 cmpp.UN(r1 lt 3)
First, examine the condition, r1 lt 3
In the Register to Integer family the intervals
for r1 are
(- ?, 3), (3, 4), (4, 4), (4, ?)
R1 lt 3 corresponds to interval I0, node I01 in
the BDD
Type UN predicate
Guarded under true
Predicate node for p1 created by
P1 ITE(I01, 1, 0)

25
Step 3 p1 cmpp.UN(r1 lt 3)

P1 ITE(I01, 1, 0)

26
Step3 p2 cmpp.UN(r1 gt r2)

The condition is a Register to Register type.
There is a family in the BDD corresponding to the
comparisons of r1 and r2
R1 gt R2 is specifically the node needed
Type UN predicate
Guarded under True
Predicate node for p2 created by
P2 ITE(r1gtr2, 1, 0)

27
Step 3 p2 cmpp.UN(r1 gt r2)

P2 ITE(r1ltr2, 1, 0)

28
Step 3 p1 cmpp.ON(r3 lt 5)

Condition is a Register to Integer type.
The Register to Integer family has the following
intervals for r3
I03 (- ?, 5), I13 (5, ?)
R3 lt 5 corresponds to I03, this is the condition
Type ON predicate
Guarded under True
Predicate node for p1 created by
P1 ITE(I03, ITE(1, 1, p1), p1)
The p1 in the ITE function is the previous
predicate node for p1.

29
Step 3 p1 cmpp.ON(r3 lt 5)

Condition corresponds to I03
P1 ITE(I03,ITE(1, 1, p1), p1) ITE(I03, 1, p1)
I03 p1

30
Step 3 p3 cmpp.UN(r1 gt 4)

Condition is a Register to Integer type
The Register to Integer family has the following
intervals for r1
I01 (- ?, 3), I11 (3, 4), I21 (4, 4), I31
(4, ?)
R1 gt 4 corresponds to both I21 or I31
Type UN predicate
Guarded under True
P3 ITE(ITE(I21, 1, I31), 1, 0)

31
Step 3 p3 cmpp.UN(r1 gt 4)
P3 ITE(ITE(I21, 1, I31), 1, 0)
p1
p3
p2
I03
I31
I04
I14
r1 gtr2
I01
I13
I21
I11
v2
v4
v1
v1
v1
v1
v2
v3
v1
v5
v0
1
32
Step 3 p3 cmpp.AN(r4 gt 2)

Condition corresponds to I14
P3 ITE(1, ITE(I14, p3, 0), p3)

p1
p3
p2
I03
I31
I04
I14
r1 gtr2
I01
I13
I21
I11
p3
v2
v1
v4
v1
v1
v1
v1
v2
v3
v3
v5
v0
v1
1
33
Step 3 Final Predicate BDD
p3
p1
p2
I04
I14
I03
I31
r1 gtr2
I01
I13
I21
I11
v2
v4
v1
v1
v1
v1
v2
v3
v3
v5
v0
v1
1
34
Queries to PQS-BDD

Are p2 and p3 disjoint?
Tmp ITE(ITE(p2, ITE(p3, 0, 1), ITE(p3, 1, 0)),
0, 1)
If tmp null, then p2 and p3 are disjoint.
Is p1 a subset of p3?
Tmp ITE(p1, ITE(p3, 0, 1), 0)
If tmp null, then p1 is a subset of p3.
All queries currently answered using PQS can be
answered with the PQS-BDD system with ITE
functions.

35
Cluster Scheduling

Jeff Ringenberg
David Oehmke

36
Motivations

Register File
Size increases linearly with the number of
registers
Size increases quadratically with the number of
ports
Access time increases logarithmically with the
number of read ports and number of registers
Wide machines require large numbers of registers
and ports
8 wide ideal, fully-orthogonal VLIW machine
requires approximately 16 read ports and 8 write
ports

37
Clustering

Functional units and registers files are broken
into sets (generally uniform)
Each functional unit in a cluster is fully
connected to the local register file for that
cluster
Limited connectivity between clusters
Register files for an 8-wide, 2 cluster machine
are approximately one quarter the area of
register file for a single cluster machine
Connectivity
Most papers assume an explicit move operation to
move data between clusters
Some actual architectures allow operands to be
directly read from other clusters via a limited
bandwidth cross path

38
Clustering in the TI c6000
39
Compiling for Clusters

Compilation for a clustered machine is more
complex than for a single cluster
Assign operations to a clusters functional units
Assign data to a clusters register file
Move data between clusters when necessary
Complications
Spread operations and data over clusters to
achieve parallelism (partitioning)
Hide/limit inter-cluster communication penalty
NP complete problem

40
Cluster Scheduling Algorithms

BUG, Bottom-up greedy
Original algorithm from Bulldog compiler
Pre-scheduling cluster assignment
Limited Connectivity VLIW
Schedule assuming fully connected, then partition
and insert necessary copy operations
Partial Component Clustering
Pre-scheduling DAG decomposition and cluster
assignment with iterative improvement phase
Effective Cluster Assignment for Modulo
Scheduling
Pre-modulo scheduling cluster assignment

41
Cluster Scheduling Algorithms

Instruction Scheduling for Clustered VLIW DSPs
(targets TI C6201 architecture)
Partitioning using simulated annealing with list
scheduler as cost function
Unified Assign and Schedule
Assign operations to clusters while scheduling
Simple modification to list scheduler
CARS
Single phase cluster assignment, register
allocation, and instruction scheduling

42
Bottom-up Greedy (BUG)

Assign (node, destinations)
if (!node.parent node.fu ! unassigned)
return
for each operand of node
fus,cycles LikelyFUs(node,destinations)
Assign (operand, fus)
fus,cycles LikelyFUs(node,destinations)
node.fu fus.front
node.cycle cycle.front
availablenode.funode.cyclefalse
for each operand of node
if (operand.type DEF
operand.location unassigned
MustHaveSingleLocation(operand))
AssignLocation(operand,node)

LikelyFUs (node, destinations)
minMAX_INT
for each fu in FeasibleLocations(node)
t CompletionCycle(node,fu,destinations)
if (t lt min)
min t
fus fu
cycles StartCycle(node,fu)
if (t min)
fus fu
cycles StartCycle(node,fu)
return fus, cycles

FeasibleLocations (node)
Returns the list of functional units that can
perform that operation
StartCycle (node, fu)
Returns an estimate of the earliest cycle that a
functional unit can be used to compute the node
operation
Takes into account availability of function units
and operand locations (delay and distance) if
available
CompletionCycle (node, fu, destinations)
Returns StartCycle(node,fu) Delay(node,fu)
Distance(fu,destinations)
Delay (node,fu)
Returns number of cycles to compute the operation
on the functional unit
Distance (fu,destinations)
Returns minimum number of cycles to move the
result of the functional unit to one of the
destinations (0 if destinations is empty)

45
Notes on BUG

Top level routine calls Assign(root,NULL) for
each root node
Assign called in decreasing depth order of the
roots
The loop through the operands in Assign is also
done in decreasing depth order
Assign for data node
DEF nodes do nothing
USE nodes pass any final locations to their
parent nodes as the destinations list
Separate phase assigns locations to DEF and USE
nodes that are still unassigned
Successors and predecessors are taken into
account for this assignment

46
Shortcomings of BUG

Interconnect resource constraints cannot be
checked
Assignment can oversaturate available buses
Assignment of values to registers occurs
on-the-fly after FUs are assigned to operations
Subsequent copies of non-local data are scheduled
later
Prior knowledge of these copies would benefit the
FU assignment and scheduling
BUG is greedy
Future knowledge is not used in decisions
Decisions cannot be changed

47
BUG Example
Assign(6,) Assign(4,M1)
(A11102,A21113) Assign(2,A1)
2A1,0 (A10101,A20112)
4A1,1 (A11102,A21113)
Assign(1,M1) 1A2,0 (A12103,A2011
2) 6M1,2 Assign(5,) (A12103,A2210
3) Assign(2,A1,A2) Assign(3,A1,A2)
3L2,0 5A1,2 (A12103,A22103) Pr
oblem move A2,M1 time1 move L2,A1 time1
1
2 -
Cluster 1 Multiply(M1), ALU(A1) Cluster 2
Load(L2), ALU(A2) All Delays 1 Distance 0
within cluster Distance 1 between clusters
48
Unified Assign and Schedule
49
Cluster Priority Heuristics

None
Cluster list is not ordered
Random
Priority is a random number
Magnitude-weighted Predecessor (MWP)
Number of flow-dependent predecessors assigned to
the cluster
Completion-weighted Predecessor (CWP)
Latest ready time for any flow-dependent
predecessor assigned to the cluster
Critical-Path in Single Cluster using CWP (CPSC)
Priority calculation like CWP, but all nodes on
critical path assigned to a single cluster

50
Advantages of UAS

Simple modification to list scheduler
Most common instruction scheduling technique
Cluster assignment is done with full knowledge of
resource and interconnect availability
Better cluster utilization than BUG
Generates more compact schedules than BUG

51
Speedup compared to optimal for most frequently
executed basic block
52
Code size increase due to copy operations for
most frequently executed basic block
53
Speedup compared to 1-cluster 8-issue machine
(same number of resources) for full benchmark.
54
Instruction Scheduling for Clustered VLIW DSPs

Partition (simulated annealing)
T 10
RandomPartion(P)
mincost ListSchedule(graph,P)
while (T gt 0.01)
for i1 to 50
rRandom(1,n)
SwitchToOtherCluster(Pr)
cost ListSchedule(graph,P)
delta cost mincost
if (delta lt 0 or Random(0,1) lt exp(-delta/T))
mincost cost
else
SwitchToOtherCluster(Pr)
T T 0.9
return P

55
Getting Non-Local Operands

Check for an already existing copy with either
the destination or source in the required cluster
(CSE)
Use the crosspath if the crosspath is available
this cycle and the operand supports it (take into
account commutativity)
Insert a copy operation in a previous cycle

56
TI optimizing assembler versus Optimal
57
TI compiler versus algorithm
58
Effective Cluster Assignment for Modulo Scheduling

Problem
Acyclic scheduling is concerned with minimizing
the schedule length
Cyclic scheduling is concerned with maximizing
throughput
Algorithm
Greedy cluster assignment
Insert any necessary copy operations
Schedule using any standard non-cluster aware
modulo scheduler

59
Cluster Assignment

Give higher priority to nodes in recurrence
cycles
More critical the recurrence (higher recII)
higher the priority
Speculatively reserve space for future copy
operations to minimize resource contention
Aggressive cluster assignment could fill a
cluster and prevent scheduling of a required copy
Iterative approach
Correct early sub-optimal assignments

60
Standard Bottom-Up Greedy Approach
61
Modified Priority and Speculative Copy Approach
62
Cluster Selection
63
Power-Aware Modulo Scheduling
Amit Marathe, Hai Huang
EECS 583 Class Presentation II 10th April, 2002
64
Reference Paper and Motivation

Power-Aware Modulo Scheduling for
High-Performance VLIW Processors Yun and Kim,
Seoul National University
Published in ISPLED 2001 (ACM Conference)
Motivation
Reduce Step Power and Peak Power from the
software perspective
step/peak powers are more important than average
power as far as reliability is concerned (not
necessarily optimum power consumption)

65
Step Power

Step power is the difference in the average power
consumed in two consecutive clock cycles
Reflected by surge in current for
charging/discharging
Due to Aggressive/wider datapath design,
increasing clock frequency, growing transistors
Reduces reliability and causes timing and logic
errors (circuit switches at wrong time, latches
wrong value)
At Microarchitectural level
Represents inductive noise Ldi/dt
Large surge in current gt more noise gt more
faults
Aggressive turning off of FUs to reduce average
power consumption can have conflicting goals with
reducing step power

66
Peak Power

Peak Power is the maximum power dissipation
during the execution of a given program
Peak power is exponentially proportional to chip
reliability
High Peak power leads to device degradation,
reducing the chip lifetime
Complex cooling systems needed to avoid
overheating and ensure system reliability

67
Power-Aware Modulo Scheduling Algo

Aims at generating a balanced schedule that would
reduce both step power and the peak power
Ideology Compilers are smarter than
hardware-assisted solutions
Because compilers can fully control the usage of
the functional units
Machine models tested
8-issue VLIW
1 IALU, 2 MEM, 1 IMPY, 2 FALU, 2 FMPY
16-issue VLIW
2(8-issue)
Benchmarks Tested
SPEC95 FP

68
Power-Aware Modulo Scheduling (contd)

(Too) Simple Power Estimation Method
P(op,i) is the power consumed by operation op
in pipeline stage I
The total power consumed in 1 clock cycle is
given by the sum of the total power consumed in
each pipeline stage of that clock cycle
Total power consumed in one pipeline stage in a
given clock cycle is the total power consumed by
all ops in that pipeline stage
Problems What about inter-stage effects or
inter-operation effects on power consumption ?

69
Base Algo (Iterative Modulo Scheduling)
70
Balanced IMS (power-aware algo)
Cost Function
Aim Minimize the cost function This is NOT a
complicated function ? It just says that pick a
schedule in which the above function is
minimized. P (Lsp,i) is the power consumed in
time-slot i of the software pipeline loop.
Ideal P (Lsp,I) is when all the ops are
no-ops Peak power is maximum P(Lsp,i) and the
step power is P(Lsp,I) P(Lsp,I-1) Somehow,
minimizing the cost function minimizes peak power
step power
71
Summary

IMS selects earliest time slot (within the
computed slack time slack time is the range of
time in which the op can be scheduled without
violating dependency constraints) in which there
is no resource conflict and schedules the op
BIMS uses the cost function F(Lsp,I) to place an
instruction in one of the time slots in the slack
(basically that time slot which incurs least
increase of the cost function)

72
An Example for a better picture
IMS If power(noop)0, and power(other
ops)1 Peak power 4 Step power 3
Balanced IMS If power(noop) 0 and power(other
ops) 1 Peak power 2 Step power 0
73
Results
74
Conclusions

They dont make a strong case as to why average
power is not important (they dont even analyze
the average power)
Power model too simplistic
Seems to be a novel idea (so far most of the
papers have focused on reducing step/peak power
in hardware)
Promising results (almost 37.1 reduction in
step-power consumption)
Idea worth exploring for large systems

75
Dynamic Voltage ScalingAn Overview

Hai Huang
Amit Marathe

76
Issue of Operating Voltage

Predominant device technology is CMOS
Energy proportional to operating voltage
Maximum gate delays inversely related to voltage
Can reduce unit computation energy by reducing
frequency and voltage

77
Dynamic Voltage Scaling (DVS)

Weiser94
busy system --gt increase freqency
idle system --gt reduce frequency
Needs processors supporting software adjustable
PLL, voltage regulator
e.g., Xscale, SpeedStep, PowerNow!, Crusoe

78
RT System vs. NRT System

All systems can be classified as either
1. Real-Time System
2. Non Real-Time System (or Soft Real-Time
System)
NRTS works well with DVS no deadline
Some challenges of using DVS with RTS

79
Real-Time Systems

A task is characterized as (P, D, C)
P - period
D - deadline
C - worst case execution time (WCET)
All it matters is the task meeting its deadline!

80
RTS with DVS

Static DVS
Worst-case utilization, U
Task1 (10, 3) Task2 (5, 1) Task3 (10, 4)
U 3/10 1/5 4/10 0.9

Before
After
1.0
T1
T2
T3
T1
T2
T3
0.5
5
10
15
20
81
RTS with DVS (cont.)

Dynamic DVS
Observation WCET is much greater than ACET
Use actual execution time instead of WCET yields
even higher energy saving

82
NRTS with DVS

Use the past to predict the future
Potentially have longer delay
No problem, no strict deadlines
Opportunity for more aggressive DVS algorithms
Needs to strike balance between energy-saving and
performance

83
DVS in Compiler?

Mosse00
Power Management Points (PMP) are inserted to the
generated code
Application monitors its own progress and adjust
clock speed if appropriate
Targeted to single-threaded embedded systems

84
Power Management Points

A task is divided into n sections
Each section has a WCET
PMPs are inserted at the section boundaries
Obtains actual run-time of the section
Compare actual time to WCET, and adjust processor
frequency accordingly
Natural places are loop boundaries and procedure
call sites
Use profiling information to eliminate
unnecessary PMPs to reduce overheads

85
Voltage Adjustment Schemes

NPM No Power Management
Every section runs at highest speed
SPM Static Power Management
Same as Static DVS approach use worst case
utilization
DPM-P Dynamic Power Management Proportional
Task is divided in n sections, with task
deadline d
j sections finished at
Speed is set to

86
Voltage Adjustment Schemes

DPM-G Dynamic Power Management Greedy
Task is divided in n sections, with task
deadline d
j sections finished at
Speed is set to

87
Voltage Adjustment Schemes

DPM-S Dynamic Power Management Statistic
Task is divided in n sections, with task
deadline d
j sections finished at
Speed is set to

88
Performance
89
Performance
90
Conclusion

DVS is a powerful way to save energy
If deadline is not an issue, then opportunity to
be more aggressive to save energy
If meeting deadline is important, then be more
conservative
DVS applying to compiler is still an open
research area systems are mostly multitasking

Write a Comment

User Comments (0)