Title: EECS 583 Lecture 23 Group 1 Control flow analysis opti Group 2 Dataflow analysis opti
1EECS 583 Lecture 23Group 1 Control flow
analysis optiGroup 2 Dataflow analysis opti
- University of Michigan
- April 8, 2002
2Today
- Control flow analysis and optimization
- Identifying branch correlations with path
profiles
- Extending trimaran to perform path profiling,
optimization using path profile information
- Nael, Pariwat, Nuwee
- Compiler switch spacewalking
- Identifying the best settings for compiler
switches
- Ibrahim, Pete
- Dataflow analysis and optimization
- BDD-based predicate analysis (record number of
slides)
- More intelligent predicate relation analysis
using binary decision diagrahms
- Beth, Laura, Bill
3Next time
- Scheduling groups
- TI C6x Dave, Jeff
- While loop software pipelining Arnar, Tomas,
Misha
- Power-sensitive scheduling Hai, Amit
- Last class Wednes, April 17
- Start _at_400 pm
- To be held on central campus
- We will head over to Ashleys afterwards
(attendance optional)
- First round is on me
- Dont have to drink if you do not want to
4EECS 583 Group 1 Advanced control flow
analysis and optimization Path Profiling
- University of Michigan
- April 8, 2002
5Profiling?
- Profiles counts occurrences of event during
programs execution
- Point/Basic Block Profiling
- Edge Profiling
- Path Profiling
270
120
150
120
250
100
20
250
270
270
160
110
160
270
160
Point Profiling
Edge Profiling
6Path Profiling What is it? How to?
- Path execution trace
- Path profiles how often does a control-flow
path executes?
- How to profile paths?
- Approximate with block or edge profiles
- Inaccurate!!
- Trace program
- High cost!!
- Use?
- Compiler optimization
- Performance Tuning
- Program Testing
7Profiling? Edge Profiling Not Enough!
Example how edge profiles misidentify the most
frequently executed paths.
120
150
100
20
250
270
160
110
160
8Dumb way to collect path profiling
- Set k history depth
- Execute program, - new block - add to FIFO
queue of length k
- If pattern is found in hash table - counter
- Else add new pattern in hash table
k 4
.A B C D E F A C D E
String of path
Hash table Path / Counter A B C
D 10 B C D E 5 C D E F 1 E F A C 1 F A C D
1
A C D E 2 ...
F A C D
FIFO
FIFO
A C D E
Most recent path
9Efficient path profiling
- By Thomas Ball James R. Larus 1996
- From Bell Lab. And U. of Wisconsin - Madison
10Efficient path profiling Algorithm Overview
- Convert CFG to DAG
- Assign integer values to edges such that no two
paths compute the same path sum
- Select edges to instrument and compute
appropriate increment for each edge
- Regenerate path
11Convert CFG to DAG
- Path start at procedure entry or loop head -
- Yes! It is intraprocedural path profiling
- Acyclic path
- Remove backedges
- Add source vertex (ENTRY)
- Add sink vertex (EXIT)
12Assign Edge Values
- Assign each edge value such that
- Sums along DAG path is unique, non-negative
integer
- Sums lie in range 0 NumPaths-1
- NumPaths number of paths to EXIT
Val(e)
NumPaths(v)
13Assign Edge Values - Example
Vertex(v) NumPaths(v)
A B C D E F
Val(v-wk) ?i 1 to k-1NumPaths(wi)
14Assign Edge Values - Example
- DAG with values computed and Path Encoding
15Many ways to compute sum
16Minimal Increments
- Given edges values - find minimal operation to
find sum
- Efficient event counting Ball
- Weighs edge frequency
- Max spanning tree on least traveled edges
17Determine Minimal Increments
- Inc (B-D) Val(A-B) Val(B-D) Val(D-F)
- 2 2 0 4
18Instrumentation
- Basic
- Initialize r 0 at ENTRY
- Increment r Inc(c) along chord
- Record countr at EXIT
- Optimize
- Initialize Increment r Inc(c)
- Increment Record countr Inc(c)
19Path Regeneration
- Given sum P , which path produced it?
- We know that we have path encoding 0
NumPaths-1
- For (I 0 I
- Go from ENTRY node and traverse the graph
- Do until reach EXIT
- At each block, take edge with largest Val(e) NumPaths
- Decrement NumPaths
20Performance Measurement
- Low average run-time overhead 31
(edge-profiling 16)
21Performance Measurement
- Longer path than edge-profiling
22Related works and Extension
- Interprocedural Path Profiling by David Melski
Thomas Reps
- Paths may cross procedure boundaries
- Use numbering scheme as in Ball Larus
- Complicated by cyclic paths from entering
different call sites
- Whole Program Paths by James R. Larus
- A complete, compact record of program entires
control flow
- Support interprocedural paths
- Practically used in finding hot sub-paths
- Require more storage and post-processing
23EECS 583 Project Static Correlated Branch
Prediction
- Nuwee Wiwatwattana
- , Pariwat Luangsuwimol
- , Nael Botros.
- April 8, 2002
24Introduction
- Importance of path profiles.
- Exposes patterns followed by program.
- Allows optimizations to make use of the path
frequencies rather than just point frequency.
- In some situations, while it is possible to
optimize a statement with respect to some paths
along which it lies, the same optimization
opportunity does not exist along other paths
through the statement. We refer to such
optimizations as path sensitive optimizations. - Rajiv Gupta, University of Arizona
25Path profile driven optimizations
- Examples of path profile driven optimizations
- Static correlated branch prediction.
- Path profile guided partial dead code
elimination.
- Partial redundancy elimination.
- Load redundancy removal.
- Elimination of array bound check.
26Increased importance of branch prediction accuracy
- Increased importance of branch prediction
- Deeper pipelines.
- Delayed branch resolution.
- Dynamic versus Static branch prediction
- Use of actual pattern followed by branch.
- Costs of Hardware dynamic branch prediction
- Cycle time cost. (Tcpu Ninst Cycle/Inst
Second/Cycle)
- Hardware cost.
27(No Transcript)
28Static Correlated Branch Prediction(SCBP)
- Static (compiler) simulation of a hardware per
branch history branch prediction and branch
target predictor
29SCBP
- Limitation on Static Branch prediction
- Limited communication with processor (in the form
of a bit stating if the branch is likely taken or
Not Taken passed from compiler to processor)
- Branch prediction static with time unlike
Hardware dynamic branch predictors.
- How SCPB gets around those limitations
- Code Duplication
30SCBP
- Trade-off between code expansion and branch
predictability.
- Improves performance of multiple-issue, deeply
pipelined microprocessors.
SCBP utilizes general path profiles, not limiting
the technique to forward path profiles. We a
ppear to have been the first researchers to have
collected Path profile information Young and S
mith
31SCBP
- What does SCBP do
- Algorithm to use general path profiles to improve
static branch prediction accuracy.
- Trade off code expansion for improved accuracy.
- Provide an algorithm to automatically and
effectively tune SCBP space-time trade-off.
32SCBP
- Steps for Static Correlation Branch prediction
- Profiling.
- Local minimization.
- Global reconciliation.
- Layout.
- Example
33SCBP
- First we generate the path profile for the given
code
34SCBP
- Local minimization
- Here the concept of a history tree is
introduced.
- History tree
- Nodes of history tree are edges of CFG.
- For each block that ends with a conditional
branch there is a history tree.
- Root node is called predicted branch, block
containing it is called predicted block.
- Any path whose last edge starts at predicted
block is called a predictive path.
- For each predictive path, last edge is called
counted edge, and the rest of the path is called
observed path.
- Different nodes in history tree may map to same
CFG block( different paths, or history covering
multiple iterations of a loop)
35SCBP
36SCBP
- Now we need to find minimum amount of history
necessary to exploit correlation is it exists.
The less history we need to preserve, the less
code expansion will result in final program. - Pruning history trees
37SCBP
Example of Pruning a branch with no correlation
to ancestors
Tree collapses to its root node.
38SCBP
- Global Reconciliation
- Determining the minimum number of copies needed
for each basic block in order to preserve
correlation history.
39SCBP
- Global Reconciliation steps
- First step find all potential splitters. ( Each
non leaf node of a minimized (pruned) tree is a
splitter of the source block of the edge to which
it maps. - Second step determine the number of pieces in
each partition of paths leading to each
splitter.
- Third step get the intersection of all pieces of
all partitions of a certain basic block.
40SCBP
41SCBP
- Layout issues
- The CFG produced by reconciliation typically is
not an executable program, as new join points are
created.
- If SCBP is performed in an intermediate
representation the new joints are no problem.
- If it is performed in intermediate representation
that resembles machine code, we may need to add
some new branch instructions in order to ensure
correct program semantics.
42SCBP
- Trading Off Space and Time
- So far SCBP attempted to capture maximum
improvement in branch prediction accuracy with no
attempt to limit size expansion.
- Net effect will depend on both improvement in
branch prediction and the penalties due to worse
cache miss rate.
- Rather than making maximum number of copies to
achieve the maximum improvement in branch
prediction, we will choose only profitable
branches and blocks for duplication. - Solution
- Overpruning sacrificing some of the branch
accuracy in the favor of code size.
43SCBP
- Experimental Results
- Benchmarks used
44SCBP
- Experimental Results
- Training sets used for benchmarks
45SCBP
- Experimental Results
- Additional information about benchmarks
46SCBP
- Experimental Results
- Number of paths profiled as a function of history
depth
47SCBP
- Experimental Results
- Times for profiling
48SCBP
- Experimental Results
- Sizes of original and transformed code
49SCBP
- Experimental Results
- Mis-predict rate and cache miss rate
50Reference Papers
- Static Correlated Branch Prediction
- Cliff Young
- Bell Laboratories
- and
- Michael D. Smith
- Harvard University
51How to Implement optimization Based on Path
Profiling in Trimaran?
52General Idea
Text file
Impact
Elcor
simu
53How to Create a Path Profile?
Insert code inside the C program
Rebel Cmpp Add . Sub
C Code If() bb1 Add Sub Bb1
PATH_FILE 4 1 2 3 4 3 2 4 5 6 8
Codegen
a.out
54Go Back To Elcor
- 2 steps
- Create small function to read input from text
file and generate appropriate data structure to
store it.
- Create an optimization code.
55Compiler Switch Optimization
- Peter Schwartz
- Ibrahim Bashir
- April 8, 2002
56Importance of Switches
- The paper
- A case study on the importance of compiler and
other optimizations for improving super-scalar
processor performance
- by Duvall, Andersen, Leggoe, Graham, Cooke, and
Antonio
- 1999
- What they did
- Started with a FORTRAN program
- Optimized by changing compiler switches
- Tested results on 2 machines
57Program and Optimizations
- Program
- Written in FORTRAN
- Modeled spherical particle transport phenomena
- 2500 lines of code
- 25 subroutines
- Many nested loops
- Optimizations
58Experiment and Results
- Test machines
- IBM SP
- 160MHz POWER2 CPU
- 4 nodes
- DEC Alpha
- 667MHz 21164 CPU
- 1 node
- Results
59Conclusions
- Other switch settings gave similar results
- Authors only reported good results
- Good switch settings gave speedup of 10
- Used experts to set switches
- We want to automate the process
60Genetic Algorithm Parallelisation System (GAPS)
- The paper
- GAPS Iterative Feedback Directed Parallelisation
Using Genetic Algorithms
- by Andy Nisbet
- 1998
- What he did
- Started with FORTRAN program
- Used genetic algorithms to find a good sequence
of compiler optimizations
- Tested with different numbers of processors
61The GAPS Approach
62Population and Evaluation
- Population initialization
- Uses domain information
- Population size 20
- Individuals represent sequences of
transformations
- Fitness evaluation
- Transformations encoded in individual applied to
original code
- Transformed code compiled and run on benchmark
- Faster execution higher fitness score
- Illegal code gets lowest fitness
63Selection and Reproduction
- Selection
- Probability linear normalization of fitness
- Illegal sequences can still reproduce
- Elitism - only members of weaker half selected
for deletion
- Reproduction
- All transformations changed loop structures
- Crossover and mutation were specific to this
domain
- Details are irrelevant
64Experiment
- FORTRAN code
- Machine
- SGI Origin 2000
- 1, 2, 4, 8, and 16 processors
- Compilers
- GAPS
- PFA (Native SGI)
- Petit
65Results
- GAPS performed best
- 44 faster than PFA
- 37 faster than Petit
- Took about 24 hours to produce 20,000
individuals
- Only 5.5 of GAPS individuals were legal
66Conclusions
- Genetic algorithms can be applied to compiler
optimizations
- Can require many reproductions
- Can take a long time
- Our search space should be much smaller
67Adaptive Program Optimization
68Motivation
- Trying to find best combination of optimizations
to apply to a given program
- Might not be possible to determine the
applicability of certain transformations at
compile-time
- Optimization decisions are limited by lack of
information about the input data set
- Drawbacks of profiling
- Profiling makes use of information collected
during previous program runs
- Data collected through profiling is based on
program input that may not be representative of
current input
- Profiles may also become inaccurate if the
machine configuration changes
- Adaptive optimization techniques attempt to gain
more accurate info by making decisions at
run-time based on the current input and machine
configuration
69Adaptive Optimization
- Rather than applying a particular transformation
at compile-time, generate adaptive code which can
behave like transformed code (if needed) at
run-time - Adaptive programs can be thought of as having
multiple execution paths
- Selection of a particular execution path is based
on run-time information/values
- More practical than multiversioning
- Cost of run-time analysis is small compared to
benefits if transformation can be applied
70Dynamic Optimization
- Technique similar to adaptive optimization
perform optimizations dynamically as information
becomes available to apply and evaluate them
- Multiversioning
- Compiler generates multiple versions of a code
section, and the most appropriate variant is
selected at run-time based on current input data
and/or machine environment - Major limitation is that variants are generated
at compile-time and therefore no run-time info
can be exploited during code generation
- Creating enough code variants to cover all
possible scenarios can lead to significant code
growth, so typically only a few versions are
created for each code section
71Dynamic Optimization (2)
- Dynamic Feedback
- Technique that selects from compile-time code
variants, but uses run-time sampling to choose
- Same problems as multiversioning no run-time
info used in code generation and code explosion
- Sampling phase measure execution time for each
optimization generated at compile-time
- User-defined duration doesnt monitor changes in
environment
- Production phase use the variant with the
smallest execution time
- More problems
- No guarantee that input data/environment are
constant during sampling phase, so it may be
unreasonable to compare performance of variants
- Behavior during sampling phase may not model
behavior during production phase
72Dynamic Optimization (3)
- Dynamic Compilation
- Generates new code variants during program
execution
- Makes use of run-time information
- Overhead exceeds multiversioning and dynamic
feedback
- Program execution is paused as new code variants
are generated
- Generating code at run-time is expensive
- Due to high overhead, only applied to code
sections that will benefit
- Difficult to automate selection of such sections
73Why use adaptive optimization?
- Applicability
- Conservative compile-time analysis may exclude
some optimizations
- Run-time test can determine whether or not a
transformation applies
- Usefulness
- Whether or not an optimization will be useful may
depend on program characteristics not known at
compile-time
- Selection
- When there is more than one applicable
optimization, selecting the most suitable one may
require run-time info
- Adaptive code can behave either as untransformed
code or code that has been transformed using one
or more optimizations
74Adaptive optimization vs. Multiversioning
- Drawback of multiversion programs is the
resulting code growth
- Experience has shown that in many situations
multiple transformations must be applied to gain
the desired optimization benefit
- Number of versions grows exponentially
- Same problem as compiler switch selection
- Adaptive code avoids exponential code growth by
requiring execution of some additional
instructions at run-time
- Despite run-time overhead due to additional
instructions, adaptive programs can realize a
large fraction of the speedup achievable by
corresponding multiversion programs - Results adaptive version had 40-80 of the
speedup achieved by multiversion program
75How adaptive optimization works
- Achieves effects of transformations by
- Adapting flow of control
- Modifying bounds of loop variables
- Adapting usage of loop variables in array
subscript expressions
- Choosing between serial and parallel execution of
loops
- Example loop fusion
- Execution of loop iterations is interleaved
- Adaptive code contains back to back loops whose
execution can either be interleaved or carried
out in sequence
- Set Boolean flags based on run-time info, and use
value of flags to decide between
original/transformed execution
- Adaptive transformation reduces code growth
associated with multiversioning by introducing
additional predicates and branches
- Some problems
- No mention of how run-time info is used to select
a transformation
- Only applied to a few loop transformations
76Our project
- Start out with a large number of Elcor/Impact
optimization switches
- Phase 1 narrow down switches to some reasonable
number
- Whats a reasonable number?
- Might be decided based on performance analysis
- How exactly do we narrow down the switches?
- Couple of different techniques use empirical
analysis to decide on one
- Phase 2 feed reasonable number of switches into
a genetic algorithm
- Let genetic algorithm run for a really long time
- Refine this manageable number of switches into an
optimal combination (near-optimal?)
77Our project (2)
- Phase 3 analysis
- What kind of performance improvements do we get?
- Looking for reduction in cycle count
- How close are we to the optimal combination of
switches?
- Probably wont know the exact optimal combination
due to time constraints
- Is it any better than what Trimaran does by
default?
- Evaluate how good Trimaran defaults are
- Is it any better than what a programmer with some
knowledge of the program would pick?
- Any unexpected optimization inter-dependencies?
78Global Predicate Analysis and its Application to
Register Allocation.
79Abstract
- VLIW machines can exploit ILP
- Predicated execution is a useful tool
- Unfortunately predicated code can confuse
traditional optimization techniques that compiler
uses
- Solution is to modify traditional compiler
optimizations to make them predicate aware
80Global and local
- Impact already does this type of pred analysis
but it does it on a hyper block basis
- Authors claim there are benefits to doing
predicate analysis with a global scope (they do
procedure scope sort of)
- An example of a global view with partially
if-converted code
81- p, q, s false
- If () then
- p,q cmpp.un.uc () if true
- x if p
- else
- r,s cmpp.un.uc () if true
- x if r
- y if s
-
- x if p
- y is s
- A register allocater that was only aware of
per block information would detect an
interference between x and y. Need to know p and
s can never be true at the same time. This
results in a spurious interference edge in the
coloring algorithm.
82How do you analyze predicated code?
- Build up a partition graph.
- Query this partition graph as you perform
optimization techniques.
- Some old terms
83Building up the partition graph
- Execution Trace - all instructions being executed
from the beginning to the end in straight line
code.
- Domain - all predicates have a domain, a trace
belongs to a domain p if all instructions of the
trace are executed when p is true.
- Partition - divides a predicates domain into
multiple disjoint subsets. The union of these
subsets equals the domain.
84Predicate Partition Graph
- Predicate Partition Graph G (V,E)
- V has nodes for each predicate
- E contains directed edges p - q if p has a
partition and q is a subset of that partition.
- If partition p q U r then E would contain p-q
and p-r
- Things can get messy
85Example
S1
F
p3
T
p2
S3
S2
F
T
S4
S5
p5
p4
S6
86- I1. p2,p3 cmpp.un.uc(s1 cond) if true
- I2. p4_1 cmpp.uc(s1 cond) if true
- I3. S2 if p2
- I4. p5 cmpp.uc(s2 cond) if p2
- I5. px cmpp.on(s2 cond) if p2
- I6. p4_2 p4_1 px if true
- I6. S3 if p3
- I7. S5 if p5
- I8. S4 if p4
- I9. S6 if true
87Building the partition graph
- p0 is true
- p0 is split into p3 and p2 by I1.
- p4_1 has same condition as p3.
p0
p3
p2
p4_1
88building partition graph .
- I4 and I5 split partition p2 into px and p5
- I6 makes p4_2 a new domain which is the
complement of p5 so we add new split to p0
p0
p4_2
p3
p2
p4_1
px
p5
89still making partition graph..
- p4_2 is the union of p4_1 and px so we add that
partition.
- its a little messy.
p0
p4_2
p3
p2
p4_1
px
p5
90what is the partition graph good for anyway
- you can ask it queries that are useful when doing
optimizations.
- predicate aware register allocater uses
- isDisjoint
- LeastUpperBoundSum
- LeastUpperBoundDiff
91extending to global
- Paper does predicate analysis on a procedure to
get better results. Input is a CFG
- Modify our terms
- Trace - all executed instructions on acyclic path
from start node of CFG to end node of CFG.
- Trace belongs to a domain of a predicate p, if
all instructions in a trace are executed when p
is true.
- Domain of a basic block - all traces where the
basic block is executed. If BB4 can be reached
from BB2 then domains of BB4 and BB2 are not
disjoint.
92how to deal with domain of a basic block
- assign a predicate to each basic block (not real
just used to build partition graph). Call this
predicate a control predicate
- Real predicates that occur in the code are called
materialized predicates.
- Build a single partition graph that includes
control predicates and materialized predicates.
93Building up the global partition graph, control
predicates first
- More terms..
- Critical edge - an edge whose source has more
than one successor and whose destination has more
than one predecessor (S2-S4) in example.
- S4 does not post dominate S2, S4s control
predicate cannot be a child of S2s control
predicate in the partition graph.
- To get around this make a virtual node with a
virtual control predicate at a critical edge. By
having this virtual predicate it is easier to
build the partition graph.
94Building up the global partition graph, control
predicates first
- Back edges in a CFG are a problem
- In a loop p1 and p2 may be disjoint in the
contexts of one iteration but not over all
iterations.
- Paper does some waffling here, ignore back
edges, be conservative, little impact,
approximation, less precise
- They make more virtual nodes that act as hooks to
return conservative answers when partition graph
is queried.
95Algorithm to build partition graph for control
predicates
- ConstructPartitionGraphForControlPredicates(CFG)
- Create a virtual node on each critical edge or
back edge
- Find control equivalent nodes
- Assign a predicate to each set of control
equivalent nodes
- for every node in the CFG
- v current node
- p control predicate assigned to v
- if( number of successors 1)
- Create a partition with p as the parent predicate
and
- predicates assigned to the successors as the
child
- predicates
- if( number of predecessors 1)
- Create a partition with p as the parent predicate
and
- predicates assigned to the predecessors as the
child
- predicates, if this partition has not been
generated yet
-
- if a non-start node, u, has no parent, create a
partition with
- the immediate dominator of u as the parent
predicate, and
- u and an implicit predicate as the child
predicates
96Handling the Materialized predicates
- Straightforward. Scan basic block for cmpps.
Depending on the type of cmpp and whether it is
guarded or not take some action to modify the
partition graph. - Algorithm for this as well
97Materialized predicate algorithm..
- ConstructPartitionGraphForMaterializedPredicates
- (instruction stream CFG)
- for every compare instruction in CFG
- cinst current compare instruction
- qp qualifying predicate of cinst
- bp control predicate assigned to the basic
block
- containing cinst
- p1 the first destination predicate
- p1_old the old definition of p1 if p1 is an
update
- p2 the second destination predicate if exists
- pp parent predicate
- if( qp p0 ) pp bp else pp qp
- switch (compare type)
- case .un.uc
- Create a partition with pp as the parent
predicate
- and p1 and p2 as child predicates
- case .cn.cc
- if( qp p0 ) process in the same way as a
- .un.uc case
98Example input for these algorithms
- p, q, r, s false
- if ( .. ) then
- p,q cmpp.un.uc (...) if true
- x .. if p
- else
- r,s cmpp.un.uc (...) if true
- x .. if r
- y .. if s
-
- .. x if p
- .. y if s
99Algorithms create the following partition graph
p0
p_then
p_else
q
p
r
s
100Using this global (procedure level) partition
graph they modified a register allocater.
- Register allocater used the coloring algorithm we
saw in class, operated on procedures
- They compiled several SPECint-92 benchmarks two
times.
- Once with their global predicate analysis and
once with local predicate analysis. They measured
the number of colors required by the register
allocated.
101Results
- Of 1009 procedures compiled 248 procedures showed
improvement in the number of colors.
- Of those 248 procedures that showed improvements,
average improvement was a 20 decrease in number
of required colors.
102Why did most procedures not show any improvement?
- Blame the conservative if converter.
- More aggressive if converter would generate more
cmpps and the algorithm would have more
opportunities to improve.
- Even if code was if converted it may be so simple
local predicate analysis is sufficient.