Title: Seminar Series
1Seminar Series
- Static and Dynamic Compiler Optimizations
(6/28). - Speculative Compiler Optimizations (7/05)
- ADORE An Adaptive Object Code ReOptimization
System (7/19) - Current Trends in CMP/CMT Processors (7/26)
- Static and Dynamic Helper Thread Prefetching
(8/02) - Dynamic Instrumentation/Translation (8/16)
- Virtual Machine Technologies and their Emerging
Applications (8/23)
2Professional Background
- CE BS and CE MS, NCTU
- CS Ph.D. University of Wisconsin, Madison
- Cray Research, 1987-1993
- Architect for Cray Y-MP, Cray C-90, FAST
- Compiler optimization for Cray-Xmp, Ymp, Cray-2,
Cray-3. - Hewlett Packard, 1993 -1999
- Compiler technical lead for HP-7200, HP-8000,
IA-64 - Lab technical lead for adaptive systems
- University of Minnesota, 2000-now
- ADORE/Itanium and ADORE/Sparc systems
- Sun Microsystem, 2004-2005
- Visiting professor
3Static and Dynamic Compiler Optimizations
4Background
- Optimization
- A process of making something as effective as
possible - Compiler
- A computer program that translates programs
written in high-level languages into machine
instructions - Compiler Optimization
- The phases of compilation that generates good
code to make as efficiently use of the target
machines as possible.
5Background (cont.)
- Static Optimization
- compile time optimization one time, fixed
optimization that will not change after
distribution. - Dynamic Optimization
- optimization performed at program execution time
adaptive to the execution environment.
6Some Examples
- Redundancy elimination
- C (AB)(AB) ? tAB Ctt
- Register allocation
- keep frequently used data items in registers
- Instruction scheduling
- to avoid pipeline bubbles
- Cache prefetching
- to minimize cache miss penalties
7How Important Is Compiler Optimization?
- In the last 15 years, the computer performance
has increased by 2000 times. - Clock rate increased by 100 X
- Micro-architecture contributed 5-10X
- the number of transistors doubles every 18
months. - Compiler optimization added 2-3X for single
processors
8- Have you used compiler optimization lately?
9Speed up from Compiler Optimization
10Speed up from Compiler Optimization
11(No Transcript)
12Static compilation system
C Front End
C Front End
Fortran Front End
- Machine-independent optimizations
Platform neutral
Intermediate Language (IL, IR)
Optimizing Backend
IL to IL Inter- Procedural Optimizer
- Machine-dependent optimizations
Profile-Directed Feedback
Machine Code
Sample input
13Criteria for optimizations
- Must preserve the meaning of programs
- Example
T1 c/N For (I0 IltNI) AI bIT1
For (I0 IltNI) AI bIc/N
X
What if N 0?
14If (C gt 0) A bj dj
T1bj T2dj If (C gt 0) A T1T2
X
What if bj except when Clt0?
15- Basic Concepts
- Optimizations improve performance, but do not
give optimal performance - Optimizations generally (or statistically)
improve performance. They could also slow down
the code. - Example LICM, Cache prefetching, Procedure
inlining - Must be absolutely (not statistically!) correct
(safe or conservative) - Some optimizations are more important in general
purpose compilers - Loop optimizations, reg allocation, inst
scheduling
16Optimization at different levels
- Local (within a basic block)
- Global (cross basic blocks but within a
procedure) - Inter-procedural
- Cross module (link time)
- Post-link time (such as Spike/iSpike)
- Runtime (as in dynamic compilation)
17Tradeoff in Optimizations
- Space vs. Speed
- Usually favors speed. However, on machines with
small memory or I-cache, space is equally
important - Compile time vs. Execution Time
- Usually favors execution time, but not necessary
true in recent years. (e.g. JIT, large apps) - Absolutely robust vs. statistically robust
- Decrease default optimization level at less
important regions. - Complexity vs. Efficiency
- Select between complex but more efficient and
simple but less efficient (easier to maintain)
algorithms.
18Overview of Optimizations
- Early Optimizations
- scalar replacement, constant folding
- local/global value numbering
- local/global copy propagation
- Redundancy Elimination
- local/global CSE, PRE
- LICM
- code hoisting
- Loop Optimizations
- strength reduction
- induction variable removal
- unnecessary bound checking elimination
19Overview of Optimizations
- Procedure Optimizations
- tail-recursion elimination, in-line expansion,
leaf-routine optimization, shrink wrapping,
memorization - Register Allocation
- graph coloring
- Instruction Scheduling
- local/global code scheduling
- software pipelining
- trace scheduling, superblock formation
20Overview of Optimizations
- Memory Hierarchy Optimizations
- loop blocking, loop interchange
- memory padding, cache prefetching, data
re-layout - Loop Transformations
- reduction recognition, loop collapsing, loop
reversal, strip mining, loop fusion, loop
distribution - Peephole Optimizations
- Profile Guided Optimizations
- Code Re-positioning, I-cache prefetching,
Profiling guided in-lining, RA, IS, .
21Overview of Optimizations
- More Optimizations
- SIMD Transformation, VLIW Transformation
- Communication Optimizations
- (See David Bacon and Susan Grahams survey
paper) - Optimization Evaluation
- Is there a commonly accepted method?
- users choice
- benchmarks
- Livermore loops (14 kernels from scientific code)
- PERFECT club, SPLASH, NAS
- SPEC
22Importance of Individual Opt.
- How much performance an optimization contributes
? - Is this optimization commonplace?
- does it happen in one particular instance?
- does it happen in one particular program?
- does it happen for one particular type of app?
- how much difference does it makes?
- does it enable other optimizations
- procedure integration, unrolling
23Ordering
- Ordering is important, some dependences between
optimizations exist - Procedure integration and loop unrolling usually
enable other optimizations - Loop transformations should be done before
address linearization. - No optimal ordering
- Some optimizations should be applied multiple
times (e.g. copy propagation, DCE) - Some recent research advocate exhaustive search
with intelligent pruning
24Example Organization
Reaching definition Define-use chains
IR
Control Flow Analysis
Data Flow Analysis
Trans- formations
Flow graph Identify loops
Global CSE Copy prop Code motion
25Loops in Flow Graph
- Dominators
- d of a flow graph dominates node n, written as d
dom n, if every path from the initial node of the
flow graph to n goes through d. - Example
1
2
3
1 dom all 3 dom 4,5,6,7 4 dom 5,6,7
4
5
6
7
26Loops in Flow Graph (cont.)
- Natural loops
- A loop must have a single entry point, called the
header. It dominates all nodes in the loop. - At least one path back to the header.
- Backedge
- An edge in the flow graph whose head dominates
its tail. For example, - edge 4 ?3 and edge 7 ?1
27Global Data Flow Analysis
- To provide global information about how a
procedure manipulates its data. - Example
A3
BA1
BA
CA
Can we propagate constant 3 for A?
28 Data Flow Equations
- A typical data flow equation has the form
- Out S GenS U (InS KillS)
- S means a statement
- GenS means definitions generated within S
- KillS means definitions killed as control flows
through S - InS means definitions live at the beginning of
S - OutS means definitions available at the end of S
29Reaching Definitions
- A definition d reaches a point p, if there is a
path from the point immediately following d to p,
such that d is not killed along that path.
d1 Im-1 d2 jn d3 au1
B1
d1,d2,d5 reach B2 d5 kills d2, so d2 does not
reach B3,B4,B5
B2
d4 II1
B3
d5 jj-1
B4
B6
d6 au2
B5
30Data Flow Equation forReaching Definition
genS d1 killS all def of a outS
genS U (inS killS)
S
d1 a bc
genS genS1 U genS2 killS killS1 I
killS2 outS outS1 U outS2
S
S1
S2
genS genS1 killS killS1 InS1
inS U genS1 outS outS1
S
S1
31Transformation example LICM
- Loop Invariant Code Motion
- A loop invariant is an instruction (a load or a
calculation) in a loop whose result is always the
same in every iteration. - Once we identified loops, and tracked the
locations at which operand values are defined
(i.e. reaching definition), we can recognize a
loop invariant if each of its operands - 1) is a constant,
- 2) has reaching definitions that all lie outside
the loop or - 3) has a single reaching definition that itself
is a loop invariant.
32Static Compilers
- Traditional compilation model for C, C,
Fortran, - Extremely mature technology
- Static design point allows for extremely deep and
accurate analyses supporting sophisticated
program transformation for performance. - ABI enables a useful level of language
interoperability - But
33Static compilationthe downsides
- CPU designers restricted by requirement to
deliver increasing performance to applications
that will not be recompiled - Slows down the uptake of new ISA and
micro-architectural features - Constrains the evolution of CPU design by
discouraging radical changes - Model for applying feedback information from
application profile to optimization and code
generation components is awkward and not widely
adopted thus diluting the performance achieved on
the system
34Static compilationthe downsides
- Largely unable to satisfy our increasing desire
to exploit dynamic traits of the application - Even link-time is too early to be able to catch
some high-value opportunities for performance
improvement - Whole classes of speculative optimizations are
infeasible without heroic efforts
35Tyranny of the Dusty Deck
- Binary compatibility is one of the crowning
achievements of the early computer yearsbut - It does (or at least should) make CPU architects
think very carefully about adding anything new
because - you can almost never get rid of anything you add
- it takes a long time to find out for sure whether
anything you add is a good idea or not
36Profile-Directed Feedback (PDF)
- Two-step optimization process
- First pass instruments the generated code to
collect statistics about the program execution - Developer exercises this program with common
inputs to collect representative data - Program may be executed multiple times to reflect
variety of common inputs - Second pass re-optimizes the program based on the
profile data collected - Also called Profile-Guided Optimization (PGO) or
Profile-Based Optimization (PBO)
37Data collected by PDF
- Basic block execution counters
- How many times each basic block in the program is
reached - Used to derive branch and call frequencies
- Value profiling
- Collects a histogram of values for a particular
attribute of the program - Used for specialization
38Other PDF Opportunities
- Path Profile
- Alias Profile
- Cache Miss Profile
- I-cache miss
- D-cache miss
- Miss types
- ITLB/DTLB misses
- Speculation Failure Profile
- Event Correlation Profile
39Optimizations affected by PDF
- Inlining
- Uses call frequencies to prioritize inlining
sites - Function partitioning
- Groups the program into cliques of routines with
high call affinity - Speculation
- Control speculative execution, data speculative
execution and value speculation based
optimizations. - Predication
- Code Layout
- Superblock formation
40Optimizations triggered by PDF(in the IBM
compiler)
- Specialization triggered by value profiling
- Arithmetic ops, built-in function calls, pointer
calls - Extended basic block creation
- Organizes code to frequently fall-through on
branches - Specialized linkage conventions
- Treats all registers as non-volatile for
infrequent calls - Branch hinting
- Sets branch-prediction hints available on the ISA
- Dynamic memory reorganization
- Groups frequently accessed heap storage
41Impact of PDF on SpecInt 2000
On a PWR4 system running AIX using the latest IBM
compilers, at the highest available optimization
level (-O5)
42Sounds greatwhats the problem?
- Only the die-hard performance types use it (eg.
HPC, middleware) - Its tricky to get rightyou only want to train
the system to recognize things that are
characteristic of the application and somehow
ignore artifacts of the input set - In the end, its still static and runtime checks
and multiple versions can only take you so far - Undermines the usefulness of benchmark results as
a predictor of application performance when
upgrading hardware - In summarythe usability issue for developers
that shows no sign of going away anytime soon
43Dynamic Compilation System
class
class
jar
Java Virtual Machine
JIT Compiler
Machine Code
44JVM Evolution
- First generation of JVMs were entirely
interpreted. Pure interpretation is good for
proof-of-concept, but too slow for executing real
code. - Second generation JVMs used JIT (Just-in-time)
compilers to convert bytecodes into machine codes
before execution in a lazy fashion. - Hotspot is the 3rd generation technology. It
combines interpretation, profiling and dynamic
compilation. It compiles only the frequently
executed code. It also comes with 2 compilers
server compiler (optimize for speed) and client
compiler (optimize for start-up and memory
footprint). - New dynamic compilation techniques for JVMs are
CPO (Continuous Program Optimization) or
continuous recompilation and OSR
(On-Stack-Replacement) which can switch a code
from interpretation mode to compiled versions.
45Dynamic Compilation
- Traditional model for languages like Java
- Rapidly maturing technology
- Exploitation of current invocation behaviour on
exact CPU model - Recompilation and other dynamic techniques enable
aggressive speculations - Profile feedback to optimizer is performed online
(transparent to user/application) - Compile time budget is concentrated on hottest
code with the most (perceived) opportunities - But
46Dynamic compilationthe downsides
- Some important analyses not affordable at runtime
even if applied only to the hottest code (array
data flow, global scheduling, dependency
analysis, loop transformations, ) - Non-determinism in the compilation system can be
problematic - For some users, it severely challenges their
notions of quality assurance - Requires new approaches to RAS and to getting
reproducible defects for the compiler service
team - Introduces a very complicated code base into each
and every application - Compile time budget is concentrated on hottest
code with the most (perceived) opportunities and
not on other code, which in aggregate may be as
important a contributor to performance - What do you do when theres no hot code?
47The best of both worlds
C
C
F90
Front Ends
class
class
jar
Bytecode, MIL, etc
Java / .NET
CPO
Portable High Level Optimizer
Common Backend
JIT
Dynamic Machine Code
Binary Translation
Profile-Directed Feedback (PDF)
Static Machine Code
48More boxes, but is it better?
- If ubiquitous, could enable a new era in CPU
architectural innovation by reducing the load of
the dusty deck millstone - Deprecated ISA features supported via binary
translation or recompilation from IL-fattened
binary - No latency effect in seeing the value of a new
ISA feature - New feature mistakes become relatively painless
to undo
49Theres more
- Transparently bring the benefits of dynamic
optimization to traditionally static languages
while still leveraging the power of static
analysis and language-specific semantic
information - All of the advantages of dynamic profile-directed
feedback (PDF) optimizations with none of the
static pdf drawbacks - No extra build step
- No input artifacts skewing specialization choices
- Code specialized to each invocation on exact
processor model - More aggressive speculative optimizations
- Recompilation as a recovery option
- Static analyses inform value profiling choices
- New static analysis goal of identifying the
inhibitors to optimizations for later dynamic
testing and specialization
50Summary
- A crossover point has been reached between
dynamic and static compilation technologies. - They need to be converged/combined to overcome
their individual weaknesses - Hardware designers struggle under the mounting
burden of maintaining high performance backwards
compatibility