Seminar Series - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Seminar Series

Description:

Title: Compiler Enhancements For Productivity & Performance Author: chng1me Last modified by: hsu Created Date: 7/19/2005 8:09:40 PM Document presentation format – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 51
Provided by: chng3
Category:

less

Transcript and Presenter's Notes

Title: Seminar Series


1
Seminar Series
  • Static and Dynamic Compiler Optimizations
    (6/28).
  • Speculative Compiler Optimizations (7/05)
  • ADORE An Adaptive Object Code ReOptimization
    System (7/19)
  • Current Trends in CMP/CMT Processors (7/26)
  • Static and Dynamic Helper Thread Prefetching
    (8/02)
  • Dynamic Instrumentation/Translation (8/16)
  • Virtual Machine Technologies and their Emerging
    Applications (8/23)

2
Professional Background
  • CE BS and CE MS, NCTU
  • CS Ph.D. University of Wisconsin, Madison
  • Cray Research, 1987-1993
  • Architect for Cray Y-MP, Cray C-90, FAST
  • Compiler optimization for Cray-Xmp, Ymp, Cray-2,
    Cray-3.
  • Hewlett Packard, 1993 -1999
  • Compiler technical lead for HP-7200, HP-8000,
    IA-64
  • Lab technical lead for adaptive systems
  • University of Minnesota, 2000-now
  • ADORE/Itanium and ADORE/Sparc systems
  • Sun Microsystem, 2004-2005
  • Visiting professor

3
Static and Dynamic Compiler Optimizations
  • Wei Chung Hsu
  • 6/28/2006

4
Background
  • Optimization
  • A process of making something as effective as
    possible
  • Compiler
  • A computer program that translates programs
    written in high-level languages into machine
    instructions
  • Compiler Optimization
  • The phases of compilation that generates good
    code to make as efficiently use of the target
    machines as possible.

5
Background (cont.)
  • Static Optimization
  • compile time optimization one time, fixed
    optimization that will not change after
    distribution.
  • Dynamic Optimization
  • optimization performed at program execution time
    adaptive to the execution environment.

6
Some Examples
  • Redundancy elimination
  • C (AB)(AB) ? tAB Ctt
  • Register allocation
  • keep frequently used data items in registers
  • Instruction scheduling
  • to avoid pipeline bubbles
  • Cache prefetching
  • to minimize cache miss penalties

7
How Important Is Compiler Optimization?
  • In the last 15 years, the computer performance
    has increased by 2000 times.
  • Clock rate increased by 100 X
  • Micro-architecture contributed 5-10X
  • the number of transistors doubles every 18
    months.
  • Compiler optimization added 2-3X for single
    processors

8
  • Have you used compiler optimization lately?

9
Speed up from Compiler Optimization
10
Speed up from Compiler Optimization
11
(No Transcript)
12
Static compilation system
C Front End
C Front End
Fortran Front End
  • Machine-independent optimizations

Platform neutral
Intermediate Language (IL, IR)
Optimizing Backend
IL to IL Inter- Procedural Optimizer
  • Machine-dependent optimizations

Profile-Directed Feedback
Machine Code
Sample input
13
Criteria for optimizations
  • Must preserve the meaning of programs
  • Example

T1 c/N For (I0 IltNI) AI bIT1
For (I0 IltNI) AI bIc/N
X
What if N 0?
14
  • Example

If (C gt 0) A bj dj
T1bj T2dj If (C gt 0) A T1T2
X
What if bj except when Clt0?
15
  • Basic Concepts
  • Optimizations improve performance, but do not
    give optimal performance
  • Optimizations generally (or statistically)
    improve performance. They could also slow down
    the code.
  • Example LICM, Cache prefetching, Procedure
    inlining
  • Must be absolutely (not statistically!) correct
    (safe or conservative)
  • Some optimizations are more important in general
    purpose compilers
  • Loop optimizations, reg allocation, inst
    scheduling

16
Optimization at different levels
  • Local (within a basic block)
  • Global (cross basic blocks but within a
    procedure)
  • Inter-procedural
  • Cross module (link time)
  • Post-link time (such as Spike/iSpike)
  • Runtime (as in dynamic compilation)

17
Tradeoff in Optimizations
  • Space vs. Speed
  • Usually favors speed. However, on machines with
    small memory or I-cache, space is equally
    important
  • Compile time vs. Execution Time
  • Usually favors execution time, but not necessary
    true in recent years. (e.g. JIT, large apps)
  • Absolutely robust vs. statistically robust
  • Decrease default optimization level at less
    important regions.
  • Complexity vs. Efficiency
  • Select between complex but more efficient and
    simple but less efficient (easier to maintain)
    algorithms.

18
Overview of Optimizations
  • Early Optimizations
  • scalar replacement, constant folding
  • local/global value numbering
  • local/global copy propagation
  • Redundancy Elimination
  • local/global CSE, PRE
  • LICM
  • code hoisting
  • Loop Optimizations
  • strength reduction
  • induction variable removal
  • unnecessary bound checking elimination

19
Overview of Optimizations
  • Procedure Optimizations
  • tail-recursion elimination, in-line expansion,
    leaf-routine optimization, shrink wrapping,
    memorization
  • Register Allocation
  • graph coloring
  • Instruction Scheduling
  • local/global code scheduling
  • software pipelining
  • trace scheduling, superblock formation

20
Overview of Optimizations
  • Memory Hierarchy Optimizations
  • loop blocking, loop interchange
  • memory padding, cache prefetching, data
    re-layout
  • Loop Transformations
  • reduction recognition, loop collapsing, loop
    reversal, strip mining, loop fusion, loop
    distribution
  • Peephole Optimizations
  • Profile Guided Optimizations
  • Code Re-positioning, I-cache prefetching,
    Profiling guided in-lining, RA, IS, .

21
Overview of Optimizations
  • More Optimizations
  • SIMD Transformation, VLIW Transformation
  • Communication Optimizations
  • (See David Bacon and Susan Grahams survey
    paper)
  • Optimization Evaluation
  • Is there a commonly accepted method?
  • users choice
  • benchmarks
  • Livermore loops (14 kernels from scientific code)
  • PERFECT club, SPLASH, NAS
  • SPEC

22
Importance of Individual Opt.
  • How much performance an optimization contributes
    ?
  • Is this optimization commonplace?
  • does it happen in one particular instance?
  • does it happen in one particular program?
  • does it happen for one particular type of app?
  • how much difference does it makes?
  • does it enable other optimizations
  • procedure integration, unrolling

23
Ordering
  • Ordering is important, some dependences between
    optimizations exist
  • Procedure integration and loop unrolling usually
    enable other optimizations
  • Loop transformations should be done before
    address linearization.
  • No optimal ordering
  • Some optimizations should be applied multiple
    times (e.g. copy propagation, DCE)
  • Some recent research advocate exhaustive search
    with intelligent pruning

24
Example Organization
Reaching definition Define-use chains
IR
Control Flow Analysis
Data Flow Analysis
Trans- formations
Flow graph Identify loops
Global CSE Copy prop Code motion
25
Loops in Flow Graph
  • Dominators
  • d of a flow graph dominates node n, written as d
    dom n, if every path from the initial node of the
    flow graph to n goes through d.
  • Example

1
2
3
1 dom all 3 dom 4,5,6,7 4 dom 5,6,7
4
5
6
7
26
Loops in Flow Graph (cont.)
  • Natural loops
  • A loop must have a single entry point, called the
    header. It dominates all nodes in the loop.
  • At least one path back to the header.
  • Backedge
  • An edge in the flow graph whose head dominates
    its tail. For example,
  • edge 4 ?3 and edge 7 ?1

27
Global Data Flow Analysis
  • To provide global information about how a
    procedure manipulates its data.
  • Example

A3
BA1
BA
CA
Can we propagate constant 3 for A?
28
Data Flow Equations
  • A typical data flow equation has the form
  • Out S GenS U (InS KillS)
  • S means a statement
  • GenS means definitions generated within S
  • KillS means definitions killed as control flows
    through S
  • InS means definitions live at the beginning of
    S
  • OutS means definitions available at the end of S

29
Reaching Definitions
  • A definition d reaches a point p, if there is a
    path from the point immediately following d to p,
    such that d is not killed along that path.

d1 Im-1 d2 jn d3 au1
B1
d1,d2,d5 reach B2 d5 kills d2, so d2 does not
reach B3,B4,B5
B2
d4 II1
B3
d5 jj-1
B4
B6
d6 au2
B5
30
Data Flow Equation forReaching Definition
genS d1 killS all def of a outS
genS U (inS killS)
S
d1 a bc
genS genS1 U genS2 killS killS1 I
killS2 outS outS1 U outS2
S
S1
S2
genS genS1 killS killS1 InS1
inS U genS1 outS outS1
S
S1
31
Transformation example LICM
  • Loop Invariant Code Motion
  • A loop invariant is an instruction (a load or a
    calculation) in a loop whose result is always the
    same in every iteration.
  • Once we identified loops, and tracked the
    locations at which operand values are defined
    (i.e. reaching definition), we can recognize a
    loop invariant if each of its operands
  • 1) is a constant,
  • 2) has reaching definitions that all lie outside
    the loop or
  • 3) has a single reaching definition that itself
    is a loop invariant.

32
Static Compilers
  • Traditional compilation model for C, C,
    Fortran,
  • Extremely mature technology
  • Static design point allows for extremely deep and
    accurate analyses supporting sophisticated
    program transformation for performance.
  • ABI enables a useful level of language
    interoperability
  • But

33
Static compilationthe downsides
  • CPU designers restricted by requirement to
    deliver increasing performance to applications
    that will not be recompiled
  • Slows down the uptake of new ISA and
    micro-architectural features
  • Constrains the evolution of CPU design by
    discouraging radical changes
  • Model for applying feedback information from
    application profile to optimization and code
    generation components is awkward and not widely
    adopted thus diluting the performance achieved on
    the system

34
Static compilationthe downsides
  • Largely unable to satisfy our increasing desire
    to exploit dynamic traits of the application
  • Even link-time is too early to be able to catch
    some high-value opportunities for performance
    improvement
  • Whole classes of speculative optimizations are
    infeasible without heroic efforts

35
Tyranny of the Dusty Deck
  • Binary compatibility is one of the crowning
    achievements of the early computer yearsbut
  • It does (or at least should) make CPU architects
    think very carefully about adding anything new
    because
  • you can almost never get rid of anything you add
  • it takes a long time to find out for sure whether
    anything you add is a good idea or not

36
Profile-Directed Feedback (PDF)
  • Two-step optimization process
  • First pass instruments the generated code to
    collect statistics about the program execution
  • Developer exercises this program with common
    inputs to collect representative data
  • Program may be executed multiple times to reflect
    variety of common inputs
  • Second pass re-optimizes the program based on the
    profile data collected
  • Also called Profile-Guided Optimization (PGO) or
    Profile-Based Optimization (PBO)

37
Data collected by PDF
  • Basic block execution counters
  • How many times each basic block in the program is
    reached
  • Used to derive branch and call frequencies
  • Value profiling
  • Collects a histogram of values for a particular
    attribute of the program
  • Used for specialization

38
Other PDF Opportunities
  • Path Profile
  • Alias Profile
  • Cache Miss Profile
  • I-cache miss
  • D-cache miss
  • Miss types
  • ITLB/DTLB misses
  • Speculation Failure Profile
  • Event Correlation Profile

39
Optimizations affected by PDF
  • Inlining
  • Uses call frequencies to prioritize inlining
    sites
  • Function partitioning
  • Groups the program into cliques of routines with
    high call affinity
  • Speculation
  • Control speculative execution, data speculative
    execution and value speculation based
    optimizations.
  • Predication
  • Code Layout
  • Superblock formation

40
Optimizations triggered by PDF(in the IBM
compiler)
  • Specialization triggered by value profiling
  • Arithmetic ops, built-in function calls, pointer
    calls
  • Extended basic block creation
  • Organizes code to frequently fall-through on
    branches
  • Specialized linkage conventions
  • Treats all registers as non-volatile for
    infrequent calls
  • Branch hinting
  • Sets branch-prediction hints available on the ISA
  • Dynamic memory reorganization
  • Groups frequently accessed heap storage

41
Impact of PDF on SpecInt 2000
On a PWR4 system running AIX using the latest IBM
compilers, at the highest available optimization
level (-O5)
42
Sounds greatwhats the problem?
  • Only the die-hard performance types use it (eg.
    HPC, middleware)
  • Its tricky to get rightyou only want to train
    the system to recognize things that are
    characteristic of the application and somehow
    ignore artifacts of the input set
  • In the end, its still static and runtime checks
    and multiple versions can only take you so far
  • Undermines the usefulness of benchmark results as
    a predictor of application performance when
    upgrading hardware
  • In summarythe usability issue for developers
    that shows no sign of going away anytime soon

43
Dynamic Compilation System
class
class
jar
Java Virtual Machine
JIT Compiler
Machine Code
44
JVM Evolution
  • First generation of JVMs were entirely
    interpreted. Pure interpretation is good for
    proof-of-concept, but too slow for executing real
    code.
  • Second generation JVMs used JIT (Just-in-time)
    compilers to convert bytecodes into machine codes
    before execution in a lazy fashion.
  • Hotspot is the 3rd generation technology. It
    combines interpretation, profiling and dynamic
    compilation. It compiles only the frequently
    executed code. It also comes with 2 compilers
    server compiler (optimize for speed) and client
    compiler (optimize for start-up and memory
    footprint).
  • New dynamic compilation techniques for JVMs are
    CPO (Continuous Program Optimization) or
    continuous recompilation and OSR
    (On-Stack-Replacement) which can switch a code
    from interpretation mode to compiled versions.

45
Dynamic Compilation
  • Traditional model for languages like Java
  • Rapidly maturing technology
  • Exploitation of current invocation behaviour on
    exact CPU model
  • Recompilation and other dynamic techniques enable
    aggressive speculations
  • Profile feedback to optimizer is performed online
    (transparent to user/application)
  • Compile time budget is concentrated on hottest
    code with the most (perceived) opportunities
  • But

46
Dynamic compilationthe downsides
  • Some important analyses not affordable at runtime
    even if applied only to the hottest code (array
    data flow, global scheduling, dependency
    analysis, loop transformations, )
  • Non-determinism in the compilation system can be
    problematic
  • For some users, it severely challenges their
    notions of quality assurance
  • Requires new approaches to RAS and to getting
    reproducible defects for the compiler service
    team
  • Introduces a very complicated code base into each
    and every application
  • Compile time budget is concentrated on hottest
    code with the most (perceived) opportunities and
    not on other code, which in aggregate may be as
    important a contributor to performance
  • What do you do when theres no hot code?

47
The best of both worlds
C
C
F90
Front Ends
class
class
jar
Bytecode, MIL, etc
Java / .NET
CPO
Portable High Level Optimizer
Common Backend
JIT
Dynamic Machine Code
Binary Translation
Profile-Directed Feedback (PDF)
Static Machine Code
48
More boxes, but is it better?
  • If ubiquitous, could enable a new era in CPU
    architectural innovation by reducing the load of
    the dusty deck millstone
  • Deprecated ISA features supported via binary
    translation or recompilation from IL-fattened
    binary
  • No latency effect in seeing the value of a new
    ISA feature
  • New feature mistakes become relatively painless
    to undo

49
Theres more
  • Transparently bring the benefits of dynamic
    optimization to traditionally static languages
    while still leveraging the power of static
    analysis and language-specific semantic
    information
  • All of the advantages of dynamic profile-directed
    feedback (PDF) optimizations with none of the
    static pdf drawbacks
  • No extra build step
  • No input artifacts skewing specialization choices
  • Code specialized to each invocation on exact
    processor model
  • More aggressive speculative optimizations
  • Recompilation as a recovery option
  • Static analyses inform value profiling choices
  • New static analysis goal of identifying the
    inhibitors to optimizations for later dynamic
    testing and specialization

50
Summary
  • A crossover point has been reached between
    dynamic and static compilation technologies.
  • They need to be converged/combined to overcome
    their individual weaknesses
  • Hardware designers struggle under the mounting
    burden of maintaining high performance backwards
    compatibility
Write a Comment
User Comments (0)
About PowerShow.com