Title: CGO 2006: The Fourth International Symposium on Code Generation and Optimization New York, March 26-29, 2006
1CGO 2006The Fourth International Symposium on
Code Generation and OptimizationNew York,
March 26-29, 2006
- Conference Review
- Presented by Ivan Matosevic
2Outline
- Conference overview
- Brief summaries of sessions
- Keynote speeches
- Best paper
3Conference Overview
- Primary focus back-end compilation techniques
- Static analysis and optimization
- Profiling
- Run-time techniques
- 8 sessions, 29 papers
- Dominating topics multicores, dynamic compilation
4Overview of Session
- Dynamic Optimization
- Object-Oriented Code Generation and Optimization
- Phase Detection and Profiling
- Tiled and Multicore Compilation
- Static Code Generation and Optimization Issues
- SIMD Compilation
- Optimization Space Exploration
- Security and Reliability
5Session 1 Dynamic Optimization
- Kim Hazelwood (University of Virginia), Robert
Cohn (Intel), A Cross-Architectural Interface for
Code Cache Manipulation - Pin dynamic instrumentation system with code
cache - The paper describes an API for various operations
with the code cache (callbacks, lookups,
statistics, etc.) - Derek Bruening, Vladimir Kiriansky, Tim Garnett,
Sanjeev Banerji (Determina Corporation),
Thread-Shared Software Code Caches - Problem sharing a code cache across multiple
threads - Authors propose a fine-grained locking scheme
- Evaluation using DynamoRIO
6Session 1 Dynamic Optimization
- Keith Cooper, Anshuman Dasgupta (Rice Univ.),
Tailoring Graph-coloring Register Allocation For
Runtime Compilation - Problem register allocation in JIT compilers
- Authors propose a novel lightweight
graph-colouring technique - Weifeng Zhang, Brad Calder, Dean Tullsen (UC San
Diego), A Self Repairing Prefetcher in an
Event-Driven Dynamic Optimization Framework - Extension of the Trident event-driven dynamic
optimization framework (previously proposed by
the same authors) - Dynamic insertion of prefetching instructions
based on run-time analysis
7Session 2 Object-Oriented CodeGeneration and
Optimization
- Suresh Srinivas, Yun Wang, Miaobo Chen, Qi Zhang,
Eric Lin, Valery Ushakov, Yoav Zach, Shalom
Goldenberg (Intel Corporation), Java JNI Bridge
An MRTE Framework for Mixed Native ISA Execution - Use a dynamic translator for the execution of
native calls to one ISA on a different ISAs Java
platform - Kris Venstermans, Lieven Eeckhout, Koen De
Bosschere (Ghent University), Space-Efficient
64-bit Java Objects through Selective Typed
Virtual Addressing - Use address bits on a 64-bit architecture to
encode object type in order to save memory - Objects of the same type allocated in a
contiguous (virtual) region
8Session 2 Object-Oriented CodeGeneration and
Optimization
- Daryl Maier, Pramod Ramarao, Mark Stoodley, Vijay
Sundaresan (IBM Canada), Experiences with
Multi-threading and Dynamic Class Loading in a
Java Just-In-Time Compiler - The IBM TestaRossa JIT compiler
- This paper focuses on code patching and profiling
in a multi-threaded environment with a lot of
class loading/unloading - Lixin Su, Mikko H Lipasti (University of
Wisconsin Madison), Dynamic Class Hierarchy
Mutation - Run-time reassignment of objects from one derived
class to another, changing its virtual tables - Offers opportunity for optimizations based on
specialization
9Session 3 Phase Detection and Profiling
- Priya Nagpurkar, (UCSB), Michael Hind (IBM),
Chandra Krintz, (UCSB), Peter Sweeney, V.T. Rajan
(IBM), Online Phase Detection Algorithms - Detecting phase behaviour in virtual machines
- Track dynamic program parameters (methods
invoked, branch directions) over time and apply
a similarity model - Jeremy Lau, Erez Perelman, Brad Calder (UC San
Diego), Selecting Software Phase Markers with
Code Structure Analysis - Portions of code whose execution correlates with
phase changes - Procedure calls and returns, loop boundaries
- Profile-based hierarchical loop-call graph
10Session 3 Phase Detection and Profiling
- Shashidhar Mysore, Banit Agrawal, Timothy
Sherwood, Nisheeth Shrivastava, Subhash Suri (UC
Santa Barbara), Profiling over Adaptive Ranges - Voted best paper details later
- Hyesoon Kim, Muhammad Aater Suleman, Onur Mutlu,
Yale N. Patt (UT-Austin), 2D-Profiling Detecting
Input-Dependent Branches with a Single Input Data
Set - Predicts whether the prediction accuracy of each
branch will vary across input sets - Heuristic approach used to derive representative
profiling results from a single input set
11Session 4 Tiled and Multicore Compilation
- David Wentzlaff, Anant Agarwal (MIT),
Constructing Virtual Architectures on a Tiled
Processor - Map components of a superscalar architecture
(Pentium III) onto a parallel tiled architecture
(Raw) using dynamic translation - In a way, uses Raw as a coarse-grain FPGA
- Aaron Smith, (UT-Austin), J. Burrill, (UMass at
Amherst), J. Gibson, B. Maher, N. Nethercote, B.
Yoder, D. Burger, K. S. McKinley (UT-Austin),
Compiling for EDGE Architectures - TRIPS EDGE (Explicit Data Graph Execution)
architecture - This paper focuses on compilation of standard C
and FORTRAN benchmarks
12Session 4 Tiled and Multicore Compilation
- Shih-wei Liao, Zhaohui Du, Gansha Wu, Guei-Yuan
Lueh (Intel), Data and Computation
Transformations for Brook Streaming Applications
on Multiprocessors - Parallel compiler for the Brook streaming
language - An extension of C that enables specifying data
parallelism - Michael L. Chu, Scott A. Mahlke (University of
Michigan), Compiler-directed Object Partitioning
for Multicluster Processors - Partitioning of data in clustered architectures
such as Raw - I didnt really understand what programming model
these authors have in mind?
13Session 5 Static Code Generation
andOptimization Issues
- Two papers about the HPUX Itanium compiler
- Dhruva R. Chakrabarti, Shin-Ming Liu
(Hewlett-Packard), Inline Analysis Beyond
Selection Heuristics - Cross-module techniques for selection of inlined
call sites and the choice of specialized function
versions - Robert Hundt, Dhruva R. Chakrabarti, Sandya S.
Mannarswamy (Hewlett-Packard), Practical
Structure Layout Optimization and Advice - Data layout and placement on the heap to improve
locality - Structure splitting, structure peeling, dead
field removal, and field reordering
14Session 5 Static Code Generation
andOptimization Issues
- Chris Lupo, Kent Wilken (University of
California, Davis), Post Register Allocation
Spill Code Optimization - Authors propose a profile-based algorithm for
placement of save/restore instructions handling
spilled variables in function calls - Implemented as a part of GCC
- Seung Woo Son, Guangyu Chen, Mahmut Kandemir
(Pennsylvania State University), A
Compiler-Guided Approach for Reducing Disk Power
Consumption by Exploiting Disk Access Locality - Goal restructure code so that disk idle periods
are lengthened - The approach targets array-based programs disk
layout of array data exposed to the compiler
15Session 6 SIMD Compilation
- Jianhui Li, Qi Zhang, Shu Xu, Bo Huang (Intel
China Software Center), Optimizing Dynamic Binary
Translation for SIMD Instructions - Algorithms for dynamic binary translation of SIMD
instructions in general-purpose architectures
(such as MMX in x86) - Evaluation using IA-32 binaries on Itanium 2
- Dorit Nuzman (IBM), Richard Henderson (Red Hat),
Multi-Platform Auto-Vectorization - Implementation of automatic vectorizer for GCC
4.0 -
16Session 7 Optimization-space Exploration
- Felix Agakov, Edwin Bonilla, John Cavazos, Bjoern
Franke, Grigori Fursin, Michael O'Boyle, Marc
Toussaint, John Thomson, Chris Williams (U. of
Edinburgh), Using Machine Learning to Focus
Iterative Optimization - Predictive modelling used to search the
optimization space - Targets embedded platforms AMD Au1500 and Texas
Instruments TI C6713 - Prasad Kulkarni, David Whalley, Gary Tyson
(Florida State University), Jack Davidson
(University of Virginia), Exhaustive Optimization
Phase Order Space Exploration - Exhaustive search of the phase order space (15
phases) using aggressive pruning takes time on
the order of minutes to hours - Targets StrongARM SA-100
17Session 7 Optimization-space Exploration
- Zhelong Pan, Rudolf Eigenmann (Purdue
University), Fast and Effective Orchestration of
Compiler Optimizations for Automatic Performance
Tuning - Problem find the optimal combination of 38 GCC
O3 options, targeting Pentium IV and Sparc II - Proposed heuristic algorithm that provides s
quality solution in time on the order of several
hours
18Session 8 Security and Reliability
- Edson Borin, (UNICAMP), Cheng Wang, Youfeng Wu
(Intel), Guido Araujo (UNICAMP), Software-Based
Transparent and Comprehensive Control-Flow Error
Detection - Addresses the problem of soft (transient) errors
that cause branches to incorrect instructions - Implemented in SW as a part of a dynamic binary
translator - Tao Zhang, Xiaotong Zhuang, Santosh Pande
(Georgia Tech), Compiler Optimizations to Reduce
Security Overheads - Optimizations that specifically target techniques
that implement software protection with minimal
HW support
19Session 8 Security and Reliability
- Susanta Nanda, Wei Li, Tzi-cker Chiueh (State
University of NY at Stony Brook), BIRD Binary
Interpretation using Runtime Disassembly - Goal framework for automatic detection of
vulnerabilities such as buffer overflows when the
source code is not available - Static and dynamic disassembly and
instrumentation targets Windows x86 application
20Keynote Speeches
- Wei Li, Principal Engineer, Intel "Parallel
Programming 2.0" - Kevin Stoodley, Fellow and CTO of Compilation
Technology, IBM "Productivity and Performance
Future Directions in Compilers"
21Wei Li Parallel Programming 2.0
- Major technological change
- Moores Law continues to increase transistor
counts - However power, memory latency, limits to ILP are
setting an effective performance ceiling - General trend towards thread-level on-chip
parallelism - SMT
- Chip multiprocessors
22Wei Li Parallel Programming 2.0
- Parallel Programming 2.0 refers to the advent
of multicores - A very optimistic future vision
23Wei Li Parallel Programming 2.0
- Key issue where will the parallelism come from?
- Parallel programming needs to become more
mainstream - Consumer vs. HPC/server/database
- Inclusion into education at more elementary level
- New tools for greater ease of programming
- Intels parallel programming tools
- http//www.intel.com/software
24K. Stoodley"Productivity and Performance
Future Directions in Compilers"
- Limits to traditional static compilation
- Overview of IBM compiler technology
- Testarossa JIT compiler, Toronto Portable
Optimizer, Tobey backend - Challenges at present and near future
- Software abstraction complexity forces the
scope of compilation to higher levels - Maintaining high performance backwards
compatibility increasingly difficult
25K. Stoodley"Productivity and Performance
Future Directions in Compilers"
- Future convergence/combination of dynamic and
static compilation technologies
26Best Paper
- Shashidhar Mysore, Banit Agrawal, Timothy
Sherwood, Nisheeth Shrivastava, Subhash Suri (UC
Santa Barbara) Profiling over Adaptive Ranges
27Profiling over Adaptive Ranges
- Problem how to count specific events efficiently
and accurately? - Code segments executed
- Memory regions accessed
- IP addresses of routed packets
- In all cases, impossible to maintain separate
counters for the entire range of values - Each basic block, memory address, IP address
28Trade-off Precision vs. Efficiency
Uniform ranges
Unlimited counters
- Profiling with uniform ranges fails to
distinguish hot code
29Higher Precision for Hot Regions
- Good trade-off with limited resources
- High precision for hot regions
- Low precision for colder ones, but this affects
the accuracy less
- Challenge how to determine what exactly to count
with what precision?
30Solution Adaptive Profiling
- Start with one counter split counters as they
become hot
31Solution Adaptive Profiling
- Start with one counter split counters as they
become hot
32Solution Adaptive Profiling
- Start with one counter split counters as they
become hot
33Counter Merging
- Problem what if program behaviour changes after
the initialization phase?
34Counter Merging
- Problem what if program behaviour changes after
the initialization phase?
35Counter Merging
- Solution perform counter merging along with
splitting
36Counter Merging
- Counters of merged child nodes added to the parent
37Counter Merging
- Counters of merged child nodes added to the parent
38Counter Merging
- Problem how to identify nodes for merging?
- They are by definition those ones that are not
updated frequently - Solution periodic batched merge operations
- Tree depth grows at logarithmic rate ? can be
done at exponentially increasing intervals
39Additional Contributions
- Heuristics for splitting and merging
- Theoretical analysis of accuracy guarantees
- Proposal for hardware implementation
- Experimental evaluation
- Memory requirements
- Average and worst-case errors on benchmarks
- Performance of HW implementation
- Accuracies on the order of 98.0-99.8 with only
8-64K of memory
40Conclusions
- Highly interesting program
- My short presentation certainly doesnt do
justice to most of the mentioned works! - Readings to perhaps consider for future CARG
- D. Wentzlaff, A. Agarwal, Constructing Virtual
Architectures on a Tiled Processor - A. Smith et al., Compiling for EDGE Architectures
- F. Agakov et al., Using Machine Learning to Focus
Iterative Optimization - (Highly subjective!)