CGO 2006: The Fourth International Symposium on Code Generation and Optimization New York, March 26-29, 2006 - PowerPoint PPT Presentation

About This Presentation

Title:

CGO 2006: The Fourth International Symposium on Code Generation and Optimization New York, March 26-29, 2006

Description:

Derek Bruening, Vladimir Kiriansky, Tim Garnett, Sanjeev Banerji (Determina ... Jianhui Li, Qi Zhang, Shu Xu, Bo Huang (Intel China Software Center), Optimizing ... – PowerPoint PPT presentation

Number of Views:231

Avg rating:3.0/5.0

Slides: 41

Provided by: ivanmat

Category:

more less

Transcript and Presenter's Notes

Title: CGO 2006: The Fourth International Symposium on Code Generation and Optimization New York, March 26-29, 2006

1
CGO 2006The Fourth International Symposium on
Code Generation and OptimizationNew York,
March 26-29, 2006

Conference Review
Presented by Ivan Matosevic

2
Outline

Conference overview
Brief summaries of sessions
Keynote speeches
Best paper

3
Conference Overview

Primary focus back-end compilation techniques
Static analysis and optimization
Profiling
Run-time techniques
8 sessions, 29 papers
Dominating topics multicores, dynamic compilation

4
Overview of Session

Dynamic Optimization
Object-Oriented Code Generation and Optimization
Phase Detection and Profiling
Tiled and Multicore Compilation
Static Code Generation and Optimization Issues
SIMD Compilation
Optimization Space Exploration
Security and Reliability

5
Session 1 Dynamic Optimization

Kim Hazelwood (University of Virginia), Robert
Cohn (Intel), A Cross-Architectural Interface for
Code Cache Manipulation
Pin dynamic instrumentation system with code
cache
The paper describes an API for various operations
with the code cache (callbacks, lookups,
statistics, etc.)
Derek Bruening, Vladimir Kiriansky, Tim Garnett,
Sanjeev Banerji (Determina Corporation),
Thread-Shared Software Code Caches
Problem sharing a code cache across multiple
threads
Authors propose a fine-grained locking scheme
Evaluation using DynamoRIO

6
Session 1 Dynamic Optimization

Keith Cooper, Anshuman Dasgupta (Rice Univ.),
Tailoring Graph-coloring Register Allocation For
Runtime Compilation
Problem register allocation in JIT compilers
Authors propose a novel lightweight
graph-colouring technique
Weifeng Zhang, Brad Calder, Dean Tullsen (UC San
Diego), A Self Repairing Prefetcher in an
Event-Driven Dynamic Optimization Framework
Extension of the Trident event-driven dynamic
optimization framework (previously proposed by
the same authors)
Dynamic insertion of prefetching instructions
based on run-time analysis

7
Session 2 Object-Oriented CodeGeneration and
Optimization

Suresh Srinivas, Yun Wang, Miaobo Chen, Qi Zhang,
Eric Lin, Valery Ushakov, Yoav Zach, Shalom
Goldenberg (Intel Corporation), Java JNI Bridge
An MRTE Framework for Mixed Native ISA Execution
Use a dynamic translator for the execution of
native calls to one ISA on a different ISAs Java
platform
Kris Venstermans, Lieven Eeckhout, Koen De
Bosschere (Ghent University), Space-Efficient
64-bit Java Objects through Selective Typed
Virtual Addressing
Use address bits on a 64-bit architecture to
encode object type in order to save memory
Objects of the same type allocated in a
contiguous (virtual) region

8
Session 2 Object-Oriented CodeGeneration and
Optimization

Daryl Maier, Pramod Ramarao, Mark Stoodley, Vijay
Sundaresan (IBM Canada), Experiences with
Multi-threading and Dynamic Class Loading in a
Java Just-In-Time Compiler
The IBM TestaRossa JIT compiler
This paper focuses on code patching and profiling
in a multi-threaded environment with a lot of
class loading/unloading
Lixin Su, Mikko H Lipasti (University of
Wisconsin Madison), Dynamic Class Hierarchy
Mutation
Run-time reassignment of objects from one derived
class to another, changing its virtual tables
Offers opportunity for optimizations based on
specialization

9
Session 3 Phase Detection and Profiling

Priya Nagpurkar, (UCSB), Michael Hind (IBM),
Chandra Krintz, (UCSB), Peter Sweeney, V.T. Rajan
(IBM), Online Phase Detection Algorithms
Detecting phase behaviour in virtual machines
Track dynamic program parameters (methods
invoked, branch directions) over time and apply
a similarity model
Jeremy Lau, Erez Perelman, Brad Calder (UC San
Diego), Selecting Software Phase Markers with
Code Structure Analysis
Portions of code whose execution correlates with
phase changes
Procedure calls and returns, loop boundaries
Profile-based hierarchical loop-call graph

10
Session 3 Phase Detection and Profiling

Shashidhar Mysore, Banit Agrawal, Timothy
Sherwood, Nisheeth Shrivastava, Subhash Suri (UC
Santa Barbara), Profiling over Adaptive Ranges
Voted best paper details later
Hyesoon Kim, Muhammad Aater Suleman, Onur Mutlu,
Yale N. Patt (UT-Austin), 2D-Profiling Detecting
Input-Dependent Branches with a Single Input Data
Set
Predicts whether the prediction accuracy of each
branch will vary across input sets
Heuristic approach used to derive representative
profiling results from a single input set

11
Session 4 Tiled and Multicore Compilation

David Wentzlaff, Anant Agarwal (MIT),
Constructing Virtual Architectures on a Tiled
Processor
Map components of a superscalar architecture
(Pentium III) onto a parallel tiled architecture
(Raw) using dynamic translation
In a way, uses Raw as a coarse-grain FPGA
Aaron Smith, (UT-Austin), J. Burrill, (UMass at
Amherst), J. Gibson, B. Maher, N. Nethercote, B.
Yoder, D. Burger, K. S. McKinley (UT-Austin),
Compiling for EDGE Architectures
TRIPS EDGE (Explicit Data Graph Execution)
architecture
This paper focuses on compilation of standard C
and FORTRAN benchmarks

12
Session 4 Tiled and Multicore Compilation

Shih-wei Liao, Zhaohui Du, Gansha Wu, Guei-Yuan
Lueh (Intel), Data and Computation
Transformations for Brook Streaming Applications
on Multiprocessors
Parallel compiler for the Brook streaming
language
An extension of C that enables specifying data
parallelism
Michael L. Chu, Scott A. Mahlke (University of
Michigan), Compiler-directed Object Partitioning
for Multicluster Processors
Partitioning of data in clustered architectures
such as Raw
I didnt really understand what programming model
these authors have in mind?

13
Session 5 Static Code Generation
andOptimization Issues

Two papers about the HPUX Itanium compiler
Dhruva R. Chakrabarti, Shin-Ming Liu
(Hewlett-Packard), Inline Analysis Beyond
Selection Heuristics
Cross-module techniques for selection of inlined
call sites and the choice of specialized function
versions
Robert Hundt, Dhruva R. Chakrabarti, Sandya S.
Mannarswamy (Hewlett-Packard), Practical
Structure Layout Optimization and Advice
Data layout and placement on the heap to improve
locality
Structure splitting, structure peeling, dead
field removal, and field reordering

14
Session 5 Static Code Generation
andOptimization Issues

Chris Lupo, Kent Wilken (University of
California, Davis), Post Register Allocation
Spill Code Optimization
Authors propose a profile-based algorithm for
placement of save/restore instructions handling
spilled variables in function calls
Implemented as a part of GCC
Seung Woo Son, Guangyu Chen, Mahmut Kandemir
(Pennsylvania State University), A
Compiler-Guided Approach for Reducing Disk Power
Consumption by Exploiting Disk Access Locality
Goal restructure code so that disk idle periods
are lengthened
The approach targets array-based programs disk
layout of array data exposed to the compiler

15
Session 6 SIMD Compilation

Jianhui Li, Qi Zhang, Shu Xu, Bo Huang (Intel
China Software Center), Optimizing Dynamic Binary
Translation for SIMD Instructions
Algorithms for dynamic binary translation of SIMD
instructions in general-purpose architectures
(such as MMX in x86)
Evaluation using IA-32 binaries on Itanium 2
Dorit Nuzman (IBM), Richard Henderson (Red Hat),
Multi-Platform Auto-Vectorization
Implementation of automatic vectorizer for GCC
4.0

16
Session 7 Optimization-space Exploration

Felix Agakov, Edwin Bonilla, John Cavazos, Bjoern
Franke, Grigori Fursin, Michael O'Boyle, Marc
Toussaint, John Thomson, Chris Williams (U. of
Edinburgh), Using Machine Learning to Focus
Iterative Optimization
Predictive modelling used to search the
optimization space
Targets embedded platforms AMD Au1500 and Texas
Instruments TI C6713
Prasad Kulkarni, David Whalley, Gary Tyson
(Florida State University), Jack Davidson
(University of Virginia), Exhaustive Optimization
Phase Order Space Exploration
Exhaustive search of the phase order space (15
phases) using aggressive pruning takes time on
the order of minutes to hours
Targets StrongARM SA-100

17
Session 7 Optimization-space Exploration

Zhelong Pan, Rudolf Eigenmann (Purdue
University), Fast and Effective Orchestration of
Compiler Optimizations for Automatic Performance
Tuning
Problem find the optimal combination of 38 GCC
O3 options, targeting Pentium IV and Sparc II
Proposed heuristic algorithm that provides s
quality solution in time on the order of several
hours

18
Session 8 Security and Reliability

Edson Borin, (UNICAMP), Cheng Wang, Youfeng Wu
(Intel), Guido Araujo (UNICAMP), Software-Based
Transparent and Comprehensive Control-Flow Error
Detection
Addresses the problem of soft (transient) errors
that cause branches to incorrect instructions
Implemented in SW as a part of a dynamic binary
translator
Tao Zhang, Xiaotong Zhuang, Santosh Pande
(Georgia Tech), Compiler Optimizations to Reduce
Security Overheads
Optimizations that specifically target techniques
that implement software protection with minimal
HW support

19
Session 8 Security and Reliability

Susanta Nanda, Wei Li, Tzi-cker Chiueh (State
University of NY at Stony Brook), BIRD Binary
Interpretation using Runtime Disassembly
Goal framework for automatic detection of
vulnerabilities such as buffer overflows when the
source code is not available
Static and dynamic disassembly and
instrumentation targets Windows x86 application

20
Keynote Speeches

Wei Li, Principal Engineer, Intel "Parallel
Programming 2.0"
Kevin Stoodley, Fellow and CTO of Compilation
Technology, IBM "Productivity and Performance
Future Directions in Compilers"

21
Wei Li Parallel Programming 2.0

Major technological change
Moores Law continues to increase transistor
counts
However power, memory latency, limits to ILP are
setting an effective performance ceiling
General trend towards thread-level on-chip
parallelism
SMT
Chip multiprocessors

22
Wei Li Parallel Programming 2.0

Parallel Programming 2.0 refers to the advent
of multicores
A very optimistic future vision

23
Wei Li Parallel Programming 2.0

Key issue where will the parallelism come from?
Parallel programming needs to become more
mainstream
Consumer vs. HPC/server/database
Inclusion into education at more elementary level
New tools for greater ease of programming
Intels parallel programming tools
http//www.intel.com/software

24
K. Stoodley"Productivity and Performance
Future Directions in Compilers"

Limits to traditional static compilation
Overview of IBM compiler technology
Testarossa JIT compiler, Toronto Portable
Optimizer, Tobey backend
Challenges at present and near future
Software abstraction complexity forces the
scope of compilation to higher levels
Maintaining high performance backwards
compatibility increasingly difficult

25
K. Stoodley"Productivity and Performance
Future Directions in Compilers"

Future convergence/combination of dynamic and
static compilation technologies

26
Best Paper

Shashidhar Mysore, Banit Agrawal, Timothy
Sherwood, Nisheeth Shrivastava, Subhash Suri (UC
Santa Barbara) Profiling over Adaptive Ranges

27
Profiling over Adaptive Ranges

Problem how to count specific events efficiently
and accurately?
Code segments executed
Memory regions accessed
IP addresses of routed packets
In all cases, impossible to maintain separate
counters for the entire range of values
Each basic block, memory address, IP address

28
Trade-off Precision vs. Efficiency
Uniform ranges
Unlimited counters

Profiling with uniform ranges fails to
distinguish hot code

29
Higher Precision for Hot Regions

Good trade-off with limited resources
High precision for hot regions
Low precision for colder ones, but this affects
the accuracy less

Challenge how to determine what exactly to count
with what precision?

30
Solution Adaptive Profiling

Start with one counter split counters as they
become hot

31
Solution Adaptive Profiling

Start with one counter split counters as they
become hot

32
Solution Adaptive Profiling

Start with one counter split counters as they
become hot

33
Counter Merging

Problem what if program behaviour changes after
the initialization phase?

34
Counter Merging

Problem what if program behaviour changes after
the initialization phase?

35
Counter Merging

Solution perform counter merging along with
splitting

36
Counter Merging

Counters of merged child nodes added to the parent

37
Counter Merging

Counters of merged child nodes added to the parent

38
Counter Merging

Problem how to identify nodes for merging?
They are by definition those ones that are not
updated frequently
Solution periodic batched merge operations
Tree depth grows at logarithmic rate ? can be
done at exponentially increasing intervals

39
Additional Contributions

Heuristics for splitting and merging
Theoretical analysis of accuracy guarantees
Proposal for hardware implementation
Experimental evaluation
Memory requirements
Average and worst-case errors on benchmarks
Performance of HW implementation
Accuracies on the order of 98.0-99.8 with only
8-64K of memory

40
Conclusions

Highly interesting program
My short presentation certainly doesnt do
justice to most of the mentioned works!
Readings to perhaps consider for future CARG
D. Wentzlaff, A. Agarwal, Constructing Virtual
Architectures on a Tiled Processor
A. Smith et al., Compiling for EDGE Architectures
F. Agakov et al., Using Machine Learning to Focus
Iterative Optimization
(Highly subjective!)