Title: Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler
1Buffered dynamic run-time profiling of arbitrary
data for Virtual Machines which employ
interpreter and Just-In-Time (JIT) compiler
- Compiler workshop 08Nikola Grcevski, IBM Canada
Lab
2Agenda
- The motivation and the importance of profiling
- Design and implementation of J9 VM interpreter
profiler - Performance results and start-up overhead
3The static vs. dynamic compiler
- Static compilers can take their time to analyze
the code - perform intra procedural analysis - Dynamic Just-In-Time compilers dont have this
luxury, compilation happens during application
runtime - Can dynamic compilers ever produce quality
optimized code comparable to static compilers?
4Why profile?
- The whole category of speculative optimizations
relies on some type of profiling information - Opens up opportunities for new code and memory
optimizations - Critical for high performance dynamic compiler
systems
5What could we profile?
- Pretty much anything that we expect will provide
repeatable information that we can use to
optimize - The profiling can be at the Java level or CPU
level if the OS supports it.
6What kind of profilers does J9 have
- JIT profiler
- Instruments methods with various profiling hooks
- Targeted only to methods that are very hot
- Temporal and slows down execution
- Interpreter profiler
- The topic of this presentation
7What kinds of data we collect withthe
interpreter profiler?
- Branch direction
- Virtual/Interface call targets
- Switch statement index
- Instanceof and checkcast runtime types
8Interpreter profiler design
- Buffered approach to data collection on the
application threads
Application Thread 1
Application Thread N
div
vcall
if
icall
mul
add
vcall
if
if
if
switch
.
9Interpreter profiler design
- Buffer full event triggers processing of the data
by the JIT
Buffer full event
Application Thread 1
if
JIT runtime
vcall
if
switch
if
.
10Interpreter profiler design
- JIT parses the application thread profiling
buffer and builds internal profiling data
structure
JIT profiling hashtable
Profiling buffer
JIT runtime
data
Bytecode program counter
Hash function based on bytecode PC
11Whats in the data we collect?
- Bytecode program counter
- Variable size data packet
- 1 byte for branch direction
- Word size for call targets and runtime types
- 4 bytes for switch index
12Processing the buffered branch information
- We create an object to hold the bytecode PC and
branch counts. We are using 4 bytes to store the
branch information.
pc
taken not taken
13What does the JIT do with the call information?
- We keep up to 3 call targets with their counts as
well as residue count
pc
residue
Class A
count
Class B
count
Class C
count
We use the same approach for checkcast and
instanceof
14What does the JIT do with the switch information?
- We create a data structure to hold the bytecode
PC and counts for switch index. The index data is
8 bytes wide, split into 4 records the top 3 and
the rest.
pc
record 1
record 2
record 3
The rest
each record is split into 2 portions 1 byte
count and 1 byte switch index
count index
15Storing the profiling data
- Each data record is stored in global hashtable,
using the PC for the hash function - On subsequent encounters of the same PC with
profiling data the records are updated. - Branch and switch counts are incremented
- Call targets and runtime types are added and
counts incremented.
16Using the profiling information
- The profiler database only knows of bytecode PC
- At all points where the compiler is interested in
profiling information it generates the bytecode
pc from the method information and the bytecode
index - The compiler has to make sense out of the
information in the hashtable
17Interpreter profiler design
- JIT compiler consults the profiling hashtable in
various stages of method compilation
JIT profiling hashtable
Compilation Thread
inliner
order code
.
codegen
18Performance results
- Up to 30 improvement on various applications
- EJB and other middleware applications benefit
mostly from code ordering and devirtualization
for the purpose of inlining - Benchmarks typically benefit from other
optimization enabled by the ability to
devirtualize virtual and interface calls - With various tweaks we managed to drive the
start-up over head to below 10
19How do we manage the profiling overhead?
- We turn the profiler off in Xquickstart mode
- No locking on the hashtable
- We detect startup phase of the application and
skip records to ease off the data collection
overhead
20Turning the profiler ON and OFF
- The profiler is ON by default
- The sampler thread turns the profiler OFF or back
ON - Number of consecutive ticks in JIT generated code
turns the profiler OFF - Number of consecutive ticks in interpreter turns
the profiler back ON
21Some of the problems we encountered
- Tuning for optimal balance between startup
overhead and throughput performance wasnt easy - Application phase change detection wasnt easy
- Class unloading created lots of problems
22Summary
- Profiling is critical for performance of run-time
systems - Using buffered approach to data collection can
help build efficient profilers - Tuning for optimal balance of startup overhead
and throughput performance is challenging