Using%20Platform-Specific%20Performance%20Counters%20for%20Dynamic%20Compilation - PowerPoint PPT Presentation

About This Presentation

Title:

Using%20Platform-Specific%20Performance%20Counters%20for%20Dynamic%20Compilation

Description:

Runtime overhead of collecting information from the CPU low ... Update at recompilation, GC. Identify 100% of instructions (PCs) ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 26

Provided by: florians

Learn more at: https://www.ece.lsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using%20Platform-Specific%20Performance%20Counters%20for%20Dynamic%20Compilation

1
Using Platform-Specific Performance Counters for
Dynamic Compilation

Florian Schneider and Thomas Gross
ETH Zurich

2
Introduction Motivation

Dynamic compilers common execution platform for
OO languages (Java, C)
Properties of OO programs difficult to analyze at
compile-time
JIT compiler can immediately use information
obtained at run-time

3
Introduction Motivation

Types of information
Profiles e.g. execution frequency of methods /
basic blocks
Hardware-specific properties cache misses, TLB
misses, branch prediction failures

4
Outline

Introduction
Requirements
Related work
Implementation
Results
Conclusions

5
Requirements

Infrastructure flexible enough to measure
different execution metrics
Hide machine-specific details from VM
Keep changes to the VM/compiler minimal
Runtime overhead of collecting information from
the CPU low
Information must be precise to be useful for
online optimization

6
Related work

Profile guided optimization
Code positioning PettisPLDI90
Hardware performance monitors
Relating HPM data to basic blocks Ammons PLDI97
Vertical profiling Hauswirth OOPSLA 2004
Dynamic optimization
Mississippi delta Adl-Tabatabai PLDI2004
Object reordering Huang OOPSLA 2004
Our work
No instrumentation
Use profile data hardware info
Targets fully automatic dynamic optimization

7
Hardware performance monitors

Sampling-based counting
CPU reports state every n events
Precision platform-dependent (pipelines,
out-of-order execution)
Sampling provides method, basic block, or
instruction-level information
Newer CPUs support precise sampling (e.g. P4,
Itanium)

8
Hardware performance monitors

Way to localize performance bottlenecks
Sampling interval determines how fine-grained the
information is
Smaller sampling interval ? more data
Trade-off precision vs. runtime overhead
Need enough samples for a representative picture
of the program behavior

9
Implementation

Main parts
Kernel module low level access to hardware, per
process counting
User-space library hides kernel device driver
details from VM
Java VM thread collects samples periodically,
maps samples to Java code
Implemented on top of Jikes RVM

10
System overview
11
Implementation

Supported events
L1 and L2 cache misses
DTLB misses
Branch prediction
Parameters of the monitoring module
Buffer size (fixed)
Polling interval (fixed)
Sampling interval (adaptive)
Keep runtime overhead constant by changing
interval during run-time automatically

12
From raw data to Java
0x080485e1 mov 0x4(esi),esi 0x080485e
4 mov 0x4,edi 0x080485e9 mov
(esi,edi,4),esi 0x080485ec mov
ebx,0x4(esi) 0x080485ef mov
0x4,ebx 0x080485f4 push
ebx 0x080485f5 mov 0x0,ebx 0x080485f
a push ebx 0x080485fb mov
0x8(ebp),ebx 0x080485fe push
ebx 0x080485ff mov (ebx),ebx 0x08048
601 call 0x4(ebx) 0x08048604
add 0xc,esp 0x08048607 mov
0x8(ebp),ebx 0x0804860a mov
0x4(ebx),ebx
GETFIELD
ARRAYLOAD
INVOKEVIRTUAL

Determine method bytecode instr
Build sorted method table
Map offset to bytecode

13
From raw data to Java

Sample gives PC register contents
PC ? machine code ? compiled Java code ? bytecode
instruction
For data address use registers machine code to
calculate target address
GETFIELD ? indirect load
mov 12(eax), eax // 12 offset of field

14
Engineering issues

Lookup of PC to get method / BC instr must be
efficient
Done in parallel with user program
Use binary search / hash table
Update at recompilation, GC
Identify 100 of instructions (PCs)
Include samples from application, VM, and library
code
Dealing with native parts

15
Infrastructure

Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime
platform
Pentium 4, 3 GHz, 1G RAM, 1M L2 cache
Measured data show
Runtime overhead
Extraction of meaningful information

16
Runtime overhead
Program Orig sec, score Sampling interval 10000 Sampling interval 1000
javac 7.18 2.0 2.4
raytrace 4.04 2.4 2.0
jess 2.93 0.6 0.1
jack 2.73 3.5 2.7
db 10.49 0.1 3.1
compress 6.50 0.9 1.5
mpegaudio 6.54 1.3 0.3
jbb 6209.67 2.4 4.6
average 1.6 2.1

Experiment setup monitor L2 cache misses

17
Runtime overhead specJBB
Total cost / sample 3000 cycles
18
Measurements

Measure which instructions produce most events
(cache misses, branch mispred)
Potential for data locality and control flow
optimizations
Compare different spec-benchmarks
Find hot spots instructions that produce 80
of all measured events

19
L1/L2 Cache misses
80 quantile 21 instructions (N571)
80 quantile 13 (N295)
20
L1/L2 Cache misses
80 quantile 477 (N8526)
80 quantile 76 (N2361)
21
L1/L2 Cache misses
80 quantile 1296 (N3172)
80 quantile 153 (N672)
22
Branch prediction
80 quantile 307 (N4193)
80 quantile 1575 (N7478)
23
Summary
80-quantile in of total L1 misses L2 misses Branch pred.
specJBB 5.6 3.2 7.3
javac 40.9 22.7 21.1
db 3.7 4.4 0.8

Distribution of events over program differ
significantly between benchmarks
Challenge Are data precise enough to guide
optimizations in a dynamic compiler?

24
Further work

Apply information in optimizer
Data access path expressions p.x.y
Control flow inlining, I-cache locality
Investigate flexible sampling interval
Further optimizations of monitoring system
Replacing expensive JNI calls
Avoid copying of samples

25
Concluding remarks

Precise performance event monitoring is possible
with low overhead ( 2)
Monitoring infrastructure tied into Jikes RVM
compiler
Instruction level information allows
optimizations to focus on hot spots
Good platform to study coupling compiler
decisions to hardware-specific platform properties

26
Backup P4 performance counters