Using%20Platform-Specific%20Performance%20Counters%20for%20Dynamic%20Compilation - PowerPoint PPT Presentation

About This Presentation
Title:

Using%20Platform-Specific%20Performance%20Counters%20for%20Dynamic%20Compilation

Description:

Runtime overhead of collecting information from the CPU low ... Update at recompilation, GC. Identify 100% of instructions (PCs) ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 26
Provided by: florians
Learn more at: https://www.ece.lsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Using%20Platform-Specific%20Performance%20Counters%20for%20Dynamic%20Compilation


1
Using Platform-Specific Performance Counters for
Dynamic Compilation
  • Florian Schneider and Thomas Gross
  • ETH Zurich

2
Introduction Motivation
  • Dynamic compilers common execution platform for
    OO languages (Java, C)
  • Properties of OO programs difficult to analyze at
    compile-time
  • JIT compiler can immediately use information
    obtained at run-time

3
Introduction Motivation
  • Types of information
  • Profiles e.g. execution frequency of methods /
    basic blocks
  • Hardware-specific properties cache misses, TLB
    misses, branch prediction failures

4
Outline
  1. Introduction
  2. Requirements
  3. Related work
  4. Implementation
  5. Results
  6. Conclusions

5
Requirements
  • Infrastructure flexible enough to measure
    different execution metrics
  • Hide machine-specific details from VM
  • Keep changes to the VM/compiler minimal
  • Runtime overhead of collecting information from
    the CPU low
  • Information must be precise to be useful for
    online optimization

6
Related work
  • Profile guided optimization
  • Code positioning PettisPLDI90
  • Hardware performance monitors
  • Relating HPM data to basic blocks Ammons PLDI97
  • Vertical profiling Hauswirth OOPSLA 2004
  • Dynamic optimization
  • Mississippi delta Adl-Tabatabai PLDI2004
  • Object reordering Huang OOPSLA 2004
  • Our work
  • No instrumentation
  • Use profile data hardware info
  • Targets fully automatic dynamic optimization

7
Hardware performance monitors
  • Sampling-based counting
  • CPU reports state every n events
  • Precision platform-dependent (pipelines,
    out-of-order execution)
  • Sampling provides method, basic block, or
    instruction-level information
  • Newer CPUs support precise sampling (e.g. P4,
    Itanium)

8
Hardware performance monitors
  • Way to localize performance bottlenecks
  • Sampling interval determines how fine-grained the
    information is
  • Smaller sampling interval ? more data
  • Trade-off precision vs. runtime overhead
  • Need enough samples for a representative picture
    of the program behavior

9
Implementation
  • Main parts
  • Kernel module low level access to hardware, per
    process counting
  • User-space library hides kernel device driver
    details from VM
  • Java VM thread collects samples periodically,
    maps samples to Java code
  • Implemented on top of Jikes RVM

10
System overview
11
Implementation
  • Supported events
  • L1 and L2 cache misses
  • DTLB misses
  • Branch prediction
  • Parameters of the monitoring module
  • Buffer size (fixed)
  • Polling interval (fixed)
  • Sampling interval (adaptive)
  • Keep runtime overhead constant by changing
    interval during run-time automatically

12
From raw data to Java
0x080485e1 mov 0x4(esi),esi 0x080485e
4 mov 0x4,edi 0x080485e9 mov
(esi,edi,4),esi 0x080485ec mov
ebx,0x4(esi) 0x080485ef mov
0x4,ebx 0x080485f4 push
ebx 0x080485f5 mov 0x0,ebx 0x080485f
a push ebx 0x080485fb mov
0x8(ebp),ebx 0x080485fe push
ebx 0x080485ff mov (ebx),ebx 0x08048
601 call 0x4(ebx) 0x08048604
add 0xc,esp 0x08048607 mov
0x8(ebp),ebx 0x0804860a mov
0x4(ebx),ebx
GETFIELD
ARRAYLOAD
INVOKEVIRTUAL
  • Determine method bytecode instr
  • Build sorted method table
  • Map offset to bytecode

13
From raw data to Java
  • Sample gives PC register contents
  • PC ? machine code ? compiled Java code ? bytecode
    instruction
  • For data address use registers machine code to
    calculate target address
  • GETFIELD ? indirect load
  • mov 12(eax), eax // 12 offset of field

14
Engineering issues
  • Lookup of PC to get method / BC instr must be
    efficient
  • Done in parallel with user program
  • Use binary search / hash table
  • Update at recompilation, GC
  • Identify 100 of instructions (PCs)
  • Include samples from application, VM, and library
    code
  • Dealing with native parts

15
Infrastructure
  • Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime
    platform
  • Pentium 4, 3 GHz, 1G RAM, 1M L2 cache
  • Measured data show
  • Runtime overhead
  • Extraction of meaningful information

16
Runtime overhead
Program Orig sec, score Sampling interval 10000 Sampling interval 1000
javac 7.18 2.0 2.4
raytrace 4.04 2.4 2.0
jess 2.93 0.6 0.1
jack 2.73 3.5 2.7
db 10.49 0.1 3.1
compress 6.50 0.9 1.5
mpegaudio 6.54 1.3 0.3
jbb 6209.67 2.4 4.6
average 1.6 2.1
  • Experiment setup monitor L2 cache misses

17
Runtime overhead specJBB
Total cost / sample 3000 cycles
18
Measurements
  • Measure which instructions produce most events
    (cache misses, branch mispred)
  • Potential for data locality and control flow
    optimizations
  • Compare different spec-benchmarks
  • Find hot spots instructions that produce 80
    of all measured events

19
L1/L2 Cache misses
80 quantile 21 instructions (N571)
80 quantile 13 (N295)
20
L1/L2 Cache misses
80 quantile 477 (N8526)
80 quantile 76 (N2361)
21
L1/L2 Cache misses
80 quantile 1296 (N3172)
80 quantile 153 (N672)
22
Branch prediction
80 quantile 307 (N4193)
80 quantile 1575 (N7478)
23
Summary
80-quantile in of total L1 misses L2 misses Branch pred.
specJBB 5.6 3.2 7.3
javac 40.9 22.7 21.1
db 3.7 4.4 0.8
  • Distribution of events over program differ
    significantly between benchmarks
  • Challenge Are data precise enough to guide
    optimizations in a dynamic compiler?

24
Further work
  • Apply information in optimizer
  • Data access path expressions p.x.y
  • Control flow inlining, I-cache locality
  • Investigate flexible sampling interval
  • Further optimizations of monitoring system
  • Replacing expensive JNI calls
  • Avoid copying of samples

25
Concluding remarks
  • Precise performance event monitoring is possible
    with low overhead ( 2)
  • Monitoring infrastructure tied into Jikes RVM
    compiler
  • Instruction level information allows
    optimizations to focus on hot spots
  • Good platform to study coupling compiler
    decisions to hardware-specific platform properties

26
Backup P4 performance counters
  • P4 stores samples in OS supplied buffer
  • Interrupt only generated when buffer is filled
  • Lower runtime overhead
  • All registers are stored together with IP
  • Possible to obtain data address profiles
  • Only subset of events available for PEBS
  • Future architectures may support more

EAX EBX ECX EDX ESI EDI EBP ESP EIP
Write a Comment
User Comments (0)
About PowerShow.com