Shadow Profiling: Hiding Instrumentation Costs with Parallelism - PowerPoint PPT Presentation

About This Presentation
Title:

Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Description:

Shadow Profiling: Hiding Instrumentation. Costs with Parallelism. Tipp Moseley. Alex Shye ... Collect arbitrarily detailed and abundant information. Incur ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 24
Provided by: tippmo
Learn more at: http://www.cgo.org
Category:

less

Transcript and Presenter's Notes

Title: Shadow Profiling: Hiding Instrumentation Costs with Parallelism


1
Shadow ProfilingHiding Instrumentation Costs
with Parallelism
  • Tipp Moseley
  • Alex Shye
  • Vijay Janapa Reddi
  • Dirk Grunwald
  • (University of Colorado)
  • Ramesh Peri
  • (Intel Corporation)

2
Motivation
  • An ideal profiler will
  • Collect arbitrarily detailed and abundant
    information
  • Incur negligible overhead
  • A real profiler, e.g., using Pin, satisfies
    condition 1
  • But the cost is high
  • 3X for BBL counting
  • 25X for loop profiling
  • 50X or higher for memory profiling
  • A real profiler, e.g. PMU sampling or code
    patching, satisfies condition 2
  • But the detail is very coarse

3
Motivation
Bursty Tracing (Sampled Instrumentation),Novel
Hardware,Shadow Profiling
VTune, DCPI, OProfile, PAPI, pfmon, PinProbes,
Pintools, Valgrind, ATOM,
4
Goal
  • To create a profiler capable of collecting
    detailed, abundant information while incurring
    negligible overhead
  • Enable developers to focus on other things

5
The Big Idea
  • Stems from fault tolerance work on deterministic
    replication
  • Periodically fork(), profile shadow processes

Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0
1 Orig. Slice 1 Slice 0 Slice 1
2 Orig. Slice 2 Slice 0 Slice 1 Slice 2
3 Orig. Slice 3 Slice 3 Slice 1 Slice 2
4 Orig. Slice 4 Slice 3 Slice 4 Slice 2
5 Slice 3 Slice 4
6 Slice 4
Assuming instrumentation overhead of 3X
6
Challenges
  • Threads
  • Shared Memory
  • Asynchronous Interrupts
  • System Calls
  • JIT overhead
  • Overhead vs. Number of CPUs
  • Maximum speedup is Number of CPUs
  • If profiler overhead is 50X, need at least 51
    CPUs to run in real-time (probably many more)
  • Too many complications to ensure deterministic
    replication

7
Goal (Revised)
  • To create a profiler capable of sampling detailed
    traces (bursts) with negligible overhead
  • Trade abundance for low overhead
  • Like SimPoints or SMARTS (but not as smart )

8
The Big Idea (revised)
  • Do not strive for full, deterministic replica
  • Instead, profile many short, mostly deterministic
    bursts
  • Profile a fixed number of instructions
  • Fake it for system calls
  • Must not allow shadow to side-effect system

Time CPU 0 CPU 1 CPU 2 CPU 3
0 Orig. Slice 0 Slice 0 Spyware
1 Orig. Slice 1 Slice 0 Spyware
2 Orig. Slice 2 Slice 0 Slice 1 Spyware
3 Orig. Slice 3 Slice 1 Spyware
4 Orig. Slice 4 Slice 1 Spyware
9
Design Overview
10
Design Overview
  • Monitor uses Pin Probes (code patching)
  • Application runs natively
  • Monitor receives periodic timer signal and
    decides when to fork()
  • After fork(), child uses PIN_ExecuteAt()
    functionality to switch Pin from Probe to JIT
    mode.
  • Shadow process profiles as usual, except handling
    of special cases
  • Monitor logs special read() system calls and
    pipes result to shadow processes

11
System Calls
  • For SPEC CPU2000, system calls occur around 35
    times per second
  • Forking after each puts lots of pressure on CoW
    pages, Pin JIT engine
  • 95 of dynamic system calls can be safely handled
  • Some system calls can be allowed to execute (49)
  • getrusage, _llseek, times, time, brk, munmap,
    fstat64, close, stat64, umask, getcwd, uname,
    access, exit_group,

12
System Calls
  • Some can be replaced with success assumed (39)
  • write, ftruncate, writev, unlink, rename,
  • Some are handled specially, but execution may
    continue (1.8)
  • mmap2, open(creat), mmap, mprotect, mremap, fcntl
  • read() is special (5.4)
  • For reads from pipes/sockets, the data must be
    logged from the original app
  • For reads from files, the file must be closed and
    reopened after the fork() because the OS file
    pointer is not duplicated
  • ioctl() is special (4.8)
  • Frequent in perlbmk
  • Behavior is device-dependent, safest action is to
    simply terminate the segment and re-fork()

13
Other Issues
  • Shared Memory
  • Disallow writes to shared memory
  • Asynchronous Interrupts (Userspace signals)
  • Since we are only mostly deterministic, no longer
    an issue
  • When main program receives a signal, pass it
    along to live children
  • JIT Overhead
  • After each fork(), it is like Pinning a new
    program
  • Warmup is too slow
  • Use Persistent Code Caching CGO07

14
Multithreaded Programs
  • Issuefork() does not duplicate all threads
  • Only the thread that called fork()
  • Solution
  • Barrier all threads in the program and store
    their CPU state
  • Fork the process and clone new threads for those
    that were destroyed
  • Identical address space only register state was
    really lost
  • In each new thread, restore previous CPU state
  • Modified clone() handling in Pin VM
  • Continue execution, virtualize thread IDs for
    relevant system calls

15
Tuning Overhead
  • Load
  • Number of active shadow processes
  • Tested 0.125, 0.25, 0.5, 1.0, 2.0
  • Sample Size
  • Number of instructions to profile
  • Longer samples for less overhead, more data
  • Shorter samples for more evenly dispersed data
  • Tested 1M, 10M, 100M

16
Experiments
  • Value Profiling
  • Typical overhead 100X
  • Accuracy measured by Difference in Invariance
  • Path Profiling
  • Typical overhead 50 - 10X
  • Accuracy measured by percent of hot paths
    detected (2 threshold)
  • All experiments use SPEC2000 INT Benchmarks with
    ref data set
  • Arithmetic mean of 3 runs presented

17
Results - Value Profiling Overhead
  • Overhead versus native execution
  • Several configurations less than 1
  • Path profiling exhibits similar trends

18
Results - Value Profiling Accuracy
  • All configurations within 7 of perfect profile
  • Lower is better

19
Results - Path Profiling Accuracy
  • Most configurations over 90 accurate
  • Higher is better
  • Some benchmarks (e.g., 176.gcc, 186.crafty,
    187.parser) have millions of paths, but few are
    hot

20
Results - Page Fault Increase
  • Proportional increase in page faults
  • Shadow/Native

21
Results - Page Fault Rate
  • Difference in page faults per second experienced
    by native application

22
Future Work
  • Improve stability for multithreaded programs
  • Investigate effects of different persistent code
    cache policies
  • Compare sampling policies
  • Random (current)
  • Phase/event-based
  • Static analysis
  • Study convergence
  • Apply technique
  • Profile-guided optimizations
  • Simulation techniques

23
Conclusion
  • Shadow Profiling allows collection of bursts of
    detailed traces
  • Accuracy is over 90
  • Incurs negligible overhead
  • Often less than 1
  • With increasing numbers of cores, allows
    developers focus to shift from profiling to
    applying optimizations
Write a Comment
User Comments (0)
About PowerShow.com