Pipelined Profiling and Analysis on Multi-core Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Pipelined Profiling and Analysis on Multi-core Systems

Description:

Example communication between stage 0 and stage 1. Shared buffers ... Factor 2 : the number of recovery threads ... Set associative caches ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 29
Provided by: ioan5
Category:

less

Transcript and Presenter's Notes

Title: Pipelined Profiling and Analysis on Multi-core Systems


1
Pipelined Profiling and Analysis on Multi-core
Systems
PiPA
  • Qin Zhao
  • Ioana Cutcutache
  • Weng-Fai Wong

2
Why PiPA?
  • Code profiling and analysis
  • very useful for understanding program behavior
  • implemented using dynamic instrumentation systems
  • several challenges coverage, accuracy, overhead
  • overhead due to instrumentation engine
  • overhead due to profiling code
  • The performance problem!
  • Cachegrind - 100x slowdown
  • Pin dcache - 32x slowdown
  • Need faster tools!

3
Our Goals
  • Improve the performance
  • reduce the overall profiling and analysis
    overhead
  • but maintain the accuracy
  • How?
  • parallelize!
  • optimize
  • Keep it simple
  • easy to understand
  • easy to build new analysis tools

4
Previous Approach
  • Parallelized slice profiling
  • SuperPin, Shadow Profiling
  • Suitable for simple, independent tasks

Uninstrumented application
Original application
Instrumentation overhead
Instrumented application
Profiling overhead
SuperPinned application
Instrumented slices
5
PiPA Key Idea
  • Pipelining!

Original application
Instrumentation overhead
Profiling overhead
Time
Instrumented application stage 0
Profile processing stage 1
Threads or Processes
Profile Information
Analysis on profile 1
Analysis on profile 2
Parallel analysis stage 2
Analysis on profile 3
Analysis on profile 4
6
PiPA Challenges
  • Minimize the profiling overhead
  • Runtime Execution Profile (REP)
  • Minimize the communication between stages
  • double buffering
  • Design efficient parallel analysis algorithms
  • we focus on cache simulation

7
PiPA Prototype
  • Cache Simulation

8
Our Prototype
  • Implemented in DynamoRIO
  • Three stages
  • Stage 0 instrumented application collect REP
  • Stage 1 parallel profile recovery and splitting
  • Stage 2 parallel cache simulation
  • Experiments
  • SPEC2000 SPEC2006 benchmarks
  • 3 systems dual core, quad core, eight core

9
Communication
  • Keys to minimize the overhead
  • double buffering
  • shared buffers
  • large buffers
  • Example communication between stage 0 and stage
    1

Shared buffers
Processing threads at stage 1
Profiling thread at stage 0
10
Stage 0 Profiling
  • compact profile
  • minimal overhead

11
Stage 0 Profiling
  • Runtime Execution Profile (REP)
  • fast profiling
  • small profile size
  • easy information extraction
  • Hierarchical Structure
  • profile buffers
  • data units
  • slots
  • Can be customized for different analyses
  • in our prototype we consider cache simulation

12
REP Example
REP
profile basepointer
First buffer
tag 0x080483d7num_slots 2num_refs 3refs
ref0
pc 0x080483d7
. . .
type read
size 4
offset 12
value_slot 1
bb1
size_slot -1
mov eax 0x0c ? eax
bb1
mov ebp ? esp
pc 0x080483dctype
readsize 4offset
0value_slot 2size_slot -1
pop ebp
REP Unit
12 bytes
eax
return
esp
bb2
REP Unit
esp
bb2 pop ebx pop ecx cmp eax, 0 jz
label_bb3
pc 0x080483ddtype
readsize 4offset
4value_slot 2size_slot -1
. . .
Canary Zone
Next buffer
. . .
REPS
REPD


13
Profiling Optimization
  • Store register values in REP
  • avoid computing the memory address
  • Register liveness analysis
  • avoid register stealing if possible
  • Record a single register value for multiple
    references
  • a single stack pointer value for a sequence of
    push/pop
  • the base address for multiple accesses to the
    same structure
  • More in the paper

14
REP Example
REP
profile basepointer
First buffer
tag 0x080483d7num_slots 2num_refs 3refs
ref0
pc 0x080483d7
. . .
type read
size 4
offset 12
value_slot 1
bb1
size_slot -1
mov eax 0x0c ? eax
bb1
mov ebp ? esp
pc 0x080483dctype
readsize 4offset
0size_slot -1
pop ebp
REP Unit
eax
return
esp
value_slot 2
bb2
REP Unit
esp
bb2 pop ebx pop ecx cmp eax, 0 jz
label_bb3
pc 0x080483ddtype
readsize 4offset
4size_slot -1
. . .
value_slot 2
Canary Zone
Next buffer
. . .
REPS
REPD


15
Profiling Overhead
4-core
9
8
2-core
8
7
7
6
6
5
Slowdown relative
to native execution
Slowdown relative
to native execution
5
4
4
3
3
2
2
1
1
0
0
SPECint2000
SPECfp2000
SPEC2000
SPECint2000
SPECfp2000
SPEC2000
8-core
10
9
8
optimized instrumentation
7
6
Slowdown relative
to native execution
5
instrumentation without optimization
4
3
2
Avg slowdown 3x
1
0
SPECint2000
SPECfp2000
SPEC2000
16
Stage 1 Profile Recovery
  • fast recovery

17
Stage 1 Profile Recovery
  • Need to reconstruct the full memory reference
    information
  • ltpc, address, type, sizegt

REP
pc 0x080483d7
pc 0x080483dc
tag 0x080483d7num_slots 2num_refs
3refs ref0
type read
type read
. . .
size 4
size 4
. . .
offset 12
offset 0
value_slot 1
value_slot 2
size_slot -1
size_slot -1
bb1
REP Unit
0x2304
0x141a
PC Address
Type Size ....
............. ........
.........
bb2
REP Unit
0x1423
0x080483d7
read 4
0x2310
. . .
0x080483dc
read 4
0x141a
.... .............
........ .........
Canary Zone
. . .
18
Profile Recovery Overhead
  • Factor 1 buffer size
  • Experiments done on the 8-core system, using 8
    recovery threads

9
small (64k)
medium (1M)
large (16M)
8
7
6
5
Slowdown relative to native execution
4
3
2
1
0
SPECint2000
SPECfp2000
SPEC2000
19
Profile Recovery Overhead
  • Factor 2 the number of recovery threads
  • Experiments done on the 8-core system, using 16MB
    buffers

20
0 threads
2 threads
18
4 threads
16
6 threads
14
8 threads
12
10
Slowdown relative to native execution
8
6
4
2
0
SPECint2000
SPECfp2000
SPEC2000
20
Profile Recovery Overhead
  • Factor 3 the number of available cores
  • Experiments done using 16MB buffers and 8
    recovery threads

21
Profile Recovery Overhead
  • Factor 4 the impact of using REP
  • experiments done on the 8-core system with 16MB
    buffers and 8 threads

PIPA using REP
PIPA-REP 4.5x
PIPA using standard profile format
PIPA-standard 20.7x
ltpc, address, type, sizegt
22
Stage 2 Cache Simulation
  • parallel analysis
  • independent simulators

23
Stage 2 Parallel Cache Simulation
  • How to parallelize?
  • split the address trace into independent groups
  • Set associative caches
  • partition the cache sets and simulate them using
    several independent simulators
  • merge the results (no of hits and misses) at the
    end of the simulation
  • Example
  • 32K cache, 32-byte line, 4-way associative gt 256
    sets
  • 4 independent simulators, each one simulates 64
    sets (round-robin distribution)
  • two memory references that access different
    sets are independent

PC Address Type Size ....
r 4 ....
w 4 ....
r 4 .... w
4 .... r
4 .... r 4
0
0xbf9c4614
, 0xbf9c4705
...
, 0xbf9c460d
0xbf9c4614
1
0xbf9c4a34
...
0xbf9c4705
0xbf9c4a34
2
0xbf9c4a5c
...
0xbf9c4a60
0xbf9c4a5c
3
0xbf9c4a60
...
0xbf9c460d
81
83
82
48
56
48
Set index
24
Cache Simulation Overhead
  • Experiments done on the 8-core system
  • 8 recovery threads and 8 cache simulators

PiPA
10.5x
PiPA speedup over dcache
3x
Pin dcache
32x
25
SPEC 2006 Results
  • Experiments done using the 8-core system

Profiling
3x
Average speedup over dcache
3.27x
Profiling recovery
3.7x
Full cache simulation
10.2x
26
Summary
  • PiPA is an effective technique for parallel
    profiling and analysis
  • based on pipelining
  • drastically reduces both
  • profiling time
  • analysis time
  • full cache simulation incurs only 10.5x slowdown
  • Runtime Execution Profile
  • requires minimal instrumentation code
  • compact enough to ensure optimal buffer usage
  • makes it easy for next stages to recover the full
    trace
  • Parallel cache simulation
  • the cache is partitioned into several independent
    simulators

27
Future Work
  • Design APIs
  • hide the communication between the pipeline
    stages
  • focus only on the instrumentation and analysis
    tasks
  • Further improve the efficiency
  • parallel profiling
  • workload monitoring
  • More analysis algorithms
  • branch prediction simulation
  • memory dependence analysis
  • ...

28
Pin Prototype
  • Second implementation in Pin
  • Preliminary results
  • 2.6x speedup over Pin dcache
  • Plan to release PiPA
  • www.comp.nus.edu.sg/ioana
Write a Comment
User Comments (0)
About PowerShow.com