Title: Pipelined Profiling and Analysis on Multi-core Systems
1Pipelined Profiling and Analysis on Multi-core
Systems
PiPA
- Qin Zhao
- Ioana Cutcutache
- Weng-Fai Wong
2Why PiPA?
- Code profiling and analysis
- very useful for understanding program behavior
- implemented using dynamic instrumentation systems
- several challenges coverage, accuracy, overhead
- overhead due to instrumentation engine
- overhead due to profiling code
- The performance problem!
- Cachegrind - 100x slowdown
- Pin dcache - 32x slowdown
- Need faster tools!
3Our Goals
- Improve the performance
- reduce the overall profiling and analysis
overhead - but maintain the accuracy
- How?
- parallelize!
- optimize
- Keep it simple
- easy to understand
- easy to build new analysis tools
4Previous Approach
- Parallelized slice profiling
- SuperPin, Shadow Profiling
- Suitable for simple, independent tasks
Uninstrumented application
Original application
Instrumentation overhead
Instrumented application
Profiling overhead
SuperPinned application
Instrumented slices
5PiPA Key Idea
Original application
Instrumentation overhead
Profiling overhead
Time
Instrumented application stage 0
Profile processing stage 1
Threads or Processes
Profile Information
Analysis on profile 1
Analysis on profile 2
Parallel analysis stage 2
Analysis on profile 3
Analysis on profile 4
6PiPA Challenges
- Minimize the profiling overhead
- Runtime Execution Profile (REP)
- Minimize the communication between stages
- double buffering
- Design efficient parallel analysis algorithms
- we focus on cache simulation
7PiPA Prototype
8Our Prototype
- Implemented in DynamoRIO
- Three stages
- Stage 0 instrumented application collect REP
- Stage 1 parallel profile recovery and splitting
- Stage 2 parallel cache simulation
- Experiments
- SPEC2000 SPEC2006 benchmarks
- 3 systems dual core, quad core, eight core
9Communication
- Keys to minimize the overhead
- double buffering
- shared buffers
- large buffers
- Example communication between stage 0 and stage
1
Shared buffers
Processing threads at stage 1
Profiling thread at stage 0
10Stage 0 Profiling
- compact profile
- minimal overhead
11Stage 0 Profiling
- Runtime Execution Profile (REP)
- fast profiling
- small profile size
- easy information extraction
- Hierarchical Structure
- profile buffers
- data units
- slots
- Can be customized for different analyses
- in our prototype we consider cache simulation
12REP Example
REP
profile basepointer
First buffer
tag 0x080483d7num_slots 2num_refs 3refs
ref0
pc 0x080483d7
. . .
type read
size 4
offset 12
value_slot 1
bb1
size_slot -1
mov eax 0x0c ? eax
bb1
mov ebp ? esp
pc 0x080483dctype
readsize 4offset
0value_slot 2size_slot -1
pop ebp
REP Unit
12 bytes
eax
return
esp
bb2
REP Unit
esp
bb2 pop ebx pop ecx cmp eax, 0 jz
label_bb3
pc 0x080483ddtype
readsize 4offset
4value_slot 2size_slot -1
. . .
Canary Zone
Next buffer
. . .
REPS
REPD
13Profiling Optimization
- Store register values in REP
- avoid computing the memory address
- Register liveness analysis
- avoid register stealing if possible
- Record a single register value for multiple
references - a single stack pointer value for a sequence of
push/pop - the base address for multiple accesses to the
same structure - More in the paper
14REP Example
REP
profile basepointer
First buffer
tag 0x080483d7num_slots 2num_refs 3refs
ref0
pc 0x080483d7
. . .
type read
size 4
offset 12
value_slot 1
bb1
size_slot -1
mov eax 0x0c ? eax
bb1
mov ebp ? esp
pc 0x080483dctype
readsize 4offset
0size_slot -1
pop ebp
REP Unit
eax
return
esp
value_slot 2
bb2
REP Unit
esp
bb2 pop ebx pop ecx cmp eax, 0 jz
label_bb3
pc 0x080483ddtype
readsize 4offset
4size_slot -1
. . .
value_slot 2
Canary Zone
Next buffer
. . .
REPS
REPD
15Profiling Overhead
4-core
9
8
2-core
8
7
7
6
6
5
Slowdown relative
to native execution
Slowdown relative
to native execution
5
4
4
3
3
2
2
1
1
0
0
SPECint2000
SPECfp2000
SPEC2000
SPECint2000
SPECfp2000
SPEC2000
8-core
10
9
8
optimized instrumentation
7
6
Slowdown relative
to native execution
5
instrumentation without optimization
4
3
2
Avg slowdown 3x
1
0
SPECint2000
SPECfp2000
SPEC2000
16Stage 1 Profile Recovery
17Stage 1 Profile Recovery
- Need to reconstruct the full memory reference
information - ltpc, address, type, sizegt
REP
pc 0x080483d7
pc 0x080483dc
tag 0x080483d7num_slots 2num_refs
3refs ref0
type read
type read
. . .
size 4
size 4
. . .
offset 12
offset 0
value_slot 1
value_slot 2
size_slot -1
size_slot -1
bb1
REP Unit
0x2304
0x141a
PC Address
Type Size ....
............. ........
.........
bb2
REP Unit
0x1423
0x080483d7
read 4
0x2310
. . .
0x080483dc
read 4
0x141a
.... .............
........ .........
Canary Zone
. . .
18Profile Recovery Overhead
- Factor 1 buffer size
- Experiments done on the 8-core system, using 8
recovery threads
9
small (64k)
medium (1M)
large (16M)
8
7
6
5
Slowdown relative to native execution
4
3
2
1
0
SPECint2000
SPECfp2000
SPEC2000
19Profile Recovery Overhead
- Factor 2 the number of recovery threads
- Experiments done on the 8-core system, using 16MB
buffers
20
0 threads
2 threads
18
4 threads
16
6 threads
14
8 threads
12
10
Slowdown relative to native execution
8
6
4
2
0
SPECint2000
SPECfp2000
SPEC2000
20Profile Recovery Overhead
- Factor 3 the number of available cores
- Experiments done using 16MB buffers and 8
recovery threads
21Profile Recovery Overhead
- Factor 4 the impact of using REP
- experiments done on the 8-core system with 16MB
buffers and 8 threads
PIPA using REP
PIPA-REP 4.5x
PIPA using standard profile format
PIPA-standard 20.7x
ltpc, address, type, sizegt
22Stage 2 Cache Simulation
- parallel analysis
- independent simulators
23Stage 2 Parallel Cache Simulation
- How to parallelize?
- split the address trace into independent groups
- Set associative caches
- partition the cache sets and simulate them using
several independent simulators - merge the results (no of hits and misses) at the
end of the simulation - Example
- 32K cache, 32-byte line, 4-way associative gt 256
sets - 4 independent simulators, each one simulates 64
sets (round-robin distribution)
- two memory references that access different
sets are independent
PC Address Type Size ....
r 4 ....
w 4 ....
r 4 .... w
4 .... r
4 .... r 4
0
0xbf9c4614
, 0xbf9c4705
...
, 0xbf9c460d
0xbf9c4614
1
0xbf9c4a34
...
0xbf9c4705
0xbf9c4a34
2
0xbf9c4a5c
...
0xbf9c4a60
0xbf9c4a5c
3
0xbf9c4a60
...
0xbf9c460d
81
83
82
48
56
48
Set index
24Cache Simulation Overhead
- Experiments done on the 8-core system
- 8 recovery threads and 8 cache simulators
PiPA
10.5x
PiPA speedup over dcache
3x
Pin dcache
32x
25SPEC 2006 Results
- Experiments done using the 8-core system
Profiling
3x
Average speedup over dcache
3.27x
Profiling recovery
3.7x
Full cache simulation
10.2x
26Summary
- PiPA is an effective technique for parallel
profiling and analysis - based on pipelining
- drastically reduces both
- profiling time
- analysis time
- full cache simulation incurs only 10.5x slowdown
- Runtime Execution Profile
- requires minimal instrumentation code
- compact enough to ensure optimal buffer usage
- makes it easy for next stages to recover the full
trace - Parallel cache simulation
- the cache is partitioned into several independent
simulators
27Future Work
- Design APIs
- hide the communication between the pipeline
stages - focus only on the instrumentation and analysis
tasks - Further improve the efficiency
- parallel profiling
- workload monitoring
- More analysis algorithms
- branch prediction simulation
- memory dependence analysis
- ...
28Pin Prototype
- Second implementation in Pin
- Preliminary results
- 2.6x speedup over Pin dcache
- Plan to release PiPA
- www.comp.nus.edu.sg/ioana