Pipelined Profiling and Analysis on Multi-core Systems

About This Presentation

Title:

Pipelined Profiling and Analysis on Multi-core Systems

Description:

Example communication between stage 0 and stage 1. Shared buffers ... Factor 2 : the number of recovery threads ... Set associative caches ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 29

Provided by: ioan5

Learn more at: http://groups.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Pipelined Profiling and Analysis on Multi-core Systems

1
Pipelined Profiling and Analysis on Multi-core
Systems
PiPA

Qin Zhao
Ioana Cutcutache
Weng-Fai Wong

2
Why PiPA?

Code profiling and analysis
very useful for understanding program behavior
implemented using dynamic instrumentation systems
several challenges coverage, accuracy, overhead
overhead due to instrumentation engine
overhead due to profiling code
The performance problem!
Cachegrind - 100x slowdown
Pin dcache - 32x slowdown
Need faster tools!

3
Our Goals

Improve the performance
reduce the overall profiling and analysis
overhead
but maintain the accuracy
How?
parallelize!
optimize
Keep it simple
easy to understand
easy to build new analysis tools

4
Previous Approach

Parallelized slice profiling
SuperPin, Shadow Profiling
Suitable for simple, independent tasks

Uninstrumented application
Original application
Instrumentation overhead
Instrumented application
Profiling overhead
SuperPinned application
Instrumented slices
5
PiPA Key Idea

Pipelining!

Original application
Instrumentation overhead
Profiling overhead
Time
Instrumented application stage 0
Profile processing stage 1
Threads or Processes
Profile Information
Analysis on profile 1
Analysis on profile 2
Parallel analysis stage 2
Analysis on profile 3
Analysis on profile 4
6
PiPA Challenges

Minimize the profiling overhead
Runtime Execution Profile (REP)
Minimize the communication between stages
double buffering
Design efficient parallel analysis algorithms
we focus on cache simulation

7
PiPA Prototype

Cache Simulation

8
Our Prototype

Implemented in DynamoRIO
Three stages
Stage 0 instrumented application collect REP
Stage 1 parallel profile recovery and splitting
Stage 2 parallel cache simulation
Experiments
SPEC2000 SPEC2006 benchmarks
3 systems dual core, quad core, eight core

9
Communication

Keys to minimize the overhead
double buffering
shared buffers
large buffers
Example communication between stage 0 and stage
1

Shared buffers
Processing threads at stage 1
Profiling thread at stage 0
10
Stage 0 Profiling

compact profile
minimal overhead

11
Stage 0 Profiling

Runtime Execution Profile (REP)
fast profiling
small profile size
easy information extraction
Hierarchical Structure
profile buffers
data units
slots
Can be customized for different analyses
in our prototype we consider cache simulation

12
REP Example
REP
profile basepointer
First buffer
tag 0x080483d7num_slots 2num_refs 3refs
ref0
pc 0x080483d7
. . .
type read
size 4
offset 12
value_slot 1
bb1
size_slot -1
mov eax 0x0c ? eax
bb1
mov ebp ? esp
pc 0x080483dctype
readsize 4offset
0value_slot 2size_slot -1
pop ebp
REP Unit
12 bytes
eax
return
esp
bb2
REP Unit
esp
bb2 pop ebx pop ecx cmp eax, 0 jz
label_bb3
pc 0x080483ddtype
readsize 4offset
4value_slot 2size_slot -1
. . .
Canary Zone
Next buffer
. . .
REPS
REPD

13
Profiling Optimization

Store register values in REP
avoid computing the memory address
Register liveness analysis
avoid register stealing if possible
Record a single register value for multiple
references
a single stack pointer value for a sequence of
push/pop
the base address for multiple accesses to the
same structure
More in the paper

14
REP Example
REP
profile basepointer
First buffer
tag 0x080483d7num_slots 2num_refs 3refs
ref0
pc 0x080483d7
. . .
type read
size 4
offset 12
value_slot 1
bb1
size_slot -1
mov eax 0x0c ? eax
bb1
mov ebp ? esp
pc 0x080483dctype
readsize 4offset
0size_slot -1
pop ebp
REP Unit
eax
return
esp
value_slot 2
bb2
REP Unit
esp
bb2 pop ebx pop ecx cmp eax, 0 jz
label_bb3
pc 0x080483ddtype
readsize 4offset
4size_slot -1
. . .
value_slot 2
Canary Zone
Next buffer
. . .
REPS
REPD

15
Profiling Overhead
4-core
9
8
2-core
8
7
7
6
6
5
Slowdown relative
to native execution
Slowdown relative
to native execution
5
4
4
3
3
2
2
1
1
0
0
SPECint2000
SPECfp2000
SPEC2000
SPECint2000
SPECfp2000
SPEC2000
8-core
10
9
8
optimized instrumentation
7
6
Slowdown relative
to native execution
5
instrumentation without optimization
4
3
2
Avg slowdown 3x
1
0
SPECint2000
SPECfp2000
SPEC2000
16
Stage 1 Profile Recovery

fast recovery

17
Stage 1 Profile Recovery

Need to reconstruct the full memory reference
information
ltpc, address, type, sizegt

REP
pc 0x080483d7
pc 0x080483dc
tag 0x080483d7num_slots 2num_refs
3refs ref0
type read
type read
. . .
size 4
size 4
. . .
offset 12
offset 0
value_slot 1
value_slot 2
size_slot -1
size_slot -1
bb1
REP Unit
0x2304
0x141a
PC Address
Type Size ....
............. ........
.........
bb2
REP Unit
0x1423
0x080483d7
read 4
0x2310
. . .
0x080483dc
read 4
0x141a
.... .............
........ .........
Canary Zone
. . .
18
Profile Recovery Overhead

Factor 1 buffer size
Experiments done on the 8-core system, using 8
recovery threads

9
small (64k)
medium (1M)
large (16M)
8
7
6
5
Slowdown relative to native execution
4
3
2
1
0
SPECint2000
SPECfp2000
SPEC2000
19
Profile Recovery Overhead

Factor 2 the number of recovery threads
Experiments done on the 8-core system, using 16MB
buffers

20
0 threads
2 threads
18
4 threads
16
6 threads
14
8 threads
12
10
Slowdown relative to native execution
8
6
4
2
0
SPECint2000
SPECfp2000
SPEC2000
20
Profile Recovery Overhead

Factor 3 the number of available cores
Experiments done using 16MB buffers and 8
recovery threads

21
Profile Recovery Overhead

Factor 4 the impact of using REP
experiments done on the 8-core system with 16MB
buffers and 8 threads

PIPA using REP
PIPA-REP 4.5x
PIPA using standard profile format
PIPA-standard 20.7x
ltpc, address, type, sizegt
22
Stage 2 Cache Simulation

parallel analysis
independent simulators

23
Stage 2 Parallel Cache Simulation

How to parallelize?
split the address trace into independent groups
Set associative caches
partition the cache sets and simulate them using
several independent simulators
merge the results (no of hits and misses) at the
end of the simulation
Example
32K cache, 32-byte line, 4-way associative gt 256
sets
4 independent simulators, each one simulates 64
sets (round-robin distribution)

two memory references that access different
sets are independent

PC Address Type Size ....
r 4 ....
w 4 ....
r 4 .... w
4 .... r
4 .... r 4
0
0xbf9c4614
, 0xbf9c4705
...
, 0xbf9c460d
0xbf9c4614
1
0xbf9c4a34
...
0xbf9c4705
0xbf9c4a34
2
0xbf9c4a5c
...
0xbf9c4a60
0xbf9c4a5c
3
0xbf9c4a60
...
0xbf9c460d
81
83
82
48
56
48
Set index
24
Cache Simulation Overhead

Experiments done on the 8-core system
8 recovery threads and 8 cache simulators

PiPA
10.5x
PiPA speedup over dcache
3x
Pin dcache
32x
25
SPEC 2006 Results

Experiments done using the 8-core system

Profiling
3x
Average speedup over dcache
3.27x
Profiling recovery
3.7x
Full cache simulation
10.2x
26
Summary

PiPA is an effective technique for parallel
profiling and analysis
based on pipelining
drastically reduces both
profiling time
analysis time
full cache simulation incurs only 10.5x slowdown
Runtime Execution Profile
requires minimal instrumentation code
compact enough to ensure optimal buffer usage
makes it easy for next stages to recover the full
trace
Parallel cache simulation
the cache is partitioned into several independent
simulators

27
Future Work

Design APIs
hide the communication between the pipeline
stages
focus only on the instrumentation and analysis
tasks
Further improve the efficiency
parallel profiling
workload monitoring
More analysis algorithms
branch prediction simulation
memory dependence analysis
...

28
Pin Prototype

Second implementation in Pin
Preliminary results
2.6x speedup over Pin dcache
Plan to release PiPA
www.comp.nus.edu.sg/ioana

Write a Comment

User Comments (0)

About PowerShow.com

Pipelined Profiling and Analysis on Multi-core Systems - PowerPoint PPT Presentation

Pipelined Profiling and Analysis on Multi-core Systems

Example communication between stage 0 and stage 1. Shared buffers ... Factor 2 : the number of recovery threads ... Set associative caches ... – PowerPoint PPT presentation