Title: A Smarter Paradyn Performance Consultant: Combining a Call GraphBased Search with Stack Sampling
1A Smarter Paradyn Performance Consultant
Combining a Call Graph-Based Search with Stack
Sampling
- Barton Miller and Philip Roth
- bart,pcroth_at_cs.wisc.edu
- Computer Sciences Department
- University of Wisconsin
- Madison, WI 53706-1685
- USA
2Paradyn
- Uses two main Paradyn technologies
- Dynamic instrumentation
- Automated bottleneck search (Performance
Consultant) - The PC has been effective for both novices and
experts . . . - . . . and our new Call Graph based search is a
definite win. - Underlying theme automate the techniques of an
experienced programmer. - But we can do better using sampling data to
speed search.
3Paradyn Basics Resource Hierarchies
Thread1
Process1
printstatus
Host1
Thread1
Process2
debugA
Machine
testutil.C
Thread1
Host2
Process1
debugB
Thread2
main.C
main
Code
Barrier
b1
c1
t1
vectinsert
Message
c2
vect.C
SyncObject
vectdelete
Semaphore
sem1
vectsize
SpinLock
spin1
4Paradyn Basics Resource Hierarchies
Thread1
Process1
printstatus
Host1
Thread1
Process2
debugA
Machine
testutil.C
Thread1
Host2
Process1
debugB
Thread2
main.C
main
Code
Barrier
b1
c1
t1
vectinsert
Message
c2
vect.C
SyncObject
vectdelete
Semaphore
sem1
vectsize
SpinLock
spin1
Example focus /Code/testutil.C/printstatus
, /Machine, /SyncObject
5Paradyn Basics Resource Hierarchies
Thread1
Process1
printstatus
Host1
Thread1
Process2
debugA
Machine
testutil.C
Thread1
Host2
Process1
debugB
Thread2
main.C
main
Code
Barrier
b1
c1
t1
vectinsert
Message
c2
vect.C
SyncObject
vectdelete
Semaphore
sem1
vectsize
SpinLock
spin1
Example focus /Code/testutil.C/printstatus
, /Machine/Host1/Process1, /SyncObject
6Paradyn Basics Performance Metrics
- Metrics are measurable performance
characteristics such as CPU time, function
calls, I/O bytes transferred, L2 cache misses - Performance data collected for metric/focus pair
- Example metric/focus pairs
- cpu /Code/mod1/func1
- msgs /Code/mod1/func1, /Machine/host1/proc4/thre
ad2, /SyncObject/Message/1/0
7Performance Consultant Basics
- Why is the application running slowly?
- Test bottleneck hypotheses
- CPU Bound?
- I/O Wait Bound?
- Synchronization Wait Bound?
- Memory Bound?
- Performance metric associated with each
hypothesis - Which part of the application is slow?
- Isolates bottleneck to part of resource
hierarchy
8Call Graph Based Performance Consultant
- Based on applications call graph
- Code hierarchy search starts at function main,
search continues to mains children - Advantages Lots!
- Its Scalable Natural hierarchical refinement
from course grained search to fine grained search - Uses less costly inclusive metrics
- Functions which are not part of call graph will
never be instrumented
9Call Graph Based PC Example
Top Level Hypothesis
SyncWaitBound
CPUBound
I/OWaitBound
10Call Graph Based PC Example
Top Level Hypothesis
SyncWaitBound
CPUBound
I/OWaitBound
main
11Call Graph Based PC Example
Top Level Hypothesis
SyncWaitBound
CPUBound
I/OWaitBound
main
a1
a2
a3
a4
12Call Graph Based PC Example
Top Level Hypothesis
SyncWaitBound
CPUBound
I/OWaitBound
main
a1
a3
a2
a4
13Call Graph Based PC Example
Top Level Hypothesis
SyncWaitBound
CPUBound
I/OWaitBound
main
a1
a3
a2
b1
a4
b2
b3
14Call Graph Based PC Example
Top Level Hypothesis
SyncWaitBound
CPUBound
I/OWaitBound
main
a1
a3
a2
a4
b1
b2
b3
15Call Graph Construction
- Problem Cannot determine targets of calls using
function pointers and virtual functions. - Unknown callees in static call graph may cause
blind spots in new PC search - We resolve dynamic callee addresses at run time
- Strategy
- Build static call graph at program start
- Fill in dynamic call graph on demand.
16New Performance Consultant Enhancements
- Hybrid Search Strategy
- Stack sampling to find functions close to
actual bottlenecks (deep starters) - Call graph search to cover rest of search space
- Bi-directional Search - refine searches both
upward and downward in call graph - Stack sampling data
- Comes from current Paradyn stack-walks
- Collected each time Paradyn tries to insert or
remove instrumentation.
17Enhancement Benefits
- Finds bottlenecks hidden by a strict call graph
search
A
B
C
D
E
- Finds bottlenecks more quickly and efficiently
18Choosing Deep Starters
- Function counts kept in Graph
C
D
B
E
F
A
G
- For each subgraph of the graph whose node counts
are above threshold, use the node furthest from
the root node of the graph
19Search Algorithms
- Deep Start
- When starting search in the call graph...
- add deep starters at high priority
- add root of hierarchy at normal priority
- Has benefits of deep start search, but wont miss
bottlenecks due to statistical nature of sampling
20Deep Start Example
CPUBound
main
time_step
density
factor
21Deep Start Example
CPUBound
main
time_step
22Deep Start Example
CPUBound
main
factor
time_step
23Deep Start Example
CPUBound
main
factor
time_step
density
factor
24Search Algorithms (cont.)
- Bi-directional Search
- Extends Deep Start to search upward in call graph
- Add upward nodes from deep starters at medium
priority - Prune upward search based on residual metric
values
25Upward Pruning
density
factor
26Preliminary Results
- 8-node Pentium-III laptop cluster
27Conclusion
- Call graph based search strategy has been highly
effective. - As programs scale in size, there are many
opportunities for improvement. - New version of PC available in Paradyn 3.2 (next
week!) - Lots of ongoing experiments.
- http//www.cs.wisc.edu/paradyn
28Performance Results
29Retroactive Instrumentation
- Problem Find CPU Time for a function if we are
executing in one of its children. - When do we start the timer for the entry to
function? - Need mechanism to trigger instrumentation code.
- Retroactive instrumentation walks stack,
triggering outstanding timers
30Dynamic Call Sites
- Characterized by keeping the address of a callee
in a register or memory location - New type of instrumentation necessary to
determine callee - Examples
31Dynamic Call Site Instrumentation
main() fpbarfoo() (fp)()bar()
. . .
CodeGenerator
PerformanceConsultant
Notifier
ParadynFront-End
ParadynDaemon
Application
32Dynamic Call Site Instrumentation
main() fpbarfoo() (fp)()bar()
. . .
1. PC requests instrument call sites of foo.
CodeGenerator
PerformanceConsultant
Notifier
ParadynFront-End
ParadynDaemon
Application
33Dynamic Call Site Instrumentation
main() fpbarfoo() (fp)()bar()
. . .
2. Daemon instruments call sites of foo.
CodeGenerator
PerformanceConsultant
Notifier
ParadynFront-End
ParadynDaemon
Application
34Dynamic Call Site Instrumentation
main() fpbarfoo() (fp)()bar()
. . .
A. Ap executes call site, notifies daemon.
CodeGenerator
PerformanceConsultant
Notifier
ParadynFront-End
ParadynDaemon
Application
35Dynamic Call Site Instrumentation
main() fpbarfoo() (fp)()bar()
. . .
B. Daemon notifies PC that foo called bar.
2.
1.
CodeGenerator
PerformanceConsultant
Notifier
A.
ParadynFront-End
ParadynDaemon
Application
36Dynamic Call Site Instrumentation
main() fpbarfoo() (fp)()bar()
. . .
C. PC requests inclusive-time metric for bar.
2.
1.
CodeGenerator
PerformanceConsultant
B.
Notifier
A.
ParadynFront-End
ParadynDaemon
Application
37Dynamic Call Site Instrumentation
main() fpbarfoo() (fp)()bar()
. . .
D. Daemon instruments bar.
2.
1.
CodeGenerator
PerformanceConsultant
C.
B.
Notifier
A.
ParadynFront-End
ParadynDaemon
Application