Title: Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li, Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon
1Automatic Performance Diagnosis of Parallel
Computations with Compositional Models Li Li,
Allen D. Malonylili, malony_at_cs.uoregon.eduPer
formance Research LaboratoryDep. of Computer and
Information ScienceUniversity of Oregon
2Parallel Performance Diagnosis
- Performance tuning process
- Process to find and fix performance problems
- Performance diagnosis detect and explain
problems - Performance optimization repair found problems
- Diagnosis is critical to efficiency of
performance tuning - Focus on the performance diagnosis
- Capture diagnosis processes
- Integrate with performance experimentation and
evaluation - Formalize the (expert) performance cause
inference - Support diagnosis in an automated manner
3Generic Performance Diagnosis Process
- Design and run performance experiments
- Observe performance under a specific circumstance
- Generate desirable performance evaluation data
- Find symptoms
- Observation deviating from performance
expectation - Detect by evaluating performance metrics
- Infer causes of symptoms
- Relate symptoms to program
- Interpret symptoms at different levels of
abstraction - Iterate the process to refine performance bug
search - Refine performance hypothesis based on symptoms
found - Generate more data to validate the hypothesis
4Knowledge-Based Automatic Performance Diagnosis
- Experts analyze systematically and use experience
- Implicitly use knowledge of code structure and
parallelism - Guide by the knowledge to conduct diagnostic
analysis - Knowledge-based approach
- Capture knowledge about performance problems
- Capture knowledge about how to detect and explain
them - Apply the knowledge to performance diagnosis
- Performance knowledge
- Experiment design and specifications
- Performance models
- Performance metrics and evaluation rules
- High level performance factors/design parameters
(causes)
5Implications
- Where does the knowledge come from?
- Extract from parallel computational models
- Structural and operational characteristics
- Reusable parallel design patterns
- Associate computational models with performance
models - Well-defined computation and communication
pattern - Model examples
- Single models Master-worker, Pipeline,AMR, ...
- Compositional models
- Use model knowledge to diagnose performance
problem - Engineer model knowledge
- Integrate model knowledge with cause inference
6Model-based Generic Knowledge Generation
Algorithm-specific Knowledge Extension
Behavioral Modeling
extend
event1
Abstract events
instantiate
event2
Performance Modeling
refine
Performance composition and coupling descriptions
Metrics Definition
Algorithmic-specific metrics
extend
Model-based metrics
instantiate
extend
Metric-driven inference
Performance bug search and cause inference
Inference Modeling
Algorithm- specific factors
extend
Performancefactor library
7Hercule Automatic Performance Diagnosis System
model
Parallel program
model knowledge
algorithm-spec. info
Hercule
perf. data
knowledge base
event recognizer
measurement system
inference engine
metric evaluator
experiment specifications
inference rules
diagnosis results
explanations
problems
- Goals of automation, adaptability, extension, and
reuse
8Single Model Knowledge Engineered
- Master-worker
- Divide-and-conquer
- Wavefront (2D pipeline)
- Adaptive Mesh Refinement
- Parallel Recursive Tree
- Geometric Decomposition
- Related publications
- L. Li and A. D. Malony, "Model-based Performance
Diagnosis of Master-worker Parallel
Computations", in the proceedings of Europar
2006. - L. Li, A. D. Malony and K. Huck, "Model-Based
Relative Performance Diagnosis of Wavefront
Parallel Computations", in the proceedings of
HPCC 2006. - L. Li, A. D. Malony, "Knowledge Engineering for
Automatic Parallel Performance Diagnosis", to
appear in Concurrency and Computation Practice
and Experience.
9Characteristics of Model Composition
- Compositional model
- Combine two or more models
- Interaction changes individual model behaviors
- Composition pattern affects performance
- Model abstraction for describing composition
- Computational component set C1, C2, ..., Ck
- Relative control order F(C1, C2, ..., Ck)
- Integrate component sets in a compositional model
- Composition forms
- Model nesting
- Model restructuring
- Different implications to performance knowledge
engineering
10Model Nesting
- Formal representation
- Two models root F(C1, C2, ..., Ck) and child
G(D1, D2, ..., Dl)
F(C1, C2, ..., Ck) G(D1, D2, ..., Dl)
? F(C1G(D1, D2, ..., Dl), C2G(D1, D2,
..., Dl), ... ... CkG(D1, D2, ...,
Dl)) where CiG(D1, D2, ..., Dl) means
Ci implements the G model.
11Model Nesting (contd.)
- Examples
- Iterative, multi-phase applications
- FLASH, developed by DOE supported ASC/Alliances
Center for Astrophysical Thermonuclear Flashes - Implications to performance diagnosis
- Hierarchical model structure dictates analysis
order - Refine problem discovery from root to child
- Preserve performance features of individual models
12Model Restructuring
- Formal representation
- Two models F(C1, C2, ..., Ck) and G(D1, D2,
..., Dl)
F(C1, C2, ..., Ck) G(D1, D2, ..., Dl) ?
H((C1F, ..., CkFD1G, ..., DlG)) where
C1F, ..., CkFD1G, ..., DlG selects a
component CiF or DjG while preserving relative
component order in F and G. H is the new
function ruling all components.
13Adapt Performance Knowledge to Composition
- Objective discover and interpret performance
effects caused by model interaction - Model nesting
- Behavioral modeling
- Derive F(C1, C2, ..., Ck) from single model
behaviors - Replace affected root component with child model
behaviors - Performance modeling and metric formulation
- Unite overhead categories according to nesting
hierarchy - Evaluate overheads according to the model
hierarchy - Inference modeling
- Represent inference process with an inference
tree - Merge inference steps of participant models
- Extend root model inferences with implementing
child model inferences
14Model Nesting Case Study - FLASH
- FLASH
- Parallel simulations in astrophysical
hydrodynamics - Use Adaptive Mesh Refinement (AMR) to manage
meshes - Use a Parallel Recursive Tree (PRT) to manage
mesh data - Model nesting
- Root AMR model
- Child PRT model
- AMR implements
- PRT data operations
15Single Model Characteristics
- AMR operations
- AMR_Refinement refine a mesh grid
- AMR_Derefinement coarsen a mesh grid
- AMR_LoadBalancing even out work load after
refinement or derefinement - AMR_Guardcell update guard cells at the
boundary of every grid block with data from the
neighbors - AMR_Prolong prolong the solution to newly
created leaf blocks after refinement - AMR_Restrict restrict the solution up the block
tree after derefinement - AMR_MeshRedistribution mesh redistribution when
balancing workload - PRT operations
- PRT_comm_to_parent communicate to parent
processor - PRT_comm_to_child communicate to child
processor - PRT_comm_to_sibling communicate to sibling
processor - PRT_build_tree initialize tree structure, or
migrate part of the tree to another
processor and rebuild the connection.
16AMR Inference Tree
symptoms
intermediate observations
performance factors
inference direction
... ...
... ...
... ...
... ...
17PRT Inference Tree
symptoms
intermediate observations
performance factors
inference direction
1
2
... ...
3
5
4
18FLASH Inference Tree
A
refine perf. problem search following subtrees
of PRT that are relevant to A. The No. represent
corresponding subtrees in PRT.
No.
... ...
1,3
1,2,3
1,2,3
1,2,3
3
4
1,2,3
1,2,3
5
19Experiment with FLASH v3.0
- Sedov explosion simulation in FLASH3
- Test platform
- IBM pSeries 690 SMP cluster with 8 processors
- Execution profiles of a problematic run (Paraprof
view)
20Diagnosis Results Output (Step 12)
Begin diagnosing ...
Begin diagnosing AMR
program ... ... Level 1 experiment -- collect
performance profiles with respect to computation
and communication. _______________________________
_______________________________ do experiment
1... ... Communication accounts for 80.70 of run
time. Communication cost of the run degrades
performance.
- Step 1 find performance symptom
- Step 2 look at root AMR model performance
- Level 2 experiment -- collect performance
profiles with respect to AMR refine, - derefine, guardcell-fill, prolong, and
workload-balance. - __________________________________________________
______________ - do experiment 2... ...
- Processes spent
- 4.35 of communication time in checking
refinement, - 2.22 in refinement,
- 13.83 in checking derefinement (coarsening),
- 1.43 in derefinement,
- 49.44 in guardcell filling,
- 3.44 in prolongating data,
- 9.43 in dealing with work balancing,
21Step 3 Interpret Expensive guardcell_filling
with PRT Performance
Level 3 experiment for
diagnosing grid guardcell-filling related
problems -- collect performance event trace with
respect to restriction, intra-level and
inter-level commu. associated with the grid block
tree. ___________________________________________
________________________________________ do
experiment 3... ... Among the guardcell-filling
communication, 53.01 is spent restricting the
solution up the block tree, 8.27 is spent in
building tree connections required by
guardcell-filling (updating the neighbor list in
terms of morton order), and 38.71 in
transferring guardcell data among grid
blocks. __________________________________________
_________________________________________ The
restriction communication time consists of 94.77
in transferring physical data among grid blocks,
and 5.23 in building tree connections. Among
the restriction communication, 92.26 is spent in
collective communications. Looking at the
performance of data transfer in restrictions from
the PRT perspective, remote fetch parent data
comprises 0.0, remote fetch sibling comprises
0.0, and remote fetch child comprises
100. Improving block contiguity at the
inter-level of the PRT will reduce restriction
data communication. ______________________________
__________________________________________________
__ Among the guardcell data transfer, 65.78 is
spent in collective communications. Looking at
the performance of guardcell data transfer from
the PRT perspective, remote fetch parent data
comprises 3.42, remote fetch sibling comprises
85.93, and remote fetch child comprises
10.64. Improving block contiguity at the
intra-level of the PRT will reduce guardcell data
communication.
AMR model performance
PRT operation perf. in AMR_Restrict
PRT operation perf. in transferring guardcell data
22Conclusion and Future Directions
- Model-based performance diagnosis approach
- Provide performance feedbacks at a high level of
abstraction - Support automatic problem discovery and
interpretation - Enable novice programmers to use established
expertises - Compositional model diagnosis
- Adapt knowledge engineering approach to model
integration - Disentangle cross-model performance effects
- Enhance applicability of model-based approach
- Future directions
- Automate performance knowledge adaptation
- Algorithmic knowledge, compositional model
knowledge - Incorporate system utilization model
- Reveal interplay between programming model and
system utilization - Explain performance with the model-system
relationship