Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li, Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon - PowerPoint PPT Presentation

About This Presentation

Title:

Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li, Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon

Description:

Performance diagnosis: detect and explain problems ... Capture diagnosis processes. Integrate with performance experimentation and evaluation ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 23

Provided by: alle125

Learn more at: https://www.cs.rochester.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li, Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon

1
Automatic Performance Diagnosis of Parallel
Computations with Compositional Models Li Li,
Allen D. Malonylili, malony_at_cs.uoregon.eduPer
formance Research LaboratoryDep. of Computer and
Information ScienceUniversity of Oregon
2
Parallel Performance Diagnosis

Performance tuning process
Process to find and fix performance problems
Performance diagnosis detect and explain
problems
Performance optimization repair found problems
Diagnosis is critical to efficiency of
performance tuning
Focus on the performance diagnosis
Capture diagnosis processes
Integrate with performance experimentation and
evaluation
Formalize the (expert) performance cause
inference
Support diagnosis in an automated manner

3
Generic Performance Diagnosis Process

Design and run performance experiments
Observe performance under a specific circumstance
Generate desirable performance evaluation data
Find symptoms
Observation deviating from performance
expectation
Detect by evaluating performance metrics
Infer causes of symptoms
Relate symptoms to program
Interpret symptoms at different levels of
abstraction
Iterate the process to refine performance bug
search
Refine performance hypothesis based on symptoms
found
Generate more data to validate the hypothesis

4
Knowledge-Based Automatic Performance Diagnosis

Experts analyze systematically and use experience
Implicitly use knowledge of code structure and
parallelism
Guide by the knowledge to conduct diagnostic
analysis
Knowledge-based approach
Capture knowledge about performance problems
Capture knowledge about how to detect and explain
them
Apply the knowledge to performance diagnosis
Performance knowledge
Experiment design and specifications
Performance models
Performance metrics and evaluation rules
High level performance factors/design parameters
(causes)

5
Implications

Where does the knowledge come from?
Extract from parallel computational models
Structural and operational characteristics
Reusable parallel design patterns
Associate computational models with performance
models
Well-defined computation and communication
pattern
Model examples
Single models Master-worker, Pipeline,AMR, ...
Compositional models
Use model knowledge to diagnose performance
problem
Engineer model knowledge
Integrate model knowledge with cause inference

6
Model-based Generic Knowledge Generation
Algorithm-specific Knowledge Extension
Behavioral Modeling
extend
event1
Abstract events
instantiate
event2
Performance Modeling
refine
Performance composition and coupling descriptions
Metrics Definition
Algorithmic-specific metrics
extend
Model-based metrics
instantiate
extend
Metric-driven inference
Performance bug search and cause inference
Inference Modeling
Algorithm- specific factors
extend
Performancefactor library
7
Hercule Automatic Performance Diagnosis System
model
Parallel program
model knowledge
algorithm-spec. info
Hercule
perf. data
knowledge base
event recognizer
measurement system
inference engine
metric evaluator
experiment specifications
inference rules
diagnosis results
explanations
problems

Goals of automation, adaptability, extension, and
reuse

8
Single Model Knowledge Engineered

Master-worker
Divide-and-conquer
Wavefront (2D pipeline)
Adaptive Mesh Refinement
Parallel Recursive Tree
Geometric Decomposition
Related publications
L. Li and A. D. Malony, "Model-based Performance
Diagnosis of Master-worker Parallel
Computations", in the proceedings of Europar
2006.
L. Li, A. D. Malony and K. Huck, "Model-Based
Relative Performance Diagnosis of Wavefront
Parallel Computations", in the proceedings of
HPCC 2006.
L. Li, A. D. Malony, "Knowledge Engineering for
Automatic Parallel Performance Diagnosis", to
appear in Concurrency and Computation Practice
and Experience.

9
Characteristics of Model Composition

Compositional model
Combine two or more models
Interaction changes individual model behaviors
Composition pattern affects performance
Model abstraction for describing composition
Computational component set C1, C2, ..., Ck
Relative control order F(C1, C2, ..., Ck)
Integrate component sets in a compositional model
Composition forms
Model nesting
Model restructuring
Different implications to performance knowledge
engineering

10
Model Nesting

Formal representation
Two models root F(C1, C2, ..., Ck) and child
G(D1, D2, ..., Dl)

F(C1, C2, ..., Ck) G(D1, D2, ..., Dl)
? F(C1G(D1, D2, ..., Dl), C2G(D1, D2,
..., Dl), ... ... CkG(D1, D2, ...,
Dl)) where CiG(D1, D2, ..., Dl) means
Ci implements the G model.
11
Model Nesting (contd.)

Examples
Iterative, multi-phase applications
FLASH, developed by DOE supported ASC/Alliances
Center for Astrophysical Thermonuclear Flashes
Implications to performance diagnosis
Hierarchical model structure dictates analysis
order
Refine problem discovery from root to child
Preserve performance features of individual models

12
Model Restructuring

Formal representation
Two models F(C1, C2, ..., Ck) and G(D1, D2,
..., Dl)

F(C1, C2, ..., Ck) G(D1, D2, ..., Dl) ?
H((C1F, ..., CkFD1G, ..., DlG)) where
C1F, ..., CkFD1G, ..., DlG selects a
component CiF or DjG while preserving relative
component order in F and G. H is the new
function ruling all components.
13
Adapt Performance Knowledge to Composition

Objective discover and interpret performance
effects caused by model interaction
Model nesting
Behavioral modeling
Derive F(C1, C2, ..., Ck) from single model
behaviors
Replace affected root component with child model
behaviors
Performance modeling and metric formulation
Unite overhead categories according to nesting
hierarchy
Evaluate overheads according to the model
hierarchy
Inference modeling
Represent inference process with an inference
tree
Merge inference steps of participant models
Extend root model inferences with implementing
child model inferences

14
Model Nesting Case Study - FLASH

FLASH
Parallel simulations in astrophysical
hydrodynamics
Use Adaptive Mesh Refinement (AMR) to manage
meshes
Use a Parallel Recursive Tree (PRT) to manage
mesh data
Model nesting
Root AMR model
Child PRT model
AMR implements
PRT data operations

15
Single Model Characteristics

AMR operations
AMR_Refinement refine a mesh grid
AMR_Derefinement coarsen a mesh grid
AMR_LoadBalancing even out work load after
refinement or derefinement
AMR_Guardcell update guard cells at the
boundary of every grid block with data from the
neighbors
AMR_Prolong prolong the solution to newly
created leaf blocks after refinement
AMR_Restrict restrict the solution up the block
tree after derefinement
AMR_MeshRedistribution mesh redistribution when
balancing workload
PRT operations
PRT_comm_to_parent communicate to parent
processor
PRT_comm_to_child communicate to child
processor
PRT_comm_to_sibling communicate to sibling
processor
PRT_build_tree initialize tree structure, or
migrate part of the tree to another
processor and rebuild the connection.

16
AMR Inference Tree
symptoms
intermediate observations
performance factors
inference direction
... ...
... ...
... ...
... ...
17
PRT Inference Tree
symptoms
intermediate observations
performance factors
inference direction
1
2
... ...
3
5
4
18
FLASH Inference Tree
A
refine perf. problem search following subtrees
of PRT that are relevant to A. The No. represent
corresponding subtrees in PRT.
No.
... ...
1,3
1,2,3
1,2,3
1,2,3
3
4
1,2,3
1,2,3
5
19
Experiment with FLASH v3.0

Sedov explosion simulation in FLASH3
Test platform
IBM pSeries 690 SMP cluster with 8 processors
Execution profiles of a problematic run (Paraprof
view)

20
Diagnosis Results Output (Step 12)
Begin diagnosing ...
Begin diagnosing AMR
program ... ... Level 1 experiment -- collect
performance profiles with respect to computation
and communication. _______________________________
_______________________________ do experiment
1... ... Communication accounts for 80.70 of run
time. Communication cost of the run degrades
performance.

Step 1 find performance symptom
Step 2 look at root AMR model performance

Level 2 experiment -- collect performance
profiles with respect to AMR refine,
derefine, guardcell-fill, prolong, and
workload-balance.
__________________________________________________
______________
do experiment 2... ...
Processes spent
4.35 of communication time in checking
refinement,
2.22 in refinement,
13.83 in checking derefinement (coarsening),
1.43 in derefinement,
49.44 in guardcell filling,
3.44 in prolongating data,
9.43 in dealing with work balancing,

21
Step 3 Interpret Expensive guardcell_filling
with PRT Performance

Level 3 experiment for
diagnosing grid guardcell-filling related
problems -- collect performance event trace with
respect to restriction, intra-level and
inter-level commu. associated with the grid block
tree. ___________________________________________
________________________________________ do
experiment 3... ... Among the guardcell-filling
communication, 53.01 is spent restricting the
solution up the block tree, 8.27 is spent in
building tree connections required by
guardcell-filling (updating the neighbor list in
terms of morton order), and 38.71 in
transferring guardcell data among grid
blocks. __________________________________________
_________________________________________ The
restriction communication time consists of 94.77
in transferring physical data among grid blocks,
and 5.23 in building tree connections. Among
the restriction communication, 92.26 is spent in
collective communications. Looking at the
performance of data transfer in restrictions from
the PRT perspective, remote fetch parent data
comprises 0.0, remote fetch sibling comprises
0.0, and remote fetch child comprises
100. Improving block contiguity at the
inter-level of the PRT will reduce restriction
data communication. ______________________________
__________________________________________________
__ Among the guardcell data transfer, 65.78 is
spent in collective communications. Looking at
the performance of guardcell data transfer from
the PRT perspective, remote fetch parent data
comprises 3.42, remote fetch sibling comprises
85.93, and remote fetch child comprises
10.64. Improving block contiguity at the
intra-level of the PRT will reduce guardcell data
communication.

AMR model performance
PRT operation perf. in AMR_Restrict
PRT operation perf. in transferring guardcell data
22
Conclusion and Future Directions