Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li, Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li, Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon

Description:

Performance diagnosis: detect and explain problems ... Capture diagnosis processes. Integrate with performance experimentation and evaluation ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Automatic Performance Diagnosis of Parallel Computations with Compositional Models Li Li, Allen D. Malony {lili, malony}@cs.uoregon.edu Performance Research Laboratory Dep. of Computer and Information Science University of Oregon


1
Automatic Performance Diagnosis of Parallel
Computations with Compositional Models Li Li,
Allen D. Malonylili, malony_at_cs.uoregon.eduPer
formance Research LaboratoryDep. of Computer and
Information ScienceUniversity of Oregon
2
Parallel Performance Diagnosis
  • Performance tuning process
  • Process to find and fix performance problems
  • Performance diagnosis detect and explain
    problems
  • Performance optimization repair found problems
  • Diagnosis is critical to efficiency of
    performance tuning
  • Focus on the performance diagnosis
  • Capture diagnosis processes
  • Integrate with performance experimentation and
    evaluation
  • Formalize the (expert) performance cause
    inference
  • Support diagnosis in an automated manner

3
Generic Performance Diagnosis Process
  • Design and run performance experiments
  • Observe performance under a specific circumstance
  • Generate desirable performance evaluation data
  • Find symptoms
  • Observation deviating from performance
    expectation
  • Detect by evaluating performance metrics
  • Infer causes of symptoms
  • Relate symptoms to program
  • Interpret symptoms at different levels of
    abstraction
  • Iterate the process to refine performance bug
    search
  • Refine performance hypothesis based on symptoms
    found
  • Generate more data to validate the hypothesis

4
Knowledge-Based Automatic Performance Diagnosis
  • Experts analyze systematically and use experience
  • Implicitly use knowledge of code structure and
    parallelism
  • Guide by the knowledge to conduct diagnostic
    analysis
  • Knowledge-based approach
  • Capture knowledge about performance problems
  • Capture knowledge about how to detect and explain
    them
  • Apply the knowledge to performance diagnosis
  • Performance knowledge
  • Experiment design and specifications
  • Performance models
  • Performance metrics and evaluation rules
  • High level performance factors/design parameters
    (causes)

5
Implications
  • Where does the knowledge come from?
  • Extract from parallel computational models
  • Structural and operational characteristics
  • Reusable parallel design patterns
  • Associate computational models with performance
    models
  • Well-defined computation and communication
    pattern
  • Model examples
  • Single models Master-worker, Pipeline,AMR, ...
  • Compositional models
  • Use model knowledge to diagnose performance
    problem
  • Engineer model knowledge
  • Integrate model knowledge with cause inference

6
Model-based Generic Knowledge Generation
Algorithm-specific Knowledge Extension
Behavioral Modeling
extend
event1
Abstract events
instantiate
event2
Performance Modeling
refine
Performance composition and coupling descriptions
Metrics Definition
Algorithmic-specific metrics
extend
Model-based metrics
instantiate
extend
Metric-driven inference
Performance bug search and cause inference
Inference Modeling
Algorithm- specific factors
extend
Performancefactor library
7
Hercule Automatic Performance Diagnosis System
model
Parallel program
model knowledge
algorithm-spec. info
Hercule
perf. data
knowledge base
event recognizer
measurement system
inference engine
metric evaluator
experiment specifications
inference rules
diagnosis results
explanations
problems
  • Goals of automation, adaptability, extension, and
    reuse

8
Single Model Knowledge Engineered
  • Master-worker
  • Divide-and-conquer
  • Wavefront (2D pipeline)
  • Adaptive Mesh Refinement
  • Parallel Recursive Tree
  • Geometric Decomposition
  • Related publications
  • L. Li and A. D. Malony, "Model-based Performance
    Diagnosis of Master-worker Parallel
    Computations", in the proceedings of Europar
    2006.
  • L. Li, A. D. Malony and K. Huck, "Model-Based
    Relative Performance Diagnosis of Wavefront
    Parallel Computations", in the proceedings of
    HPCC 2006.
  • L. Li, A. D. Malony, "Knowledge Engineering for
    Automatic Parallel Performance Diagnosis", to
    appear in Concurrency and Computation Practice
    and Experience.

9
Characteristics of Model Composition
  • Compositional model
  • Combine two or more models
  • Interaction changes individual model behaviors
  • Composition pattern affects performance
  • Model abstraction for describing composition
  • Computational component set C1, C2, ..., Ck
  • Relative control order F(C1, C2, ..., Ck)
  • Integrate component sets in a compositional model
  • Composition forms
  • Model nesting
  • Model restructuring
  • Different implications to performance knowledge
    engineering

10
Model Nesting
  • Formal representation
  • Two models root F(C1, C2, ..., Ck) and child
    G(D1, D2, ..., Dl)

F(C1, C2, ..., Ck) G(D1, D2, ..., Dl)
? F(C1G(D1, D2, ..., Dl), C2G(D1, D2,
..., Dl), ... ... CkG(D1, D2, ...,
Dl)) where CiG(D1, D2, ..., Dl) means
Ci implements the G model.
11
Model Nesting (contd.)
  • Examples
  • Iterative, multi-phase applications
  • FLASH, developed by DOE supported ASC/Alliances
    Center for Astrophysical Thermonuclear Flashes
  • Implications to performance diagnosis
  • Hierarchical model structure dictates analysis
    order
  • Refine problem discovery from root to child
  • Preserve performance features of individual models

12
Model Restructuring
  • Formal representation
  • Two models F(C1, C2, ..., Ck) and G(D1, D2,
    ..., Dl)

F(C1, C2, ..., Ck) G(D1, D2, ..., Dl) ?
H((C1F, ..., CkFD1G, ..., DlG)) where
C1F, ..., CkFD1G, ..., DlG selects a
component CiF or DjG while preserving relative
component order in F and G. H is the new
function ruling all components.
13
Adapt Performance Knowledge to Composition
  • Objective discover and interpret performance
    effects caused by model interaction
  • Model nesting
  • Behavioral modeling
  • Derive F(C1, C2, ..., Ck) from single model
    behaviors
  • Replace affected root component with child model
    behaviors
  • Performance modeling and metric formulation
  • Unite overhead categories according to nesting
    hierarchy
  • Evaluate overheads according to the model
    hierarchy
  • Inference modeling
  • Represent inference process with an inference
    tree
  • Merge inference steps of participant models
  • Extend root model inferences with implementing
    child model inferences

14
Model Nesting Case Study - FLASH
  • FLASH
  • Parallel simulations in astrophysical
    hydrodynamics
  • Use Adaptive Mesh Refinement (AMR) to manage
    meshes
  • Use a Parallel Recursive Tree (PRT) to manage
    mesh data
  • Model nesting
  • Root AMR model
  • Child PRT model
  • AMR implements
  • PRT data operations

15
Single Model Characteristics
  • AMR operations
  • AMR_Refinement refine a mesh grid
  • AMR_Derefinement coarsen a mesh grid
  • AMR_LoadBalancing even out work load after
    refinement or derefinement
  • AMR_Guardcell update guard cells at the
    boundary of every grid block with data from the
    neighbors
  • AMR_Prolong prolong the solution to newly
    created leaf blocks after refinement
  • AMR_Restrict restrict the solution up the block
    tree after derefinement
  • AMR_MeshRedistribution mesh redistribution when
    balancing workload
  • PRT operations
  • PRT_comm_to_parent communicate to parent
    processor
  • PRT_comm_to_child communicate to child
    processor
  • PRT_comm_to_sibling communicate to sibling
    processor
  • PRT_build_tree initialize tree structure, or
    migrate part of the tree to another
    processor and rebuild the connection.

16
AMR Inference Tree
symptoms
intermediate observations
performance factors
inference direction
... ...
... ...
... ...
... ...
17
PRT Inference Tree
symptoms
intermediate observations
performance factors
inference direction
1
2
... ...
3
5
4
18
FLASH Inference Tree
A
refine perf. problem search following subtrees
of PRT that are relevant to A. The No. represent
corresponding subtrees in PRT.
No.
... ...
1,3
1,2,3
1,2,3
1,2,3
3
4
1,2,3
1,2,3
5
19
Experiment with FLASH v3.0
  • Sedov explosion simulation in FLASH3
  • Test platform
  • IBM pSeries 690 SMP cluster with 8 processors
  • Execution profiles of a problematic run (Paraprof
    view)

20
Diagnosis Results Output (Step 12)
Begin diagnosing ...
Begin diagnosing AMR
program ... ... Level 1 experiment -- collect
performance profiles with respect to computation
and communication. _______________________________
_______________________________ do experiment
1... ... Communication accounts for 80.70 of run
time. Communication cost of the run degrades
performance.
  • Step 1 find performance symptom
  • Step 2 look at root AMR model performance

  • Level 2 experiment -- collect performance
    profiles with respect to AMR refine,
  • derefine, guardcell-fill, prolong, and
    workload-balance.
  • __________________________________________________
    ______________
  • do experiment 2... ...
  • Processes spent
  • 4.35 of communication time in checking
    refinement,
  • 2.22 in refinement,
  • 13.83 in checking derefinement (coarsening),
  • 1.43 in derefinement,
  • 49.44 in guardcell filling,
  • 3.44 in prolongating data,
  • 9.43 in dealing with work balancing,


21
Step 3 Interpret Expensive guardcell_filling
with PRT Performance

Level 3 experiment for
diagnosing grid guardcell-filling related
problems -- collect performance event trace with
respect to restriction, intra-level and
inter-level commu. associated with the grid block
tree. ___________________________________________
________________________________________ do
experiment 3... ... Among the guardcell-filling
communication, 53.01 is spent restricting the
solution up the block tree, 8.27 is spent in
building tree connections required by
guardcell-filling (updating the neighbor list in
terms of morton order), and 38.71 in
transferring guardcell data among grid
blocks. __________________________________________
_________________________________________ The
restriction communication time consists of 94.77
in transferring physical data among grid blocks,
and 5.23 in building tree connections. Among
the restriction communication, 92.26 is spent in
collective communications. Looking at the
performance of data transfer in restrictions from
the PRT perspective, remote fetch parent data
comprises 0.0, remote fetch sibling comprises
0.0, and remote fetch child comprises
100. Improving block contiguity at the
inter-level of the PRT will reduce restriction
data communication. ______________________________
__________________________________________________
__ Among the guardcell data transfer, 65.78 is
spent in collective communications. Looking at
the performance of guardcell data transfer from
the PRT perspective, remote fetch parent data
comprises 3.42, remote fetch sibling comprises
85.93, and remote fetch child comprises
10.64. Improving block contiguity at the
intra-level of the PRT will reduce guardcell data
communication.

AMR model performance
PRT operation perf. in AMR_Restrict
PRT operation perf. in transferring guardcell data
22
Conclusion and Future Directions
  • Model-based performance diagnosis approach
  • Provide performance feedbacks at a high level of
    abstraction
  • Support automatic problem discovery and
    interpretation
  • Enable novice programmers to use established
    expertises
  • Compositional model diagnosis
  • Adapt knowledge engineering approach to model
    integration
  • Disentangle cross-model performance effects
  • Enhance applicability of model-based approach
  • Future directions
  • Automate performance knowledge adaptation
  • Algorithmic knowledge, compositional model
    knowledge
  • Incorporate system utilization model
  • Reveal interplay between programming model and
    system utilization
  • Explain performance with the model-system
    relationship
Write a Comment
User Comments (0)
About PowerShow.com