Stack Trace Analysis for Large Scale Debugging using MRNet - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Stack Trace Analysis for Large Scale Debugging using MRNet

Description:

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5 ... Some bugs only occur at large scales. Non-deterministic & hard to reproduce ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 25
Provided by: martin54
Category:

less

Transcript and Presenter's Notes

Title: Stack Trace Analysis for Large Scale Debugging using MRNet


1
Stack Trace Analysis for Large Scale Debugging
using MRNet
UCRL-PRES-230290
  • Dorian C. Arnold, Barton P. Miller
  • University of Wisconsin
  • Dong Ahn, Bronis R. de Supinski,Gregory L. Lee,
    Martin Schulz
  • Lawrence Livermore National Laboratory

2
Scaling Tools
  • Machine sizes are increasing
  • New cluster close to or above 10,000 cores
  • Blue Gene/L over 131,000 cores
  • Not only applications need to scale
  • Support environment
  • Tools
  • Challenges
  • Data collection, storage, and analysis
  • Scalable process management and control
  • Visualization

3
LLNL Parallel Debug Sessions
18,391sessions
(03/01/2006 05/11/2006)
4
Debugging on BlueGene/L
  • Typical debug session includes many interactions

4096 is only 3 of BG/L!
5
Scalability Limitations
  • Large volumes of debug data
  • Single frontend for all node connections
  • Centralized data analysis
  • Vendor licensing limitations
  • Approach scalable, lightweight debugger
  • Reduce exploration space to small subset
  • Online aggregation using a TBON
  • Full-featured debugger for deeper digging

6
Outline
  • Case study CCSM
  • STAT Approach
  • Concept of Stack Traces
  • Identification of Equivalence Classes
  • Implementation
  • Using Tree-based Overlay Networks
  • Data and Work Flow in STAT
  • Evaluation
  • Conclusions

7
Case Study CCSM
  • Community Climate System Model (CCSM)
  • Used to make climate predictions
  • Coupled models for atmosphere, ocean, sea ice and
    land surface
  • Implementation
  • Multiple Program Multiple Data (MPMD) model
  • MPI-based application
  • Distinct components for each model
  • Typically requires significant node count
  • Models executed concurrently
  • Several hundred tasks

8
Observations
  • Intermittently hangs with 472 tasks
  • Non-deterministic
  • Only at large scale
  • Appears at seemingly random code locations
  • Hard to reproduce2 hangs over next 10 days (50
    runs)
  • Current approach
  • Attach to job using TotalView
  • Collect stack traces from all 472 tasks
  • Visualize cross-node callgraph

9
CCSM Callgraph
10
Lessons Learned
  • Some bugs only occur at large scales
  • Non-deterministic hard to reproduce
  • Stack traces can provide useful insight
  • Many bugs are temporal in nature
  • Need tools that
  • Combine spatial and temporal observations
  • Discover application behavior
  • Run effectively at scale

11
STAT Approach
  • Sample application stack traces
  • Across time and space
  • Through third party interface
  • Using a DynInst based daemon
  • Merge/analyze traces
  • Discover equivalent process behavior
  • Group similar processes
  • Facilitate scalable analysis/data presentation
  • Leverage TBON model (MRNet)
  • Communicate traces back to a frontend
  • Merge on the fly within MRNet filters

12
Singleton Stack Trace
Appl.
13
Merging Stack Traces
  • Multiple traces over space or time
  • Taken independently
  • Stored in graph representation
  • Create call graph prefix tree
  • Only merge nodes with identical stack backtrace
  • Retains context information
  • Advantages
  • Compressed representation
  • Scalable visualization
  • Scalable analysis

14
Merging Stack Traces
15
2D-Trace/Space Analysis
Appl
Appl
Appl

Appl
Appl
16
Prefix Tree vs. DAG
TotalView
STAT
17
2D-Trace/Time Analysis

Appl
18
Time Space Analysis
  • Both 2D techniques insufficient
  • Spatial aggregation misses temporal component
  • Temporal aggregation misses parallel aspects
  • Multiple samples, multiple processes
  • Track global program behavior over time
  • Merge into single, 3D prefix tree
  • Challenges
  • Scalable data representation
  • Scalable analysis
  • Scalable and useful visualization/results

19
3D-Trace/Space/Time Analysis
Appl
Appl

Appl

Appl
Appl
20
3D-Trace/Space/Time Analysis
288 Nodes / 10 Snapshots
21
STAT on CCSM Case Study
22
Implementation Details
  • Communication through MRNet
  • Single data stream from BE to FE
  • Filters implement tree merge
  • Tree depth can be configured
  • Three major components
  • Backend (BE) daemons gathering traces
  • Communication processes merging prefix trees
  • Frontend (FE) tool storing the final graph
  • Final result saved as GML or DOT file
  • Node classes color coded
  • External visualization tools

23
Work and Data Flow
trace( count, freq. )
FE
Tree Merge
CP
CP
CP
CP
BE
BE
BE
BE

Node 1
Node 2
Node N-1
Node N
24
STAT Performance
1024x4 Cluster 1.4 GHz Itanium2 Quadrics QsNetII
25
Conclusions
  • Scaling tools poses challenges
  • Data management and process control
  • New strategies for tools needed
  • STAT Scalable Stacktrace Analysis
  • Lightweight tool to identify process classes
  • Based on merged callgraph prefix trees
  • Aggregation in Time and Space
  • Orthogonal to full featured debuggers
  • Implementation based on TBONs
  • Scalable data collection and aggregation
  • Enables significant speedup

26
More Information
  • Paper published at IPDPS 2007Stack Trace
    Analysis for Large Scale DebuggingD. Arnold,
    D.H. Ahn, B.R. de Supinski, G. Lee, B.P. Miller,
    and M. Schulz
  • Project website Demo tomorrow
    http//www.paradyn.org/STAT
  • TBON computing papers open-source prototype,
    MRNet, available athttp//www.paradyn.org/mrnet
Write a Comment
User Comments (0)
About PowerShow.com