Title: Automated Support for Classifying Software Failure Reports
1Automated Support for Classifying Software
Failure Reports
- Andy Podgurski, David Leon, Patrick Francis, Wes
Masri, Melinda Minch, Jiayang Sun, Bin Wang - Case Western Reserve University
Presented by Hamid Haidarian Shahri
2Automated failure reporting
- Recent software products automatically detect and
report crashes/exceptions to developer - Netscape Navigator
- Microsoft products
- Report includes call stack, register values,
other debug info
3Example
4User-initiated reporting
- Other products permit user to report a failure at
any time - User describes problem
- Application state info may also be included in
report
5Mixed blessing
- Good news
- More failures reported
- More precise diagnostic information
- Bad news
- Dramatic increase in failure reports
- Too many to review manually
6Our approach
- Help developers group reported failures with same
cause before cause is known - Provide semi-automatic support
- For execution profiling
- Supervised and unsupervised pattern
classification - Multivariate visualization
- Initial classification is checked, refined by
developer
7Example classification
8How classification helps (Benefits)
- Aids prioritization and debugging
- Suggests number of underlying defects
- Reflects how often each defect causes failures
- Assembles evidence relevant to prioritizing,
diagnosing each defect
9Formal view of problem
- Let F f1, f2, ..., fm be set of reported
failures - True failure classification partition of F into
subsets F1, F2, ..., Fk such that in each Fi all
failures have same cause - Our approach produces approximate failure
classification G1, G2, ..., Gp
10Classification strategy ()
- Software instrumented to collect and upload
profiles or captured executions for developer - Profiles of reported failures combined with those
of apparently successful executions (reducing
bias) - Subset of relevant features selected
- Failure profiles analyzed using cluster analysis
and multivariate visualization - Initial classification of failures examined,
refined
11Execution profiling
- Our approach not limited to classifying crashes
and exceptions - User may report failure well after critical
events leading to failure - Profiles should characterize entire execution
- Profiles should characterize events potentially
relevant to failure, e.g., - Control flow, data flow, variable values, event
sequences, state transitions - Full execution capture/replay permits arbitrary
profiling
12Feature selection
- Generate candidate feature sets
- Use each one to train classifier to distinguish
failures from successful executions - Select features of classifier, which performs
best overall - Use those features to group (cluster) related
failures
13Probabilistic wrapper method
- Used to select features in our experiments
- Due to Liu and Setiono
- Random feature sets generated
- Each used with one part of profile data to train
classifier - Misclassification rate of each classifier
estimated using another part of data (testing) - Features of classifier with smallest estimated
misclassification rate used for grouping failures
14Logistic regression (skip)
- Simple, widely-used classifier
- Binary dependent variable Y
- Expected value E(Y x) of Y given predictor x
(x1, x2, ..., xp) is ?(x) P(Y 1 x)
15Logistic regression cont. (skip)
- Log odds ratio (logit) g(x) defined by
- Coefficients estimated from sample of x and Y
values. - Estimate of Y given x is 1 iff estimate of g(x)
is positive
16Grouping related failures
- Alternatives
- 1) Automatic cluster analysis
- Can be fully automated
- 2) Multivariate visualization
- User must identify groups in display
- Weaknesses of each approach offset by combining
them
171) Automatic cluster analysis
- Identifies clusters among objects based on
similarity of feature values - Employs dissimilarity metric
- e.g., Euclidean, Manhattan distance
- Must estimate number of clusters
- Difficult problem
- Several reasonable ways to cluster a population
may exist
18Estimating number of clusters
- Widely-used metric of quality of clustering due
to Calinski and Harabasz - B is total between-cluster sum of squared
distances - W is total within-cluster sum of squared
distances from cluster centroids - n is number of objects in population
- Local maxima represent alternative estimates
192) Multidimensional scaling (MDS)
- Represents dissimilarities between objects by 2D
scatter plot - Distances between points in display approximate
dissimilarities - Small dissimilarities poorly represented with
high-dimensional profiles - Our solution hierarchical MDS (HMDS)
20Confirming or refining the initial classification
- Select 2 failures from each group
- Choose ones with maximally dissimilar profiles
- Debug to determine if they are related
- If not, split group
- Examine neighboring groups to see if they should
be combined
21Limitations
- Classification unlikely to be exact
- Sampling error
- Modeling error
- Representation error
- Spurious correlations
- Form of profiling
- Human judgment
22Experimental validation
- Implemented classification strategy with three
large subject programs - GCC, Jikes, javac compilers
- Failures clustered automatically (what failure?)
- Resulting clusters examined manually
- Most or all failures in each cluster examined
23Subject programs
- GCC 2.95.2 C compiler
- Written in C
- Used subset of regression test suite
(self-validating execution tests) - 3333 tests run, 136 failures
- Profiled with Gnu Gcov (2214 function call
counts) - Jikes 1.15 java compiler
- Written in C
- Used Jacks test suite (self-validating)
- 3149 tests run, 225 failures
- Profiled with Gcov (3644 function call counts)
24Subject programs cont.
- javac 1.3.1_02-b02 java compiler
- Written in Java
- Used Jacks test suite
- 3140 tests run, 233 failures
- Profiled with function-call profiler written
using JVMPI (1554 call counts)
25Experimental methodology (skip)
- 400-500 candidate Logistic Regression (LR) models
generated per data set - 500 randomly selected features per model
- Model with lowest estimated misclassification
rate chosen - Data partitioned into three subsets
- Train (50) used to train candidate models
- TestA (25) used to pick best model
- TestB (25) used for final estimate of
misclassification rate
26Experimental Methodology cont. (skip)
- Measure used to pick best model
- Gives extra weight to misclassification of
failures - Final LR models correctly classified ? 72 of
failures and ? 91 of successes - Linearly dependent features omitted from fitted
LR models
27Experimental methodology cont. (skip)
- Cluster analysis
- S-Plus clustering algorithm clara
- Based on k-medoids criterion
- Calinski-Harabasz index plotted for 2 ?
c ? 50, local maxima examined - Visualization
- Hierarchical MDS (HMDS) algorithm used
28Manual examination of failures (skip)
- Several GCC tests often have same source file,
different optimization levels - Such tests often fail or succeed together
- Hence, GCC failures were grouped manually based
on - Source file
- Information about bug fixes
- Date of first version to pass test
29Manual examination cont. (skip)
- Jikes, javac failures grouped in two stages
- Automatically formed clustered checked
- Overlapping clusters in HMDS display checked
- Activities
- Debugging
- Comparing versions
- Examining error codes
- Inspecting source files
- Check correspondence between tests and JLS
sections
30GCC results
31GCC results cont.
HMDS display of GCC failure profiles after
feature selection. Convex hulls indicate results
of automatic clustering into 27 clusters.
HMDS display of GCC failure profiles after
feature selection. Convex hulls indicate
failures involving same defect using HMDS (more
accurate).
32GCC results cont.
HMDS display of GCC failure profiles before
feature selection. Convex hulls indicate
failures involving same defect. So feature
selection helps in grouping.
33javac results
34javac results cont.
HMDS display of javac failures. Convex hulls
indicate results of manual classification with
HMDS.
35Jikes results
36Jikes results cont.
HMDS display of Jikes failures. Convex hulls
indicate results of manual classification with
HMDS.
37Summary of results
- In most automatically-created clusters, majority
of failures had same cause - A few large, non-homogenous clusters were created
- Sub-clusters evident in HMDS displays
- Automatic clustering sometimes splits groups of
failures with same cause - HMDS displays didnt have this problem
- Overall, failures with same cause formed fairly
cohesive clusters
38Threats to validity
- One type of program used in experiments
- Hand-crafted test inputs used for profiling
- Think of Microsoft..
39Related work
- cSlice Agrawal, et al
- Path spectra Reps, et al
- Tarantula Jones, et al
- Delta debugging Hildebrand Zeller
- Cluster filtering Dickinson, et al
- Clustering IDS alarms Julisch Dacier
40Conclusions
- Demonstrated that our classification strategy is
potentially useful with compilers - Further evaluation needed with different types of
software, failure reports from field - Note Input space is huge. More accurate
reporting (severity, location) could facilitate a
better grouping and overcome these problems - Note Limited labeled data available and error
causes/types constantly changing (errors are
debugged), so effectiveness of learning is
somewhat questionable (like following your
shadow)
41Future work
- Further experimental evaluation
- Use more powerful classification, clustering
techniques - Use different profiling techniques
- Extract additional diagnostic information
- Use techniques for classifying intrusions
reported by anomaly detection systems