Automated Support for Classifying Software Failure Reports - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Automated Support for Classifying Software Failure Reports

Description:

Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang ... Tarantula [Jones, et al] Delta debugging [Hildebrand & Zeller] ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 42

Provided by: csU2

Learn more at: http://www.cs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automated Support for Classifying Software Failure Reports

1
Automated Support for Classifying Software
Failure Reports

Andy Podgurski, David Leon, Patrick Francis, Wes
Masri, Melinda Minch, Jiayang Sun, Bin Wang
Case Western Reserve University

Presented by Hamid Haidarian Shahri
2
Automated failure reporting

Recent software products automatically detect and
report crashes/exceptions to developer
Netscape Navigator
Microsoft products
Report includes call stack, register values,
other debug info

3
Example
4
User-initiated reporting

Other products permit user to report a failure at
any time
User describes problem
Application state info may also be included in
report

5
Mixed blessing

Good news
More failures reported
More precise diagnostic information
Bad news
Dramatic increase in failure reports
Too many to review manually

6
Our approach

Help developers group reported failures with same
cause before cause is known
Provide semi-automatic support
For execution profiling
Supervised and unsupervised pattern
classification
Multivariate visualization
Initial classification is checked, refined by
developer

7
Example classification
8
How classification helps (Benefits)

Aids prioritization and debugging
Suggests number of underlying defects
Reflects how often each defect causes failures
Assembles evidence relevant to prioritizing,
diagnosing each defect

9
Formal view of problem

Let F f1, f2, ..., fm be set of reported
failures
True failure classification partition of F into
subsets F1, F2, ..., Fk such that in each Fi all
failures have same cause
Our approach produces approximate failure
classification G1, G2, ..., Gp

10
Classification strategy ()

Software instrumented to collect and upload
profiles or captured executions for developer
Profiles of reported failures combined with those
of apparently successful executions (reducing
bias)
Subset of relevant features selected
Failure profiles analyzed using cluster analysis
and multivariate visualization
Initial classification of failures examined,
refined

11
Execution profiling

Our approach not limited to classifying crashes
and exceptions
User may report failure well after critical
events leading to failure
Profiles should characterize entire execution
Profiles should characterize events potentially
relevant to failure, e.g.,
Control flow, data flow, variable values, event
sequences, state transitions
Full execution capture/replay permits arbitrary
profiling

12
Feature selection

Generate candidate feature sets
Use each one to train classifier to distinguish
failures from successful executions
Select features of classifier, which performs
best overall
Use those features to group (cluster) related
failures

13
Probabilistic wrapper method

Used to select features in our experiments
Due to Liu and Setiono
Random feature sets generated
Each used with one part of profile data to train
classifier
Misclassification rate of each classifier
estimated using another part of data (testing)
Features of classifier with smallest estimated
misclassification rate used for grouping failures

14
Logistic regression (skip)

Simple, widely-used classifier
Binary dependent variable Y
Expected value E(Y x) of Y given predictor x
(x1, x2, ..., xp) is ?(x) P(Y 1 x)

15
Logistic regression cont. (skip)

Log odds ratio (logit) g(x) defined by
Coefficients estimated from sample of x and Y
values.
Estimate of Y given x is 1 iff estimate of g(x)
is positive

16
Grouping related failures

Alternatives
1) Automatic cluster analysis
Can be fully automated
2) Multivariate visualization
User must identify groups in display
Weaknesses of each approach offset by combining
them

17
1) Automatic cluster analysis

Identifies clusters among objects based on
similarity of feature values
Employs dissimilarity metric
e.g., Euclidean, Manhattan distance
Must estimate number of clusters
Difficult problem
Several reasonable ways to cluster a population
may exist

18
Estimating number of clusters

Widely-used metric of quality of clustering due
to Calinski and Harabasz
B is total between-cluster sum of squared
distances
W is total within-cluster sum of squared
distances from cluster centroids
n is number of objects in population
Local maxima represent alternative estimates

19
2) Multidimensional scaling (MDS)

Represents dissimilarities between objects by 2D
scatter plot
Distances between points in display approximate
dissimilarities
Small dissimilarities poorly represented with
high-dimensional profiles
Our solution hierarchical MDS (HMDS)

20
Confirming or refining the initial classification

Select 2 failures from each group
Choose ones with maximally dissimilar profiles
Debug to determine if they are related
If not, split group
Examine neighboring groups to see if they should
be combined

21
Limitations

Classification unlikely to be exact
Sampling error
Modeling error
Representation error
Spurious correlations
Form of profiling
Human judgment

22
Experimental validation

Implemented classification strategy with three
large subject programs
GCC, Jikes, javac compilers
Failures clustered automatically (what failure?)
Resulting clusters examined manually
Most or all failures in each cluster examined

23
Subject programs

GCC 2.95.2 C compiler
Written in C
Used subset of regression test suite
(self-validating execution tests)
3333 tests run, 136 failures
Profiled with Gnu Gcov (2214 function call
counts)
Jikes 1.15 java compiler
Written in C
Used Jacks test suite (self-validating)
3149 tests run, 225 failures
Profiled with Gcov (3644 function call counts)

24
Subject programs cont.

javac 1.3.1_02-b02 java compiler
Written in Java
Used Jacks test suite
3140 tests run, 233 failures
Profiled with function-call profiler written
using JVMPI (1554 call counts)

25
Experimental methodology (skip)

400-500 candidate Logistic Regression (LR) models
generated per data set
500 randomly selected features per model
Model with lowest estimated misclassification
rate chosen
Data partitioned into three subsets
Train (50) used to train candidate models
TestA (25) used to pick best model
TestB (25) used for final estimate of
misclassification rate

26
Experimental Methodology cont. (skip)

Measure used to pick best model
Gives extra weight to misclassification of
failures
Final LR models correctly classified ? 72 of
failures and ? 91 of successes
Linearly dependent features omitted from fitted
LR models

27
Experimental methodology cont. (skip)

Cluster analysis
S-Plus clustering algorithm clara
Based on k-medoids criterion
Calinski-Harabasz index plotted for 2 ?
c ? 50, local maxima examined
Visualization
Hierarchical MDS (HMDS) algorithm used

28
Manual examination of failures (skip)

Several GCC tests often have same source file,
different optimization levels
Such tests often fail or succeed together
Hence, GCC failures were grouped manually based
on
Source file
Information about bug fixes
Date of first version to pass test

29
Manual examination cont. (skip)

Jikes, javac failures grouped in two stages
Automatically formed clustered checked
Overlapping clusters in HMDS display checked
Activities
Debugging
Comparing versions
Examining error codes
Inspecting source files
Check correspondence between tests and JLS
sections

30
GCC results
31
GCC results cont.
HMDS display of GCC failure profiles after
feature selection. Convex hulls indicate results
of automatic clustering into 27 clusters.
HMDS display of GCC failure profiles after
feature selection. Convex hulls indicate
failures involving same defect using HMDS (more
accurate).
32
GCC results cont.
HMDS display of GCC failure profiles before
feature selection. Convex hulls indicate
failures involving same defect. So feature
selection helps in grouping.
33
javac results
34
javac results cont.
HMDS display of javac failures. Convex hulls
indicate results of manual classification with
HMDS.
35
Jikes results
36
Jikes results cont.
HMDS display of Jikes failures. Convex hulls
indicate results of manual classification with
HMDS.
37
Summary of results

In most automatically-created clusters, majority
of failures had same cause
A few large, non-homogenous clusters were created
Sub-clusters evident in HMDS displays
Automatic clustering sometimes splits groups of
failures with same cause
HMDS displays didnt have this problem
Overall, failures with same cause formed fairly
cohesive clusters

38
Threats to validity

One type of program used in experiments
Hand-crafted test inputs used for profiling
Think of Microsoft..

39
Related work

cSlice Agrawal, et al
Path spectra Reps, et al
Tarantula Jones, et al
Delta debugging Hildebrand Zeller
Cluster filtering Dickinson, et al
Clustering IDS alarms Julisch Dacier

40
Conclusions

Demonstrated that our classification strategy is
potentially useful with compilers
Further evaluation needed with different types of
software, failure reports from field
Note Input space is huge. More accurate
reporting (severity, location) could facilitate a
better grouping and overcome these problems
Note Limited labeled data available and error
causes/types constantly changing (errors are
debugged), so effectiveness of learning is
somewhat questionable (like following your
shadow)

41
Future work