Automated Support for Classifying Software Failure Reports - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Automated Support for Classifying Software Failure Reports

Description:

Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang ... Tarantula [Jones, et al] Delta debugging [Hildebrand & Zeller] ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 42
Provided by: csU2
Learn more at: http://www.cs.umd.edu
Category:

less

Transcript and Presenter's Notes

Title: Automated Support for Classifying Software Failure Reports


1
Automated Support for Classifying Software
Failure Reports
  • Andy Podgurski, David Leon, Patrick Francis, Wes
    Masri, Melinda Minch, Jiayang Sun, Bin Wang
  • Case Western Reserve University

Presented by Hamid Haidarian Shahri
2
Automated failure reporting
  • Recent software products automatically detect and
    report crashes/exceptions to developer
  • Netscape Navigator
  • Microsoft products
  • Report includes call stack, register values,
    other debug info

3
Example
4
User-initiated reporting
  • Other products permit user to report a failure at
    any time
  • User describes problem
  • Application state info may also be included in
    report

5
Mixed blessing
  • Good news
  • More failures reported
  • More precise diagnostic information
  • Bad news
  • Dramatic increase in failure reports
  • Too many to review manually

6
Our approach
  • Help developers group reported failures with same
    cause before cause is known
  • Provide semi-automatic support
  • For execution profiling
  • Supervised and unsupervised pattern
    classification
  • Multivariate visualization
  • Initial classification is checked, refined by
    developer

7
Example classification
8
How classification helps (Benefits)
  • Aids prioritization and debugging
  • Suggests number of underlying defects
  • Reflects how often each defect causes failures
  • Assembles evidence relevant to prioritizing,
    diagnosing each defect

9
Formal view of problem
  • Let F f1, f2, ..., fm be set of reported
    failures
  • True failure classification partition of F into
    subsets F1, F2, ..., Fk such that in each Fi all
    failures have same cause
  • Our approach produces approximate failure
    classification G1, G2, ..., Gp

10
Classification strategy ()
  • Software instrumented to collect and upload
    profiles or captured executions for developer
  • Profiles of reported failures combined with those
    of apparently successful executions (reducing
    bias)
  • Subset of relevant features selected
  • Failure profiles analyzed using cluster analysis
    and multivariate visualization
  • Initial classification of failures examined,
    refined

11
Execution profiling
  • Our approach not limited to classifying crashes
    and exceptions
  • User may report failure well after critical
    events leading to failure
  • Profiles should characterize entire execution
  • Profiles should characterize events potentially
    relevant to failure, e.g.,
  • Control flow, data flow, variable values, event
    sequences, state transitions
  • Full execution capture/replay permits arbitrary
    profiling

12
Feature selection
  • Generate candidate feature sets
  • Use each one to train classifier to distinguish
    failures from successful executions
  • Select features of classifier, which performs
    best overall
  • Use those features to group (cluster) related
    failures

13
Probabilistic wrapper method
  • Used to select features in our experiments
  • Due to Liu and Setiono
  • Random feature sets generated
  • Each used with one part of profile data to train
    classifier
  • Misclassification rate of each classifier
    estimated using another part of data (testing)
  • Features of classifier with smallest estimated
    misclassification rate used for grouping failures

14
Logistic regression (skip)
  • Simple, widely-used classifier
  • Binary dependent variable Y
  • Expected value E(Y x) of Y given predictor x
    (x1, x2, ..., xp) is ?(x) P(Y 1 x)

15
Logistic regression cont. (skip)
  • Log odds ratio (logit) g(x) defined by
  • Coefficients estimated from sample of x and Y
    values.
  • Estimate of Y given x is 1 iff estimate of g(x)
    is positive

16
Grouping related failures
  • Alternatives
  • 1) Automatic cluster analysis
  • Can be fully automated
  • 2) Multivariate visualization
  • User must identify groups in display
  • Weaknesses of each approach offset by combining
    them

17
1) Automatic cluster analysis
  • Identifies clusters among objects based on
    similarity of feature values
  • Employs dissimilarity metric
  • e.g., Euclidean, Manhattan distance
  • Must estimate number of clusters
  • Difficult problem
  • Several reasonable ways to cluster a population
    may exist

18
Estimating number of clusters
  • Widely-used metric of quality of clustering due
    to Calinski and Harabasz
  • B is total between-cluster sum of squared
    distances
  • W is total within-cluster sum of squared
    distances from cluster centroids
  • n is number of objects in population
  • Local maxima represent alternative estimates

19
2) Multidimensional scaling (MDS)
  • Represents dissimilarities between objects by 2D
    scatter plot
  • Distances between points in display approximate
    dissimilarities
  • Small dissimilarities poorly represented with
    high-dimensional profiles
  • Our solution hierarchical MDS (HMDS)

20
Confirming or refining the initial classification
  • Select 2 failures from each group
  • Choose ones with maximally dissimilar profiles
  • Debug to determine if they are related
  • If not, split group
  • Examine neighboring groups to see if they should
    be combined

21
Limitations
  • Classification unlikely to be exact
  • Sampling error
  • Modeling error
  • Representation error
  • Spurious correlations
  • Form of profiling
  • Human judgment

22
Experimental validation
  • Implemented classification strategy with three
    large subject programs
  • GCC, Jikes, javac compilers
  • Failures clustered automatically (what failure?)
  • Resulting clusters examined manually
  • Most or all failures in each cluster examined

23
Subject programs
  • GCC 2.95.2 C compiler
  • Written in C
  • Used subset of regression test suite
    (self-validating execution tests)
  • 3333 tests run, 136 failures
  • Profiled with Gnu Gcov (2214 function call
    counts)
  • Jikes 1.15 java compiler
  • Written in C
  • Used Jacks test suite (self-validating)
  • 3149 tests run, 225 failures
  • Profiled with Gcov (3644 function call counts)

24
Subject programs cont.
  • javac 1.3.1_02-b02 java compiler
  • Written in Java
  • Used Jacks test suite
  • 3140 tests run, 233 failures
  • Profiled with function-call profiler written
    using JVMPI (1554 call counts)

25
Experimental methodology (skip)
  • 400-500 candidate Logistic Regression (LR) models
    generated per data set
  • 500 randomly selected features per model
  • Model with lowest estimated misclassification
    rate chosen
  • Data partitioned into three subsets
  • Train (50) used to train candidate models
  • TestA (25) used to pick best model
  • TestB (25) used for final estimate of
    misclassification rate

26
Experimental Methodology cont. (skip)
  • Measure used to pick best model
  • Gives extra weight to misclassification of
    failures
  • Final LR models correctly classified ? 72 of
    failures and ? 91 of successes
  • Linearly dependent features omitted from fitted
    LR models

27
Experimental methodology cont. (skip)
  • Cluster analysis
  • S-Plus clustering algorithm clara
  • Based on k-medoids criterion
  • Calinski-Harabasz index plotted for 2 ?
    c ? 50, local maxima examined
  • Visualization
  • Hierarchical MDS (HMDS) algorithm used

28
Manual examination of failures (skip)
  • Several GCC tests often have same source file,
    different optimization levels
  • Such tests often fail or succeed together
  • Hence, GCC failures were grouped manually based
    on
  • Source file
  • Information about bug fixes
  • Date of first version to pass test

29
Manual examination cont. (skip)
  • Jikes, javac failures grouped in two stages
  • Automatically formed clustered checked
  • Overlapping clusters in HMDS display checked
  • Activities
  • Debugging
  • Comparing versions
  • Examining error codes
  • Inspecting source files
  • Check correspondence between tests and JLS
    sections

30
GCC results
31
GCC results cont.
HMDS display of GCC failure profiles after
feature selection. Convex hulls indicate results
of automatic clustering into 27 clusters.
HMDS display of GCC failure profiles after
feature selection. Convex hulls indicate
failures involving same defect using HMDS (more
accurate).
32
GCC results cont.
HMDS display of GCC failure profiles before
feature selection. Convex hulls indicate
failures involving same defect. So feature
selection helps in grouping.
33
javac results
34
javac results cont.
HMDS display of javac failures. Convex hulls
indicate results of manual classification with
HMDS.
35
Jikes results
36
Jikes results cont.
HMDS display of Jikes failures. Convex hulls
indicate results of manual classification with
HMDS.
37
Summary of results
  • In most automatically-created clusters, majority
    of failures had same cause
  • A few large, non-homogenous clusters were created
  • Sub-clusters evident in HMDS displays
  • Automatic clustering sometimes splits groups of
    failures with same cause
  • HMDS displays didnt have this problem
  • Overall, failures with same cause formed fairly
    cohesive clusters

38
Threats to validity
  • One type of program used in experiments
  • Hand-crafted test inputs used for profiling
  • Think of Microsoft..

39
Related work
  • cSlice Agrawal, et al
  • Path spectra Reps, et al
  • Tarantula Jones, et al
  • Delta debugging Hildebrand Zeller
  • Cluster filtering Dickinson, et al
  • Clustering IDS alarms Julisch Dacier

40
Conclusions
  • Demonstrated that our classification strategy is
    potentially useful with compilers
  • Further evaluation needed with different types of
    software, failure reports from field
  • Note Input space is huge. More accurate
    reporting (severity, location) could facilitate a
    better grouping and overcome these problems
  • Note Limited labeled data available and error
    causes/types constantly changing (errors are
    debugged), so effectiveness of learning is
    somewhat questionable (like following your
    shadow)

41
Future work
  • Further experimental evaluation
  • Use more powerful classification, clustering
    techniques
  • Use different profiling techniques
  • Extract additional diagnostic information
  • Use techniques for classifying intrusions
    reported by anomaly detection systems
Write a Comment
User Comments (0)
About PowerShow.com