Early Fault Detection and Failure Prediction in Large Software Systems - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Early Fault Detection and Failure Prediction in Large Software Systems

Description:

Salfner, Malek -- Humboldt University Berlin. 2. Outline. Our goal. Description of the model ... Salfner, Malek -- Humboldt University Berlin. 3 ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 13
Provided by: f11
Category:

less

Transcript and Presenter's Notes

Title: Early Fault Detection and Failure Prediction in Large Software Systems


1
Early Fault Detection and Failure Prediction in
Large Software Systems
  • Felix Salfner and Miroslaw MalekDepartment of
    Computer ScienceHumboldt University
    BerlinGermany

2
Outline
  • Our goal
  • Description of the model
  • Validation of the model
  • Two applications using the failure predictor
  • Work in progress
  • Conclusions

3
Our Goal Highly-Available Component-Based
Software Systems
Event-logs
System
Event type
High levelfailure prediction
t

Comp 1
Comp 2
Comp 3

Faultdetection
Res b

Res a
Res c
4
Mathematical Model View
Stochastic Occurrence of Faults
System
Failures
t Dt
t
t
t
t
TS 1
TS n
Errors
Failure prediction
Faultdetection
Model
t Dt
t - Dt
5
Model Description
  • The model contains patterns of events
  • Failure prediction patterns that lead to
    failures.
  • Early fault detection patterns that identify and
    locate faults.
  • Patterns reflect temporal behavior of the system.
  • Patterns are modeled as paths in an acyclic
    directed graph.
  • Events are characterized by multiple system
    properties.
  • Two-phase approach
  • Model construction
  • Analyze system behavior with the help of past
    logfiles.
  • Extract patterns by means of clustering
    algorithms.
  • Construct a generalized model.
  • Model application
  • Wait for the occurrence of events.
  • Check whether the event matches known patterns
    (paths).
  • If true, calculate probability and timeframe for
    every path.

6
Model construction
  • Identify target positions in a logfile
  • Cut out segments preceding the target positions
    (extract history)
  • Each segment forms one path in the graph
  • Group events by means of clustering algorithms
  • Simplify the graph
  • Calculate relative likelihoods of branches

7
Model application
  • ExampleMeasure memory usage each time an event
    occurs
  • Two types of failures
  • No process memory available
  • No shared memory available

8
Validation of the model
  • Focus on
  • Telecommunication system such as ATT or Siemens
  • Large software system
  • Component / container based software architecture
  • Distributed computing system (5 5000 Servers)
  • Large data set 500MB per day of operation
  • Validation of selected paths by domain experts

9
Failure Specific Dynamic Recovery
Checkpointing
  • Failure specific recovery scheme
  • Risk levels for different failure types
  • Dynamic Recovery
  • Low risk
  • Predicted probability of failure occurrence is
    below risk level
  • Leave out checkpointing and acceptance test
  • Reduce computational overhead
  • Gain efficiency
  • High risk
  • Predicted probability of failure occurrence is
    above risk level
  • Checkpointing and acceptance test have to be
    carried out
  • Reduce lost computation in case of failure

Acceptance test
Checkpointing
Acceptance test
Checkpointing
Acceptance test

10
Evaluating Proactive Measures
  • Patterns describe system behavior in the presence
    of faults
  • How does the system usually run into failure
    situations?
  • Proactive techniques take countermeasures to
    prevent the system from running into failure
    situations.
  • The model facilitates evaluation of proactive
    measures while they are applied to a running
    system.

11
Work in Progress
  • Online learning
  • Include new patterns when failures are identified
  • Prune nodes that are rarely used
  • Integration of health paths
  • Include cases where no failure occurred
  • Introduce probability densities to nodes
  • Now Ranges for node parameters
  • Future Probability densities
  • A paths probability also depends on the
    deviation from the center of a given distribution

12
Conclusions
  • Temporal system behavior is directly incorporated
    into the model.
  • Calculations during the models application can
    be performed effectively. Only a
    depth-first-search with a few additional
    multiplications and additions is needed.
  • The model is intuitive since paths express
    correlations in a formalism that is easily
    understandable.
  • It is extensible to a hybrid model since it can
    be supplemented by paths obtained from classic
    system analysis (within one model).
Write a Comment
User Comments (0)
About PowerShow.com