Early Fault Detection and Failure Prediction in Large Software Systems

About This Presentation

Title:

Early Fault Detection and Failure Prediction in Large Software Systems

Description:

Salfner, Malek -- Humboldt University Berlin. 2. Outline. Our goal. Description of the model ... Salfner, Malek -- Humboldt University Berlin. 3 ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 13

Provided by: f11

Category:

more less

Transcript and Presenter's Notes

Title: Early Fault Detection and Failure Prediction in Large Software Systems

1
Early Fault Detection and Failure Prediction in
Large Software Systems

Felix Salfner and Miroslaw MalekDepartment of
Computer ScienceHumboldt University
BerlinGermany

2
Outline

Our goal
Description of the model
Validation of the model
Two applications using the failure predictor
Work in progress
Conclusions

3
Our Goal Highly-Available Component-Based
Software Systems
Event-logs
System
Event type
High levelfailure prediction
t

Comp 1
Comp 2
Comp 3

Faultdetection
Res b

Res a
Res c
4
Mathematical Model View
Stochastic Occurrence of Faults
System
Failures
t Dt
t
t
t
t
TS 1
TS n
Errors
Failure prediction
Faultdetection
Model
t Dt
t - Dt
5
Model Description

The model contains patterns of events
Failure prediction patterns that lead to
failures.
Early fault detection patterns that identify and
locate faults.
Patterns reflect temporal behavior of the system.
Patterns are modeled as paths in an acyclic
directed graph.
Events are characterized by multiple system
properties.
Two-phase approach
Model construction
Analyze system behavior with the help of past
logfiles.
Extract patterns by means of clustering
algorithms.
Construct a generalized model.
Model application
Wait for the occurrence of events.
Check whether the event matches known patterns
(paths).
If true, calculate probability and timeframe for
every path.

6
Model construction

Identify target positions in a logfile
Cut out segments preceding the target positions
(extract history)
Each segment forms one path in the graph

Group events by means of clustering algorithms
Simplify the graph
Calculate relative likelihoods of branches

7
Model application

ExampleMeasure memory usage each time an event
occurs
Two types of failures
No process memory available
No shared memory available

8
Validation of the model

Focus on
Telecommunication system such as ATT or Siemens
Large software system
Component / container based software architecture
Distributed computing system (5 5000 Servers)
Large data set 500MB per day of operation
Validation of selected paths by domain experts

9
Failure Specific Dynamic Recovery
Checkpointing

Failure specific recovery scheme
Risk levels for different failure types
Dynamic Recovery
Low risk
Predicted probability of failure occurrence is
below risk level
Leave out checkpointing and acceptance test
Reduce computational overhead
Gain efficiency
High risk
Predicted probability of failure occurrence is
above risk level
Checkpointing and acceptance test have to be
carried out
Reduce lost computation in case of failure

Acceptance test
Checkpointing
Acceptance test
Checkpointing
Acceptance test

10
Evaluating Proactive Measures

Patterns describe system behavior in the presence
of faults
How does the system usually run into failure
situations?
Proactive techniques take countermeasures to
prevent the system from running into failure
situations.
The model facilitates evaluation of proactive
measures while they are applied to a running
system.

11
Work in Progress

Online learning
Include new patterns when failures are identified
Prune nodes that are rarely used
Integration of health paths
Include cases where no failure occurred
Introduce probability densities to nodes
Now Ranges for node parameters
Future Probability densities
A paths probability also depends on the
deviation from the center of a given distribution

12
Conclusions

Temporal system behavior is directly incorporated
into the model.
Calculations during the models application can
be performed effectively. Only a
depth-first-search with a few additional
multiplications and additions is needed.
The model is intuitive since paths express
correlations in a formalism that is easily
understandable.
It is extensible to a hybrid model since it can
be supplemented by paths obtained from classic
system analysis (within one model).

Write a Comment

User Comments (0)

About PowerShow.com

Early Fault Detection and Failure Prediction in Large Software Systems - PowerPoint PPT Presentation

Early Fault Detection and Failure Prediction in Large Software Systems

Salfner, Malek -- Humboldt University Berlin. 2. Outline. Our goal. Description of the model ... Salfner, Malek -- Humboldt University Berlin. 3 ... – PowerPoint PPT presentation