Title: Dependability Theory and Methods Part 1: Introduction and definitions
1Dependability Theory and MethodsPart 1
Introduction and definitions
- Andrea Bobbio
- Dipartimento di Informatica
- Università del Piemonte Orientale, A. Avogadro
- 15100 Alessandria (Italy)
- bobbio_at_unipmn.it - http//www.mfn.unipmn.it/bob
bio
Bertinoro, March 10-14, 2003
2Dependability Definition
Dependability is the property of a system to be
dependable in time, i.e. such that reliance can
justifiably be placed on the service it delivers.
Dependability extends the interest on the system
from the design and construction phase to the
operational phase (life cycle).
3What dependability theory and practice wants to
avoid
4Dependability Taxonomy
reliability availability maintainability safety se
curity
measures
dependability
5Quantitative analysis
The quantitative analysis aims at numerically
evaluating measures to characterize the
dependability of an item
- Risk assessment and safety
- Design specifications
- Technical assistance and maintenance
- Life cycle cost
- Market competition
6 Risk assessment and safety
The risk associated to an activity is given
proportional to the probability of occurrence of
the activity and to the magnitute of the
consequences.
R P ? M
A safety critical system is a system whose
incorrect behavior may cause a risk to occur,
causing undesirable consequences to the item, to
the operators, to the population, to the
environment.
7Design specifications
- Technological items must be dependable.
- Some times, dependability requirements (both
qualitative and quantitative) are part of the
design specifications - Mean time between failures
- Total down time
8Technical assistance and maintenance
The planning of all the activity related to the
technical assistance and maintenance is linked to
the system dependability (expected number of
failure in time).
- planning spare parts and maintenance crews
- cost of the technical assistance (warranty
period) - preventive vs reactive maintenance.
9Market competition
- The choice of the consumers is strongly
influenced by the perceived dependability. - advertisement messages stress the
dependability - the image of a product or of a brand may depend
on the dependability.
10Purpose of evaluation
- Understanding a system
- Observation
- Operational environment
- Reasoning
- Predicting the behavior of a system
- Need a model
- A model is a convenient abstraction
- Accuracy based on degree of extrapolation
11Methods of evaluation
- Measurement-Based
- Most believable, most expensive
- Not always possible or cost effective during
system design
- Model-Based
- Less believable, Less expensive
- Analytic vs Discrete-Event Simulation
- Combinatorial vs State-Space Methods
12Measurement-Based
- Most believable, most expensive
- Data are obtained observing the behavior of
physical objects. - field observations
- measurements on prototypes
- measurements on components (accelerated tests).
13Models
Closed-form Answers
Numerical Solution
Analytic
Simulation
All models are wrong some models are useful
14Methods of evaluation
- Measurements Models data bank
15The probabilistic approach
The mechanisms that lead to failure a
technological object are very complex and depend
on many physical, chemical, technical, human,
environmental factors.
The time to failure cannot be expressed by a
determin-istic law.
We are forced to assume the time to failure as a
random variable. The quantitative dependability
analysis is based on a probabilistic approach.
16Reliability
The reliability is a measurable attribute of the
dependability and it is defined as
The reliability R(t) of an item at time t is the
probability that the item performs the required
function in the interval (0 t) given the stress
and environmental conditions in which it operates.
17Basic Definitions cdf
- Let X be the random variable representing the
time to failure of an item.
The cumulative distribution function (cdf) F(t)
of the r.v. X is given by
F(t) Pr X ? t
F(t) represents the probability that the item is
already failed at time t (unreliability) .
18Basic Definitions cdf
- Equivalent terminoloy for F(t)
- CDF (cumulative distribution function)
- Probability distribution function
- Distribution function
19Basic Definitions cdf
F(t)
1
F(b)
F(a)
0
t
a
b
F(0) 0 lim F(t) 1 t?? F(t) non-decreasing
20Basic Definitions Reliability
- Let X be the random variable representing the
time to failure of an item.
The survivor function (sf) R(t) of the r.v. X is
given by
R (t) Pr X gt t 1 - F(t)
R(t) represents the probability that the item is
correctly working at time t and gives the
reliability function .
21Basic Definitions
- Equivalent terminology for R(t) 1 -F(t)
- Reliability
- Complementary distribution function
- Survivor function
22Basic Definitions Reliability
R(t)
1
R(a)
0
t
a
b
R(0) 1 lim R(t) 0 t?? R(t) non-increasing
23Basic Definitions density
- Let X be the random variable representing the
time to failure of an item and let F(t) be a
derivable cdf
The density function f(t) is defined as
d F(t) f (t)
dt
f (t) dt Pr t ? X lt t dt
24Basic Definitions Density
f (t)
0
t
a
b
b
? f(x) dx Pr a lt X ? b F(b) F(a)
a
25Basic Definitions Density
f (t)
1
0
t
26Basic Definitions
- Equivalent terminology pdf
- probability density function
- density function
- density
- f(t)
For a non-negative random variable
27Quiz 1The higher the MTTF is, the higher the
item reliability is.
The correct answer is wrong !!!
28Hazard (failure) rate
- h(t) ?t Conditional Prob. system will fail in
- (t, t ?t) given that it is survived until
time t - f(t) ?t Unconditional Prob. System will fail in
- (t, t ?t)
29The Failure Rate of a Distribution
- is the conditional probability that
the unit will fail in the interval
given that it is functioning at time t. - is the unconditional probability that
the unit will fail in the interval
- Difference between the two sentences
- probability that someone will die between 90 and
91, given that he lives to 90 - probability that someone will die between 90 and
91
30Bathtub curve
h(t)
(infant mortality burn in)
(wear-out-phase)
CFR Constant fail. rate (useful life)
DFR
IFR
t
Increasing fail. rate
Decreasing failure rate
31Infant mortality (dfr)
Also called infant mortality phase or reliability
growth phase. The failure rate decreases with
time.
- Caused by undetected hardware/software defects
- Can cause significant prediction errors if
steady-state failure rates are used - Weibull Model can be used
32Useful life (cfr)
The failure rate remains constant in time (age
independent) .
- Failure rate much lower than in early-life
period. - Failure caused by random effects (as
environmental shocks).
33Wear-out phase (ifr)
The failure rate increases with age.
It is characteristic of irreversible aging
phenomena (deterioration, wear-out, fatigue,
corrosion etc) Applicable for mechanical and
other systems. (Properly qualified electronic
parts do not exhibit wear-out failure during its
intended service life) Weibull Failure Model can
be used
34Exponential Distribution
Failure rate is age-independent (constant).
- Cumul. distribution function
- Reliability
- Density Function
- Failure Rate (CFR)
- Mean Time to Failure
35The Cumulative Distribution Function of an
Exponentially Distributed Random Variable With
Parameter ? 1
F(t)
1.0
F(t) 1 - e
- ? t
0.5
2.50
0
1.25
3.75
5.00
t
36The Reliability Function of an Exponentially
Distributed Random Variable With Parameter ? 1
R(t)
1.0
0.5
2.50
0
1.25
3.75
5.00
t
37Exponential Density Function (pdf)
f(t)
MTTF 1/ ?
38Memoryless Property of the Exponential
Distribution
- Assume X gt t. We have observed that the
component has not failed until time t - Let Y X - t , the remaining (residual) lifetime
39Memoryless Property of the Exponential
Distribution (cont.)
- Thus Gt(y) is independent of t and is identical
to the original exponential distribution of X - The distribution of the remaining life does not
depend on how long the component has been
operating - An observed failure is the result of some
suddenly appearing failure, not due to gradual
deterioration
40Quiz 3 If two components (say, A and B) have
independent identical exponentially distributed
times to failure, by the memoryless property,
which of the following is true?
- 1. They will always fail at the same time
- 2. They have the same probability of failing at
time t during operation - 3. When these two components are operating
simultaneously, the component which has been
operational for a shorter duration of time will
survive longer
41Weibull Distribution
- Distribution Function
- Density Function
- Reliability
42Weibull Distribution
? shape parameter ? scale parameter.
Failure Rate
Dfr
Cfr
Ifr
43Failure Rate of the Weibull Distribution with
Various Values of ?
44Weibull Distribution for Various Values of ?
Cdf
density
45Failure Rate Models
- We use a truncated Weibull Model
- Infant mortality phase modeled by DFR Weibull and
the steady-state phase by the exponential
Figure 2.34 Weibull Failure-Rate Model
7 6 5 4 3 2 1 0
Failure-Rate Multiplier
0
2,190
4,380
6,570
8,760
10,950
13,140
15,330
17,520
Operating Times (hrs)
46Failure Rate Models (cont.)
- This model has the form
- where
- steady-state failure rate
- is Weibull shape parameter
- Failure rate multiplier
47Failure Rate Models (cont.)
- There are several ways to incorporate time
dependent failure rates in availability models - The easiest way is to approximate a continuous
function by a piecewise constant step function
Discrete Failure-Rate Model
7 6 5 4 3 2 1 0
Failure-Rate Multiplier
2,190
4,380
6,570
10,950
13,140
15,330
17,520
8,760
0
Operating Times (hrs)
48Failure Rate Models (cont.)
- Here the discrete failure-rate model is defined
by
49A lifetime experiment
X 1
1
X 2
2
X 3
3
X 4
4
X N
N
t 0
N i.i.d components are put in a life test
experiment.
50A lifetime experiment
X 1
1
X 2
2
X 3
3
4
X 4
X N
N
51Repairable systemsAvailability
52Repairable systems
X 1
X 2
X 3
UP
DOWN
t
Y 1
Y 2
X 1, X 2 . X n Successive UP times Y1, Y 2
. Y n Successive DOWN times
53Repairable systems
- The usual hypothesis in modeling repairable
systems is that - The successive UP times X 1, X 2 . X n are
i.i.d. random variable i.e. samples from a
common cdf F (t) - The successive DOWN times Y1, Y 2 . Y n are
i.i.d. random variable i.e. samples from a
common cdf G (t)
54Repairable systems
X 1
X 2
X 3
UP
DOWN
t
Y 1
Y 2
- The dynamic behaviour of a repairable system is
characterized by - the r.v. X of the successive up times
- the r.v. Y of the successive down times
55Maintainability
- Let Y be the r.v. of the successive down times
- G(t) Pr Y ? t (maintainability)
- d G(t)
- g (t) (density)
- dt
- g(t)
- h g (t) (repair rate)
- 1 - G(t)
- MTTR ? t g(t) dt (Mean Time To
Repair) -
?
0
56Availability
The measure to characterize a repairable system
is the availability (unavailability)
The avaiability A(t) of an item at time t is the
probability that the item is correctly working at
time t.
57Availability
- The measure to characterize a repairable system
is the availability (unavailability) - A(t) Pr time t, system UP
- U(t) Pr time t, system DOWN
-
- A(t) U(t) 1
58Definition of Availability
- An important difference between reliability and
availability is - reliability refers to failure-free operation
during an interval (0 t) - availability refers to failure-free operation at
a given instant of time t (the time when a
device or system is accessed to provide a
required function), independently on the number
of cycles failure/repair.
59Definition of Availability
I(t)
1
Failed and being restored
Operating and providing a required function
Operating and providing a required function
0
t
1 working 0 failed
I(t) indicator function
System Failure and Restoration Process
60Availability evaluation
- In the special case when times to failure and
times to restoration are both exponentially
distributed, the alternating process can be
viewed as a two-state homogeneous Continuous Time
Markov Chain
Time-independent failure rate
? Time-independent repair rate ?
612-State Markov Availability Model
- Transient Availability analysis
- for each state, we apply a flow balance equation
- Rate of buildup rate of flow IN - rate of flow
OUT
622-State Markov Availability Model
632-State Markov Availability Model
1
A(t)
Ass
642-State Markov Model
1) Pointwise availability A(t)
2) Steady state availability limiting value as
- If there is no restoration (?0) the
availability - becomes the reliability A(t) R(t)
65Steady-state Availability
- Steady-state availability
- In many system models, the limit
- exists and is called the steady-state availability
The steady-state availability represents the
probability of finding a system operational after
many fail-and-restore cycles.
66Steady-state Availability
1
0
UP
DOWN
t
Expected UP time EU(t) MUT MTTF
Expected DOWN time ED(t) MDT MTTR
67Availability Example (I)
Let a system have a steady state availability Ass
0.95 This means that, given a mission time T,
it is expected that the system works correctly
for a total time of 0.95T. Or, alternatively,
it is expected that the system is out of service
for a total time Uss T (1- Ass) T
68Availability Example (II)
Let a system have a rated productivity of W
/year. The loss due to system out of service can
be estimated as Uss W (1- Ass) W The
availability (unavailability) is an index to
estimate the real productivity, given the rated
productivity.
Alternatively, if the goal is to have a net
productivity of W /year, the plant must be
designed such that its rated productivity W
should satisfy Uss W W
69Availability
We can show that This result is valid without
making any assumptions on the form of the
distributions of times to failure times to
repair. Also
70Motivation High Availability
71Maintainability
- MDT (Mean Down Time or MTTR - mean time to
restoration). - The total down time (Y ) consists of
- Failure detection time
- Alarm notification time
- Dispatch and travel time of the repair person(s)
- Repair or replacement time
- Reboot time
72Maintainability
- The total down time (Y ) consists of
- Logistic time
- Administrative times
- Dispatch and travel time of the repair
person(s) - Waiting time for spares, tools
- Effective restoration time
- Access and diagnosis time
- Repair or replacement time
- Test and reboot time
73Maintenance Costs
- The total cost of a maintenance action consists
of - Cost of spares and replaced parts
- Cost of person/hours for repair
- Down-time cost (loss of productivity)
- The down-time cost (due to a loss of
productivity) can be the most relevant cost
factor.
74Maintenance Policy
- Is the sequence of action that minimizes the
total cost related to a down time - Reactive maintenance
- maintenance action is triggered by a failure.
- Proactive maintenance
- preventive maintenance policy.