Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi

About This Presentation

Title:

Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi

Description:

Probability and Statistics with Reliability, Queuing and ... Model of a real system developed at Avaya Labs. Modeling Software Faults. Application Failure ... – PowerPoint PPT presentation

Number of Views:1954

Avg rating:3.0/5.0

Slides: 108

Provided by: cseIi

Category:

more less

Transcript and Presenter's Notes

Title: Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi

1
Probability and Statistics with Reliability,
Queuing and Computer Science Applications
Chapter 8 onContinuous-Time Markov ChainsKishor
Trivedi
2
Non-State Space Models

Recall that non-state-space models like RBDs and
FTs can easily be formulated and (assuming
statistical independence) solved for system
reliability, system availability and system MTTF
Each component can have attached to it
A probability of failure
A failure rate
A distribution of time to failure
Steady-state and instantaneous unavailability

3
Markov chain

To model complex interactions between components,
use other kinds of models like Markov chains or
more generally state space models.
Many examples of dependencies among system
components have been observed in practice and
captured by Markov models.

4
MARKOV CHAINS

State-space based model
States represent various conditions of the system
Transitions between states indicate occurrences
of events

5
State-Space-Based Models

States and labeled state transitions
State can keep track of
Number of functioning resources of each type
States of recovery for each failed resource
Number of tasks of each type waiting at each
resource
Allocation of resources to tasks
A transition
Can occur from any state to any other state
Can represent a simple or a compound event

6
State-Space-Based Model (Continued)

Drawn as a directed graph
Transition label
Probability homogeneous discrete-time Markov
chain (DTMC)
Rate homogeneous continuous-time Markov chain
(CTMC)
Time-dependent rate non-homogeneous CTMC
Distribution function semi-Markov process (SMP)
Two distribution functions Markov regenerative
process (MRGP)

7
MARKOV CHAINS (Continued)

For continuous-time Markov chains (CTMCs) the
time variable associated with the system
evolution is continuous
We will mean a CTMC whenever we speak of Markov
model (chain)

8
Chapter 8

Continuous Time Markov Chains

9
Formal Definition

A discrete-state continuous-time stochastic
process is called a Markov
chain if
for t0 lt t1 lt t2 lt . lt tn lt t , the
conditional pmf satisfies the following Markov
property
A CTMC is characterized by state changes that can
occur at any arbitrary time
Index space is continuous.
The state space is discrete valued.

10
Continuous Time Markov Chain (CTMC)

A CTMC can be completely described by
Initial state probability vector for X(t0)
Transition probability functions (over an
interval)

11
pmf of X(t)

Using the theorem of total probability
If v 0 in the above equation, we get

12
Homogenous CTMCs

is a (time-)homogenous CTMC iff
Or, the conditional pmf satisfies
A CTMC is said to be irreducible if every state
can be reached from every other state, with a
non-zero probability.
A state is said to be absorbing if no other state
can be reached from it with non-zero probability.
Notion of transient, recurrent non-null,
recurrent null are the same as in a DTMC. There
is no notion of periodicity in a CTMC, however.

13
CTMC Dynamics
Chapman-Kolmogorov Equation

Note that these transition probabilities are
functions of elapsed time and not of the number
of elapsed steps
The direct use of the this equation is difficult
unlike the case of DTMC where we could anchor
on one-step transition probabilities
Hence the notion of rates of transitions which
follows next

14
Transition Rates

Define the rates (probabilities per unit time)
net rate out of state j at time t
the rate from state i to state j at time t

15
Kolmogorov Differential Equation

The transition probabilities and transition rates
are,
Dividing both sides by h and taking the limit,

16
Kolmogorov Differential Equation (contd.)

Kolmogorovs backward equation,
Writing these eqs. in the matrix form,

17
Homogeneous CTMC

Specialize to HCTMC (Kolmogorov diff.
eqn)
In the matrix form, (Matrix Q is called the
infinitesimal generator
matrix (or simply Generator Matrix))

18
CTMC Steady-state Solution

Steady state solution of CTMC obtained by
solving the following balance equations
Irreducible CTMCs with all states recurrent
non-null will have ve steady-state pj values
that are unique and independent of the initial
probability vector. All states of a finite
irreducible CTMC will be recurrent non-null.
Measures of interest may be computed by assigning
reward rates to states and computing expected
steady state reward rate

19
CTMC Measures

Measures of interest may be computed by
assigning reward rates to states and computing
expected reward rate at time t
Expected accumulated reward (over an interval of
time)
Lj(t) is the expected time spent in state j
during (0,t)

20
Markov Availability Model
21
2-State Markov Availability Model

1) Steady-state balance equations for each state
Rate of flow IN rate of flow OUT
State1
State0
2 unknowns, 2 equations, but there is only one
independent equation.

22
2-State Markov Availability Model(Continued)

Need an additional equation

Downtime in minutes per year
876060
23
2-State Markov Availability Model(Continued)

2) Transient Availability for each state
Rate of buildup rate of flow IN - rate of flow
OUT
This equation can be solved to obtain assuming
1(0)1

24
2-State Markov Availability Model(Continued)

3)
4) Steady State Availability

25
Markov availability model

Assume we have a two-component parallel redundant
system with repair rate ?.
Assume that the failure rate of both the
components is ?.
When both the components have failed, the system
is considered to have failed.

26
Markov availability model (Continued)

Let the number of properly functioning components
be the state of the system. The state space is
0,1,2 where 0 is the system down state.
We wish to examine effects of shared vs.
non-shared repair.

27
Markov availability model (Continued)
2
1
0
Non-shared (independent) repair
2
1
0
Shared repair
28
Markov availability model (Continued)

Note Non-shared case can be modeled solved
using a RBD or a FTREE but shared case needs the
use of Markov chains.

29
Steady-state balance equations

For any state
Rate of flow in Rate of flow out
Consider the shared case
?i steady state probability that system is in
state i

30
Steady-state balance equations (Continued)

Hence
Since
We have
or

31
Steady-state balance equations (Continued)

Steady-state unavailability ?0 1 - Ashared
Similarly for non-shared case,
steady-state unavailability 1 - Anon-shared
Downtime in minutes per year (1 - A) 876060

32
Steady-state balance equations
33
A larger example

Return to the 2 control and 3 voice channels
example and assume that the control channel
failure rate is ?c, voice channel failure rate is
?v.
Repair rates are ?c and ?v, respectively.
Assuming a single shared repair facility and
control channel having preemptive repair priority
over voice channels, draw the state diagram of a
Markov availability model. Using SHARPE GUI,
solve the Markov chain for steady-state and
instantaneous availability.

34
(No Transcript)
35

WFS Example

36
A Workstations-Fileserver Example

Computing system consisting of
A file-server
Two workstations
Computing network connecting them
System operational as long as
One of the Workstations
and
The file-server are operational
Computer network is assumed to be fault-free

37
The WFS Example
38
Markov Chain for WFS Example

Assuming exponentially distributed times to
failure
?w failure rate of workstation
?f failure rate of file-server
Assume that components are repairable
?w repair rate of workstation
?f repair rate of file-server
File-server has (preemptive) priority for repair
over workstations (such repair priority cannot be
captured by non-state-space models)

39
Markov Availability Model for WFS
Since all states are reachable from every other
states, the CTMC is irreducible. Furthermore, all
states are positive recurrent.
40
Markov Availability Model for WFS (Continued)

In this figure, the label (i,j) of each state
is interpreted as follows i represents the
number of workstations that are still functioning
and j is 1 or 0 depending on whether the
file-server is up or down respectively.

41
Markov Model

Let X(t), t gt 0 represent a finite-state
Continuous Time Markov Chain (CTMC) with state
space ?.
Infinitesimal Generator Matrix Q qij
qij (i ! j) transition rate from state i to
state j
qii - qi , the diagonal
element

42
Markov Availability Model for WFS (Continued)

For the example problem, with the states ordered
as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q
matrix is given by

Q
43
(No Transcript)
44
Markov Model (steady-state)
? Steady-state probability vector These are
called steady-state balance equations rate of
flow in rate of flow out after solving for
obtain Steady-state availability
45
Markov Model (transient)

p(t)transient state probability vector
p(0) initial probability vector of the CTMC
Transient behavior described by the Kolmogorov
differential equation

46
Markov Availability Model

We compute the availability of the systemSystem
is available as long as it is in states (2,1)
and (1,1).
Instantaneous availability of the system

47
Availability (Continued)

Interval Availability
Steady-State Availability
There are three kinds of Availabilities!
Instantaneous, Interval Steady-state

48
Markov Availability Model (Continued)
L(i,j)(t) Expected Total Time Spent in State
(i,j) during (0,t)

Interval availability

49
Markov Availability Model (Continued)
50
2-component Availability model with finite
Detection delay

2-component availability model
Steady state availability Ass 1-p0
Failure detection stage takes random time, EXP(d)
Down states are 0 and 1D ? Ass 1- p0- p1D
Therefore, steady state unavailability U(d) is
given by

51
Redundant System with Finite Detection Switchover
Time

After solving the Markov model, we obtain
steady-state probabilities
Can solve in closed-form or using SHARPE

52
Closed-form
53
2-component availability model with imperfect
coverage

Coverage factor c (conditional probability that
the fault is correctly handled)
1C state is a reboot (down) state.

54
2-components availability model delay
imperfect coverage

Model has detection delay imperfect coverage
Down states are 0, 1C and 1D.

55
Modeling Software FaultsOperating System Failure
Availability model with hardware and software
(OS) redundancy operational phase Heisenbugs
Probability Statistics with Reliability,
Queuing and Computer Science Applications (2nd
ed.) K. S. Trivedi John Wiley, 2001.

Assumptions
Hardware failures are permanent
A repair or replacement action while OS failures
are cleared by a reboot
Repair or reboot takes place at rates ? and ? for
the hardware and OS, respectively.

56
Webserver Availability Model with warm Replication

Two nodes for hardware redundancy
Each node has a copy of the webserver (software
redundancy replication)
Primary node can fail
Secondary node can fail
Primary process can fail
Secondary process can fail
Failures may have imperfect coverage
Time delay for fault detection
Model of a real system developed at Avaya Labs

57
Modeling Software FaultsApplication Failure
Availability model with passive redundancy (warm
replication) of application Operational phase
Heisenbugs or hardware transients

Assumptions
A web server software, that fails at the rate ?p
running on a machine that fails at the rate ?m
Mean time to detect server process failure ?-1p
and the mean time to detect machine failure ?-1m
The mean restart time of a machine ?-1m
The mean restart time of a server ?-1p

Performance and Reliability Evaluation of
Passive Replication Schemes in Application Level
Fault-Tolerance S. Garg, Y. Huang, C. Kintala,
K. S. Trivedi and S. Yagnik Proc. of the 29th
Intl. Symp. On Fault-Tolerant Computing, FTCS-29,
June 1999.
58
Parameters

Process MTTF 10 days (1/?p)
Node MTTF 20 days (1/?n)
Process polling interval 2 seconds (1/?p)
Mean process restart time 30 seconds (1/?p)
Mean process failover time 2 minutes (1/?n)
Switching time with mean 1/ ?s
C 0.95

59
Solution for warm replication
60
Modeling an N1 Protection System
61
Outline

Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a uniform
distribution
Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution
Using equations of the underlying Semi-Markov
Process
Solutions for the models

62
Description of the system

N Number of protected units (we use N1)
? Unit failure rate
? Unit restoration rate
T deterministic time between routine
diagnostics
c Probability that a protection switch
successfully restores service
d Probability that a failure in the standby
unit is detected

63
Outline

Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a uniform
distribution
Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution
Using equations of the underlying Semi-Markov
Process
Solutions for the models

64
Using a rate approximation (N1)
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
?
2/T
2?
?
?
Normal 1 Protection Switch Failure
2 Simplex 3 Failure to detect protection
fault 4 Failed 5
?
Failed (0)
Time to diagnostic is exponentially
distributed with mean T/2
N1
65
(No Transcript)
66
Outline

Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a uniform
distribution
Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution
Using equations of the underlying Semi-Markov
Process
Solutions for the models

67
Comparison of probability density functions (pdf)
T 1
68
Comparison of cumulative distribution functions
(cdf)
T 1
69
Using a 3-stage Erlang approximation to a uniform
distribution (N1)
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
s1
s2
6/T
?
2?
?
?
6/T
6/T
Time to diagnostic is uniformly distributed over
(0,T) - approximated by a 3-stage Erlang with
mean T/2
?
Failed (0)
?
?
N1
70
(No Transcript)
71
Outline

Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a uniform
distribution
Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution
Using equations of the underlying Semi-Markov
Process
Solutions for the models

72
Using a Semi-Markov model - approximation method
using an Erlang distribution (N1)
E(t) -gt 3-stage Erlang distribution given by,
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
?
E(t)
Time to diagnostic is uniformly distributed over
(0,T) - approximated by a 3-stage Erlang
distribution with mean T/2
2?
?
?
?
Failed (0)
N1
73
Outline

Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a uniform
distribution
Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution
Using equations of the underlying Semi-Markov
Process
Solutions for the models

74
Using Equations of the underlying Semi-Markov
Process

Steady state solution
One step transition probability matrix, P of the
embedded DTMC

75
Using Equations of the underlying Semi-Markov
Process (Continued)
76
Using Equations of the underlying Semi-Markov
Process (Continued)

Time to the next diagnostic is uniformly
distributed over (0,T)

77
Using Equations of the underlying Semi-Markov
Process (Continued)
78
Outline

Description of the system
Using a rate approximation
Using a 3-stage Erlang approximation to a uniform
distribution
Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution
Using equations of the underlying Semi-Markov
Process
Solutions for the models

79
Solutions for the models

Parameter values assumed
N 1
c 0.9
d 0.9
? 0.0001 / hour
? 1 / hour
T 1 hour

80
Results obtained

Steady state availability
Probability of being in states Normal,
Simplex, or Failure to Detect Protection
Fault
Steady state unavailability
Probability of being in states Protection Switch
Failure, or Failed (0)
Average downtime in steady state
Steady state unavailability Number of minutes
in a year
Average units available
2PNormal 1PSimplex 1PFailuretoDetectProtecti
onFault

81
(No Transcript)
82
Markov Reliability Model
83
Markov reliability model with repair

Consider the 2-component parallel system (no
delay perfect cov) but disallow repair from
system down state
Note that state 0 is now an absorbing state. The
state diagram is given in the following figure.
This reliability model with repair cannot be
modeled using a reliability block diagram or a
fault tree. We need to resort to Markov chains.
(This is a form of dependency since in order to
repair a component you need to know the status of
the other component).

84
Markov reliability model with repair (Continued)
Absorbing state

Markov chain has an absorbing state. In the
steady-state, system will be in state 0 with
probability 1. Hence transient analysis is of
interest. States 1 and 2 are transient states.

85
Markov reliability model with repair (Continued)

Assume that the initial state of the Markov chain
is 2, that is, p2(0) 1, pk (0) 0 for k 0,
1.
Then the system of differential Equations is
written
based on
rate of buildup rate of flow in - rate of flow
out
for each state

86
Markov reliability model with repair
(Continued)
87
Markov reliability model with repair
(Continued)

After solving these equations, we get
R(t) p2(t) p1(t)
Recalling that
, we get

88
Markov reliability model with repair
(Continued)

Note that the MTTF of the two component
parallel redundant system, in the absence
of a repair facility (i.e., ? 0), would
have
been equal to the first term,
3 / ( 2? ), in the above expression.
Therefore, the effect of a repair facility is
to
increase the mean life by ? / (2?2), or by a
factor

89
Markov Reliability Model with Repair ( WFS
Example)

Assume that the computer system does not recover
if both workstations fail, or if the file-server
fails

90
Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing
states while (2,1) and (1,1) are transient
states. Note we have made a simplification that,
once the CTMC reaches a system failure state, we
do not allow any more transitions.
91
(No Transcript)
92
Markov Model with Absorbing States

If we solve for p2,1(t) and p1,1(t) then
R(t)p2,1(t) p1,1(t)
For a Markov chain with absorbing states
A the set of absorbing states
B ? - A the set of remaining states
ti,j Mean time spent in state i,j until
absorption

93
Markov Model with Absorbing States (Continued)
QB derived from Q by restricting it to only
states in B
Mean time to absorption MTTA is given as
94
Markov Reliability Model with Repair (Continued)
95
Markov Reliability Model with Repair (Continued)

Mean time to failure is 19992 hours.

96
Markov Reliability Model without Repair

Assume that neither workstations nor file-server
is repairable

97
Markov Reliability Model without Repair
(Continued)
States (0,1), (1,0) and (2,0) become absorbing
states
98
(No Transcript)
99
(No Transcript)
100
Markov Reliability Model without Repair
(Continued)