Title: Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi
1Probability and Statistics with Reliability,
Queuing and Computer Science Applications
Chapter 8 onContinuous-Time Markov ChainsKishor
Trivedi
2Non-State Space Models
- Recall that non-state-space models like RBDs and
FTs can easily be formulated and (assuming
statistical independence) solved for system
reliability, system availability and system MTTF - Each component can have attached to it
- A probability of failure
- A failure rate
- A distribution of time to failure
- Steady-state and instantaneous unavailability
3Markov chain
- To model complex interactions between components,
use other kinds of models like Markov chains or
more generally state space models. - Many examples of dependencies among system
components have been observed in practice and
captured by Markov models.
4MARKOV CHAINS
- State-space based model
- States represent various conditions of the system
- Transitions between states indicate occurrences
of events
5State-Space-Based Models
- States and labeled state transitions
- State can keep track of
- Number of functioning resources of each type
- States of recovery for each failed resource
- Number of tasks of each type waiting at each
resource - Allocation of resources to tasks
- A transition
- Can occur from any state to any other state
- Can represent a simple or a compound event
6State-Space-Based Model (Continued)
- Drawn as a directed graph
- Transition label
- Probability homogeneous discrete-time Markov
chain (DTMC) - Rate homogeneous continuous-time Markov chain
(CTMC) - Time-dependent rate non-homogeneous CTMC
- Distribution function semi-Markov process (SMP)
- Two distribution functions Markov regenerative
process (MRGP)
7MARKOV CHAINS (Continued)
- For continuous-time Markov chains (CTMCs) the
time variable associated with the system
evolution is continuous - We will mean a CTMC whenever we speak of Markov
model (chain)
8Chapter 8
- Continuous Time Markov Chains
9Formal Definition
- A discrete-state continuous-time stochastic
process is called a Markov
chain if - for t0 lt t1 lt t2 lt . lt tn lt t , the
conditional pmf satisfies the following Markov
property -
- A CTMC is characterized by state changes that can
occur at any arbitrary time - Index space is continuous.
- The state space is discrete valued.
10Continuous Time Markov Chain (CTMC)
- A CTMC can be completely described by
- Initial state probability vector for X(t0)
- Transition probability functions (over an
interval) -
11pmf of X(t)
- Using the theorem of total probability
-
- If v 0 in the above equation, we get
12Homogenous CTMCs
- is a (time-)homogenous CTMC iff
- Or, the conditional pmf satisfies
- A CTMC is said to be irreducible if every state
can be reached from every other state, with a
non-zero probability. - A state is said to be absorbing if no other state
can be reached from it with non-zero probability. - Notion of transient, recurrent non-null,
recurrent null are the same as in a DTMC. There
is no notion of periodicity in a CTMC, however.
13CTMC Dynamics
Chapman-Kolmogorov Equation
-
- Note that these transition probabilities are
functions of elapsed time and not of the number
of elapsed steps - The direct use of the this equation is difficult
unlike the case of DTMC where we could anchor
on one-step transition probabilities - Hence the notion of rates of transitions which
follows next
14Transition Rates
- Define the rates (probabilities per unit time)
- net rate out of state j at time t
- the rate from state i to state j at time t
15 Kolmogorov Differential Equation
- The transition probabilities and transition rates
are, - Dividing both sides by h and taking the limit,
16Kolmogorov Differential Equation (contd.)
- Kolmogorovs backward equation,
- Writing these eqs. in the matrix form,
17Homogeneous CTMC
- Specialize to HCTMC (Kolmogorov diff.
eqn) - In the matrix form, (Matrix Q is called the
infinitesimal generator
matrix (or simply Generator Matrix))
18CTMC Steady-state Solution
- Steady state solution of CTMC obtained by
solving the following balance equations - Irreducible CTMCs with all states recurrent
non-null will have ve steady-state pj values
that are unique and independent of the initial
probability vector. All states of a finite
irreducible CTMC will be recurrent non-null. - Measures of interest may be computed by assigning
reward rates to states and computing expected
steady state reward rate
19CTMC Measures
- Measures of interest may be computed by
assigning reward rates to states and computing
expected reward rate at time t - Expected accumulated reward (over an interval of
time) - Lj(t) is the expected time spent in state j
during (0,t) -
20Markov Availability Model
212-State Markov Availability Model
- 1) Steady-state balance equations for each state
- Rate of flow IN rate of flow OUT
- State1
- State0
-
- 2 unknowns, 2 equations, but there is only one
independent equation.
222-State Markov Availability Model(Continued)
- Need an additional equation
Downtime in minutes per year
876060
232-State Markov Availability Model(Continued)
- 2) Transient Availability for each state
- Rate of buildup rate of flow IN - rate of flow
OUT -
- This equation can be solved to obtain assuming
1(0)1
242-State Markov Availability Model(Continued)
- 3)
- 4) Steady State Availability
25Markov availability model
- Assume we have a two-component parallel redundant
system with repair rate ?. - Assume that the failure rate of both the
components is ?. - When both the components have failed, the system
is considered to have failed.
26Markov availability model (Continued)
- Let the number of properly functioning components
be the state of the system. The state space is
0,1,2 where 0 is the system down state. - We wish to examine effects of shared vs.
non-shared repair.
27Markov availability model (Continued)
2
1
0
Non-shared (independent) repair
2
1
0
Shared repair
28Markov availability model (Continued)
- Note Non-shared case can be modeled solved
using a RBD or a FTREE but shared case needs the
use of Markov chains.
29Steady-state balance equations
- For any state
- Rate of flow in Rate of flow out
- Consider the shared case
- ?i steady state probability that system is in
state i
30Steady-state balance equations (Continued)
31Steady-state balance equations (Continued)
- Steady-state unavailability ?0 1 - Ashared
- Similarly for non-shared case,
- steady-state unavailability 1 - Anon-shared
- Downtime in minutes per year (1 - A) 876060
32Steady-state balance equations
33A larger example
- Return to the 2 control and 3 voice channels
example and assume that the control channel
failure rate is ?c, voice channel failure rate is
?v. - Repair rates are ?c and ?v, respectively.
Assuming a single shared repair facility and
control channel having preemptive repair priority
over voice channels, draw the state diagram of a
Markov availability model. Using SHARPE GUI,
solve the Markov chain for steady-state and
instantaneous availability.
34(No Transcript)
35 36A Workstations-Fileserver Example
- Computing system consisting of
- A file-server
- Two workstations
- Computing network connecting them
- System operational as long as
- One of the Workstations
- and
- The file-server are operational
- Computer network is assumed to be fault-free
37The WFS Example
38Markov Chain for WFS Example
- Assuming exponentially distributed times to
failure - ?w failure rate of workstation
- ?f failure rate of file-server
- Assume that components are repairable
- ?w repair rate of workstation
- ?f repair rate of file-server
- File-server has (preemptive) priority for repair
over workstations (such repair priority cannot be
captured by non-state-space models)
39Markov Availability Model for WFS
Since all states are reachable from every other
states, the CTMC is irreducible. Furthermore, all
states are positive recurrent.
40Markov Availability Model for WFS (Continued)
- In this figure, the label (i,j) of each state
is interpreted as follows i represents the
number of workstations that are still functioning
and j is 1 or 0 depending on whether the
file-server is up or down respectively.
41Markov Model
- Let X(t), t gt 0 represent a finite-state
Continuous Time Markov Chain (CTMC) with state
space ?. - Infinitesimal Generator Matrix Q qij
- qij (i ! j) transition rate from state i to
state j - qii - qi , the diagonal
element
42Markov Availability Model for WFS (Continued)
- For the example problem, with the states ordered
as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q
matrix is given by
Q
43(No Transcript)
44Markov Model (steady-state)
? Steady-state probability vector These are
called steady-state balance equations rate of
flow in rate of flow out after solving for
obtain Steady-state availability
45Markov Model (transient)
- p(t)transient state probability vector
- p(0) initial probability vector of the CTMC
- Transient behavior described by the Kolmogorov
differential equation
46Markov Availability Model
- We compute the availability of the systemSystem
is available as long as it is in states (2,1)
and (1,1). - Instantaneous availability of the system
47Availability (Continued)
- Interval Availability
- Steady-State Availability
- There are three kinds of Availabilities!
- Instantaneous, Interval Steady-state
48Markov Availability Model (Continued)
L(i,j)(t) Expected Total Time Spent in State
(i,j) during (0,t)
49Markov Availability Model (Continued)
502-component Availability model with finite
Detection delay
- 2-component availability model
- Steady state availability Ass 1-p0
- Failure detection stage takes random time, EXP(d)
- Down states are 0 and 1D ? Ass 1- p0- p1D
- Therefore, steady state unavailability U(d) is
given by
51Redundant System with Finite Detection Switchover
Time
- After solving the Markov model, we obtain
steady-state probabilities - Can solve in closed-form or using SHARPE
52Closed-form
532-component availability model with imperfect
coverage
- Coverage factor c (conditional probability that
the fault is correctly handled) - 1C state is a reboot (down) state.
542-components availability model delay
imperfect coverage
- Model has detection delay imperfect coverage
- Down states are 0, 1C and 1D.
55Modeling Software FaultsOperating System Failure
Availability model with hardware and software
(OS) redundancy operational phase Heisenbugs
Probability Statistics with Reliability,
Queuing and Computer Science Applications (2nd
ed.) K. S. Trivedi John Wiley, 2001.
- Assumptions
- Hardware failures are permanent
- A repair or replacement action while OS failures
are cleared by a reboot - Repair or reboot takes place at rates ? and ? for
the hardware and OS, respectively. -
56Webserver Availability Model with warm Replication
- Two nodes for hardware redundancy
- Each node has a copy of the webserver (software
redundancy replication) - Primary node can fail
- Secondary node can fail
- Primary process can fail
- Secondary process can fail
- Failures may have imperfect coverage
- Time delay for fault detection
- Model of a real system developed at Avaya Labs
57Modeling Software FaultsApplication Failure
Availability model with passive redundancy (warm
replication) of application Operational phase
Heisenbugs or hardware transients
- Assumptions
- A web server software, that fails at the rate ?p
running on a machine that fails at the rate ?m - Mean time to detect server process failure ?-1p
and the mean time to detect machine failure ?-1m - The mean restart time of a machine ?-1m
- The mean restart time of a server ?-1p
-
Performance and Reliability Evaluation of
Passive Replication Schemes in Application Level
Fault-Tolerance S. Garg, Y. Huang, C. Kintala,
K. S. Trivedi and S. Yagnik Proc. of the 29th
Intl. Symp. On Fault-Tolerant Computing, FTCS-29,
June 1999.
58Parameters
- Process MTTF 10 days (1/?p)
- Node MTTF 20 days (1/?n)
- Process polling interval 2 seconds (1/?p)
- Mean process restart time 30 seconds (1/?p)
- Mean process failover time 2 minutes (1/?n)
- Switching time with mean 1/ ?s
- C 0.95
59Solution for warm replication
60Modeling an N1 Protection System
61Outline
- Description of the system
- Using a rate approximation
- Using a 3-stage Erlang approximation to a uniform
distribution - Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution - Using equations of the underlying Semi-Markov
Process - Solutions for the models
62Description of the system
- N Number of protected units (we use N1)
- ? Unit failure rate
- ? Unit restoration rate
- T deterministic time between routine
diagnostics - c Probability that a protection switch
successfully restores service - d Probability that a failure in the standby
unit is detected
63Outline
- Description of the system
- Using a rate approximation
- Using a 3-stage Erlang approximation to a uniform
distribution - Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution - Using equations of the underlying Semi-Markov
Process - Solutions for the models
64Using a rate approximation (N1)
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
?
2/T
2?
?
?
Normal 1 Protection Switch Failure
2 Simplex 3 Failure to detect protection
fault 4 Failed 5
?
Failed (0)
Time to diagnostic is exponentially
distributed with mean T/2
N1
65(No Transcript)
66Outline
- Description of the system
- Using a rate approximation
- Using a 3-stage Erlang approximation to a uniform
distribution - Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution - Using equations of the underlying Semi-Markov
Process - Solutions for the models
67Comparison of probability density functions (pdf)
T 1
68Comparison of cumulative distribution functions
(cdf)
T 1
69Using a 3-stage Erlang approximation to a uniform
distribution (N1)
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
s1
s2
6/T
?
2?
?
?
6/T
6/T
Time to diagnostic is uniformly distributed over
(0,T) - approximated by a 3-stage Erlang with
mean T/2
?
Failed (0)
?
?
N1
70(No Transcript)
71Outline
- Description of the system
- Using a rate approximation
- Using a 3-stage Erlang approximation to a uniform
distribution - Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution - Using equations of the underlying Semi-Markov
Process - Solutions for the models
72Using a Semi-Markov model - approximation method
using an Erlang distribution (N1)
E(t) -gt 3-stage Erlang distribution given by,
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
?
E(t)
Time to diagnostic is uniformly distributed over
(0,T) - approximated by a 3-stage Erlang
distribution with mean T/2
2?
?
?
?
Failed (0)
N1
73Outline
- Description of the system
- Using a rate approximation
- Using a 3-stage Erlang approximation to a uniform
distribution - Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution - Using equations of the underlying Semi-Markov
Process - Solutions for the models
74Using Equations of the underlying Semi-Markov
Process
- Steady state solution
- One step transition probability matrix, P of the
- embedded DTMC
75Using Equations of the underlying Semi-Markov
Process (Continued)
76Using Equations of the underlying Semi-Markov
Process (Continued)
- Time to the next diagnostic is uniformly
distributed over (0,T)
77Using Equations of the underlying Semi-Markov
Process (Continued)
78Outline
- Description of the system
- Using a rate approximation
- Using a 3-stage Erlang approximation to a uniform
distribution - Using a Semi-Markov model - approximation method
using a 3-stage Erlang distribution - Using equations of the underlying Semi-Markov
Process - Solutions for the models
79Solutions for the models
- Parameter values assumed
- N 1
- c 0.9
- d 0.9
- ? 0.0001 / hour
- ? 1 / hour
- T 1 hour
80Results obtained
- Steady state availability
- Probability of being in states Normal,
Simplex, or Failure to Detect Protection
Fault - Steady state unavailability
- Probability of being in states Protection Switch
Failure, or Failed (0) - Average downtime in steady state
- Steady state unavailability Number of minutes
in a year - Average units available
- 2PNormal 1PSimplex 1PFailuretoDetectProtecti
onFault
81(No Transcript)
82Markov Reliability Model
83Markov reliability model with repair
- Consider the 2-component parallel system (no
delay perfect cov) but disallow repair from
system down state - Note that state 0 is now an absorbing state. The
state diagram is given in the following figure. - This reliability model with repair cannot be
modeled using a reliability block diagram or a
fault tree. We need to resort to Markov chains.
(This is a form of dependency since in order to
repair a component you need to know the status of
the other component).
84Markov reliability model with repair (Continued)
Absorbing state
- Markov chain has an absorbing state. In the
steady-state, system will be in state 0 with
probability 1. Hence transient analysis is of
interest. States 1 and 2 are transient states.
85Markov reliability model with repair (Continued)
- Assume that the initial state of the Markov chain
- is 2, that is, p2(0) 1, pk (0) 0 for k 0,
1. - Then the system of differential Equations is
written - based on
- rate of buildup rate of flow in - rate of flow
out - for each state
86Markov reliability model with repair
(Continued)
87Markov reliability model with repair
(Continued)
- After solving these equations, we get
- R(t) p2(t) p1(t)
- Recalling that
, we get
88Markov reliability model with repair
(Continued)
- Note that the MTTF of the two component
parallel redundant system, in the absence - of a repair facility (i.e., ? 0), would
have - been equal to the first term,
- 3 / ( 2? ), in the above expression.
- Therefore, the effect of a repair facility is
to - increase the mean life by ? / (2?2), or by a
- factor
89Markov Reliability Model with Repair ( WFS
Example)
- Assume that the computer system does not recover
if both workstations fail, or if the file-server
fails
90Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing
states while (2,1) and (1,1) are transient
states. Note we have made a simplification that,
once the CTMC reaches a system failure state, we
do not allow any more transitions.
91(No Transcript)
92Markov Model with Absorbing States
- If we solve for p2,1(t) and p1,1(t) then
- R(t)p2,1(t) p1,1(t)
- For a Markov chain with absorbing states
- A the set of absorbing states
- B ? - A the set of remaining states
- ti,j Mean time spent in state i,j until
absorption
93Markov Model with Absorbing States (Continued)
QB derived from Q by restricting it to only
states in B
Mean time to absorption MTTA is given as
94Markov Reliability Model with Repair (Continued)
95Markov Reliability Model with Repair (Continued)
- Mean time to failure is 19992 hours.
96Markov Reliability Model without Repair
- Assume that neither workstations nor file-server
is repairable
97Markov Reliability Model without Repair
(Continued)
States (0,1), (1,0) and (2,0) become absorbing
states
98(No Transcript)
99(No Transcript)
100Markov Reliability Model without Repair
(Continued)
- Mean time to failure is 9333 hours.
101Markov Reliability Model with Imperfect Coverage
102Markov model with imperfect coverage
- Next consider a modification of the above
- example proposed by Arnold as a model of
- duplex processors of an electronic
- switching system. We assume that not all
- faults are recoverable and that c is the
- coverage factor which denotes the
- conditional probability that the system
- recovers given that a fault has occurred.
- The state diagram is now given by the
- following picture
103Now allow for Imperfect coverage
c
104Markov modelwith imperfect coverage (Continued)
- Assume that the initial state is 2 so that
- Then the system of differential equations are
0
)
0
(
)
0
(
,
1
)
0
(
p
p
p
1
0
2
)
(
t
dp
105Markov model with imperfect coverage (Continued)
- After solving the differential equations we
obtain - R(t)p2(t) p1(t)
- From R(t), we can system MTTF
- It should be clear that the system MTTF and
system reliability are - critically dependent on the coverage factor.
106(No Transcript)
107(No Transcript)