Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi - PowerPoint PPT Presentation

About This Presentation
Title:

Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi

Description:

Probability and Statistics with Reliability, Queuing and ... Model of a real system developed at Avaya Labs. Modeling Software Faults. Application Failure ... – PowerPoint PPT presentation

Number of Views:1954
Avg rating:3.0/5.0
Slides: 108
Provided by: cseIi
Category:

less

Transcript and Presenter's Notes

Title: Probability and Statistics with Reliability, Queuing and Computer Science Applications: Chapter 8 on Continuous-Time Markov Chains Kishor Trivedi


1
Probability and Statistics with Reliability,
Queuing and Computer Science Applications
Chapter 8 onContinuous-Time Markov ChainsKishor
Trivedi
2
Non-State Space Models
  • Recall that non-state-space models like RBDs and
    FTs can easily be formulated and (assuming
    statistical independence) solved for system
    reliability, system availability and system MTTF
  • Each component can have attached to it
  • A probability of failure
  • A failure rate
  • A distribution of time to failure
  • Steady-state and instantaneous unavailability

3
Markov chain
  • To model complex interactions between components,
    use other kinds of models like Markov chains or
    more generally state space models.
  • Many examples of dependencies among system
    components have been observed in practice and
    captured by Markov models.

4
MARKOV CHAINS
  • State-space based model
  • States represent various conditions of the system
  • Transitions between states indicate occurrences
    of events

5
State-Space-Based Models
  • States and labeled state transitions
  • State can keep track of
  • Number of functioning resources of each type
  • States of recovery for each failed resource
  • Number of tasks of each type waiting at each
    resource
  • Allocation of resources to tasks
  • A transition
  • Can occur from any state to any other state
  • Can represent a simple or a compound event

6
State-Space-Based Model (Continued)
  • Drawn as a directed graph
  • Transition label
  • Probability homogeneous discrete-time Markov
    chain (DTMC)
  • Rate homogeneous continuous-time Markov chain
    (CTMC)
  • Time-dependent rate non-homogeneous CTMC
  • Distribution function semi-Markov process (SMP)
  • Two distribution functions Markov regenerative
    process (MRGP)

7
MARKOV CHAINS (Continued)
  • For continuous-time Markov chains (CTMCs) the
    time variable associated with the system
    evolution is continuous
  • We will mean a CTMC whenever we speak of Markov
    model (chain)

8
Chapter 8
  • Continuous Time Markov Chains

9
Formal Definition
  • A discrete-state continuous-time stochastic
    process is called a Markov
    chain if
  • for t0 lt t1 lt t2 lt . lt tn lt t , the
    conditional pmf satisfies the following Markov
    property
  • A CTMC is characterized by state changes that can
    occur at any arbitrary time
  • Index space is continuous.
  • The state space is discrete valued.

10
Continuous Time Markov Chain (CTMC)
  • A CTMC can be completely described by
  • Initial state probability vector for X(t0)
  • Transition probability functions (over an
    interval)

11
pmf of X(t)
  • Using the theorem of total probability
  • If v 0 in the above equation, we get

12
Homogenous CTMCs
  • is a (time-)homogenous CTMC iff
  • Or, the conditional pmf satisfies
  • A CTMC is said to be irreducible if every state
    can be reached from every other state, with a
    non-zero probability.
  • A state is said to be absorbing if no other state
    can be reached from it with non-zero probability.
  • Notion of transient, recurrent non-null,
    recurrent null are the same as in a DTMC. There
    is no notion of periodicity in a CTMC, however.

13
CTMC Dynamics
Chapman-Kolmogorov Equation
  • Note that these transition probabilities are
    functions of elapsed time and not of the number
    of elapsed steps
  • The direct use of the this equation is difficult
    unlike the case of DTMC where we could anchor
    on one-step transition probabilities
  • Hence the notion of rates of transitions which
    follows next

14
Transition Rates
  • Define the rates (probabilities per unit time)
  • net rate out of state j at time t
  • the rate from state i to state j at time t

15
Kolmogorov Differential Equation
  • The transition probabilities and transition rates
    are,
  • Dividing both sides by h and taking the limit,

16
Kolmogorov Differential Equation (contd.)
  • Kolmogorovs backward equation,
  • Writing these eqs. in the matrix form,

17
Homogeneous CTMC
  • Specialize to HCTMC (Kolmogorov diff.
    eqn)
  • In the matrix form, (Matrix Q is called the
    infinitesimal generator
    matrix (or simply Generator Matrix))

18
CTMC Steady-state Solution
  • Steady state solution of CTMC obtained by
    solving the following balance equations
  • Irreducible CTMCs with all states recurrent
    non-null will have ve steady-state pj values
    that are unique and independent of the initial
    probability vector. All states of a finite
    irreducible CTMC will be recurrent non-null.
  • Measures of interest may be computed by assigning
    reward rates to states and computing expected
    steady state reward rate

19
CTMC Measures
  • Measures of interest may be computed by
    assigning reward rates to states and computing
    expected reward rate at time t
  • Expected accumulated reward (over an interval of
    time)
  • Lj(t) is the expected time spent in state j
    during (0,t)

20
Markov Availability Model
21
2-State Markov Availability Model
  • 1) Steady-state balance equations for each state
  • Rate of flow IN rate of flow OUT
  • State1
  • State0
  • 2 unknowns, 2 equations, but there is only one
    independent equation.

22
2-State Markov Availability Model(Continued)
  • Need an additional equation

Downtime in minutes per year
876060
23
2-State Markov Availability Model(Continued)
  • 2) Transient Availability for each state
  • Rate of buildup rate of flow IN - rate of flow
    OUT
  • This equation can be solved to obtain assuming
    1(0)1

24
2-State Markov Availability Model(Continued)
  • 3)
  • 4) Steady State Availability

25
Markov availability model
  • Assume we have a two-component parallel redundant
    system with repair rate ?.
  • Assume that the failure rate of both the
    components is ?.
  • When both the components have failed, the system
    is considered to have failed.

26
Markov availability model (Continued)
  • Let the number of properly functioning components
    be the state of the system. The state space is
    0,1,2 where 0 is the system down state.
  • We wish to examine effects of shared vs.
    non-shared repair.

27
Markov availability model (Continued)
2
1
0
Non-shared (independent) repair
2
1
0
Shared repair
28
Markov availability model (Continued)
  • Note Non-shared case can be modeled solved
    using a RBD or a FTREE but shared case needs the
    use of Markov chains.

29
Steady-state balance equations
  • For any state
  • Rate of flow in Rate of flow out
  • Consider the shared case
  • ?i steady state probability that system is in
    state i

30
Steady-state balance equations (Continued)
  • Hence
  • Since
  • We have
  • or

31
Steady-state balance equations (Continued)
  • Steady-state unavailability ?0 1 - Ashared
  • Similarly for non-shared case,
  • steady-state unavailability 1 - Anon-shared
  • Downtime in minutes per year (1 - A) 876060

32
Steady-state balance equations
33
A larger example
  • Return to the 2 control and 3 voice channels
    example and assume that the control channel
    failure rate is ?c, voice channel failure rate is
    ?v.
  • Repair rates are ?c and ?v, respectively.
    Assuming a single shared repair facility and
    control channel having preemptive repair priority
    over voice channels, draw the state diagram of a
    Markov availability model. Using SHARPE GUI,
    solve the Markov chain for steady-state and
    instantaneous availability.

34
(No Transcript)
35
  • WFS Example

36
A Workstations-Fileserver Example
  • Computing system consisting of
  • A file-server
  • Two workstations
  • Computing network connecting them
  • System operational as long as
  • One of the Workstations
  • and
  • The file-server are operational
  • Computer network is assumed to be fault-free

37
The WFS Example
38
Markov Chain for WFS Example
  • Assuming exponentially distributed times to
    failure
  • ?w failure rate of workstation
  • ?f failure rate of file-server
  • Assume that components are repairable
  • ?w repair rate of workstation
  • ?f repair rate of file-server
  • File-server has (preemptive) priority for repair
    over workstations (such repair priority cannot be
    captured by non-state-space models)

39
Markov Availability Model for WFS
Since all states are reachable from every other
states, the CTMC is irreducible. Furthermore, all
states are positive recurrent.
40
Markov Availability Model for WFS (Continued)
  • In this figure, the label (i,j) of each state
    is interpreted as follows i represents the
    number of workstations that are still functioning
    and j is 1 or 0 depending on whether the
    file-server is up or down respectively.

41
Markov Model
  • Let X(t), t gt 0 represent a finite-state
    Continuous Time Markov Chain (CTMC) with state
    space ?.
  • Infinitesimal Generator Matrix Q qij
  • qij (i ! j) transition rate from state i to
    state j
  • qii - qi , the diagonal
    element

42
Markov Availability Model for WFS (Continued)
  • For the example problem, with the states ordered
    as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q
    matrix is given by

Q
43
(No Transcript)
44
Markov Model (steady-state)
? Steady-state probability vector These are
called steady-state balance equations rate of
flow in rate of flow out after solving for
obtain Steady-state availability
45
Markov Model (transient)
  • p(t)transient state probability vector
  • p(0) initial probability vector of the CTMC
  • Transient behavior described by the Kolmogorov
    differential equation

46
Markov Availability Model
  • We compute the availability of the systemSystem
    is available as long as it is in states (2,1)
    and (1,1).
  • Instantaneous availability of the system

47
Availability (Continued)
  • Interval Availability
  • Steady-State Availability
  • There are three kinds of Availabilities!
  • Instantaneous, Interval Steady-state

48
Markov Availability Model (Continued)
L(i,j)(t) Expected Total Time Spent in State
(i,j) during (0,t)
  • Interval availability

49
Markov Availability Model (Continued)
50
2-component Availability model with finite
Detection delay
  • 2-component availability model
  • Steady state availability Ass 1-p0
  • Failure detection stage takes random time, EXP(d)
  • Down states are 0 and 1D ? Ass 1- p0- p1D
  • Therefore, steady state unavailability U(d) is
    given by

51
Redundant System with Finite Detection Switchover
Time
  • After solving the Markov model, we obtain
    steady-state probabilities
  • Can solve in closed-form or using SHARPE

52
Closed-form
53
2-component availability model with imperfect
coverage
  • Coverage factor c (conditional probability that
    the fault is correctly handled)
  • 1C state is a reboot (down) state.

54
2-components availability model delay
imperfect coverage
  • Model has detection delay imperfect coverage
  • Down states are 0, 1C and 1D.

55
Modeling Software FaultsOperating System Failure
Availability model with hardware and software
(OS) redundancy operational phase Heisenbugs
Probability Statistics with Reliability,
Queuing and Computer Science Applications (2nd
ed.) K. S. Trivedi John Wiley, 2001.
  • Assumptions
  • Hardware failures are permanent
  • A repair or replacement action while OS failures
    are cleared by a reboot
  • Repair or reboot takes place at rates ? and ? for
    the hardware and OS, respectively.

56
Webserver Availability Model with warm Replication
  • Two nodes for hardware redundancy
  • Each node has a copy of the webserver (software
    redundancy replication)
  • Primary node can fail
  • Secondary node can fail
  • Primary process can fail
  • Secondary process can fail
  • Failures may have imperfect coverage
  • Time delay for fault detection
  • Model of a real system developed at Avaya Labs

57
Modeling Software FaultsApplication Failure
Availability model with passive redundancy (warm
replication) of application Operational phase
Heisenbugs or hardware transients
  • Assumptions
  • A web server software, that fails at the rate ?p
    running on a machine that fails at the rate ?m
  • Mean time to detect server process failure ?-1p
    and the mean time to detect machine failure ?-1m
  • The mean restart time of a machine ?-1m
  • The mean restart time of a server ?-1p

Performance and Reliability Evaluation of
Passive Replication Schemes in Application Level
Fault-Tolerance S. Garg, Y. Huang, C. Kintala,
K. S. Trivedi and S. Yagnik Proc. of the 29th
Intl. Symp. On Fault-Tolerant Computing, FTCS-29,
June 1999.
58
Parameters
  • Process MTTF 10 days (1/?p)
  • Node MTTF 20 days (1/?n)
  • Process polling interval 2 seconds (1/?p)
  • Mean process restart time 30 seconds (1/?p)
  • Mean process failover time 2 minutes (1/?n)
  • Switching time with mean 1/ ?s
  • C 0.95

59
Solution for warm replication
60
Modeling an N1 Protection System
61
Outline
  • Description of the system
  • Using a rate approximation
  • Using a 3-stage Erlang approximation to a uniform
    distribution
  • Using a Semi-Markov model - approximation method
    using a 3-stage Erlang distribution
  • Using equations of the underlying Semi-Markov
    Process
  • Solutions for the models

62
Description of the system
  • N Number of protected units (we use N1)
  • ? Unit failure rate
  • ? Unit restoration rate
  • T deterministic time between routine
    diagnostics
  • c Probability that a protection switch
    successfully restores service
  • d Probability that a failure in the standby
    unit is detected

63
Outline
  • Description of the system
  • Using a rate approximation
  • Using a 3-stage Erlang approximation to a uniform
    distribution
  • Using a Semi-Markov model - approximation method
    using a 3-stage Erlang distribution
  • Using equations of the underlying Semi-Markov
    Process
  • Solutions for the models

64
Using a rate approximation (N1)
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
?
2/T
2?
?
?
Normal 1 Protection Switch Failure
2 Simplex 3 Failure to detect protection
fault 4 Failed 5
?
Failed (0)
Time to diagnostic is exponentially
distributed with mean T/2
N1
65
(No Transcript)
66
Outline
  • Description of the system
  • Using a rate approximation
  • Using a 3-stage Erlang approximation to a uniform
    distribution
  • Using a Semi-Markov model - approximation method
    using a 3-stage Erlang distribution
  • Using equations of the underlying Semi-Markov
    Process
  • Solutions for the models

67
Comparison of probability density functions (pdf)
T 1
68
Comparison of cumulative distribution functions
(cdf)
T 1
69
Using a 3-stage Erlang approximation to a uniform
distribution (N1)
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
s1
s2
6/T
?
2?
?
?
6/T
6/T
Time to diagnostic is uniformly distributed over
(0,T) - approximated by a 3-stage Erlang with
mean T/2
?
Failed (0)
?
?
N1
70
(No Transcript)
71
Outline
  • Description of the system
  • Using a rate approximation
  • Using a 3-stage Erlang approximation to a uniform
    distribution
  • Using a Semi-Markov model - approximation method
    using a 3-stage Erlang distribution
  • Using equations of the underlying Semi-Markov
    Process
  • Solutions for the models

72
Using a Semi-Markov model - approximation method
using an Erlang distribution (N1)
E(t) -gt 3-stage Erlang distribution given by,
Normal (11)
(1-d)?
(1-c)?
?
(cd)?
Failure to Detect Protection Fault
Protection Switch Failure
Simplex (1)
?
E(t)
Time to diagnostic is uniformly distributed over
(0,T) - approximated by a 3-stage Erlang
distribution with mean T/2
2?
?
?
?
Failed (0)
N1
73
Outline
  • Description of the system
  • Using a rate approximation
  • Using a 3-stage Erlang approximation to a uniform
    distribution
  • Using a Semi-Markov model - approximation method
    using a 3-stage Erlang distribution
  • Using equations of the underlying Semi-Markov
    Process
  • Solutions for the models

74
Using Equations of the underlying Semi-Markov
Process
  • Steady state solution
  • One step transition probability matrix, P of the
  • embedded DTMC

75
Using Equations of the underlying Semi-Markov
Process (Continued)
76
Using Equations of the underlying Semi-Markov
Process (Continued)
  • Time to the next diagnostic is uniformly
    distributed over (0,T)

77
Using Equations of the underlying Semi-Markov
Process (Continued)
78
Outline
  • Description of the system
  • Using a rate approximation
  • Using a 3-stage Erlang approximation to a uniform
    distribution
  • Using a Semi-Markov model - approximation method
    using a 3-stage Erlang distribution
  • Using equations of the underlying Semi-Markov
    Process
  • Solutions for the models

79
Solutions for the models
  • Parameter values assumed
  • N 1
  • c 0.9
  • d 0.9
  • ? 0.0001 / hour
  • ? 1 / hour
  • T 1 hour

80
Results obtained
  • Steady state availability
  • Probability of being in states Normal,
    Simplex, or Failure to Detect Protection
    Fault
  • Steady state unavailability
  • Probability of being in states Protection Switch
    Failure, or Failed (0)
  • Average downtime in steady state
  • Steady state unavailability Number of minutes
    in a year
  • Average units available
  • 2PNormal 1PSimplex 1PFailuretoDetectProtecti
    onFault

81
(No Transcript)
82
Markov Reliability Model
83
Markov reliability model with repair
  • Consider the 2-component parallel system (no
    delay perfect cov) but disallow repair from
    system down state
  • Note that state 0 is now an absorbing state. The
    state diagram is given in the following figure.
  • This reliability model with repair cannot be
    modeled using a reliability block diagram or a
    fault tree. We need to resort to Markov chains.
    (This is a form of dependency since in order to
    repair a component you need to know the status of
    the other component).

84
Markov reliability model with repair (Continued)
Absorbing state
  • Markov chain has an absorbing state. In the
    steady-state, system will be in state 0 with
    probability 1. Hence transient analysis is of
    interest. States 1 and 2 are transient states.

85
Markov reliability model with repair (Continued)
  • Assume that the initial state of the Markov chain
  • is 2, that is, p2(0) 1, pk (0) 0 for k 0,
    1.
  • Then the system of differential Equations is
    written
  • based on
  • rate of buildup rate of flow in - rate of flow
    out
  • for each state

86
Markov reliability model with repair
(Continued)
87
Markov reliability model with repair
(Continued)
  • After solving these equations, we get
  • R(t) p2(t) p1(t)
  • Recalling that
    , we get

88
Markov reliability model with repair
(Continued)
  • Note that the MTTF of the two component
    parallel redundant system, in the absence
  • of a repair facility (i.e., ? 0), would
    have
  • been equal to the first term,
  • 3 / ( 2? ), in the above expression.
  • Therefore, the effect of a repair facility is
    to
  • increase the mean life by ? / (2?2), or by a
  • factor

89
Markov Reliability Model with Repair ( WFS
Example)
  • Assume that the computer system does not recover
    if both workstations fail, or if the file-server
    fails

90
Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing
states while (2,1) and (1,1) are transient
states. Note we have made a simplification that,
once the CTMC reaches a system failure state, we
do not allow any more transitions.
91
(No Transcript)
92
Markov Model with Absorbing States
  • If we solve for p2,1(t) and p1,1(t) then
  • R(t)p2,1(t) p1,1(t)
  • For a Markov chain with absorbing states
  • A the set of absorbing states
  • B ? - A the set of remaining states
  • ti,j Mean time spent in state i,j until
    absorption

93
Markov Model with Absorbing States (Continued)
QB derived from Q by restricting it to only
states in B
Mean time to absorption MTTA is given as
94
Markov Reliability Model with Repair (Continued)
95
Markov Reliability Model with Repair (Continued)
  • Mean time to failure is 19992 hours.

96
Markov Reliability Model without Repair
  • Assume that neither workstations nor file-server
    is repairable

97
Markov Reliability Model without Repair
(Continued)
States (0,1), (1,0) and (2,0) become absorbing
states
98
(No Transcript)
99
(No Transcript)
100
Markov Reliability Model without Repair
(Continued)

  • Mean time to failure is 9333 hours.

101
Markov Reliability Model with Imperfect Coverage
102
Markov model with imperfect coverage
  • Next consider a modification of the above
  • example proposed by Arnold as a model of
  • duplex processors of an electronic
  • switching system. We assume that not all
  • faults are recoverable and that c is the
  • coverage factor which denotes the
  • conditional probability that the system
  • recovers given that a fault has occurred.
  • The state diagram is now given by the
  • following picture

103
Now allow for Imperfect coverage
c
104
Markov modelwith imperfect coverage (Continued)
  • Assume that the initial state is 2 so that
  • Then the system of differential equations are




0
)
0
(
)
0
(
,
1
)
0
(
p
p
p
1
0
2
)
(
t
dp
105
Markov model with imperfect coverage (Continued)
  • After solving the differential equations we
    obtain
  • R(t)p2(t) p1(t)
  • From R(t), we can system MTTF
  • It should be clear that the system MTTF and
    system reliability are
  • critically dependent on the coverage factor.

106
(No Transcript)
107
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com