Probability for Computer Science presentation

About This Presentation

Transcript and Presenter's Notes

Title: Probability for Computer Science

1
Probability for Computer Science
IIT Kanpur

Kishor S. Trivedi
Visiting Prof. Of Computer Science and
Engineering, IITK
Prof. Department of Electrical and Computer
Engineering
Duke University
Durham, NC 27708-0291
Phone 7576
e-mail kst_at_ee.duke.edu
URL www.ee.duke.edu/kst

2
Outline

Introduction
Preliminaries Sample Space, Probability Axioms,
Independence, Conditioning,Binomial Trials
Random Variables Binomial, Poisson, Exponential,
Weibull, Erlang, Hyperexponential,
Hypoexponential, Pareto, Defective
Reliability, Hazard Rate
Average Case Analysis of Program Performance
Reliability Analysis Using Block Diagrams and
Fault Trees
Reliability of Standby Systems
Statistical Inference Including Confidence
Intervals
Hypothesis Testing
Regression

3
Schedule Textbook

Schedule Jan 21, 23, 28 and Feb 6, 18, 25, 27
Probability Statistics with reliability,
queuing,
and computer science applications, K. S.
Trivedi, second edition, John Wiley Sons, 2001
(Indian paperback).

4
Program Performance Evaluation

Worst-case vs. Average case analysis
Data-structure-oriented vs. Control
structure-oriented
Sequential vs. Concurrent
Centralized vs. Distributed
Structured vs. with unrestricted transfer of
control
Unlimited (hardware) resources vs. limited
resources
Software architecture modules, their
characteristics (execution time) and interactions
(branching, looping)
Characteristics of hardware on which the software
is run
Measures completion time (mean, variance
dist.), thruput
Measurements or Models (simulation vs. analytic)
analytic models combinatorial, DTMC, SMP,
CTMC, SPN

5
System Performance Evaluation

Workload traffic arrivals, service time
distributions
pattern of resource
requests
Hardware architecture and software architecture
Resource Contention, Scheduling Allocation
Concurrency, Synchronization, distributed
processing
Timeliness (Have to Meet Deadlines)
Measures Thruput, Goodput, loss probability,
response time or delay
(mean, variance dist.)
Low-level (Cache, memory interference ch. 7)
System-level (CPU-I/O, multiprocessing ch. 8,9)
Network-level (protocols, handoff in wireless
ch. 7,8)
Measurements or models (simulation or analytic)
analytic models DTMC, CTMC, PFQN, SPN

6
System Performance Evaluation

Workload
Single vs. multiple types of requests (classes,
chains) in the latter case, the following three
items needed for each type of request
traffic arrivals one time vs. a stream
stream Poisson (Bernoulli), General renewal,
IPP (IBP), MMPP(MMBP), MAP, BMAP, NHPP,
Self-similar
service time distributions Exponential
(geometric), deterministic, uniform, Erlang,
Hyperexponential, Hypoexponential, Phase-type,
general (with finite mean and variance), Pareto
pattern of resource requests service time
distribution (or the mean) at each resource per
visit, branching probabilities often described
as a DTMC (discrete-time Markov chain) and can
also be seen as the behavior of an individual
program
All this information should be collected from
actual measurements (if possible) followed by
statistical inference

7
Software Reliability

Black-box (measurements statistical inference)
vs. Architecture-based approach (models)
Black-box approaches treat software as a
monolithic whole, considering only its
interactions with external environment, without
an attempt to model its internal structure
With growing emphasis on reuse, software
development process moves toward component-based
software design
White-box approach may be better to analyze a
system with many software components and how they
fit together

8
Software Architecture

Software behavior with respect to the manner in
which different components interact
May include the information about the execution
time of each component
Use control flow graph to represent architecture
Sequential program architecture modeled by
Discrete Time Markov Chain (DTMC)
Continuous Time Markov Chain (CTMC)
Semi-Markov process (SMP)

9
Failure Behavior of Components and Interfaces

Failure can happen
during the execution of any component or
during the transfer of control between components
Failure behavior can be specified in terms of
reliability
constant failure rate
time-dependent failure intensity

10
System Reliability/Availability

Faultload fault types, fault arrivals,
repair/recovery procedures and delay time
distributions
Hardware architecture and software architecture
Minimum Resource Requirements
Dynamic failures
Performance/Reliability interdependence
Measures Reliability, Availability, MTTF,
Downtime
Low-level (Physics of failures, chip level)
System-level (CPU-I/O, multiprocessing ch. 8,9)
Software and Hardware combined together
Network-level
Measurements or models (simulation or analytic)
analytic models RBD, FTREE, CTMC, SPN

11
Definition of Reliability

Recommendations E.800 of the International
Telecommunications Union (ITU-T) defines
reliability as follows
The ability of an item to perform a required
function under given conditions for a given time
interval.
In this definition, an item may be a circuit
board, a component on a circuit board, a module
consisting of several circuit boards, a base
transceiver station with several modules, a
fiber-optic transport-system, or a mobile
switching center (MSC) and all its subtending
network elements. The definition includes systems
with software.

12
Definition of Availability

Availability is closely related to reliability,
and is also defined in ITU-T Recommendation E.800
as follows1
"The ability of an item to be in a state to
perform a required function at a given instant of
time or at any instant of time within a given
time interval, assuming that the external
resources, if required, are provided."
An important difference between reliability and
availability is that reliability refers to
failure-free operation during an interval, while
availability refers to failure-free operation at
a given instant of time, usually the time when a
device or system is first accessed to provide a
required function or service

13
High Reliability/Availability/Safety

Traditional applications
(long-life/life-critical/safety-critical)
Space missions, aircraft control, defense,
nuclear systems
New applications
(non-life-critical/non-safety-critical,
business critical)
Banking, airline reservation, e-commerce
applications, web-hosting, telecommunication
Scientific applications
(non-critical)

14
Motivation High Availability

Scott McNealy, Sun Microsystems Inc.
"We're paying people for uptime.The only thing
that really matters is uptime, uptime, uptime,
uptime and uptime. I want to get it down to a
handful of times you might want to bring a Sun
computer down in a year. I'm spending all my time
with employees to get this design goal
SUN Microsystems SunUP RASCAL program for
high-availability
Motorola - 5NINES Initiative
HP, Cisco, Oracle, SAP - 5nines5minutes Alliance
IBM Cornhusker clustering technology for
high-availability, eLiza, autonomic computing
Microsoft Trustable computing initiative
John Hennessey in IEEE Computer
Microsoft Regular full page ad on 99.999
availability in USA Today

15
Motivation High Availability
16
Need for a new term

Reliability is used in a generic sense
Reliability used as a precisely defined
mathematical function
To remove confusion, IFIP WG 10.4 has proposed
Dependability as an umbrella term

17
Dependability Umbrella term
Trustworthiness of a computer system such that
reliance can justifiably be placed on the service
it delivers
18
IFIP WG10.4

Failure occurs when the delivered service no
longer complies with the specification
Error is that part of the system state which is
liable to lead to subsequent failure
Fault is adjudged or hypothesized cause of an
error

Faults are the cause of errors that may lead to
failures
Fault
Error
Failure
19
DependabilityReliability, Availability,Safety,
Security

Redundancy Hardware (Static,Dynamic),
Information, Time, software
Fault Types Permanent (needs repair or
replacement), Intermittent (reboot/restart or
replacement), Transient (retry), Design
Heisenbugs, Aging related bugs
Bohrbugs
Fault Detection, Automated Reconfiguration
Imperfect Coverage
Maintenance scheduled, unscheduled

20
Software Fault Classification

Many software bugs are reproducible, easily
found and fixed during the testing and debugging
phase

Bohrbugs

Other bugs that are hard to find and fix remain
in the software during the operational phase
These bugs may never be fixed, but if the
operation is retried or the system is rebooted,
the bugs may not manifest themselves as failures
manifestation is non-deterministic and dependent
on the software reaching very rare states

Heisenbugs
21
Software Fault Classification
22
Failure Classification (Cristian)

Failures
Omission failures (Send/receive failures)
Crash failures
Infinite loop
Timing failures
Early
Late (performance or dynamic failures)
Response failures
Value failures
State-transition failures

23
Security

Security intrusions cause a system to fail
Security Failure
Integrity Destruction/Unauthorized modification
of information
Confidentiality Theft of information
Availability e.g., Denial of Services (DoS)
Similarity (as well as differences) between
Malicious vs. accidental faults
Security vs. reliability/availability
Intrusion tolerance vs. fault tolerance

24
The Need of Performability Modeling

New technologies, services standards need new
modeling methodologies
Pure performance modeling too optimistic!
Outage-and-recovery behavior not considered
Pure dependability modeling too conservative!
Different levels of performance not considered

25
ilities besides performance
Performability measures of the systems ability to
perform designated functions
R.A.S.-ability concerns grow. High-R.A.S. not
only a selling point for equipment vendors and
service providers. But, regulatory outage report
required by FCC for public switched telephone
networks (PSTN) may soon apply to wireless.
26
Evaluation vs. Optimization

Evaluation of system for desired measures given a
set of parameters
Sensitivity Analysis
Bottleneck analysis
Reliability importance
Optimization
StaticLinear,integer,geometric,nonlinear,
multi-objective constrained or unconstrained
Dynamic Dynamic programming, Markov decision
process, semi-Markov decision process

27
PURPOSE OF EVALUATION

Understanding a system
Observation
Operational environment
Controlled environment
Reasoning
A model is a convenient abstraction
Predicting behavior of a system
Need a model
Accuracy based on degree of extrapolation

28
PURPOSE OF EVALUATION(Continued)

These famous quotes bring out the difficulty of
prediction
based on models
All Models are Wrong Some Models are Useful
George Box
Prediction is fine as long as it is not about
the future
Mark Twain

29
Basic Definitions

Reliability R(t)
X time to failure of a system
F(t) distribution function of system lifetime
Mean Time To system Failure
f(t) density function of system lifetime

30
Availability (Continued)

Instantaneous (point) Availability A(t)
A(t) P (system working at t)
Let H(t) be the convolution of F and G
g(t) density function of system repair time
Then
Inst. Availability , ,
Reliability

31
Availability
Never failed in (0,t), prob R(t)

System working at time t

First failed and got repaired at time xltt UP at
end of interval (x,t), prob
x dx
t
x
0
First repair completed here
32
Availability (Continued)

MTTR Mean Time to Repair
Y repair period of the system
Availability and Reliability are related but
different!

33
Availability (Continued)

Steady-State Availability
We can show that for systems without redundancy
For a system with redundancy
where MTTFeq MTTReq must be carefully
defined
Also

34
MEASURES TO BE EVALUATED

Dependability
Reliability R(t), System MTTF
Availability Steady-state, Transient
Downtime
Performance
Throughput, Blocking Probability, Response Time

Does it work, and for how long?''
Given that it works, how well does it work?''
35
MEASURES TO BE EVALUATED (Continued)

Composite Performance and Dependability
Need Techniques and Tools That Can Evaluate
Performance, Dependability and Their Combinations

How much work will be done(lost) in a given
interval including the effects of
failure/repair/contention?''
36
Methods of EVALUATION

Measurement-Based
Most believable, most expensive
Not always possible or cost effective during
system design
Statistical techniques are very important here
Model-Based

37
Methods of EVALUATION(Continued)

Model-Based
Less believable, Less expensive
1. Discrete-Event Simulation vs. Analytic
2. State-Space Methods vs. Non-State-Space
Methods
3. Hybrid Simulation Analytic (SPNP)
4. State Space Non-State Space (SHARPE)

38
Methods of EVALUATION(Continued)

Measurements Models
Vaidyanathan et al ISSRE 99

39
QUANTITATIVE EVALUATION TAXONOMY
Closed-form solution
Numerical solution using a tool
40
Note that

Both measurements simulations imply statistical
analysis of outputs (ch. 10,11)
Statistical inference
Hypothesis testing
Design of experiments
Analysis of variance
Regression (linear, nonlinear)
Distribution driven simulation requires
generation of random deviates (variates) (ch. 3,
4, 5)
Probability and Statistics are different yet
highly related
Probability models need inputs that generally
come from measurement data (followed by
statistical inference)
Statistics in turn uses probability theory

41
MODELING THROUGHOUT SYSTEM LIFECYCLE

System Specification/Design Phase
Answer What-if Questions''
Compare design alternatives (Bedrock, Wireless
handoff)
Performance-Dependability Trade-offs (Wireless
Handoff)
Design Optimization (optimizing the number of
guard channels)

42
MODELING THROUGHOUT SYSTEM LIFECYCLE (Continued)

Design Verification Phase
Use Measurements Models
E.g. Fault/Injection Availability Model
Union Switch and Signals, Boeing, Draper
Configuration Selection Phase DEC, HP
System Operational Phase IDEN Project
Workload based adaptive rejuvenation

It is fun!

43
MODELING TAXONOMY
44
MODELER'S DILEMMA

Should I Use Discrete-Event Simulation?
Point Estimates and Confidence Intervals
How many simulation runs are sufficient?
What Specification Language to use?
C, SIMULA, SIMSCRIPT, MODSIM, GPSS, RESQ, SPNP
v6, Bones, SES workbench

45
MODELER'S DILEMMA (Continued)

Simulation
Detailed System Behavior including
non-exponential distributions non-Poisson or
processes
Performance, Availability and Performability
Modeling Possible
- Long Execution Time (Variance Reduction
Possible)
Importance Sampling, importance splitting,
regenerative simulation.
Parallel and Distributed Simulation
- Many users in practice do not realize the need
to calculate confidence intervals

46
MODELER'S DILEMMA (Continued)
Should I Use Non-State-Space Methods?

Model Solved Without Generating State Space
Also Known as Combinatorial Models
Use Order Statistics, Mixing, Convolution
Common Dependability Model Types
also called Combinatorial Models
Series-Parallel Reliability Block Diagrams (RBD)
Non-Series-Parallel Block Diagrams (or
Reliability Graphs)
Fault-Trees Without Repeated Events
Fault-Trees With Repeated Events

47
RBD example
48
RELIABILITY GRAPH Example
49
Fault tree without repeated events
50
FAULT TREE WITH REPEATED EVENTS
EXAMPLE
51
(No Transcript)
52
Combinatorial Models

These techniques easy to use and solve for
Mincuts
System Availability(steady-state, inst.)
Downtime in minutes/year
System Reliability, System MTTF
Each component can have attached to it
A probability of failure
A failure rate
A distribution of time to failure
A failure rate and a repair rate

53
Combinatorial Modeling (Continued)

These models can be solved using fast algorithms
assuming stochastic independence between system
components. Systems with several hundred
components can be handled.
For series-parallel RBDs fault trees w/o
repeated events
Series-parallel composition algorithms
For fault trees with repeated events and
reliability graphs
Factoring (conditioning) algorithms
Sum of disjoint products (SDP) algorithms after
first finding all mincuts
Binary decision diagrams (BDD) algorithms

54
Combinatorial Modeling (Continued)

Easy specification, fast computation, no
distributional assumption
Can easily solve models with 100s of
components
- Failure/Repair Dependencies are often present
RBDs, FTREEs cannot easily handle these
(e.g., shared repair, warm/cold spares, imperfect
coverage, non-zero switching time, travel time of
repair person, reliability with repair)

55
COMBINATORIAL MODELING TAXONOMY
SP reliability block diagrams
Non-SP reliability block diagrams
56
Markov chain

To model more complicated interactions between
components, use other kinds of models like Markov
chains or more generally state space models.
Many examples of dependencies among system
components have been observed in practice and
captured by Markov models.

57
State-Space-Based Models

States and labeled state transitions
State can keep track of
Number of functioning resources of each type
States of recovery for each failed resource
Number of tasks of each type waiting at each
resource
Allocation of resources to tasks
A transition
Can occur from any state to any other state
Can represent a simple or a compound event

58
State-Space-Based Models (Continued)

Transitions between states represent the change
of the system state due to the occurrence of an
event
Drawn as a directed graph
Transition label
Probability homogeneous discrete-time Markov
chain (DTMC)
Rate homogeneous continuous-time Markov chain
(CTMC)
Time-dependent rate non-homogeneous CTMC
Distribution function semi-Markov process (SMP)
Two distribution functions Markov regenerative
process (MRGP)

59
MODELER'S DILEMMA (Continued)

Should I Use Markov Models?
State-Space-Based Methods
Model Fault-Tolerance and Recovery/Repair
Combined Modeling of hardware and software
Model Dependencies
Model Contention for Resources
Model Concurrency and Timeliness

60
Condition-Based Maintenance

Failure model is stage type with k stages
Inspection carried out randomly to determine
degradation stage
Determine optimal inspection interval
Many extensions to this model are available

61
Condition-Based Maintenance
Availability

Mean time between inspections

62
Webserver Availability Model with Warm Replication

Two nodes for hardware redundancy
Each node has a copy of the webserver (software
redundancy replication)
Primary node can fail
Secondary node can fail
Primary process can fail
Secondary process can fail
Failures may have imperfect coverage
Time delay for fault detection
Model of a real system developed at Avaya Labs
Both hardware software faults included

63
Markov Model with Software and Hardware Faults
Performance and Reliability Evaluation of
Passive Replication Schemes in Application Level
Fault-Tolerance S. Garg, Y. Huang, C. Kintala,
K. S. Trivedi and S. Yagnik Proc. of the 29th
Intl. Symp. On Fault-Tolerant Computing, FTCS-29,
June 1999.
64
Parameters

Process MTTF 10 days (1/?p)
Node MTTF 20 days (1/?n)
Process polling interval 2 seconds (1/?p)
Mean process restart time 30 seconds (1/?p)
Mean process failover time 2 minutes (1/?n)
Switching time with mean 1/ ?s
C 0.95

65
Solution for Warm replication
66
MULTIPROCESSOR AVAILABILITY MODEL

n Processors, at least 1 Needed for System to be
UP
Each Processor Fails at Rate ?
Each Processor is Repaired at Rate ?
Coverage Probability c
Average Reconfiguration Delay After a Covered
Failure 1/?
Ave. Reboot Delay After an Uncovered Failure 1/?
Not possible to capture these realistic aspects
in a combinatorial model
Model System Availability Using a Markov Chain

67
MULTIPROCESSOR AVAILABILITY MODEL
Dn
Dn-1
...............
n
n-1
n-2
1
0
Bn
Bn-1
68
(No Transcript)
69
(No Transcript)
70
LESSONS

To Realize Availability Benefits of
Multiprocessing
Coverage Must be Near-Perfect
Reconfiguration Delay Must be Very Small
.
Must Consider Different Levels of (Degradable)
Performance

71
Markov Reward Models (MRMs)

Modeling any system with a pure reliability /
availability model can lead to incomplete, or, at
least, less precise results.
Gracefully degrading systems may be able to
survive the failure of one or more of their
active components and continue to provide service
at a reduced level.
Markov reward model is commonly used technique
for the modeling of gracefully degradable system

72
Markov Reward Models (MRMs)

Continuous Time Markov Chains are useful models
for performance as well as availability
prediction
Extension of CTMC to Markov reward models make
them even more useful
Attach a reward rate ri to state i of CTMC
X(t) is instantaneous reward rate of CTMC

73
Markov Reward Models (MRMs) (Continued)

Expected instantaneous reward rate at time t
this generalizes instantaneous availability
where
is the prob. that the Markov chain is in state
i at time t
Expected steady-state reward rate
this generalizes steady-state availability
where
is the prob. that the Markov chain is in state
i in steady-state

74
Performance model

Use a Finite Buffer Queuing Model To Determine
The Prob. Task is Rejected Due to Buffer Full
Task Arrival Rate ? task Service Rate ?
Number of Buffers b
Buffer Full Prob. qb(i) with i Processors
Results from the lower level performance model
used to assign reward rates to the upper level
availability model
Queuing model
M/M/i/b

1
. . .
i
b
75
TOTAL BLOCKING PROBABILITY

ri 1 if i is a down state
if i is an up state

76
TOTAL BLOCKING PROBABILITY
77
(No Transcript)
78
MODELER'S DILEMMA (Continued)

Should I Use Markov Models?
Generalize to Markov Reward Models for Modeling
Degradable Performance
Generalize to Markov Regenerative Models for
Allowing Generally Distributed Event Times
Generalize to Non-Homogeneous Markov Chains for
Allowing Weibull Failure Distributions
Performance, Availability and Performability
Modeling Possible
- Large (Exponential) State Space

79
State Space Modeling Taxonomy
discrete-time Markov chains
Markovian modeling
continuous-time Markov chains
Markov reward models
State space methods
Semi-Markov models
non-Markovian modeling
Markov regenerative models
Non-Homogeneous Markov
80
State Space Explosion

State space explosion can be handled in two ways
Largeness tolerance
Model specification use more concise (and
smaller) model specification (GSPN and SRN
models)
Automatically generate solve underlying
Markov (reward) model
Largeness avoidance
Hierarchical model composition fixed-point
iteration
combine results from different
kinds of models
Possible to use state-space methods
for those parts of a system
that require them, and use non-state-space
methods for the more well-behaved parts
of the system.
State Truncation

81
LARGENESS TOLERANCE

The Markov chains tend to be large and complex
leading too
Model generation problem
Use automated means of generating the Markov
chains Stochastic Petri Nets, Stochastic Reward
Nets

82
LARGENESS TOLERANCE(Continued)

Model solution problem
Use sparse storage for the matrices
Use sparsity preserving solution methods
Sucessive Overrelaxation,
Gauss-Seidel,
Uniformization,
ODE-solution methods

83
Stochastic Petri Net (SPN)

Introduced in 1980s by Natkin, Florin, Molloy,
Ajmone Marsan, Balbo, Conte, Bobbio, Trivedi,
others
A modeling formalism for the automated generation
and solution of Markovian stochastic systems
Many extensions to the original formalism gspn,
srn, dspn, mrspn, fspn

84
GSPN Model for Multiprocessor
GSPN Model of a Multiprocessor note that the
gspn is the same for all n
85
ERG for Multiprocessor Model (n2)
Tfail
tcov
Trep
2,0,0,0,0
1,1,0,0,0
1,0,1,0,0
0,0,0,0,2
Tuncov
tquick
Treboot
Trecon
1,0,0,1,0
1,0,0,0,1
0,1,0,0,1
Tfail
Trep
Extended Reachability Graph for Multiprocessor
model
?c
2,0,0,0,0
1,0,1,0,0
?(1-c)
?
?
?
1,0,0,1,0
1,0,0,0,1
0,0,0,0,2
?
?
Reduced ERG (Markov chain) for Multiprocessor
model
86
Stochastic Reward Net (SRN)

Introduced by Ciardo, Muppala and Trivedi 1989
Structural characteristics
Extensive Marking dependency allowed for firing
rates and firing probabilities
Transition Priorities
Guards (Enabling functions) for Transitions
Variable cardinality arcs

87
Stochastic Reward Net (SRN)

Stochastic characteristics
Allow definition of reward rates in terms of net
level entities
Automatically generate the reward rates for the
markings
Enables computation of required measures of
interest

88
Example Reward Rates for Multiprocessor
Availability

Reward rate at the net level for steady state
availability
Reward rate at the CTMC level for steady-state
availability (n2)

89
Analysis Procedure of SRN
Stochastic Reward Nets
Reachability Analysis
Extended Reachability Graphs
Eliminates vanishing markings
Markov Reward Model
Solve MRM (transient or steady-state)
Measures of Interest
90
LARGENESS AVOIDANCE

Non-State-Space methods
Reliability block diagrams
Fault-trees
Product-Form Queuing Networks
Approximate solutions
State Truncation
SAVE, SPNP (Kantz and Trivedi PNPM91)

91
Case Study JPL REE System Availability Modeling
in Spacecraft Architecture
92
LARGENESS AVOIDANCE (Cont.)

Stochastic Petri Nets (State-space-based
modeling)
State truncation by introducing guard function
Guard g is defined as
If (?mark(_dn) gt K)
return (0)
else
return (1)

93
SPN MODELING
94
AVAILABILITY MEASURES
95
LARGENESS AVOIDANCE (Continued)

Approximate solutions
Hierarchical Decomposition
and Fixed-Point Iteration among submodels
Heidelberger and Trivedi IEEE-TC,1983
(Queueing Models)
Ciardo and Trivedi PNPM91 (SPN Models)
Tomek and Trivedi (Availability Models)
Lanus, Liang Trivedi (Bedrock)
Wireless handoff work Ma, Han Trivedi

96
Hierarchical example

Blocks colored red are expanded into submodels

97
LARGENESS AVOIDANCE (Continued)

Approximate solutions
Performability
Multiprocessor example
Fluid Approximation
Mitra Kulkarni Ciardo Nicol, and Trivedi
FSPN

98
Summary- Modeling Techniques

Combinatorial techniques like RBDs and FTREEs are
easy to represent and solve
Combinatorial models cannot represent intricate
dependencies
State space based models like Markov chains can
handle dependencies
State space explosion problem
Use automated generation methods stochastic
Petri nets
Hierarchical models

99
IN ORDER TO FULFILL OUR GOALS

Modeling Performance, Availability and
Performability
Modeling Complex Systems
We Need
Automatic Generation and Solution of Large Markov
Reward Models

100
IN ORDER TO FULFILL OUR GOALS (Continued)

Facility for State Truncation, Hierarchical
composition of Non-State-Space and State-Space
Models, Fixed-Point Iteration
There are Two Tools that Potentially meet these
Goals
Stochastic Petri Net Package (SPNP)
Symbolic Hierarchical Automated Reliability and
Performance Evaluator (SHARPE)

101
Model-based Availability evaluation

Choice of the model type is dictated by
Measures of interest
Level of detailed system behavior to be
represented
Ease of model specification and solution
Representation power of the model type
Access to suitable tools or toolkits

102
SPNP Software Package
103
SPNP

Installed at over 250 Sites companies
universities
Ported to Most Architectures and Operating
Systems
Used For Performance, Dependability and
Performability
Steady-State as well as Transient Analysis
Analytic-numeric methods for Markovian models.
Simulation for non-Markovian and fluid models
Written in C Language
GUI now available

104
SOME INDUSTRIAL USES

HP
Cluster Availability Modeling
Server Availability
Mass Storage Arrays Availability Modeling
MOTOROLA
Recovery strategies in wireless handoff
proposed and modeled several strategies
Fixed-point iteration used
Software rejuvenation in CMTS
IBM
Software rejuvenation for a cluster system
Boeing, EMC,

105
DISCRETE EVENT SIMULATION ANALYSIS

Can be used for
Markovian SRN
non-Markovian SRN
Fluid SPN
FSPN (Fluid Stochastic Petri net)
Used as a model for
Systems involving fluid variables
Approx. of models with a large number of tokens
No need to generate the reachability graph
Possibility to give the number of replications or
the desired relative error.

106
DISTRIBUTIONS AVAILABLE FOR SIMULATION

Exponential
Constant (including Immediate)
Uniform
Truncated normal
Weibull
Lognormal
Geometric
Erlang
Pareto
Cauchy
Beta
Gamma
Poisson

107
Solution Technique in SPNP
108
An Introduction to SHARPE software tool
109
Overview of SHARPE

SHARPE Symbolic-Hierarchical Automated
Reliability and Performance Evaluator
Well-known modeling tool (Installed at over 300
Sites companies and universities)
Combines flexibility of Markov models and
efficiency of combinatorial models
Ported to most architectures and operating
systems
Used for Education, Research, Engineering Practice

110
Overview of SHARPE (cont.)

Graphical User Interface is available
Used for analysis of performance(traffic),
dependability and performability
Hierarchy facilitates largeness stiffness
avoidance
Steady-state as well as transient analysis
Written in C language
Used as an engine by several other tools

111
Architecture of SHARPE interface
Fault tree
MRGP
Reliability Block Diagrams
Markov chain
Hierarchical Hybrid Compositions
Petri net (GSPN SRN)
Reliability graph
Task graph
Pfqn, Mfqn
Reliability/Availability
Performance
Performability
112
Modeling Steps

Model construction
Model calibration or parameterization
Model solution
Result interpretation
Model Validation

113
MODEL CALIBRATION

What is ??
Fault Model for Each Component
Design,Manufacturing Heisenbugs, Bohrbugs
Operational Permanent, Intermittent,Transient
Human
Fault Arrival Processes (PP,Weibull,NHPP)
Failure Rates (SourcesMIL-STD)

114
MODEL CALIBRATION (Continued)

What is c ?
Field Data
Fault/Error Injection (FIAT,MESSALINE)
Analytic Coverage Model
What is ? ?
Maintenance Model

Corrective dispatch , travel, repair time, dead
on arrival, imperfect repair
Preventive
115
MODEL CALIBRATION (Continued)

What is r ?
Binary Up Down
Capacity-Oriented
Number of Operational Resources in Each State
Performance-Oriented
Evaluate Perf. in Each Degraded Level of Syst.
Config.
1. Measurements
2. Simulation Model
3. Analytic Model -- SHARPE, SPNP

116
VALIDATION VERIFICATION

Validation Does the conceptual model faithfully
reflect the behavior of the system?
Verification Has the conceptual model been
correctly implemented?

117
MODEL VALIDATION (Continued)

Three step process outlined by Naylor and Finger
Face validation Discussion with the experts
Input-Output validation Compare results obtained
from model with those from measurements
Validation of model assumptions Either prove
that the assumptions are correct or do
statistical testing
Rejection of a hypothesis regarding model
assumption based on measurement data leads to an
improved model

118
MODEL ASSUMPTIONS/ERRORS

Errors in Model Structure
Missing or Extra Arcs
Missing or Extra States
Use Face Validation to avoid these errors.
Errors Due to Non-Independence
Distributional Errors
Parametric Errors

119
MODEL ASSUMPTIONS/ ERRORS(Continued)

Errors Due Approximations
Decomposition/Aggregation/Iteration
State Truncation
Numerical Solution Errors
Discretization Errors
Round-Off Errors

120
Model Verification

Programming Errors
Approximation errors Tight bounds due to
approximations are desirable
Numerical Errors in numerical algorithms should
be bounded

121
MODELING AND MEASUREMENTS INTERFACES

Measurements supply Input Parameters to Models
(Model Calibration or Parameterization)
Confidence Intervals should be obtained
Boeing, Draper, Union Switch projects
Model Sensitivity Analysis can suggest which
Parameters to Measure More Accurately Blake,
Reibman and Trivedi SIGMETRICS 1988.

Probability for Computer Science PowerPoint PPT Presentation