Title: EGR 518 Performability Performance and Dependability Analysis for Computer Systems
1EGR 518Performability (Performance and
Dependability) Analysis for Computer Systems
- Instructor Meng-Lai Yin
- Office Bldg. 9, Room 511
- Tel 909-869-2535
- emailmyin_at_csupomona.edu
2 Expectation
- A student, after take this class, is expected to
- Know the terminology and state-of-the-art
technology in reliability, availability,
performance, and performability - Grasp an overall picture of the system being
analyzed - Recognize and determine the type of analysis
needed for a particular task - Construct corresponding models for the analysis
- Get familiar with provided computer-aided
analysis tools to conduct the analysis - Obtain quantitative as well as qualitative
results from the models - Validate the modeling results
3Course Outline
- Basic Concepts about Performability Modeling
- Probability Review
- Fault Tolerance Techniques
- Concepts about Modeling Approaches
- Modeling Tools
- Reliability Block Diagram
- Markov Modeling Technique
- Fault Tree Analysis
- Performance Analysis Queuing Models
- Integrate Performance and Dependability
- Case Studies, Project Presentations
4Assessment
- Homework 20
- Quizzes 20
- Midterm 20
- Final 20
- Project 20
5Text References
- Kishor S. Trivedi, Probability and Statistics
with Reliability, Queuing and Computer Science
Applications, second edition, John Wiley Sons,
Inc. 2002, ISBN 0-471-33341-7. - References
- 1 Robin A. Sahner, Kishor S. Trivedi, Antonio
Puliafito, - Performance and Reliability Analysis of Computer
Systems - An Example-Based Approach Using the SHARPE
Software Package, - Kluwer Academic Publishers, 1996. ISBN
0-7923-9650-2. - 2Martin L. Shooman, Reliability of Computer
Systems and Networks,Fault Tolerance, Analysis,
and Design, John Wiley Sons, Inc., 2002. ISBN
0-471-29342-3. - 3 http//www.crhc.uiuc.edu/PERFORM/home.html
- 4 http//www.eecs.umich.edu/jfm/
- 5 http//www.ee.duke.edu/kst/
6Ok. So, what is "Performability"?
The needs of High Performance, Fault Tolerant
Computing
7Fault-Tolerant Computing
- Fault-tolerant computing is a generic term
describing redundant design techniques with
duplicate components or repeated computations
enabling uninterrupted (tolerant) operation in
response to component failure (faults).
8Links
Performability
- http//www.crhc.uiuc.edu/PERFORM/home.html
- http//www.eecs.umich.edu/jfm/
Conferences
http//www.dsn.org http//www.rams.org
9An Example
The purpose of this example is to showthe
existences of performance degradable systems
10An email received on July 20, 2005 433PM
- We are experiencing problems with the AIX user
account file systems. We need to take the AIX
system off-line immediately to fix the problem.
We expect the AIX file systems to be off line for
approximately an hour and a half. We hope to
have the file systems back on-line by 600PM. - Sorry for any inconvenience.
- Sys Admin Team
11Later that day July 20, 2005 626PM
- All AIX file systems are back on-line except
wei_snoop which is in a rebuild stage. Wei_snoop
file system will be back on-line by 0600 tomorrow
morning. - Thanks,
- Sys Admin Team
12Observations
- The system is not totally failed even with the
failed AIX file system - The system can operate without the wei_snoop file
system - The system can be upgraded while operating
-
More and more systems become performance
degradable
13Performance Degradable Systems
- Performance degradable systems have the
capability of continuing to operate failure-free
in the presence of certain faults or errors by
diminishing the level of quality of service 7.
Normal Scenario A system starts with all
components operational and performs at its
maximum capability. When a component fails, the
system will reconfigure itself and operate with
degraded performance, etc.
14Reasons for Performability Modeling
- Two separate measures in the past
- Traditional dependability analysis assumes no
performance degraded states. - Performance measures always are applied to fully
operational state. - Need an integrated, meaningful metric
- For performance degradable systems, where the
system can operate in many different states, how
do you address the systems performance? - Traditional metrics (performance, reliability,
availability. etc.) and the corresponding
modeling techniques cannot catch the overall
performance feature for performance degradable
systems.
15The Beginning of Performability
- The term Performability was introduced almost
three decades ago 4, by Prof. J. F. Meyer.
John F. Meyer Address 4111 EECS Phone (734)
763-0037Fax (734) 763-1503 Professor Emeritus,
Electrical Engr Computer ScienceDegree Ph.D.,
U-Michigan
16A Tribute to M. D. Beaudry
- Before Dr. John F. Meyer gave the name
performability to the world, several works
actually had already been devoted to address the
issue of providing appropriate metrics for
performance degradable systems. - In Particular, the work conducted by Danielle
Beaudry 1 has been referenced in many places. - In 1, she addressed the performance-related
reliability measures for gracefully degraded
systems (performance degradable systems ).
17Course Objectives
- At the conclusion of this course, a participant
will be able to - know the basic concepts about performability
- know how to
- conduct and evaluate a dependability analysis
using Reliability Block Diagram (RBD) or Markov
techniques - conduct and evaluate a performance analysis using
Queuing models - conduct and evaluate a performability analysis
using various modeling techniques
18Approach
19Performance Analysis
- Purpose
- To assess workload, traffic arrival rates,
service time distributions, etc. - To evaluate resource Contention Scheduling
- To assess the effects of Concurrency and
Synchronization - Measures
- Throughput
- Response time (mean dist.)
- others
20How about dependability?
So many terms have been used in this area, such
as reliability, availability, ..
21Reliability, Availability, Dependability
- They are all probabilities.
- What are the differences?
Definition of Reliability The probability of an
item to perform a required function under given
conditions for a given time interval. (without
any failure)
Definition of Availability "The probability of
an item to be in a state to perform a required
function at a given instant of time, assuming
that the external resources, if required, are
provided. (can have failures with repairs)
22Picture the Differences
time
t0
?
Reliability the probability that the item
survive theduration t0, ?)
time
t0
?
Availability the probability that the item is
working at time ?, given that the item was
working at time t0.
23Picture the Differences
1.0
Steady state availability
A typical reliability figure (without repair)
A typical availability figure (with repair)
24Calculating Reliability Availability
- Let ? be the failure rate for a component, and ?
be the repair rate for that component. - Assume exponential distribution for the failures
- Then reliability can be calculated as R(t) e
-? t - SS (Steady-State) -Availability can be assessed
as - or
25 Dependability Umbrella term
Trustworthiness of a computer system such that
reliance can justifiably be placed on the service
it delivers
Copied from course materials provided by prof.
Trivedi
26Modeling Taxonomy
27Combinatorial Approach
- If a system consisting of n components, and every
component is either working or failed, then we
can simply list out of all the possible
combinations and calculate the probability for
each combination.
28Complexity Concerns
- How many possible combinations of the status of
these n components? - What can be done to manage the complexity?
- During model construction
- Need a more intelligent way to describe the
systems failure behavior - Series and parallel RBD (Reliability Block
Diagram) approach - During model solution
- Need more efficient ways of calculations, rather
than counting individual probabilities
29Structured Combinatorial Approach
- Reliability block diagrams
- Integrate certain probability events into a
module, which contains the info - A probability of failure
- A failure rate
- A distribution of time to failure
- Steady-state and instantaneous unavailability
- Organize the modules in a structured way,
according to the effects of each modules failure - Statistical independence Assumption
- Failures independence
- Repairs independence
30Some Basic Terminology
- Redundancy Hardware (Static,Dynamic),
information, Time, software - Fault Types Permanent (needs repair or
replacement), Intermittent (reboot/restart or
replacement), Transient (retry), - Fault, error, failure
- Fault detection, imperfect Coverage
- Maintenance scheduled (preventive), unscheduled
(corrective)
31Terminology Continue
- Failure occurs when the delivered service no
longer complies with the specification - Error is that part of the system state which is
liable to lead to subsequent failure - Fault is adjudged or hypothesized cause of an
error
Faults are the cause of errors that may lead to
failures
Fault
Error
Failure
32 High Availability Intents
- Scott McNealy, Sun Microsystems Inc.
- "We're paying people for uptime.The only thing
that really matters is uptime, uptime, uptime,
uptime and uptime. I want to get it down to a
handful of times you might want to bring a Sun
computer down in a year. I'm spending all my time
with employees to get this design goal - SUN Microsystems SunUP RASCAL program for
high-availability - Motorola - 5NINES Initiative
- HP, Cisco, Oracle, SAP - 5nines5minutes Alliance
- IBM Cornhusker clustering technology for
high-availability, eLiza, autonomic computing - Microsoft Trustable computing initiative
- Microsoft Regular full page ad on 99.999
availability in USA Today
33(No Transcript)