Title: Design of Reliable Systems and Networks ECE 542/CS 536 Lecture 2 Introduction
1Design of Reliable Systems and NetworksECE
542/CS 536 Lecture 2 Introduction
Prof. Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2Class Information
- Class Times
- Class meets Monday, Wednesday from 1030 a.m. to
1150 a.m. in Rm. 260 Mechanical Engineering
Building. - Instructor
- Ravi K. Iyer
- Office 255 CSL Phone 333-2510 Email
iyer_at_crhc.uiuc.edu - Hours 1230 - 130 p.m. on Mondays and other
times by appointment - Class Web site
- http//courses.ece.uiuc.edu/ece442
- Class Material
- Reading list available on the website
- Class Notes on the web
- Lecture notes on some topics available on the web
3Outline
- Overview and course objectives
- Motivation for reliable system design
- Taxonomy of dependable computing
- Fault classes (hardware and software)
- Failure sources
4Course Overview
- Introduce system view of reliable computing.
- Hardware redundancy techniques.
- Hardware and software error detection techniques
- Illustrate some of these techniques on selected
systems - Tandem fault-tolerant platform employing hardware
redundancy - ARMORs, a software platform for designing highly
available applications - providing error detection to DHCP (Dynamic Host
Control Protocol) application and to
call-processing in a wireless telephone network - Designing a failure resilient node/network
controller. - Software fault tolerance techniques including
process pairs, robust data structures, recovery
blocks, and N-version programming - Illustrate some of these techniques on selected
systems - Tandem on line transaction processing system
- High availability design of IBM server
5Course Overview (cont.)
- Network specific issues in designing reliable
system, including mechanisms/algorithms for
supporting consistent data, reliable
communications, and replication - Broadcast protocols, agreement protocols, and
commit protocols - Illustrate some of these techniques in
maintaining data consistency in a replicated
DHCP server executing on the Chameleon ARMORs
testbed - Review of example high-availability networked
systems - Checkpointing and recovery techniques
- Illustrate checkpointing on examples of
- a distributed database system,
- checkpointing of multithreaded processes
micro-checkpointing, and - the IRIX operating system.
- Issues in experimental system evaluation
6Recommended Reading
- IEEE Trans on Dependable and Secure Computing
- Prad96 D.K. Pradhan, ed., Fault Tolerant
Computer System Design, Prentice-Hall, 1996 - John89 B. W. Johnson, Design and Analysis of
Fault Tolerant Digital Systems, Addison Wesley,
1989 - SiSw92 D.P. Siewiorek and R.S. Swarz, Reliable
Computer Systems - Design and Evaluation, Digital
Press (distributed by Butterworth), 1992, 2nd
edition. - Lyu95a M.R. Lyu, Handbook of Software
Reliability Engineering, McGraw-Hill, 1995 - Lyu95b M.R. Lyu, ed., Software Fault Tolerance,
J. Wiley Sons, 1995 - Birm96 K.P. Birman, Building Secure and
Reliable Network Applications, Manning, 1996 - SiSh94 M. Singhal and N.G. Shivaratri, Advanced
Concepts in Operating Systems, McGraw-Hill, 1994
7Why Study Reliable Computing!!!
- Traditional needs
- Long-life applications (e.g., unmanned and manned
space missions ) - Life-critical, short-term applications (e.g.,
aircraft engine control, fly-by-wire) - Defense applications (e.g., aircraft, guidance
control) - Nuclear industry
- Telephone Switching systems
- Mission-critical applications
- Health industry
- Automotive industry
- Industrial control systems, production lines
- Banking, reservations, commerce
8Why Study Reliable Computing!!! (cont.)
- Networks
- Wired and wireless networked applications
- Data mining
- Distributed, networked systems (reliability and
security are the major concerns) - commerce stores, catalog industry
- Scientific computing, education
- Typically reliability is not an issue yet.
- This is changing in the new 10 Teraflop
machines reliability is a major concern.
9Objectives
- System (hardware, software) perspective/view on
design issues in reliable computing
Applications
What can be provided in software and application
itself?
Application program interface (API)
SIFT
Middleware
How to combine hardware and software fault
tolerance techniques - (1) fast error detection
in hardware, (2) high efficiency detection and
recovery in software How to assess whether the
achieved availability meets system requirements
What can be provided in the communication layer?
Reliable communications
What is typically provided in the operating
system?
Operating system
System network
What can be provided in COTS hardware to ensure
fail-silent behavior of system components
(nodes, network)?
Hardware
Processing elements Memory Storage system
10How do We Achieve the Objectives?
Applications
Checkpointing and rollback, application
replication, software, voting (fault masking),
Process pairs, robust data structures, recovery
blocks, N-version programming,
Application program interface (API)
SIFT
Middleware
CRC on messages , acknowledgment, watchdogs,
heartbeats, consistency protocols
Reliable communications
Memory management, detection of process
failures, hooks to support software fault
tolerance for application
Operating system
System network
Hardware
Processing elements Memory Storage system
Error correcting codes, N_of_M and standby
redundancy , voting, watchdog timers, reliable
storage (RAID, mirrored disks)
11Examples of Computer-related Failures
FAULTS
FAILURES
Availability / Reliability
Confidentiality
Safety
Design
Localized
Physical
Interaction
Distributed
False alerts at the North American Air
Defense (NORAD) Ford 85 First launch of the
Space Shuttle postponed Gaman 81 Excessive
radiotherapy doses (Therac-25) Leveson Turner
93 The wily hacker penetrates several tens of
sensitive computing facilities Stoll
88 Internet worm Spatford 89 9 hours outage
of the long-distance phone in the USA Neumann
95 Scud missed by a Patriot (Dhahran, Gulf
War) Neumann 95 Crash of the communication
system of the London ambulance service HA
93 Authorization denial of credit card
operations in France The maiden flight of the
Arine 5 launcher ended in a failure (France)
?
?
?
June 1980 April 1981 June 1985 - January
1987 August 1986 - 1987 November 1988 15
January 1990 February 1991 November 1992 26
and 27 June 1993 4 June 1996
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
12Effect of major network outages on large business
customers
Large Insurance Carriers 20k/hour Major
Airlines 2.5M/hour Trading / Investment
Banking 6M/hour
40 30 20 10 0
Percent of Users
1k
10k
100k
1M
10M
Downtime costs (/hour)
13Dependable Computing
- Dependability is property of computer system that
allows reliance to be placed justifiably on
service it delivers. The service delivered by a
system is its behavior as it is perceptible by
its user
AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY IN
TEGRITY MAINTAINABILITY FAULT PREVENTION FAULT
TOLERANCE FAULT REMOVAL FAULT FORECASTING FAULTS
ERRORS FAILURES
ATTRIBUTES
DEPENDABILITY
MEANS
IMPAIRMENTS
14Fault Classes
- Based on the temporal persistence
- Permanent faults, whose presence is continuous
and stable. - Intermittent faults, whose presence is only
occasional due to unstable hardware or varying
hardware and software states (e.g., as a function
of load or activity). - Transient faults, resulting from temporary
environmental conditions.
- Based on the origin
- Physical faults, stemming from physical phenomena
internal to the system, such as threshold change,
shorts, opens, etc., or from external changes,
such as environmental, electromagnetic,
vibration, etc. - Human-made faults, which may be either design
faults, introduced during system design,
modification, or establishment of operating
procedures, or interaction faults, which are
violation of operating or maintenance procedures.
15Fault Cycle Dependability Measures
Reliability a measure of the continuous delivery
of service R(t) is the probability that the
system survives (does not fail) throughout 0,
t expected value MTTF(Mean Time To Failure)
Previous repair
Fault occurs
Maintainability a measure of the service
interruption M(t) is the probability that the
system will be repaired within a time less than
t expected value MTTR (Mean Time To Repair)
FAULT Latency
Error - fault becomes active (e.g. memory has
write 0)
MTTF
Availability a measure of the service delivery
with respect to the alternation of the delivery
and interruptions A(t) is the probability that
the system delivers a proper (conforming to
specification)service at a given time
t. expected value EA MTTF / (MTTF MTTR)
ERROR Latency
MTBF
Error detection (read memory, parity error)
REPAIR TIME
MTTR
Safety a measure of the time to catastrophic
failure S(t) is the probability that no
catastrophic failures occur during 0,
t expected value MTTCF(Mean Time To
Catastrophic Failure)
Repair memory
Next fault occurs
16Faults, Errors, and Failures in Computing Systems
Faults Errors Failures
Failure to Meet Requirements Reliability,
long term - Mission life Reliability,
short term - Critical functions -
Database protection Availability
Detection latencies Containment boundaries
Recovery latencies Autonomy
Permanent (hard) faults - Natural failures
- Natural radiation - HW design
errors Transient (soft) faults - Power
transients - Switching transients - Natural
radiation - Single upsets - Multiple
upsets Intermittent faults - Natural failures
- Power transients Software faults - SW
design errors - System upgrades -
Requirements changes External faults
Processor
17Hardware Fault Models
Stack-at
Module level
Functional level
System level
Example a parallelprocessor topology View
machine as agraph - nodes correspond to
processors - edges correspond to links Fault
Model A processor (node) orlink (edge) faulty
Example Memories One or more cells arestuck at
0 or 1 One or more cells fail to undergo 0-1 or
1-0 transition Two or more cells arecoupled A
1-0 transition in one cell changes contents in
another cell More than one cell isaccessed
during READor WRITE A wrong cell is
accessedduring READ or WRITE
Example decoder No output linesactivated An
incorrect lineactivated instead of desired
line An incorrect lineactivated in additionto
desired line
Example physical failures in circuits Lines in
a gate level stuck at 0 or 1 Faulty
contact Transistor stuck open or closed Metal
lines open Shorts between adjacent metal lines
18Software Fault Models
- IBM OS
- Allocation management Memory region used after
deallocation - Copying overrun Program copies data past end of
a buffer - Pointer management Variable containing data
address corrupted - Wrong algorithm Program works executes but
uses wrong algorithm - Uninitialized variable Variable used before
initialization - Undefined state System goes into unanticipated
state - Data error Program produces or reads wrong data
- Statement logic Statements executed in wrong
order or omitted - Interface error A module's interface incorrectly
defined or incorrectly used - Memory leak Program does not deallocate memory
it has allocated - Synchronization Error in locking or
synchronization code
- GUARDIAN 90
- Incorrect computation Arithmetic overflow or an
incorrect arithmetic function - Data fault Incorrect constant or variable
- Data definition fault Fault in declaring data or
data structure - Missing operation Omission of a few lines of
source code - Side effect of code update Not all dependencies
between software modules considered when
updating software - Unexpected situation Not providing routines to
handle rare but legitimate operational scenarios
19Software Fault Models (Myrinet Network Switch)
- Message dropped A message was dropped.
- Data corrupted A message with incorrect data
was sent. - Restart The Myrinet Control Program restarted
itself. - Interface hung The interface (on local or
remote node) was not able to operate properly. - Computer crash The system (local or remote node)
crashed.
20Failure Sources and Frequencies
- Non-Fault-Tolerant Systems
- Japan, 1383 organizations (Watanabe 1986,
Siewiorek Swarz 1992) - USA, 450 companies (FIND/SVP 1993)
- Mean time to failure 6 to 12 weeks
- Average outage duration after failure
- 1 to 4 hours
- Fault-Tolerant Systems
- Tandem Computers (Gray 1990)
- Bell Northern Research (Cramp et al. 1992)
- Mean time to failure
- 21 years (Tandem)
Failure Sources
21Failure Sources and Frequencies Permanent and
Transient Failures
- Transient and permanent failures CMU, Stanford,
Illinois - Ratio of transient failures to permanent failure
is 41 (80 transient, 20 permanent), varying
81 to 21. - MTBF h
- Tandem GUARDIAN 98
- Tandem NonStop-UX 480 to 2040
- Network of 69 SunOS workstations (CRHC) 5
22Failure Sources and Frequencies Availability
Assessment
23Typical Recovery Latencies for a Hierarchical
Fault Tolerant Design
Recovery Latency
10 s 1 s 100 ms 10 ms 1 ms 100 ?s 10 ?s 1
?s 100 ns 10 ns 1 ns
Recovery Level