Design of Reliable Systems and Networks ECE 542/CS 536 Lecture 2 Introduction - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Design of Reliable Systems and Networks ECE 542/CS 536 Lecture 2 Introduction

Description:

Department of Electrical and Computer Engineering and. Coordinated Science Laboratory ... Industrial control systems, production lines. Banking, reservations, ... – PowerPoint PPT presentation

Number of Views:189

Avg rating:3.0/5.0

Slides: 24

Provided by: Centerfor8

Category:

more less

Transcript and Presenter's Notes

Title: Design of Reliable Systems and Networks ECE 542/CS 536 Lecture 2 Introduction

1
Design of Reliable Systems and NetworksECE
542/CS 536 Lecture 2 Introduction
Prof. Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Class Information

Class Times
Class meets Monday, Wednesday from 1030 a.m. to
1150 a.m. in Rm. 260 Mechanical Engineering
Building.
Instructor
Ravi K. Iyer
Office 255 CSL Phone 333-2510 Email
iyer_at_crhc.uiuc.edu
Hours 1230 - 130 p.m. on Mondays and other
times by appointment
Class Web site
http//courses.ece.uiuc.edu/ece442
Class Material
Reading list available on the website
Class Notes on the web
Lecture notes on some topics available on the web

3
Outline

Overview and course objectives
Motivation for reliable system design
Taxonomy of dependable computing
Fault classes (hardware and software)
Failure sources

4
Course Overview

Introduce system view of reliable computing.
Hardware redundancy techniques.
Hardware and software error detection techniques
Illustrate some of these techniques on selected
systems
Tandem fault-tolerant platform employing hardware
redundancy
ARMORs, a software platform for designing highly
available applications
providing error detection to DHCP (Dynamic Host
Control Protocol) application and to
call-processing in a wireless telephone network
Designing a failure resilient node/network
controller.
Software fault tolerance techniques including
process pairs, robust data structures, recovery
blocks, and N-version programming
Illustrate some of these techniques on selected
systems
Tandem on line transaction processing system
High availability design of IBM server

5
Course Overview (cont.)

Network specific issues in designing reliable
system, including mechanisms/algorithms for
supporting consistent data, reliable
communications, and replication
Broadcast protocols, agreement protocols, and
commit protocols
Illustrate some of these techniques in
maintaining data consistency in a replicated
DHCP server executing on the Chameleon ARMORs
testbed
Review of example high-availability networked
systems
Checkpointing and recovery techniques
Illustrate checkpointing on examples of
a distributed database system,
checkpointing of multithreaded processes
micro-checkpointing, and
the IRIX operating system.
Issues in experimental system evaluation

6
Recommended Reading

IEEE Trans on Dependable and Secure Computing
Prad96 D.K. Pradhan, ed., Fault Tolerant
Computer System Design, Prentice-Hall, 1996
John89 B. W. Johnson, Design and Analysis of
Fault Tolerant Digital Systems, Addison Wesley,
1989
SiSw92 D.P. Siewiorek and R.S. Swarz, Reliable
Computer Systems - Design and Evaluation, Digital
Press (distributed by Butterworth), 1992, 2nd
edition.
Lyu95a M.R. Lyu, Handbook of Software
Reliability Engineering, McGraw-Hill, 1995
Lyu95b M.R. Lyu, ed., Software Fault Tolerance,
J. Wiley Sons, 1995
Birm96 K.P. Birman, Building Secure and
Reliable Network Applications, Manning, 1996
SiSh94 M. Singhal and N.G. Shivaratri, Advanced
Concepts in Operating Systems, McGraw-Hill, 1994

7
Why Study Reliable Computing!!!

Traditional needs
Long-life applications (e.g., unmanned and manned
space missions )
Life-critical, short-term applications (e.g.,
aircraft engine control, fly-by-wire)
Defense applications (e.g., aircraft, guidance
control)
Nuclear industry
Telephone Switching systems
Mission-critical applications
Health industry
Automotive industry
Industrial control systems, production lines
Banking, reservations, commerce

8
Why Study Reliable Computing!!! (cont.)

Networks
Wired and wireless networked applications
Data mining
Distributed, networked systems (reliability and
security are the major concerns)
commerce stores, catalog industry
Scientific computing, education
Typically reliability is not an issue yet.
This is changing in the new 10 Teraflop
machines reliability is a major concern.

9
Objectives

System (hardware, software) perspective/view on
design issues in reliable computing

Applications
What can be provided in software and application
itself?
Application program interface (API)
SIFT
Middleware
How to combine hardware and software fault
tolerance techniques - (1) fast error detection
in hardware, (2) high efficiency detection and
recovery in software How to assess whether the
achieved availability meets system requirements
What can be provided in the communication layer?
Reliable communications
What is typically provided in the operating
system?
Operating system
System network
What can be provided in COTS hardware to ensure
fail-silent behavior of system components
(nodes, network)?
Hardware
Processing elements Memory Storage system
10
How do We Achieve the Objectives?
Applications
Checkpointing and rollback, application
replication, software, voting (fault masking),
Process pairs, robust data structures, recovery
blocks, N-version programming,
Application program interface (API)
SIFT
Middleware
CRC on messages , acknowledgment, watchdogs,
heartbeats, consistency protocols
Reliable communications
Memory management, detection of process
failures, hooks to support software fault
tolerance for application
Operating system
System network
Hardware
Processing elements Memory Storage system
Error correcting codes, N_of_M and standby
redundancy , voting, watchdog timers, reliable
storage (RAID, mirrored disks)
11
Examples of Computer-related Failures
FAULTS
FAILURES
Availability / Reliability
Confidentiality
Safety
Design
Localized
Physical
Interaction
Distributed
False alerts at the North American Air
Defense (NORAD) Ford 85 First launch of the
Space Shuttle postponed Gaman 81 Excessive
radiotherapy doses (Therac-25) Leveson Turner
93 The wily hacker penetrates several tens of
sensitive computing facilities Stoll
88 Internet worm Spatford 89 9 hours outage
of the long-distance phone in the USA Neumann
95 Scud missed by a Patriot (Dhahran, Gulf
War) Neumann 95 Crash of the communication
system of the London ambulance service HA
93 Authorization denial of credit card
operations in France The maiden flight of the
Arine 5 launcher ended in a failure (France)
?
?
?
June 1980 April 1981 June 1985 - January
1987 August 1986 - 1987 November 1988 15
January 1990 February 1991 November 1992 26
and 27 June 1993 4 June 1996
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
12
Effect of major network outages on large business
customers
Large Insurance Carriers 20k/hour Major
Airlines 2.5M/hour Trading / Investment
Banking 6M/hour
40 30 20 10 0
Percent of Users
1k
10k
100k
1M
10M
Downtime costs (/hour)
13
Dependable Computing

Dependability is property of computer system that
allows reliance to be placed justifiably on
service it delivers. The service delivered by a
system is its behavior as it is perceptible by
its user

AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY IN
TEGRITY MAINTAINABILITY FAULT PREVENTION FAULT
TOLERANCE FAULT REMOVAL FAULT FORECASTING FAULTS
ERRORS FAILURES
ATTRIBUTES
DEPENDABILITY
MEANS
IMPAIRMENTS
14
Fault Classes

Based on the temporal persistence
Permanent faults, whose presence is continuous
and stable.
Intermittent faults, whose presence is only
occasional due to unstable hardware or varying
hardware and software states (e.g., as a function
of load or activity).
Transient faults, resulting from temporary
environmental conditions.

Based on the origin
Physical faults, stemming from physical phenomena
internal to the system, such as threshold change,
shorts, opens, etc., or from external changes,
such as environmental, electromagnetic,
vibration, etc.
Human-made faults, which may be either design
faults, introduced during system design,
modification, or establishment of operating
procedures, or interaction faults, which are
violation of operating or maintenance procedures.

15
Fault Cycle Dependability Measures
Reliability a measure of the continuous delivery
of service R(t) is the probability that the
system survives (does not fail) throughout 0,
t expected value MTTF(Mean Time To Failure)
Previous repair
Fault occurs
Maintainability a measure of the service
interruption M(t) is the probability that the
system will be repaired within a time less than
t expected value MTTR (Mean Time To Repair)
FAULT Latency
Error - fault becomes active (e.g. memory has
write 0)
MTTF
Availability a measure of the service delivery
with respect to the alternation of the delivery
and interruptions A(t) is the probability that
the system delivers a proper (conforming to
specification)service at a given time
t. expected value EA MTTF / (MTTF MTTR)
ERROR Latency
MTBF
Error detection (read memory, parity error)
REPAIR TIME
MTTR
Safety a measure of the time to catastrophic
failure S(t) is the probability that no
catastrophic failures occur during 0,
t expected value MTTCF(Mean Time To
Catastrophic Failure)
Repair memory
Next fault occurs
16
Faults, Errors, and Failures in Computing Systems
Faults Errors Failures
Failure to Meet Requirements Reliability,
long term - Mission life Reliability,
short term - Critical functions -
Database protection Availability
Detection latencies Containment boundaries
Recovery latencies Autonomy
Permanent (hard) faults - Natural failures
- Natural radiation - HW design
errors Transient (soft) faults - Power
transients - Switching transients - Natural
radiation - Single upsets - Multiple
upsets Intermittent faults - Natural failures
- Power transients Software faults - SW
design errors - System upgrades -
Requirements changes External faults
Processor
17
Hardware Fault Models
Stack-at
Module level
Functional level
System level
Example a parallelprocessor topology View
machine as agraph - nodes correspond to
processors - edges correspond to links Fault
Model A processor (node) orlink (edge) faulty
Example Memories One or more cells arestuck at
0 or 1 One or more cells fail to undergo 0-1 or
1-0 transition Two or more cells arecoupled A
1-0 transition in one cell changes contents in
another cell More than one cell isaccessed
during READor WRITE A wrong cell is
accessedduring READ or WRITE
Example decoder No output linesactivated An
incorrect lineactivated instead of desired
line An incorrect lineactivated in additionto
desired line
Example physical failures in circuits Lines in
a gate level stuck at 0 or 1 Faulty
contact Transistor stuck open or closed Metal
lines open Shorts between adjacent metal lines
18
Software Fault Models

IBM OS
Allocation management Memory region used after
deallocation
Copying overrun Program copies data past end of
a buffer
Pointer management Variable containing data
address corrupted
Wrong algorithm Program works executes but
uses wrong algorithm
Uninitialized variable Variable used before
initialization
Undefined state System goes into unanticipated
state
Data error Program produces or reads wrong data
Statement logic Statements executed in wrong
order or omitted
Interface error A module's interface incorrectly
defined or incorrectly used
Memory leak Program does not deallocate memory
it has allocated
Synchronization Error in locking or
synchronization code

GUARDIAN 90
Incorrect computation Arithmetic overflow or an
incorrect arithmetic function
Data fault Incorrect constant or variable
Data definition fault Fault in declaring data or
data structure
Missing operation Omission of a few lines of
source code
Side effect of code update Not all dependencies
between software modules considered when
updating software
Unexpected situation Not providing routines to
handle rare but legitimate operational scenarios

19
Software Fault Models (Myrinet Network Switch)

Message dropped A message was dropped.
Data corrupted A message with incorrect data
was sent.
Restart The Myrinet Control Program restarted
itself.
Interface hung The interface (on local or
remote node) was not able to operate properly.
Computer crash The system (local or remote node)
crashed.

20
Failure Sources and Frequencies

Non-Fault-Tolerant Systems
Japan, 1383 organizations (Watanabe 1986,
Siewiorek Swarz 1992)
USA, 450 companies (FIND/SVP 1993)
Mean time to failure 6 to 12 weeks
Average outage duration after failure
1 to 4 hours

Fault-Tolerant Systems
Tandem Computers (Gray 1990)
Bell Northern Research (Cramp et al. 1992)
Mean time to failure
21 years (Tandem)

Failure Sources
21
Failure Sources and Frequencies Permanent and
Transient Failures

Transient and permanent failures CMU, Stanford,
Illinois
Ratio of transient failures to permanent failure
is 41 (80 transient, 20 permanent), varying
81 to 21.
MTBF h
Tandem GUARDIAN 98
Tandem NonStop-UX 480 to 2040
Network of 69 SunOS workstations (CRHC) 5

22
Failure Sources and Frequencies Availability
Assessment
23
Typical Recovery Latencies for a Hierarchical
Fault Tolerant Design
Recovery Latency
10 s 1 s 100 ms 10 ms 1 ms 100 ?s 10 ?s 1
?s 100 ns 10 ns 1 ns
Recovery Level

Write a Comment

User Comments (0)