Design of Reliable Systems and Networks ECE 542/CS 536 Lecture 2 Introduction - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Design of Reliable Systems and Networks ECE 542/CS 536 Lecture 2 Introduction

Description:

Department of Electrical and Computer Engineering and. Coordinated Science Laboratory ... Industrial control systems, production lines. Banking, reservations, ... – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 24
Provided by: Centerfor8
Category:

less

Transcript and Presenter's Notes

Title: Design of Reliable Systems and Networks ECE 542/CS 536 Lecture 2 Introduction


1
Design of Reliable Systems and NetworksECE
542/CS 536 Lecture 2 Introduction
Prof. Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Class Information
  • Class Times
  • Class meets Monday, Wednesday from 1030 a.m. to
    1150 a.m. in Rm. 260 Mechanical Engineering
    Building.
  • Instructor
  • Ravi K. Iyer
  • Office 255 CSL Phone 333-2510 Email
    iyer_at_crhc.uiuc.edu
  • Hours 1230 - 130 p.m. on Mondays and other
    times by appointment
  • Class Web site
  • http//courses.ece.uiuc.edu/ece442
  • Class Material
  • Reading list available on the website
  • Class Notes on the web
  • Lecture notes on some topics available on the web

3
Outline
  • Overview and course objectives
  • Motivation for reliable system design
  • Taxonomy of dependable computing
  • Fault classes (hardware and software)
  • Failure sources

4
Course Overview
  • Introduce system view of reliable computing.
  • Hardware redundancy techniques.
  • Hardware and software error detection techniques
  • Illustrate some of these techniques on selected
    systems
  • Tandem fault-tolerant platform employing hardware
    redundancy
  • ARMORs, a software platform for designing highly
    available applications
  • providing error detection to DHCP (Dynamic Host
    Control Protocol) application and to
    call-processing in a wireless telephone network
  • Designing a failure resilient node/network
    controller.
  • Software fault tolerance techniques including
    process pairs, robust data structures, recovery
    blocks, and N-version programming
  • Illustrate some of these techniques on selected
    systems
  • Tandem on line transaction processing system
  • High availability design of IBM server

5
Course Overview (cont.)
  • Network specific issues in designing reliable
    system, including mechanisms/algorithms for
    supporting consistent data, reliable
    communications, and replication
  • Broadcast protocols, agreement protocols, and
    commit protocols
  • Illustrate some of these techniques in
    maintaining data consistency in a replicated
    DHCP server executing on the Chameleon ARMORs
    testbed
  • Review of example high-availability networked
    systems
  • Checkpointing and recovery techniques
  • Illustrate checkpointing on examples of
  • a distributed database system,
  • checkpointing of multithreaded processes
    micro-checkpointing, and
  • the IRIX operating system.
  • Issues in experimental system evaluation

6
Recommended Reading
  • IEEE Trans on Dependable and Secure Computing
  • Prad96 D.K. Pradhan, ed., Fault Tolerant
    Computer System Design, Prentice-Hall, 1996
  • John89 B. W. Johnson, Design and Analysis of
    Fault Tolerant Digital Systems, Addison Wesley,
    1989
  • SiSw92 D.P. Siewiorek and R.S. Swarz, Reliable
    Computer Systems - Design and Evaluation, Digital
    Press (distributed by Butterworth), 1992, 2nd
    edition.
  • Lyu95a M.R. Lyu, Handbook of Software
    Reliability Engineering, McGraw-Hill, 1995
  • Lyu95b M.R. Lyu, ed., Software Fault Tolerance,
    J. Wiley Sons, 1995
  • Birm96 K.P. Birman, Building Secure and
    Reliable Network Applications, Manning, 1996
  • SiSh94 M. Singhal and N.G. Shivaratri, Advanced
    Concepts in Operating Systems, McGraw-Hill, 1994

7
Why Study Reliable Computing!!!
  • Traditional needs
  • Long-life applications (e.g., unmanned and manned
    space missions )
  • Life-critical, short-term applications (e.g.,
    aircraft engine control, fly-by-wire)
  • Defense applications (e.g., aircraft, guidance
    control)
  • Nuclear industry
  • Telephone Switching systems
  • Mission-critical applications
  • Health industry
  • Automotive industry
  • Industrial control systems, production lines
  • Banking, reservations, commerce

8
Why Study Reliable Computing!!! (cont.)
  • Networks
  • Wired and wireless networked applications
  • Data mining
  • Distributed, networked systems (reliability and
    security are the major concerns)
  • commerce stores, catalog industry
  • Scientific computing, education
  • Typically reliability is not an issue yet.
  • This is changing in the new 10 Teraflop
    machines reliability is a major concern.

9
Objectives
  • System (hardware, software) perspective/view on
    design issues in reliable computing

Applications
What can be provided in software and application
itself?
Application program interface (API)
SIFT
Middleware
How to combine hardware and software fault
tolerance techniques - (1) fast error detection
in hardware, (2) high efficiency detection and
recovery in software How to assess whether the
achieved availability meets system requirements
What can be provided in the communication layer?
Reliable communications
What is typically provided in the operating
system?
Operating system
System network
What can be provided in COTS hardware to ensure
fail-silent behavior of system components
(nodes, network)?
Hardware
Processing elements Memory Storage system
10
How do We Achieve the Objectives?
Applications
Checkpointing and rollback, application
replication, software, voting (fault masking),
Process pairs, robust data structures, recovery
blocks, N-version programming,
Application program interface (API)
SIFT
Middleware
CRC on messages , acknowledgment, watchdogs,
heartbeats, consistency protocols
Reliable communications
Memory management, detection of process
failures, hooks to support software fault
tolerance for application
Operating system
System network
Hardware
Processing elements Memory Storage system
Error correcting codes, N_of_M and standby
redundancy , voting, watchdog timers, reliable
storage (RAID, mirrored disks)
11
Examples of Computer-related Failures
FAULTS
FAILURES
Availability / Reliability
Confidentiality
Safety
Design
Localized
Physical
Interaction
Distributed
False alerts at the North American Air
Defense (NORAD) Ford 85 First launch of the
Space Shuttle postponed Gaman 81 Excessive
radiotherapy doses (Therac-25) Leveson Turner
93 The wily hacker penetrates several tens of
sensitive computing facilities Stoll
88 Internet worm Spatford 89 9 hours outage
of the long-distance phone in the USA Neumann
95 Scud missed by a Patriot (Dhahran, Gulf
War) Neumann 95 Crash of the communication
system of the London ambulance service HA
93 Authorization denial of credit card
operations in France The maiden flight of the
Arine 5 launcher ended in a failure (France)
?
?
?
June 1980 April 1981 June 1985 - January
1987 August 1986 - 1987 November 1988 15
January 1990 February 1991 November 1992 26
and 27 June 1993 4 June 1996
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
12
Effect of major network outages on large business
customers
Large Insurance Carriers 20k/hour Major
Airlines 2.5M/hour Trading / Investment
Banking 6M/hour
40 30 20 10 0
Percent of Users
1k
10k
100k
1M
10M
Downtime costs (/hour)
13
Dependable Computing
  • Dependability is property of computer system that
    allows reliance to be placed justifiably on
    service it delivers. The service delivered by a
    system is its behavior as it is perceptible by
    its user

AVAILABILITY RELIABILITY SAFETY CONFIDENTIALITY IN
TEGRITY MAINTAINABILITY FAULT PREVENTION FAULT
TOLERANCE FAULT REMOVAL FAULT FORECASTING FAULTS
ERRORS FAILURES
ATTRIBUTES
DEPENDABILITY
MEANS
IMPAIRMENTS
14
Fault Classes
  • Based on the temporal persistence
  • Permanent faults, whose presence is continuous
    and stable.
  • Intermittent faults, whose presence is only
    occasional due to unstable hardware or varying
    hardware and software states (e.g., as a function
    of load or activity).
  • Transient faults, resulting from temporary
    environmental conditions.
  • Based on the origin
  • Physical faults, stemming from physical phenomena
    internal to the system, such as threshold change,
    shorts, opens, etc., or from external changes,
    such as environmental, electromagnetic,
    vibration, etc.
  • Human-made faults, which may be either design
    faults, introduced during system design,
    modification, or establishment of operating
    procedures, or interaction faults, which are
    violation of operating or maintenance procedures.

15
Fault Cycle Dependability Measures
Reliability a measure of the continuous delivery
of service R(t) is the probability that the
system survives (does not fail) throughout 0,
t expected value MTTF(Mean Time To Failure)
Previous repair
Fault occurs
Maintainability a measure of the service
interruption M(t) is the probability that the
system will be repaired within a time less than
t expected value MTTR (Mean Time To Repair)
FAULT Latency
Error - fault becomes active (e.g. memory has
write 0)
MTTF
Availability a measure of the service delivery
with respect to the alternation of the delivery
and interruptions A(t) is the probability that
the system delivers a proper (conforming to
specification)service at a given time
t. expected value EA MTTF / (MTTF MTTR)
ERROR Latency
MTBF
Error detection (read memory, parity error)
REPAIR TIME
MTTR
Safety a measure of the time to catastrophic
failure S(t) is the probability that no
catastrophic failures occur during 0,
t expected value MTTCF(Mean Time To
Catastrophic Failure)
Repair memory
Next fault occurs
16
Faults, Errors, and Failures in Computing Systems
Faults Errors Failures
Failure to Meet Requirements Reliability,
long term - Mission life Reliability,
short term - Critical functions -
Database protection Availability
Detection latencies Containment boundaries
Recovery latencies Autonomy
Permanent (hard) faults - Natural failures
- Natural radiation - HW design
errors Transient (soft) faults - Power
transients - Switching transients - Natural
radiation - Single upsets - Multiple
upsets Intermittent faults - Natural failures
- Power transients Software faults - SW
design errors - System upgrades -
Requirements changes External faults
Processor
17
Hardware Fault Models
Stack-at
Module level
Functional level
System level
Example a parallelprocessor topology View
machine as agraph - nodes correspond to
processors - edges correspond to links Fault
Model A processor (node) orlink (edge) faulty
Example Memories One or more cells arestuck at
0 or 1 One or more cells fail to undergo 0-1 or
1-0 transition Two or more cells arecoupled A
1-0 transition in one cell changes contents in
another cell More than one cell isaccessed
during READor WRITE A wrong cell is
accessedduring READ or WRITE
Example decoder No output linesactivated An
incorrect lineactivated instead of desired
line An incorrect lineactivated in additionto
desired line
Example physical failures in circuits Lines in
a gate level stuck at 0 or 1 Faulty
contact Transistor stuck open or closed Metal
lines open Shorts between adjacent metal lines
18
Software Fault Models
  • IBM OS
  • Allocation management Memory region used after
    deallocation
  • Copying overrun Program copies data past end of
    a buffer
  • Pointer management Variable containing data
    address corrupted
  • Wrong algorithm Program works executes but
    uses wrong algorithm
  • Uninitialized variable Variable used before
    initialization
  • Undefined state System goes into unanticipated
    state
  • Data error Program produces or reads wrong data
  • Statement logic Statements executed in wrong
    order or omitted
  • Interface error A module's interface incorrectly
    defined or incorrectly used
  • Memory leak Program does not deallocate memory
    it has allocated
  • Synchronization Error in locking or
    synchronization code
  • GUARDIAN 90
  • Incorrect computation Arithmetic overflow or an
    incorrect arithmetic function
  • Data fault Incorrect constant or variable
  • Data definition fault Fault in declaring data or
    data structure
  • Missing operation Omission of a few lines of
    source code
  • Side effect of code update Not all dependencies
    between software modules considered when
    updating software
  • Unexpected situation Not providing routines to
    handle rare but legitimate operational scenarios

19
Software Fault Models (Myrinet Network Switch)
  • Message dropped A message was dropped.
  • Data corrupted A message with incorrect data
    was sent.
  • Restart The Myrinet Control Program restarted
    itself.
  • Interface hung The interface (on local or
    remote node) was not able to operate properly.
  • Computer crash The system (local or remote node)
    crashed.

20
Failure Sources and Frequencies
  • Non-Fault-Tolerant Systems
  • Japan, 1383 organizations (Watanabe 1986,
    Siewiorek Swarz 1992)
  • USA, 450 companies (FIND/SVP 1993)
  • Mean time to failure 6 to 12 weeks
  • Average outage duration after failure
  • 1 to 4 hours
  • Fault-Tolerant Systems
  • Tandem Computers (Gray 1990)
  • Bell Northern Research (Cramp et al. 1992)
  • Mean time to failure
  • 21 years (Tandem)

Failure Sources
21
Failure Sources and Frequencies Permanent and
Transient Failures
  • Transient and permanent failures CMU, Stanford,
    Illinois
  • Ratio of transient failures to permanent failure
    is 41 (80 transient, 20 permanent), varying
    81 to 21.
  • MTBF h
  • Tandem GUARDIAN 98
  • Tandem NonStop-UX 480 to 2040
  • Network of 69 SunOS workstations (CRHC) 5

22
Failure Sources and Frequencies Availability
Assessment
23
Typical Recovery Latencies for a Hierarchical
Fault Tolerant Design
Recovery Latency
10 s 1 s 100 ms 10 ms 1 ms 100 ?s 10 ?s 1
?s 100 ns 10 ns 1 ns
Recovery Level
Write a Comment
User Comments (0)
About PowerShow.com