Fault Tolerance Computing - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Fault Tolerance Computing

Description:

In terms of money, time, and lives. ... If one component fails, there is a spare to take over. How the spare knows when to take over? ... – PowerPoint PPT presentation

Number of Views:1161
Avg rating:3.0/5.0
Slides: 35
Provided by: steve965
Category:

less

Transcript and Presenter's Notes

Title: Fault Tolerance Computing


1
Fault Tolerance Computing
  • Adnan Agbaria

2
System Model and Basic Concepts
3
Staff
  • Dr. Adnan Agbaria
  • adnan_at_il.ibm.com
  • Office Hours
  • Right after the class
  • Monday 830-1000
  • Course URL
  • http//cs.haifa.ac.il/courses/ftc

4
Materials
  • Textbooks
  • Distributed Systems 2nd edition Sape Mullender
    (Editor), ACM Press Frontier Series, Addison
    Wesley
  • K. Berman. Building Secure and Reliable Network
    Applications. Manning Publishing Company and
    Prentice Hall, December, 1996.
  • J.-C. Laprie. Dependability Basic Concepts and
    Terminology. Springer_Verlag, 1992.
  • P. Jalote. Fault Tolerance in Distributed
    Systems. Prentice-Hall, Inc., 1994.
  • Research papers
  • See the list at the web site

5
Grading and Prerequisites
  • Grading
  • Participation 20
  • Presentation 40
  • Home Assignments 40
  • Prerequisites
  • Operating systems
  • Networking
  • Algorithms

6
Course Outline
  • Definition and basic concepts
  • Replications
  • Group Communication and Virtual Synchrony
  • Consensus and Byzantine Agreement
  • Checkpoint/Restart Basic concepts
  • Distributed Checkpointing

7
Course Outline (Contd)
  • Student presentations Replications
  • Student presentations Failure detection
  • Student presentations Group communication and
    virtual synchrony
  • Student presentations Distributed checkpointing
  • Network computer security and intrusion tolerance

8
Student Presentations
  • Every student should send me an email
  • Two preferred papers to present
  • 1st paper is the most wanted!
  • Each presentation is
  • 30 min presentation
  • 15 Q and A
  • Homework questions may include materials from the
    presentations as well as from the lecturers
    presentations.

9
Outline
  • Motivation
  • Concepts, definitions, notations, and system
    model
  • Fault model
  • Synchronous and asynchronous models
  • Time in distributed systems

10
Motivation
  • The system downtime cost is very high
  • 4 billions annually the estimated cost of
    system downtime in North American companies
    (source Computer Economics Infocorp.
    Consulting)
  • Availability is still low
  • "despite the Internet driving a significantly
    increased desire for continuous availability,
    through 2005, fewer than 20 percent of
    mission-critical Web-based applications will
    achieve it. Around 40 percent will achieve high
    availability at lower cost" (Source The Gartner
    Group)

11
Motivation (Contd)
  • The impact of failures is VERY costly.
  • In terms of money, time, and lives.
  • Examples bank, air control, telephone systems,
    weather forecasting, etc.
  • There is no way to prevent failures
  • So what we can do Fault tolerance
  • Goals
  • High availability and reliability.
  • Ways
  • Fault tolerance

12
Distributed System - Definition
  • A distributed system consists of a collection of
    autonomous computers, connected through a network
    and distribution middleware, which enables
    computers to coordinate their activities and to
    share the resources of the system, so that users
    perceive the system as a single, integrated
    computing facility.

13
Why Distributed Systems?
  • Information and Hardware Sharing
  • Scalability
  • Availability
  • Fault Tolerance
  • Price/performance

14
Types of Distributed Systems
  • Client/Server
  • Web (HTTP), NFS, Automatic Teller Machines
  • Group computing
  • Distributed/replicated servers
  • Pub/sub and messaging based system
  • Collaborative computing (CSCW)
  • Teleteaching, telemedicine, video-conferencing,
    Lotus Notes, shared windows sessions
  • Parallel (cluster) computing in distributed
    environments
  • Message passing interface (MPI)
  • Distributed Shared Memory

15
System Model
  • Distributed system with n processes,
  • Denoted by P1, P2,,Pn
  • Each process has local memory and CPU.
  • Processes communicate via asynchronous network by
    send/receive events.
  • Processes are asynchronous too
  • Dont share a global clock.

16
Basic Events
Computation
Send(M)
Network
Receive(M)
Recovery
17
A Drawing Conception
P1
m2
m3
m1
P2
18
Fault Tolerance
  • Hardware, software and networks fail!
  • Source of failures
  • Human, radiation, etc.
  • There are
  • Intentional faults Mainly caused by attacks,
    viruses/worms, etc.
  • Non-intentional faults Mainly caused due to
  • Bugs in the code
  • Incorrect configuration and deployment
  • Environment
  • The rate of failures is still too high
  • Impossible to prevent failures!
  • So, What we can do?

19
Fault Tolerance (Contd)
  • Distributed systems must maintain availability
    even at low levels of hardware/software/network
    reliability.
  • Fault tolerance is achieved, mainly, by
  • Recovery
  • Recover the machine, system, or application upon
    any failure
  • Failures should be detected.
  • Where to recover from?
  • Start everything from scratch, or
  • Restart from a pre-captured state.
  • Cost and performance

20
Fault Tolerance (Contd)
  • Redundancy
  • If one component fails, there is a spare to take
    over.
  • How the spare knows when to take over?
  • How often we update the spare?
  • How many spare do we need?
  • Cost and performance.
  • Self stabilization
  • If the system is in a faulty state, it should
    detect and go back to the normal state.
  • How does the system do that?
  • We are not consider this technique in the course.

21
Reliability
  • Means that the system is continuously produce
    correct services.
  • The reliability R(t) of a system SYS can be
    expressed as
  • R(t) Prob(SYS is fully functioning in 0,t)
  • A metric for reliability R(t) is MTTF, the Mean
    Time To Failure

22
Availability
  • Means that the system produces services when it
    is required from authorized use
  • The availability A(t) of a system SYS can be
    expressed as
  • A(t) Prob(SYS is fully functioning at time t)
  • A metric for the average, steady-state
    availability is

23
Failure, Error, and Fault
  • Failure transition from proper to improper
    service
  • Error that part of system state that is liable
    to lead to subsequent failure
  • Fault the hypothesized cause of error(s)

Activation
Propagation
Causation
Fault
Error
Failure
Fault
24
Failure Types
  • Crash Fail-stop mode. The process does not
    active
  • Omission Fail to send/receive a message
  • Transit Temporally failure that may affect the
    system functionality
  • Byzantine Exhibits random behavior of the
    process
  • Malicious Intention failure that usually caused
    by attacks

25
Means to attain Availability and Reliability
  • Fault prevention
  • Try to prevent faults before happening.
  • Examples
  • Using of strongly-typed programming language
  • Firewalls for preventing intrusions (for
    intrusion tolerance)
  • Fault tolerance
  • Handling failures and trying to continue provide
    correct functionality.
  • Examples
  • Checkpoint/Restart, Replication, and
    Self-Stabilization.
  • Replication with Byzantine Agreement (for
    intrusion tolerance)
  • Fault Detection
  • Detecting and removing the faults.
  • Examples
  • Timeout-based detection This is for detecting
    crash failure.
  • Anomaly-based detection Intrusion Detection
    Systems (IDSs)

26
Correct vs. Faulty
  • Look at a complete run (execution)
  • external observers view
  • A process that does not fail in a run is correct
    in that run
  • Otherwise, the process is faulty in the run
  • a process that fails any time in the run is
    faulty throughout the entire run

27
Threshold Failure Model
  • t out of n processes may fail
  • t is usually given as a function of n, e.g.,
  • t lt n
  • 2t lt n
  • 3t lt n

28
Examples
29
A Database System
  • Transactions
  • Initiate a connection with the bank server and
    ask for a financial transaction on your account.
  • The Server update the database.
  • The Server send a confirmation message to the
    user.

2
3
1
30
A Database System (Contd)
  • Possible Failures
  • The server may crash
  • One of the connection may be down (cut).

2
3
1
31
Synchronous vs. Asynchronous
  • Synchrony assumptions
  • Message latency is bounded
  • Processes have synchronized clocks
  • Processing times are bounded
  • Asynchrony no assumptions
  • Asynchronous models are more practice.
  • The Internet is an asynchrony system.

32
Example The Coordinated Attack Problem
  • Definition
  • Two armies (red and blue) surround a town.
  • The two armies want to coordinate to attack the
    town,
  • Victory is achieved if and only if two armies
    attack simultaneously. Otherwise, the attack army
    will be defeated.
  • The generals (red and blue) communicate by
    messengers.
  • messengers can be captured (message loss) and/or
    can take arbitrarily long.

33
The Coordinated Attack Problem (Contd)
  • There is no solution for the problem in the
    asynchronous model.
  • There is a solution in the synchronous model (?)
  • Can we add some requirements in the asynch model
    to solve the problem?

34
Time in Distributed Systems
  • Logical time
  • Causality
  • Similar to visible knowledge that advances at the
    speed of light
  • Wall clock time / real time / global time
  • Clock skew
  • The rate in which local clocks drift w.r.t. each
    other. Depends on clocks quality, but also on
    temperature, magnetic field, etc.
  • It is possible to obtain time from GPS, or radio
    clocks, but the latency (both of the signal and
    handling the signal inside the computer) can vary
    a bit. Also, may not be available when there is
    no line of site to the sky.
Write a Comment
User Comments (0)
About PowerShow.com