CTIS 490 DISTRIBUTED SYSTEMS - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

CTIS 490 DISTRIBUTED SYSTEMS

Description:

Another feature of distributed systems that distinguishes them from non ... redundancy Extra bits are added to allow recovery from garbled bits ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 25

Provided by: cneyt

Category:

more less

Transcript and Presenter's Notes

Title: CTIS 490 DISTRIBUTED SYSTEMS

1
CTIS 490DISTRIBUTED SYSTEMS

WEEK 9
FAULT TOLERANCE

2
INTRODUCTION

Another feature of distributed systems that
distinguishes them from non-distributed systems
is the notion of a partial failure.
A partial failure may happen when one component
in a distributed system fails. This failure may
affect the proper operation of other components,
while at the same time leaving other components
unaffected.
One of the goals of distributed systems design is
to construct the system so that it can
automatically recover from partial failures. In
particular, whenever a failure occurs, the
distributed system should continue to operate in
an acceptable way while repairs are being made.
In other words, it should tolerate faults and
continue to operate.

3
FAULT TOLERANCE

Being fault tolerant is strongly related to what
are called dependable systems.
Dependable systems have the following attributes
Availability defined as the property that a
system is ready to be used immediately. The
probability that the system is operating
correctly at any given moment and is available to
perform its functions.
Reliability refers to the property that a
system can run continuously without failure. In
contrast to availability, reliability refers to
in terms of a time interval instead of an instant
of time.
A highly-reliable system is one that will
most likely continue to work without interruption
during a relatively long period of time. If a
system goes down for 1 millisecond every hour, it
has an availability of over 99.999 percent, but
it is unreliable. If a system never crashes but
is shutdown for two week very August, it has high
reliability but 96 percent availability.

4
FAULT TOLERANCE

Safety refers to the situation when a system
temporarily fails to operate correctly. For
example, many process control systems such as
nuclear power plants or sending people into space
are required to provide a high degree of safety.
Maintainability refers to how easy a failed
system can be repaired. A high maintainable
system may also show a high degree of
availability, especially if failures can be
detected and repaired automatically.
Dependable systems are also required to provide
high degree of security.

5
FAULT TOLERANCE

Faults can be classified as
Transient faults occur once and then disappear.
A bird flying through the transmitter may cause
lost bits on some networks.
Intermittent faults occur, then vanishes, then
reappears. A loose contact on a connector will
often cause this kind of fault.
Permanent faults continue to exists until
faulty component is replaced. Burnt-out chips and
software bugs are examples of this type of faults.

6
FAILURE MODELS
7
FAILURE MODELS

Crash failures occur when a server prematurely
halts, but was working correctly until it
stopped. An example of a crash failure is an
operating system that comes to a halt, and only
solution is to reboot it.
Omission failures occur when a server fails to
respond to a request. Maybe server never got the
request in the first place. Maybe, server could
not send a response (said to hung).
Timing (Performance) failures occur when response
lies outside a specified real-time interval. In
data streaming timing is very important.

8
FAILURE MODELS

Response failures occur when servers response is
incorrect. For example, a search engine returns
Web pages not related to search terms. When
servers receive a message that it cannot
recognize, it may incorrectly take default
actions.
Arbitrary failures are the most series faults,
also known as Byzantine failures. It may happen
when a server produces an output it should never
have produced, but which cannot be detected as
being incorrect. A server may even be maliciously
working together with other servers to produce
intentionally wrong answers.

9
REDUNDANCY

A fault tolerant system should hide the
occurrence of failures from other processes.
The key technique for masking faults is to use
redundancy.
There are three kinds of redundancy
Information redundancy Extra bits are added to
allow recovery from garbled bits
Time redundancy An action is performed, and
then if need to be, it is performed again. For
example, if a transaction aborts, it can be
redone again. Time redundancy is helpful when the
faults are transient or intermittent.

10
REDUNDANCY

Physical redundancy Extra equipment or
processes are added to make it possible for the
system as a whole to tolerate the loss or
malfunctioning of some components. Physical
redundancy can be done either in hardware or
software.
Physical redundancy is used in biology mammals
have two eyes, two ears etc., planes 747 has
four engines, but it can fly on three, and
sports multiple referees.
Physical redundancy is also used in electronic
circuits.

11
REDUNDANCY
Triple modular redundancy.
12
TRIPLE MODULAR REDUNDANCY

In triple modular redundancy (TMR), each device
is replicated three times.
Each voter is a circuit that has three inputs and
one output. If two or three of the inputs are the
same, the output is equal to input. If all three
inputs are different, the output is undefined.
Three voters are also needed at each stage,
because a voter is also a component and can also
be faulty.
Although not all fault tolerant distributed
systems use TMR, the technique is very general to
give an idea what a fault-tolerant system is as
opposed to individual components are highly
reliable but the overall design cannot tolerate
faults.

13
DISTRIBUTED COMMIT

Distributed commit problem involves having an
operation, for example distributed transaction,
being performed by each member of a process
group, or none at all.
Distributed commit is established by means of a
coordinator.
In its simplest form, a coordinator tells other
processes whether or not to locally perform the
operation. This scheme is referred as a one-phase
commit protocol.
In this scheme, if one process cannot perform the
operation, there is no way to tell the
coordinator.
The most common used protocol is two-phase
commit.

14
TWO-PHASE COMMIT (2PC)

Consider a distributed transaction involving the
participation of a number of processes each
running on a different machine.
The protocol consists of 2 phases (and 2 steps)
Voting phase
Decision phase

15
TWO-PHASE COMMIT (2PC)

The finite state machine for the coordinator in
2PC.
The finite state machine for a participant.

16
TWO-PHASE COMMIT (2PC)

Coordinator sends a VOTE_REQUEST message
In response, participant either sends VOTE_COMMIT
or VOTE_ABORT message.
Coordinator collects all votes from the
participants. If all participants have voted to
commit, then coordinator sends GLOBOL_COMMIT
message. If one participant had voted to abort
the transaction, coordinator sends GLOBAL_ABORT
message.
Each participant that voted for a commit waits
for the coordinator. When a coordinator receives
a GLOBAL_COMMIT message, it locally commits the
transaction. Otherwise, if it receives a
GLOBAL_ABORT message, transaction is locally
aborted.

17
TWO-PHASE COMMIT (2PC)

Several problems arise when 2PC protocol is used
in a system where failures occur since
coordinator and participants block waiting for
incoming messages.
Protocol easily fails when one of the processes
crashes. For this reason, timeout mechanisms are
used.
There are a total of three states in which either
a coordinator or participant is blocked waiting
for an incoming message.
A participant may be waiting in its INIT state
for a VOTE_REQUEST. If that message is not
received after some time, it will locally abort
the transaction and send a VOTE_ABORT.

18
TWO-PHASE COMMIT (2PC)

A coordinator can be blocked in state WAIT,
waiting for the votes of each participant. If not
all votes have been collected after a certain
period of time, the coordinator should vote for
abort as well, and send GLOBOL_ABORT.
A participant can be blocked in state READY,
waiting for the global vote as sent by the
coordinator. If that message is not received
within a given time, the participant cannot
simply decide to abort. Instead, it must find out
which message the coordinator sent.

19
TWO-PHASE COMMIT (2PC)

The simplest solution to this problem is to let
each participant block until the coordinator
recovers again.
A better solution is to let a participant P
contact another participant Q to see if it can
decide from Qs state what it should do.

Actions taken by a participant P when residing in
state READY and having contacted another
participant Q.
20
TWO-PHASE COMMIT (2PC)

To ensure that a process can actually recover, it
is necessary that it saves its state to
persistent storage.
If a participant was in state INIT, it can safely
decide to locally abort the transaction when it
recovers, and then inform the coordinator.
Problems arise when a participant crashed while
residing in state READY. When recovering it
cannot decide on its own what it should do next
that is commit or abort the transaction.
Consequently, it is forced to contact other
participants to find out what it should do,
similar to the situation when it times out while
residing in state READY.

21
TWO-PHASE COMMIT (2PC)