5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors - PowerPoint PPT Presentation

1 / 9

About This Presentation

Title:

5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors

Description:

5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors 5.1 Static, or Masking Redundancy N copies of each processor are used and the minimum degree of ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 10

Provided by: Fabia179

Category:

more less

Transcript and Presenter's Notes

Title: 5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors

1
5. Basic Approaches to Achieve Fault Tolerance in
Multiprocessors

5.1 Static, or Masking Redundancy
N copies of each processor are used and the
minimum degree of replication is the
triplication. The replicated results are voted
on.
5.2 Dynamic, or Standby Redundancy
First, the presence of a faulty processor is
detected. Then it is replaced with a spare by
performing network reconfiguration and error
recovery.

2
6. Fault Tolerance Through Static Redundancy

Three forms of Static Redundancy

Redundancy for Availability
Redundancy for Safety
Redundancy for Non-Classical Faults

3
6. Fault Tolerance Through Static Redundancy

6.1 Redundancy for Availability

Used in the form of HW, SW, Time, or Information
Redundancy n copies of a module perform the
computation simultaneously to be voted. The
scheme is combined with the use of a disagreement
detector (voter/comparator) and a switching unit
that produces a hybrid redundant system.
This approach can be applied at several levels in
a distributed system each processor can be
replicated and the result of each processors
computation voted on, or the entire
multiprocessor can be replicated and the combined
result voted on. A third option divides the P
processors of the multiprocessor into P/N groups
of N processors, each group voting on its results
before communicating to other groups. To provide
robust communication, all critical transactions
between groups may be replicated and voted upon.

4
6. Fault Tolerance Through Static Redundancy

6.2 Redundancy for Safety

Reliability refers to the probability that the
system produces correct output.
Safety is defined as the probability that the
system output is either correct, or that the
error in the output is detectable Johnson,
1989. High safety is ensured by making
negligible the probability of an undetected error
in the output. When an uncorrectable error in the
output is detected, a recovery or safe shutdown
can be carried out.
A fault-tolerance scheme must be, in practice,
chosen which meets the reliability-safety
requirement.

5
6. Fault Tolerance Through Static Redundancy

6.2 Redundancy for Safety

5
R ?( 5 i).pigood.(1 pgood)5 -i
ik
5
S 1 - ?( 5 i).p5 - igood.(1 pgood)i
ik

Voter Reliability Safety
3-out-of-5 0.99144 0.99144
4-out-of-5 0.91854 0.99954
5-out-of-5 0.59049 0.99999
6
6. Fault Tolerance Through Static Redundancy

6.2 Redundancy for Safety
Design strategies can achieve both high
reliability and safety using the generic model
below. The outputs of the arbiter constitutes the
system outputs which consists of two components
(I) data output, (II) unsafe flag.

Module 1
Module 2
Module n
Safe Modular Redundant (SMR) System
Arbiter
Data
Unsafe
An arbitration strategy is the function
implemented by the arbiter to decide what the
correct output is and when the errors in the
module outputs exceed the correction capability,
so that the correct output cannot be provided.
7
6. Fault Tolerance Through Static Redundancy

6.3 Redundancy for Tolerating Non-Classical Faults

Even malicious failures, where two or more faulty
nodes may cooperate and attempt to foil the
operation, must be tolerated.
Byzantine Fault Model (BFM) Protocol was proposed
for precisely such an environment by Pease,
Shostak, and Lamport (1982).
BFM considers that a faulty node may not only
produce incorrect values, but also send ? values
to ? destinations instead of identical values, as
expected.
Typically, timing-related complex failures,
resulting in difficult agreements between good
processors in the presence of faulty processors.
BFM does not require foreknowledge of component
misbehavior and can tolerate faulty components
with even the most malevolent behavior, thus
avoiding the need for the costly task of
providing the validity of assumptions regarding
faulty component misbehavior.

8
6. Fault Tolerance Through Static Redundancy

6.3 Redundancy for Tolerating Non-Classical Faults

For a BFM Protocol to tolerate m faults, the
following requirements must be met
At least (3m 1) nodes must participate
At least (2m 1) disjoint communication paths
must exist between nodes
At least (m 1) rounds of communication must
take place
All nodes must be synchronized within a
well-known skew of each other.

D
D
D
a
r
a
a
r
a
r
r
a
r
r
a
r
a
A
C
r
A
C
A
C
B
a
r
B
r
r
B
r
a
Byzantine agreement
Byzantine disagreement
9
6. Fault Tolerance Through Static Redundancy