Title: 5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors
15. Basic Approaches to Achieve Fault Tolerance in
Multiprocessors
- 5.1 Static, or Masking Redundancy
- N copies of each processor are used and the
minimum degree of replication is the
triplication. The replicated results are voted
on. - 5.2 Dynamic, or Standby Redundancy
- First, the presence of a faulty processor is
detected. Then it is replaced with a spare by
performing network reconfiguration and error
recovery.
26. Fault Tolerance Through Static Redundancy
- Three forms of Static Redundancy
- Redundancy for Availability
- Redundancy for Safety
- Redundancy for Non-Classical Faults
36. Fault Tolerance Through Static Redundancy
- 6.1 Redundancy for Availability
- Used in the form of HW, SW, Time, or Information
Redundancy n copies of a module perform the
computation simultaneously to be voted. The
scheme is combined with the use of a disagreement
detector (voter/comparator) and a switching unit
that produces a hybrid redundant system. - This approach can be applied at several levels in
a distributed system each processor can be
replicated and the result of each processors
computation voted on, or the entire
multiprocessor can be replicated and the combined
result voted on. A third option divides the P
processors of the multiprocessor into P/N groups
of N processors, each group voting on its results
before communicating to other groups. To provide
robust communication, all critical transactions
between groups may be replicated and voted upon.
46. Fault Tolerance Through Static Redundancy
- 6.2 Redundancy for Safety
- Reliability refers to the probability that the
system produces correct output. - Safety is defined as the probability that the
system output is either correct, or that the
error in the output is detectable Johnson,
1989. High safety is ensured by making
negligible the probability of an undetected error
in the output. When an uncorrectable error in the
output is detected, a recovery or safe shutdown
can be carried out. - A fault-tolerance scheme must be, in practice,
chosen which meets the reliability-safety
requirement.
56. Fault Tolerance Through Static Redundancy
- 6.2 Redundancy for Safety
- 5
- R ?( 5 i).pigood.(1 pgood)5 -i
- ik
- 5
- S 1 - ?( 5 i).p5 - igood.(1 pgood)i
- ik
Voter Reliability Safety
3-out-of-5 0.99144 0.99144
4-out-of-5 0.91854 0.99954
5-out-of-5 0.59049 0.99999
66. Fault Tolerance Through Static Redundancy
- 6.2 Redundancy for Safety
- Design strategies can achieve both high
reliability and safety using the generic model
below. The outputs of the arbiter constitutes the
system outputs which consists of two components
(I) data output, (II) unsafe flag.
Module 1
Module 2
Module n
Safe Modular Redundant (SMR) System
Arbiter
Data
Unsafe
An arbitration strategy is the function
implemented by the arbiter to decide what the
correct output is and when the errors in the
module outputs exceed the correction capability,
so that the correct output cannot be provided.
76. Fault Tolerance Through Static Redundancy
- 6.3 Redundancy for Tolerating Non-Classical Faults
- Even malicious failures, where two or more faulty
nodes may cooperate and attempt to foil the
operation, must be tolerated. - Byzantine Fault Model (BFM) Protocol was proposed
for precisely such an environment by Pease,
Shostak, and Lamport (1982). - BFM considers that a faulty node may not only
produce incorrect values, but also send ? values
to ? destinations instead of identical values, as
expected. - Typically, timing-related complex failures,
resulting in difficult agreements between good
processors in the presence of faulty processors. - BFM does not require foreknowledge of component
misbehavior and can tolerate faulty components
with even the most malevolent behavior, thus
avoiding the need for the costly task of
providing the validity of assumptions regarding
faulty component misbehavior.
86. Fault Tolerance Through Static Redundancy
- 6.3 Redundancy for Tolerating Non-Classical Faults
- For a BFM Protocol to tolerate m faults, the
following requirements must be met - At least (3m 1) nodes must participate
- At least (2m 1) disjoint communication paths
must exist between nodes - At least (m 1) rounds of communication must
take place - All nodes must be synchronized within a
well-known skew of each other.
D
D
D
a
r
a
a
r
a
r
r
a
r
r
a
r
a
A
C
r
A
C
A
C
B
a
r
B
r
r
B
r
a
Byzantine agreement
Byzantine disagreement
96. Fault Tolerance Through Static Redundancy
- 6.3 Redundancy for Tolerating Non-Classical Faults
- Example m 7
- At least (3 x 7 1) 22 nodes
- At least (2 x 7 1) 15 disjoint communication
paths - At least (7 1) 8 rounds of communication
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
15
16
17
18
15
16
17
18