5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors

Description:

5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors 5.1 Static, or Masking Redundancy N copies of each processor are used and the minimum degree of ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 10
Provided by: Fabia179
Category:

less

Transcript and Presenter's Notes

Title: 5. Basic Approaches to Achieve Fault Tolerance in Multiprocessors


1
5. Basic Approaches to Achieve Fault Tolerance in
Multiprocessors
  • 5.1 Static, or Masking Redundancy
  • N copies of each processor are used and the
    minimum degree of replication is the
    triplication. The replicated results are voted
    on.
  • 5.2 Dynamic, or Standby Redundancy
  • First, the presence of a faulty processor is
    detected. Then it is replaced with a spare by
    performing network reconfiguration and error
    recovery.

2
6. Fault Tolerance Through Static Redundancy
  • Three forms of Static Redundancy
  • Redundancy for Availability
  • Redundancy for Safety
  • Redundancy for Non-Classical Faults

3
6. Fault Tolerance Through Static Redundancy
  • 6.1 Redundancy for Availability
  • Used in the form of HW, SW, Time, or Information
    Redundancy n copies of a module perform the
    computation simultaneously to be voted. The
    scheme is combined with the use of a disagreement
    detector (voter/comparator) and a switching unit
    that produces a hybrid redundant system.
  • This approach can be applied at several levels in
    a distributed system each processor can be
    replicated and the result of each processors
    computation voted on, or the entire
    multiprocessor can be replicated and the combined
    result voted on. A third option divides the P
    processors of the multiprocessor into P/N groups
    of N processors, each group voting on its results
    before communicating to other groups. To provide
    robust communication, all critical transactions
    between groups may be replicated and voted upon.

4
6. Fault Tolerance Through Static Redundancy
  • 6.2 Redundancy for Safety
  • Reliability refers to the probability that the
    system produces correct output.
  • Safety is defined as the probability that the
    system output is either correct, or that the
    error in the output is detectable Johnson,
    1989. High safety is ensured by making
    negligible the probability of an undetected error
    in the output. When an uncorrectable error in the
    output is detected, a recovery or safe shutdown
    can be carried out.
  • A fault-tolerance scheme must be, in practice,
    chosen which meets the reliability-safety
    requirement.

5
6. Fault Tolerance Through Static Redundancy
  • 6.2 Redundancy for Safety
  • 5
  • R ?( 5 i).pigood.(1 pgood)5 -i
  • ik
  • 5
  • S 1 - ?( 5 i).p5 - igood.(1 pgood)i
  • ik

Voter Reliability Safety
3-out-of-5 0.99144 0.99144
4-out-of-5 0.91854 0.99954
5-out-of-5 0.59049 0.99999
6
6. Fault Tolerance Through Static Redundancy
  • 6.2 Redundancy for Safety
  • Design strategies can achieve both high
    reliability and safety using the generic model
    below. The outputs of the arbiter constitutes the
    system outputs which consists of two components
    (I) data output, (II) unsafe flag.

Module 1
Module 2
Module n
Safe Modular Redundant (SMR) System
Arbiter
Data
Unsafe
An arbitration strategy is the function
implemented by the arbiter to decide what the
correct output is and when the errors in the
module outputs exceed the correction capability,
so that the correct output cannot be provided.
7
6. Fault Tolerance Through Static Redundancy
  • 6.3 Redundancy for Tolerating Non-Classical Faults
  • Even malicious failures, where two or more faulty
    nodes may cooperate and attempt to foil the
    operation, must be tolerated.
  • Byzantine Fault Model (BFM) Protocol was proposed
    for precisely such an environment by Pease,
    Shostak, and Lamport (1982).
  • BFM considers that a faulty node may not only
    produce incorrect values, but also send ? values
    to ? destinations instead of identical values, as
    expected.
  • Typically, timing-related complex failures,
    resulting in difficult agreements between good
    processors in the presence of faulty processors.
  • BFM does not require foreknowledge of component
    misbehavior and can tolerate faulty components
    with even the most malevolent behavior, thus
    avoiding the need for the costly task of
    providing the validity of assumptions regarding
    faulty component misbehavior.

8
6. Fault Tolerance Through Static Redundancy
  • 6.3 Redundancy for Tolerating Non-Classical Faults
  • For a BFM Protocol to tolerate m faults, the
    following requirements must be met
  • At least (3m 1) nodes must participate
  • At least (2m 1) disjoint communication paths
    must exist between nodes
  • At least (m 1) rounds of communication must
    take place
  • All nodes must be synchronized within a
    well-known skew of each other.

D
D
D
a
r
a
a
r
a
r
r
a
r
r
a
r
a
A
C
r
A
C
A
C
B
a
r
B
r
r
B
r
a
Byzantine agreement
Byzantine disagreement
9
6. Fault Tolerance Through Static Redundancy
  • 6.3 Redundancy for Tolerating Non-Classical Faults
  • Example m 7
  • At least (3 x 7 1) 22 nodes
  • At least (2 x 7 1) 15 disjoint communication
    paths
  • At least (7 1) 8 rounds of communication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
15
16
17
18
15
16
17
18
Write a Comment
User Comments (0)
About PowerShow.com