Algorithm-Based Fault Tolerance Theory of Check Placement - PowerPoint PPT Presentation

1 / 107
About This Presentation
Title:

Algorithm-Based Fault Tolerance Theory of Check Placement

Description:

Matrix-based formalism of Nair et al. Dependence graph-based formalism of Park et al ... Framework for hierarchical fault tolerant systems by Nair et al ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 108
Provided by: gregBron
Category:

less

Transcript and Presenter's Notes

Title: Algorithm-Based Fault Tolerance Theory of Check Placement


1
Algorithm-Based Fault ToleranceTheory of Check
Placement
  • Greg Bronevetsky

2
So Far
  • Learned how certain computations could be checked
    using algorithm-specific checks.
  • In any algorithm we can develop checks to verify
    any set of data items.
  • How effective are these checks?
  • How many faults can given set of checks detect?

3
Abstract Checks
  • Suppose we are given (g,h)-checks
  • Check defined on g data elements
  • If all elements correct, returns 0
  • If 0? and ?h elements erroneous, return 1
  • If ?h elements erroneous, undefined

4
Checking Example
d1
d1
d2
d2


sum
sum


dn
dn
  • Assume (2, 1) checks
  • 2 elements, 1-failure detect
  • Both sets of checks can detect single errors
  • Neither can locate individual errors

5
But with one more check
d1
d2

sum

dn
n checks ?i. di and sum1 more check sum
  • If also check sum
  • can detect any pair of errors
  • can locate single errors
  • Need general theory of effective and efficient
    check placement

6
Goals
  • Need models for correlating processor faults to
    data errors
  • Given fault model and set of checks need to
    derive fault detectability and locatability

7
Papers covered
  • V.S.S. Nair, J.A. Abraham, P. Banerjee.
    "Efficient techniques for the analysis of
    algorithm-based fault tolerance (ABFT) schemes",
    1996.
  • Choon-Sik Park and Mineo Kaneko, "An Efficient
    Technique for Design of ABFT Systems Based on
    Modified PD Graph".
  • Choon-Sik Park, "Algorithm-Based Fault Tolerant
    Systems Based on Graph-Theoretic Error
    OccurencePropagation Models", 2000. (PhD Thesis)
  • V.S.S. Nair, J.A. Abraham. "Hierarchical design
    and analysis of fault-tolerant multiprocessor
    systems using concurrent error detection", 1990.

8
Outline
  • Matrix-based formalism of Nair et al
  • Dependence graph-based formalism of Park et al
  • Includes fault propagation models
  • Framework for hierarchical fault tolerant systems
    by Nair et al
  • Building fault tolerant systems out of fault
    tolerant components

9
Basic Framework
d1
P1
C1
d2
C
d3
P2
h
e
d4
C2
c
P3
d5
k
s
d6
C3
P4
d7
  • Each processor and check associated with set of
    elements

10
Basic Framework
  • Data(Pi) set of data elements affected by
    processor i
  • If Pi fails, any subset of of Data(Pi) may be
    erroneous
  • No notion of errors propagating based on data
    dependences
  • Data() defines the Processor-Data (PD) Matrix

11
Associated PD Matrix
d1
P1
d2
Data Elements
d3
P2
d4
Processors
P3
d5
d6
P4
d7
12
Basic Framework
  • Check(di) set of checks that check data element
    di.
  • Must be non-empty if we expect to detect errors
  • Check defines the Data-Check (DC) Matrix
  • Paper focuses on (g,1) checks
  • g data elements
  • can detect upto 1 fault

13
Associated DC Matrix
d1
Checks
C1
d2
C
d3
h
Data Elements
e
d4
C2
c
d5
k
s
d6
C3
d7
  • C1 and C2 are (3,1) checks
  • C3 is a (2,1) check

14
The PC Matrix
  • Finally, associate processors and checks
  • Processor-check (PC) matrix PD?DC

Checks
DC
Data Elements
PD
?

Data Elements
Processors
elements verified by check
PC

Processors
15
Using the PC Matrix
  • PC matrix shows if we can detect single-processor
    errors
  • Assume all checks are (g,h) checks
  • If each row of PC has all entries ?h failure of
    that process will be detected
  • Regardless of which entries actually become
    erroneous

elements verified by check
PC
Processors
16
Using the PC Matrix
  • If each row of PC has all entries ?h failure of
    that process will be detected

d1
P1
C1
d2
C
d3
P2
h
e
d4
C2
c
P3
d5
k
s
d6
C3
P4
d7
elements verified by check
PC
Processors
17
Relaxing Detectability
  • Condition is too conservative
  • Suppose we have (3, 2) checks
  • Pis PD row is
  • There are 2 checks. DC matrix
  • PC Matrix

d1
P1
C1
d2
d3
d4
C2
d5
18
Relaxing Detectability
  • C1 may be overwhelmed by errors
  • Will not notice error ltd1, d2 d5gt
  • By above criterion system cant detect failure in
    P1

d1
P1
C1
d2
d3
d4
C2
d5
19
Reaching New Detectability Definition
d1
P1
C1
d2
d3
d4
C2
d5
  • But how could C1 be overwhelmed?
  • When all 3 of its elements have errors
  • Recall, these are (3,2) checks

20
Reaching New Detectability Definition
d1
P1
C1
d2
d3
d4
C2
d5
  • But C1 and C2 overlap on d5
  • Thus if C1 overwhelmed, C2 detects error
  • It is not overwhelmed
  • Thus, for any error pattern can see if any check
    will notice

21
Trivial Algorithm 2
  • Try every possible error pattern
  • Exponentially many of them
  • For each pattern see if some check will detect
    it
  • Before ensured that no check overwhelmed
  • Pro Correct and not conservative
  • Con Expensive

22
New Definition of Detectability
  • Work with error patterns
  • Ex ltd1, d2, d5gt, ltd1, d3, d4gt, ltd3gt, etc.
  • If one check detects given error pattern, no
    problem if other checks overwhelmed
  • Repeat until all error patterns detected
  • If some check not overwhelmed, eliminate all
    detectable error patterns from consideration

23
Example of Detectability Algorithm
d1
P1
C1
d2
d3
C2
P2
(2,1) checks
d4
C3
d5
C4
  • Is failure of P1 detectable?
  • P1 fails ? d1, d2 and/or d3 may have errors
  • C1, C2 overwhelmed
  • C3 not overwhelmed

24
Example of Detectability Algorithm
d1
P1
C1
d2
d3
C2
P2
(2,1) checks
d4
C3
d5
C4
  • Look at errors C3 can detect d3
  • Remove them from consideration
  • Since any error pattern involving d3 will be
    detected

25
Example of Detectability Algorithm
d1
P1
C1
d2
C2
P2
(2,1) checks
d4
C3
d5
C4
  • Look at remaining error patterns combinations of
    d1 and/or d2
  • Now C2 not overwhelmed
  • Remove any error patterns involving d2

26
Example of Detectability Algorithm
d1
P1
C1
C2
P2
(2,1) checks
d4
C3
d5
C4
  • Look at remaining error patterns d1
  • C1 not overwhelmed
  • Remove any of its error patterns

27
Example of Detectability Algorithm
P1
C1
C2
P2
(2,1) checks
d4
C3
d5
C4
  • All of P1s error patterns detected
  • We are done!

28
Failing Check Processors
  • What if processor performing check fails?
  • Add pseudo data elements to represent
    processors
  • Each check will also check its processors
    pseudo-data element
  • New element has ? weight, so error in it will
    overwhelm any check

29
Final System
P1
C1
d2
d3
C2
P2
(2,1) checks
d4
C3
d5
C4
d6
d7
  • Check C3 is in P1
  • Checks C1, C2 and C4 on P2

30
The Infinities
d1
P1
C1
d2
d3
C2
P2
(2,1) checks
d4
C3
d5
C4
d6
d7
Checks
DC
Data Elements
PD
Processors
Data Elements
elements verified by check
PC
Processors
31
The Infinities
d1
P1
C1
d2
d3
C2
P2
(2,1) checks
d4
C3
d5
C4
d6
elements verified by check
PC
d7
Processors
  • If P1 fails, C1 and C2 overwhelmed
  • C3 also overwhelmed by ?1
  • Because C3 runs on failed P1
  • Only C4 not overwhelmed

32
The Infinities
d1
P1
C1
d2
d3
C2
P2
(2,1) checks
d4
C3
d5
C4
d6
elements verified by check
PC
d7
Processors
  • Remove all error patterns detected by C4
  • Any that include d2

33
The Infinities
d1
P1
C1
d3
C2
P2
(2,1) checks
d4
C3
d5
C4
d6
elements verified by check
PC
d7
C4s entry must become 0Others may go lower
Processors
  • C1 and C2 no longer overwhelmed
  • Remove error patterns detected by C1 and C2
  • Any that include d1 and d3

34
The Infinities
P1
C1
C2
P2
(2,1) checks
d4
C3
d5
C4
d6
elements verified by check
PC
d7
C1s and C2s entries must become 0Others may go
lower
Processors
  • Now P1s row is all 0s and ?s
  • All real data elements successfully checked
  • Only pseudo-elements remain
  • Dont care

35
The Infinities
P1
C1
C2
P2
(2,1) checks
d4
C3
d5
C4
d6
elements verified by check
PC
d7
Processors
  • Note failure of P2 not detectable
  • d5 only checked by C4, which runs on P2
  • Thus, entry will never drop to ?

36
Multi-Process Errors
  • Want to know if system detect failures of ?r
    processors
  • For every subset of r processors
  • Take union of all data elements they touched
  • Pretend each r-set is single processor
  • Use above algorithm to check if all resulting
    error patterns detectable

37
Fault Locatability
  • We only see errors, not faults
  • For each error pattern, want to know which fault
    caused it
  • Given two fault patterns, are they
    distinguishable?
  • Only if they have different patterns of failed
    checks
  • Will give intuition for analysis

38
0-1 Disagreement
  • Take rows Ri and Rj of rPC (faults Fi and Fj)
  • For every possible error pattern in Ri and Rj
    look at what each check says on this pattern
  • If check responses different on each pattern Fi
    and Fj can be differentiated

39
1-0 Disagreement
  • Want to differentiate faults Fi and Fi?Fj ?j
  • Compare each error pattern of Fi and Fj Eik and
    Ejl
  • If some check meets Eik on 1? ?h spots and
    meets Eil on 0 spots then Ejk and Ejk?Ejl
    distinguishable
  • If this is true for all error patterns then Fi
    and Fi?Fj distinguishable

40
1-0 Disagreement Example
41
1-0 Disagreement Example
  • Clearly, Eik and Ejl look different
  • Eik?Ejl corresponds to fault pattern
  • Checks would say
  • Different from Eik or Ejl Distinguishable!

42
Fault Locatability
  • If can show 1-0 disagreement between every
    single-process fault and every r-process
    faultSystem is r-fault locatable
  • Algorithm for locatability is obscure
  • Read the paper

43
Summary
  • Presented matrix-based framework for evaluating
    error detectability locatability
  • Framework deals with arbitrary errors
  • More work by V.S.S. Nair with other coauthors

44
Outline
  • Matrix-based formalism of Nair et al
  • Dependence graph-based formalism of Park et al
  • Includes fault propagation models
  • Framework for hierarchical fault tolerant systems
    by Nair et al
  • Building fault tolerant systems out of fault
    tolerant components

45
Graph-Based Framework
  • Developed by Choon-Sik Park
  • Does in graphs what Nair et al work does in
    matrices
  • Assumes (g,1) checks
  • Differences
  • Different definition of fault locatability
  • Unknown if equivalent
  • Presents more limited fault?error models
  • As opposed to anything and everything
  • Will first present general view, then specific
    error models

46
Basic Picture
Errors
Faults
Data
Checks




c
Fi
eiu




Fj
c



ejv

Processor?Data, Data?Data dependence info
maintained
47
k-Faults
  • Faults may cause number of possible errors
  • For given fault, many errors possible
  • If given error happens, all associated data
    elements definitely corrupted
  • k-Faults faults generating errors that corrupt
    ?k data elements

Errors
Faults
Data
Fi
eiu
48
Fault Detectability
  • System is k-fault detectable if for every error
    pattern ? check c s.t. c?eiu1
  • ? means intersection of affected data elements
  • Proof
  • If there exists such check then every error
    pattern induced by fault will be detected
  • If k-fault detectable then must ? some check that
    reliably yells for any possible error pattern
  • Can allow the check that yells to be the check in
    definition

49
Fault Management
  • k-fault detectability If a fault affects ?k data
    elements then checks will detect it
  • k-fault locatability For all faults that affect
    ?k data elements, can tell any pair of faults
    apart
  • Will examine all fault patterns Fi that come from
    ?k data elements failing

50
Fault Locatability 1
  • To locate faults, must ensure that different
    faults cause different errors
  • Theorem 1System k-fault locatable only if for
    error patterns eiu, ejv (from faults Fi and Fj)
    eiu?ejv??
  • ? ? symmetric difference
  • Proof clearIf two faults can show up as same
    error, cant tell them apart

51
Fault Locatability 2
  • Theorem 2System k-fault locatable only if for
    error patterns eiu, ejv ? checks c and c' s.t.
  • c?(eiu?ejv)1 (recall all checks are (g,1))
  • c?(eiu?ejv)0
  • If c?(eiu-ejv)1 then c'?ejv)1
  • If c?(ejv-eiu)1 then c'?eiu)1
  • Intuition Trying to make tuple ltc,c'gt be
    different and ?lt0,0gt on errors eiu and ejv

52
Fault Locatability Illustration
eiu
(eiu-ejv)
(eiu?ejv)
(ejv-eju)
ejv
(eiu?ejv)
53
Fault Locatability Illustration
eiu
(eiu-ejv)
(eiu?ejv)
c
(ejv-eju)
ejv
  • c?(eiu?ejv)1
  • i.e. c overlaps one element ?(eiu?ejv)
  • (because of (g,1) checks)

(eiu?ejv)
54
Fault Locatability Illustration
eiu
(eiu-ejv)
(eiu?ejv)
c
(ejv-eju)
ejv
  • c?(eiu?ejv)0
  • i.e. c only touches on the part that is unique to
    ejv

(eiu?ejv)
55
Fault Locatability Illustration
c'
eiu
OR
(eiu-ejv)
(eiu?ejv)
c
(ejv-eju)
ejv
  • If c?(ejv-eiu)1 then c'?eiu)1
  • If c notices ejv make sure that c notices eiu

(eiu?ejv)
56
Fault Locatability Illustration
c'
eiu
OR
(eiu-ejv)
(eiu?ejv)
c
(ejv-eju)
ejv
  • Error eiultc,c'gtlt0,1gt
  • Error ejvltc,c'gtlt1,?gt
  • Patterns distinguishable
  • Either error detected

(eiu?ejv)
57
Fault Locatability 2
  • Theorem 2System k-fault locatable only if for
    error patterns eiu, ejv ? checks c and c' s.t.
  • c?(eiu?ejv)1 (recall all checks are (g,1))
  • c?(eiu?ejv)0
  • If c?(eiu-ejv)1 then c'?ejv)1
  • If c?(ejv-eiu)1 then c'?eiu)1
  • This, is above true for every pair of error
    patterns, system k-fault detectable

58
Extra Fault Detectability
  • Theorem if system is k-fault locatable then it
    is 2k-fault detectable
  • Must show for any fault Fl in ?2k processors, ?
    resulting errors elw, ? check c. c?elw1
  • Note Failures of ?2k processors result in ?2?
    errors as failures of ?k data elements
  • Thus, can break up elw (eiu?ejv), coming from
    k-fault patterns Fi and Fj

59
Extra Fault Detectability
  • Theorem if system is k-fault locatable then it
    is 2k-fault detectable
  • Must show ?eiu,ejv ? check c. c?(eiu?ejv)1
  • If (eiu?ejv) happens, both c and c' will notice

eiu
(eiu-ejv)
(eiu?ejv)
c
(ejv-eju)
OR
c'
ejv
60
Fault?Error Models
  • So far trying to deal with arbitrary errors
  • Actual model of how faults turn into errors not
    defined
  • i.e. arbitrary
  • This is unnecessarily general
  • Should focus on realistic models of error
    generation and propagation
  • Makes it easier to design reliable systems

61
Single-Input-Driven Model
  • Output of computation erroneous if any input(s)
    are
  • Even if processor is faulty
  • If processor is faulty, its computations may or
    may not be erroneous
  • (this is where we use data dependence
    information)
  • Will focus on how model treats single-processor
    failures

62
SID Model Picture
Pi
Data

  • data elements on Pi
  • Synonymous with sets of data elements on Pi
  • Focus on single-processor failures

63
Fault Model in Practice
  • If Pi fails, any subset of Diws may have error
  • If Diw has error, any data depending on it has
    error
  • Bijection between Diw and errors Eiw

Pi
Data


64
Single-Fault Detectability in SID
Pi
Data
  • Brute-Force algorithm
  • ? sets of Eiws
  • If ? check c s.t. c?(?Eiws)1 then this error
    pattern detectable
  • If all patterns detectable, system is
    single-fault detectable

c


65
Too Conservative
  • Like before, algorithm too conservative
  • Examines exponentially many error patterns
  • Suppose set of errors
    detected via check c
  • i.e. c?E1
  • Look at

c
66
Too Conservative
  • Clearly, all overlap with c on one element
  • Thus, each one detectable
  • Similarly, all unions containing
    detectable
  • Therefore, if a set of errors detectable, all
    unions containing suberrors also detectable
  • And thus, no need to check them

Can ignore E1, E2, E1?E2, E1?E3, E1?E2, E1? E2?
E3 Cant ignore E3
c
67
New Definition of Detectability
  • (start with all possible errors)
  • For each check cs
  • Check that detectable
  • Now ignore detectable subsets of
  • Remove detectable subsets
  • Repeat to ensure rest of also detectable

68
Detectability Example
  • Check ( )
  • c1 meets E1 and E2

c1
69
Detectability Example
  • Check ( )
  • c1 meets E1 and E2
  • Remove them to get

c1
70
Detectability Example
  • Check
  • C2 meets E3 and E4
  • Also meets E2 but on error E2, c1 will ring

c1
c2
71
Detectability Example
  • Check
  • C2 meets E3 and E4
  • Also meets E2 but on error E2, c1 will ring
  • Remove them to get

c1
c2
72
Detectability Example
  • Check
  • C3 meets E5

c1
c2
c3
73
Detectability Example
  • Check
  • C3 meets E5
  • Remove it to get

c1
c2
c3
74
Detectability Example
  • Check
  • C3 meets E6
  • Recall circles on left are data on processor I

c1
c2
c3
c4
75
Detectability Example
  • Check
  • C3 meets E6
  • Recall circles on left are data on processor I
  • Remove it to get

c1
c2
c3
c4
76
Detectability Example
c1
c2
  • DONE!

c3
c4
77
Single-Fault Locatability in SID
  • Basic definitionMust exist enough checks s.t.
    all error patterns produced by failure of Pi
    differentiable from error patterns of Pj
  • Involves a lot of error patterns
  • Start with brute-force definition

78
Brute-Force Definition
  • ? error patterns EqEi1, Ei5, Eiw, from Pi
    ? checks and s.t.
  • Detects error E
  • Ignores any error from Pj
  • detect Ej and all subsets via above
    algorithm
  • And vice versa (since s may ring on Pis
    errors)
  • Result
  • Any error pattern in Ei, none in Ej will ring
    some cq
  • Every pattern in Ej detectable

79
Responses of Checks
  • On error pattern Eq (due to failure of Pi)
  • On any error Ej due to failure of Pj
  • Can brute-force evaluate test on every possible Eq

At least one must be 1 (else Ej not detectable)
80
Brute Force Too Exhaustive
  • Recall that if
    then same true for all sets containing E1, Er
  • Thus, can eliminate many of the steps above

81
New Definition of Locatability
  • (start with all possible Pi errors)
  • For each check cs
  • Check cs detects
  • But not Ej
  • Ensure that Ej is detectable via above algorithm

82
New Definition of Locatability
  • Syndrome of Ei and detectable subsets
  • Syndrome of Ej all subsets
  • Can now ignore detectable subsets of
  • Remove detectable subsets
  • Repeat until all covered
  • Do same for
  • In paper, steps for and interleaved

At least one must be 1 (else Ej not detectable)
83
Summary
  • Presented graph-based framework for evaluating
    error detectability locatability
  • Framework deals with arbitrary errors
  • Can be specialized to a simpler fault model
    Single-Input Driven
  • Choon-Sik Parks thesis presents the
    Multiple-Input Driven model
  • More realistic but complex

84
Outline
  • Matrix-based formalism of Nair et al
  • Dependence graph-based formalism of Park et al
  • Includes fault propagation models
  • Framework for hierarchical fault tolerant systems
    by Nair et al
  • Building fault tolerant systems out of fault
    tolerant components

85
Building Larger Systems
  • Now know how to analyze systems for detectability
    locatability
  • For large systems this can be very hard/expensive
  • Large systems typically made up of smaller
    components
  • Simplifies fault tolerance design

86
Basic Idea
  • Have component with known detectability (t)
    locatability (l)
  • Construct system S out of k components
  • What is resulting fault tolerance?

87
Basic Idea
  • System fault tolerance no better than for
    individual component
  • If gtt data elements fail in same component, error
    not detected
  • If gtl elements fail in component, will not locate
  • Detectability locatability ratio tends to 0 as
    system size increases!

88
Hierarchical Design
  • To build fault tolerant systems must introduce
    checks with new components
  • Will present hierarchical design scheme with
    specific detectability locatability guarantees
  • Assumptions
  • All (g,h) checks have same h
  • No restriction on g
  • Every processor produces only one data element
  • Same true for blocks of processors
  • Checks are fault tolerant
  • Claims that this doesnt change problem

89
Basic Component
  • Start off with basic system
  • System has internal checks
  • Fault detectability t
  • Fault locatability l

B

90
Basic Component
  • Then replicate it k-fold
  • Assumptions
  • copies are independent
  • (i.e. do not affect each others data)
  • Each system produces one data element

B1
B2
Bk




91
Basic Component
  • Then replicate it k-fold
  • And add additional checks across all copies
  • Process repeated d-1 times to get d-level
    hierarchical system

B1
B2
Bk
c1
c2




cr
92
Detectability 1?k?h
  • Theorem 1
  • If 1?k?h then hierarchical system can detect
    ?B?kd-1 errors
  • Proof
  • Base case d2
  • Suppose every element has error
  • Each check must deal with k?herrors
  • But they are (g,h) checks andwill detect such
    errors
  • Thus, system can detect ?B?k errors

B1
B2
Bk
c1
c2




cr
93
Detectability 1?k?h
  • Theorem 1
  • If 1?k?h then hierarchical system can detect
    ?B?kd-1 errors
  • Proof
  • Inductive case d1
  • Components Bi each have ?B?kd-2elements
  • By argument above, system detects
    ?(B?kd-2)?kB?kd-1 errors
  • Argument works because sub-systemsat each level
    produce one data element

B1
B2
Bk
c1
c2




cr
94
Detectability kgth
  • Theorem 2
  • If kgth then hierarchical system can detect
    ?(t1)(h1)d-1-1 errors
  • Proof
  • Base case d2
  • Suppose (t1)(h1) errors with h1 copies of B
    having t1 errors each
  • Detectability of B t, so internalchecks will
    not notice errors
  • 2nd level checks will get h1 errors each will
    not notice
  • Thus, ? error pattern of size (t1)(h1) that
    will not be detected

B1
B2
Bk
c1
c2




cr
95
Detectability kgth
  • Theorem 2
  • If kgth then hierarchical system can detect
    ?(t1)(h1)d-1-1 errors
  • Proof
  • Base case d2
  • Suppose (t1)(h1)-1 errors
  • By pigeonhole principle, some unithas ?t errors
    or some 2nd levelcheck has ?h errors
  • Thus, some check at 1st or 2nd levelwill ring
  • Thus, system detectability (t1)(h1)-1

B1
B2
Bk
c1
c2




cr
96
Detectability kgth
  • Theorem 2
  • If kgth then hierarchical system can detect
    ?(t1)(h1)d-1-1 errors
  • Proof
  • Inductive case d1
  • Components Bi detect ?Td errors
  • By induction, Td (t1)(h1)d-1-1
  • By argument above, system detects ?(Td1)(h1)-1
    errors
  • Thus, system detectability (t1)(h1)d-1

B2
B1
Bk
c1
c2




cr
97
Locatability
  • Theorem 3
  • If kgt1 then hierarchical system can locate
    ?2d-1(l1)-1 errors
  • Proof
  • Base case d2
  • Suppose fault pattern of 2(l1) errors, l1
    errors in two Bis
  • Bi Bj cant locate the errors
  • 2nd level checks may locate erroneous rows, not
    columns
  • Thus, ? unlocatable fault pattern of size 2(l1)

B2
Bk
B1
c1
c2




cr
98
Locatability
  • Theorem 3
  • If kgt1 then hierarchical system can locate
    ?2d-1(l1)-1 errors
  • Proof
  • Base case d2
  • Suppose fault pattern of 2(l1)-1
  • At most one Bi may have ?l1 errors
  • If none do, were done
  • Remaining ?l errors distributed among other Bjs

B2
B1
Bk
c1
c2




cr
99
Locatability
Bi
Bj
Bk
c1
c2




cr
  • Let Bi have lr errors (r?1)

100
Locatability
Bi
Bj
Bk
c1
c2




cr
  • Let Bi have lr errors (r?1)
  • Remaining Bjs share remaining l-r1 errors
  • ?(lr)-(l-r1)2r-1 rows only have errors in Bi
  • 2r-1 rows when all l-r1 errors are in same Bj

101
Finding Overwhelmed Unit
  • First, find the Bi that have gtl errors
  • All but one sub-system detects and locates errors
    correctly
  • Overwhelmed subsystem
  • Detects correctly
  • Locatability l ? Detectability gt 2l
  • Citation of 1973 paper by Russel Kime
  • Error location mistakes

102
Finding Overwhelmed Unit
  • In ?2r-1 rows only Bi has error
  • Thus, no other row will claim an error there
  • 2nd-level checks will catch these errors
  • Bis checks cant lie about it
  • Will definitely know these are errors


Bi
Bj
Bk
l1
Known errorsUknown errors No error
?2r-1
103
Finding Overwhelmed Unit
  • Number of errors in Bi lr
  • Number of known errors ? 2r-1
  • Number of unknown errors in Bi ? (lr)-(2r-1)
    l-r1
  • Since r?1, l-r1?l
  • Bis checks can identify ?l errors
  • Error patterns ?l produce unique check alert
    patterns
  • This data enough to identify remaining unknown
    errors

104
Locatability
  • Theorem 3
  • If kgt1 then hierarchical system can locate
    ?2d-1(l1)-1 errors
  • Proof
  • Base case d2
  • Can Locate errors size ?2(l1)-1
  • Inductive case d1
  • Components Bi can locate ?2d-1(l1)-1 errors
  • By argument above, system locates
    ?2(2d-1(l1)-1)1-1 2d(l1)-1 errors

105
Summary
  • Presented systematic way to build hierarchical
    systems with good fault-detection properties
  • For d-level system composed of identical
    independent components
  • Component detectabilityt, locatabilityl

106
Conclusion
  • Formalisms for analyzing fault detectability
    locatability
  • Matrix-based formalism of Nair et al
  • Dependence graph-based formalism of Park et al
  • Includes fault propagation models
  • Framework for hierarchical fault tolerant systems
    by Nair et al
  • Building fault tolerant systems out of fault
    tolerant components

107
Conclusion
  • These schemes have complex rules for acceptable
    check placements
  • Requires detailed analysis of system to place
    them manually
  • More detailed analysis if checks are
    hand-designed
  • Likely since few known automatic techniques
  • Overall, approach can support automatic solutions
    but currently very manual
Write a Comment
User Comments (0)
About PowerShow.com