Design of High Availability Systems and Networks Software Fault Tolerance

About This Presentation

Title:

Design of High Availability Systems and Networks Software Fault Tolerance

Description:

Fault Isolation Using Hardware Checkers. Register C. Memory. array 2. Register B. Memory ... and if register C is input. to register B implicated. set FRUs is 3 and 4 ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 54

Provided by: centerforr3

Category:

more less

Transcript and Presenter's Notes

Title: Design of High Availability Systems and Networks Software Fault Tolerance

1
Design of High Availability Systems and Networks
Software Fault Tolerance
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Outline

Motivation for software fault tolerance
N-Version programming
Recovery blocks
IBM server example
Process pairs
Robust data structures

3
Motivation for Software Fault Tolerance

Usual method of software reliability is fault
avoidance using good software engineering
methodologies
Large and complex systems ? fault avoidance not
successful
Redundancy in software needed to detect, isolate,
and recover software failures
Use static redundancy or dynamic redundancy
Hardware fault tolerance easier to assess
Software is difficult to prove correct

HARDWARE FAULTS SOFTWARE FAULTS 1. Faults
time-dependent Faults time-invariant 2.
Duplicate hardware detects Duplicate software not
effective 3. Random failure is main
cause Complexity is main cause
4
Consequences of Software Failure

General Accounting Office reports 4.2 mission
lost annually due to software errors
Launch failure of Mariner I (1962)
Destruction of French satellite (1988)
Problems with Space Shuttle and Apollo missions
STAR WARS (SDI) funding billions of dollars for
correct software development
ATT blockages (error in recovery-recognition
software)(1990)
SS7 (signaling system) protocol implementation -
untested patch (mistyped character) (1997)
Therac 25 (overdose of medical radiation 1000s
of rads in excess of prescribed dosage)

5
Experiences with Current Software

Many computer crashes are due to software
Even though one expects software to be correct,
it never is
Software exhibits fairly constant failure
frequency
Number of failures is correlated with
Execution time
Code density
Software timing, synchronization points

6
Experiences with Current Software (cont.)
Key parameters and variables (with defect
reintroduction)
Defect Detection Time Constant s 17.2
Weeks Defect Repair Time Constant t 4.7
Weeks Code Delivery 589810 Lines Initial Error
Density ? 0.00387 Defects per Line Defect
Reintroduction Rate ? 33 Percent Deployment Time
T Week 100 Estimated Remaining Defects ERDT 664
Defects Estimated Current Defects ECDT 445
Defects Testing Process Quality TPQT 90
Percent Testing Process Efficiency TPET 60
Percent
7
Difficulties

Improvements in software development
methodologies reduce the incidence of faults,
yielding fault avoidance
Need for test and verification
Formal verification techniques, such as proof of
correctness, can be applied to rather small
programs
Potential exists of faulty translation of user
requirements
Conventional testing is hit-or-miss. Program
testing can show the presence of bugs but never
show their absence, - Dikstra, 1972.
There is a lack of good fault models.

8
Approaches to Software Fault Tolerance

ROBUSTNESS The extent to which software
continues to operate despite introduction of
invalid inputs.
Example 1. Check input data
gtask for new input
gtuse default value and raise flag
2. Self checking software
FAULT CONTAINMENT Faults in one module should
not affect other modules.
Example Reasonable checks
Watchdog timers
Overflow/divide-by-zero detection
Assertion checking
FAULT TOLERANCE Provides uninterrupted
operation in presence of program fault through
multiple implementations of a given function

9
N-Version Programming Basic Model
The N-version software (NVS) model with n3
Consensus Results
10
Recovery Blocks Basic Model
The Recovery Block (RB) Model
EE
Execution Environment (EE)
J -th Recovery Block Software Unit
Alternate 1
Accepted Results
Recovery Cache
Acceptance Test
xi
No
xij
No
Yes
Alternate 2
No
Execution Support Functions
Take Next Alternate
11
Execution Models for Software Fault-Tolerance
Approaches
Start software execution
Start software execution
Version 2 execution
Version 1 execution
Version N execution
Primary alternate execution
...
End Execution Version 2
End Execution Version N
End Execution Version 1
End primary alternate execution
Alternate selection
Acceptance test execution
Gathering versions results
No alternate any more available
Alternate 1 selected lz 2, .N
N
Acceptance test not passed
Start decision algorithm execution
Acceptance test passed
Decision algorithm execution
Alternate 1 execution lz 2,..N
Failed software
No acceptable result provided
Acceptable result provided
End alternate 1 execution
End software execution
Failed software
End software execution
Recovery blocks
N-version programming
12
Execution Models for Software Fault-Tolerance
Approaches (cont.)
Start software execution
Self-checking component 1 execution
Self-checking component 2 execution
...
Self-checking component N execution
No acceptable result
No acceptable result
Acceptable result provided
Acceptable result provided
Acceptable result provided
N
Result selected
No result selected
Failed software
End software execution
N self-checking programming
13
Software Fault-Tolerance Approaches and Their
Equivalent Hardware Counterparts

RB is equivalent to the stand-by sparing (of
passive dynamic redundancy) in HW fault-tolerant
architectures
NVP is equivalent to N-modular redundancy (static
redundancy) in HW fault-tolerant architectures
NSCP is equivalent to active dynamic redundancy
A self-checking component results either from
The association of an acceptance test to a
version
The association of two variants with a comparison
algorithm
Fault-tolerance is provided by the parallel
execution of N ? 2 self-checking components

14
Concepts of N-Version Programming

N ? 2 versions of functionally equivalent
programs
Independent generations of programs ? carried
out by N groups of individuals who do not talk to
each other with respect to programming process
(different algorithms, different programming
languages, translation)
Initial specification formally done in some
formal spec. language
states unambiguously the functional requirements
leaves widest possible choice of implementation
By making the development process diverse it is
hoped that the versions will contain diverse
faults
The inventors of NVP emphasized that
the definition of NVP has never postulated an
assumption of independence and that NVP is a
rigorous process of software development

15
Assumption of Independence in N-Version
Programming

Do the N versions of a program fail
independently? Are faults unrelated?
Does Prob (failure of N-version system) Prob
(failure of one version)N ??
If so, then the system reliability can be very
high
Why this assumption may be false?
People make same mistakes, e.g. incorrect
treatment of boundary conditions
Some parts of a problem more difficult than
others
statistics show similarity in programmers view
of difficult regions

16
Observation from Experiments

Assumption of independence of failures of
versions DOES NOT hold
This does not mean N-version programming is
useless
The reliability of the system will not be as high
as in the case when the faults in different
versions are independent
Example PODS (Project on Diverse Software)
All faults were caused by omissions and
ambiguities in the requirement specifications
Two common faults were found in two versions
Three different versions of software with failure
rate 1.5 10-6, 0.8 10-3, and 0.8 10-3,
resulted in the failure rate of 0.8 10-3 after
majority voting
The common/coincident faults could not be
excluded by majority voting

17
Limitation of N-Version Programming

All N -versions originate from the same initial
specifications whose correctness, completeness,
and unambiguity should be assumed
Use formal correctness proofs on specs, rather
than proofs on implementations
Exhaustive validation
Based on an assumption that software faults are
distinguishable
faults that will cause disagreement between
versions at specified voting points might be a
result of independent programming efforts to
remove identical software defects

18
Concepts of Recovery Blocks

Characteristics
Incorporates general solution to the problem of
switching to spare
Explicitly structures a software system so that
extra software for spares and error detection
does not reduce system reliability
First to consider a single sequential process
later extended to
Multiple processes within one system
Multiple processes in multiple systems gt
distributed recovery blocks
Can view progress as sequences of basic
operations, assignments to stored variable
Structured program has BLOCKS of code to simplify
understanding of the functional description
Choose blocks as units for error detection and
recovery.

19
Alternates

Primary alternate is the one that is to be used
normally
Other alternates attempt less desirable options
One source of alternates is earlier release of
primary alternates
Gracefully degraded alternates
E.g., ensure consistent sequence (S)
by extend S with (i)
else by concatenate to S
else by S (empty sequence)
else error

20
Acceptance Tests

Function ensure the operation of recovery blocks
is satisfactory
Should access variables in the program, NOT local
to the recovery block, since these cannot have
effect after exit. Also, different alternates
use different local variables.
Need not check for absolute correctness -
cost/complexity trade-off
Run-time overheads should be LOW
NO RESIDUAL EFFECTS should be present, since
variables, if updated, might result in passing of
successive alternates

21
Restoration of System State

Restoring system state is automatic
Taking a copy of entire system state on entry to
each recovery block is too costly
Use Recovery Caches or Recursive Caches
When a process is to be backed up, it is to a
state just before entry to primary alternate
Only NONLOCAL variables that have been MODIFIED
have to be reset

22
Process Conversions

A systematic methodology of extending recovery
blocks across processes by taking process
interactions into considerations (considers
time/space)
Prevents domino effect

P1
X
X
X
P2
X
X
23
Process Conversations (cont.)

Recovery block spanning two or more processes is
called a conversation
Within a conversation, processes communicate
among themselves, NOT with others
Operations of a conversation
Within a conversation, communication is only
among participants, not external
On entry, a process establishes a checkpoint
If an error is detected by any process, then all
processes restore their checkpoints
Next to ALL processes execute their available
alternative
All processes leave the conversation together
(perform their acceptance tests just prior to
leaving)
At the end of the conversation, ALL processes
must satisfy their respective acceptance tests,
and none may proceed otherwise

24
Nested Conversions
Checkpoint
Inter-process communication
Acceptance test
Conversation boundary
25
Comparison of Recovery Blocks vs. N-Version
Programming

Advantages of Recovery Block
Most software systems evolve by replacement of
some modules by new ones - can be used as
alternates
Nice hierarchical design - structured approach
Disadvantages of Recovery Block
System state must be saved before entry to
recovery block -- excessive storage
Difficult to handle multiple processes -- might
have domino effect
Difficult to undo effects in real-time systems
Effectiveness of acceptance test
Higher coverage is more complex
Lack of formal method to check

26
Comparison of Recovery Block vs.N-Version
Programming (cont.)

Advantages of N-Version Programming
Immediate masking of software faults -- no delay
in operation
Self-checking (acceptance tests) not required
Conventional fault tolerant systems HW and SW
have redundant hardware e.g. TMR (easier to
include N-version software on redundant hardware)
Disadvantages of N-Version Programming
How to get N-versions?
Impose design diversity, since randomness does
not give uncorrelated software faults
Extremely dependent on input specifications
(formal correctness proofs)

27
High-Availability System DesignIBM Mainframe
30xx
28
IBM 30xx Simplified System Model
Expanded storage
Central memory
System controller
Processor controller
CPUs
Power distribution and cooling
Channel control
Channel adapters and servers
29
Fault Isolation Using Hardware Checkers

Error checker placement determined by Fault
Isolation Domains (FID)
Checkers define the boundary of fault
containment

FRU 1
If checker 2 is triggered and if register C is
input to register B ? implicated set FRUs is 3
and 4
Checker 1
FRU 4
FRU 2
Cable
FRU 3
FRU 5
...
Decoder
Checker 2
Red Fault Isolation Domain Blue Field
Replaceable Unit
Checker 3
30
Mapping of Fault Isolation Domains to Field
Replaceable Units
Function FID FRU Syndrome Memory array 1 1
1 C1 Register A 1 1 C1 Checker
1 1 1 C1 Drivers 2 1
C2 Cable 2 2 C2 Memory array 2 2
3 C2 Register B 2 3 C2 Checker
2 2 3 C2 Register C 2 4
C2 Decoder 3 5 C3 Checker 3 3
5 C3
31
IBM 30XX Data Path Overview
Expanded storage Storage controller with
hardware- assisted memory tester
ECC
ECC
Central storage Storage controller with
hardware- assisted memory tester
ECC
P
ECC
P
System controller
Processor controller
CPU
Channel control element
Cache
P
P
Instruction Fetch/decode
Vector execution
P parity
Instruction execution
P
LSSG Logic Support Station Group
Channel adapter LSSG
P
Control Storage Parity
Channel server
32
Hardware-Based Retry
Instruction Execution
Errors are detected by parity checks on
register contents and on data buses and by
pattern validity checks in control logic
circuits.
Instruction execution
Operands into retry buffers
Error detected ?
No
Yes
Instruction and execution elements
Freeze execution
Stop on error and restore operands
Communicate back to processor controller through
LSS
Get instructions/data from retry buffers
Test
No retry or threshold crossed
Retry permitted ?
Signal OS for SW recovery
Instruction retry
Restart execution
33
Checkers in The Central Processor

Byte parity on data path registers
Parity checks on input/output of adders
Parity on microstore
Parity on microstore addresses
Encoder/decoder checks
Single-bit error detection in cache for data
received from memory
Additional illegal pattern checks

34
Levels of Error Recovery
System operation
Machine checkInterruption
System supported restart
Functional recovery
System recovery
System repair
System continues
System continues Task terminated
Successful
1 Perform instruction retry
Unsuccessful
System reloaded
2 Terminate affected task and continue
system operation
Successful
Unsuccessful
Notify operator external repair
Successful
3 Restart system operation, stop for repair
not required
Unsuccessful
4 Stop, repair, restart
35
System Level Facilities for Error Detection and
Recovery

Installation error detection capability
Tools to build profiles of system software
modules and inspect correct usage of system
resources.
Software facilities to detect the occurrences of
selected events, e.g., appendages allow user
control of I/O SLIP (serviceability level
indication processing) aids in error detection
and diagnosis (e.g., access to traps that cause a
program interruption).
User defines detection mechanisms to detect
programmer-defined exceptions, e.g., incorrect
address or attempting privileged instructions.
The operator detects evident error conditions,
e.g., loop conditions, endless wait states
The data management and supervisor routines
ensure valid data is processed and
non-conflicting requests are made

36
Recovery Processing Overview Handling Hardware
and Software Errors
ABEND (AbnormalTermination)
CONTROL
RECOVERY TERMINATION MANAGER
PROGRAM
TERMINATION ROUTINES
RETRY ROUTINES
RECOVERY ROUTINES
37
IBMs S/390 G5 Microprocessor

Not superscalar processor in IBMs CMOS
technology
Four logical units
The L1-cache, or buffer control element (BCE),
contains the cache data arrays, cache directory,
translation-lookaside buffer (TLB), and address
translation logic.
The I-unit
handles instruction fetching, decoding, and
address generation and contains the queue of
instructions awaiting execution.
The E-unit
contains the various execution units, along with
the local working copy of the general access and
floating point registers.
The R-unit
is the recovery unit that holds a checkpointed
copy of the entire microarchitected state of the
processor

38
IBM G5 Microprocessor Recovery Support

R-unit
For every clock cycle in which the E-unit
produces a result, that value is also written
into the R-unit copy.
The R-unit checks whether the result is correct
and then it generates ECC on that result.
The checkpointed result is written into the
R-unit registers along with ECC.
The contents of R-unit registers represent the
complete checkpointed state of the processor
during any given cycle, should it be necessary to
recover from a hardware error.
Millicode
Millicode is used to implement instructions that
are either more complex or relatively
infrequently used
The millicode has complete read/write access to
all R-unit registers.
Millicode also performs various service functions
logging data associated with any hardware errors
that may have occurred, scrubbing memory for
correctable errors, supporting operator console
functions, and controlling low-level I/O
operations.

39
IBM G5 Microprocessor Recovery Support

Full duplication of the I-unit and E-unit.
On every clock cycle, signals coming from these
units, including instruction results, are
cross-compared in the R-unit and the L1-cache.
If the signals do not match, hardware error
recovery is invoked.
All arrays in the L1-cache unit are protected
with parity except for the store buffers, which
are protected with ECC.
If the R-unit or L1-cache detects an error, the
processor automatically enters an error recovery
mode of operation.

40
IBM G5 Microprocessor Recovery Procedure

The R-unit freezes its checkpoint state and does
not allow any pending instructions to update it.
The L1-cache forwards any store data to the L2
for instructions that have already been
checkpointed.
All arrays in the L1 cache unit and the BTB are
reset.
Each R-unit register is read out in sequence,
with ECC logic correcting any errors it may find,
and the corrected values are written back into
the register file and to all shadow copies of
these registers in the I-unit, E-unit, and
L1-cache.
All R-unit registers are read a second time to
ensure there are no solid correctable errors. If
there are, the processor is check-stopped, i.e.,
that chip is no longer available for system
The E-unit forces a serialization interrupt,
which restarts instruction fetching and
execution.
An asynchronous interrupt tells millicode to log
trace array and other data for later analysis by
IBM product engineering.
Two conditions may cause recovery to fail
an uncorrectable error during step 4, or another
error occurring during step 6 before an
instruction is successfully completed.
both cases result in a check-stop condition

41
IBM G5 Microprocessor System Recovery Features

System recovery features are used when the
processor goes into a check-stopped state.
Processor availability facility (PAF).
The service element scans out the latches from
the check-stopped processor and extracts the
processor architectural state.
The data are stored in an area set aside for
machine check interrupt.
The operating system uses the saved data to
resume executing the job on another processor.
Concurrent processor sparing
Uses spare processors not visible to the user.
Upon a processor check-stop, the user can issue a
command on the console that lets the operating
system use one of the spare processors
Transparent processor sparing
Moves the microarchitected state (checkpointed in
R-unit) of a failed processor to a spare
processor in the system.
The spare processor begins fetching and executing
instructions where the failed processor stopped.

42
Process Pairs

Applicability
Permanent and transient hardware and software
failures
Loosely coupled redundant architectures
Message passing process communication
Well suited for maintaining data integrity in a
transactional type of system
Can be used to replicate a critical system
function or user application
Assumptions
Hardware and software modules design to
fail-fast, i.e., to rapidly detect errors and
subsequently terminate processing
Errors can be corrected by re-executing the same
software copy in changed environment

43
Process Pairs - Overview

The user application is replicated on two
processors as primary and backup processes, i.e.,
as process pairs
Normally, only the primary process provides
service
The primary sends checkpoints to the backup
The backup can take over the function when the
primary fails
The operating systems halts the processor when it
detects non-recoverable errors
The I am alive message protocol allows the
other processors to detect the halt and to take
over the primaries that were running on the
halted processor

44
Process Pairs Mechanism in Tandem Guardian OS
1. The application executes as Primary 2.
Primary starts a Backup in another processor 3.
Duplicated file images are also created 4.
Primary periodically sends checkpoint information
to Backup 5. Backup reads checkpoint
messages and updates its data, file status,
and program counter - the checkpoint
information is inserted in the
corresponding memory locations of the Backup 7.
Backup loads and executes if the system reports
that Primary processor is down -
the error detection is done by Primary OS or
- Primary fails to respond to I am alive
message 8. All file activities by Primary are
performed on both the primary and backup file
copies 9. Primary periodically asks the OS if a
Backup exists - if there is no Backup,
the Primary can request the
creation of a copy of both the process and
file structure

Checkpoint
data
file status
PC

Primary
Backup
Backup exists?
Backup exists?
I/O
I/O
Operating System
Operating System
I/O
I am alive
Mirrored disks
45
Process Pairs Transaction

A major issue in the design of loosely coupled
duplicated systems is how both copies can be kept
consistent in the face of errors

Step 1 2 3 4 5 6
Requester SeqNo 0 Issue request to write
record Checkpoint results
Requester Backup SeqNo 0 SeqNo
1
Server SeqNo 0 If SeqNo lt MySeqNo,
then return saved status Otherwise, read
disk, perform operation, checkpoint
request Write to disk SeqNo 1 checkpoint
result Return results
Server Backup SeqNo 0 Saves
request Saves result SeqNo 1
46
Process PairsAdvantages Disadvantages

Advantages
Extremely successful in Tandem OLTP applications
Tolerates hardware, operating system, and
application failures
High coverage (gt 90) of hardware and software
faults
The backup does not significantly reduce the
performance
Disadvantages
Necessity of error detection checks and signaling
techniques to make a process fail-fast
Process pairs are difficult to construct for
non-transaction-based applications

47
Robust Data Structures

The goal is to find storage structures that are
robust in the face of errors and failures
What do we want to preserve?
Semantic integrity - the data meaning is not
corrupted
Structural integrity - the correct data
representation is preserved
Focus on techniques for preserving the structural
integrity
A robust data structure contains redundant data
which allow erroneous changes to be detected, and
possibly corrected
a change is defined as an elementary (e.g., as
single word) modification to the encoded (data
structure representation on a storage medium)
form of a data structure instance
structural redundancy
a stored count of the numbers of nodes in a
structure instance
identifier fields
additional pointers

48
Robust Data Structures (cont.)

Consider data structure which consists of a
header and a set of nodes
the header contains
pointers to certain nodes of the instance or to
parts of itself
counts
identifier fields
a node contains
data items
structural information pointers and node type
identifier fields
Error detection and correction
in-line checks may be introduced into normal
system code to perform error detection and
possibly correction, during regular operation

49
Link Lists

Non-robust data structure
in each node store a pointer to the next node of
the list
place a null pointer in the last node

header
node
node
data
data
next
NULL
next
0-detectable and 0-correctable changing one
pointer to NULL can reduce any list to empty list
50
Robust Data StructuresSingle-Linked List
Implementation

Additions for improving robustness
an identifier field to each node
replace the NULL pointer in the last node by a
pointer to the header of the list
stores a count of the number of nodes

header
node
node
H -ID
ID
ID
data
data
count 3
next
next
next

1-detectable and 0-correctable
change to the count can be detected by comparing
it against the number of nodes find by
following pointers
change to the pointer may be detected by a
mismatch in count number or the new pointer
points to a foreign node (which cannot have a
valid identifier)

51
Robust Data StructuresDouble-Linked List
Implementation

Additions for improving robustness
a pointer added to each node, pointing to the
predecessor of the node on the list

header
node
node
H -ID
ID
ID
data
data
count 3
next
next
next
previous
previous
previous
2-detectable and 1-correctable the data structure
has two independent, disjoint sets of pointers,
each of which may be used to reconstruct the
entire list
52
Error Correcting in Double-Linked List

Scan the list in the forward direction until an
identifier field error or forward/backward
pointer mismatch is detected
When this happens scan the list in the reverse
direction until a similar error is detected
Repair the data structure

The forward scan detects a mismatch in Node B
and sets Local_PtrB B (local nodes
pointer) Next_PtrB F (pointer to the next
node) The reverse scan detects a mismatch in
Node C and sets Local_PtrC C (local nodes
pointer) Back_PtrC B (pointer to the previous
node) Correction (Local_PtrB
Back_PtrC ) ? Next_PtrB Local_PtrC
i.e., (Next_PtrB C)
Header
Node
Node
H -ID
ID
ID
data
data
count 3
B
C (F)
A
B
C
A
ID
?
?
?
Node
?
53
Robust Data Structures Concluding Remarks

Commonly used techniques for supporting robust
data structures
techniques which preserve structural integrity of
data
binary trees, heaps, fifos, queues, stacks
linked data structures
content-based techniques
checksums, encoding
Limitations
not transparent to the application
best in tolerating errors which corrupt the
structure of the data (not the semantic)
increased complexity of the update routines may
make them error prone
erroneous changes to the data structure may be
propagated by correct update routines
faulty update routines may provoke correlated
erroneous changes to several fields