Title: Design of High Availability Systems and Networks Software Fault Tolerance
1Design of High Availability Systems and Networks
Software Fault Tolerance
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2Outline
- Motivation for software fault tolerance
- N-Version programming
- Recovery blocks
- IBM server example
- Process pairs
- Robust data structures
3Motivation for Software Fault Tolerance
- Usual method of software reliability is fault
avoidance using good software engineering
methodologies - Large and complex systems ? fault avoidance not
successful - Redundancy in software needed to detect, isolate,
and recover software failures - Use static redundancy or dynamic redundancy
- Hardware fault tolerance easier to assess
- Software is difficult to prove correct
HARDWARE FAULTS SOFTWARE FAULTS 1. Faults
time-dependent Faults time-invariant 2.
Duplicate hardware detects Duplicate software not
effective 3. Random failure is main
cause Complexity is main cause
4Consequences of Software Failure
- General Accounting Office reports 4.2 mission
lost annually due to software errors - Launch failure of Mariner I (1962)
- Destruction of French satellite (1988)
- Problems with Space Shuttle and Apollo missions
- STAR WARS (SDI) funding billions of dollars for
correct software development - ATT blockages (error in recovery-recognition
software)(1990) - SS7 (signaling system) protocol implementation -
untested patch (mistyped character) (1997) - Therac 25 (overdose of medical radiation 1000s
of rads in excess of prescribed dosage)
5Experiences with Current Software
- Many computer crashes are due to software
- Even though one expects software to be correct,
it never is - Software exhibits fairly constant failure
frequency - Number of failures is correlated with
- Execution time
- Code density
- Software timing, synchronization points
6Experiences with Current Software (cont.)
Key parameters and variables (with defect
reintroduction)
Defect Detection Time Constant s 17.2
Weeks Defect Repair Time Constant t 4.7
Weeks Code Delivery 589810 Lines Initial Error
Density ? 0.00387 Defects per Line Defect
Reintroduction Rate ? 33 Percent Deployment Time
T Week 100 Estimated Remaining Defects ERDT 664
Defects Estimated Current Defects ECDT 445
Defects Testing Process Quality TPQT 90
Percent Testing Process Efficiency TPET 60
Percent
7Difficulties
- Improvements in software development
methodologies reduce the incidence of faults,
yielding fault avoidance - Need for test and verification
- Formal verification techniques, such as proof of
correctness, can be applied to rather small
programs - Potential exists of faulty translation of user
requirements - Conventional testing is hit-or-miss. Program
testing can show the presence of bugs but never
show their absence, - Dikstra, 1972. - There is a lack of good fault models.
8Approaches to Software Fault Tolerance
- ROBUSTNESS The extent to which software
continues to operate despite introduction of
invalid inputs. - Example 1. Check input data
- gtask for new input
- gtuse default value and raise flag
- 2. Self checking software
- FAULT CONTAINMENT Faults in one module should
not affect other modules. - Example Reasonable checks
- Watchdog timers
- Overflow/divide-by-zero detection
- Assertion checking
- FAULT TOLERANCE Provides uninterrupted
operation in presence of program fault through
multiple implementations of a given function
9N-Version Programming Basic Model
The N-version software (NVS) model with n3
Consensus Results
10Recovery Blocks Basic Model
The Recovery Block (RB) Model
EE
Execution Environment (EE)
J -th Recovery Block Software Unit
Alternate 1
Accepted Results
Recovery Cache
Acceptance Test
xi
No
xij
No
Yes
Alternate 2
No
Execution Support Functions
Take Next Alternate
11Execution Models for Software Fault-Tolerance
Approaches
Start software execution
Start software execution
Version 2 execution
Version 1 execution
Version N execution
Primary alternate execution
...
End Execution Version 2
End Execution Version N
End Execution Version 1
End primary alternate execution
Alternate selection
Acceptance test execution
Gathering versions results
No alternate any more available
Alternate 1 selected lz 2, .N
N
Acceptance test not passed
Start decision algorithm execution
Acceptance test passed
Decision algorithm execution
Alternate 1 execution lz 2,..N
Failed software
No acceptable result provided
Acceptable result provided
End alternate 1 execution
End software execution
Failed software
End software execution
Recovery blocks
N-version programming
12Execution Models for Software Fault-Tolerance
Approaches (cont.)
Start software execution
Self-checking component 1 execution
Self-checking component 2 execution
...
Self-checking component N execution
No acceptable result
No acceptable result
Acceptable result provided
Acceptable result provided
Acceptable result provided
N
Result selected
No result selected
Failed software
End software execution
N self-checking programming
13Software Fault-Tolerance Approaches and Their
Equivalent Hardware Counterparts
- RB is equivalent to the stand-by sparing (of
passive dynamic redundancy) in HW fault-tolerant
architectures - NVP is equivalent to N-modular redundancy (static
redundancy) in HW fault-tolerant architectures - NSCP is equivalent to active dynamic redundancy
- A self-checking component results either from
- The association of an acceptance test to a
version - The association of two variants with a comparison
algorithm - Fault-tolerance is provided by the parallel
execution of N ? 2 self-checking components
14Concepts of N-Version Programming
- N ? 2 versions of functionally equivalent
programs - Independent generations of programs ? carried
out by N groups of individuals who do not talk to
each other with respect to programming process
(different algorithms, different programming
languages, translation) - Initial specification formally done in some
formal spec. language - states unambiguously the functional requirements
- leaves widest possible choice of implementation
- By making the development process diverse it is
hoped that the versions will contain diverse
faults - The inventors of NVP emphasized that
- the definition of NVP has never postulated an
assumption of independence and that NVP is a
rigorous process of software development
15Assumption of Independence in N-Version
Programming
- Do the N versions of a program fail
independently? Are faults unrelated? - Does Prob (failure of N-version system) Prob
(failure of one version)N ?? - If so, then the system reliability can be very
high - Why this assumption may be false?
- People make same mistakes, e.g. incorrect
treatment of boundary conditions - Some parts of a problem more difficult than
others - statistics show similarity in programmers view
of difficult regions
16Observation from Experiments
- Assumption of independence of failures of
versions DOES NOT hold - This does not mean N-version programming is
useless - The reliability of the system will not be as high
as in the case when the faults in different
versions are independent - Example PODS (Project on Diverse Software)
- All faults were caused by omissions and
ambiguities in the requirement specifications - Two common faults were found in two versions
- Three different versions of software with failure
rate 1.5 10-6, 0.8 10-3, and 0.8 10-3,
resulted in the failure rate of 0.8 10-3 after
majority voting - The common/coincident faults could not be
excluded by majority voting
17Limitation of N-Version Programming
- All N -versions originate from the same initial
specifications whose correctness, completeness,
and unambiguity should be assumed - Use formal correctness proofs on specs, rather
than proofs on implementations - Exhaustive validation
- Based on an assumption that software faults are
distinguishable - faults that will cause disagreement between
versions at specified voting points might be a
result of independent programming efforts to
remove identical software defects
18Concepts of Recovery Blocks
- Characteristics
- Incorporates general solution to the problem of
switching to spare - Explicitly structures a software system so that
extra software for spares and error detection
does not reduce system reliability - First to consider a single sequential process
later extended to - Multiple processes within one system
- Multiple processes in multiple systems gt
distributed recovery blocks - Can view progress as sequences of basic
operations, assignments to stored variable - Structured program has BLOCKS of code to simplify
understanding of the functional description - Choose blocks as units for error detection and
recovery.
19Alternates
- Primary alternate is the one that is to be used
normally - Other alternates attempt less desirable options
- One source of alternates is earlier release of
primary alternates - Gracefully degraded alternates
- E.g., ensure consistent sequence (S)
- by extend S with (i)
- else by concatenate to S
- else by S (empty sequence)
- else error
20Acceptance Tests
- Function ensure the operation of recovery blocks
is satisfactory - Should access variables in the program, NOT local
to the recovery block, since these cannot have
effect after exit. Also, different alternates
use different local variables. - Need not check for absolute correctness -
cost/complexity trade-off - Run-time overheads should be LOW
- NO RESIDUAL EFFECTS should be present, since
variables, if updated, might result in passing of
successive alternates
21Restoration of System State
- Restoring system state is automatic
- Taking a copy of entire system state on entry to
each recovery block is too costly - Use Recovery Caches or Recursive Caches
- When a process is to be backed up, it is to a
state just before entry to primary alternate - Only NONLOCAL variables that have been MODIFIED
have to be reset
22 Process Conversions
- A systematic methodology of extending recovery
blocks across processes by taking process
interactions into considerations (considers
time/space) - Prevents domino effect
P1
X
X
X
P2
X
X
23Process Conversations (cont.)
- Recovery block spanning two or more processes is
called a conversation - Within a conversation, processes communicate
among themselves, NOT with others - Operations of a conversation
- Within a conversation, communication is only
among participants, not external - On entry, a process establishes a checkpoint
- If an error is detected by any process, then all
processes restore their checkpoints - Next to ALL processes execute their available
alternative - All processes leave the conversation together
(perform their acceptance tests just prior to
leaving) - At the end of the conversation, ALL processes
must satisfy their respective acceptance tests,
and none may proceed otherwise
24Nested Conversions
Checkpoint
Inter-process communication
Acceptance test
Conversation boundary
25Comparison of Recovery Blocks vs. N-Version
Programming
- Advantages of Recovery Block
- Most software systems evolve by replacement of
some modules by new ones - can be used as
alternates - Nice hierarchical design - structured approach
- Disadvantages of Recovery Block
- System state must be saved before entry to
recovery block -- excessive storage - Difficult to handle multiple processes -- might
have domino effect - Difficult to undo effects in real-time systems
- Effectiveness of acceptance test
- Higher coverage is more complex
- Lack of formal method to check
26Comparison of Recovery Block vs.N-Version
Programming (cont.)
- Advantages of N-Version Programming
- Immediate masking of software faults -- no delay
in operation - Self-checking (acceptance tests) not required
- Conventional fault tolerant systems HW and SW
have redundant hardware e.g. TMR (easier to
include N-version software on redundant hardware) - Disadvantages of N-Version Programming
- How to get N-versions?
- Impose design diversity, since randomness does
not give uncorrelated software faults - Extremely dependent on input specifications
(formal correctness proofs)
27High-Availability System DesignIBM Mainframe
30xx
28IBM 30xx Simplified System Model
Expanded storage
Central memory
System controller
Processor controller
CPUs
Power distribution and cooling
Channel control
Channel adapters and servers
29Fault Isolation Using Hardware Checkers
- Error checker placement determined by Fault
Isolation Domains (FID) - Checkers define the boundary of fault
containment
FRU 1
If checker 2 is triggered and if register C is
input to register B ? implicated set FRUs is 3
and 4
Checker 1
FRU 4
FRU 2
Cable
FRU 3
FRU 5
...
Decoder
Checker 2
Red Fault Isolation Domain Blue Field
Replaceable Unit
Checker 3
30Mapping of Fault Isolation Domains to Field
Replaceable Units
Function FID FRU Syndrome Memory array 1 1
1 C1 Register A 1 1 C1 Checker
1 1 1 C1 Drivers 2 1
C2 Cable 2 2 C2 Memory array 2 2
3 C2 Register B 2 3 C2 Checker
2 2 3 C2 Register C 2 4
C2 Decoder 3 5 C3 Checker 3 3
5 C3
31IBM 30XX Data Path Overview
Expanded storage Storage controller with
hardware- assisted memory tester
ECC
ECC
Central storage Storage controller with
hardware- assisted memory tester
ECC
P
ECC
P
System controller
Processor controller
CPU
Channel control element
Cache
P
P
Instruction Fetch/decode
Vector execution
P parity
Instruction execution
P
LSSG Logic Support Station Group
Channel adapter LSSG
P
Control Storage Parity
Channel server
32Hardware-Based Retry
Instruction Execution
Errors are detected by parity checks on
register contents and on data buses and by
pattern validity checks in control logic
circuits.
Instruction execution
Operands into retry buffers
Error detected ?
No
Yes
Instruction and execution elements
Freeze execution
Stop on error and restore operands
Communicate back to processor controller through
LSS
Get instructions/data from retry buffers
Test
No retry or threshold crossed
Retry permitted ?
Signal OS for SW recovery
Instruction retry
Restart execution
33Checkers in The Central Processor
- Byte parity on data path registers
- Parity checks on input/output of adders
- Parity on microstore
- Parity on microstore addresses
- Encoder/decoder checks
- Single-bit error detection in cache for data
received from memory - Additional illegal pattern checks
34Levels of Error Recovery
System operation
Machine checkInterruption
System supported restart
Functional recovery
System recovery
System repair
System continues
System continues Task terminated
Successful
1 Perform instruction retry
Unsuccessful
System reloaded
2 Terminate affected task and continue
system operation
Successful
Unsuccessful
Notify operator external repair
Successful
3 Restart system operation, stop for repair
not required
Unsuccessful
4 Stop, repair, restart
35System Level Facilities for Error Detection and
Recovery
- Installation error detection capability
- Tools to build profiles of system software
modules and inspect correct usage of system
resources. - Software facilities to detect the occurrences of
selected events, e.g., appendages allow user
control of I/O SLIP (serviceability level
indication processing) aids in error detection
and diagnosis (e.g., access to traps that cause a
program interruption). - User defines detection mechanisms to detect
programmer-defined exceptions, e.g., incorrect
address or attempting privileged instructions. - The operator detects evident error conditions,
e.g., loop conditions, endless wait states - The data management and supervisor routines
ensure valid data is processed and
non-conflicting requests are made
36Recovery Processing Overview Handling Hardware
and Software Errors
ABEND (AbnormalTermination)
CONTROL
RECOVERY TERMINATION MANAGER
PROGRAM
TERMINATION ROUTINES
RETRY ROUTINES
RECOVERY ROUTINES
37IBMs S/390 G5 Microprocessor
- Not superscalar processor in IBMs CMOS
technology - Four logical units
- The L1-cache, or buffer control element (BCE),
- contains the cache data arrays, cache directory,
translation-lookaside buffer (TLB), and address
translation logic. - The I-unit
- handles instruction fetching, decoding, and
address generation and contains the queue of
instructions awaiting execution. - The E-unit
- contains the various execution units, along with
the local working copy of the general access and
floating point registers. - The R-unit
- is the recovery unit that holds a checkpointed
copy of the entire microarchitected state of the
processor
38IBM G5 Microprocessor Recovery Support
- R-unit
- For every clock cycle in which the E-unit
produces a result, that value is also written
into the R-unit copy. - The R-unit checks whether the result is correct
and then it generates ECC on that result. - The checkpointed result is written into the
R-unit registers along with ECC. - The contents of R-unit registers represent the
complete checkpointed state of the processor
during any given cycle, should it be necessary to
recover from a hardware error. - Millicode
- Millicode is used to implement instructions that
are either more complex or relatively
infrequently used - The millicode has complete read/write access to
all R-unit registers. - Millicode also performs various service functions
- logging data associated with any hardware errors
that may have occurred, scrubbing memory for
correctable errors, supporting operator console
functions, and controlling low-level I/O
operations.
39IBM G5 Microprocessor Recovery Support
- Full duplication of the I-unit and E-unit.
- On every clock cycle, signals coming from these
units, including instruction results, are
cross-compared in the R-unit and the L1-cache. - If the signals do not match, hardware error
recovery is invoked. - All arrays in the L1-cache unit are protected
with parity except for the store buffers, which
are protected with ECC. - If the R-unit or L1-cache detects an error, the
processor automatically enters an error recovery
mode of operation.
40IBM G5 Microprocessor Recovery Procedure
- The R-unit freezes its checkpoint state and does
not allow any pending instructions to update it. - The L1-cache forwards any store data to the L2
for instructions that have already been
checkpointed. - All arrays in the L1 cache unit and the BTB are
reset. - Each R-unit register is read out in sequence,
with ECC logic correcting any errors it may find,
and the corrected values are written back into
the register file and to all shadow copies of
these registers in the I-unit, E-unit, and
L1-cache. - All R-unit registers are read a second time to
ensure there are no solid correctable errors. If
there are, the processor is check-stopped, i.e.,
that chip is no longer available for system - The E-unit forces a serialization interrupt,
which restarts instruction fetching and
execution. - An asynchronous interrupt tells millicode to log
trace array and other data for later analysis by
IBM product engineering. - Two conditions may cause recovery to fail
- an uncorrectable error during step 4, or another
error occurring during step 6 before an
instruction is successfully completed. - both cases result in a check-stop condition
41IBM G5 Microprocessor System Recovery Features
- System recovery features are used when the
processor goes into a check-stopped state. - Processor availability facility (PAF).
- The service element scans out the latches from
the check-stopped processor and extracts the
processor architectural state. - The data are stored in an area set aside for
machine check interrupt. - The operating system uses the saved data to
resume executing the job on another processor. - Concurrent processor sparing
- Uses spare processors not visible to the user.
- Upon a processor check-stop, the user can issue a
command on the console that lets the operating
system use one of the spare processors - Transparent processor sparing
- Moves the microarchitected state (checkpointed in
R-unit) of a failed processor to a spare
processor in the system. - The spare processor begins fetching and executing
instructions where the failed processor stopped.
42Process Pairs
- Applicability
- Permanent and transient hardware and software
failures - Loosely coupled redundant architectures
- Message passing process communication
- Well suited for maintaining data integrity in a
transactional type of system - Can be used to replicate a critical system
function or user application - Assumptions
- Hardware and software modules design to
fail-fast, i.e., to rapidly detect errors and
subsequently terminate processing - Errors can be corrected by re-executing the same
software copy in changed environment
43Process Pairs - Overview
- The user application is replicated on two
processors as primary and backup processes, i.e.,
as process pairs - Normally, only the primary process provides
service - The primary sends checkpoints to the backup
- The backup can take over the function when the
primary fails - The operating systems halts the processor when it
detects non-recoverable errors - The I am alive message protocol allows the
other processors to detect the halt and to take
over the primaries that were running on the
halted processor
44Process Pairs Mechanism in Tandem Guardian OS
1. The application executes as Primary 2.
Primary starts a Backup in another processor 3.
Duplicated file images are also created 4.
Primary periodically sends checkpoint information
to Backup 5. Backup reads checkpoint
messages and updates its data, file status,
and program counter - the checkpoint
information is inserted in the
corresponding memory locations of the Backup 7.
Backup loads and executes if the system reports
that Primary processor is down -
the error detection is done by Primary OS or
- Primary fails to respond to I am alive
message 8. All file activities by Primary are
performed on both the primary and backup file
copies 9. Primary periodically asks the OS if a
Backup exists - if there is no Backup,
the Primary can request the
creation of a copy of both the process and
file structure
- Checkpoint
- data
- file status
- PC
Primary
Backup
Backup exists?
Backup exists?
I/O
I/O
Operating System
Operating System
I/O
I am alive
Mirrored disks
45Process Pairs Transaction
- A major issue in the design of loosely coupled
duplicated systems is how both copies can be kept
consistent in the face of errors
Step 1 2 3 4 5 6
Requester SeqNo 0 Issue request to write
record Checkpoint results
Requester Backup SeqNo 0 SeqNo
1
Server SeqNo 0 If SeqNo lt MySeqNo,
then return saved status Otherwise, read
disk, perform operation, checkpoint
request Write to disk SeqNo 1 checkpoint
result Return results
Server Backup SeqNo 0 Saves
request Saves result SeqNo 1
46Process PairsAdvantages Disadvantages
- Advantages
- Extremely successful in Tandem OLTP applications
- Tolerates hardware, operating system, and
application failures - High coverage (gt 90) of hardware and software
faults - The backup does not significantly reduce the
performance - Disadvantages
- Necessity of error detection checks and signaling
techniques to make a process fail-fast - Process pairs are difficult to construct for
non-transaction-based applications
47Robust Data Structures
- The goal is to find storage structures that are
robust in the face of errors and failures - What do we want to preserve?
- Semantic integrity - the data meaning is not
corrupted - Structural integrity - the correct data
representation is preserved - Focus on techniques for preserving the structural
integrity - A robust data structure contains redundant data
which allow erroneous changes to be detected, and
possibly corrected - a change is defined as an elementary (e.g., as
single word) modification to the encoded (data
structure representation on a storage medium)
form of a data structure instance - structural redundancy
- a stored count of the numbers of nodes in a
structure instance - identifier fields
- additional pointers
48Robust Data Structures (cont.)
- Consider data structure which consists of a
header and a set of nodes - the header contains
- pointers to certain nodes of the instance or to
parts of itself - counts
- identifier fields
- a node contains
- data items
- structural information pointers and node type
identifier fields - Error detection and correction
- in-line checks may be introduced into normal
system code to perform error detection and
possibly correction, during regular operation
49Link Lists
- Non-robust data structure
- in each node store a pointer to the next node of
the list - place a null pointer in the last node
header
node
node
data
data
next
NULL
next
0-detectable and 0-correctable changing one
pointer to NULL can reduce any list to empty list
50Robust Data StructuresSingle-Linked List
Implementation
- Additions for improving robustness
- an identifier field to each node
- replace the NULL pointer in the last node by a
pointer to the header of the list - stores a count of the number of nodes
header
node
node
H -ID
ID
ID
data
data
count 3
next
next
next
- 1-detectable and 0-correctable
- change to the count can be detected by comparing
it against the number of nodes find by
following pointers - change to the pointer may be detected by a
mismatch in count number or the new pointer
points to a foreign node (which cannot have a
valid identifier)
51Robust Data StructuresDouble-Linked List
Implementation
- Additions for improving robustness
- a pointer added to each node, pointing to the
predecessor of the node on the list
header
node
node
H -ID
ID
ID
data
data
count 3
next
next
next
previous
previous
previous
2-detectable and 1-correctable the data structure
has two independent, disjoint sets of pointers,
each of which may be used to reconstruct the
entire list
52Error Correcting in Double-Linked List
- Scan the list in the forward direction until an
identifier field error or forward/backward
pointer mismatch is detected - When this happens scan the list in the reverse
direction until a similar error is detected - Repair the data structure
The forward scan detects a mismatch in Node B
and sets Local_PtrB B (local nodes
pointer) Next_PtrB F (pointer to the next
node) The reverse scan detects a mismatch in
Node C and sets Local_PtrC C (local nodes
pointer) Back_PtrC B (pointer to the previous
node) Correction (Local_PtrB
Back_PtrC ) ? Next_PtrB Local_PtrC
i.e., (Next_PtrB C)
Header
Node
Node
H -ID
ID
ID
data
data
count 3
B
C (F)
A
B
C
A
ID
?
?
?
Node
?
53Robust Data Structures Concluding Remarks
- Commonly used techniques for supporting robust
data structures - techniques which preserve structural integrity of
data - binary trees, heaps, fifos, queues, stacks
- linked data structures
- content-based techniques
- checksums, encoding
- Limitations
- not transparent to the application
- best in tolerating errors which corrupt the
structure of the data (not the semantic) - increased complexity of the update routines may
make them error prone - erroneous changes to the data structure may be
propagated by correct update routines - faulty update routines may provoke correlated
erroneous changes to several fields