Title: SwiFT: SOFTWARE IMPLEMENTED FAULT TOLERANCE Pawan Kumar Choudhary, Kishor S' Trivedi
1SwiFT SOFTWARE IMPLEMENTED FAULT TOLERANCEPawan
Kumar Choudhary, Kishor S. Trivedi
- Center for Advanced Computing and Communication
- Department of Electrical and Computer Engineering
- Duke University
2Need for Software Fault-tolerance
- From a users point of view, fault tolerance has
two dimensions - Availability
- Users of telephone switching systems, for
example, demand continuous availability - Data consistency
- Bank teller machine customers demand the highest
degree of data consistency. - Safety critical, real-time systems, such as
nuclear power reactors and flight control
systems, need the highest levels of both
availability and data consistency.
3What is SwiFT
- SwiFT( Software Implemented Fault tolerance) is a
collection of daemon processes and C/C
libraries . - It provides fault tolerance to applications on a
cluster of Windows-NT nodes, logically configured
as a ring. - It provides Automatic error detection and
recovery, checkpointing/message-logging, fault
management, event logging and replay ,data
replications, and IP packets re-routing
4Transaction processing vs. Process Replication
- To achieve high availability and reliability in
applications like telecommunication in a
distributed network environment, two types of
techniques have been deployed for fault
tolerance - Transaction processing-
- Applications usually have a well-defined
transaction boundary, such as updating a record
or establishing a communication channel. - When a fault occurs, both the client and server
abort the on-going transaction and rollback to a
clean state - Process replication-
- It allows for faster recovery than transactional
processing and for recovery of non-transactional
and long transactional applications, such as
switching systems and PBX's. - It is also suitable for applications which incur
long transactions or do not satisfy transaction
property, such as atomicity or isolation. - Process replication uses two techniques, atomic
multicasting and checkpoint/message-logging, to
make process states consistent
5Replication techniques in SwiFT
- SwiFT applies three different techniques for
message replication- - Cold- Only one active copy of FT process is
present. If it fails , SwiFT first tries to
recover the failed process locally if the local
recovery fails, SwiFT migrates the process onto
another machine - Warm-One or more backup processes run on a
network, and the primary process periodically
checkpoints its state to its backup processes - Hot - SwiFT monitors all of a fault tolerant
process's replicas. If SwiFT detects any replica
failure, it recovers the failed replica so the
number of replicas remains constant.
6Checkpoint /Message-Logging
- SwiFT applies the checkpoint/message-logging
technique for fault tolerance. - SwiFT provides fault tolerant services by
routinely checkpointing the server's state onto
backup servers or into stable storage. - When a failure occurs, SwiFT stops the failed
server process and either promotes a backup
server to being the primary server or creates a
new process. - Checkpointing in SwiFT is done with the help of
application monitoring, application failure
recovery, file replication, Windows events
logging/replay, IP packets dispatching, and IP
address fail-over.
7Components of SwiFT
- Watchd for process failure detection, recovery,
replication management, and distributed system
services, - Winckp for transparent process checkpointing and
mouse/keyboard events logging and replaying, - Libft for data checkpointing, communication
messages logging and recovery, - REPL for on-line incremental file replication and
disaster recovery, and - One-IP for IP packets dispatching, fail-over and
re-routing - SwiFT's components are designed to handle both
client and server error recoveries so they can
all be applied within a program. - Program developers can often access a server
program's source code but have no control over
the client programs developed by companies, using
SwiFT makes client error recovery as transparent
as possible
8Applications
- Embedded within the system to improve
availability and reliability. - Specially useful in telecommunication as high
availability is desired. - In e-commerce and financial transactions over
internet where data consistency is utmost
importance.
9SwiFT and our Role
- SwiFT has been developed by Lucent Technologies
for Windows NT systems. - http//www.bell-labs.com/projects/swift
- Our emphasis will be in using it for evaluating
the effectiveness of different recovery mechanism
. This will also help in verifying several
recovery models. For example - S. Garg, Y. Huang, C. Kintala, K. S. Trivedi, and
S. Yajnik. Performance and reliability evaluation
of passive replication schemes in application
level fault tolerance. 1999
10Modeling with SwiFT for different Replication
schemes
CTMC model for server with no, cold and warm
replication
2)
1)
1) Plot of availability vs. ?n (mean time
for node failure detection) 2) Loss probability
and Throughput plotted vs. ?n (mean time for node
failure detection)
Effect of polling frequency, K 12 Polling
interval for SwiFT is set to 2 seconds.
?10,?P1sec.
11Other Commercial FT processes
- DOORS- Distributed Object Oriented Reliable
Service - MSCS-Microsoft Cluster Server
- Microsoft Wolfpack
- Veritas First Watch
- Vinca Standby Server
- HP MC-Service Gaurd
12Summary
- SwiFT is a collection of re-usable software
components that facilitate the development of
fault-tolerant applications for the Windows NT
operating system. - Designed with high available applications in
mind its components addresses cold and warm
replication management schemes. - SwiFT specializes in detecting hangs and failures
resulting from system crashes. - SwiFT addresses the checkpointing of process
states and the replication of files, processes
and applications these items are useful to
implement low-cost fault tolerance.