SwiFT: SOFTWARE IMPLEMENTED FAULT TOLERANCE Pawan Kumar Choudhary, Kishor S' Trivedi

About This Presentation

Title:

SwiFT: SOFTWARE IMPLEMENTED FAULT TOLERANCE Pawan Kumar Choudhary, Kishor S' Trivedi

Description:

Hot - SwiFT monitors all of a fault tolerant process's replicas. ... SwiFT's components are designed to handle both client and server error ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 13

Provided by: people3

Category:

more less

Transcript and Presenter's Notes

Title: SwiFT: SOFTWARE IMPLEMENTED FAULT TOLERANCE Pawan Kumar Choudhary, Kishor S' Trivedi

1
SwiFT SOFTWARE IMPLEMENTED FAULT TOLERANCEPawan
Kumar Choudhary, Kishor S. Trivedi

Center for Advanced Computing and Communication
Department of Electrical and Computer Engineering
Duke University

2
Need for Software Fault-tolerance

From a users point of view, fault tolerance has
two dimensions
Availability
Users of telephone switching systems, for
example, demand continuous availability
Data consistency
Bank teller machine customers demand the highest
degree of data consistency.
Safety critical, real-time systems, such as
nuclear power reactors and flight control
systems, need the highest levels of both
availability and data consistency.

3
What is SwiFT

SwiFT( Software Implemented Fault tolerance) is a
collection of daemon processes and C/C
libraries .
It provides fault tolerance to applications on a
cluster of Windows-NT nodes, logically configured
as a ring.
It provides Automatic error detection and
recovery, checkpointing/message-logging, fault
management, event logging and replay ,data
replications, and IP packets re-routing

4
Transaction processing vs. Process Replication

To achieve high availability and reliability in
applications like telecommunication in a
distributed network environment, two types of
techniques have been deployed for fault
tolerance
Transaction processing-
Applications usually have a well-defined
transaction boundary, such as updating a record
or establishing a communication channel.
When a fault occurs, both the client and server
abort the on-going transaction and rollback to a
clean state
Process replication-
It allows for faster recovery than transactional
processing and for recovery of non-transactional
and long transactional applications, such as
switching systems and PBX's.
It is also suitable for applications which incur
long transactions or do not satisfy transaction
property, such as atomicity or isolation.
Process replication uses two techniques, atomic
multicasting and checkpoint/message-logging, to
make process states consistent

5
Replication techniques in SwiFT

SwiFT applies three different techniques for
message replication-
Cold- Only one active copy of FT process is
present. If it fails , SwiFT first tries to
recover the failed process locally if the local
recovery fails, SwiFT migrates the process onto
another machine
Warm-One or more backup processes run on a
network, and the primary process periodically
checkpoints its state to its backup processes
Hot - SwiFT monitors all of a fault tolerant
process's replicas. If SwiFT detects any replica
failure, it recovers the failed replica so the
number of replicas remains constant.

6
Checkpoint /Message-Logging

SwiFT applies the checkpoint/message-logging
technique for fault tolerance.
SwiFT provides fault tolerant services by
routinely checkpointing the server's state onto
backup servers or into stable storage.
When a failure occurs, SwiFT stops the failed
server process and either promotes a backup
server to being the primary server or creates a
new process.
Checkpointing in SwiFT is done with the help of
application monitoring, application failure
recovery, file replication, Windows events
logging/replay, IP packets dispatching, and IP
address fail-over.

7
Components of SwiFT

Watchd for process failure detection, recovery,
replication management, and distributed system
services,
Winckp for transparent process checkpointing and
mouse/keyboard events logging and replaying,
Libft for data checkpointing, communication
messages logging and recovery,
REPL for on-line incremental file replication and
disaster recovery, and
One-IP for IP packets dispatching, fail-over and
re-routing
SwiFT's components are designed to handle both
client and server error recoveries so they can
all be applied within a program.
Program developers can often access a server
program's source code but have no control over
the client programs developed by companies, using
SwiFT makes client error recovery as transparent
as possible

8
Applications

Embedded within the system to improve
availability and reliability.
Specially useful in telecommunication as high
availability is desired.
In e-commerce and financial transactions over
internet where data consistency is utmost
importance.

9
SwiFT and our Role

SwiFT has been developed by Lucent Technologies
for Windows NT systems.
http//www.bell-labs.com/projects/swift
Our emphasis will be in using it for evaluating
the effectiveness of different recovery mechanism
. This will also help in verifying several
recovery models. For example
S. Garg, Y. Huang, C. Kintala, K. S. Trivedi, and
S. Yajnik. Performance and reliability evaluation
of passive replication schemes in application
level fault tolerance. 1999

10
Modeling with SwiFT for different Replication
schemes
CTMC model for server with no, cold and warm
replication
2)
1)
1) Plot of availability vs. ?n (mean time
for node failure detection) 2) Loss probability
and Throughput plotted vs. ?n (mean time for node
failure detection)
Effect of polling frequency, K 12 Polling
interval for SwiFT is set to 2 seconds.
?10,?P1sec.
11
Other Commercial FT processes

DOORS- Distributed Object Oriented Reliable
Service
MSCS-Microsoft Cluster Server
Microsoft Wolfpack
Veritas First Watch
Vinca Standby Server
HP MC-Service Gaurd

12
Summary

SwiFT is a collection of re-usable software
components that facilitate the development of
fault-tolerant applications for the Windows NT
operating system.
Designed with high available applications in
mind its components addresses cold and warm
replication management schemes.
SwiFT specializes in detecting hangs and failures
resulting from system crashes.
SwiFT addresses the checkpointing of process
states and the replication of files, processes
and applications these items are useful to
implement low-cost fault tolerance.