Title: TCP Server Fault Tolerance Using Connection Migration to a Backup Server
1TCP Server Fault Tolerance Using Connection
Migration to a Backup Server
- Manish Marwah1,2 Shivakant Mishra1
Christof Fetzer3
3ATT Labs-Research 180 Park Avenue Florham
Park, NJ 07932
2Avaya Labs 1300 W 120th Avenue Westminster, CO
80234
1Department of Computer Science University of
Colorado, Campus Box 0430 Boulder, CO 80309
IEEE International Conference on Dependable
Systems and Networks 2003, San Francisco, June
22-25, 2003
2Outline
- Introduction
- ST-TCP Architecture
- ST-TCP Protocol Details
- ST-TCP System Architecture
- Experimental Results
- Conclusions
- Future Work
3Introduction
- TCP is hugely popular
- Used in numerous applications
- Provides a rich set of features
- However, TCP does not provide server
fault-tolerance
4TCP Server Fault Tolerance
- Consider a TCP based client-server application
- Server failure gt TCP connection failure gt
Application session failure - A backup server provides fast service restoration
- For application session restoration, in addition,
an application level recovery protocol is
necessary - Custom SW on clients is required
5ST-TCP (Server fault Tolerant TCP)
- We propose ST-TCP
-
- A light-weight active primary-backup server
fault tolerance mechanism, implemented at the TCP
layer, for fast, transparent failover of a TCP
connection to a backup server
6ST-TCP Design Principles
- No changes required in the client
- No changes required in the server application
- Fast and transparent failover
- Behavior exactly the same as standard TCP
- Minimal overhead during normal operation
- Simple to implement minimal kernel changes
- Assumes primary and backup on same LAN and
deterministic application
7ST-TCP Architecture
- Architecture Overview
- Ethernet Tapping Architecture
8Architecture Overview
- Active Primary-Backup System
- Application Replica
- Ethernet Tapping
- Backup receives same byte stream as primary
- Primary backup heartbeat
9Architecture Overview (cont.)
- Application Replica
- Runs on the backup
- Processes same client TCP stream
- Produces same TCP stream (as primary) for the
client - Suppresses output TCP stream during normal
operation
10Architecture Overview (cont.)
- Failure Detection and Recovery
- Primary and backup exchange heartbeats (HB) for
failure detection - Backup takes over the client-primary TCP
connection if it detects that the primary has
failed - To prevent false positives, backup turns off the
power to the primary before taking over
11Ethernet Tapping Architecture
- Option 1 Promiscuous Mode
Hub/Switch
- Works for Hubs
- For switches
- Replicate all primary port traffic on the backup
port, or - Make sure that switch does not learn MAC address
of primary
12Ethernet Tapping Architecture (cont.)
- Option 2 Multicast Ethernet Addr
- Create virtual NICs with service IP addr (SVI) on
primary and backup - Associate a multicast Ethernet addr (SME) to both
these virtual interfaces - Configure a static ARP entry in the router
mapping SVI to SME
13ST-TCP Protocol Details
- Initialization
- Failure Free Period
- Primary-Backup Synchronization
- Failure Detection and Recovery
14Initialization
- Replica application started on backup
- Uses same port number as primary
- Uses same sequence numbers
Client
ST-TCP Server
SYN
SYN/ACK
ACK
Backup syncs sequence numbers with primary
15Failure Free Period
- Backup drops all TCP segments destined for the
client - Client Acks destined for the primary serve as
Acks for backup as well - Primary-Backup exchange heartbeat (HB) messages
on a UDP connection - This connection is also used by the backup for
sending sequence number of the latest client
bytes received
16Primary-Backup Synchronization
- Problem - Backup may miss bytes
- Primary discards client bytes only after the
backup server has received them - Uses additional receive buffer space
- Backup periodically informs Primary of latest
client bytes - Monitors primary-client segments to determine if
it missed any bytes - Asks primary for missing bytes
17Primary TCP Receive Buffer Management
Standard TCP Receive Buffer
ST-TCP Receive Buffer
- Additional buffer and standard buffer independent
- Only kernel change in the primary
18Failure Detection and Recovery
- Failure detection of the primary by the backup
involves monitoring the primary-backup heartbeat
(HB) - Failure detection of the backup by the primary
involves monitoring - the primary-backup heartbeat (HB)
- status of the primary additional receive buffer
19System Architecture
- No Single point of failure
20Experimental Results
- Setup
- Applications
- Performance Results
21Experimental Setup
800 MHz AMD Athlon, 512KB cache, 256 MB RAM,
10/100 Mbps NIC Linux 2.2.18
Primary
10/100 Mbps Hub
Backup
800 MHz AMD Athlon, 512KB cache, 256 MB RAM,
10/100 Mbps NIC Linux 2.2.18
900 MHz Pentium III, 10/100 Mbps NIC Linux
2.4.9 (can be any OS)
Client
22Applications
- Performance of ST-TCP is measured with simulated
applications representing various communication
characteristics - Three such applications are considered
- Echo
- Client sends 150 bytes Server echos back
(rlogin/telnet) - Interactive
- Client sends 150 bytes Server responds with 10
kbytes (http) - Bulk Transfer
- Client sends 150 bytes Server responds with
large data transfer (1, 5, 20, 100 MB) (ftp) - Experiments are run with varying heartbeat
intervals (5 s to 50 ms)
23Performance Results
- Two quantities are measured
- Comparison of ST-TCP with standard TCP during
failure free period
No significant overhead of using ST-TCP!
- Failover Times
- Failure detection time
- How much the backup TCP has backed off
24Failover Times Echo
25Failover Times Interactive
26Failover Times Bulk Transfer
27Conclusions
- ST-TCP extends TCP to tolerate server failures
- Hot backup server, tapping architecture minimizes
overhead - No changes to the client, server application
- Negligible impact during failure free periods
- Insignificant performance overhead, minimal
impact of backup on primary, no bandwidth impact - No deviation from standard TCP
- Fast failover (few hundred ms), completely
transparent to the client - Logger for double failure scenarios
- Light-weight
28Future Work
- Performance Enhancements
- Address application nondeterminism issues
- Extend it to other transport layer protocols e.g.
SCTP - Run real applications on ST-TCP
- Address issues related to using one backup for
multiple primary servers - Primary and Backup on different LANs
29Backup Slides
30Bandwidth Used by Primary Backup Heartbeat
Assuming each HB message (including all headers)
is 128 bytes
31(No Transcript)