Failure Detectors presentation

About This Presentation

Transcript and Presenter's Notes

Title: Failure Detectors

1
Failure Detectors

CS 717
Ashish Motivala
Dec 6th 2001

2
Some Papers Relevant Papers

Unreliable Failure Detectors for Reliable
Distributed Systems. Tushar Deepak Chandra and
Sam Toueg. Journal of the ACM.
A gossip-style failure detection service. R. van
Renesse, Y. Minsky, and M. Hayden. Middleware
'98.
Scalable Weakly-consistent Infection-style
Process Group Membership protocol. Ashish
Motivala, Abhinandan Das, Indranil Gupta. To be
submitted to DSN 2002 tomorrow.
http//www.cs.cornell.edu/gupta/swim
On the Quality of Service of Failure Detectors.
Wei Chen, Cornell University (with Sam Toueg,
Advisor, and Marcos Aguilera, Contributing
Author). DSN 2000.
Fail-aware failure detectors. C. Fetzer and F.
Cristian. In Proceedings of the 15th Symposium on
Reliable Distributed Systems.

3
Asynchronous vs Synchronous Model

No value to assumptions about process speed
Network can arbitrarily delay a message
But we assume that messages are sequenced and
retransmitted (arbitrary numbers of times), so
they eventually get through.
Failures in asynchronous model?
Usually, limited to process crash faults
If detectable, we call this fail-stop but how
to detect?

4
Asynchronous vs Synchronous Model

No value to assumptions about process speed
Network can arbitrarily delay a message
But we assume that messages are sequenced and
retransmitted (arbitrary numbers of times), so
they eventually get through.

Assume that every process will run within bounded
delay
Assume that every link has bounded delay
Usually described as synchronous rounds

5
Failures in Asynchronous and Synchronous Systems

Usually, limited to process crash faults
If detectable, we call this fail-stop but how
to detect?

Can talk about message omission failures
failure to send is the usual approach
But network assumed reliable (loss charged to
sender)
Process crash failures, as in asynchronous
setting
Byzantine failures arbitrary misbehavior by
processes

6
Realistic???

Asynchronous model is too weak since they have no
clocks(real systems have clocks, most timing
meets expectations but heavy tails)
Synchronous model is too strong (real systems
lack a way to implement synchronize rounds)
Partially Synchronous Model async n/w with a
reliable channel
Timed Asynchronous Model time bounds on clock
drift rates and message delays Fetzer

7
Impossibility Results

Consensus All processes need to agree on a value
FLP Impossibility of Consensus
A single faulty process can prevent consensus
Realistic because a slow process is
indistinguishable from a crashed one.
Chandra/Toueg Showed that FLP Impossibility
applies to many problems, not just consensus
In particular, they show that FLP applies to
group membership, reliable multicast
So these practical problems are impossible in
asynchronous systems
They also look at the weakest condition under
which consensus can be solved

8
Byzantine Consensus

Example 3 processes, 1 is faulty (A, B, C)
Non-faulty processes A and B start with input 0
and 1, respectively
They exchange messages each now has a set of
inputs 0, 1, x, where x comes from C
C sends 0 to A and 1 to B
A has 0, 1, 0 and wants to pick 0. B has 0,
1, 1 and wants to pick 1.
By definition, impossibility in this model means
xxx cant always be done

9
Chandra/Toueg Idea

Theoretical Idea
Separate problem into
The consensus algorithm itself
A failure detector a form of oracle that
announces suspected failure
But the process can change its decision
Question what is the weakest oracle for which
consensus is always solvable?

10
Sample properties

Completeness detection of every crash
Strong completeness Eventually, every process
that crashes is permanently suspected by every
correct process
Weak completeness Eventually, every process that
crashes is permanently suspected by some correct
process

11
Sample properties

Accuracy does it make mistakes?
Strong accuracy No process is suspected before
it crashes.
Weak accuracy Some correct process is never
suspected
Eventual strong/ weak accuracy there is a time
after which strong/weak accuracy is satisfied.

12
A sampling of failure detectors
13
Perfect Detector?

Named Perfect, written P
Strong completeness and strong accuracy
Immediately detects all failures
Never makes mistakes

14
Example of a failure detector

The detector they call W eventually weak
More commonly ?W diamond-W
Defined by two properties
There is a time after which every process that
crashes is suspected by some correct process
weak completeness
There is a time after which some correct process
is never suspected by any correct process weak
accuracy
Eg. we can eventually agree upon a leader. If it
crashes, we eventually, accurately detect the
crash

15
?W Weakest failure detector

They show that ?W is the weakest failure detector
for which consensus is guaranteed to be achieved
Algorithm is pretty simple
Rotate a token around a ring of processes
Decision can occur once token makes it around
once without a change in failure-suspicion status
for any process
Subsequently, as token is passed, each recipient
learns the decision outcome

16
Building systems with ?W

Unfortunately, this failure detector is not
implementable
This is the weakest failure detector that solves
consensus
Using timeouts we can make mistakes at arbitrary
times

17
Group Membership Service
Asynchronous Lossy Network
X
Process Group
pi
Join Leave Failure
pj
18
Data Dissemination using Epidemic Protocols

Want efficiency, robustness, speed and scale
Tree distribution is efficient, but fragile and
hard configure
Gossip is efficient and robust but has high
latency. Almost linear in network load and scales
O(nlogn) in detection time with number of
processes.

19
State Monotonic Property

A gossip message contains the state of the sender
of the gossip.
The receiver used a merge function to merge the
received state and the sent state.
Need some kind of monotonicity in state and in
gossip

20
Simple Epidemic

Assume a fixed population of size n
For simplicity, assume homogeneous spreading
Simple epidemic any one can infect any one with
equal probability
Assume that k members are already in infected
And that the infection occurs in rounds

21
Probability of Infection

Probability Pinfect(k,n) that a particular
uninfected member is infected in a round if k are
already in a round if k are already infected?
Pinfect(k,n) 1 P(nobody infects member)
1 (1 1/n)k
E(newly infected members) (n-k)x Pinfect(k,n)
Basically its a Binomial Distribution

22
2 Phases

Intuition 2 Phases
First Half 1 -gt n/2 Phase 1
Second Half n/2 -gt n Phase 2
For large n, Pinfect(n/2,n) 1 (1/e)0.5 0.4

23
Infection and Uninfection

Infection
Initial Growth Factor is very high about 2
At the half way mark its about 1.4
Exponential growth
Uninfection
Slow death of uninfection to start
At half way mark its about 0.4
Exponential decline

24
Rounds

Number of rounds necessary to infect the entire
population is O(log n)
Robbert uses and base of 1.585 for experiments

25
How the Protocol Works

Each member maintains a list of (address
heartbeat) pairs.
Periodically each member gossips
Increments his heartbeat
Sends (part of) list to a randomly chosen member
On receipt of gossip, merge the lists
Each member maintains the last heartbeat of each
list member

26
(No Transcript)
27

28
(No Transcript)
29
(No Transcript)
30
SWIMGroup Membership Service
Asynchronous Lossy Network
X
Process Group
pi
Join Leave Failure
pj
31
System Design

Join, Leave, Failure broadcast to all processes
Need to detect a process failure at some process
quickly (to be able to broadcast it)
Failure Detector Protocol Specifications
Detection Time
Accuracy
Load

Specified by application designer to SWIM
Optimized by SWIM
32
SWIM Failure Detector Protocol
Protocol period T time units
33
Properties

Expected Detection time
e/(e-1) protocol periods
Load O(K) per process
Inaccuracy probability exponential in K
Process failures detected
in O(log N) protocol periods w.h.p.
in O(N) protocol periods deterministically

34
Why not Heartbeating ?

Centralized single failure point
All-to-all O(N) load per process
Logical ring unpredictability on multiple
failures

35
LAN Scalability
Win2000, 100 Base-T Ethernet LAN Protocol Period
3RTT, RTT10 ms, K1
36
Deployment

Broadcast suspicion before declaring process
failure
Piggyback broadcasts through ping messages
Epidemic-style broadcast
WAN
Load on core routers
No representatives per subnet/domain

Write a Comment

User Comments (0)

About PowerShow.com

Failure Detectors PowerPoint PPT Presentation