A GossipStyle Failure Detection Service - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

A GossipStyle Failure Detection Service

Description:

Multi-Level Protocol. Basic Protocol overloads Routers in Internet with gossips. ... Multi-Level Protocol (2) Subnet Numbers are variable-sized ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 23
Provided by: Robbertva5
Category:

less

Transcript and Presenter's Notes

Title: A GossipStyle Failure Detection Service


1
A Gossip-Style Failure Detection Service
  • Robbert van Renesse
  • Yaron Minsky
  • Mark Hayden

2
Accurate Failure Detection
  • Difficult, if not impossible
  • Mistakes can be allowed
  • Useful for
  • system administration
  • replication
  • load balancing
  • group communication
  • Existing FDs slow or too unreliable

3
Informal Properties
  • Mistake probability fixed
  • (independent of members)
  • Scales in members (O(nlogn))
  • Scales in bandwidth (O(n))
  • Resilient against message loss
  • Resilient against crashes

4
Environment
  • Crash failures and partitions
  • Unbounded message delay
  • Negligible clock drift

5
Basic Gossip Protocol
  • Each member maintains a list of (address,
    heartbeat) pairs
  • Periodically, each member gossips
  • increments its own heartbeat
  • sends list to randomly chosen member
  • On receipt of gossip, merge lists
  • Each member maintains last time heartbeat
    increased for each other

6
Linear Bandwidth
  • Gossip message grows linearly with n
  • Slow down gossiping linearly

How long to wait before reporting failure?
7
Gossiping in Practice
8
Model
  • Each round one random member gossips to another
    random one.
  • We track infection of one heartbeat of one
    member.
  • Calculate probability that k members are infected
    in round i
  • f members failed from start

9
Analysis
10
Failure Detection Time
B 250 bytes/sec/member
11
Quality of Detection
12
Effect of Failed Members
13
Effect of Message Loss
14
Multi-Level Protocol
  • Basic Protocol overloads Routers in Internet with
    gossips.
  • Solution gossip mostly within subnets,
    occasionally between subnets
  • Internet address structure

DOMAIN
SUBNET
HOST NUMBER
15
Multi-Level Protocol (2)
  • Subnet Numbers are variable-sized
  • also gossip (domain, subnet mask) list
  • Say m subnets, n hosts
  • Each members gossips to other subnet within same
    domain once every n gossips
  • To another domain every n x m gossips
  • Linear in top-level, constant in subnets

16
Multi-Level Penalty
n 256
17
Catastrophe Recovery
  • Ending up in small partition is bad

18
Recovery by Broadcast
  • Generate broadcast with high probability at least
    every 60 secs
  • On average, once every 30 secs
  • Broadcast carries (part of) membership list.

19
Broadcast Protocol
  • Divide 60 seconds into 3 sec intervals
  • Each member tosses a weighted coin each interval
    to decide on broadcast
  • Weight depends on time since last reception of a
    broadcast

Choose a so that expected time to next broadcast
is 30 secs
Probability at t 60 is 1.
20
Broadcast Protocol Practice
21
Broadcast Storms?
n 1000
22
Conclusion
  • Over a period of a month, 200 failures and
    recoveries detected
  • No mistakes, so far
  • Can probably do better
  • gossip failures, rather than heartbeats
Write a Comment
User Comments (0)
About PowerShow.com