CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

CS556: Distributed Systems

Description:

Replicate data close to points where groups of clients need it ... Hold back until above condition is satisfied. RM can wait for missing updates ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 51
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Fault Tolerance (I)
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
The gossip architecture (I)
  • Replicate data close to points where groups of
    clients need it
  • Periodic exchange of msgs among RMs
  • Front-ends send queries updates to any RM they
    choose
  • Any RM that is available can provide acceptable
    response times
  • Consistent service over time
  • Relaxed consistency bet. replicas

3
The gossip architecture (II)
  • Causal update ordering
  • Forced ordering
  • Causal total
  • A Forced-order a Causal-order update that are
    related by the happened-before relation may be
    applied in different orders at different RMs !
  • Immediate ordering
  • Updates are applied in a consistent order
    relative to any other update at all RMs

4
The gossip architecture (III)
  • Bulletin board application example
  • Posting items -gt causal order
  • Adding a subscriber -gt forced order
  • Removing a subscriber -gt immediate order
  • Gossip messages updates among RMs
  • Front-ends maintain prev vector timestamp
  • One entry per RM
  • RMs respond with new vector timestamp

5
State components of a gossip RM
6
Query operations in gossip
  • RM must return a value that is at least as recent
    as the requests timestamp
  • Q.prev lt valueTS
  • List of pending query operations
  • Hold back until above condition is satisfied
  • RM can wait for missing updates
  • or request updates from the RMs concerned
  • RMs response includes valueTS

7
Updates in causal order
  • RM-i checks to see if operation ID is in its
    executed table or in its log
  • Discard update if it has already seen it
  • Increment i-th element of replica timestamp
  • Count of updates received from front-ends
  • Assign vector timestamp (ts) to the update
  • Replace i-th element of u.prev by i-th element of
    replica timestamp
  • Insert log entry
  • lti, ts, u.op, u.prev, u.idgt
  • Stability condition u.prev lt valueTS
  • All updates on which u depends have been applied

8
Forced immediate order
  • Unique sequence number is appended to update
    timestamps
  • Primary RM acts as sequencer
  • Another RM can be elected to take over
    consistently as sequencer
  • Majority of RMs (including primary) must record
    which update is the next in sequence
  • Immediate ordering by having the primary order
    them in the sequence (along with forced updates
    considering causal updates as well)
  • Agreement protocol on sequence

9
Gossip timestamps
  • Gossip msgs bet. RMs
  • Replica timestamp log
  • Receivers tasks
  • Merge arriving log m.log with its own
  • Add record r to local log if replicaTS lt r.ts
  • Apply any updates that have become stable
  • This may in turn make pending updates become
    stable
  • Eliminate records from log entries in executed
    table
  • Once it is established that they have been
    applied everywhere
  • Sort the set of stable updates in timestamp order
  • r is applied only if there is no s s.t. s.prev lt
    r.prev
  • tableTSj m.ts
  • If tableTSic gt r.tsc, for all i, then r is
    discarded
  • c RM that created record r
  • ACKs by front-ends to discard records from
    executed table

10
Update propagation
  • How long before all RMs receive an update ?
  • Frequency duration of network partitions
  • Beyond systems control !
  • Frequency of gossip msgs
  • Policy for choosing a gossip partner
  • Random
  • Weighted probabilities to favor near partners
  • Surprisingly robust !
  • But exhibits variable update propagation times
  • Deterministic
  • Simple function of RMs state
  • Eg Examine timestamp table choose the RM that
    appears to be the furthest behind in updates
    received
  • Topological
  • Based on fixed arrangement of RMs into a graph
  • Ring, mesh, trees
  • Trade-off amount of communication against higher
    latencies the possibility that a single failure
    will affect other RMs

11
Scalability concerns
  • 2 messages per query (bet. front-end RM)
  • Causal update
  • G messages per gossip message
  • 2 (R-1)/G messages exchanged
  • Increasing G leads to
  • Less messages
  • but also worse delivery latencies
  • RM has to wait for more updates to arrive before
    propagating them
  • Improvement by having read-only replicas
  • Provided that update/query ratio is low !
  • Updated by gossip msgs but do not receive
    updates directly from front-ends
  • Can be situated close to client groups
  • Vector timestamps need only include updateable RMs

12
Dependability Basic Concepts
  • Availability
  • Reliability
  • Safety
  • Maintainability

Fault ? Error ? Failure
  • Faults
  • -Transient
  • Intermittent
  • Permanent

13
Failure Models
14
Failure detectors
  • Not necessarily reliable !
  • P is here message, every T sec, assuming a max.
    message transmission delay D
  • Categorization of processes (hints)
  • suspected vs unsuspected
  • A process may be functioning correctly on the
    other side of a partitioned network
  • or it could be slow to respond to probes
  • Reliable detection
  • unsuspected vs failed (crashed)
  • Feasible only in synchronous systems
  • It is possible to give different responses to
    different processes
  • different comm. conditions

15
Failure Masking by Redundancy (I)
  • Hide the occurrence of failures from other
    processes, by redundancy
  • Information
  • Extra bits to allow recovery
  • Time
  • Transactions to allow abort/redo
  • Physical
  • Extra equipment to tolerate loss/malfunction of
    some components
  • Voter circuitry
  • Voters are components too ? They may themselves
    fail !

16
Failure Masking by Redundancy (II)
  • Triple modular redundancy (TMR)

17
Flat vs Hierarchical Groups (I)
Process resilience by replicating processes into
groups
Group membership protocols
18
Flat vs Hierarchical Groups (II)
  • Flat groups
  • Symmetrical (no special roles)
  • No single point of failure
  • Complex operation protocols (eg voting)
  • Hierarchical groups
  • Coordinator is a single point of failure

19
Failure Masking Replication
  • Having a group of identical processes allows us
    to mask gt1 faulty processes
  • Primary-backup protocols
  • Hierarchical organization
  • Replicated-write protocols
  • Flat process groups
  • Active replication
  • Quorum protocols
  • K-fault tolerant system
  • Fail-silent processes ? group size (k 1)
  • Byzantine failures ? group size (2k 1)

20
Coordination/Agreement
  • A set of process must collaborate
  • or agree with one or more processes
  • without a fixed master/slave relationships
  • failure assumptions failure detectors
  • Problems
  • mutual exclusion
  • election
  • multicast
  • reliability ordering semantics
  • consensus
  • Byzantine agreement

21
Problems of Agreement
  • A set of processes need to agree on a value
    (decision), after one or more processes have
    proposed what that value (decision) should be
  • Examples
  • mutual exclusion, election, transactions
  • Processes may be correct, crashed, or they may
    exhibit arbitrary (Byzantine) failures
  • Messages are exchanged on an one-to-one basis,
    and they are not signed

22
Two Agreement Problems
  • Consensus problem every process i proposes a
    value vi, while in the undecided state. Process i
    exchanges messages until it makes decision di and
    moves to decided state.
  • Termination all correct processes must make a
    decision
  • Agreement same decision for all correct
    processes
  • Integrity if all correct processes proposed same
    value, any correct process decides that value
  • Byzantine generals problem a commander
    process i orders value v.
  • The lieutenant processes must agree on what the
    commander ordered.
  • Processes may be faulty
  • provide wrong or contradictory messages
  • Integrity requirement
  • A distinguished process decides a value for
    others to agree upon
  • Solution only exists if N gt 3f, where f faulty
    processes

23
Consensus for 3 processes
24
The Two-Army Problem
  • How can two perfect processes reach agreement
    about 1 bit of information ?
  • over an unreliable comm. Channel
  • Red army 5000 troops
  • Blue army 1, 2 3000 troops each
  • How can the blue armies reach agreement on when
    to attack ?
  • Their only means of communication is by sending
    messengers
  • that may be captured by the enemy !
  • No solution!
  • Proof by contradiction Assume there is a
    solution with a minimum messages

25
Consensus No Failures Case
majority(v1, , vN) returns most frequently
occurring value - returns if no majority
exists
Consensus via reliable multicast
For ordered values, min/max could be used instead
of majority
In general, if failures can occur it is not 100
certain that consensus can be reached in finite
time !
Terminating Reliable Multicast (TRB) A single
process multicasts a msg, and all
correct processes must agree on that msg -
Even if sender crashes, all correct processes
must deliver a special msg (Server-Fault)
26
Relation among problems
A problem B reduces to a problem A if there is an
algorithm which transforms any algorithm for A
into an algorithm for B.
Synchronous systems TRB is equivalent to
Consensus
Asynchronous systems Consensus reduces to
TRB but not vice versa!
Asynchronous systems with crash failures
Atomic Multicast is equivalent to Consensus
27
Consensus in synchronous systems
Duration of round max. delay of B-multicast
Up to f faulty processes
Dolev Strong, 1983 Any algorithm to reach
consensus despite up to f failures requires (f
1) rounds.
28
Byzantine agreement synchronous
Faulty process
Nothing can be done to improve a correct
process knowledge beyond the first stage -
It cannot tell which process is faulty.
3 says 1 says u
Lamport et al, 1982 No solution for N 3, f
1
Pease et al, 1982 No solution for Nlt 3f
(assuming private comm. channels)
29
Agreement in Faulty Systems (I)
  • The Byzantine generals problem for 3 loyal
    generals and 1 traitor
  • The generals announce their troop strengths
  • The vectors that each general assembles based on
    (a)
  • The vectors that each general receives in step 3.

Consensus by generals 1, 2, 4 ? (1, 2, UNKNOWN,
4))
30
Agreement in Faulty Systems (II)
  • The same as in previous slide, except now with 2
    loyal generals and one traitor.

31
Byzantine agreement for N gt 3f
Example with N4, f1 - 1st round Commander
sends a value to each lieutenant - 2nd round
Each of the lieutenants sends the value it has
received to each of its peers.
- A lieutenant receives a total of (N 2) 1
values, of which (N 2) are correct. -
By majority(), the correct lieutenants compute
the same value.
In general, O(N(f1)) msgs
O(N2) for signed msgs
32
Impossibility of (deterministic) consensus in
asynchronous systems
M.J. Fischer, N. Lynch, and M. Paterson
Impossibility of distributed consensus with one
faulty process, J. ACM, 32(2), pp. 374-382,
1985.
A crashed process cannot be distinguished from a
slow one. - Not even with a 100 reliable
comm. network !
There is always a chance that some continuation
of the processes execution avoid consensus being
reached.
No guarantee for consensus, but Prob(consensus)
gt 0
Solutions based on randomization or
(unreliable) failure detectors or by fault
masking
33
Reliable client-server communication
What about reliable point-to-point transport
protocols ?
  • TCP masks omission failures
  • by using ACKs retransmissions
  • but it does not mask crash failures !
  • Eg When a connection is broken, the client is
    only notified via an exception

34
5 classes of failures in RPC
  • Client is unable to locate server
  • Binding exception
  • at the expense of transparency
  • Request message is lost
  • Is it safe to retransmit ?
  • Allow server to detect it is dealing with a retry
  • Server crashes after receiving a request
  • Reply message is lost
  • Client crashes after sending a request

35
Lost Request Messages Server Crashes (I)
  • A server in client-server communication
  • Normal case
  • Crash after execution
  • Crash before execution

36
Server Crashes (II)
  • At-least-once semantics
  • Client keeps retransmitting until it gets a
    response
  • At-most-once semantics
  • Give up immediately report failure
  • Guarantee nothing
  • Ideal would be exactly-once semantics
  • no general way to arrange this !

37
Server Crashes (III)
  • Print server scenario
  • M servers completion message
  • Server may send M either before or after printing
  • P servers print operation
  • C servers crash
  • Possible event orderings
  • M ? P ? C
  • M ? C (? P)
  • P ? M ? C
  • P ? C (? M)
  • C (? P ? M)
  • C (? M ? P)

38
Server Crashes (IV)
  • Different combinations of client server
    strategies in the presence of server crashes.

No combination of client server strategy is
correct for all cases !
39
Lost Reply Messages
  • Is it safe to retransmit the request ?
  • Idempotent requests
  • Example Read a files first 1024 bytes
  • Counterexample money transfer order
  • Assign sequence number to request
  • Server keeps track of clients most recently
    received sequence
  • additionally, set a RETRANSMISSION bit in the
    request header

40
Client Crashes (I)
  • Orphan computation
  • No process waiting for the result
  • Waste of resources (CPU cycles, locks)
  • Possible confusion upon clients recovery
  • 4 alternative strategies proposed by Nelson
    (1981)
  • Extermination
  • Client keeps log of requests to be issued
  • Upon recovery, explicitly kill orphans
  • Overhead of logging (for every RPC)
  • Problems with grand-orphans
  • Problems with network partitions

41
Client Crashes (II)
  • Reincarnation
  • Divide time up into epochs (sequentially
    numbered)
  • Upon reboot, client broadcasts start-of-epoch
  • Upon receipt, all remote computations on behalf
    of this client are killed
  • After a network partition, an orphans response
    will contain an obsolete epoch number ? easily
    detected
  • Gentle reincarnation
  • Upon receipt of start-of-epoch, each server
    checks to see if it has any remote computations
  • If the owner cannot be found, the computation is
    killed
  • Expiration
  • Each RPC is given a time quantum T to complete
  • must explicitly ask for another if it cannot
    finish in time
  • After reboot, client only needs to wait a time T
  • How to select a reasonable value for T ?

42
Basic Reliable-Multicasting Schemes
  • A simple solution to reliable multicasting when
    all receivers are known are assumed not to fail
  • Message transmission
  • Reporting feedback

43
Nonhierarchical Feedback Control
  • Several receivers have scheduled a request for
    retransmission, but the first retransmission
    request leads to the suppression of others.

44
Hierarchical Feedback Control
  • The essence of hierarchical reliable
    multicasting
  • Each coordinator forwards the message to its
    children.
  • A coordinator handles retransmission requests.

45
Virtual Synchrony (I)
  • The logical organization of a distributed system
    to distinguish between message receipt and
    message delivery

46
Virtual Synchrony (II)
  • The principle of virtual synchronous multicast.

47
Message Ordering (I)
  • Three communicating processes in the same group.
    The ordering of events per process is shown along
    the vertical axis.

48
Message Ordering (II)
  • Four processes in the same group with two
    different senders, and a possible delivery order
    of messages under FIFO-ordered multicasting

49
Implementing Virtual Synchrony (I)
50
Implementing Virtual Synchrony (II)
  • Process 4 notices that process 7 has crashed,
    sends a view change
  • Process 6 sends out all its unstable messages,
    followed by a flush message
  • Process 6 installs the new view when it has
    received a flush message from everyone else
Write a Comment
User Comments (0)
About PowerShow.com