Title: Recap
1Recap
- Fault Tolerance
- Process Resilience
2Today
- Reliable Client-Server Communication
- Reliable Group Communication
3Reliable Communication
- There are multiple types of communication
failure - Crashes - the communication channel breaks in
some way - Omission - messages are dropped
- Timing - messages arrive too slowly (or too
quickly) - Arbitrary - messages are duplicated, corrupted,
etc. - Two primary types of communication
- Point-to-Point
- RPC
4Point-to-Point Communication
- Reliable point-to-point communication is
typically in the form of TCP sockets - Omission failures are masked using a system of
acknowledgements and retransmissions - Arbitrary failures are masked using packet
numbering and the reliability of the underlying
Internet Protocol - Crash failures and timing failures cannot always
be masked - the way to get reliability in the
face of crash failures is to have the system
automatically re-establish broken connections
5Remote Procedure Calls
- The goal of RPC is to hide communication, by
making remote calls look local - As long as the client and server are functioning
perfectly, and the network is reasonably speedy,
it does a good job - When errors occur in communication, the
differences between local and remote calls arent
always easy to mask
6Remote Procedure Calls
- Five main classes of failure can occur in RPC
systems - The client is unable to locate the server
- The request message from the client to the server
is lost - The server crashes after receiving a request
- The reply message from the server to the client
is lost - The client crashes after sending a request
- Each of these has its own set of problems
7Remote Procedure CallsClient Cannot Locate the
Server
- This can happen if the server is down, or if the
server has been changed since the client was
built (so the interface isnt compatible anymore) - One solution is to raise an exception on the
client side that must be dealt with by an
exception handler - Drawbacks not every language has exceptions, and
this destroys the transparency - We pretty much cant maintain transparency in
this case
8Remote Procedure CallsLost Request Messages
- This can happen for many reasons, and is the
easiest failure to deal with - We have the client (or OS) start a timer when
sending the request, and if theres no reply
before the timer runs out, we send the request
again - If the message was really lost, everythings OK
because the server never saw the first one - If the message wasnt lost, as long as the server
can detect that its a duplicate everything is
still OK - Its possible for the client to incorrectly
conclude that the server is down, which isnt
good, but cant be avoided
9Remote Procedure CallsServer Crashes
- There are multiple places where the server can
crash, all of which look the same to the client
(it doesnt get a reply)
10Remote Procedure CallsServer Crashes
- There are three schools of thought on what the
RPC system should do in these scenarios - Keep trying until a reply has been received (on
the assumption that the server will restart
eventually), then return that reply to the client
- at least once semantics, guarantees the call
was executed one or more times - Give up immediately and report the failure - at
most once semantics, guarantees the call was
executed one time or not at all - Dont guarantee anything (very easy to implement)
11Remote Procedure CallsServer Crashes
- None of those options are what we really want -
we really want exactly once semantics, but
unfortunately, we cant have it - No matter what strategy is used by the client to
reissue unanswered requests, or by the server to
send completion messages, duplicate executions
(or no execution) can result
12Remote Procedure CallsLost Reply Messages
- One solution to lost reply messages is to just
rely on a timer again, just like lost request
messages - The problem is, this may cause problems if the
request is not idempotent (executing it more than
once has the same effect as executing it once) - We can structure many calls idempotently, but
with some it simply isnt possible
13Remote Procedure CallsLost Reply Messages
- One solution is to use sequence numbering, or
some other scheme, to let the server detect
duplicates - However, it still has to respond to the requests
and track the sequence numbers, and this might
substantially increase the processing overhead on
the server - Another solution is to have a bit in the message
header that distinguishes originals from
duplicates - originals can always be processed
safely (this doesnt help too much though)
14Remote Procedure CallsClient Crashes
- A client can send a request to a server, but then
crash before it receives the response - this
leaves an orphan computation running on the
server - Orphans can cause problems such as wasting CPU
cycles, locking files, or otherwise using
resources - also, if the client resends the
request and receives a response from the orphan,
chaos can ensue - Nelson (1981) proposed four solutions to the
problem of orphans
15Remote Procedure CallsClient Crashes
- Extermination
- The client stub logs all RPC transmissions to
disk, and explicitly cancels any that were in
progress before a crash when the machine comes
back up - Disadvantages of this approach
- Its expensive to keep a log
- The orphans themselves may make RPC calls that
are difficult to cancel - Its possible that the network will be
partitioned in a way such that the cancellation
doesnt make it to the server
16Remote Procedure CallsClient Crashes
- Reincarnation
- Time is divided into sequentially-numbered
epochs, and the epoch number is incremented on
every reboot - When a client boots, it broadcasts its epoch
number to all machines, and they cancel any RPCs
that have an old epoch number - Disadvantages of this approach
- It requires a broadcast to the entire network
- If the network is partitioned, some orphans may
survive (though they can be detected once they
communicate)
17Remote Procedure CallsClient Crashes
- Gentle Reincarnation
- Like reincarnation, but less draconian
- When an epoch broadcast comes in, each machine
kills only those computations for which it cannot
locate the owner on the network - This mainly addresses the possible situation
where a false epoch message is received from some
faulty (or malicious) process on the network
18Remote Procedure CallsClient Crashes
- Expiration
- Each RPC is given a quantum of time T to run to
completion, and must explicitly ask for another
quantum if it cant finish - After a crash a client has to wait for at least
one quantum to pass before going back online, and
all its orphans will have disappeared - The main problem with this method is deciding
what a reasonable value for T is, balancing the
need to clean up orphans quickly with the
communication overhead that can result
19Remote Procedure CallsClient Crashes
- In practice, none of these solutions are
particularly desirable - Killing an orphan may also have unforeseen
consequences, such as database corruption or
files staying locked forever - An orphan may have taken various actions, such as
setting timers to start other processes at future
times, which make removing all traces of it from
the system impossible
20Reliable Group Communication
- Because process resilience by replication is so
important, reliable multicast services are
important as well - It turns out to be rather difficult to multicast
reliably - some of the difficulty lies in
defining exactly what reliably means in terms
of multicast communication - We distinguish between reliable multicast in the
presence of faulty processes and reliable
multicast when processes are assumed to operate
correctly
21Reliable Multicast
- If there are faulty processes, multicasting is
considered reliable if it is guaranteed that all
non-faulty group members receive the messages - However, agreement needs to be reached on what
the group looks like before messages can be
delivered - If there are no faulty processes, and the group
membership doesnt change during communication,
multicasting is considered reliable if every
message is delivered to every group member - we
get agreement for free
22Reliable Multicast Implementations
- Its (relatively) easy to implement reliable
multicast with non-faulty processes, if we dont
require messages to be delivered in the same
order to all group members - Unfortunately, the easy solution isnt scalable
to large groups - There are harder solutions that are scalable to
large groups, of which we will discuss two
categories
23The Easy Reliable MulticastImplementation
24Scalability of the Easy Reliable Multicast
Implementation
- If there are N receivers, the sender has to
accept at least N acknowledgements - feedback
implosion - One solution is to only have receivers send
negative acknowledgements - when they receive a
message and detect theyve missed one, they ask
for the one they missed - In theory, sender has to keep all messages
forever - This still isnt guaranteed to prevent feedback
implosions - We need more sophisticated solutions
25Nonhierarchical Feedback Control
- The goal is to reduce the number of feedback
messages - we use a technique called feedback
suppression - This technique underlies the Scalable Reliable
Multicasting (SRM) protocol, Floyd et al. 1997 - Receivers never acknowledge the successful
delivery of a message - Negative acknowledgements are multicast, not sent
just to the message sender
26Nonhierarchical Feedback Control
- This allows other receivers that missed the same
message to suppress their feedback, because the
replacement message will be multicast when the
original sender gets one negative acknowledgement - The negative acknowledgements are scheduled with
random delays, to prevent feedback implosions
27Nonhierarchical Feedback Control
- Drawbacks
- Feedback messages must be scheduled accurately to
prevent feedback implosion - Receivers that received a message are forced to
receive it again if other receivers missed it
28Nonhierarchical Feedback Control
- One workaround is to let receivers that didnt
get a particular message m join a separate
multicast group for m - but this requires very
efficient group management - Receivers can assist in recovery to increase
scalability - if a receiver has successfully
received m and then gets a negative
acknowledgement for m, it can multicast m before
the negative acknowledgement gets to the original
sender
29Hierarchical Feedback Control
- To scale to very large groups, we need some sort
of hierarchical organization - Assume we have one sender that needs to multicast
to a very large group of receivers - We can partition the receivers into subgroups,
within which any multicast method that scales to
small groups can be used, and elect a local
coordinator for each subgroup
30Hierarchical Feedback Control
- Within each subgroup, the coordinator handles the
negative acknowledgements of subgroup members by
retransmitting to the subgroup - If the coordinator misses a message, it can
request it from the coordinator of its parent
group - If we base the implementation on acknowledgements
rather than negative acknowledgements, the
coordinator doesnt need to keep too large a
buffer
31Hierarchical Feedback Control
32Hierarchical Feedback Control
- The main problem with this scheme is the
construction of the tree - This often has to be done dynamically
- One way is to make use of the multicast tree in
the underlying network, if such exists, by adding
extra software to multicast routers - but its
not easy to make that kind of change to the
routers that are already deployed on existing
networks
33Atomic Multicast
- Often, we need to guarantee that, in the presence
of process failures, a message is delivered
either to all processes in a group or to none at
all - this is the atomic multicast problem - We can define reliable multicast in the presence
of process failures in terms of process groups
and changes to group membership
34Atomic MulticastCommunication Model
- We distinguish between message receipt and
message delivery, for the purpose of modeling
communication in such a system
35Atomic Multicast
- Each multicast message m is associated with a
list of processes to which it should be delivered
- this list corresponds to the group view that
the sender had at the time m was sent - This group view is shared by the rest of the
processes on the list - so each process on the
list believes that m should be delivered to all
processes on the list, and to no other processes
36Atomic Multicast
- Suppose m is multicast when its sender has group
view G - Another process joins or leaves the group while
the multicast of m is taking place - this causes
a view change (which is communicated with a
multicast message vc) - There are now two messages in transit (m, vc) -
we need to guarantee either that m is delivered
before vc to all processes, or that m is not
delivered at all
37Virtual Synchrony
- In principle, the only case where m should not be
delivered at all is when the group membership
change is caused by the sender of m crashing -
either all members of G should hear that the
sender crashed before m was sent, or none should - A reliable multicast that satisfies the
requirement that a message multicast to group
view G is delivered to each nonfaulty process in
G is called virtually synchronous
38Virtual Synchrony
39Virtual Synchrony andMessage Ordering
- Virtual synchrony is much like using a
synchronization variable in a data store - view
changes are barriers that messages cannot cross - There are four different possible message
orderings for virtually synchronous multicast - Unordered
- FIFO-ordered
- Causally-ordered
- Totally-ordered
40Virtual Synchrony and Message Ordering
- In reliable, unordered multicast, no guarantees
are given about the order in which received
messages are delivered by different processes - In reliable, FIFO-ordered multicast, all messages
from each individual process are delivered to all
other processes in the same order, but no
guarantees are made about the relative delivery
orders of messages from different processes
41Virtual Synchrony and Message Ordering
- In reliable, causally-ordered multicast, messages
are delivered so that potential causality among
messages is preserved (this can be done with
vector timestamps) - In reliable, totally-ordered multicast, all
messages are delivered in the same order to all
group members (this is generally combined with a
requirement of causal or FIFO ordering)
42Next Class
- Distributed commit, a general distributed systems
problem of which atomic multicast is an example - Recovery