Distributed Systems: Atomicity, decision making, snapshots

About This Presentation

Title:

Distributed Systems: Atomicity, decision making, snapshots

Description:

Slides adapted from Ken's CS514 lectures. Distributed Systems: Atomicity, ... after Custer, who died at Little Bighorn because he arrived a couple of days too early! ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 65

Provided by: ranveer7

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Systems: Atomicity, decision making, snapshots

1
Distributed Systems Atomicity, decision making,
snapshots
2
Announcements

Please complete course evaluations
http//www.engineering.cornell.edu/CourseEval/
Prelim II coming up this week
Thursday, April 26th, 730900pm, 1½ hour exam
101 Phillips
Closed book, no calculators/PDAs/
Bring ID
Topics
Since last Prelim, up to (and including) Monday,
April 23rd
Lectures 19-34, chapters 10-18 (7th ed)
Review Session Tuesday, April 24th
during second half of 415 Section
Homework 6 (and solutions) available via CMS
Do it without looking at solutions. However, it
will not be graded

3
Review What time is it?

In distributed system we need practical ways to
deal with time
E.g. we may need to agree that update A occurred
before update B
Or offer a lease on a resource that expires at
time 1010.0150
Or guarantee that a time critical event will
reach all interested parties within 100ms

4
Review Event Ordering

Problem distributed systems do not share a clock
Many coordination problems would be simplified if
they did (first one wins)
Distributed systems do have some sense of time
Events in a single process happen in order
Messages between processes must be sent before
they can be received
How helpful is this?

5
Review Happens-before

Define a Happens-before relation (denoted by ?).
1) If A and B are events in the same process, and
A was executed before B, then A ? B.
2) If A is the event of sending a message by one
process and B is the event of receiving that
message by another process, then A ? B.
3) If A ? B and B ? C then A ? C.

6
Review Total ordering?

Happens-before gives a partial ordering of events
We still do not have a total ordering of events
We are not able to order events that happen
concurrently
Concurrent if (not A?B) and (not B?A)

7
Review Partial Ordering
Pi -gtPi1 Qi -gt Qi1 Ri -gt Ri1
R0-gtQ4 Q3-gtR4 Q1-gtP4 P1-gtQ2
8
Review Total Ordering?
P0, P1, Q0, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, Q1, P1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
P0, Q0, P1, Q1, Q2, P2, P3, P4, Q3, R0, Q4, R1,
R2, R3, R4
9
Review Timestamps

Assume each process has a local logical clock
that ticks once per event and that the processes
are numbered
Clocks tick once per event (including message
send)
When send a message, send your clock value
When receive a message, set your clock to MAX(
your clock, timestamp of message 1)
Thus sending comes before receiving
Only visibility into actions at other nodes
happens during communication, communicate
synchronizes the clocks
If the timestamps of two events A and B are the
same, then use the process identity numbers to
break ties.
This gives a total ordering!

10
Review Distributed Mutual Exclusion

Want mutual exclusion in distributed setting
The system consists of n processes each process
Pi resides at a different processor
Each process has a critical section that requires
mutual exclusion
Problem Cannot use atomic testAndSet primitive
since memory not shared and processes may be on
physically separated nodes
Requirement
If Pi is executing in its critical section, then
no other process Pj is executing in its critical
section
Compare three solutions
Centralized Distributed Mutual Exclusion (CDME)
Fully Distributed Mutual Exclusion (DDME)
Token passing

11
Today

Atomicity and Distributed Decision Making
What time is it now?
Synchronized clocks
What does the entire system look like at this
moment?

12
Atomicity

Recall
Atomicity either all the operations associated
with a program unit are executed to completion,
or none are performed.
In a distributed system may have multiple copies
of the data
(e.g. replicas are good for reliability/availabili
ty)
PROBLEM How do we atomically update all of the
copies?
That is, either all replicas reflect a change or
none

13
Generals Paradox

Generals paradox
Constraints of problem
Two generals, on separate mountains
Can only communicate via messengers
Messengers can be captured
Problem need to coordinate attack
If they attack at different times, they all die
If they attack at same time, they win
Named after Custer, who died at Little Bighorn
because he arrived a couple of days too early!
Can messages over an unreliable network be used
to guarantee two entities do something
simultaneously?
Remarkably, no, even if all messages get
through
No way to be sure last message gets through!

14
Replica Consistency Problem -Concurrent and
conflicting updates

Imagine we have multiple bank servers and a
client desiring to update their back account
How can we do this?
Allow a client to update any server then have
server propagate update to other servers?
Simple and wrong!
Simultaneous and conflicting updates can occur at
different servers?
Have client send update to all servers?
Same problem - race condition which of the
conflicting update will reach each server first

15
Two-phase commit

Since we cant solve the Generals Paradox (i.e.
simultaneous action), concurrent and conflicting
updates may be sent by clients, lets solve a
related problem
Distributed transaction Two machines agree to do
something, or not do it, atomically
Algorithm for providing atomic updates in a
distributed system
Give the servers (or replicas) a chance to say no
and if any server says no, client aborts the
operation

16
Framework

Goal Update all replicas atomically
Either everyone commits or everyone aborts
No inconsistencies even in face of failures
Caveat Assume only crash or fail-stop failures
Crash servers stop when they fail do not
continue and generate bad data
Fail-stop in addition to crash, fail-stop
failure is detectable.
Definitions
Coordinator Software entity that shepherds the
process (in our example could be one of the
servers)
Ready to commit side effects of update safely
stored on non-volatile storage
Even if crash, once I say I am ready to commit
then a recover procedure will find evidence and
continue with commit protocol

17
Two Phase Commit Phase 1

Coordinator send a PREPARE message to each
replica
Coordinator waits for all replicas to reply with
a vote
Each participant replies with a vote
Votes PREPARED if ready to commit and locks data
items being updated
Votes NO if unable to get a lock or unable to
ensure ready to commit

18
Two Phase Commit Phase 2

If coordinator receives PREPARED vote from all
replicas then it may decide to commit or abort
Coordinator send its decision to all participants
If participant receives COMMIT decision then
commit changes resulting from update
If participant received ABORT decision then
discard changes resulting from update
Participant replies DONE
When Coordinator received DONE from all
participants then can delete record of outcome

19
Performance

In absence of failure, 2PC (two-phase-commit)
makes a total of 2 (1.5?) round trips of messages
before decision is made
Prepare
Vote NO or PREPARE
Commit/abort
Done (but done just for bookkeeping, does not
affect response time)

20
Failure Handling in 2PC Replica Failure

The log contains a ltcommit Tgt record.
In this case, the site executes redo(T).
The log contains an ltabort Tgt record.
In this case, the site executes undo(T).
The log contains a ltready Tgt record
In this case consult coordinator Ci.
If Ci is down, site sends query-status T message
to the other sites.
The log contains no control records concerning T.
In this case, the site executes undo(T).

21
Failure Handling in 2PC Coordinator Ci Failure

If an active site contains a ltcommit Tgt record in
its log, then T must be committed.
If an active site contains an ltabort Tgt record in
its log, then T must be aborted.
If some active site does not contain the record
ltready Tgt in its log then the failed coordinator
Ci cannot have decided to commit T. Rather than
wait for Ci to recover, it is preferable to abort
T.
All active sites have a ltready Tgt record in their
logs, but no additional control records. In this
case we must wait for the coordinator to recover.
Blocking problem T is blocked pending the
recovery of site Si.

22
Failure Handling

Failure detected with timeouts
If participant times out before getting a PREPARE
can abort
If coordinator times out waiting for a vote can
abort
If a participant times out waiting for a decision
it is blocked!
Wait for Coordinator to recover?
Punt to some other resolution protocol
If a coordinator times out waiting for done, keep
record of outcome
other sites may have a replica.

23
Failures in distributed systems

We may want to avoid relying on a single
server/coordinator/boss to make progress
Thus want the decision making to be distributed
among the participants (all nodes created
equal) gt the consensus problem in distributed
systems.
However depending on what we can assume about the
network, it may be impossible to reach a decision
in some cases!

24
Impossibility of Consensus

Network characteristics
Synchronous - some upper bound on
network/processing delay.
Asynchronous - no upper bound on
network/processing delay.
Fischer Lynch and Paterson showed
With even just one failure possible, you cannot
guarantee consensus.
Cannot guarantee consensus process will terminate
Assumes asynchronous network
Essence of proof Just before a decision is
reached, we can delay a node slightly too long to
reach a decision.
But we still want to do it.. Right?

25
Distributed Decision Making Discussion

Why is distributed decision making desirable?
Fault Tolerance!
A group of machines can come to a decision even
if one or more of them fail during the process
Simple failure mode called failstop (different
modes later)
After decision made, result recorded in multiple
places
Undesirable feature of Two-Phase Commit Blocking
One machine can be stalled until another site
recovers
Site B writes prepared to commit record to its
log, sends a yes vote to the coordinator (site
A) and crashes
Site A crashes
Site B wakes up, check its log, and realizes that
it has voted yes on the update. It sends a
message to site A asking what happened. At this
point, B cannot decide to abort, because update
may have committed
B is blocked until A comes back
A blocked site holds resources (locks on updated
items, pages pinned in memory, etc) until learns
fate of update
Alternative There are alternatives such as
Three Phase Commit which dont have this
blocking problem
What happens if one or more of the nodes is
malicious?
Malicious attempting to compromise the decision
making
Known as Byzantine fault tolerance. More on this
next time

26
Introducing wall clock time

Back to the notion of time
Distributed systems sometimes needs more precise
notion of time other than happens-before
There are several options
Instead of network/process identitity to break
ties
Extend a logical clock with the clock time and
use it to break ties
Makes meaningful statements like B and D were
concurrent, although B occurred first
But unless clocks are closely synchronized such
statements could be erroneous!
We use a clock synchronization algorithm to
reconcile differences between clocks on various
computers in the network

27
Synchronizing clocks

Without help, clocks will often differ by many
milliseconds
Problem is that when a machine downloads time
from a network clock it cant be sure what the
delay was
This is because the uplink and downlink
delays are often very different in a network
Outright failures of clocks are rare

28
Synchronizing clocks

Suppose p synchronizes with time.windows.com and
notes that 123 ms elapsed while the protocol was
running what time is it now?

Delay 123ms
p
What time is it?
0923.02921
time.windows.com
29
Synchronizing clocks

Options?
p could guess that the delay was evenly split,
but this is rarely the case in WAN settings
(downlink speeds are higher)
p could ignore the delay
p could factor in only certain delay, e.g. if
we know that the link takes at least 5ms in each
direction. Works best with GPS time sources!
In general cant do better than uncertainty in
the link delay from the time source down to p

30
Consequences?

In a network of processes, we must assume that
clocks are
Not perfectly synchronized.
We say that clocks are inaccurate
Even GPS has uncertainty, although small
And clocks can drift during periods between
synchronizations
Relative drift between clocks is their precision

31
Temporal distortions

Things can be complicated because we cant
predict
Message delays (they vary constantly)
Execution speeds (often a process shares a
machine with many other tasks)
Timing of external events
Lamport looked at this question too

32
Temporal distortions

What does now mean?

p

0
a
d

e
b
c

p

1
f

p

2
p

3
33
Temporal distortions
What does now mean?

p

0
a
d

e
b
c

p

1
f

p

2
p

3
34
Temporal distortions
Timelines can stretch caused by
scheduling effects, message delays, message loss

p

0
a
d

e
b
c

p

1
f

p

2
p

3
35
Temporal distortions
Timelines can shrink E.g. something lets a
machine speed up

p

0
a
d

e
b
c

p

1
f

p

2
p

3
36
Temporal distortions
Cuts represent instants of time. But not
every cut makes sense Black cuts could occur
but not gray ones.

p

0
a
d

e
b
c

p

1
f

p

2
p

3
37
Consistent cuts and snapshots

Idea is to identify system states that might
have occurred in real-life
Need to avoid capturing states in which a message
is received but nobody is shown as having sent it
This the problem with the gray cuts

38
Temporal distortions
Red messages cross gray cuts backwards

p

0
a
d

e
b
c

p

1
f

p

2
p

3
39
Temporal distortions
Red messages cross gray cuts backwards In
a nutshell the cut includes a message that was
never sent

p

0
a

e
b
c

p

1
p

2
p

3
40
Who cares?

Suppose, for example, that we want to do
distributed deadlock detection
System lets processes wait for actions by other
processes
A process can only do one thing at a time
A deadlock occurs if there is a circular wait

41
Deadlock detection algorithm

p worries perhaps we have a deadlock
p is waiting for q, so sends whats your state?
q, on receipt, is waiting for r, so sends the
same question and r for s. And s is waiting on
p.

42
Suppose we detect this state

We see a cycle
but is it a deadlock?

p
q
Waiting for
Waiting for
Waiting for
r
s
Waiting for
43
Phantom deadlocks!

Suppose system has a very high rate of locking.
Then perhaps a lock release message passed a
query message
i.e. we see q waiting for r and r waiting for
s but in fact, by the time we checked r, q was
no longer waiting!
In effect we checked for deadlock on a gray cut
an inconsistent cut.

44
Consistent cuts and snapshots

Goal is to draw a line across the system state
such that
Every message received by a process is shown as
having been sent by some other process
Some pending messages might still be in
communication channels
A cut is the frontier of a snapshot

45
Chandy/Lamport Algorithm

Assume that if pi can talk to pj they do so using
a lossless, FIFO connection
Now think about logical clocks
Suppose someone sets his clock way ahead and
triggers a flood of messages
As these reach each process, it advances its own
time eventually all do so.
The point where time jumps forward is a
consistent cut across the system

46
Using logical clocks to make cuts
Message sets the time forward by a lot

p

0
a
d

e
b
c

p

1
f

p

2
p

3
Algorithm requires FIFO channels must delay e
until b has been delivered!
47
Using logical clocks to make cuts
Cut occurs at point where time advanced

p

0
a
d

e
b
c

p

1
f

p

2
p

3
48
Turn idea into an algorithm

To start a new snapshot, pi
Builds a message Pi is initiating snapshot k.
The tuple (pi, k) uniquely identifies the
snapshot
In general, on first learning about snapshot (pi,
k), px
Writes down its state pxs contribution to the
snapshot
Starts tape recorders for all communication
channels
Forwards the message on all outgoing channels
Stops tape recorder for a channel when a
snapshot message for (pi, k) is received on it
Snapshot consists of all the local state
contributions and all the tape-recordings for the
channels

49
Chandy/Lamport

This algorithm, but implemented with an outgoing
flood, followed by an incoming wave of snapshot
contributions
Snapshot ends up accumulating at the initiator,
pi
Algorithm doesnt tolerate process failures or
message failures.

50
Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
51
Chandy/Lamport
w
t
I want to start a snapshot
q
r
p
s
u
y
v
x
z
A network
52
Chandy/Lamport
w
t
q
p records local state
r
p
s
u
y
v
x
z
A network
53
Chandy/Lamport
w
p starts monitoring incoming channels
t
q
r
p
s
u
y
v
x
z
A network
54
Chandy/Lamport
w
t
q
contents of channel p-y
r
p
s
u
y
v
x
z
A network
55
Chandy/Lamport
w
p floods message on outgoing channels
t
q
r
p
s
u
y
v
x
z
A network
56
Chandy/Lamport
w
t
q
r
p
s
u
y
v
x
z
A network
57
Chandy/Lamport
w
q is done
t
q
r
p
s
u
y
v
x
z
A network
58
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
59
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
A network
60
Chandy/Lamport
w
t
q
q
r
p
s
u
y
v
x
z
s
z
A network
61
Chandy/Lamport
w
x
t
q
q
r
p
u
s
u
y
v
x
z
s
z
v
A network
62
Chandy/Lamport
w
w
x
t
q
q
r
p
z
s
s
v
y
u
r
u
y
v
x
z
A network
63
Chandy/Lamport
w
t
q
q
p
Done!
r
p
s
r
s
u
t
u
w
v
y
v
y
x
x
z
z
A snapshot of a network
64
Whats in the state?

In practice we only record things important to
the application running the algorithm, not the
whole state
E.g. locks currently held, lock release
messages
Idea is that the snapshot will be
Easy to analyze, letting us build a picture of
the system state
And will have everything that matters for our
real purpose, like deadlock detection