Distributed Systems 600'437 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Distributed Systems 600'437

1
Distributed Systems600.437 Replication II
Department of Computer Science The Johns Hopkins
University
2
Resilient Replication
Lecture 8

Further readings
Paxos Made Simple, Leslie Lamport
Practical Byzantine Fault Tolerance, Miguel
Castro and Barbara Liskov, OSDI 99.
Paper from DSN 2006 on our www.cnds.jhu.edu/publi
cations web page (Scaling Byzantine
Fault-Tolerant Replication to Wide Area Networks)

3
Introduction

Resilient State Machine Replication
What does resilient mean?
We want our replicated system to work even when
faults occur.
What kinds of faults can occur?
Servers may crash!
Servers might recover
Network partitions can separate servers from each
other.
Byzantine faults. Servers may lie!
We are going to use a class of protocols that is
able to handle network connectivity changes
better than those based on membership.

4
State Machine Replication

Servers start in the same state.
Servers change their state only when they execute
an update.
State changes are deterministic. Two servers in
the same state will move to identical states, if
they execute the same update.
If servers execute updates in the same order,
they will progress through exactly the same
states. State Machine Replication!

5
State Machine Replication Example

Our State one variable
Operations (cause state changes)
Op 1) n Add n to our variable
Op 2) ?vn If variable v, then set it to n
Start All servers have variable 0
If we apply the above operations in the same
order, then the servers will remain consistent

v9
v9
v3
v9
?39
?39
1
?39
v3
v3
v2
v3
1
1
1
?39
v2
v2
v2
v2
2
2
2
2
v0
v0
v0
v0
6
State Machine Replication
Clients Generate Updates
B,3
D,1
A,2
C,2
ESTABLISH ORDER
C,1
C,1
C,1
C,1
B,2
B,2
B,2
B,2
A,1
A,1
A,1
A,1
B,1
B,1
B,1
B,1
Apply Updates
7
Simple Replication
Can we use a Leader to establish an order?
u,s
Leader
u,s
u,s
u
Is this resilient?
If leader fails, then the system is not live!
u
u,s
u,s

Server sends update, u, to Leader
Leader assigns a sequence number, s, to u, and
sends the update to the non-leader servers.
Servers order update u with sequence number s.

8
How can we improve resiliency?
Elect another leader.
Use more messages.
Assign a sequence number to each leader. (Views)
Use the fact that two sets, each having at least
a majority of servers, must intersect!
First We need to describe our system model and
service properties.
9
Our System Model

N servers
Uniquely identified in 1N
Asynchronous communication
Message loss, duplication, and delay
Network partitions
No message corruption
Benign faults
Crash/recovery with stable storage
No Byzantine behavior - Not yet anyway )

10
What is Safety?

Safety If two servers execute the ith update,
then these updates are the same.
Another way to look at safety
If there exists an ordered update (u,s) at some
server, then there cannot exist an ordered update
(u,s) at any other server, where u not u.
We will now focus on achieving safety -- making
sure that we dont execute updates in different
orders on different servers.

11
Achieving Safety
Leader 2
u,s
Leader 1
u,s
u,s
Is this safe?
u,s
u,s
A new leader can violate safety! Can we fix this?

A new leader must not violate previously
established ordering!
The new leader must know about all updates that
may have been ordered.

12
Achieving Safety
u,s
u,s
What does this give us?
If a new leader gets information from any
majority of servers, it can determine what may
have been ordered!
Leader
u,s
u,s
u,s
u,s
u
u
u,s
u,s
u,s
u,s

Leader sends Proposal(u,s) to all servers
All servers send Accept(u,s) to all servers.
Servers order (u,s) when they receive a majority
of Proposal/Accept(u,s) messages

13
Changing Leaders

Changing Leaders is commonly called a View
Change.
Servers use timeouts to detect failures.
If the current leaders fails, the servers elect a
new leader.
The new leader cannot propose updates until it
collects information from a majority of servers!
Each server reports any Proposals that it knows
about.
If any server ordered a Proposal(u,s), then at
least one server in any majority will report a
Proposal for that sequence number!
The new server will never violate prior
ordering!!
Now we have a safe protocol!!

14
Changing Leaders Example
Leader 2 can send a Proposal(u,s). We say it
replays (u,s)
Leader 2
Leader 1
u,s
u,s
5
u,s
4
3

If any server orders (u,s), then at least
majority of servers must have received
Proposal(u,s).
If a new server is elected leader, it will gather
Proposals from a majority of servers.
The new leader will learn about the ordered
update!!

15
Is Our Protocol Live?

Liveness If there is a set, C, consisting of
majority of connected servers, then if any server
in set C has a new update, then this update will
eventually be executed.
Is there a problem with our protocol? It is safe,
but is it live?

16
Liveness Example
Leader 2
u,s
Leader 3
Leader 1
u,s
5
u,s
u,s
4
3

Leader 3 gets conflicting Proposal messages!
Which one should it choose?
What should we add??

17
Adding View Numbers
Leader 2
2,u,s
Leader 3
Leader 1
1,u,s
2,u,s
1,u,s
4
3

We add view numbers to the Proposal(v,u,s)!
Leader 3 gets conflicting Proposal messages!
Which one should it choose?
It chooses the one with the greatest view number!!

18
Normal Case

Assign-Sequence()
A1. u NextUpdate()
A2. next_seq
A3. SEND Proposal(view, u,next_seq)
Upon receiving Proposal(v, u,s)
B1. if not leader and v my_view
B2. SEND Accept(v,u,s)
Upon receiving Proposal(v,u,s) and
majority - 1 Accept(v,u,s)
C1. ORDER (u,s)

We use view numbers to determine which Proposal
may have been ordered previously.
A server sends an Accept(v,u,s) message only for
a view that it is currently in, and never for a
lower view!
19
Leader Election

Elect Leader()
Upon Timer T Expire
A1. my_view
A2. SEND New-Leader(my_view)
Upon receiving New-Leader(v)
B1. if Timer T Stopped
B2. if v gt my_view, then my_view v
B3. SEND New-Leader(my_view)
Upon receiving majority New-Leader(v)
where v my_view
C1. timeout 2 Timer T timeout
C2. Start Timer T

Let V_max be the highest view that any server
has. Then, at least a majority of servers are in
view V_max or V_max - 1.
Servers will stay in the maximum view for at
least one full timeout period.
A server that becomes disconnected/connected repea
tedly cannot disrupt the other servers.
20
We Have Paxos
request
proposal
accept
reply
C
0
1
2

The Part-Time Parliament Lamport, 98
A very resilient protocol. Only a majority of
participants are required to make progress.
Works well on unstable networks.
Only handles benign failures (not Byzantine).

21
What Happens If Servers Lie?

Servers must be able to verify who sent each
message.
Crypto! Digital Signatures or MACS
The leader might be bad!
What might happen?
The leader can send Proposal(u,s) to 2 out of 5
servers and Proposal(u,s) to 2 out of 5 servers
-- can we have a safety violation?
Correct servers must make sure the malicious
servers dont cause safety errors.
The bad servers might send messages or they might
not.

22
Byzantine Leader Example
2
u,s
Bad Leader
u,s
u,s
,s
3
u,s
u,s
u,s
4
5

Bad Leader Sends Proposal(u,s) to servers 4 and
5.
Bad Leader Sends Proposal(u,s) to servers 2 and
3.
Server 4 could order (u,s) and server 3 could
order (u,s).

23
How Do We Solve this Problem?

Assume that there are at most f malicious
servers, which can fail or become malicious. All
of the other servers are correct.
Let N denote the number of servers in our system.
Any correct server can wait for at most N - f
messages from servers, because f may fail or be
malicious (and not send their messages).
Can we add more servers?

24
How many servers do we need?

Malicious servers can lie.
Good servers tell the truth.
We need to guarantee that a malicious server
cannot generate two groups of Accept/Proposal
messages that conflict. (i.e., (u,s) and (u,s))
within the same view.
We need at least N3f1 servers to do this!!
We wait for 2f1 messages that say the same
thing!
The f bad servers can say Accept(u,s) and
Accept(u,s).
The good servers say only one thing, but a bad
leader can lie to them.
Lets try to generate the two sets of messages --
Can we do it?
Liar tells f1 of the good servers (u,s), and f
of the good servers (u,s).

(u,s) f(bad) f1(good)
(u,s) f(bad) f(good)
total 2f1
total 2f
25
Lets use N3f1!
2
u,s
Bad Leader
u,s
,s
3
u,s
u,s
4

f 1, N 4
Bad Leader sends Proposal(u,s) to Server 3 and 4.
Bad Leader sends Proposal(u,s) to Server 2.
Can the Bad Leader violate safety?

26
Is the Protocol Live?

f 2, N 321 7
Bad Leader is Server 7, and Server 4 is bad, too!
Bad Leader sends Proposal(v,u,s) to Servers 1, 2,
and 3
Bad Leader sends Proposal(v,u,s) to Servers 4,
5, and 6
There is a partition, Servers 2,3,4,5,6 are
together.
They cant determine which update server 1
ordered.

1, (u,s)
2, (u,s)
3, (u,s)
7, (,s)
4, (,s)
5, (u,s)
6, (u,s)
27
How Can We Guarantee Liveness?

We can add another round to the our fault
tolerant protocol. The Normal Case Protocol
becomes
The Leader broadcasts a Pre-Prepare(v,u,s)
If not Leader, Upon receiving a
Pre-Prepare(v,u,s) that doesnt conflict with
what I know about, broadcast a Prepare(v,u,s)
Upon receiving 2f Prepare(v,u,s) and 1
Pre-Prepare(v,u,s), broadcast Commit(v,u,s)
Upon receiving 2f1 Commit Messages, Order the
message
Rounds 1 and 2 allow the correct servers to
preserve safety within the same view.
Round 3 preserves safety across view changes.

28
What About Changing Leaders?

If any server orders (v,u,s), then 2f1 servers
must have collected a set of 2f Prepare(v,u,s)
messages and 1 Pre-Prepare(v,u,s)
We call such a set a Prepare-Certificate(v,u,s).
If Prepare-Certificate(v,u,s) exists, then
Prepare-Certificate(v,u,s) cannot exist.
How do we change Leaders (View Changes)
The new leader collects information from 2f1
servers. The servers supply Prepare-Certificates.
If something was ordered, the new leader will
find out.
The new leader needs to send this information to
all of the correct servers, otherwise the correct
servers will not participate in the protocol.
A Prepare-Certificate can be viewed as a trusted
message (agreed upon by all of the servers). We
use it like we use a Proposal message in Paxos.

29
We have BFT
request
pre-prepare
prepare
reply
commit
C
0
1
2
3

Byzantine Fault Tolerance Castro and Liskov, 99
Excellent LAN performance. Over 1000 updates/sec.
2/3 total servers 1 required to make progress
Three rounds of message exchanges
Expensive for large networks.

30
Steward A Hierarchical Approach

Each site acts as a trusted logical unit that can
crash or partition.
Within each site Byzantine-tolerant agreement
(similar to BFT).
Masks f malicious faults in each site.
Threshold signatures prove agreement to other
sites.
Between sites light-weight, fault-tolerant
protocol (similar to Paxos).

31
Main Idea 1 Common Case Operation

A client sends an update to a server at its local
site.
The update is forwarded to the leader site.
The representative of the leader site assigns
order in agreement and issues
a threshold signed proposal.
Each site issues a threshold signed
accept.
Upon receiving a majority of accepts, servers in
each site order the update.
The original server sends a response to the
client.

32
Hierarchy Benefits

Reduces the number of messages sent on the wide
area network.
O(N2) ? O(S2) helps both in throughput and
latency.
Reduces the number of wide area crossings.
BFT-based protocols require 3 wide area
crossings.
Paxos-based protocols require 2 wide area
crossings.
Availability of the system is increased
(2/3 of total servers 1) ? (A majority of
sites).
Read-only queries can be answered locally.

33
Hierarchy Challenges

Each site has a representative that
Coordinates the Byzantine protocol inside the
site.
Forwards packets in and out of the site.
One of the sites acts as the leader in the wide
area protocol
The representative of the leader site assigns
sequence numbers to updates.
View Changes
How do we select and change representatives and
the leader site in agreement ?
How do we transition safely when we need to
change them ?

34
Main Idea 2 View Changes

Sites change their local representatives based on
timeouts.
Leader site representative has a larger timeout.
allows contact with at least one correct rep. at
other sites.
After changing enough leader site
representatives, servers at all sites stop
participating in the protocol, and elect a
different leader site.

35
What About Pending Updates ?

The new representative or leader site should not
violate the order assigned by previous ones.
After changing the site representative The new
representative gathers information from 2f1
local servers.
After changing the leader site The
representative of the new leader site gathers
information from a majority of sites.

36
Non-Byzantine Comparison
Boston
MITPC
Delaware
4.9 ms
San Jose
UDELPC
9.81Mbits/sec
TISWPC
3.6 ms 1.42Mbits/sec
ISEPC
1.4 ms 1.47Mbits/sec
ISEPC3
100 Mb/s lt1ms
38.8 ms 1.86Mbits/sec
ISIPC4
Virginia
ISIPC
100 Mb/s lt 1ms
Los Angeles

Based on a real experimental network (CAIRN).
Several years ago we benchmarked benign
replication on this network.
Modeled on our cluster, emulating bandwidth and
latency constraints, both for Steward and BFT.

37
CAIRN Emulation Performance

Steward is limited by bandwidth at 51 updates per
second.
1.8Mbps can barely accommodate 2 updates per
second for BFT.
Earlier experimentation with benign fault
2-phase commit protocols achieved up to 76
updates per sec. Amir et. all 02.

38
The Basic Idea

Reduce database replication to Global Consistent
Persistent Order
Use group communication ordering to establish the
Global Consistent Persistent Order on the
updates.
deterministic serialized consistent
Group Communication membership quorum primary
partition.
Only replicas in the primary component can commit
updates.
Updates ordered in a primary component are marked
green and applied. Updates ordered in a
non-primary component are marked red and will be
delayed.

39
Throughput Comparison (WAN)
40
Resilient Replication Summary

We can build resilient replication protocols that
do not use membership!
Failures can be benign (crashes, network
partitions) or Byzantine (servers can lie)
We can add message exchanges (rounds) to
guarantee Safety and Liveness with different
fault models.
Performance can be improved by adding hierarchy!

Write a Comment

User Comments (0)

About PowerShow.com

Distributed Systems 600'437 PowerPoint PPT Presentation