Title: Distributed Systems 600'437
1Distributed Systems600.437 Replication II
Department of Computer Science The Johns Hopkins
University
2Resilient Replication
Lecture 8
- Further readings
- Paxos Made Simple, Leslie Lamport
- Practical Byzantine Fault Tolerance, Miguel
Castro and Barbara Liskov, OSDI 99. - Paper from DSN 2006 on our www.cnds.jhu.edu/publi
cations web page (Scaling Byzantine
Fault-Tolerant Replication to Wide Area Networks)
3Introduction
- Resilient State Machine Replication
- What does resilient mean?
- We want our replicated system to work even when
faults occur. - What kinds of faults can occur?
- Servers may crash!
- Servers might recover
- Network partitions can separate servers from each
other. - Byzantine faults. Servers may lie!
- We are going to use a class of protocols that is
able to handle network connectivity changes
better than those based on membership.
4State Machine Replication
- Servers start in the same state.
- Servers change their state only when they execute
an update. - State changes are deterministic. Two servers in
the same state will move to identical states, if
they execute the same update. - If servers execute updates in the same order,
they will progress through exactly the same
states. State Machine Replication!
5State Machine Replication Example
- Our State one variable
- Operations (cause state changes)
- Op 1) n Add n to our variable
- Op 2) ?vn If variable v, then set it to n
- Start All servers have variable 0
- If we apply the above operations in the same
order, then the servers will remain consistent
v9
v9
v3
v9
?39
?39
1
?39
v3
v3
v2
v3
1
1
1
?39
v2
v2
v2
v2
2
2
2
2
v0
v0
v0
v0
6State Machine Replication
Clients Generate Updates
B,3
D,1
A,2
C,2
ESTABLISH ORDER
C,1
C,1
C,1
C,1
B,2
B,2
B,2
B,2
A,1
A,1
A,1
A,1
B,1
B,1
B,1
B,1
Apply Updates
7Simple Replication
Can we use a Leader to establish an order?
u,s
Leader
u,s
u,s
u
Is this resilient?
If leader fails, then the system is not live!
u
u,s
u,s
- Server sends update, u, to Leader
- Leader assigns a sequence number, s, to u, and
sends the update to the non-leader servers. - Servers order update u with sequence number s.
8How can we improve resiliency?
Elect another leader.
Use more messages.
Assign a sequence number to each leader. (Views)
Use the fact that two sets, each having at least
a majority of servers, must intersect!
First We need to describe our system model and
service properties.
9Our System Model
- N servers
- Uniquely identified in 1N
- Asynchronous communication
- Message loss, duplication, and delay
- Network partitions
- No message corruption
- Benign faults
- Crash/recovery with stable storage
- No Byzantine behavior - Not yet anyway )
10What is Safety?
- Safety If two servers execute the ith update,
then these updates are the same. - Another way to look at safety
- If there exists an ordered update (u,s) at some
server, then there cannot exist an ordered update
(u,s) at any other server, where u not u. - We will now focus on achieving safety -- making
sure that we dont execute updates in different
orders on different servers.
11Achieving Safety
Leader 2
u,s
Leader 1
u,s
u,s
Is this safe?
u,s
u,s
A new leader can violate safety! Can we fix this?
- A new leader must not violate previously
established ordering! - The new leader must know about all updates that
may have been ordered.
12Achieving Safety
u,s
u,s
What does this give us?
If a new leader gets information from any
majority of servers, it can determine what may
have been ordered!
Leader
u,s
u,s
u,s
u,s
u
u
u,s
u,s
u,s
u,s
- Leader sends Proposal(u,s) to all servers
- All servers send Accept(u,s) to all servers.
- Servers order (u,s) when they receive a majority
of Proposal/Accept(u,s) messages
13Changing Leaders
- Changing Leaders is commonly called a View
Change. - Servers use timeouts to detect failures.
- If the current leaders fails, the servers elect a
new leader. - The new leader cannot propose updates until it
collects information from a majority of servers! - Each server reports any Proposals that it knows
about. - If any server ordered a Proposal(u,s), then at
least one server in any majority will report a
Proposal for that sequence number! - The new server will never violate prior
ordering!! - Now we have a safe protocol!!
14Changing Leaders Example
Leader 2 can send a Proposal(u,s). We say it
replays (u,s)
Leader 2
Leader 1
u,s
u,s
5
u,s
4
3
- If any server orders (u,s), then at least
majority of servers must have received
Proposal(u,s). - If a new server is elected leader, it will gather
Proposals from a majority of servers. - The new leader will learn about the ordered
update!!
15Is Our Protocol Live?
- Liveness If there is a set, C, consisting of
majority of connected servers, then if any server
in set C has a new update, then this update will
eventually be executed. - Is there a problem with our protocol? It is safe,
but is it live?
16Liveness Example
Leader 2
u,s
Leader 3
Leader 1
u,s
5
u,s
u,s
4
3
- Leader 3 gets conflicting Proposal messages!
- Which one should it choose?
- What should we add??
17Adding View Numbers
Leader 2
2,u,s
Leader 3
Leader 1
1,u,s
2,u,s
1,u,s
4
3
- We add view numbers to the Proposal(v,u,s)!
- Leader 3 gets conflicting Proposal messages!
- Which one should it choose?
- It chooses the one with the greatest view number!!
18Normal Case
- Assign-Sequence()
- A1. u NextUpdate()
- A2. next_seq
- A3. SEND Proposal(view, u,next_seq)
- Upon receiving Proposal(v, u,s)
- B1. if not leader and v my_view
- B2. SEND Accept(v,u,s)
- Upon receiving Proposal(v,u,s) and
- majority - 1 Accept(v,u,s)
- C1. ORDER (u,s)
We use view numbers to determine which Proposal
may have been ordered previously.
A server sends an Accept(v,u,s) message only for
a view that it is currently in, and never for a
lower view!
19Leader Election
- Elect Leader()
- Upon Timer T Expire
- A1. my_view
- A2. SEND New-Leader(my_view)
- Upon receiving New-Leader(v)
- B1. if Timer T Stopped
- B2. if v gt my_view, then my_view v
- B3. SEND New-Leader(my_view)
- Upon receiving majority New-Leader(v)
- where v my_view
- C1. timeout 2 Timer T timeout
- C2. Start Timer T
Let V_max be the highest view that any server
has. Then, at least a majority of servers are in
view V_max or V_max - 1.
Servers will stay in the maximum view for at
least one full timeout period.
A server that becomes disconnected/connected repea
tedly cannot disrupt the other servers.
20We Have Paxos
request
proposal
accept
reply
C
0
1
2
- The Part-Time Parliament Lamport, 98
- A very resilient protocol. Only a majority of
participants are required to make progress. - Works well on unstable networks.
- Only handles benign failures (not Byzantine).
21What Happens If Servers Lie?
- Servers must be able to verify who sent each
message. - Crypto! Digital Signatures or MACS
- The leader might be bad!
- What might happen?
- The leader can send Proposal(u,s) to 2 out of 5
servers and Proposal(u,s) to 2 out of 5 servers
-- can we have a safety violation? - Correct servers must make sure the malicious
servers dont cause safety errors. - The bad servers might send messages or they might
not.
22Byzantine Leader Example
2
u,s
Bad Leader
u,s
u,s
,s
3
u,s
u,s
u,s
4
5
- Bad Leader Sends Proposal(u,s) to servers 4 and
5. - Bad Leader Sends Proposal(u,s) to servers 2 and
3. - Server 4 could order (u,s) and server 3 could
order (u,s).
23How Do We Solve this Problem?
- Assume that there are at most f malicious
servers, which can fail or become malicious. All
of the other servers are correct. - Let N denote the number of servers in our system.
- Any correct server can wait for at most N - f
messages from servers, because f may fail or be
malicious (and not send their messages). - Can we add more servers?
24How many servers do we need?
- Malicious servers can lie.
- Good servers tell the truth.
- We need to guarantee that a malicious server
cannot generate two groups of Accept/Proposal
messages that conflict. (i.e., (u,s) and (u,s))
within the same view. - We need at least N3f1 servers to do this!!
- We wait for 2f1 messages that say the same
thing! - The f bad servers can say Accept(u,s) and
Accept(u,s). - The good servers say only one thing, but a bad
leader can lie to them. - Lets try to generate the two sets of messages --
Can we do it? - Liar tells f1 of the good servers (u,s), and f
of the good servers (u,s).
(u,s) f(bad) f1(good)
(u,s) f(bad) f(good)
total 2f1
total 2f
25Lets use N3f1!
2
u,s
Bad Leader
u,s
,s
3
u,s
u,s
4
- f 1, N 4
- Bad Leader sends Proposal(u,s) to Server 3 and 4.
- Bad Leader sends Proposal(u,s) to Server 2.
- Can the Bad Leader violate safety?
26Is the Protocol Live?
- f 2, N 321 7
- Bad Leader is Server 7, and Server 4 is bad, too!
- Bad Leader sends Proposal(v,u,s) to Servers 1, 2,
and 3 - Bad Leader sends Proposal(v,u,s) to Servers 4,
5, and 6 - There is a partition, Servers 2,3,4,5,6 are
together. - They cant determine which update server 1
ordered.
1, (u,s)
2, (u,s)
3, (u,s)
7, (,s)
4, (,s)
5, (u,s)
6, (u,s)
27How Can We Guarantee Liveness?
- We can add another round to the our fault
tolerant protocol. The Normal Case Protocol
becomes - The Leader broadcasts a Pre-Prepare(v,u,s)
- If not Leader, Upon receiving a
Pre-Prepare(v,u,s) that doesnt conflict with
what I know about, broadcast a Prepare(v,u,s) - Upon receiving 2f Prepare(v,u,s) and 1
Pre-Prepare(v,u,s), broadcast Commit(v,u,s) - Upon receiving 2f1 Commit Messages, Order the
message - Rounds 1 and 2 allow the correct servers to
preserve safety within the same view. - Round 3 preserves safety across view changes.
28What About Changing Leaders?
- If any server orders (v,u,s), then 2f1 servers
must have collected a set of 2f Prepare(v,u,s)
messages and 1 Pre-Prepare(v,u,s) - We call such a set a Prepare-Certificate(v,u,s).
- If Prepare-Certificate(v,u,s) exists, then
Prepare-Certificate(v,u,s) cannot exist. - How do we change Leaders (View Changes)
- The new leader collects information from 2f1
servers. The servers supply Prepare-Certificates.
If something was ordered, the new leader will
find out. - The new leader needs to send this information to
all of the correct servers, otherwise the correct
servers will not participate in the protocol. - A Prepare-Certificate can be viewed as a trusted
message (agreed upon by all of the servers). We
use it like we use a Proposal message in Paxos.
29We have BFT
request
pre-prepare
prepare
reply
commit
C
0
1
2
3
- Byzantine Fault Tolerance Castro and Liskov, 99
- Excellent LAN performance. Over 1000 updates/sec.
- 2/3 total servers 1 required to make progress
- Three rounds of message exchanges
- Expensive for large networks.
30Steward A Hierarchical Approach
- Each site acts as a trusted logical unit that can
crash or partition. - Within each site Byzantine-tolerant agreement
(similar to BFT). - Masks f malicious faults in each site.
- Threshold signatures prove agreement to other
sites. - Between sites light-weight, fault-tolerant
protocol (similar to Paxos).
31Main Idea 1 Common Case Operation
- A client sends an update to a server at its local
site. - The update is forwarded to the leader site.
- The representative of the leader site assigns
order in agreement and issues
a threshold signed proposal. - Each site issues a threshold signed
accept. - Upon receiving a majority of accepts, servers in
each site order the update. - The original server sends a response to the
client.
32Hierarchy Benefits
- Reduces the number of messages sent on the wide
area network. - O(N2) ? O(S2) helps both in throughput and
latency. - Reduces the number of wide area crossings.
- BFT-based protocols require 3 wide area
crossings. - Paxos-based protocols require 2 wide area
crossings. - Availability of the system is increased
- (2/3 of total servers 1) ? (A majority of
sites). - Read-only queries can be answered locally.
33Hierarchy Challenges
- Each site has a representative that
- Coordinates the Byzantine protocol inside the
site. - Forwards packets in and out of the site.
- One of the sites acts as the leader in the wide
area protocol - The representative of the leader site assigns
sequence numbers to updates. - View Changes
- How do we select and change representatives and
the leader site in agreement ? - How do we transition safely when we need to
change them ?
34Main Idea 2 View Changes
- Sites change their local representatives based on
timeouts. - Leader site representative has a larger timeout.
- allows contact with at least one correct rep. at
other sites. - After changing enough leader site
representatives, servers at all sites stop
participating in the protocol, and elect a
different leader site.
35What About Pending Updates ?
- The new representative or leader site should not
violate the order assigned by previous ones. - After changing the site representative The new
representative gathers information from 2f1
local servers. - After changing the leader site The
representative of the new leader site gathers
information from a majority of sites.
36Non-Byzantine Comparison
Boston
MITPC
Delaware
4.9 ms
San Jose
UDELPC
9.81Mbits/sec
TISWPC
3.6 ms 1.42Mbits/sec
ISEPC
1.4 ms 1.47Mbits/sec
ISEPC3
100 Mb/s lt1ms
38.8 ms 1.86Mbits/sec
ISIPC4
Virginia
ISIPC
100 Mb/s lt 1ms
Los Angeles
- Based on a real experimental network (CAIRN).
- Several years ago we benchmarked benign
replication on this network. - Modeled on our cluster, emulating bandwidth and
latency constraints, both for Steward and BFT.
37CAIRN Emulation Performance
- Steward is limited by bandwidth at 51 updates per
second. - 1.8Mbps can barely accommodate 2 updates per
second for BFT. - Earlier experimentation with benign fault
2-phase commit protocols achieved up to 76
updates per sec. Amir et. all 02.
38The Basic Idea
- Reduce database replication to Global Consistent
Persistent Order - Use group communication ordering to establish the
Global Consistent Persistent Order on the
updates. - deterministic serialized consistent
- Group Communication membership quorum primary
partition. - Only replicas in the primary component can commit
updates. - Updates ordered in a primary component are marked
green and applied. Updates ordered in a
non-primary component are marked red and will be
delayed.
39Throughput Comparison (WAN)
40Resilient Replication Summary
- We can build resilient replication protocols that
do not use membership! - Failures can be benign (crashes, network
partitions) or Byzantine (servers can lie) - We can add message exchanges (rounds) to
guarantee Safety and Liveness with different
fault models. - Performance can be improved by adding hierarchy!