Distributed Systems 600'437 PowerPoint PPT Presentation

presentation player overlay
1 / 40
About This Presentation
Transcript and Presenter's Notes

Title: Distributed Systems 600'437


1
Distributed Systems600.437 Replication II
Department of Computer Science The Johns Hopkins
University
2
Resilient Replication
Lecture 8
  • Further readings
  • Paxos Made Simple, Leslie Lamport
  • Practical Byzantine Fault Tolerance, Miguel
    Castro and Barbara Liskov, OSDI 99.
  • Paper from DSN 2006 on our www.cnds.jhu.edu/publi
    cations web page (Scaling Byzantine
    Fault-Tolerant Replication to Wide Area Networks)

3
Introduction
  • Resilient State Machine Replication
  • What does resilient mean?
  • We want our replicated system to work even when
    faults occur.
  • What kinds of faults can occur?
  • Servers may crash!
  • Servers might recover
  • Network partitions can separate servers from each
    other.
  • Byzantine faults. Servers may lie!
  • We are going to use a class of protocols that is
    able to handle network connectivity changes
    better than those based on membership.

4
State Machine Replication
  • Servers start in the same state.
  • Servers change their state only when they execute
    an update.
  • State changes are deterministic. Two servers in
    the same state will move to identical states, if
    they execute the same update.
  • If servers execute updates in the same order,
    they will progress through exactly the same
    states. State Machine Replication!

5
State Machine Replication Example
  • Our State one variable
  • Operations (cause state changes)
  • Op 1) n Add n to our variable
  • Op 2) ?vn If variable v, then set it to n
  • Start All servers have variable 0
  • If we apply the above operations in the same
    order, then the servers will remain consistent

v9
v9
v3
v9
?39
?39
1
?39
v3
v3
v2
v3
1
1
1
?39
v2
v2
v2
v2
2
2
2
2
v0
v0
v0
v0
6
State Machine Replication
Clients Generate Updates
B,3
D,1
A,2
C,2
ESTABLISH ORDER
C,1
C,1
C,1
C,1
B,2
B,2
B,2
B,2
A,1
A,1
A,1
A,1
B,1
B,1
B,1
B,1
Apply Updates
7
Simple Replication
Can we use a Leader to establish an order?
u,s
Leader
u,s
u,s
u
Is this resilient?
If leader fails, then the system is not live!
u
u,s
u,s
  • Server sends update, u, to Leader
  • Leader assigns a sequence number, s, to u, and
    sends the update to the non-leader servers.
  • Servers order update u with sequence number s.

8
How can we improve resiliency?
Elect another leader.
Use more messages.
Assign a sequence number to each leader. (Views)
Use the fact that two sets, each having at least
a majority of servers, must intersect!
First We need to describe our system model and
service properties.
9
Our System Model
  • N servers
  • Uniquely identified in 1N
  • Asynchronous communication
  • Message loss, duplication, and delay
  • Network partitions
  • No message corruption
  • Benign faults
  • Crash/recovery with stable storage
  • No Byzantine behavior - Not yet anyway )

10
What is Safety?
  • Safety If two servers execute the ith update,
    then these updates are the same.
  • Another way to look at safety
  • If there exists an ordered update (u,s) at some
    server, then there cannot exist an ordered update
    (u,s) at any other server, where u not u.
  • We will now focus on achieving safety -- making
    sure that we dont execute updates in different
    orders on different servers.

11
Achieving Safety
Leader 2
u,s
Leader 1
u,s
u,s
Is this safe?
u,s
u,s
A new leader can violate safety! Can we fix this?
  • A new leader must not violate previously
    established ordering!
  • The new leader must know about all updates that
    may have been ordered.

12
Achieving Safety
u,s
u,s
What does this give us?
If a new leader gets information from any
majority of servers, it can determine what may
have been ordered!
Leader
u,s
u,s
u,s
u,s
u
u
u,s
u,s
u,s
u,s
  • Leader sends Proposal(u,s) to all servers
  • All servers send Accept(u,s) to all servers.
  • Servers order (u,s) when they receive a majority
    of Proposal/Accept(u,s) messages

13
Changing Leaders
  • Changing Leaders is commonly called a View
    Change.
  • Servers use timeouts to detect failures.
  • If the current leaders fails, the servers elect a
    new leader.
  • The new leader cannot propose updates until it
    collects information from a majority of servers!
  • Each server reports any Proposals that it knows
    about.
  • If any server ordered a Proposal(u,s), then at
    least one server in any majority will report a
    Proposal for that sequence number!
  • The new server will never violate prior
    ordering!!
  • Now we have a safe protocol!!

14
Changing Leaders Example
Leader 2 can send a Proposal(u,s). We say it
replays (u,s)
Leader 2
Leader 1
u,s
u,s
5
u,s
4
3
  • If any server orders (u,s), then at least
    majority of servers must have received
    Proposal(u,s).
  • If a new server is elected leader, it will gather
    Proposals from a majority of servers.
  • The new leader will learn about the ordered
    update!!

15
Is Our Protocol Live?
  • Liveness If there is a set, C, consisting of
    majority of connected servers, then if any server
    in set C has a new update, then this update will
    eventually be executed.
  • Is there a problem with our protocol? It is safe,
    but is it live?

16
Liveness Example
Leader 2
u,s
Leader 3
Leader 1
u,s
5
u,s
u,s
4
3
  • Leader 3 gets conflicting Proposal messages!
  • Which one should it choose?
  • What should we add??

17
Adding View Numbers
Leader 2
2,u,s
Leader 3
Leader 1
1,u,s
2,u,s
1,u,s
4
3
  • We add view numbers to the Proposal(v,u,s)!
  • Leader 3 gets conflicting Proposal messages!
  • Which one should it choose?
  • It chooses the one with the greatest view number!!

18
Normal Case
  • Assign-Sequence()
  • A1. u NextUpdate()
  • A2. next_seq
  • A3. SEND Proposal(view, u,next_seq)
  • Upon receiving Proposal(v, u,s)
  • B1. if not leader and v my_view
  • B2. SEND Accept(v,u,s)
  • Upon receiving Proposal(v,u,s) and
  • majority - 1 Accept(v,u,s)
  • C1. ORDER (u,s)

We use view numbers to determine which Proposal
may have been ordered previously.
A server sends an Accept(v,u,s) message only for
a view that it is currently in, and never for a
lower view!
19
Leader Election
  • Elect Leader()
  • Upon Timer T Expire
  • A1. my_view
  • A2. SEND New-Leader(my_view)
  • Upon receiving New-Leader(v)
  • B1. if Timer T Stopped
  • B2. if v gt my_view, then my_view v
  • B3. SEND New-Leader(my_view)
  • Upon receiving majority New-Leader(v)
  • where v my_view
  • C1. timeout 2 Timer T timeout
  • C2. Start Timer T

Let V_max be the highest view that any server
has. Then, at least a majority of servers are in
view V_max or V_max - 1.
Servers will stay in the maximum view for at
least one full timeout period.
A server that becomes disconnected/connected repea
tedly cannot disrupt the other servers.
20
We Have Paxos
request
proposal
accept
reply
C
0
1
2
  • The Part-Time Parliament Lamport, 98
  • A very resilient protocol. Only a majority of
    participants are required to make progress.
  • Works well on unstable networks.
  • Only handles benign failures (not Byzantine).

21
What Happens If Servers Lie?
  • Servers must be able to verify who sent each
    message.
  • Crypto! Digital Signatures or MACS
  • The leader might be bad!
  • What might happen?
  • The leader can send Proposal(u,s) to 2 out of 5
    servers and Proposal(u,s) to 2 out of 5 servers
    -- can we have a safety violation?
  • Correct servers must make sure the malicious
    servers dont cause safety errors.
  • The bad servers might send messages or they might
    not.

22
Byzantine Leader Example
2
u,s
Bad Leader
u,s
u,s
,s
3
u,s
u,s
u,s
4
5
  • Bad Leader Sends Proposal(u,s) to servers 4 and
    5.
  • Bad Leader Sends Proposal(u,s) to servers 2 and
    3.
  • Server 4 could order (u,s) and server 3 could
    order (u,s).

23
How Do We Solve this Problem?
  • Assume that there are at most f malicious
    servers, which can fail or become malicious. All
    of the other servers are correct.
  • Let N denote the number of servers in our system.
  • Any correct server can wait for at most N - f
    messages from servers, because f may fail or be
    malicious (and not send their messages).
  • Can we add more servers?

24
How many servers do we need?
  • Malicious servers can lie.
  • Good servers tell the truth.
  • We need to guarantee that a malicious server
    cannot generate two groups of Accept/Proposal
    messages that conflict. (i.e., (u,s) and (u,s))
    within the same view.
  • We need at least N3f1 servers to do this!!
  • We wait for 2f1 messages that say the same
    thing!
  • The f bad servers can say Accept(u,s) and
    Accept(u,s).
  • The good servers say only one thing, but a bad
    leader can lie to them.
  • Lets try to generate the two sets of messages --
    Can we do it?
  • Liar tells f1 of the good servers (u,s), and f
    of the good servers (u,s).

(u,s) f(bad) f1(good)
(u,s) f(bad) f(good)
total 2f1
total 2f
25
Lets use N3f1!
2
u,s
Bad Leader
u,s
,s
3
u,s
u,s
4
  • f 1, N 4
  • Bad Leader sends Proposal(u,s) to Server 3 and 4.
  • Bad Leader sends Proposal(u,s) to Server 2.
  • Can the Bad Leader violate safety?

26
Is the Protocol Live?
  • f 2, N 321 7
  • Bad Leader is Server 7, and Server 4 is bad, too!
  • Bad Leader sends Proposal(v,u,s) to Servers 1, 2,
    and 3
  • Bad Leader sends Proposal(v,u,s) to Servers 4,
    5, and 6
  • There is a partition, Servers 2,3,4,5,6 are
    together.
  • They cant determine which update server 1
    ordered.

1, (u,s)
2, (u,s)
3, (u,s)
7, (,s)
4, (,s)
5, (u,s)
6, (u,s)
27
How Can We Guarantee Liveness?
  • We can add another round to the our fault
    tolerant protocol. The Normal Case Protocol
    becomes
  • The Leader broadcasts a Pre-Prepare(v,u,s)
  • If not Leader, Upon receiving a
    Pre-Prepare(v,u,s) that doesnt conflict with
    what I know about, broadcast a Prepare(v,u,s)
  • Upon receiving 2f Prepare(v,u,s) and 1
    Pre-Prepare(v,u,s), broadcast Commit(v,u,s)
  • Upon receiving 2f1 Commit Messages, Order the
    message
  • Rounds 1 and 2 allow the correct servers to
    preserve safety within the same view.
  • Round 3 preserves safety across view changes.

28
What About Changing Leaders?
  • If any server orders (v,u,s), then 2f1 servers
    must have collected a set of 2f Prepare(v,u,s)
    messages and 1 Pre-Prepare(v,u,s)
  • We call such a set a Prepare-Certificate(v,u,s).
  • If Prepare-Certificate(v,u,s) exists, then
    Prepare-Certificate(v,u,s) cannot exist.
  • How do we change Leaders (View Changes)
  • The new leader collects information from 2f1
    servers. The servers supply Prepare-Certificates.
    If something was ordered, the new leader will
    find out.
  • The new leader needs to send this information to
    all of the correct servers, otherwise the correct
    servers will not participate in the protocol.
  • A Prepare-Certificate can be viewed as a trusted
    message (agreed upon by all of the servers). We
    use it like we use a Proposal message in Paxos.

29
We have BFT
request
pre-prepare
prepare
reply
commit
C
0
1
2
3
  • Byzantine Fault Tolerance Castro and Liskov, 99
  • Excellent LAN performance. Over 1000 updates/sec.
  • 2/3 total servers 1 required to make progress
  • Three rounds of message exchanges
  • Expensive for large networks.

30
Steward A Hierarchical Approach
  • Each site acts as a trusted logical unit that can
    crash or partition.
  • Within each site Byzantine-tolerant agreement
    (similar to BFT).
  • Masks f malicious faults in each site.
  • Threshold signatures prove agreement to other
    sites.
  • Between sites light-weight, fault-tolerant
    protocol (similar to Paxos).

31
Main Idea 1 Common Case Operation
  • A client sends an update to a server at its local
    site.
  • The update is forwarded to the leader site.
  • The representative of the leader site assigns
    order in agreement and issues
    a threshold signed proposal.
  • Each site issues a threshold signed
    accept.
  • Upon receiving a majority of accepts, servers in
    each site order the update.
  • The original server sends a response to the
    client.

32
Hierarchy Benefits
  • Reduces the number of messages sent on the wide
    area network.
  • O(N2) ? O(S2) helps both in throughput and
    latency.
  • Reduces the number of wide area crossings.
  • BFT-based protocols require 3 wide area
    crossings.
  • Paxos-based protocols require 2 wide area
    crossings.
  • Availability of the system is increased
  • (2/3 of total servers 1) ? (A majority of
    sites).
  • Read-only queries can be answered locally.

33
Hierarchy Challenges
  • Each site has a representative that
  • Coordinates the Byzantine protocol inside the
    site.
  • Forwards packets in and out of the site.
  • One of the sites acts as the leader in the wide
    area protocol
  • The representative of the leader site assigns
    sequence numbers to updates.
  • View Changes
  • How do we select and change representatives and
    the leader site in agreement ?
  • How do we transition safely when we need to
    change them ?

34
Main Idea 2 View Changes
  • Sites change their local representatives based on
    timeouts.
  • Leader site representative has a larger timeout.
  • allows contact with at least one correct rep. at
    other sites.
  • After changing enough leader site
    representatives, servers at all sites stop
    participating in the protocol, and elect a
    different leader site.

35
What About Pending Updates ?
  • The new representative or leader site should not
    violate the order assigned by previous ones.
  • After changing the site representative The new
    representative gathers information from 2f1
    local servers.
  • After changing the leader site The
    representative of the new leader site gathers
    information from a majority of sites.

36
Non-Byzantine Comparison
Boston
MITPC
Delaware
4.9 ms
San Jose
UDELPC
9.81Mbits/sec
TISWPC
3.6 ms 1.42Mbits/sec
ISEPC
1.4 ms 1.47Mbits/sec
ISEPC3
100 Mb/s lt1ms
38.8 ms 1.86Mbits/sec
ISIPC4
Virginia
ISIPC
100 Mb/s lt 1ms
Los Angeles
  • Based on a real experimental network (CAIRN).
  • Several years ago we benchmarked benign
    replication on this network.
  • Modeled on our cluster, emulating bandwidth and
    latency constraints, both for Steward and BFT.

37
CAIRN Emulation Performance
  • Steward is limited by bandwidth at 51 updates per
    second.
  • 1.8Mbps can barely accommodate 2 updates per
    second for BFT.
  • Earlier experimentation with benign fault
    2-phase commit protocols achieved up to 76
    updates per sec. Amir et. all 02.

38
The Basic Idea
  • Reduce database replication to Global Consistent
    Persistent Order
  • Use group communication ordering to establish the
    Global Consistent Persistent Order on the
    updates.
  • deterministic serialized consistent
  • Group Communication membership quorum primary
    partition.
  • Only replicas in the primary component can commit
    updates.
  • Updates ordered in a primary component are marked
    green and applied. Updates ordered in a
    non-primary component are marked red and will be
    delayed.

39
Throughput Comparison (WAN)
40
Resilient Replication Summary
  • We can build resilient replication protocols that
    do not use membership!
  • Failures can be benign (crashes, network
    partitions) or Byzantine (servers can lie)
  • We can add message exchanges (rounds) to
    guarantee Safety and Liveness with different
    fault models.
  • Performance can be improved by adding hierarchy!
Write a Comment
User Comments (0)
About PowerShow.com