Title: LINF 2345 Leader election and consensus with crash and Byzantine failures
1LINF 2345Leader electionand consensus with
crashand Byzantine failures
- Seif Haridi
- Peter Van Roy
2Overview
- Synchronous systems with crash failures
- Leader election in rings
- Fault-tolerant consensus
3Leader Electionin Rings
4BackgroundRings
- The ring topology is a circular arrangement of
nodes used often as a control topology in
distributed computations - A graph is regular if all nodes have the same
degree - A ring is an undirected, connected, regular graph
of degree 2 - G is a ring, is equivalent to
- There is a one-to-one mapping of V to 0,,n-1
such that the neighbors of node i are nodes i-1
and i1 (modulo n)
5The Leader Election Problem
- A situation where a group of processors must
select a leader among them - Simplified coordination
- Helpful in achieving fault tolerance
- Coordinator in two/three phase commits
- Represents a general class of symmetry breaking
problems - Deadlock removals
6The Leader Election Problem
- An algorithm solves the leader election problem
if - The terminated states are partitioned into
elected and non-elected states - Once a processor enters an elected/non-elected
state, its transition function will only move it
to another (or the same) elected/non-elected
state - In every admissible execution exactly one
processor enters an elected state and all others
enter a non-elected state.
7The Leader Election ProblemRings
- In fact we have seen an election algorithm in the
previous section on arbitrary network topology - For rings
- Edges go between pi and pi1 (addition modulo n),
for all i, 0?i?n-1 - Processors have a consistent notion of left
(clockwise) and right (anti clockwise)
p0
2
1
2
1
p1
p2
1
2
Simple oriented ring
8Anonymous Rings
- A leader election algorithm for a ring is
anonymous if - Every processor has the same state machine
- Implies that processors do not have unique
identifiers - An algorithm is uniform if does not use the value
n, the number of processors - Otherwise the algorithm is nonuniform
- For each size n there is a state machine, but it
might be different for different size n
9Anonymous RingsImpossibility Results
- Main result
- There is no anonymous leader election algorithm
for ring systems - The result can be stated more comprehensively as
- There is no nonuniform anonymous algorithm for
leader election in synchronous rings - Impossibility results for synchronous systems
implies the same impossibility results for
asynchronous systems. Why? - Impossibility results for nonuniform implies the
same for uniform. Why?
10Anonymous RingsImpossibility Results
- Impossibility results for synchronous systems
implies the same impossibility results for
asynchronous systems. Why? - Answer An admissible execution in SS is also an
admissible execution in AS - Therefore there is always at least one admissible
execution of any AS algorithm that does not
satisfy the correctness condition of a leader
election algorithm - Impossibility results for nonuniform implies the
same for uniform. Why? - If there is a uniform algorithm, it could be used
as a nonuniform algorithm
11Asynchronous Rings
p0,0
- Processors have unique identifiers which could be
any natural numbers - For each pi, there is a variable idi initialized
to the identifier of pi - We specify a ring by listing the processors
starting from the one with the smallest
identifier - Each processor pi, 0?i?n, is assigned idi
p1,10
p2,5
p3,97
12Asynchronous RingsAn O(n2) Algorithm
- Each processor sends a message with its id to its
left neighbor, and waits for messages from its
right neighbor - When a processor pi receives a message m, it
checks the id of m - If m.id gt pi.id, pi forwards m to its own left
neighbor - Otherwise the message is consumed
- A processor pk that receives a message with its
own id declares itself as a leader, and sends a
termination message to its left neighbor - A processor that receives a termination message
forwards it to the left, and terminates as a
non-leader
13Asynchronous RingsAn O(n2) Algorithm
- The algorithm never sends more than O(n2)
messages - O(n2) means c.n2 is an upper bound, for some
constant c - The processor with the lowest id may forward n
messages plus one termination message - There is an admissible execution in which the
algorithm sends ?(n2) messages - ?(n2) means c1.n2 is an upper bound and c2.n2 is
a lower bound for some constants c1 and c2
14Asynchronous RingsAn O(n2) Algorithm
- Example (an execution)
- The message of processor with identifier i is
sent exactly i1 times - n termination messages
- Total is
2
1
0
n-1
n-2
15Asynchronous RingsAn O(n log n) Algorithm
- The k-neighborhood of a processor pi is the set
of processors up to distance k from pi in the
ring (to the left and to the right) - The algorithm operates in phases starting at 0
- At the kth phase a processor tries to be the
winner of that phase - To be a k-phase winner it must have the largest
id in its 2k-neighborhood - Only winners of phase k continue to phase k1
- At the end only one processor survives, and is
elected as the leader
16Asynchronous RingsAn O(n log n) Algorithm
- In phase 0, each processor pi attempts to be a
phase 0 winner - pi sends a ?probe, idi? message to its
1-neighborhood - If id of a neighbor receiving the probe is
greater that idi the message is swallowed - Otherwise the neighbor sends a reply message
- If pi receives a reply message from both its
neighbors, it becomes a phase-0 winner and
continues with phase 1
17Asynchronous RingsAn O(n log n) Algorithm
- In phase k, each processor pi that is a
(k-1)-phase winner sends probe messages to its
k-neighborhood - Each message traverses 2k processors one by one
- A probe is forwarded by a processor if its id is
smaller than the probes id, or it is not the
last processor
18Asynchronous RingsAn O(n log n) Algorithm
- If the probe is not swallowed by the last
processor, it sends back a reply - If pi receives reply messages from both
directions it becomes a k-phase winner and
continues with phase k1 - A processor that receives its own probe declares
itself a leader and sends a termination message
around the ring
19Asynchronous RingsAn O(n log n) Algorithm
p1
p2
p3
p4
p5
p6
p7
p8
p9
p1
p1, p3, p5,p7 phase 0-winners
p1
p2
p3
p4
p5
p6
p7
p8
p9
p1
p1, p5 phase 1-winners
20Asynchronous RingsAn O(n log n) Algorithm
Messages
- ?probe, id, k, i?
- ?reply, id, k?
- id identifier of the processor
- k integer, the phase number
- i integer, a hop counter
21Asynchronous RingsAn O(n log n) Algorithm
- Initially asleep false
- Upon receiving no message
- if asleep then
- asleep false
- send ?probe, id, 0, 1? to left and right
22Asynchronous RingsAn O(n log n) Algorithm
- Initially asleep false
- Upon receiving ?probe, j, k, d? from left (resp.
right) - if j id then terminate as the leader
- if j gt id and d lt 2k then
- send ?probe, j, k, d1? to right (resp.
left) - if j gt id and d 2k then
- send ?reply, j, k? to left (resp. right)
23Asynchronous RingsAn O(n log n) Algorithm
- Initially asleep false
- Upon receiving ?reply, j, k? from left (resp.
right) - if j ? id then send ?reply, j, k? to right
(resp. left) - else
- if already received ?reply, j, k? from
right (resp. left) then - send ?probe, id, k1,1? to left and
right
24Fault-Tolerant Consensus
25Fault-Tolerance Consensus Overview
- Study problems when a distributed system is
unreliable - Processors behave incorrectly
- The consensus problem
- Requires processors to agree on common output
based on their (possibly conflicting) inputs - Types of failures
- Crash failure (a processor stops operating)
- Byzantine failure (a processor behaves
arbitrarily, also known as malicious failure)
26Fault-Tolerance Consensus Overview
- Synchronous systems
- To solve consensus with Byzantine failure, less
than a third of the processors may behave
arbitrarily - We will show one algorithm in detail, which uses
optimal number of rounds but has exponential
message complexity - More sophisticated algorithms are possible, for
example, an algorithm that has polynomial message
complexity
27Fault-Tolerance Consensus Overview
- Asynchronous message passing systems
- The consensus problem cannot be solved by
deterministic algorithms, neither for crash nor
Byzantine failures - This is a famous impossibility result first
proved in 1985 by Fischer, Lynch, and Paterson - How do we get around this impossibility?
- We can introduce a synchrony assumption or we can
make the algorithm randomized (probabilistic). - Both solutions can be practical, but have their
limitations
28Synchronous Systems with Crash Failures
- Assumptions
- The communication graph is complete, i.e. a
clique - Communication links are fully reliable
- In the reliable SS
- An execution consists of rounds
- Each round consists of delivery of all messages
pending in outbuf variables, followed by one
computation step by each processor
29Synchronous Systems with Crash Failures
- An f-resilient system
- A system where f processors can fail
- Execution in an f-resilient system
- There exist a subset F of at most f processors,
the faulty processors (different for different
executions) - Each round contains exactly one computation event
for every processor not in F, and at most one
computation event for every processor in F
30Synchronous Systems with Crash Failures
- Execution in an f-resilient system
- Each round contains exactly one computation event
for every processor not in F, and at most one
computation event for every processor in F - If a processor in F does not have a computation
event in some round, then it has no computation
event in any subsequent round - In the last round in which a faulty processor has
a computation event, an arbitrary subset of its
outgoing messages are delivered
31Synchronous Systems with Crash Failures
- Clean failure
- A situation where all or none of the processors
messages are delivered in its last step - Consensus is easy and efficient for clean failure
- We have to deal with non-clean failure
- As we shall see, this is what makes the algorithm
expensive
32The Consensus Problem
- Each pi has a special component xi, called the
input, and yi, called the output - Initially
- Each xi contains a value from some well-ordered
set - yi is undefined
- Solution to the consensus problem must satisfy
the following conditions - Termination
- Agreement
- Validity
33The Consensus Problem
- Termination
- In every admissible execution, yi is eventually
assigned a value, for every nonfaulty processor
pi - Agreement
- In every execution, if yi and yj are assigned,
then yi yj, for all nonfaulty processors pi and
pj - Validity
- In every execution, if yi is assigned v for some
value v on a nonfaulty processor pi, then there
exists a processor pj such that xjv
34Simple Algorithm
- Needs f1 rounds
- Every processor maintains a set of values it
knows to exist in the system - Initially this set contains only its input value
- In later rounds
- A processor updates its set by adding new values
received from other processors - And broadcasts any new additions
- At round f1 each processor decides on the
smallest value in its set
35Simple AlgorithmConsensus in the Presence of
Crash Failure
- Initially V x
- Round k, 1? k ? f1
- send v ? V pi has not already sent v to
all processors - receive Sj from pj, 0 ? j ? n-1, j ? i
-
- if k f 1 then y min(V)
36Illustration of the Algorithmf 3
- The algorithm requires f1 rounds, and tolerates
f crash failures
Round 4
Round 3
Round 2
Round 1
p0 p1 p2 p3 p4
x
x
x
x
37Illustration of the Algorithmf 3
Round 4
Round 3
Round 2
Round 1
- p2 and p4 survive
- Others crash one at a time
- p2 and p4 have the value x
p0 p1 p2 p3 p4
x
x
x
x
38How the algorithm works
- Why is one round not enough?
- Hint non-clean failures!
- In the previous slides, the value x is sent
across only one link instead of all links,
because the processor has a non-clean failure - We need enough rounds to cover the possibility of
a non-clean failure in each round
39Synchronous Systems with Byzantine Failures
- We want to reach an agreement in spite of
malicious processors - In an execution of an f-resilient Byzantine
system, there is at most a subset of f processors
which are faulty - In a computation step of a faulty processor, its
state and the messages sent are completely
unconstrained - A faulty processor may also mimic the behavior of
a crashed processor
40The Consensus Problem
- Termination
- In every admissible execution, yi is eventually
assigned a value, for every nonfaulty processor
pi - Agreement
- In every execution, if yi and yj are assigned,
then yi yj, for all nonfaulty processors pi and
pj - Validity
- In every execution, if yi is assigned v for some
value v on a nonfaulty processor pi, then there
exists a processor pj such that xjv
41Lower Bounds on the number of Faulty Processors
- If a third or more processors can be Byzantine
then consensus cannot be reached
42Lower Bounds on the number of Faulty Processors
- If a third or more processors can be Byzantine
then consensus cannot be reached - In a system with three processors such that one
is Byzantine, there is no algorithm that solves
the consensus problem
43Three Processor system
2
- Assume that there is a 3-processor Algorithm A
that solves the Byzantine agreement problem if
one is faulty - Take two copies of A and configure them into a
hexagonal system S
1
3
44Three Processor System
2
2
3
0
0
1
A
S
1
0
1
1
3
1
1
2
3
- Input value for processors 1,2, and 3 is 0
- Input value for processors 1,2, and 3 is 1
45Three Processor System
2
- S is a synchronous system, each processor is
running its algorithm in the triangle system A - Each processor in S knows its neighbors and it is
unaware of other nodes - We expect S to exhibit a well defined behavior
with its input - Observe S does not solve the consensus problem
- Call the resulting execution ? (infinite
synchronous execution)
3
0
0
1
S
1
0
1
1
1
2
3
46Execution ? from the point of view of processors
2 and 3
2
3
2
0
0
0
?
1
1
0
?1
1
0
1
3
1
1
2
3
- Processors 2 and 3 see 1 as faulty, and since A
is a consensus algorithm they both decide on 0 in
execution ? of S
47Execution ? from the point of view of processors
1 and 2
2
3
2
0
0
1
?
1
1
0
?2
1
1
1
3
1
1
2
3
- Processors 1 and 2 see 3 as faulty, and since A
is a consensus algorithm they both decide on 1 in
execution ? of S
48Execution ? from the point of view of processors
1 and 3
2
2
3
0
0
?3
?
1
1
0
1
0
1
1
3
1
- Processors 1 and 3 see 2 as faulty, and since A
is a consensus algorithm they both must decide
on one output value in execution ? of S - This is not possible since they already decided
differently - A contradiction! Therefore A does not exist
1
2
3
49Consensus Algorithm 1
- Takes exactly f1 rounds
- Requires n ? 3f1
- The algorithm has two stages
- First, information is gathered by communication
among processors - Second, each processor computes locally its
decision value
50Information Gathering Phase
- Information in each processor is represented as a
tree, in which each path from the root to leaf
contains f2 nodes, (height f1) - Nodes are labeled by sequences of processor
names - The root is labeled by the empty sequence
- Let the label of the internal node v be (i1, i2,
, ir), then for each i, 0?i?n-1, not in the
sequence, v has a child labeled by (i1, i2, ,
ir, i)
51Information Gathering Phasen4, f1
- xi is the value of pi
- At () is the value x of this processor
- At (j) the value of pj given by pj
- At (j,k) the value of processor j given by
processor k (opinion of pk about xj) - For example, (1,2) is the opinion of p2 about the
value x1 - (1,2,0), the opinion of p0 about the value of x1
given to p0 by p2
()
(2)
(3)
(0)
(1)
(1,0)
(1,2)
(1,3)
52Information Gathering Phasen4, f1
- In the first round, each processor sends its
initial value to all processors, including itself - When pi receives a value x from pj it stores x at
the node labeled j
()
(2)
(3)
(0)
(1)
(1,0)
(1,2)
(1,3)
53Information Gathering Phasen4, f1
- At the beginning of round r, each processor
broadcasts the rth level of its tree - When a processor receives a message from pj, with
the value labeled (i1,,ir), it stores it at the
node labeled (i1,,ir,j) in its tree
54Information Gathering Phasen4, f1
()
Level 1
(0)
(2)
(3)
(1)
Level 2
(0,2)
(1,2)
(3,2)
Level 3
- A processor at level 3 stores the opinion of p2
at level 2
55Information Gathering Phasen4, f1
- pi stores at (i1,,ir,j) that value thatpj
says that ir says that that i1 says it has - We denote this value by tree(i1,,ir,j)
- Information gathering continues f1 until the
entire tree has been filled - The function resolve is applied on the tree
locally (a majority voting function)
56Information Gathering Phasen4, f1, resolve
- function resolve(tTree) if t is a leaf then
value(t) else Majority(?resolve(ti) ti is a
child of t?) - Majority takes a list of decisions (values) and
returns the most popular one (one that occurs
most frequently), otherwise a default v
57Information Gathering Phasen4, f1, Resolve
()
(0)
(2)
(3)
(1)
(0,1)
(0,2)
(1,2)
(3,2)
(0,3)
(1,0)
(1,3)
(2,0)
(2,1)
(2,3)
(3,0)
(3,1)
58Information Gathering Phasen4, f1, p0 is
malicious, p3
()
1
(0)
(2)
(3)
(1)
0
1
1
1
0
0
0
1
0
1
1
1
1
1
1
0
(0,1)
(0,2)
(1,2)
(3,2)
(0,3)
(1,0)
(1,3)
(2,0)
(2,1)
(2,3)
(3,0)
(3,1)
59Information Gathering Phasen3, f1, Resolve, p2
()
(0)
(2)
(1)
0
?
1
0
0
0
1
1
1
(0,1)
(0,2)
(1,2)
(1,0)
(2,0)
(2,1)
60Consensus Algorithms for Byzantine Failures
- Minimum number of rounds is f 1, since crash
failures are a special case of Byzantine failures
61Exponential Tree Algorithm
- Each processor maintains a tree data structure in
its local state. - Each node of the tree is labeled with a sequence
of processor indices with no repeats - roots label is empty sequence ? (root has level
0) - root has n children labeled 0 through n 1
- child node labeled i has n 1 children labeled
i 0, through i n, skipping i i
62Exponential Tree Algorithm
- roots label is empty sequence ? (root has level
0) - root has n children labeled 0 through n 1
- child node labeled i has n 1 children labeled
i 0, through i n, skipping i I - in general, node at level d with label v has n
d, children labeled v 0 through v n, skipping
any index appearing in v - nodes at level f 1 are the leaves.
63The tree when n 4 and f 1
64The Exponential Tree Algorithm
- Each processor fills in the tree nodes with
values as the rounds go by - initially, store your input in the root (level 0)
- round 1 send level 0 of your tree (the root)
store value received from pj in node j (level 1)
(default if none) - round 2 send level 1 of your tree store value
received from pj for node k in node k j (level
2) (the value that pj told me that pk told pj)
(default if none) - continue for f 1 rounds
65The Exponential Tree Algorithm
- In the last round, each processor uses the values
in its tree to compute its decision. The decision
is resolve(?), where resolve(?) equals - The value in tree node labeled ? if it is a leaf
- Majority resolve(?) ? is a child of ?
otherwise (default if none)
66Proof of Exponential Tree Algorithm
- Lemma (5.10) Nonfaulty processor pis resolved
value for node ? ?j, what pj reports for ?,
equals what pj has stored for ? - Basis ? is a leaf. Then pi stores in node ? what
pj sends it for ? in the last round - For leaves, the resolved value is the tree value.
67Proof of Exponential Tree Algorithm
- Induction ? is not a leaf.
- By tree definition, has at least n f children
- Since n gt 3f, ? has majority of nonfaulty
children. - Let ?k be a child of ? such that pk is nonfaulty.
- Since pj is nonfaulty, pj correctly reports to pk
that it has some value v in node ? thus pk
stores v in node ? ?j. - By induction, pis resolved value for ?k equals
the value v that pk has in its tree node ? - So all of ?s nonfaulty children resolve to v in
pis tree, and thus resolves to v in pis tree.
68The Exponential Tree Algorithm
69The Exponential Tree Algorithm
- Validity Suppose all inputs are v.
- Nonfaulty processor pi decides on resolve(?),
which is the majority among resolve(j), 0 ? j ?
n-1 - The previous lemma implies that for each
nonfaulty pj, resolve(j) is the value stored at
the root of pjs tree, which is pjs input v. - Thus pi decides v.
70The Exponential Tree Algorithm
- Agreement Show that all nonfaulty processors
resolve the same value for their tree roots - A node is common if all nonfaulty processors
resolve the same value for it. We will show the
root is common - Strategy
- Show that every node with a certain property is
common. - Show that the root has the property.
71The Exponential Tree Algorithm
- If every ?-to-leaf path has a common node, then ?
is common.
72The Exponential Tree Algorithm
- Show every root-to-leaf path has a common node.
- There are f 2 nodes on a root-to-leaf path.
- The label of each non-root node on a
root-to-leaf path ends in a distinct processor
index i1, i2, , if1. - At least one of these indices is that of a
nonfaulty processor, say ik. - Lemma 5.10 implies that the node whose label
ends in ik is common.