Virtual Synchrony presentation

About This Presentation

Transcript and Presenter's Notes

Title: Virtual Synchrony

1
Virtual Synchrony

Krzysztof Ostrowski
krzys_at_cs.cornell.edu

2
A motivating example
SENSORS
notifications
detection
decision made
well-coordinated response
orders
GENERALS
DANGER
WEAPON
3
Requirements for a distributed system (or a
replicated service)

Consistent views across components
E.g. vice-generals see the same events as the
chief general
Agreement on what messages have been received or
delivered
E.g. each general has same view of the world
(consistent state)
Replicas of the distributed service do not
diverge
E.g. everyone should have same view of membership
If a component is unavailable, all others decide
up or down together

4
Requirements for a distributed system (or a
replicated service)

Consistent actions
E.g. generals dont contradict each other (dont
issue conflicting orders)
A single service may need to respond to a single
request
Responses to independent requests may need to be
consistent
But Consistent ? Same (same actions ?
determinism ? no fault tolerance)

5
System as a set of groups
client-server groups
event
peer group
multicasting a decision
diffusion group
6
A process group

A notion of process group
Members know each other, cooperate
Suspected failures ? group restructures itself,
consistently
A failed node is excluded, eventually learns of
its failure
A recovered node rejoins
A group maintains a consistent common state
Consistency refers only to members
Membership is a problem on its own... but it CAN
be solved

7
A model of a dynamic process group
consistent state
A
CRASH
RECOVER
B
B
C
C
D
E
E
JOIN
F
state transfers
consistent state
membership views
8
The lifecycle of a member (a replica)
alive, but not ingroup
assumed tabula rasa all information lost
join
in group
come up
dead or suspected to be dead
transfer state here
unjoin
processing requests
fail or just unreachable
9
The Idea (Roughly)

Take some membership protocol, or an external
service
Guarantee consistency in inductive manner
Start in an identical replicated state
Apply any changes
Atomically, that is either everywhere or nowhere
In the exact same order at all replicas
Consistency of all actions / responses comes as a
result
Same events seen
Rely on ordering atomicity of failures and
message delivery

10
The Idea (Roughly)

We achieve it by the following primitives
Lower-level
Create / join / leave a group
Multicasting FBCAST, CBCAST / ABCAST (the
"CATOCS")
Higher-level
Download current state from the existing active
replicas
Request / release locks (read-only / read-write)
Update
Read (locally)

11
Why another approach, though?

We have the whole range of other tools
Transactions ACID one-copy serializability with
durability
Paxos, Chandra-Toueg (FLP-syle consensus schemes)
All kinds of locking schemes, e.g. two-phase
locking (2PL)
Virtual Synchrony is a point in the space of
solutions
Why are other tools not perfect
Some are very slow lots of messages, roundtrip
latencies
Some limit asynchrony (e.g. transactions at
commit time)
Have us pay very high cost for freatures we may
not need

12
A special class of applications

Command / Control
Joint Battlespace Infosphere, telecommunications,
Distribution / processing / filtering data
streams
Trading system, air traffic control system, stock
exchange, real-time data for banking,
risk-management
Real-Time Systems
Shop floor process control, medical decision
support, power grid
What do they have in common
A distributed, but coordinated processing and
control
Highly-efficient, pipelined distributed data
processing

13
Distributed trading system
Pricing DBs
1.
Historical Data
Market Data Feeds
Trader Clients
2.
Analytics
Current Pricing
3.

Availability for historical data
Load balancing and consistentmessage delivery
for price distribution
Parallel execution for analytics

Long-Haul WAN Spooler
Tokyo, London, Zurich, ...
14
Whats special about these systems?

Need high performance we must weaken consistency
Data is of different nature more dynamic
More relevant online, in context
Storing it persistently often doesnt make that
much sense
Communication-oriented
Online progress nobody cares about faulty nodes
Faulty nodes can be rebooted
Rebooted nodes are just spare replicas in the
common pool

15
Differences (in a Nutshell)
Databases Command / Control
relatively independent programs closely cooperating programs......organized into process groups
consistent data (external) strong consistency weakened consistency instead focus on making online progress
persistency, durable operations mostly replicated state and control info
one-copy serializability serializability w/o durability (nobody cares)
atomicity of groups of operations atomicity of messages causality
heavy-weight mechanisms slow lightweight, stress on responsiveness
relationships in data relationships between actions and in the sequences of messages
multi-phase protocols, ACKs etc. preferably one-way, pipelined processing
16
Back to virtual synchrony

Our plan
Group membership
Ordered multicast within group membership views
Delivery of new views synchronized with multicast
Higher-level primitives

17
A process group joining / leaving
group membership protocol
V2 A,B,C,D
V1 A,B,C
A
B
C
request to leave
sending a new view
request to join
D
V3 B,C,D
...OR
Group Membership Service
18
A process group joining / leaving

How it looks in practice
Application makes a call to the virtual synchrony
library
Node communicates with other nodes
Locates the appropriate group, follows the join
protocol etc.
Obtains state or whatever it needs (see later)
Application starts communicating, eg. joins
replica pool

Application
Application
all the virtual synchrony just fits into the
protocol stack
V.S. Module
V.S. Module
Network
Network
19
A process group hadling failures

? We rely on a failure detector (it doesnt
concern us here)
A faulty or unreachable node is simply excluded
from the group
Primary partition model group cannot split into
two active parts

recovery
V1 A,B,C,D
V2 B,C,D
A
CRASH
B
join
C
D
V3 A,B,C,D
"B" realizes that somethings is wrong with
"A"and initiates the membership change protocol
20
Causal delivery and vector clocks
(0,0,0,0,0)
(0,1,1,0,0)
(0,1,1,0,0)
A
B
C
D
E
(0,0,1,0,0)
(0,0,1,0,0)
delayed
cannot deliver
CBCAST FBCAST vector clocks
21
Whats great about fbcast / cbcast ?

Can deliver to itself immediately
Asynchronous sender no need to wait for anything
No need to wait for delivery order on receiver
Can issue many sends in bursts
Since processes are less synchronous...
...the system is more resilient to failures
Very efficient, overheads are comparable with TCP

22
Asynchronous pipelining
sender never needs to wait, and can send requests
at a high rate
A
B
C
buffering may reduce overhead
A
buffering
B
C
23
Whats to be careful with ?

Asynchrony
Data accumulates in buffer in the sender
Must put limits to it!
Explicit flushing send data to the others, force
it out of buffers if completes, data is safe
needed as a
A failure of the sender causes lots of updates to
be lost
Sender gets ahead of anybody else......good if
the others are doing something that doesnt
conflict.
Cannot do multiple conflicting tasks without a
form of locking, while ABCAST can

24
Why use causality?

Sender need not include context for every
message
One of the very reasons why we use FIFO delivery,
TCP
Could "encode" context in message, but costly or
complex
Causal delivery simply extends FIFO to multiple
processes
Sender knows that the receiver will have received
same msgs
Think of a thread "migrating" between servers,
causing msgs

25
A migrating thread and FIFO analogy
A
B
C
D
E
A way to think about the above... which might
explain why it is analogous to FIFO.
A
B
C
D
E
26
Why use causality?

Receiver does not need to worry about "gaps"
Could be done by looking at "state", but may be
quite hard
Ordering may simply be driven by correctness
Synchronizing after every message could be
unacceptable
Reduces inconsistency
Doesnt prevent it altogether, but it isnt
always necessary
State-level synchronization can thus be done more
rarely!
All this said... causality makes most sense in
context

27
Causal vs. total ordering
Note Unlike causal odering, total ordering may
require that local delivery be
postponed!
Causal, but not total ordering
Causal and total ordering
A,E
A
B
A,E
C
A,E
D
E,A
E
E,A
28
Total ordering atomic, synchronous
A
B
C
D
E
A
B
C
D
E
29
Why total ordering?

State machine approach
Natural, easy to understand, still cheaper than
transactions
Guarantees atomicity, which is sometimes
desirable
Consider a primary-backup scheme

30
Implementing totally ordered multicast
31
Atomicity of failures
Uniform

Delivery guarantees
Among all the surviving processes delivery
is all or none (both cases)
Uniform here also if (but not only if)
delivered to a crashed node
No guarantees for the newly joined

CRASH
Nonuniform
Wrong
CRASH
CRASH
32
Why atomicity of failures?

Reduce complexity
After we hear about failure, we need to quickly
"reconfigure"
We like to think in stable epochs...
During epochs, failures dont occur
Any failure or a membership change begins a new
epoch
Communication does not cross epoch boundaries
System does not begin a new epoch before all
messages are either consistently delivered or all
consistently forgotten
We want to know we got everything the faulty
process sent,to completely finish the old epoch
and open a "fresh" one

33
Atomicity message flushing
A
(logical partitioning)
B
C
D
E
retransmitting messages of failed nodes
A
B
C
D
E
changing membership
retransmitting own messages
34
A multi-phase failure-atomic protocol
this is the point when messages are delivered to
the applications
Phase 1
Phase 2
Phase 3
Phase 4
A
B
C
D
E
save message
OK to deliver
all have seen
garbage collect
this is only for uniform atomicity
these three phases are always present
35
Simple tools

Replicated services locking for updates, state
transfer
Divide responsibility for requests load
balancing
Simpler because of all the communication
guarantees we get
Work partitioning subdivide tasks within a group
Simple schemes for fault-tolerance
Primary Backup, Coordinator-Cohort

36
Simple replication state machine

Replicate all data and actions
Simplest pattern of usage for v.s. groups
Same state everywhere (state transfer)
Updates propagated atomically using ABCAST
Updates applied in the exact same order
Reads or queries always can be served locally
Not very efficient, updates too synchronous
A little to slow
We try to sneak-in CBCAST into the picture...
...and use it for data locking and for updates

37
Updates with token-style locking
We may not need anything more but just
causality...
CRASH
A
B
C
D
E
granting lock
ownership of a shared resource
performing an update
requesting lock
38
Updates with token-style locking
others must confirm (release any read locks)
individual confirmations
request
updates
updates
A
B
C
D
E
request
request
granting the lock
message from token owner
39
Multiple locks on unrelated data
CRASH
A
B
C
D
E
40
Replicated services
query
update
load balancing scheme
41
Replicated services
Primary-Backup Scheme
Coordinator-Cohort Scheme
backup server
requests
primary server
traces
results
42
Other types of tools

Publish-Subscribe
Every topic as a separate group
subscribe join the group
publish multicast
state transfer load prior postings
Rest of ISIS toolkit
News, file storage, job cheduling with load
sharing, framework for reactive control apps etc.

43
Complaints (Cheriton / Skeen)

Oter techniques are better transctions,
pub./sub.
Depends on applications... we dont compete w.
ACID !
A matter of taste
Too many costly features, which we may not need
Indeed and we need to be able to use them
selectively
Stackable microprotocols - ideas picked up on in
Horus
End-to-End argument
Here taken to the extreme, could be used against
TCP
But indeed, there are overheads use this stuff
wisely

44
At what level to apply causality?

Communication level (ISIS)
An efficient technique that usually captures all
that matters
Speeds-up implementation, simplifies design
May not always be most efficient might sometimes
over-order(might lead to incidental causality)
Not a complete or accurate solution, but often
just enough
What kinds of causality really matter to us?

45
At what level to apply causality?

Communication level (ISIS)
May not recognize all sources of causality
Existence of external channels (shared data,
external systems)
Semantics ordering, recognized / understood only
by applications
Semantics- or State-level
Prescribe ordering by the senders (prescriptive
causality)
Timestamping, version numbers

46
Overheads
(skip)

Causality information in messages
With unicast as a dominant communication pattern,
it could be a graph, but only in unlikely
patterns of communication
With multicast, its simply one vector (possibly
compressed)
Usually we have only a few active senders and
bursty traffic
Buffering
Overhead linear with N, but w. small constant,
similar to TCP
Buffering is bounded together with communication
rate
Anyway is needed for failure atomicity (an
essential feature)
Can be efficiently traded for control traffic via
explicit flushing
Can be greatly reduces by introducing
hierarchical structures

47
Overheads
(skip)

Overheads on the critical path
Delays in delivery, but in reality comparable to
those in TCP
Arriving out of order is uncommon, a window with
a few messages
Checking / updating causality info maintaining
msg buffers
A false (non-semantic) causality messages
unnecessarily delayed
Group membership changes
Requires agreement and slow multi-phase photocols
A tension between latency and bandwidth
Does introduce a disruption suppresses delivery
of new messages
Costly flushing protocols place load on the
network

48
Overheads
(skip)

Control traffic
Acknowledgements, 2nd / 3rd phase additional
messages
Not on critical path, but latency matters, as it
affects buffering
Can be piggybacked on other communication
Atomic communication, flush, membership changes
slowed down to the slowest participant
Heterogeneity
Need sophisticated protocols to avoid overloading
nodes
Scalability
Group size asymmetric load on senders, large
timestamps
Number of groups complex protocols, not easy to
combine

49
Conclusions

Virtual synchrony is an intermediate solution...
...less than consistency w. serializability,
better than nothing.
Strong enough for some classes of systems
Effective in practice, successfully used in many
real systems
Inapplicable or inefficient in database-style
settings
Not a monolithic scheme...
Selected features should be used only based on
need
At the same time, a complete design paradigm
Causality isnt so much helpful as an isolated
feature, but...
...it is a key piece of a much larger picture

Write a Comment

User Comments (0)

About PowerShow.com

Virtual Synchrony PowerPoint PPT Presentation