Virtual Synchrony - PowerPoint PPT Presentation

About This Presentation
Title:

Virtual Synchrony

Description:

Analytics. Current Pricing. Market. Data. Feeds. Long-Haul WAN Spooler. Tokyo, London, ... Parallel execution for analytics. What's special about these systems? ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 50
Provided by: csCor
Category:

less

Transcript and Presenter's Notes

Title: Virtual Synchrony


1
Virtual Synchrony
  • Krzysztof Ostrowski
  • krzys_at_cs.cornell.edu

2
A motivating example
SENSORS
notifications
detection
decision made
well-coordinated response
orders
GENERALS
DANGER
WEAPON
3
Requirements for a distributed system (or a
replicated service)
  • Consistent views across components
  • E.g. vice-generals see the same events as the
    chief general
  • Agreement on what messages have been received or
    delivered
  • E.g. each general has same view of the world
    (consistent state)
  • Replicas of the distributed service do not
    diverge
  • E.g. everyone should have same view of membership
  • If a component is unavailable, all others decide
    up or down together

4
Requirements for a distributed system (or a
replicated service)
  • Consistent actions
  • E.g. generals dont contradict each other (dont
    issue conflicting orders)
  • A single service may need to respond to a single
    request
  • Responses to independent requests may need to be
    consistent
  • But Consistent ? Same (same actions ?
    determinism ? no fault tolerance)

5
System as a set of groups
client-server groups
event
peer group
multicasting a decision
diffusion group
6
A process group
  • A notion of process group
  • Members know each other, cooperate
  • Suspected failures ? group restructures itself,
    consistently
  • A failed node is excluded, eventually learns of
    its failure
  • A recovered node rejoins
  • A group maintains a consistent common state
  • Consistency refers only to members
  • Membership is a problem on its own... but it CAN
    be solved

7
A model of a dynamic process group
consistent state
A
CRASH
RECOVER
B
B
C
C
D
E
E
JOIN
F
state transfers
consistent state
membership views
8
The lifecycle of a member (a replica)
alive, but not ingroup
assumed tabula rasa all information lost
join
in group
come up
dead or suspected to be dead
transfer state here
unjoin
processing requests
fail or just unreachable
9
The Idea (Roughly)
  • Take some membership protocol, or an external
    service
  • Guarantee consistency in inductive manner
  • Start in an identical replicated state
  • Apply any changes
  • Atomically, that is either everywhere or nowhere
  • In the exact same order at all replicas
  • Consistency of all actions / responses comes as a
    result
  • Same events seen
  • Rely on ordering atomicity of failures and
    message delivery

10
The Idea (Roughly)
  • We achieve it by the following primitives
  • Lower-level
  • Create / join / leave a group
  • Multicasting FBCAST, CBCAST / ABCAST (the
    "CATOCS")
  • Higher-level
  • Download current state from the existing active
    replicas
  • Request / release locks (read-only / read-write)
  • Update
  • Read (locally)

11
Why another approach, though?
  • We have the whole range of other tools
  • Transactions ACID one-copy serializability with
    durability
  • Paxos, Chandra-Toueg (FLP-syle consensus schemes)
  • All kinds of locking schemes, e.g. two-phase
    locking (2PL)
  • Virtual Synchrony is a point in the space of
    solutions
  • Why are other tools not perfect
  • Some are very slow lots of messages, roundtrip
    latencies
  • Some limit asynchrony (e.g. transactions at
    commit time)
  • Have us pay very high cost for freatures we may
    not need

12
A special class of applications
  • Command / Control
  • Joint Battlespace Infosphere, telecommunications,
  • Distribution / processing / filtering data
    streams
  • Trading system, air traffic control system, stock
    exchange, real-time data for banking,
    risk-management
  • Real-Time Systems
  • Shop floor process control, medical decision
    support, power grid
  • What do they have in common
  • A distributed, but coordinated processing and
    control
  • Highly-efficient, pipelined distributed data
    processing

13
Distributed trading system
Pricing DBs
1.
Historical Data
Market Data Feeds
Trader Clients
2.
Analytics
Current Pricing
3.
  • Availability for historical data
  • Load balancing and consistentmessage delivery
    for price distribution
  • Parallel execution for analytics

Long-Haul WAN Spooler
Tokyo, London, Zurich, ...
14
Whats special about these systems?
  • Need high performance we must weaken consistency
  • Data is of different nature more dynamic
  • More relevant online, in context
  • Storing it persistently often doesnt make that
    much sense
  • Communication-oriented
  • Online progress nobody cares about faulty nodes
  • Faulty nodes can be rebooted
  • Rebooted nodes are just spare replicas in the
    common pool

15
Differences (in a Nutshell)
Databases Command / Control
relatively independent programs closely cooperating programs......organized into process groups
consistent data (external) strong consistency weakened consistency instead focus on making online progress
persistency, durable operations mostly replicated state and control info
one-copy serializability serializability w/o durability (nobody cares)
atomicity of groups of operations atomicity of messages causality
heavy-weight mechanisms slow lightweight, stress on responsiveness
relationships in data relationships between actions and in the sequences of messages
multi-phase protocols, ACKs etc. preferably one-way, pipelined processing
16
Back to virtual synchrony
  • Our plan
  • Group membership
  • Ordered multicast within group membership views
  • Delivery of new views synchronized with multicast
  • Higher-level primitives

17
A process group joining / leaving
group membership protocol
V2 A,B,C,D
V1 A,B,C
A
B
C
request to leave
sending a new view
request to join
D
V3 B,C,D
...OR
Group Membership Service
18
A process group joining / leaving
  • How it looks in practice
  • Application makes a call to the virtual synchrony
    library
  • Node communicates with other nodes
  • Locates the appropriate group, follows the join
    protocol etc.
  • Obtains state or whatever it needs (see later)
  • Application starts communicating, eg. joins
    replica pool

Application
Application
all the virtual synchrony just fits into the
protocol stack
V.S. Module
V.S. Module
Network
Network
19
A process group hadling failures
  • ? We rely on a failure detector (it doesnt
    concern us here)
  • A faulty or unreachable node is simply excluded
    from the group
  • Primary partition model group cannot split into
    two active parts

recovery
V1 A,B,C,D
V2 B,C,D
A
CRASH
B
join
C
D
V3 A,B,C,D
"B" realizes that somethings is wrong with
"A"and initiates the membership change protocol
20
Causal delivery and vector clocks
(0,0,0,0,0)
(0,1,1,0,0)
(0,1,1,0,0)
A
B
C
D
E
(0,0,1,0,0)
(0,0,1,0,0)
delayed
cannot deliver
CBCAST FBCAST vector clocks
21
Whats great about fbcast / cbcast ?
  • Can deliver to itself immediately
  • Asynchronous sender no need to wait for anything
  • No need to wait for delivery order on receiver
  • Can issue many sends in bursts
  • Since processes are less synchronous...
  • ...the system is more resilient to failures
  • Very efficient, overheads are comparable with TCP

22
Asynchronous pipelining
sender never needs to wait, and can send requests
at a high rate
A
B
C
buffering may reduce overhead
A
buffering
B
C
23
Whats to be careful with ?
  • Asynchrony
  • Data accumulates in buffer in the sender
  • Must put limits to it!
  • Explicit flushing send data to the others, force
    it out of buffers if completes, data is safe
    needed as a
  • A failure of the sender causes lots of updates to
    be lost
  • Sender gets ahead of anybody else......good if
    the others are doing something that doesnt
    conflict.
  • Cannot do multiple conflicting tasks without a
    form of locking, while ABCAST can

24
Why use causality?
  • Sender need not include context for every
    message
  • One of the very reasons why we use FIFO delivery,
    TCP
  • Could "encode" context in message, but costly or
    complex
  • Causal delivery simply extends FIFO to multiple
    processes
  • Sender knows that the receiver will have received
    same msgs
  • Think of a thread "migrating" between servers,
    causing msgs

25
A migrating thread and FIFO analogy
A
B
C
D
E
A way to think about the above... which might
explain why it is analogous to FIFO.
A
B
C
D
E
26
Why use causality?
  • Receiver does not need to worry about "gaps"
  • Could be done by looking at "state", but may be
    quite hard
  • Ordering may simply be driven by correctness
  • Synchronizing after every message could be
    unacceptable
  • Reduces inconsistency
  • Doesnt prevent it altogether, but it isnt
    always necessary
  • State-level synchronization can thus be done more
    rarely!
  • All this said... causality makes most sense in
    context

27
Causal vs. total ordering
Note Unlike causal odering, total ordering may
require that local delivery be
postponed!
Causal, but not total ordering
Causal and total ordering
A,E
A
B
A,E
C
A,E
D
E,A
E
E,A
28
Total ordering atomic, synchronous
A
B
C
D
E
A
B
C
D
E
29
Why total ordering?
  • State machine approach
  • Natural, easy to understand, still cheaper than
    transactions
  • Guarantees atomicity, which is sometimes
    desirable
  • Consider a primary-backup scheme

30
Implementing totally ordered multicast
31
Atomicity of failures
Uniform
  • Delivery guarantees
  • Among all the surviving processes delivery
    is all or none (both cases)
  • Uniform here also if (but not only if)
  • delivered to a crashed node
  • No guarantees for the newly joined

CRASH
Nonuniform
Wrong
CRASH
CRASH
32
Why atomicity of failures?
  • Reduce complexity
  • After we hear about failure, we need to quickly
    "reconfigure"
  • We like to think in stable epochs...
  • During epochs, failures dont occur
  • Any failure or a membership change begins a new
    epoch
  • Communication does not cross epoch boundaries
  • System does not begin a new epoch before all
    messages are either consistently delivered or all
    consistently forgotten
  • We want to know we got everything the faulty
    process sent,to completely finish the old epoch
    and open a "fresh" one

33
Atomicity message flushing
A
(logical partitioning)
B
C
D
E
retransmitting messages of failed nodes
A
B
C
D
E
changing membership
retransmitting own messages
34
A multi-phase failure-atomic protocol
this is the point when messages are delivered to
the applications
Phase 1
Phase 2
Phase 3
Phase 4
A
B
C
D
E
save message
OK to deliver
all have seen
garbage collect
this is only for uniform atomicity
these three phases are always present
35
Simple tools
  • Replicated services locking for updates, state
    transfer
  • Divide responsibility for requests load
    balancing
  • Simpler because of all the communication
    guarantees we get
  • Work partitioning subdivide tasks within a group
  • Simple schemes for fault-tolerance
  • Primary Backup, Coordinator-Cohort

36
Simple replication state machine
  • Replicate all data and actions
  • Simplest pattern of usage for v.s. groups
  • Same state everywhere (state transfer)
  • Updates propagated atomically using ABCAST
  • Updates applied in the exact same order
  • Reads or queries always can be served locally
  • Not very efficient, updates too synchronous
  • A little to slow
  • We try to sneak-in CBCAST into the picture...
  • ...and use it for data locking and for updates

37
Updates with token-style locking
We may not need anything more but just
causality...
CRASH
A
B
C
D
E
granting lock
ownership of a shared resource
performing an update
requesting lock
38
Updates with token-style locking
others must confirm (release any read locks)
individual confirmations
request
updates
updates
A
B
C
D
E
request
request
granting the lock
message from token owner
39
Multiple locks on unrelated data
CRASH
A
B
C
D
E
40
Replicated services
query
update
load balancing scheme
41
Replicated services
Primary-Backup Scheme
Coordinator-Cohort Scheme
backup server
requests
primary server
traces
results
42
Other types of tools
  • Publish-Subscribe
  • Every topic as a separate group
  • subscribe join the group
  • publish multicast
  • state transfer load prior postings
  • Rest of ISIS toolkit
  • News, file storage, job cheduling with load
    sharing, framework for reactive control apps etc.

43
Complaints (Cheriton / Skeen)
  • Oter techniques are better transctions,
    pub./sub.
  • Depends on applications... we dont compete w.
    ACID !
  • A matter of taste
  • Too many costly features, which we may not need
  • Indeed and we need to be able to use them
    selectively
  • Stackable microprotocols - ideas picked up on in
    Horus
  • End-to-End argument
  • Here taken to the extreme, could be used against
    TCP
  • But indeed, there are overheads use this stuff
    wisely

44
At what level to apply causality?
  • Communication level (ISIS)
  • An efficient technique that usually captures all
    that matters
  • Speeds-up implementation, simplifies design
  • May not always be most efficient might sometimes
    over-order(might lead to incidental causality)
  • Not a complete or accurate solution, but often
    just enough
  • What kinds of causality really matter to us?

45
At what level to apply causality?
  • Communication level (ISIS)
  • May not recognize all sources of causality
  • Existence of external channels (shared data,
    external systems)
  • Semantics ordering, recognized / understood only
    by applications
  • Semantics- or State-level
  • Prescribe ordering by the senders (prescriptive
    causality)
  • Timestamping, version numbers

46
Overheads
(skip)
  • Causality information in messages
  • With unicast as a dominant communication pattern,
    it could be a graph, but only in unlikely
    patterns of communication
  • With multicast, its simply one vector (possibly
    compressed)
  • Usually we have only a few active senders and
    bursty traffic
  • Buffering
  • Overhead linear with N, but w. small constant,
    similar to TCP
  • Buffering is bounded together with communication
    rate
  • Anyway is needed for failure atomicity (an
    essential feature)
  • Can be efficiently traded for control traffic via
    explicit flushing
  • Can be greatly reduces by introducing
    hierarchical structures

47
Overheads
(skip)
  • Overheads on the critical path
  • Delays in delivery, but in reality comparable to
    those in TCP
  • Arriving out of order is uncommon, a window with
    a few messages
  • Checking / updating causality info maintaining
    msg buffers
  • A false (non-semantic) causality messages
    unnecessarily delayed
  • Group membership changes
  • Requires agreement and slow multi-phase photocols
  • A tension between latency and bandwidth
  • Does introduce a disruption suppresses delivery
    of new messages
  • Costly flushing protocols place load on the
    network

48
Overheads
(skip)
  • Control traffic
  • Acknowledgements, 2nd / 3rd phase additional
    messages
  • Not on critical path, but latency matters, as it
    affects buffering
  • Can be piggybacked on other communication
  • Atomic communication, flush, membership changes
    slowed down to the slowest participant
  • Heterogeneity
  • Need sophisticated protocols to avoid overloading
    nodes
  • Scalability
  • Group size asymmetric load on senders, large
    timestamps
  • Number of groups complex protocols, not easy to
    combine

49
Conclusions
  • Virtual synchrony is an intermediate solution...
  • ...less than consistency w. serializability,
    better than nothing.
  • Strong enough for some classes of systems
  • Effective in practice, successfully used in many
    real systems
  • Inapplicable or inefficient in database-style
    settings
  • Not a monolithic scheme...
  • Selected features should be used only based on
    need
  • At the same time, a complete design paradigm
  • Causality isnt so much helpful as an isolated
    feature, but...
  • ...it is a key piece of a much larger picture
Write a Comment
User Comments (0)
About PowerShow.com