QuickSilver: Middleware for Scalable SelfRegenerative Systems - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

QuickSilver: Middleware for Scalable SelfRegenerative Systems

Description:

and we are hoping to use them in a more and more 'unattended' manner ... Bimodal Multicast: for faster 'few to many' data transfer patterns ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 56
Provided by: kenb152
Category:

less

Transcript and Presenter's Notes

Title: QuickSilver: Middleware for Scalable SelfRegenerative Systems


1
QuickSilver Middleware for Scalable
Self-Regenerative Systems
  • Cornell UniversityKen Birman, Johannes Gehrke,
    Paul Francis,Robbert van Renesse, Werner Vogels
  • Raytheon CorporationLou DiPalma, Paul Work

2
Our topic
  • Computing systems are growing
  • larger,
  • and more complex,
  • and we are hoping to use them in a more and
    more unattended manner
  • But the technology for managing growth and
    complexity is lagging

3
Our goal
  • Build a new platform in support of massively
    scalable, self-regenerative applications
  • Demonstrate it by offering a specific military
    application interface
  • Work with Raytheon to apply in other military
    settings

4
Representative scenarios
  • Massive data centers maintained by the military
    (or by companies like Amazon)
  • Enormous publish-subscribe information bus
    systems (broadly, OSD calls these GIG and NCES
    systems)
  • Deployments of large numbers of lightweight
    sensors
  • New network architectures to control autonomous
    vehicles over media shared with other mundane
    applications

5
How to approach the problem?
  • Web Services architecture has emerged as a likely
    standard for large systems
  • But WS is document oriented, lacks
  • High availability (or any kind of quick response
    guarantees)
  • A convincing scalability story
  • Self-monitoring/adaptation features

6
Signs of trouble?
  • Most technologies are way beyond their normal
    scalability limits in this kind of center we are
    good at small clusters but not huge ones
  • Pub-sub was a big hit. No longer
  • Curious side-bar used heavily for point-to-point
    communication! (Why?)
  • Extremely hard to diagnose problems

7
We lack the right tools!
  • Today, our applications navigate in the dark
  • They lack a way to find things
  • They lack a way to sense system state
  • There are no rules for adaptation, if/when needed
  • In effect We are starting to build very big
    systems, yet doing so in the usual client-server
    manner
  • This denies applications any information about
    system state, configuration, loads, etc

8
QuickSilver
  • QuickSilver A platform to help developers build
    these massive new systems
  • It has four major components
  • Astrolabe a novel kind of virtual database
  • Bimodal Multicast for faster few to many data
    transfer patterns
  • Kelips A fast lookup mechanism
  • Group replication technologies based on virtual
    synchrony or other similar models

9
QuickSilver Architecture
Pub-sub (JMS, JBI)
Native API
Distributed query,event detection
Massively Scalable Group Communication Composable
MicroprotocolStacks
MonitoringIndexing
MessageRepository
Overlay Networks
10
Astrolabes role is to collect and report system
state, which is used for many purposes including
self-configuration and repair.
11
What does Astrolabe do?
  • Astrolabes role is to track information residing
    at a vast number of sources
  • Structured to look like a database
  • Approach peer to peer gossip. Basically, each
    machine has a piece of a jigsaw puzzle. Assemble
    it on the fly.

12
Astrolabe in a single domain
1.9
2.1
1.8
3.1
0.9
0.8
1.1
5.3
3.6
2.7
  • Row can have many columns
  • Total size should be k-bytes, not megabytes
  • Configuration certificate determines what data is
    pulled into the table (and can change)

13
So how does it work?
  • Each computer has
  • Its own row
  • Replicas of some objects (configuration
    certificate, other rows, etc)
  • Periodically, but at a fixed rate, pick a friend
    pseudo-randomly and exchange states efficiently
    (bound the size of data exchanged)
  • States converge exponentially rapidly.
  • Loads are low and constant and protocol is robust
    against all sorts of disruptions!

14
State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
15
State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
16
State Merge Core of Astrolabe epidemic
swift.cs.cornell.edu
cardinal.cs.cornell.edu
17
Observations
  • Merge protocol has constant cost
  • One message sent, received (on avg) per unit
    time.
  • The data changes slowly, so no need to run it
    quickly we usually run it every five seconds or
    so
  • Information spreads in O(log N) time
  • But this assumes bounded region size
  • In Astrolabe, we limit them to 50-100 rows

18
Scaling up and up
  • With a stack of domains, we dont want every
    system to see every domain
  • Cost would be huge
  • So instead, well see a summary

cardinal.cs.cornell.edu
19
Build a hierarchy using a P2P protocol that
assembles the puzzle without any servers
Dynamically changing query output is visible
system-wide
SQL query summarizes data
New Jersey
San Francisco
20
(1) Query goes out (2) Compute locally (3)
results flow to top level of the hierarchy
1
1
3
3
2
2
New Jersey
San Francisco
21
Hierarchy is virtual data is replicated
New Jersey
San Francisco
22
Hierarchy is virtual data is replicated
New Jersey
San Francisco
23
The key to self- properties!
  • A flexible, reprogrammable mechanism
  • Which clustered services are experiencing
    timeouts, and what were they waiting for when
    they happened?
  • Find 12 idle machines with the NMR-3D package
    that can download a 20MB dataset rapidly
  • Which machines have inventory for warehouse 9?
  • Wheres the cheapest gasoline in the area?
  • Think of aggregation functions as small agents
    that look for information

24
What about security?
  • Astrolabe requires
  • Read permissions to see database
  • Write permissions to contribute data
  • Administrative permission to change aggregation
    or configuration certificates
  • Users decide what data Astrolabe can see
  • A VPN setup can be used to hide Astrolabes
    internal messages from intruders
  • Byzantine Agreement based on threshold crypto
    used to secure aggregation functions

New!
25
Data Mining
  • Quite a hot area, usually done by collecting
    information to a centralized node, then
    querying within that node
  • Astrolabe is doing the comparable thing, but its
    query evaluation occurs in a decentralized manner
  • This is incredibly parallel, hence faster
  • And more robust against disruption too!

26
Cool Astrolabe Properties
  • Parallel. Everyone does a tiny bit work, so we
    accomplish huge tasks in seconds
  • Flexible. Decentralized query evaluation, in
    seconds
  • One aggregate can answer lots of questions. E.g.
    wheres the nearest supply shed? the
    hierarchy encodes many answers in one tree!

27
Aggregation and Hierarchy
  • Nearby information
  • Maintained in more detail, can query it directly
  • Changes seen sooner
  • Remote information summarized
  • High quality aggregated data
  • This also changes as information evolves

28
Astrolabe summary
  • Scalable could support millions of machines
  • Flexible can easily extend domain hierarchy,
    define new columns or eliminate old ones. Adapts
    as conditions evolve.
  • Secure
  • Uses keys for authentication and can even encrypt
  • Handles firewalls gracefully, including issues of
    IP address re-use behind firewalls
  • Performs well updates propagate in seconds
  • Cheap to run tiny load, small memory impact

29
Bimodal Multicast
  • A quick glimpse of scalable multicast
  • Think about really large Internet configurations
  • A data center as the data source
  • Typical publication might be going to thousands
    of client systems

30
Swiss Stock Exchange Problem Vsync. multicast
is fragile
Most members are healthy.
31
Performance degrades as the system scales up
Virtually synchronous Ensemble multicast protocols
250
group size 32
group size 64
group size 96
200
150
average throughput on nonperturbed members
100
50
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
perturb rate
32
Why doesnt multicast scale?
  • With weak semantics
  • Faulty behavior may occur more often as system
    size increases (think the Internet)
  • With stronger reliability semantics
  • Encounter a system-wide cost (e.g. membership
    reconfiguration, congestion control)
  • That can be triggered more often as a function of
    scale (more failures, or more network events,
    or bigger latencies)
  • Similar observation led Jim Gray to speculate
    that parallel databases scale as O(n2)

33
But none of this is inevitable
  • Recent work on probabilistic solutions suggests
    that gossip-based repair strategy scales quite
    well
  • Also gives very steady throughput
  • And can take advantage of hardware support for
    multicast, if available

34
Start by using unreliable multicast to rapidly
distribute the message. But some messages may not
get through, and some processes may be faulty.
So initial state involves partial distribution of
multicast(s)
35
Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesnt include them.
36
Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
37
Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
38
This solves our problem!
Low bandwidth comparison of pbcast performance at
faulty and correct hosts
High bandwidth comparison of pbcast performance
at faulty and correct hosts
200
200
traditional at unperturbed host
traditional w/1 perturbed

pbcast at unperturbed host
180
180
pbcast w/1 perturbed

traditional at perturbed host
throughput for traditional, measured at perturbed
host
pbcast at perturbed host
160
throughput for pbcast measured at perturbed host

160
140
140
120
120
100
average throughput
100
average throughput
80
80
60
60
Bimodal Multicast rides out disturbances!
40
40
20
20
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
perturb rate
perturb rate
39
Bimodal Multicast Summary
  • An extremely scalable technology
  • Remains steady and reliable
  • Even with high rates of message loss (in our
    tests as high as 20)
  • Even with large numbers of perturbed processes
    (we tested with up to 25)
  • Even with router failures
  • Even when IP multicast fails
  • And weve secured it using digital signatures

40
Kelips
  • Third in our set of tools
  • A P2P index
  • Put(name, value)
  • Get(name)
  • Kelips can do lookups with one RPC, is
    self-stabilizing after disruption
  • Unlike Astrolabe, nodes can put varying amounts
    of data out there.

41
Kelips
Take a a collection of nodes
110
230
202
30
42
Kelips
Map nodes to affinity groups
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
43
Kelips
110 knows about other members 230, 30
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
Affinity group pointers
44
Kelips
202 is a contact for 110 in group 2
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
Contacts
230
202
members per affinity group
30
Contact pointers
45
Kelips
dot.com maps to group 2. So 110 tells group 2
to route inquiries about dot.com to it.
Affinity group view
Affinity Groups peer membership thru consistent
hash
0
1
2
110
Contacts
230
202
members per affinity group
30
Resource Tuples
Gossip protocol replicates data cheaply
46
Kelips
To look up dot.com, just ask some contact in
group 2. It returns 110 (or forwards your
request).
Affinity Groups peer membership thru consistent
hash
0
1
2
110
230
202
members per affinity group
30
47
Kelips summary
  • Split the system into ?N subgroups
  • Map (key,value) pairs to some subgroup, by
    hashing the key
  • Replicate within that subgroup
  • Each node tracks
  • Its own group membership
  • k members of each of the other groups
  • To lookup a key, hash it and ask one or more of
    your contacts if they know the value

48
Kelips summary
  • O(?N) storage overhead, which is higher than for
    other DHTs
  • Same space overhead for member list, contact
    list, and replicated data itself
  • Heuristic is used to keep contacts fresh and
    avoid contacts that seem to churn
  • This buys us O(1) lookup cost
  • And background overhead is constant

49
Virtual Synchrony
  • Last piece of the puzzle
  • Outcome of a decade of DARPA-funded work,
    technology core of
  • AEGIS integrated console
  • New York and Swiss Stock Exchange
  • French Air Traffic Control System
  • Florida Electric Power and Light System

50
Virtual Synchrony Model
51
Roles in QuickSilver?
  • Provides way for groups of components to
  • Replicate data, synchronize
  • Perform tasks in parallel (like parallel database
    lookups, for improved speed)
  • Detect failures and reconfigure to compensate by
    regenerating lost functionality

52
Replication Key to understanding QuickSilver
Astrolabe
Bimodal Multicast
Kelips
Virtual Synchrony
53
Metrics
  • We plan to look at several
  • Robustness to externally imposed stress,
    overload expect to demonstrate significant
    improvements
  • Scalability Graph performance/overheads as
    function of scale, load, etc
  • End-user power Implement JBI, sensor networks,
    data-center mgt. platform
  • Total cost With Raytheon, explore impact on real
    military applications
  • Under DURIP funding we have acquired a clustered
    evaluation platform.

54
Our plan
  • Integrate these core components
  • Then
  • Build a JBI layer over the system
  • Integrate Johannes Gehrkes data mining
    technology into the platform
  • Support scalable overlay multicast (Francis)
  • Raytheon Teaming with us to tackle military
    applications, notably Navy

55
More information?
  • www.cs.cornell.edu/Info/Projects/QuickSilver
Write a Comment
User Comments (0)
About PowerShow.com