Title: Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker
1A Scalable, Content-Addressable Network
1,2
3
1
- Sylvia Ratnasamy, Paul Francis, Mark Handley,
Richard Karp, Scott Shenker
1,2
1
2
3
1
Tahoe Networks
U.C.Berkeley
ACIRI
2Outline
- Introduction
- Design
- Evaluation
- Strengths Weaknesses
3Internet-scale hash tables
- Hash tables
- essential building block in software systems
- Internet-scale distributed hash tables
- equally valuable to large-scale distributed
systems? - peer-to-peer systems
- Napster, Gnutella, Groove, FreeNet, MojoNation
- large-scale storage management systems
- Publius, OceanStore, PAST, Farsite, CFS ...
- mirroring on the Web
4Content-Addressable Network(CAN)
- CAN Internet-scale hash table
- Interface
- insert(key,value)
- value retrieve(key)
- Properties
- scalable
- operationally simple
- good performance
- Related systems Chord/Pastry/Tapestry/Buzz/Plaxto
n ...
5Problem Scope
- Design a system that provides the interface
- scalability
- robustness
- performance
- security
- Application-specific, higher level primitives
- keyword searching
- mutable content
- anonymity
6Outline
- Introduction
- Design
- Evaluation
- Strengths Weaknesses
- Ongoing Work
7CAN basic idea
8CAN basic idea
insert(K1,V1)
9CAN basic idea
insert(K1,V1)
10CAN basic idea
(K1,V1)
11CAN basic idea
retrieve (K1)
12CAN solution
- virtual Cartesian coordinate space
- entire space is partitioned amongst all the nodes
- every node owns a zone in the overall space
- abstraction
- can store data at points in the space
- can route from one point to another
- point node that owns the enclosing zone
13CAN simple example
node Iinsert(K,V)
I
14CAN simple example
node Iinsert(K,V)
I
(1) a hx(K)
x a
15CAN simple example
node Iinsert(K,V)
I
(1) a hx(K) b hy(K)
y b
x a
16CAN simple example
node Iinsert(K,V)
I
(1) a hx(K) b hy(K)
(2) route(K,V) -gt (a,b)
17CAN simple example
node Iinsert(K,V)
I
(1) a hx(K) b hy(K)
(K,V)
(2) route(K,V) -gt (a,b) (3) (a,b) stores
(K,V)
18CAN simple example
node Jretrieve(K)
(1) a hx(K) b hy(K)
(K,V)
(2) route retrieve(K) to (a,b)
J
19CAN
- Data stored in the CAN is addressed by name
(i.e. key), not location (i.e. IP address)
20CAN routing table
21CAN routing
(a,b)
(x,y)
22CAN routing
- A node only maintains state for its immediate
neighboring nodes - Compared to geographical routing
- can be considered as greedy forwarding in
Cartesian space instead of physical space.
23CAN node insertion
Bootstrap node
new node
1) Discover some node I already in CAN
24CAN node insertion
I
new node
1) discover some node I already in CAN
25CAN node insertion
(p,q)
2) pick random point in space
I
new node
26CAN node insertion
(p,q)
J
I
new node
3) I routes to (p,q), discovers node J
27CAN node insertion
new
J
4) split Js zone in half new owns one half
28CAN node insertion
- Inserting a new node affects only a single other
node and its immediate neighbors - Problem
- Inefficient if the new node and its neighbor(J)
is far away from each other in terms of
communication.
29CAN node failures
- Need to repair the space
- recover database (weak point)
- soft-state updates
- use replication, rebuild database from replicas
- repair routing
- takeover algorithm
30CAN takeover algorithm
- Simple failures
- know your neighbors neighbors
- a node periodically broadcast its zone
coordinates and a list of its neighbors and their
zone coordinates. - when a node fails, one of its neighbors takes
over its zone - self-set timer decides which neighbor to take
over. - More complex failure modes
- simultaneous failure of multiple adjacent nodes
- scoped flooding to discover neighbors
- hopefully, a rare event
31CAN node failures
- Only the failed nodes immediate neighbors are
required for recovery
32Design recap
- Basic CAN
- completely distributed
- self-organizing
- nodes only maintain state for their immediate
neighbors - Comment
- basic CAN does not work very well, additional
design features are necessary
33Design improvements
- The neighboring relationship in coordinate space
may be completely different from that in
underlying IP network. - How can coordinate space approximately map to
physical space? - Topologically-sensitive CAN construction
- distributed binning
34Distributed Binning
- Goal
- bin nodes such that co-located nodes land in same
bin - neighbors in the coordinate space are likely
close in IP network - reduce per-hop latency, prevent overly network
routing anomaly - Idea
- well known set of landmark machines
- each CAN node, measures its RTT to each landmark
- orders the landmarks in order of increasing RTT
- CAN construction
- place nodes from the same bin close together on
the CAN
35Distributed Binning
- 4 Landmarks (placed at 5 hops away from each
other) - naïve partitioning
dimensions2
dimensions4
w/o binning w/ binning
w/o binning w/ binning
?
20
15
latency Stretch
10
5
1K
4K
1K
4K
256
256
number of nodes
36Design improvements
- Multi-dimensioned coordinated spaces
- To reduce path length
- path length is O(d n 1/d)
- Hash function more complex?
37Design improvements
- Multiple, independent spaces (realities)
- To forward a message, a node checks all its
neighbors on each reality instead of one reality,
and do greedy forwarding. - Reduce routing path length
- other benefits
- Improve data availability (hash table are
replicated on each reality). - Improve routing fault tolerance.
38Design improvements
- Better CAN routing metrics.
- Use RTT instead of Cartesian distance when
selecting a next hop neighbor. - To improve per-hop(CAN hop) latency
39Design improvements
- Overloading coordinate zone
- allow multiple node to share the same zone.
- A node maintain a list of its peers in addition
to its neighbors. - A node selects one neighbor from the peers of the
neighboring zone - The contents of the hash table may be either
divided or replicated across the nodes in a zone. - Reduce path length
- reduce of zones
- reduce per-hop latency
- has more choice in selecting a neighbor.
40CAN load balancing
- Two pieces
- Dealing with hot-spots
- popular (key,value) pairs
- nodes cache recently requested entries
- overloaded node replicates popular entries at
neighbors - Need to deal with cache consistency and update
policy problem. - Uniform coordinate space partitioning
- uniformly spread (key,value) entries
- uniformly spread out routing load
41Uniform Partitioning
- Added check
- at join time, pick a zone
- check neighboring zones
- pick the largest zone and split that one
42Uniform Partitioning
65,000 nodes, 3 dimensions
w/o check
w/ check
Percentage of nodes
V
2V
4V
8V
Volume
43CAN Robustness
- Completely distributed
- no single point of failure ( not applicable to
pieces of database when node failure happens) - Not exploring database recovery (in case there
are multiple copies of database) - Resilience of routing
- can route around trouble
44Outline
- Introduction
- Design
- Evaluation
- Strengths Weaknesses
45Evaluation
- Scalability
- Low-latency
- Load balancing
- Robustness
46CAN scalability
- For a uniformly partitioned space with n nodes
and d dimensions - per node, number of neighbors is 2d
- average routing path is (dn1/d)/4 hops
- simulations show that the above results hold in
practice - Can scale the network without increasing per-node
state - Chord/Plaxton/Tapestry/Buzz
- log(n) nbrs with log(n) hops
47CAN low-latency
- Problem
- latency stretch (CAN routing delay)
(IP routing delay) - application-level routing may lead to high
stretch - Solution
- increase dimensions, realities (reduce the path
length) - Heuristics (reduce the per-CAN-hop latency)
- RTT-weighted routing
- multiple nodes per zone (peer nodes)
- deterministically replicate entries
48CAN low-latency
dimensions 2
w/o heuristics
w/ heuristics
Latency stretch
16K
32K
65K
131K
nodes
49CAN low-latency
dimensions 10
w/o heuristics
w/ heuristics
Latency stretch
16K
32K
65K
131K
nodes
50Outline
- Introduction
- Design
- Evaluation
- Strengths Weaknesses
51Strengths
- More resilient than flooding broadcast networks
- Efficient at locating information
- Fault tolerant routing
- Node Data High Availability (w/ improvement)
- Manageable routing table size network traffic
52Weaknesses
- Impossible to perform a fuzzy search
- Susceptible to malicious activity
- Maintain coherence of all the indexed data
(Network overhead, Efficient distribution) - Still relatively higher routing latency
- Poor performance w/o improvement
53Compare Can and Pastry
- CAN is greedy forwarding in Cartesian coordinate
space. - Pastry is maximum address prefix matching in a
tree structure like routing table. - The routing table at each node in Pastry
maintains more information - Routing table maintenance
- Both are local.
54Compare Can and Pastry
- State maintained at each node
- CAN 2d
- Pastry O(log N )
- Overlay network path length
- CAN average routing path is (dn1/d)/4
- Pastry O(log N)
- Latency stretch
- Pastry 50 longer than underlying IP network.
- CAN most of time much longer except using CAN
with many additional features.