Title: Scalable peertopeer substrates: A new foundation for distributed applications
1Scalable peer-to-peer substrates A new
foundation for distributed applications?
- Antony Rowstron,
- Microsoft Research Cambridge, UK
- Peter Druschel, Rice University
- Collaborators
- Miguel Castro, Anne-Marie Kermarrec, MSR
Cambridge - Y. Charlie Hu, Sitaram Iyer, Dan Wallach, Rice
University
2Outline
- Background
- Squirrel
- Pastry
- Pastry locality properties
- SCRIBE
- PAST
- Conclusions
3Background Peer-to-peer Systems
- distributed
- decentralized control
- self-organizing
- symmetric communication/roles
4Background
- Peer-to-peer applications
- Pioneers Napster, Gnutella, FreeNet
- File sharing CFS, PAST SOSP01
- Network storage FarSite Sigmetrics00,
Oceanstore ASPLOS00, PAST SOSP01 - Multicast Herald HotOS01, Bayeux NOSDAV01,
CAN-multicast NGC01, SCRIBE NGC01
5Common issues
- Organize, maintain overlay network
- node arrivals
- node failures
- Resource allocation/load balancing
- Resource location
- Locality (network proximity)
- Idea generic P2P substrate
6Architecture
Event notification
Network storage
?
P2p application layer
P2p substrate (self-organizing overlay network)
DHT
TCP/IP
Internet
7DHTs
- Peer-to-peer object location and routing
substrate - Distributed Hash Table maps object key to a live
node - Insert(key,object)
- Lookup(key)
- Key typically 128 bits
- Pastry (developed at MSR Cambrdige/Rice) is an
example of such an infrastructure. -
8DHT Related work
- Chord (MIT/UCB) Sigcomm01
- CAN (ACIRI/UCB) Sigcomm01
- Tapestry (UCB) TR UCB/CSD-01-1141
- PNRP (Microsoft) Huitema et al, unpub.
- Kleinberg 99
- Plaxton et al. 97
9Outline
- Background
- Squirrel
- Pastry
- Pastry locality properties
- SCRIBE
- PAST
- Conclusions
10Squirrel Web Caching
- Reduce latency,
- Reduce external bandwidth
- Reduce web server load.
- ISPs, Corporate network boundaries, etc.
- Cooperative Web Caching group of web caches
working together and acting as one web cache.
11Web Cache
Browser Cache
Browser
Centralized Web Cache
Web Server
Browser Cache
Browser
Internet
LAN
Sharing!
12Decentralized Web Cache
Browser Cache
Browser
Web Server
Browser Cache
Browser
Internet
LAN
13Why peer-to-peer ?
- Cost of dedicated web cache
- No additional hardware
- Administrative costs
- Self-organizing
- Scaling needs upgrading
- Resources grow with clients
- Single point of failure
- Fault-tolerant by design
14Setting
- Corporate LAN
- 100 - 100,000 desktop machines
- Single physical location
- Each node runs an instance of Squirrel
- Sets it as the browsers proxy
15Approaches
- Home-store model
- Directory model
- Both approaches require key generation
- Hash(URL)
- Collision resistant (e.g. SHA1)
- Hash(http//www.research.microsoft.com/antr) -gt
4ff367a14b374e3dd99f
16Home-store model
client
URL hash
home
17Home-store model
client
home
thats how it works!
18Directory model
- Client nodes always store objects in local
caches. - Main difference between the two schemes whether
the home node also stores the object. - In the directory model, it only stores pointers
to recent clients, and forwards requests to them.
19Directory model
client
home
20Directory model
client
delegate
random entry
home
21(skip) Full directory protocol
22Recap
- Two endpoints of design space, based on the
choice of storage location. - At first sight, both seem to do about as well.
(e.g. hit ratio, latency).
23Quirk
- Consider a
- Web page with many images, or
- Heavily browsing node
- In the Directory scheme,
- Many home nodes pointing to one delegate
- Home-store natural load balancing
.. evaluation on trace-based workloads ..
24Trace characteristics
25Total external bandwidth
Redmond
26Total external bandwidth
Cambridge
27LAN Hops
Redmond
28LAN Hops
Cambridge
29Load in requests per sec
100000
Home-store
Directory
10000
1000
Redmond
Number of such seconds
100
10
1
0
10
20
30
40
50
Max objects served per-node / second
30Load in requests per sec
1e07
Home-store
Directory
1e06
100000
10000
Cambridge
Number of such seconds
1000
100
10
1
0
10
20
30
40
50
Max objects served per-node / second
31Load in requests per min
100
Home-store
Directory
Redmond
10
Number of such minutes
1
0
50
100
150
200
250
300
350
Max objects served per-node / minute
32Load in requests per min
Home-store
Directory
10000
1000
Cambridge
Number of such minutes
100
10
1
0
20
40
60
80
100
120
Max objects served per-node / minute
33Outline
- Background
- Squirrel
- Pastry
- Pastry locality properties
- SCRIBE
- PAST
- Conclusions
34Pastry
- Generic p2p location and routing substrate (DHT)
- Self-organizing overlay network
- Consistent hashing
- Lookup/insert object in lt log16 N routing steps
(expected) - O(log N) per-node state
- Network locality heuristics
35Pastry Object distribution
2128 - 1
O
- Consistent hashing Karger et al. 97
- 128 bit circular id space
- nodeIds (uniform random)
- objIds/keys (uniform random)
- Invariant node with numerically closest nodeId
maintains object
objId/key
nodeIds
36Pastry Object insertion/lookup
2128 - 1
O
Msg with key X is routed to live node with nodeId
closest to X Problem complete routing table
not feasible
X
Route(X)
37Pastry Routing
- Tradeoff
- O(log N) routing table size
- O(log N) message forwarding steps
38Pastry Routing table ( 65a1fcx)
Row 0
Row 1
Row 2
Row 3
log16 N rows
39Pastry Routing
d471f1
d467c4
d462ba
d46a1c
d4213f
- Properties
- log16 N steps
- O(log N) state
Route(d46a1c)
d13da3
65a1fc
40Pastry Leaf sets
- Each node maintains IP addresses of the nodes
with the L numerically closest larger and smaller
nodeIds, respectively. - routing efficiency/robustness
- fault detection (keep-alive)
- application-specific local coordination
41Pastry Routing procedure
If (destination is within range of our leaf set)
forward to numerically closest member else let
l length of shared prefix let d value of
l-th digit in Ds address if (Rld exists)
forward to Rld else forward to a known
node that (a) shares at least as long a
prefix (b) is numerically closer than this node
42Pastry Routing
- Integrity of overlay
- guaranteed unless L/2 simultaneous failures of
nodes with adjacent nodeIds - Number of routing hops
- No failures lt log16 N expected, 128/4 1 max
- During failure recovery
- O(N) worst case, average case much better
43Demonstration
44Pastry Self-organization
- Initializing and maintaining routing tables and
leaf sets - Node addition
- Node departure (failure)
45Pastry Node addition
d471f1
d467c4
d462ba
d46a1c
d4213f
New node d46a1c
Route(d46a1c)
d13da3
65a1fc
46Node departure (failure)
- Leaf set members exchange keep-alive messages
- Leaf set repair (eager) request set from
farthest live node in set - Routing table repair (lazy) get table from peers
in the same row, then higher rows
47Pastry Experimental results
- Prototype
- implemented in Java
- emulated network
48Pastry Average of hops
L16, 100k random queries
49Pastry of hops (100k nodes)
L16, 100k random queries
50Pastry routing hops (failures)
L16, 100k random queries, 5k nodes, 500
failures
51Outline
- Background
- Squirrel
- Pastry
- Pastry locality properties
- SCRIBE
- PAST
- Conclusions
52Pastry Locality properties
- Assumption scalar proximity metric
- e.g. ping/RTT delay, IP hops
- a node can probe distance to any other node
- Proximity invariant
- Each routing table entry refers to a node
close to the local node (in the proximity space),
among all nodes with the appropriate nodeId
prefix.
53Pastry Routes in proximity space
54Pastry Distance traveled
L16, 100k random queries
55Pastry Locality properties
- 1) Expected distance traveled by a message in
the proximity space is within a small constant of
the minimum - 2) Routes of messages sent by nearby nodes with
same keys converge at a node near the source
nodes - 3) Among k nodes with nodeIds closest to the
key, message likely to reach the node closest to
the source node first
56Demonstration
57Pastry Node addition
58Pastry API
- route(M, X) route message M to node with nodeId
numerically closest to X - deliver(M) deliver message M to application
(callback) - forwarding(M, X) message M is being forwarded
towards key X (callback) - newLeaf(L) report change in leaf set L to
application (callback)
59Pastry Security
- Secure nodeId assignment
- Randomized routing
- Byzantine fault-tolerant leaf set membership
protocol
60Pastry Summary
- Generic p2p overlay network
- Scalable, fault resilient, self-organizing,
secure - O(log N) routing steps (expected)
- O(log N) routing table size
- Network locality properties
61Outline
- Background
- Squirrel
- Pastry
- Pastry locality properties
- SCRIBE
- PAST
- Conclusions
62SCRIBE Large-scale, decentralized event
notification
- Infrastructure to support topic-based
publish-subscribe applications - Scalable large numbers of topics, subscribers,
wide range of subscribers/topic - Efficient low delay, low link stress, low node
overhead
63SCRIBE Large scale event notification
topicId
Publish topicId
Subscribe topicId
64Scribe Results
- Simulation results
- Comparison with IP multicast delay, node stress
and link stress - Experimental setup
- Georgia Tech Transit-Stub model
- 60000 nodes randomly selected /500 000
- Zipf-like subscription distribution, 1500 topics
65Scribe Topic distribution
Windows Update
Stock Alert
Number of subscribers
Instant Messaging
Topic Rank
66Scribe Delay penalty
67Scribe Node stress
68Scribe Link stress
69Related works
- Narada
- Bayeux/Tapestry
- Multicast/CAN
70Summary
- Self-configuring P2P framework for topic-based
publish-subscribe - Scribe achieves reasonable performance /IP
multicast - Scales to a large number of subscribers
- Scales to a large number of topics
- Good distribution of load
71Outline
- Background
- Squirrel
- Pastry
- Pastry locality properties
- SCRIBE
- PAST
- Conclusions
72PAST Cooperative, archival file storage and
distribution
-
- Layered on top of Pastry
- Strong persistence
- High availability
- Scalability
- Reduced cost (no backup)
- Efficient use of pooled resources
73PAST API
- Insert - store replica of a file at k diverse
storage nodes - Lookup - retrieve file from a nearby live storage
node that holds a copy - Reclaim - free storage associated with a file
- Files are immutable
74PAST File storage
fileId
Insert fileId
75PAST File storage
Storage Invariant File replicas are stored
on k nodes with nodeIds closest to fileId (k
is bounded by the leaf set size)
76PAST File Retrieval
C
k replicas
Lookup
file located in log16 N steps (expected) usually
locates replica nearest client C
fileId
77PAST Exploiting Pastry
- Random, uniformly distributed nodeIds
- replicas stored on diverse nodes
- Uniformly distributed fileIds
- e.g. SHA-1(filename,public key, salt)
- approximate load balance
- Pastry routes to closest live nodeId
- availability, fault-tolerance
78PAST Storage management
- Maintain storage invariant
- Balance free space when global utilization is
high - statistical variation in assignment of files to
nodes (fileId/nodeId) - file size variations
- node storage capacity variations
- Local coordination only (leaf sets)
79Experimental setup
- Web proxy traces from NLANR
- 18.7 Gbytes, 10.5K mean, 1.4K median, 0 min,
138MB max - Filesystem
- 166.6 Gbytes. 88K mean, 4.5K median, 0 min, 2.7
GB max - 2250 PAST nodes (k 5)
- truncated normal distributions of node storage
sizes, mean 27/270 MB
80Need for storage management
- No diversion (tpri 1, tdiv 0)
- max utilization 60.8
- 51.1 inserts failed
- Replica/file diversion (tpri .1, tdiv .05)
- max utilization gt 98
- lt 1 inserts failed
81PAST File insertion failures
82PAST Caching
- Nodes cache files in the unused portion of their
allocated disk space - Files caches on nodes along the route of lookup
and insert messages - Goals
- maximize query xput for popular documents
- balance query load
- improve client latency
83PAST Caching
fileId
Lookup topicId
84PAST Caching
85PAST Security
- No read access control users may encrypt content
for privacy - File authenticity file certificates
- System integrity nodeIds, fileIds non-forgeable,
sensitive messages signed - Routing randomized
86PAST Storage quotas
- Balance storage supply and demand
- user holds smartcard issued by brokers
- hides user private key, usage quota
- debits quota upon issuing file certificate
- storage nodes hold smartcards
- advertise supply quota
- storage nodes subject to random audits within
leaf sets
87PAST Related Work
- CFS SOSP01
- OceanStore ASPLOS 2000
- FarSite Sigmetrics 2000
88Status
- Functional prototypes
- Pastry Middleware 2001
- PAST HotOS-VIII, SOSP01
- SCRIBE NGC 2001
- Squirrel submitted
- http//www.microsoft.research.com/antr/Pastry
89Current Work
- Security
- secure nodeId assignment
- quota system
- Keyword search capabilities
- Support for mutable files in PAST
- Anonymity/Anti-censorship
- New applications
- Software releases
90Conclusion
- For more information
- http//www.research.microsoft.com/antr/Pastry
- Ill be here till friday lunchtime feel free to
stop me and talk