DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications - PowerPoint PPT Presentation

About This Presentation

Title:

DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications

Description:

DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications Frans Kaashoek kaashoek_at_lcs.mit.edu Joint work with: H. Balakrishnan, P. Druschel , J ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 48

Provided by: Robert2780

Learn more at: http://www.podc.org

Category:

more less

Transcript and Presenter's Notes

Title: DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications

1
DISTRIBUTED HASH TABLESBuilding large-scale,
robust distributed applications

Frans Kaashoek
kaashoek_at_lcs.mit.edu
Joint work with H. Balakrishnan, P. Druschel ,
J. Hellerstein , D. Karger, R. Karp, J.
Kubiatowicz, B. Liskov, D. Mazières, R. Morris,
S. Shenker, I. Stoica

2
P2P an exciting social development

Internet users cooperating to share, for example,
music files
Napster, Gnutella, Morpheus, KaZaA, etc.
Lots of attention from the popular press
The ultimate form of democracy on the Internet
The ultimate threat to copy-right protection on
the Internet
Many vendors have launched P2P efforts

3
What is P2P?
Client
Client
Client
Internet
Client
Client

A distributed system architecture
No centralized control
Nodes are symmetric in function
Typically many nodes, but unreliable and
heterogeneous

4
Traditional distributed computingclient/server
Server
Client
Client
Internet
Client
Client

Successful architecture, and will continue to be
so
Tremendous engineering necessary to make server
farms scalable and robust

5
Application-level overlays
Site 3
Site 2
N
N
N
ISP1
ISP2
Site 1
N
N
ISP3

One per application
Nodes are decentralized
NOC is centralized

Site 4
N
P2P systems are overlay networks without central
control
6
(Potential) P2P advantages

Allows for scalable incremental growth
Aggregate tremendous amount of computation and
storage resources
Tolerate faults or intentional attacks

7
Example P2P problem lookup
N2
N1
N3
Internet
Keytitle Valuefile data
?
Client
Publisher
Lookup(title)
N6
N4
N5

At the heart of all P2P systems

8
Centralized lookup (Napster)
N2
N1
SetLoc(title, N4)
N3
Client
DB
N4
Publisher_at_
Lookup(title)
Keytitle Valuefile data
N8
N9
N7
N6
Simple, but O(N) state and a single point of
failure
9
Flooded queries (Gnutella)
N2
N1
Lookup(title)
N3
Client
N4
Publisher_at_
Keytitle ValueMP3 data
N6
N8
N7
N9
Robust, but worst case O(N) messages per lookup
10
Another approach distributed hash tables
Distributed applications
data
Lookup (key)
Insert(key, data)
Distributed hash tables
.
node
node
node

Nodes are the hash buckets
Key identifies data uniquely
DHT balances keys and data across nodes
DHT replicates, caches, routes lookups, etc.

11
Why DHTs now?

Demand pulls
Growing need for security and robustness
Large-scale distributed apps are difficult to
build
Many applications use location-independent data
Technology pushes
Bigger, faster, and better every PC can be a
server
Scalable lookup algorithms are available
Trustworthy systems from untrusted components

12
DHT is a good interface
DHT
UDP/IP
Send(IP address, data) Receive (IP address) ? data
lookup(key) ? data Insert(key, data)

Supports a wide range of applications, because
few restrictions
Keys have no semantic meaning
Value is application dependent
Minimal interface

13
DHT is a good shared infrastructure

Applications inherit some security and robustness
from DHT
DHT replicates data
Resistant to malicious participants
Low-cost deployment
Self-organizing across administrative domains
Allows to be shared among applications
Large scale supports Internet-scale workloads

14
DHTs support many applications

File sharing CFS, OceanStore, PAST,
Web cache Squirrel, ..
Censor-resistant stores Eternity, FreeNet,..
Event notification Scribe
Naming systems ChordDNS, INS, ..
Query and indexing Kademlia,
Communication primitives I3,
Backup store HiveNet
Web archive Herodotus

data is location-independent
15
Cooperative read-only file sharing
File system
block
Lookup (key)
insert (key, block)
Distributed hash tables
.
node
node
node

DHT is a robust block store
Client of DHT implements file system

16
File representationself-authenticating data
File System key995
431SHA-1
144 SHA-1
901 SHA-1

995 key901 key732 Signature
key431 key795
a.txt ID144

(i-node block)

(data)
(root block)
(directory blocks)

DHT key for block is SHA-1(content block)
File and file systems form Merkle hash trees

17
DHT distributes blocks by hashing IDs
Block 732
Block 705
Node B
995 key901 key732 Signature
247 key407 key992 key705 Signature
Node A
Internet
Block 407
Node C
Node D
Block 901
Block 992

DHT replicates blocks for fault tolerance
DHT caches popular blocks for load balance

18
Historical web archiver

Goal make and archive a daily check point of the
Web
Estimates
Web is about 57 Tbyte, compressed HTMLimg
New data per day 580 Gbyte
128 Tbyte per year with 5 replicas
Design
12,810 nodes 100 Gbyte disk each and 61 Kbit/s
per node

19
Implementation using DHT
Crawler
Client
Insert(sha-1(URL), page)
Lookup (URL)
Insert(sha-1(URL), URL)
Distributed hash tables
.
node
node
node

DHT usage
Crawler distributes crawling load by hash(URL)
Crawler inserts Web pages by hash(URL)
Client retrieve Web pages by hash(URL)
DHT replicates data for fault tolerance

20
Backup store

Goal backup on other users machines
Observations
Many user machines are not backed up
Backup requires significant manual effort
Many machines have lots of spare disk space
Using DHT
Merkle tree to validate integrity of data
Administrative and financial costs are less for
all participants
Backups are robust (automatic off-site backups)
Blocks are stored once, if key sha1(data)

21
Research challenges

Scalable lookup
Balance load (flash crowds)
Handling failures
Coping with systems in flux
Network-awareness for performance
Robustness with untrusted participants
Programming abstraction
Heterogeneity
Anonymity
Goal simple, provably-good algorithms

this talk
22
1. Scalable lookup

Map keys to nodes in a load-balanced way
Hash keys and nodes into a string of digit
Assign key to closest node

Forward a lookup for a key to a closer node

Insert lookup store
Join insert node in ring

Examples CAN, Chord, Kademlia, Pastry, Tapestry,
Viceroy, .
23
Chords routing table fingers
½
¼
1/8
1/16
1/32
1/64
1/128
N80
24
Lookups take O(log(N)) hops
N5
N10
N110
K19
N20
N99
N32
Lookup(K19)
N80
N60

Lookup route to closest predecessor

25
CAN exploit d dimensions

Each node is assigned a zone
Nodes are identified by zone boundaries
Join chose random point, split its zone

26
Routing in 2-dimensions

Routing is navigating a d-dimensional ID space
Route to closest neighbor in direction of
destination
Routing table contains O(d) neighbors
Number of hops is O(dN1/d)

27
2. Balance load
N5
K19
N10
N110
K19
N20
N99
N32
Lookup(K19)
N80
N60

Hash function balances keys over nodes
For popular keys, cache along the path

28
Why Caching Works Well
N20

Only O(log N) nodes have fingers pointing to N20
This limits the single-block load on N20

29
3. Handling failures redundancy
N5
N10
N110
N20
N99
N32
N40
N80
N60

Each node knows IP addresses of next r nodes
Each key is replicated at next r nodes

30
Lookups find replicas
N5
N10
N110
3.
N20
1.
2.
N99
K19
N40
4.
N50
N80
N60
N68
Lookup(K19)

Tradeoff between latency and bandwidth Kademlia

31
4. Systems in flux

Lookup takes log(N) hops
If system is stable
But, system is never stable!
What we desire are theorems of the type
In the almost-ideal state, .log(N)
System maintains almost-ideal state as nodes join
and fail

32
Half-life Liben-Nowell 2002
N new nodes join
N nodes
N/2 old nodes leave

Doubling time time for N joins
Halfing time time for N/2 old nodes to fail
Half life MIN(doubling-time, halfing-time)

33
Applying half life

For any node u in any P2P networks
If u wishes to stay connected with high
probability,
then, on average, u must be notified about ?(log
N) new nodes per half life
And so on,

34
5. Optimize routing to reduce latency
N20
N41
N40
N80

Nodes close on ring, but far away in Internet
Goal put nodes in routing table that result in
few hops and low latency

35
close metric impacts choice of nearby nodes
N06
USA
N105
USA
K104
Far east
N32
N103
Europe
N60
USA

Chords numerical close and table restrict choice
Prefix-based allows for choice
Kademlias offers choice in nodes and places
nodes in absolute order close (a,b) XOR(a, b)

36
Neighbor set
N06
USA
USA
N105
K104
N32
N103
Far east
Europe
N60
USA

From k nodes, insert nearest node with
appropriate prefix in routing table
Assumption triangle inequality holds

37
Finding k near neighbors

Ping random nodes
Swap neighbor sets with neighbors
Combine with random pings to explore
Provably-good algorithm to find nearby neighbors
based on sampling Karger and Ruhl 02

38
Finding nearest neighbor Karger and Ruhl 02

Maintain a neighbor table
entry i k nodes in distance 2ir
Find nearest node
Ask nodes in entry i for its nodes in entry i
Insert nearest in entry i1

r

A
2r

Claim algorithm will find the most nearby nodes
with high probability
Triangle inequality holds
Doubling property holds
Chord maintains finger and neighbor table

39
6. Malicious participants

Attacker denies service
Flood DHT with data
Attacker returns incorrect data detectable
Self-authenticating data
Attacker denies data exists liveness
Bad node is responsible, but says no
Bad node supplies incorrect routing info
Bad nodes make a bad ring, and good node joins it

Basic approach use redundancy
40
Sybil attack Douceur 02
N5

Attacker creates multiple identities
Attacker controls enough nodes to foil the
redundancy

N10
N110
N20
N99
N32
N40
N80
N60

Need a way to control creation of node IDs

41
One solution secure node IDs

Every node has a public key
Certificate authority signs public key of good
nodes
Every node signs and verifies messages
Quotas per publisher

42
Another solutionexploit practical byzantine
protocols
N06
N105
N
N
N
N32
N103
N
N60

A core set of servers is pre-configured with keys
and perform admission control
The servers achieve consensus with a practical
byzantine recovery protocol Castro and Liskov
99 and 00
The servers serialize updates OceanStore or
assign secure node Ids Configuration service

43
A more decentralized solutionweak secure node
IDs

ID SHA-1 (IP-address node)
Assumption attacker controls limited IP
addresses
Before using a node, challenge it to verify its ID

44
Using weak secure node IDS

Detect malicious nodes
Define verifiable system properties
Each node has a successor
Data is stored at its successor
Allow querier to observe lookup progress
Each hop should bring the query closer
Cross check routing tables with random queries
Recovery assume limited number of bad nodes
Quota per node ID

45
7. Programming abstraction

Blocks versus files
Database queries (join, etc.)
Mutable data (writers)
Atomicity of DHT operations

46
Philosophical questions

How decentralized should systems be?
Gnutella versus content distribution network
Have a bit of both? (e.g., OceanStore)
Why does the distributed systems community have
more problems with decentralized systems than the
networking community?
A distributed system is a system in which a
computer you dont know about renders your own
computer unusable
Internet (BGP, NetNews)

47
What are we doing at MIT?