P2P Databases - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

P2P Databases

Description:

1. Freeform versus structured attribute data ... limewire. icq. fiorana. mojo nation. jxta. united devices. open cola. uddi. process tree. can ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 42
Provided by: ashish9
Category:

less

Transcript and Presenter's Notes

Title: P2P Databases


1
P2P Databases
2
Overview
  • 0. Data objects, pointers (URLs), and attributes
  • 1. Freeform versus structured attribute data
  • 2. Centralized indices for attribute data and
    pointers (ex Napster)
  • 3. Query by flooding (ex Gnutella)
  • 4. DHTs (ex Chord)
  • 5. Problems with DHTs
  • 6. Keyword queries in DHTs (Magnolia)
  • 7. Popularity queries
  • 8. Demo of system
  • 9. (if time) Data transmission    - Overlay vs
    DHT Multicast    - Bittorrent / Splitstream
  • 10. (if time) P2P file systems and versioning
    (precursor to undo/redo logging from later in the
    course)

3
P2P Today
edonkey
bittorrent
pastry
jxta
can
fiorana
napster
freenet
united devices
open cola
?
aim
ocean store
netmeeting
farsite
gnutella
icq
ebay
morpheus
limewire
seti_at_home
bearshare
uddi
grove
jabber
popular power
kazaa
folding_at_home
tapestry
mojo nation
process tree
chord
4
Object representation and storage
Objects
Attributes Name , Artist, Album , Genre
Pointer to object
5
P2P vs. Distributed DBMS
Traditional DDBMS Issues
  • Transactions
  • Distributed Query Optimization
  • Interoperation of heterogeneous data sources
  • Reliability/failure of nodes

Complex features do not scale
6
P2P vs. Distributed DBMS
  • Example application file-sharing
  • Simple data model and query language
  • No complex query optimization
  • Easy interoperation
  • No guarantee on quality of results
  • Individual site availability unimportant
  • Local updates
  • No transactions
  • Network partitions OK

Simple Amenable to large-scale network of
PCs
7
Example file sharing
  • Challenge 1 Performance
  • Asking everyone is expensive!
  • If I am smart, I only need to ask one peer
  • How can I be smart?

File X?
8
Search in P2P
  • System can control
  • Connections made by users/topology
  • Data placement
  • Query type
  • Tight control Structured
  • Efficient, comprehensive
  • Loose control Unstructured
  • Inefficient, not comprehensive, simple,
    expressive
  • Used in real life

9
Centralized
  • Napster model
  • Benefits
  • Efficient search
  • Limited bandwidth usage
  • No per-node state
  • Drawbacks
  • Central point of failure
  • Limited scale

Bob
Alice
Jane
Judy
10
http//www.snocap.com/
11
Unstructured Query Flooding
12
Problems with unstructured
  • Inefficient
  • Query messages are flooded
  • Even if routing is intelligent, worst case load
    is still O(n), where n is nodes in system
  • Not comprehensive
  • If I do not get a result for my query, is it
    because none exists?
  • (Of course, many optimizations are possible)

13
Distributed Hash Table (DHTs)
  • Model
  • Key/Object pair, the key is hashed to get an ID
  • Example
  • Objects are files
  • The key is the content of the file
  • The ID is the hash of the file contents
  • Single operation Lookup(ID)
  • Input integer ID
  • Output the object with the corresponding ID

14
Identifiers
  • IDs are m-bit integers
  • Nodes are also assigned IDs
  • Commonly assigned by hashing a nodes IP address,
    although many problems with this
  • An object is stored on the node with the smallest
    ID greater than the objects ID
  • This node is called the successor of the objects
    ID
  • IDs are arranged on a circle, so 0 2m-1

15
Data Placement
0
  • Nodes
  • 0
  • 1
  • 3

m 3
7
1
1
6
  • Data
  • 1
  • 2
  • 6

6
2
2
3
5
4
16
Connections
  • Distance
  • 20
  • 21
  • .
  • 2m-1

0
7
1
Finger pointers
6
2
3
5
4
17
Query
  • Lookup(objectID)
  • objectID is typically the ID of the object you
    are looking for, but not necessarily
  • Approach
  • Find the predecessor of the object
  • I.e. the node with the largest ID that is smaller
    than the object ID
  • Return the successor of the predecessor

18
Query Example
  • Say node 0 wants to find the object with ID 7
  • For simplicity, we will assume a node exists at
    every ID in the space

19
Query Example
0
Node 0 Lookup(7)
7
1
Node 0 FindPred (7)
6
2
3
5
4
20
Query Example
0
Node 4 FindPred(7)
7
1
6
2
3
5
4
21
Query Example
0
Node 6 FindPred(7)
7
1
Node 6 is predecessor Return successor node 7
6
2
3
5
4
22
Query characteristics
  • With high probability, a query can be answered by
    contacting O(log N) nodes
  • N total nodes in the network
  • Efficient!
  • Also notice if an object with the ID exists in
    the network, it will be found
  • Comprehensive!
  • State is also O(log N) in size

23
Query characteristics
  • Note that finger pointers are not required for
    correct operation
  • Only successor pointers are needed
  • But then cost of query increases
  • O(N) in worst case

24
Advantages of Structured?
  • Scalability/Efficiency
  • load grows with O(log N)
  • Comprehensiveness

25
Disadvantages? (cont)
  • Availability of Data
  • If a node dies suddenly, what happens to the data
    it was storing?
  • MUST replicate data across multiple nodes
  • Query Language
  • How can we express keyword queries efficiently?
  • Many useful applications require different
    languages

26
Magnolia
27
Resulting Distribution
28
Prefix hashing
29
Balancing
Innovation
Balanced over the sibling group
100
Sibling group ID100
All siblings in a group share the same prefix
30
Insert
Keyword hP? SiblingGroup ID
Random Sibling
Locate a sibling node via SIFT
31
Advantages
  • Good Balancing Properties

32
Advantages
  • Low Traffic Load on nodes for popular queries
  • Quick Lookup
  • Popularity Ranking of Objects
  • Distributed Replication for resilience

33
Implementing Magnolia
  • Developed on top of a chord clone written in
    Python
  • If youre going to write a peer-to-peer app, why
    not leverage existing modules and libraries?
  • Challenge How do we implement group-based stores
    and queries without requiring additional network
    maintenance?

34
Chords Finger Table
  • A chord node maintains a finger table of M IPs
    pointing to nodes ahead of it in the ring.
  • A pointer at index i is the successor of node id
    (2i-1). This lets us reach any node in the
    network in O(log M) hops
  • We use the M most significant bits in a nodes
    id to indicate its group. We want to reach any
    group in O(log M) hops.
  • Do we need another table?
  • Nope. The last M entries in our finger table
    provide this.

35
Talking to Siblings
  • How do we propagate queries through the group?
  • Naïve solution send to our predecessor and
    successor.
  • A better solution We can send a query throughout
    the group by treating the sibling group as a tree.

36
Sibling Tree
N/N 16 M/M 4
0 1 2 3 4 5
6 7 8 9 10 11 12
13 14 15
0
023
01
8
1
822
122
81
11
2
12
9
5
221
21
521
921
1221
51
91
121
10
11
3
4
6
7
13
14
Every edge can be found in the finger table!
1420
15
37
Sibling Tree Problems
  • Problems
  • Not every possible node will exist
  • Not every node will have results to report
  • The query maker needs to know when the search is
    done
  • But were okay!
  • Nodes can determine if a child sub-tree is dead
  • Even if a child node in our sibling table is of a
    higher ID than expected
  • its sub-tree contains all existing descendents of
    the expected id
  • we can predict when a child is in a sibling our
    ancestors tree

38
Bigger Problems
  • What if a pointer in our finger table fails?
  • We either have to find the successor to its id
    or fail to query the sub-tree
  • What if the lowest ID node isnt the root of our
    tree?
  • Some of our edges wont be in our finger table

39
Popularity queries
40
Yulania , Demo
41
BitTorrent
42
SplitStream
Write a Comment
User Comments (0)
About PowerShow.com