PeertoPeer p2p Querying - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

PeertoPeer p2p Querying

Description:

Modified the LimeWire Gnutella Client. Run as leaf or ultrapeer. Monitor Gnutella traffic ... Log of Gnutella queries from LimeWire clients. Reissued Gnutella queries ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 63
Provided by: joehell
Category:

less

Transcript and Presenter's Notes

Title: PeertoPeer p2p Querying


1
Peer-to-Peer (p2p) Querying
  • Joe Hellerstein
  • CS186 Fall 2005

2
Note
  • These slides are based on a tutorial given at
    VLDB 2004
  • http//db.cs.berkeley.edu/jmh/talks/vldb04-p2ptut-
    final.ppt
  • http//db.cs.berkeley.edu/jmh/talks/vldb04-p2ptut-
    2upbw.pdf
  • These slides were made on a Mac
  • May not display correctly in PPT for Windows
  • Animation is often a portability problem for PPT
  • PPTs Compatibility Check finds 185 issues!

3
Outline
  • What is p2p?
  • Querying in early p2p systems
  • Napster
  • Gnutella
  • KaZaA, Gnutella with Ultrapeers
  • Some problems with queries in Gnutella
  • Distributed Hash Tables (DHTs)
  • Chord
  • Keyword search over DHTs
  • More fun
  • Towards full-service p2p querying
  • Get involved!
  • DB ideas infecting the network more deeply

4
p2p
  • Distributed applications without servers
  • Scale
  • Peers
  • Churn
  • Self-admin
  • People tend to think of filestealing
  • Respect the musicians, dont steal music.
  • Also used for, e.g., swapping biological data,
    open-source software, etc.
  • Lots of potential applications of the technology
  • Go make some up!
  • My favorite Public Health for the Internet
  • P2P is an inherently democratic architecture

5
p2p, cont
  • p2p is organic
  • Start the next phenomenon in your dorm room
  • No need for a hosted server, administrator, etc.
  • Hence no need for Venture Capital
  • Hence no need to worry if it will take off
  • Infrastructure right-sizes itself

6
Outline
  • What is p2p?
  • Querying in early p2p systems
  • Napster
  • Gnutella
  • KaZaA, Gnutella with Ultrapeers
  • Some problems with queries in Gnutella
  • Distributed Hash Tables (DHTs)
  • Chord
  • Keyword search over DHTs
  • More fun
  • Towards full-service p2p querying
  • Get involved!
  • DB ideas infecting the network more deeply

7
Early P2P I Client-Server
  • Napster

xyz.mp3
xyz.mp3 ?
8
Early P2P I Client-Server
  • Napster
  • Client-Server search

xyz.mp3
9
Early P2P I Client-Server
  • Napster
  • Client-Server search

xyz.mp3
xyz.mp3 ?
10
Early P2P I Client-Server
  • Napster
  • Client-Server search
  • pt2pt file xfer

xyz.mp3
xyz.mp3 ?
11
Early P2P I Client-Server
  • Napster
  • Client-Server search
  • pt2pt file xfer

xyz.mp3
xyz.mp3 ?
12
Early P2P II Flooding on Overlays
xyz.mp3
xyz.mp3 ?
An overlay network. Unstructured.
13
Early P2P II Flooding on Overlays
xyz.mp3
xyz.mp3 ?
Flooding
14
Early P2P II Flooding on Overlays
xyz.mp3
xyz.mp3 ?
Flooding
15
Early P2P II Flooding on Overlays
xyz.mp3
16
Early P2P II.v Ultrapeers
  • Ultrapeers can be installed (KaZaA) or
    self-promoted (Gnutella)

17
Gnutella Network
Oct 2003 Crawl
  • Popular open-source file-sharing network
  • 450,000 users as of 2003
  • 2,000,000 today
  • Ultrapeer-based Topology
  • Queries flooded among ultrapeers
  • Leaf nodes shielded from query traffic
  • Based on multiple crawlers from 30 vantage points
    on PlanetLab

Ultrapeer nodes
Leaf nodes
100 Files
0 Files
0-100 Files
18
PlanetLab
  • PlanetLab Open, globally distributed platform
    for deploying planetary-scale network services
  • 631 nodes at 299 sites, 5 continents
  • URL http//www.planet-lab.org

19
Gnutella Measurements
  • Quality of Searches
  • Recall ( of all relevant items retrieved)
  • Distinct Recall ( of all relevant distinct items
    retrieved)
  • Response Time (Latency) to 1st result
  • Software utilized
  • Modified the LimeWire Gnutella Client
  • Run as leaf or ultrapeer
  • Monitor Gnutella traffic
  • Inject queries and gather results

20
Gnutella Search Quality
  • Log of Gnutella queries from LimeWire clients
  • Reissued Gnutella queries
  • 700 randomly chosen queries
  • 30 LimeWire Ultrapeers on PlanetLab
  • 3 different times
  • Computed Query Recall
  • Each query issued simultaneously from 30
    ultrapeers
  • Union of results from 30 ultrapeers
  • Union-of-30 is our approximation of perfect
    answer

21
Queries with Small Result Sets
22
Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
23
Query Latency
24
Summary of Measurements
  • Searching on Flood-based networks
  • Highly effective for popular (highly replicated)
    items
  • Less effective for rare items
  • Significant opportunity to do better
  • Large fraction of queries return few or no
    results even when they exist
  • Bad response times for queries on rare items
  • Aggressive flooding is not the solution
  • Diminishing returns with flooding
  • Does not improve response times

25
Outline
  • What is p2p?
  • Querying in early p2p systems
  • Napster
  • Gnutella
  • KaZaA, Gnutella with Ultrapeers
  • Some problems with queries in Gnutella
  • Distributed Hash Tables (DHTs)
  • Chord
  • Keyword search over DHTs
  • More fun
  • Towards full-service p2p querying
  • Get involved!
  • DB ideas infecting the network more deeply

26
High-Level Idea Indirection
  • Indirection in space
  • Logical (content-based) IDs, routing to those IDs
  • Content-addressable network
  • Tolerant of churn
  • nodes joining and leaving the network
  • Indirection in time
  • Want some scheme to temporally decouple send and
    receive
  • Persistence required. Typical Internet solution
    soft state
  • Combo of persistence via storage and via retry
  • Publisher requests TTL on storage
  • Republishes as needed
  • Metaphor Distributed Hash Table

hz
27
What is a DHT?
  • Hash Table
  • data structure that maps keys to values
  • essential building block in software systems
  • Distributed Hash Table (DHT)
  • similar, but spread across the Internet
  • Interface
  • insert(key, value)
  • lookup(key)

28
How?
  • Every DHT node supports a single operation
  • Given key as input route messages toward node
    holding key

29
DHT in action
30
DHT in action
31
DHT in action
Operation take key as input route messages to
node holding key
32
DHT in action put()
insert(K1,V1)
Operation take key as input route messages to
node holding key
33
DHT in action put()
insert(K1,V1)
Operation take key as input route messages to
node holding key
34
DHT in action put()
(K1,V1)
Operation take key as input route messages to
node holding key
35
DHT in action get()
retrieve (K1)
Operation take key as input route messages to
node holding key
36
DHT Design Goals
  • An overlay network with
  • Flexible mapping of keys to physical nodes
  • Small network diameter
  • Small degree (fanout)
  • Local routing decisions
  • Robustness to churn
  • Routing flexibility
  • Decent locality (low stretch)
  • A storage or memory mechanism with
  • No guarantees on persistence
  • Maintenance via soft state
  • Each fact has a time-to-live
  • Publisher of the fact must republish to achieve
    persistence

37
DHT Outline
  • High-level overview
  • Fundamentals of structured network topologies
  • And examples
  • One concrete DHT
  • Chord
  • Some systems issues
  • Storage models soft state
  • Locality
  • Churn management

38
Outline
  • What is p2p?
  • Querying in early p2p systems
  • Napster
  • Gnutella
  • KaZaA, Gnutella with Ultrapeers
  • Some problems with queries in Gnutella
  • Distributed Hash Tables (DHTs)
  • Chord
  • Keyword search over DHTs
  • More fun
  • Towards full-service p2p querying
  • Get involved!
  • DB ideas infecting the network more deeply

39
An Example DHT Chord
  • Assume n 2m nodes for a moment
  • A complete Chord ring
  • Well generalize shortly

40
An Example DHT Chord
41
An Example DHT Chord
42
An Example DHT Chord
  • Overlayed 2k-Gons

43
Routing in Chord
  • At most one of each Gon
  • E.g. 1-to-0

44
Routing in Chord
  • At most one of each Gon
  • E.g. 1-to-0

45
Routing in Chord
  • At most one of each Gon
  • E.g. 1-to-0

46
Routing in Chord
  • At most one of each Gon
  • E.g. 1-to-0

47
Routing in Chord
  • At most one of each Gon
  • E.g. 1-to-0

48
Routing in Chord
  • At most one of each Gon
  • E.g. 1-to-0
  • What happened?
  • We constructed thebinary number 15!
  • Routing from x to yis like computing y - x mod
    n by summing powers of 2

2
4
8
1
Diameter log n (1 hop per gon type)Degree log
n (one outlink per gon type)
49
Outline
  • What is p2p?
  • Querying in early p2p systems
  • Napster
  • Gnutella
  • KaZaA, Gnutella with Ultrapeers
  • Some problems with queries in Gnutella
  • Distributed Hash Tables (DHTs)
  • Chord
  • Keyword search over DHTs
  • More fun
  • Towards full-service p2p querying
  • Get involved!
  • DB ideas infecting the network more deeply

50
File Search using DHTs
  • Inverted Index in a DHT
  • To answer query term1 AND term2
  • Route query to hash(term1) and hash(term2)
  • Rehash postings for term1 on DocID
  • Rehash postings for term2 on DocID
  • Do local intersection at each node that received
    tuples
  • Send matches to querier

51
Keyword Search using DHTs
  • Inverted Lists hashed by keyword (term) in the
    DHT

Query T1 AND T2
52
File Search Flooding vs. DHTs
  • Recall
  • Flooding can miss files
  • DHTs should never
  • Query complexity
  • Flooding can handle arbitrary single-site logic
  • DHTs can do equijoins, selections, aggregates,
    etc.
  • But not so good at fancy selections like
    wildcards
  • Query Performance
  • Flooding can be slow to find things, uses lots of
    BW
  • DHTs expensive to publish documents with lots of
    terms
  • DHTs expensive to intersect really long term
    lists
  • Even if output is really small!
  • Not likely to replace Google any time soon
  • Hybrid solution!

53
Hybrid Search
  • Hybrid Best of both worlds

Flood-based Network
DHT
(Search Rare Items)
(Search Popular Items)
54
Challenges
  • Identifying Rare Items
  • Query Results Size (QRS)
  • Publish items from previous queries that return
    few results
  • Term Frequency Statistics
  • Single Term (TF)
  • Term Pairs (TPF)
  • Sample items on neighboring nodes (SAM)
  • Network Churn
  • Use Ultrapeers as DHT nodes
  • Avoid publishing items from short-lived nodes

55
Results
  • Trace-driven simulations
  • 315,000 files, 75,000 nodes, 350 queries
  • Improved Response Time
  • PIER (DHT) returns first result in 10 seconds
  • 40 seconds in aggregate including 30 seconds
    timeout
  • Gnutella (Flood) queries returns first result in
    65 seconds
  • 25 seconds (38) reduction in latency
  • Improved Recall Analysis
  • 18 reduction in queries with empty results
  • Using a naïve rare-item selection scheme
  • Opportunity to do far better
  • Recall 66 potential reduction based on
    Union-of-30
  • Approaches larger-scale deployments and using
    better rare-item schemes.

56
Outline
  • What is p2p?
  • Querying in early p2p systems
  • Napster
  • Gnutella
  • KaZaA, Gnutella with Ultrapeers
  • Some problems with queries in Gnutella
  • Distributed Hash Tables (DHTs)
  • Chord
  • Keyword search over DHTs
  • More fun
  • Towards full-service p2p querying
  • Get involved!
  • DB ideas infecting the network more deeply

57
DHTs Gave Us Equality Lookups
  • What else might we want?
  • Range Search
  • Aggregation
  • Group By
  • More complex Joins
  • Intelligent Query Dissemination
  • Theme
  • All can be built elegantly on DHTs!
  • PIER

pier.cs.berkeley.edu
58
Joining the Fun
59
OpenDHT
  • A shared DHT service
  • Hosted on PlanetLab
  • Simple API
  • You dont need to deploy or host to play with a
    real DHT!
  • A playground for killer apps?
  • Neednt be as big as PIER!
  • Example FreeDB replacement

60
Infecting the network even more deeply?
  • Todays internet infrastructure is a mess
  • Very complex to configure
  • Very limited in functionality
  • Assume they cannot know all kinds of things
  • Things like p2p are threatening to make it
    obsolete
  • What do networks do?
  • Managing routing and forwarding tables
  • Perform dataflow
  • Uhhhhh. isnt that kind of like query engines?
  • Yes!
  • But dont try using Oracle for this just yet

61
(No Transcript)
62
p2.cs.berkeley.edu
  • P2 A declarative networking system
  • Based on recursive queries over graphs
  • E.g. Find the shortest path between me and you
  • A topic in DB theory that was mostly abandoned in
    practice up til recently
  • Can be used to implement routing protocols, DHTs,
    etc.
  • Chord in 47 rules
  • Instead of 10,000 lines of C
  • Lots of new fun query processing challenges
Write a Comment
User Comments (0)
About PowerShow.com