FullText Search in P2P Networks - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

FullText Search in P2P Networks

Description:

Full-text search is normally solved with inverted indexes ... Implement wiki and source code management with full-text search for Scenario B ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 18
Provided by: christ144
Category:

less

Transcript and Presenter's Notes

Title: FullText Search in P2P Networks


1
Full-Text Search in P2P Networks
  • Christof Leng
  • Databases and Distributed Systems Group
  • TU Darmstadt

2
Content
  • Short Intro to full-text search
  • Full-Text search on DHTs
  • Performance Comparison
  • Conclusion / Outlook

3
What is full-text search?
  • Searching for documents containing all of a list
    of specified words
  • Search for QuaP2P ? Darmstadt ? Research
  • Very common operation
  • Google
  • Filesharing
  • Wikis
  • Source Code
  • Document / Knowledge Management
  • Can be extended to phrase search
  • Search for TU Darmstadt ? Christof Leng

4
Inverted Index
  • Full-text search is normally solved with inverted
    indexes
  • Query result is intersection of all searched word
    entries
  • Stemming can reduce the number of word entries

doc1 New P2P system could provide speed
increase.
?
doc2 Similarity searches accelerate P2P
downloads by 30-70 percent.
?
doc3 I fail to see how this will make downloads
faster.
?
5
Overlay Types and Full-Text Search
Inverted index on central server
Inverted index on each (super-)node
Distributed inverted index
? Challenge
6
Naïve Approach
  • Map inverted index to DHT
  • Key Lookup for every word
  • Intersect result lists at client
  • Pro
  • Simple
  • Short latency
  • Con
  • Result lists may be extremely large!
  • Result list sizes may vary extremely!

Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
7
Zipf Distributions in Natural Text
  • Some words are extremely common
  • Most words are extremely uncommon
  • Largest word frequency is proportional to number
    of distinct words
  • ? Avoid transfering result lists before
    intersection!

Word Occurences
Rank
8
Intersecting on the way
  • Query least common word first
  • Forward result list to next word
  • Intersect on the way
  • Pro
  • Reduces traffic
  • Con
  • High latency
  • Knowledge about word frequencies required
  • Search for the and who (7.2 and 2.4 billion
    hits on Google each)

Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
9
Using Bloom Filters
  • Bloom Filters reduce result list size
  • Forward Bloom Filters and return result list
    recursively
  • Pro
  • Reduces traffic even more (up to factor 50x)
  • Con
  • Even higher latency
  • Getting complicated

Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
10
Zipf Distributions in Query Terms, too
  • Query popularity obeys Zipf Law (déjà vu!)
  • This puts high load on nodes with the most
    popular keys
  • Even worse, this load scales linearly with the
    network size and user activity
  • The responsible nodes are randomly assigned
    (could be a modem user)
  • ? Hotspots will occur

11
Caching and Precomputation
  • Caching
  • Keep lists received for intersection
  • Keep answers to popular queries
  • Traffic reduction 38
  • But How to ensure coherence?
  • Precomputation
  • Inverted index for pairs or tupels of words
  • Only feasible for the most popular words
  • (but most effective there anyway)
  • Traffic reduction 50

12
Further Optimizations
  • Compression of result lists
  • Adaptive Set Intersection
  • Gap Compression
  • Clustering of keys
  • Incremental Results
  • Do not return all results at once
  • Should be used in conjunction with ranking
    algorithm

13
Comparison of different approaches
  • Yang et al compared
  • DHT with Bloom Filters
  • Supernode with exhaustive flooding
  • Unstructured Random Walk w/o replication
  • Network size 1000
  • Random data set from WWW
  • All approaches have strengths

14
Feasibility of P2P Web Search Engine
  • Li et al calculated the bandwidth usage of a
    P2P-based web search engine
  • 3 billion documents (10KB each)
  • 60,000 peers
  • Basic DHT was 100x worse than basic Gnutella
  • DHT Optimizations (e.g. Bloom Filters) made it
    competitive
  • No index creation or maintenance cost included
    (60TB)
  • No replica maintenance cost included

15
Conclusion
  • Distributed Inverted Indexes are challenging
  • Implementation requires a lot of tricks
  • Performance is not outstanding
  • No comparison to state-of-the-art unstructured
    systems available
  • Maybe even more tricks from information retrieval
    research will help
  • Modeling the correct workload is really important
    for system design

16
Outlook
  • Examine robustness of full-text search under Zipf
    query workloads
  • Implement DHT full-text search in simulator
  • Compare state-of-the-art unstructured and
    structured full-text search overlays
  • Improve consistency and coherence in DHT
    full-text search systems
  • Implement wiki and source code management with
    full-text search for Scenario B
  • Phrase search is even more challenging

17
Recommended Reading
  • Performance Comparison
  • Li et al. On the Feasibility of Peer-to-Peer Web
    Indexing and Search. IPTPS 2003.
  • Yang et al. Performance of Full Text Search in
    Structured and Unstructured Peer-to-Peer Systems.
    INFOCOM 2006.
  • DHT Full-Text Search
  • P. Reynolds and A. Vahdat. Efficient Peer-to-Peer
    Keyword Searching. IMC 2003.
  • O. Gnawali. A Keyword Set Search System for
    Peer-to-Peer Networks. Msc. Thesis, MIT, 2002.
  • Workload Modeling
  • Breslau et al. Web Caching and Zipf-like
    Distributions Evidence and Implications. INFOCOM
    1999.
  • Gummadi et al. Measurement, Modeling and Analysis
    of a Peer-to-Peer File-Sharing Workload. SOSP
    2003.
Write a Comment
User Comments (0)
About PowerShow.com