An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques

Description:

san ca diego francisco puerto tx austin rico antonio earthquake jose juan vallarta lucas rican luis cabo fransisco bernardino ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 14
Provided by: mingz4
Category:

less

Transcript and Presenter's Notes

Title: An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques


1
An Evaluation and Comparison of Current
Peer-to-Peer Full-Text Keyword Search Techniques
  • Ming Zhong, Justin Moore, Kai Shen
  • University of Rochester
  • Amy Murphy
  • University of Lugano

2
Current P2P Full-Text Keyword Search Techniques
  • Document-based partitioning Gnutella, KaZaA
  • Keyword-based partitioning Bhattacharjee03,
    Gnawali02, Reynold03, Suel03
  • Hybrid indexing Tang04
  • Semantic search Ganesan04, Li04, Tang03
  • There is no comprehensive quantitative
    performance evaluation and comparison of these
    techniques! (Li03)

3
Our Work
  • Quantitative performance evaluation results on
    real, large datasets (3.8 million web pages from
    www.dmoz.org and 6.8 million web queries from
    AskJeeves).
  • Performance metrics
  • Total storage consumption
  • Communication overhead
  • Search latency
  • Search quality
  • Performance evaluation results linearly projected
    to 1 billion web pages.

4
Evaluation Setup
  • 8-byte page ID (the MD5 of page URL).
  • Each entry in inverted lists has 10 bytes (8 byte
    ID 2 byte term frequency).
  • Each query only retrieves the top 20 most related
    page IDs, ranked by TF.IDF term weighting scheme.
  • The underlying topology Chord.
  • Network settings.
  • p2p overlay link latency 40ms.
  • The maximum per-query bandwidth consumption
    1.5Mbps.
  • The maximum whole-system bandwidth consumption
    1Gbps 0.26 of US Internet backbone bandwidth
    in 2002.

5
Document-Based Partitioning
  • Each node holds a partition of documents. A query
    is broadcast to all nodes and each node returns
    top 20 most relevant documents.
  • Tree-based (log n depth) message broadcast and
    aggregation.
  • Total storage consumption 4.24 GB for our data
    set
  • Total communication cost 300n bytes for our data
    set (n, the network size).
  • Search latency 0.08 X log(n) secs.
  • Search Quality

6
Baseline Keyword-Based Partitioning
  • Each node holds the inverted lists of some
    keywords (randomly distributed). To save the
    communication overhead of the inverted list
    intersection, a k-word query visit k peers in the
    ascending order of the inverted list sizes.
  • NO quality degradation for baseline keyword-based
    partitioning.
  • Equal storage consumption compared with doc-based
    partitioning.
  • Average comm. cost per query 96.61 KB.
  • Max comm. Cost per query 18.65 MB
  • Search latencylt0.14 X log(n) 0.52 secs.

7
Improved Keyword-Based Partitioning (I)
The average comm. cost is reduced to 0.137 times
that of the baseline keyword-based partitioning.
8
Improved Keyword-Based Partitioning (II)
san ca diego francisco puerto tx austin rico
antonio earthquake jose juan vallarta lucas rican
luis cabo fransisco bernardino
9
Improved Keyword-Based Partitioning (III)
10
Hybrid Indexing
  • Each page ID in inverted lists has some metadata
    about the corresponding page.
  • A naive approach each page ID has the complete
    term list of the corresponding page. Too much
    storage consumption!!!
  • Tangs approach An inverted list of x only
    contains those pages that has x as its top terms.
    Query expansion also used.
  • Storage consumption 13 times that of
    keyword-based partitioning.
  • Latency lt 0.16 X DHT diameter sec.
  • Comm. cost per query 7.5 KB.
  • Search quality

11
Semantic Search
  • Documents and queries are mapped into points in a
    semantic space and hence keyword search becomes
    nearest point search.
  • pSearch LSI, rolling index.
  • Storage consumption 30.06 GB, 7.09 times that of
    keyword-based partitioning if p 10.
  • Comm. cost per query 1.29 MB.
  • (P 10, K 160)
  • Search latency DHT comm. latency
  • network transmission time
  • Search quality

12
Scaled Performance on 109 pages and n peers
13
Examples of Choosing Techniques
  • Application 1 105 peers, 107 pages.
  • Use keyword-based partitioning
  • 29.64 MB storage per peer
  • lt12.22 KB communication overhead per query
  • lt2.4 sec search latency
  • 100 search quality
  • Application 2 103 peers, 109 pages.
  • Use document-based partitioning
  • 1.14 GB per peer.
  • 300 KB communication overhead per query
  • 0.8 sec search latency
  • 75 95 search quality
Write a Comment
User Comments (0)
About PowerShow.com