An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques

About This Presentation

Title:

An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques

Description:

san ca diego francisco puerto tx austin rico antonio earthquake jose juan vallarta lucas rican luis cabo fransisco bernardino ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 14

Provided by: mingz4

Category:

more less

Transcript and Presenter's Notes

Title: An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques

1
An Evaluation and Comparison of Current
Peer-to-Peer Full-Text Keyword Search Techniques

Ming Zhong, Justin Moore, Kai Shen
University of Rochester
Amy Murphy
University of Lugano

2
Current P2P Full-Text Keyword Search Techniques

Document-based partitioning Gnutella, KaZaA
Keyword-based partitioning Bhattacharjee03,
Gnawali02, Reynold03, Suel03
Hybrid indexing Tang04
Semantic search Ganesan04, Li04, Tang03
There is no comprehensive quantitative
performance evaluation and comparison of these
techniques! (Li03)

3
Our Work

Quantitative performance evaluation results on
real, large datasets (3.8 million web pages from
www.dmoz.org and 6.8 million web queries from
AskJeeves).
Performance metrics
Total storage consumption
Communication overhead
Search latency
Search quality
Performance evaluation results linearly projected
to 1 billion web pages.

4
Evaluation Setup

8-byte page ID (the MD5 of page URL).
Each entry in inverted lists has 10 bytes (8 byte
ID 2 byte term frequency).
Each query only retrieves the top 20 most related
page IDs, ranked by TF.IDF term weighting scheme.
The underlying topology Chord.
Network settings.
p2p overlay link latency 40ms.
The maximum per-query bandwidth consumption
1.5Mbps.
The maximum whole-system bandwidth consumption
1Gbps 0.26 of US Internet backbone bandwidth
in 2002.

5
Document-Based Partitioning

Each node holds a partition of documents. A query
is broadcast to all nodes and each node returns
top 20 most relevant documents.
Tree-based (log n depth) message broadcast and
aggregation.
Total storage consumption 4.24 GB for our data
set
Total communication cost 300n bytes for our data
set (n, the network size).
Search latency 0.08 X log(n) secs.
Search Quality

6
Baseline Keyword-Based Partitioning

Each node holds the inverted lists of some
keywords (randomly distributed). To save the
communication overhead of the inverted list
intersection, a k-word query visit k peers in the
ascending order of the inverted list sizes.
NO quality degradation for baseline keyword-based
partitioning.
Equal storage consumption compared with doc-based
partitioning.
Average comm. cost per query 96.61 KB.
Max comm. Cost per query 18.65 MB
Search latencylt0.14 X log(n) 0.52 secs.

7
Improved Keyword-Based Partitioning (I)
The average comm. cost is reduced to 0.137 times
that of the baseline keyword-based partitioning.
8
Improved Keyword-Based Partitioning (II)
san ca diego francisco puerto tx austin rico
antonio earthquake jose juan vallarta lucas rican
luis cabo fransisco bernardino
9
Improved Keyword-Based Partitioning (III)
10
Hybrid Indexing

Each page ID in inverted lists has some metadata
about the corresponding page.
A naive approach each page ID has the complete
term list of the corresponding page. Too much
storage consumption!!!
Tangs approach An inverted list of x only
contains those pages that has x as its top terms.
Query expansion also used.
Storage consumption 13 times that of
keyword-based partitioning.
Latency lt 0.16 X DHT diameter sec.
Comm. cost per query 7.5 KB.
Search quality

11
Semantic Search

Documents and queries are mapped into points in a
semantic space and hence keyword search becomes
nearest point search.
pSearch LSI, rolling index.
Storage consumption 30.06 GB, 7.09 times that of
keyword-based partitioning if p 10.
Comm. cost per query 1.29 MB.
(P 10, K 160)
Search latency DHT comm. latency
network transmission time
Search quality

12
Scaled Performance on 109 pages and n peers
13
Examples of Choosing Techniques