Text-Based Content Search and Retrieval in ad hoc P2P Communities - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Text-Based Content Search and Retrieval in ad hoc P2P Communities

Description:

Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/ – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 18

Provided by: Francisc296

Category:

more less

Transcript and Presenter's Notes

Title: Text-Based Content Search and Retrieval in ad hoc P2P Communities

1
Text-Based Content Search and Retrieval in ad hoc
P2P Communities

Francisco Matias Cuenca-Acuna
Thu D. Nguyen
http//www.panic-lab.rutgers.edu/

2
Motivation

It is hard to find information in current P2P
infrastructures
They are designed for name-based search
They dont have quality metrics
They dont rank results
Most are optimized to find popular content
The current Internet search model has proven to
be effective to locate data
Intuitive term-based query model
Quality metric and ranking critical factors in
success of Internet search engines
Help users to quickly pinpoint relevant documents
from vast repository

3
Goals challenges

Empower P2P communities with search capabilities
similar to Internet search engines
No central servers
Fault tolerance
Cannot employ current model used by Internet
search engines
No central management and administration
Resources are fragmented
Peers behaviors are uncontrolled

4
Summary of PlanetP

Nodes maintain an index of their content
Represented as Bloom filters
Indexes and Directories are replicated everywhere
Gossiping keeps peers synchronized

Local Directory
Local Directory
Nickname Status IP Keys
Alice Online K1,..,Kn
Bob Offline K1,..,Kn
Charles Online K1,..,Kn
Nickname Status IP Keys
Alice Online K1,..,Kn
Bob Offline K1,..,Kn
Charles Online K1,..,Kn
Local Files
Local Files
XML Snippets
Gossiping
K1,..,Kn
XML Snippets
K1,..,Kn
Bloom filter
Bloom filter
5
Content search in PlanetP
STOP
6
The Vector Space model

Documents and queries are represented as
k-dimensional vectors
Word are weighted according to their relevance
for the document
Documents are weighted according to their words
The angle between a query and a document
indicates its similarity

7
Weight assignment (TFxIDF)

Idea
Use per doc. Term Frequency (TF) to weight words
(WD,t)
Use inverse global popularity (IDF) to find good
discriminators among the query terms
Intuition
TF indicates how related a document is to a
particular concept
Inverse Document Frequency (IDF) identify the
words that are good discriminators between
documents
WD,tf(Frequency of t in D)
IDFtf(No. documents/Frequency of t across
documents)

8
Node document ranking in PlanetP

Unfortunately IDF is not suited for P2P
Requires an appearance count for every word in
the community
We introduce the use of the Inverse Peer
Frequency
IPFtf(No. Peers/Peers with documents containing
t)
IPF can be computed with local information
IPF is compatible across the community

9
Stopping condition

Intuitive idea Stop as soon as k documents are
retrieved
Not good
A node might have few highly ranked documents and
many that have a low rank
We propose an adaptive approach
Contact nodes one by one and keep a list of the
top k documents retrieved
Stop contacting candidates when p nodes in a row
fail to contribute to the top k

10
Evaluation method

We use five well known document collections
Each collection comes with a set of queries and
relevance judgments
Here we present results for one (AP89)
We measure recall and precision

Trace Queries Documents Number of words Collection size (MBs)
AP89 97 84678 129603 266.0
11
Evaluation method

We use a simulator to test our algorithm
Different file distributions
Against a central search engine
Quantifying the effect not using an adaptive
stopping condition

12
Results
13
Results cont.
14
More results

Adjusting the stop condition according to the
community size and number of results expected
We provide a linear function to determine p
Recall as the community grows to 1000
(scalability)
Overlap between PlanetPs results and the ones
obtained by using standard TFxIDF
80 on average

15
Conclusions

PlanetP matches TFxIDF's performance using the
TFxIPF approximation
Give P2P communities search capabilities as
powerful as environments with centralized
resources
TFxIPF is applicable beyond PlanetP
PlanetP matches TFxIDFs performance regardless
of how documents are distributed throughout the
community
Our stopping heuristic limits searches to a small
subset of the community yet allow enough peers to
be contacted to guarantee good results

16
Related Work

Tapestry, Pastry, Chord and CAN
Implement a distributed hash table for P2P
environments
Oriented towards name based searches (for FS)
They already store all the information needed to
implement TFxIPF
Cori and Gloss
Address the problem of indexing and searching
distributed collections of documents
They build a centralized index that has total
knowledge of word usage so they dont contact
unnecessary nodes