Text-Based Content Search and Retrieval in ad hoc P2P Communities - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Text-Based Content Search and Retrieval in ad hoc P2P Communities

Description:

Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/ – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 18
Provided by: Francisc296
Category:

less

Transcript and Presenter's Notes

Title: Text-Based Content Search and Retrieval in ad hoc P2P Communities


1
Text-Based Content Search and Retrieval in ad hoc
P2P Communities
  • Francisco Matias Cuenca-Acuna
  • Thu D. Nguyen
  • http//www.panic-lab.rutgers.edu/

2
Motivation
  • It is hard to find information in current P2P
    infrastructures
  • They are designed for name-based search
  • They dont have quality metrics
  • They dont rank results
  • Most are optimized to find popular content
  • The current Internet search model has proven to
    be effective to locate data
  • Intuitive term-based query model
  • Quality metric and ranking critical factors in
    success of Internet search engines
  • Help users to quickly pinpoint relevant documents
    from vast repository

3
Goals challenges
  • Empower P2P communities with search capabilities
    similar to Internet search engines
  • No central servers
  • Fault tolerance
  • Cannot employ current model used by Internet
    search engines
  • No central management and administration
  • Resources are fragmented
  • Peers behaviors are uncontrolled

4
Summary of PlanetP
  • Nodes maintain an index of their content
  • Represented as Bloom filters
  • Indexes and Directories are replicated everywhere
  • Gossiping keeps peers synchronized

Local Directory
Local Directory
Nickname Status IP Keys
Alice Online K1,..,Kn
Bob Offline K1,..,Kn
Charles Online K1,..,Kn
Nickname Status IP Keys
Alice Online K1,..,Kn
Bob Offline K1,..,Kn
Charles Online K1,..,Kn
Local Files
Local Files
XML Snippets
Gossiping
K1,..,Kn
XML Snippets
K1,..,Kn
Bloom filter
Bloom filter
5
Content search in PlanetP
STOP
6
The Vector Space model
  • Documents and queries are represented as
    k-dimensional vectors
  • Word are weighted according to their relevance
    for the document
  • Documents are weighted according to their words
  • The angle between a query and a document
    indicates its similarity

7
Weight assignment (TFxIDF)
  • Idea
  • Use per doc. Term Frequency (TF) to weight words
    (WD,t)
  • Use inverse global popularity (IDF) to find good
    discriminators among the query terms
  • Intuition
  • TF indicates how related a document is to a
    particular concept
  • Inverse Document Frequency (IDF) identify the
    words that are good discriminators between
    documents
  • WD,tf(Frequency of t in D)
  • IDFtf(No. documents/Frequency of t across
    documents)

8
Node document ranking in PlanetP
  • Unfortunately IDF is not suited for P2P
  • Requires an appearance count for every word in
    the community
  • We introduce the use of the Inverse Peer
    Frequency
  • IPFtf(No. Peers/Peers with documents containing
    t)
  • IPF can be computed with local information
  • IPF is compatible across the community

9
Stopping condition
  • Intuitive idea Stop as soon as k documents are
    retrieved
  • Not good
  • A node might have few highly ranked documents and
    many that have a low rank
  • We propose an adaptive approach
  • Contact nodes one by one and keep a list of the
    top k documents retrieved
  • Stop contacting candidates when p nodes in a row
    fail to contribute to the top k

10
Evaluation method
  • We use five well known document collections
  • Each collection comes with a set of queries and
    relevance judgments
  • Here we present results for one (AP89)
  • We measure recall and precision

Trace Queries Documents Number of words Collection size (MBs)
AP89 97 84678 129603 266.0
11
Evaluation method
  • We use a simulator to test our algorithm
  • Different file distributions
  • Against a central search engine
  • Quantifying the effect not using an adaptive
    stopping condition

12
Results
13
Results cont.
14
More results
  • Adjusting the stop condition according to the
    community size and number of results expected
  • We provide a linear function to determine p
  • Recall as the community grows to 1000
    (scalability)
  • Overlap between PlanetPs results and the ones
    obtained by using standard TFxIDF
  • 80 on average

15
Conclusions
  • PlanetP matches TFxIDF's performance using the
    TFxIPF approximation
  • Give P2P communities search capabilities as
    powerful as environments with centralized
    resources
  • TFxIPF is applicable beyond PlanetP
  • PlanetP matches TFxIDFs performance regardless
    of how documents are distributed throughout the
    community
  • Our stopping heuristic limits searches to a small
    subset of the community yet allow enough peers to
    be contacted to guarantee good results

16
Related Work
  • Tapestry, Pastry, Chord and CAN
  • Implement a distributed hash table for P2P
    environments
  • Oriented towards name based searches (for FS)
  • They already store all the information needed to
    implement TFxIPF
  • Cori and Gloss
  • Address the problem of indexing and searching
    distributed collections of documents
  • They build a centralized index that has total
    knowledge of word usage so they dont contact
    unnecessary nodes

17
Questions?
18
Example
  • Assume k2 and p1
  • Documents with a tick (?) have been judged
    relevant
  • Documents with a cross (?) are related but not
    rele

D11 ?D12 ?D13 ?D14 ?
D21 ?D22 ? D23 ?D24 ?
D31 ?D32 ?D33 ?D34 ?
Trivial stopD11, D12
D21, D11
D21, D11
Adaptive stopD11, D12
Write a Comment
User Comments (0)
About PowerShow.com