Fulltext federated search of textbased digital libraries in peertopeer networks Information Retrieva - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Fulltext federated search of textbased digital libraries in peertopeer networks Information Retrieva

Description:

... scores at directory service (Kirsch algorithm each resource provides ... Cooperative environments: use of Kirsch algorithm (Kirsch, 1997) modified to the ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 27
Provided by: konstantin5
Category:

less

Transcript and Presenter's Notes

Title: Fulltext federated search of textbased digital libraries in peertopeer networks Information Retrieva


1
Full-text federated search of text-based digital
libraries in peer-to-peer networksInformation
Retrieval 2006, Springer
Paper presentation
  • Jie Liu, Jamie Callan
  • Language Technologies Institute, School of
    Computer Science, Carnegie Mellon University

Konstantinos Zaharis, Dept. of Comp.
Comm.Engineering, UTH
2
Paper Outline
  • Introduction
  • Overview / prior research
  • Full-text federated search in p2p
  • Test data
  • Evaluation methodology experimental settings
  • Results
  • Conclusions and future work

3
Introduction
  • Federated distributed
  • Problem addressed use of p2p nets as a search
    layer for text-based digital libraries (dls).
  • Why p2p? Because they do not need central
    authority (decentralized), they connect
    heterogeneous, multi-vendor and lightly managed
    enterprise nets. In short they are robust and
    scalable

4
Two types of environments in a p2p net
  • Cooperative p2p environments each provider gives
    its own accurate resource description to each
    neighbouring directory service (hub)
  • Uncooperative p2p environments each directory
    service conducts independently query-based
    sampling to obtain sample documents from its
    neighbouring providers in order to create their
    own resource description

5
Overview
  • Distributed IR poses three main problems
  • Resource representation discover content areas
    covered by each dl
  • Resource selection decide which dls are most
    appropriate for an information need based on
    their descriptions
  • Result merging merge ranked retrieval results
    from a set of selected dls

6
Source representation (prior research)
  • STARTS cooperative protocol (Gravano et al.,
    Proc. of ACM SIGMOD, 1997)
  • Query-based sampling for uncooperative
    environments (Callan, 2000). Directly refers to
    hidden web problem

7
Source selection (prior research)
  • Algorithms based on resource ranking (CORI,
    gGIOSS, Kullback-Leibler divergence based)
  • Threshold for resource selection usually set to a
    heuristic value (e.g. 5 or 10)

8
Result merging (prior research)
  • 1st approch normalize resource specific document
    scores into resource independent document scores
    (CORI, SSL merging algorithms)
  • 2nd approach recalculate document scores at
    directory service (Kirsch algorithm each
    resource provides summary statistics)

9
P2p network architecture
  • Clients (information consumers) issue requests
    (queries)
  • Servers (information providers, dls) route
    requests (query routing) to other servers
    (directory services) and respond to requests
    (retrieval)
  • Lower level leaf nodes providers and consumers.
    Only connect to hubs
  • Upper level hub nodes directory services.
    Connect with leaves and other hubs
  • Query routing is the unique/critical issue in p2p
    nets

10
Structured vs hierachical p2p architecture
  • Important distinction structured architecture
    uses DHT (distributed hash tables) which maps
    every data object to a distributed key. On the
    contrary hierarchical architectures that
    automatically discover contents of dl
    (appropriate for dynamic, heterogenous, privacy
    protected nets)
  • Hierarchical architecture support sophisticated
    search techniques that are not constrainted to
    controlled or small vocabularies (more
    appropriate for full-text search). However they
    are more complex and demand higher communication
    costs
  • Common characteristic construction of an overlay
    to organize peers for efficient query routing
    (semantic overlay networks)

11
Existing implementations
  • PlanetP, each peer uses a TF.IDF algorithm to
    decide which peers to contact for information
    request (Cuenca-Acuna and Ngugen, 2002)
  • pSearch, uses the semantic vector of each
    document (through LSI) to distribute document
    indices in a structured p2p net (Tang et al.,
    2003)

12
Paper contribution
  • Revise and adapt methods to solve more
    efficiently the problems in hierarchical p2p nets
  • Develop new approaches (e.g. resource ranking)
  • Discriminate between cooperative and
    uncooperative environments
  • Support thesis by extended experimental results

13
Resource description (1)
  • Format a collection language model (lists of
    terms and frequencies along with corpus
    statistics)
  • Resource can be a single provider (dl), a hub
    (multiple connected providers) or a neighborhood
    (all peers reachable from a hub)
  • Description of providers cf slide 9
  • Description of hubs aggregation of description
    of neighboring providers (within 1 hop)

14
Neighborhood description
  • Routing indices termsfreqpath to other docs
    (Crespo and Garcia-Molina, 2002b)
  • Each hub calculates and sends to its hub neighbor
    the resource description of its neighborhood
  • Total of documents aggregated in exponential
    time
  • Detection/avoidance of graph cycles because it
    affects the accuracy of descriptions

15
Resource selection (2)
  • Query routing directing queries to peers that
    are most likely to contain relevant documents.
    Cost proportional to of messages carrying the
    query
  • Flooding technique accurate but inefficient
    (exponential of query messages)
  • Random forward technique relatively efficient
    but inaccurate
  • Then what?

16
Resource ranking (full-text)
  • Providers use of K-L divergence resource ranking
    algorithm to calculate P(Pi Q) (Si and Callan,
    2004)
  • Hubs same as above with aggregation over
    selected neighborhoods
  • After ranking, the idea is to select the
    top-ranked entities by either a) specifying a
    predetermined number (not as good for dynamically
    changed nets) or b) letting the entities to learn
    their own threshold values automatically and
    autonomously

17
Unsupervised threshold learning method
  • Providers estimate ranking scores of relevant and
    non-relevant documents using the merged retrieval
    results of a set of training queries (set-based
    threshold learning)
  • Hubs, however use individual training queries for
    each member of their neighborhoods
    (individual-based threshold learning

18
Result merging (3)
  • Cooperative environments use of Kirsch algorithm
    (Kirsch, 1997) modified to the point that it no
    longer needs global statistics (fewer costs)
  • Uncooperative environments no summary statistics
    are available, so adapt the Semi-Supervised-Learni
    ng algorithm (Si and Callan, 2003a). Use linear
    regression with local weights and overlapping
    documents

19
a real P2P network search model
SourceFull text federated search in P2P
networks, Lu J., PhD Dissertation, CMU 2007
20
Test data and evaluations
  • Use of WT10g-based testbed collection
  • provides (websites) 2500
  • hubs (content-based clustering) 25
  • documents 1500000
  • Queries automatically generated (by extracting
    key terms from documents)
  • Evaluation criteria a) search accuracy and b)
    query routing efficiency

21
Experimental settings
  • Four methods for resource selection
  • Flooding
  • Random selection
  • Full-text selection using a fixed threshold (e.g.
    1 of the top-ranked neighbouring hubs)
  • Full-text selection using learned thresholds
  • TTL (time-to-live) value for each query message
    set to 6
  • Query-based sampling for resource representation
    in uncooperative environments
  • of training points to apply SSL method set to 3

22
Experimental results (numbers)
23
Experimental results show
  • Full-text selection performs better than flooding
    or random selection
  • Using learned thresholds for resource selection
    yields a few more query messages (than using a
    fixed threshold) but improves search accuracy
  • Uncooperative environments exhibit 10 search
    performance degradation in comparision to
    cooperative once, which is generally accepted

24
Conclusions and future work
  • Enhance hub functionality so as not only to
    provide sufficient information for its connected
    providers, but also calculate a path to other,
    probably useful, peers (provider routing
    technique)
  • Method works well for small/medium sized p2p nets
    with regulated network structures and organized
    content distribution. But what happens in
    larger-scale networks?
  • What happens in dynamically/temporally evolved
    nets? What about load balancing, dynamic
    clustering and fault tolerance?

25
  • Comments
  • Paper contained no intriguing ideas but proposed
    practical modifications to existing methods
  • Writing style demonstrated frequent repetitions,
    verbatimism and often vagueness
  • It is obvious that researchers are more inclined
    to better empirical results/tools for real world
    applications than theoretical models
  • All references are taken from commenting paper
    reference list

26
Thank you for your attention!
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com