Fulltext federated search of textbased digital libraries in peertopeer networks Information Retrieva - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Fulltext federated search of textbased digital libraries in peertopeer networks Information Retrieva

Description:

... scores at directory service (Kirsch algorithm each resource provides ... Cooperative environments: use of Kirsch algorithm (Kirsch, 1997) modified to the ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 27

Provided by: konstantin5

Category:

more less

Transcript and Presenter's Notes

Title: Fulltext federated search of textbased digital libraries in peertopeer networks Information Retrieva

1
Full-text federated search of text-based digital
libraries in peer-to-peer networksInformation
Retrieval 2006, Springer
Paper presentation

Jie Liu, Jamie Callan
Language Technologies Institute, School of
Computer Science, Carnegie Mellon University

Konstantinos Zaharis, Dept. of Comp.
Comm.Engineering, UTH
2
Paper Outline

Introduction
Overview / prior research
Full-text federated search in p2p
Test data
Evaluation methodology experimental settings
Results
Conclusions and future work

3
Introduction

Federated distributed
Problem addressed use of p2p nets as a search
layer for text-based digital libraries (dls).
Why p2p? Because they do not need central
authority (decentralized), they connect
heterogeneous, multi-vendor and lightly managed
enterprise nets. In short they are robust and
scalable

4
Two types of environments in a p2p net

Cooperative p2p environments each provider gives
its own accurate resource description to each
neighbouring directory service (hub)
Uncooperative p2p environments each directory
service conducts independently query-based
sampling to obtain sample documents from its
neighbouring providers in order to create their
own resource description

5
Overview

Distributed IR poses three main problems
Resource representation discover content areas
covered by each dl
Resource selection decide which dls are most
appropriate for an information need based on
their descriptions
Result merging merge ranked retrieval results
from a set of selected dls

6
Source representation (prior research)

STARTS cooperative protocol (Gravano et al.,
Proc. of ACM SIGMOD, 1997)
Query-based sampling for uncooperative
environments (Callan, 2000). Directly refers to
hidden web problem

7
Source selection (prior research)

Algorithms based on resource ranking (CORI,
gGIOSS, Kullback-Leibler divergence based)
Threshold for resource selection usually set to a
heuristic value (e.g. 5 or 10)

8
Result merging (prior research)

1st approch normalize resource specific document
scores into resource independent document scores
(CORI, SSL merging algorithms)
2nd approach recalculate document scores at
directory service (Kirsch algorithm each
resource provides summary statistics)

9
P2p network architecture

Clients (information consumers) issue requests
(queries)
Servers (information providers, dls) route
requests (query routing) to other servers
(directory services) and respond to requests
(retrieval)
Lower level leaf nodes providers and consumers.
Only connect to hubs
Upper level hub nodes directory services.
Connect with leaves and other hubs
Query routing is the unique/critical issue in p2p
nets

10
Structured vs hierachical p2p architecture

Important distinction structured architecture
uses DHT (distributed hash tables) which maps
every data object to a distributed key. On the
contrary hierarchical architectures that
automatically discover contents of dl
(appropriate for dynamic, heterogenous, privacy
protected nets)
Hierarchical architecture support sophisticated
search techniques that are not constrainted to
controlled or small vocabularies (more
appropriate for full-text search). However they
are more complex and demand higher communication
costs
Common characteristic construction of an overlay
to organize peers for efficient query routing
(semantic overlay networks)

11
Existing implementations

PlanetP, each peer uses a TF.IDF algorithm to
decide which peers to contact for information
request (Cuenca-Acuna and Ngugen, 2002)
pSearch, uses the semantic vector of each
document (through LSI) to distribute document
indices in a structured p2p net (Tang et al.,
2003)

12
Paper contribution

Revise and adapt methods to solve more
efficiently the problems in hierarchical p2p nets
Develop new approaches (e.g. resource ranking)
Discriminate between cooperative and
uncooperative environments
Support thesis by extended experimental results

13
Resource description (1)

Format a collection language model (lists of
terms and frequencies along with corpus
statistics)
Resource can be a single provider (dl), a hub
(multiple connected providers) or a neighborhood
(all peers reachable from a hub)
Description of providers cf slide 9
Description of hubs aggregation of description
of neighboring providers (within 1 hop)

14
Neighborhood description

Routing indices termsfreqpath to other docs
(Crespo and Garcia-Molina, 2002b)
Each hub calculates and sends to its hub neighbor
the resource description of its neighborhood
Total of documents aggregated in exponential
time
Detection/avoidance of graph cycles because it
affects the accuracy of descriptions

15
Resource selection (2)

Query routing directing queries to peers that
are most likely to contain relevant documents.
Cost proportional to of messages carrying the
query
Flooding technique accurate but inefficient
(exponential of query messages)
Random forward technique relatively efficient
but inaccurate
Then what?

16
Resource ranking (full-text)

Providers use of K-L divergence resource ranking
algorithm to calculate P(Pi Q) (Si and Callan,
2004)
Hubs same as above with aggregation over
selected neighborhoods
After ranking, the idea is to select the
top-ranked entities by either a) specifying a
predetermined number (not as good for dynamically
changed nets) or b) letting the entities to learn
their own threshold values automatically and
autonomously

17
Unsupervised threshold learning method

Providers estimate ranking scores of relevant and
non-relevant documents using the merged retrieval
results of a set of training queries (set-based
threshold learning)
Hubs, however use individual training queries for
each member of their neighborhoods
(individual-based threshold learning

18
Result merging (3)

Cooperative environments use of Kirsch algorithm
(Kirsch, 1997) modified to the point that it no
longer needs global statistics (fewer costs)
Uncooperative environments no summary statistics
are available, so adapt the Semi-Supervised-Learni
ng algorithm (Si and Callan, 2003a). Use linear
regression with local weights and overlapping
documents

19
a real P2P network search model
SourceFull text federated search in P2P
networks, Lu J., PhD Dissertation, CMU 2007
20
Test data and evaluations

Use of WT10g-based testbed collection
provides (websites) 2500
hubs (content-based clustering) 25
documents 1500000
Queries automatically generated (by extracting
key terms from documents)
Evaluation criteria a) search accuracy and b)
query routing efficiency

21
Experimental settings

Four methods for resource selection
Flooding
Random selection
Full-text selection using a fixed threshold (e.g.
1 of the top-ranked neighbouring hubs)
Full-text selection using learned thresholds
TTL (time-to-live) value for each query message
set to 6
Query-based sampling for resource representation
in uncooperative environments
of training points to apply SSL method set to 3

22
Experimental results (numbers)
23
Experimental results show

Full-text selection performs better than flooding
or random selection
Using learned thresholds for resource selection
yields a few more query messages (than using a
fixed threshold) but improves search accuracy
Uncooperative environments exhibit 10 search
performance degradation in comparision to
cooperative once, which is generally accepted

24
Conclusions and future work

Enhance hub functionality so as not only to
provide sufficient information for its connected
providers, but also calculate a path to other,
probably useful, peers (provider routing
technique)
Method works well for small/medium sized p2p nets
with regulated network structures and organized
content distribution. But what happens in
larger-scale networks?
What happens in dynamically/temporally evolved
nets? What about load balancing, dynamic
clustering and fault tolerance?

Comments
Paper contained no intriguing ideas but proposed
practical modifications to existing methods
Writing style demonstrated frequent repetitions,
verbatimism and often vagueness
It is obvious that researchers are more inclined
to better empirical results/tools for real world
applications than theoretical models
All references are taken from commenting paper
reference list

26
Thank you for your attention!