Title: Fulltext federated search of textbased digital libraries in peertopeer networks Information Retrieva
1Full-text federated search of text-based digital
libraries in peer-to-peer networksInformation
Retrieval 2006, Springer
Paper presentation
- Jie Liu, Jamie Callan
- Language Technologies Institute, School of
Computer Science, Carnegie Mellon University
Konstantinos Zaharis, Dept. of Comp.
Comm.Engineering, UTH
2Paper Outline
- Introduction
- Overview / prior research
- Full-text federated search in p2p
- Test data
- Evaluation methodology experimental settings
- Results
- Conclusions and future work
3Introduction
- Federated distributed
- Problem addressed use of p2p nets as a search
layer for text-based digital libraries (dls). - Why p2p? Because they do not need central
authority (decentralized), they connect
heterogeneous, multi-vendor and lightly managed
enterprise nets. In short they are robust and
scalable
4Two types of environments in a p2p net
- Cooperative p2p environments each provider gives
its own accurate resource description to each
neighbouring directory service (hub) - Uncooperative p2p environments each directory
service conducts independently query-based
sampling to obtain sample documents from its
neighbouring providers in order to create their
own resource description
5Overview
- Distributed IR poses three main problems
- Resource representation discover content areas
covered by each dl - Resource selection decide which dls are most
appropriate for an information need based on
their descriptions - Result merging merge ranked retrieval results
from a set of selected dls
6Source representation (prior research)
- STARTS cooperative protocol (Gravano et al.,
Proc. of ACM SIGMOD, 1997) - Query-based sampling for uncooperative
environments (Callan, 2000). Directly refers to
hidden web problem
7Source selection (prior research)
- Algorithms based on resource ranking (CORI,
gGIOSS, Kullback-Leibler divergence based) - Threshold for resource selection usually set to a
heuristic value (e.g. 5 or 10)
8Result merging (prior research)
- 1st approch normalize resource specific document
scores into resource independent document scores
(CORI, SSL merging algorithms) - 2nd approach recalculate document scores at
directory service (Kirsch algorithm each
resource provides summary statistics)
9P2p network architecture
- Clients (information consumers) issue requests
(queries) - Servers (information providers, dls) route
requests (query routing) to other servers
(directory services) and respond to requests
(retrieval) - Lower level leaf nodes providers and consumers.
Only connect to hubs - Upper level hub nodes directory services.
Connect with leaves and other hubs - Query routing is the unique/critical issue in p2p
nets
10Structured vs hierachical p2p architecture
- Important distinction structured architecture
uses DHT (distributed hash tables) which maps
every data object to a distributed key. On the
contrary hierarchical architectures that
automatically discover contents of dl
(appropriate for dynamic, heterogenous, privacy
protected nets) - Hierarchical architecture support sophisticated
search techniques that are not constrainted to
controlled or small vocabularies (more
appropriate for full-text search). However they
are more complex and demand higher communication
costs - Common characteristic construction of an overlay
to organize peers for efficient query routing
(semantic overlay networks)
11Existing implementations
- PlanetP, each peer uses a TF.IDF algorithm to
decide which peers to contact for information
request (Cuenca-Acuna and Ngugen, 2002) - pSearch, uses the semantic vector of each
document (through LSI) to distribute document
indices in a structured p2p net (Tang et al.,
2003)
12Paper contribution
- Revise and adapt methods to solve more
efficiently the problems in hierarchical p2p nets - Develop new approaches (e.g. resource ranking)
- Discriminate between cooperative and
uncooperative environments - Support thesis by extended experimental results
13Resource description (1)
- Format a collection language model (lists of
terms and frequencies along with corpus
statistics) - Resource can be a single provider (dl), a hub
(multiple connected providers) or a neighborhood
(all peers reachable from a hub) - Description of providers cf slide 9
- Description of hubs aggregation of description
of neighboring providers (within 1 hop)
14Neighborhood description
- Routing indices termsfreqpath to other docs
(Crespo and Garcia-Molina, 2002b) - Each hub calculates and sends to its hub neighbor
the resource description of its neighborhood - Total of documents aggregated in exponential
time - Detection/avoidance of graph cycles because it
affects the accuracy of descriptions
15Resource selection (2)
- Query routing directing queries to peers that
are most likely to contain relevant documents.
Cost proportional to of messages carrying the
query - Flooding technique accurate but inefficient
(exponential of query messages) - Random forward technique relatively efficient
but inaccurate - Then what?
16Resource ranking (full-text)
- Providers use of K-L divergence resource ranking
algorithm to calculate P(Pi Q) (Si and Callan,
2004) - Hubs same as above with aggregation over
selected neighborhoods - After ranking, the idea is to select the
top-ranked entities by either a) specifying a
predetermined number (not as good for dynamically
changed nets) or b) letting the entities to learn
their own threshold values automatically and
autonomously
17Unsupervised threshold learning method
- Providers estimate ranking scores of relevant and
non-relevant documents using the merged retrieval
results of a set of training queries (set-based
threshold learning) - Hubs, however use individual training queries for
each member of their neighborhoods
(individual-based threshold learning
18Result merging (3)
- Cooperative environments use of Kirsch algorithm
(Kirsch, 1997) modified to the point that it no
longer needs global statistics (fewer costs) - Uncooperative environments no summary statistics
are available, so adapt the Semi-Supervised-Learni
ng algorithm (Si and Callan, 2003a). Use linear
regression with local weights and overlapping
documents
19a real P2P network search model
SourceFull text federated search in P2P
networks, Lu J., PhD Dissertation, CMU 2007
20Test data and evaluations
- Use of WT10g-based testbed collection
- provides (websites) 2500
- hubs (content-based clustering) 25
- documents 1500000
- Queries automatically generated (by extracting
key terms from documents) - Evaluation criteria a) search accuracy and b)
query routing efficiency
21Experimental settings
- Four methods for resource selection
- Flooding
- Random selection
- Full-text selection using a fixed threshold (e.g.
1 of the top-ranked neighbouring hubs) - Full-text selection using learned thresholds
- TTL (time-to-live) value for each query message
set to 6 - Query-based sampling for resource representation
in uncooperative environments - of training points to apply SSL method set to 3
22Experimental results (numbers)
23Experimental results show
- Full-text selection performs better than flooding
or random selection - Using learned thresholds for resource selection
yields a few more query messages (than using a
fixed threshold) but improves search accuracy - Uncooperative environments exhibit 10 search
performance degradation in comparision to
cooperative once, which is generally accepted
24Conclusions and future work
- Enhance hub functionality so as not only to
provide sufficient information for its connected
providers, but also calculate a path to other,
probably useful, peers (provider routing
technique) - Method works well for small/medium sized p2p nets
with regulated network structures and organized
content distribution. But what happens in
larger-scale networks? - What happens in dynamically/temporally evolved
nets? What about load balancing, dynamic
clustering and fault tolerance?
25- Comments
- Paper contained no intriguing ideas but proposed
practical modifications to existing methods - Writing style demonstrated frequent repetitions,
verbatimism and often vagueness - It is obvious that researchers are more inclined
to better empirical results/tools for real world
applications than theoretical models - All references are taken from commenting paper
reference list
26Thank you for your attention!