Query Routing in PeertoPeer Web Search Engine - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Query Routing in PeertoPeer Web Search Engine

Description:

Web is huge, it's difficult to cover all. Timely re-crawls are required. Technical limits ... Controls 80% of web search requests ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 19

Provided by: Pas125

Category:

more less

Transcript and Presenter's Notes

Title: Query Routing in PeertoPeer Web Search Engine

1
Query Routing in Peer-to-Peer Web Search Engine
International Max Planck Research School for
Computer Science

Speaker Pavel Serdyukov
Supervisors Gerhard Weikum Christian Zimmer
Matthias Bender

2
Talk Outline

Motivation
Proposed Search Engine architecture
Query routing and database selection
Similarity-based measures
Example GlOSS
Document-frequency-based measures
Example CORI
Evaluation of methods
Proposals
Conclusion

3
Problems of present Web Search Engines

Size of indexable Web
Web is huge, its difficult to cover all
Timely re-crawls are required
Technical limits
Deep Web
Monopoly of Google
Controls 80 of web search requests
Paid sites get updated more frequently and get
higher rating
Sites may be censored by engine

4
Make use of Peer-to-Peer technology

Exploit previously unused CPU/memory/disk power
Provide up-to-date results for small portions of
Web
Conquer Deep Web by personalized and specialized
web crawlers

Global Directory
Global directory must be shared among peers!
5
Query routing

Goal find peers with relevant documents
Known before as Database Selection Problem
Not all techniques are applicable to P2P

6
Database Selection Problem

1st inference Is this document relevant?
Its a subjective user judgment, we model it
We use only representations of user needs and
documents (keywords, inverted indices)
2nd inference Database is potential to satisfy
query, if it
has many documents (size-based naive approach)
has many documents, containing all query words
high number of them with given similarity
high summarized similarity of them

7
Measuring usefulness

Number of documents with all query words is
unknown
no full document representations available,
only database summaries (representatives)
3rd Inference (usefulness) is built on top of
previous two
Steps of database selection
Rely on sensible 1st and 2nd inferences
Choose database representatives for 3rd inference
Calculate usefulness measures
Choose most useful databases

8
Similarity-based measures

Definition Usefulness is a sum of document
similarities, exceeding threshold l
Simplest summarized weight of query terms across
collection
no assumptions about word cooccurrence
l 0

9
GlOSS

High correlation assumption
Sort all n query terms Ti in descendant order of
their DFs
DFn ? Tn , Tn-1 , , T1 ,
DFn-1 DFn ? Tn-1 , Tn-2 , , T1 , ,
DF1 DF2 ? T1
Use averaged term weights to calculate document
similarity
l gt 0
l is query dependent
l is collection dependent
Usually because of local IDFs difference
Proposal use global term importance
Usually l is set to 0 in experiments

10
Problems of similarity-based measures

Is this inference good?
A few high-scored documents and a lot of low
scored documents are regarded as equal
Proposal summarize first K similarities
Highly scored documents could be bad indicator of
usefulness
Most of relevant documents have moderate scores
Highly scored documents could be non-relevant

11
Document frequency based measures

Dont use term frequencies (actual similarities)
Exploit document frequencies only
Exploit global measure of term importance
Average IDF
ICF (inversed collection frequency)
Main assumption many documents with rare terms
have more meaning for user
most likely contain other query terms

12
CORI Using TFIDF normalization

13
CORI Issues

Pure document frequencies make CORI better
The less statistics, the simpler
Smaller variance
Better estimates ranking, not actual database
summaries
No use of document richness
To be normalized or not to be?
Small databases are not necessary better
Collection may specialize well in several topics

14
Using usefulness measures
Information CF 120
Retrieval CF 40
C 1000
DFmax
avg_tf
DF
DFmax
avg_tf
DF
20
60
12
Peer1
60
8
5
Peer1
400
6
60
Peer2
400
4
10
Peer2
60
15
20
Peer3
60
10
5
Peer3
15
Analysis of experiments