Query Routing in PeertoPeer Web Search Engine - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Query Routing in PeertoPeer Web Search Engine

Description:

Web is huge, it's difficult to cover all. Timely re-crawls are required. Technical limits ... Controls 80% of web search requests ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 19
Provided by: Pas125
Category:

less

Transcript and Presenter's Notes

Title: Query Routing in PeertoPeer Web Search Engine


1
Query Routing in Peer-to-Peer Web Search Engine
International Max Planck Research School for
Computer Science
  • Speaker Pavel Serdyukov
  • Supervisors Gerhard Weikum Christian Zimmer
  • Matthias Bender

2
Talk Outline
  • Motivation
  • Proposed Search Engine architecture
  • Query routing and database selection
  • Similarity-based measures
  • Example GlOSS
  • Document-frequency-based measures
  • Example CORI
  • Evaluation of methods
  • Proposals
  • Conclusion

3
Problems of present Web Search Engines
  • Size of indexable Web
  • Web is huge, its difficult to cover all
  • Timely re-crawls are required
  • Technical limits
  • Deep Web
  • Monopoly of Google
  • Controls 80 of web search requests
  • Paid sites get updated more frequently and get
    higher rating
  • Sites may be censored by engine

4
Make use of Peer-to-Peer technology
  • Exploit previously unused CPU/memory/disk power
  • Provide up-to-date results for small portions of
    Web
  • Conquer Deep Web by personalized and specialized
    web crawlers

Global Directory
Global directory must be shared among peers!
5
Query routing
  • Goal find peers with relevant documents
  • Known before as Database Selection Problem
  • Not all techniques are applicable to P2P

6
Database Selection Problem
  • 1st inference Is this document relevant?
  • Its a subjective user judgment, we model it
  • We use only representations of user needs and
    documents (keywords, inverted indices)
  • 2nd inference Database is potential to satisfy
    query, if it
  • has many documents (size-based naive approach)
  • has many documents, containing all query words
  • high number of them with given similarity
  • high summarized similarity of them

7
Measuring usefulness
  • Number of documents with all query words is
    unknown
  • no full document representations available,
  • only database summaries (representatives)
  • 3rd Inference (usefulness) is built on top of
    previous two
  • Steps of database selection
  • Rely on sensible 1st and 2nd inferences
  • Choose database representatives for 3rd inference
  • Calculate usefulness measures
  • Choose most useful databases

8
Similarity-based measures
  • Definition Usefulness is a sum of document
    similarities, exceeding threshold l
  • Simplest summarized weight of query terms across
    collection
  • no assumptions about word cooccurrence
  • l 0

9
GlOSS
  • High correlation assumption
  • Sort all n query terms Ti in descendant order of
    their DFs
  • DFn ? Tn , Tn-1 , , T1 ,
  • DFn-1 DFn ? Tn-1 , Tn-2 , , T1 , ,
  • DF1 DF2 ? T1
  • Use averaged term weights to calculate document
    similarity
  • l gt 0
  • l is query dependent
  • l is collection dependent
  • Usually because of local IDFs difference
  • Proposal use global term importance
  • Usually l is set to 0 in experiments

10
Problems of similarity-based measures
  • Is this inference good?
  • A few high-scored documents and a lot of low
    scored documents are regarded as equal
  • Proposal summarize first K similarities
  • Highly scored documents could be bad indicator of
    usefulness
  • Most of relevant documents have moderate scores
  • Highly scored documents could be non-relevant

11
Document frequency based measures
  • Dont use term frequencies (actual similarities)
  • Exploit document frequencies only
  • Exploit global measure of term importance
  • Average IDF
  • ICF (inversed collection frequency)
  • Main assumption many documents with rare terms
  • have more meaning for user
  • most likely contain other query terms

12
CORI Using TFIDF normalization

13
CORI Issues
  • Pure document frequencies make CORI better
  • The less statistics, the simpler
  • Smaller variance
  • Better estimates ranking, not actual database
    summaries
  • No use of document richness
  • To be normalized or not to be?
  • Small databases are not necessary better
  • Collection may specialize well in several topics

14
Using usefulness measures
Information CF 120
Retrieval CF 40
C 1000
DFmax
avg_tf
DF
DFmax
avg_tf
DF
20
60
12
Peer1
60
8
5
Peer1
400
6
60
Peer2
400
4
10
Peer2
60
15
20
Peer3
60
10
5
Peer3
15
Analysis of experiments
  • CORI is the best, but
  • Only when choosing more than 50 from 236
    databases
  • Only 10 better when choosing more than 90
    databases
  • Test collections are strange
  • Chronologically or even randomly separated
    documents
  • No topic specificity
  • No actual Web data used
  • No overlapping among collections
  • Experiments are unrealistic, its unclear
  • Which method is better
  • Is there any satisfactory method

16
Possible solutions
  • Most of measures could be unified in framework
  • We can play with it and try
  • Various normalization schemes
  • Different notions of term importance (ICF, local
    IDF)
  • Use statistics of top documents
  • Change the power of factors
  • DFICF 4 is not worse than CORI
  • Change the form of expression

GlOSS
CORI
17
Conclusion
  • What done
  • Measures are analytically evaluated
  • Sensible subset of measures is chosen
  • Measures are implemented
  • What could be done next
  • Carry out new sensible experiments
  • Choose appropriate usefulness measure
  • Experiment with database representatives
  • Build own measure
  • Try to exploit collections metadata
  • Bookmarks, authoritative documents, collection
    descriptions

18
  • Thank you for attention!
Write a Comment
User Comments (0)
About PowerShow.com