Searching Web Better - PowerPoint PPT Presentation

Loading...

PPT – Searching Web Better PowerPoint presentation | free to download - id: 250ef-YzgwO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Searching Web Better

Description:

... RSCF-based Metasearch Engine. Search Engine Components. Feature ... Metasearch Engine. Receives query from user. Sends query to multiple search engines ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 40
Provided by: tqz
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Searching Web Better


1
Searching Web Better
  • Dr Wilfred Ng
  • Department of Computer Science
  • The Hong Kong University of Science and
    Technology

2
Outline
  • Introduction
  • Main Techniques (RSCF)
  • Clickthrough Data
  • Ranking Support Vector Machine Algorithm
  • Ranking SVM in Co-training Framework
  • The RSCF-based Metasearch Engine
  • Search Engine Components
  • Feature Extraction
  • Experiments
  • Current Development

3
Search Engine Adaptation
Social Science
Computer Science
Finance
Product
CS terms
News
Google, MSNsearch, Wisenut, Overture, …
Adapt the search engine by learning from implicit
feedback ---- Clickthrough data
4
Clickthrough Data
  • Clickthrough data data that indicates which
    links in the returned ranking results have been
    clicked by the users
  • Formally, a triplet (q, r, c)
  • q the input query
  • r the ranking result presented to the user
  • c the set of links the user clicked on
  • Benefits
  • Can be obtained timely
  • No intervention to the search activity

5
An Example of Clickthrough Data
Users input query
l
l
l
Clicked by the user
l
l
l
l
l
6
Target Ranking (Preference Pairs Set )
7
An Example of Clickthrough Data
Users input query
l
Labelled data set
l
l
Clicked by the user
l
l
l
l
Unlabelled data set
l
8
Target Ranking (Preference Pairs Set )
  • Labelled data set l1, l2,…, l10
  • Unlabelled data set l11, l12,…

9
The Ranking SVM Algorithm
Three links, each described by a
feature vector Target ranking l1 Weight vector -- Ranker Distance between
two closest projected links
l2
l2
l1
l2
l1
l1
l3
l3
l3
Cons It needs a large set of labelled data
10
The Ranking SVM in Co-training Framework
  • Divide the feature vector into two subvectors
  • Two rankers are built over these two feature
    subvectors
  • Each ranker chooses several unlabelled
    preference pairs and add them to the labelled
    data set
  • Rebuild each ranker from the augmented labelled
    data set

Labelled Preference Feedback Pairs P_l
Training
Ranker a_B
Ranker a_A
Augmented pairs
Augmented pairs
Selecting confident pairs
Unlabelled Preference Pairs P_u
11
Some Issues
  • Guideline for partitioning the feature vector
  • After the partition each subvector must be
    sufficient for the later ranking
  • Number of rankers
  • Depend on the number of features
  • When to terminate the procedure?
  • Prediction difference indicates the ranking
    difference between the two rankers
  • After termination, get a final ranker on the
    augmented labelled data set

12
Metasearch Engine
User
query
  • Receives query from user
  • Sends query to multiple search engines
  • Combines the retrieved results from the
    underlying search engines
  • Presents a unified ranking result to user

Metasearch Engine
Search Engine 1
Search Engine 2
Search Engine n
Retrieved Results 1
Retrieved Results 2
Retrieved Results n
Unified Ranking Result
13
Search Engine Components
  • Powered by Inktomi, relatively mature
  • One of the most powerful search engines nowadays
  • A new but growing search engine
  • Ranks links based on the prices paid by the
    sponsors on the links

14
Feature Extraction
  • Ranking Features (12 binary features)
  • Rank(E,T) where E? M,W,O T? 1,3,5,10
  • (M MSNsearch, W Wisenut, O Overture)
  • Indicate the ranking of the links in each
    underlying search engine
  • Similarity Features(4 features)
  • Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
  • URL,Title, Abstract Cover, Abstract Group
  • Indicate the similarity between the query and the
    link

15
Experiments
  • Experiment data within the same domain
    Computer science
  • Objectives
  • Offline experiments compared with RSVM
  • Online experiments compared with Google

16
Prediction Error
  • Prediction Error difference between the rankers
    ranking and the target ranking
  • Target ranking
  • l1
  • Projected ranking
  • l2
  • Prediction error 33

l2
l2
l1
l1
l3
l3
17
Offline Experiment (Compared with RSVM)
10 queries 30 queries
60 queries
The ranker trained by the RSVM algorithm on the
whole feature vector The ranker trained by the
RSCF algorithm on one feature subvector The
ranker trained by the RSCF algorithm on another
feature subvector
Prediction error rise up again! The number of
iterations in RSCF algorithm is about four to
five!
18
Offline Experiment (Compare with RSVM)
Overall comparison
The ranker trained by the RSVM algorithm The
final ranker trained by the RSCF algorithm
19
Online Experiment (Compare with Google)
  • Experiment data CS terms
  • e.g. radix sort, TREC collection, …
  • Experiment Setup
  • Combine the results returned by RSCF and those by
    Google into one shuffled list
  • Present to the users in a unified way
  • Record the users clicks

20
Experimental Analysis
21
Experimental Analysis
22
Experimental Analysis
23
Conclusion on RSCF
  • Search engine adaptation
  • The RSCF algorithm
  • Train on clickthrough data
  • Apply RSVM in the co-training framework
  • The RSCF-based metasearch engine
  • Offline experiments better than RSVM
  • Online experiments better than Google

24
Current Development
  • Features extraction and division
  • Apply in different domains
  • Search engine personalization
  • SpyNoby Project Personalized search engine with
    clickthrough analysis

25
Modified Target Ranking for Metasearch Engines
  • If l1 and l7 are from the same underlying search
    engine, the preference pairs set arising from l1
    should be
  • l1
  • Advantages
  • Alleviate the penalty on high-ranked links
  • Give more credit to the ranking ability of the
    underlying search engines

26
Modified Target Ranking
  • Labeled data set l1, l2,…, l10
  • Unlabelled data set l11, l12,…

27
RSCF-based Metasearch Engine - MEA
User
query
q
MEA
q
q
q
  • ……
  • ……
  • …………
  • …………
  • 30. ……
  • ……
  • ……
  • …………
  • …………
  • 30. ......
  • ……
  • ……
  • …………
  • …………
  • 30. ……

Unified Ranking Result
28
RSCF-based Metasearch Engine - MEB
User
query
q
MEB
q
q
q
q
  • ……
  • ……
  • …………
  • …………
  • 30. ……
  • ……
  • ……
  • …………
  • …………
  • 30. ……
  • ……
  • ……
  • …………
  • …………
  • 30. ……
  • ……
  • ……
  • …………
  • …………
  • 30. ……

Unified Ranking Result
29
Generating Clickthrough Data
  • Probability of being clicked on
  • k the ranking of the link in the metasearch
    engine
  • n the number of all the links in the metasearch
    engine
  • the skewness parameter in Zipfs law
  • Harmonic number
  • Judge the links relevance manually
  • If the link is irrelevant ? not be clicked on
  • If the link is relevant ? has the probability of
    Pr(k) to be clicked on

30
Feature Extraction
  • Ranking Features (binary features)
  • Rank(E,T) whether the link is ranked within ST
    in E
  • where E? G,M,W,O T? 1,3,5,10,15,20,25,30
  • S11, S32,3, S54,5, S106,7,8,9,10 ……
  • (G Google, M MSNsearch, W Wisenut, O
    Overture)
  • Indicate the ranking of the links in each
    underlying search engine
  • Similarity Features(4 features)
  • Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a)
  • Measure the similarity between the query and the
    link

31
Experiments
  • Experiment data three different domains
  • CS terms
  • News
  • E-shopping
  • Objectives
  • Prediction Error better than RSVM
  • Top-k Precision adaptation ability

32
Top-k Precision
  • Advantages
  • Precision is more easier to obtained than recall
  • Users care only top-k links (k10)
  • Evaluation data 30 queries in each domain

33
Comparison of Top-k precision
News
CS terms
E-shopping
34
Statistical Analysis
  • Hypothesis Testing
  • (two-sample hypothesis testing about means)
  • used to analyze whether there is a statistically
    significant difference between two means of two
    samples

35
Comparison Results
  • MEA can produce better search quality than Google
  • Google does not excel in every query category
  • MEA and MEB is able to adapt to bring out the
    strengths of each underlying search engine
  • MEA and MEB are better than, or comparable to all
    their underlying search engine components in
    every query category
  • The RSCF-based metasearch engine
  • Comparison of prediction error better than RSVM
  • Comparison of top-k precision adaptation
    ability

36
Spy Naïve Bayes Motivation
  • The problem of Joachims method
  • Strong assumptions
  • Excessively penalize high-ranked links
  • l1, l2, l3 are apt to appear on the right,
    while l7, l10 on the left
  • New interpretation of clickthrough data
  • Clicked positive (P)
  • Unclicked unlabeled (U), containing both
    positive and negative samples.
  • Goal identify Reliable Negatives (RN) from U

lp
37
Spy Naïve Bayes Ideas
  • Standard naïve Bayes classify positive and
    negative samples
  • One-step spy naïve Bayes Spying out RN from U
  • Put a small number of positive samples into U to
    act as spies, (to scout the behavior of real
    positive samples in U)
  • Take U as negative samples to train a naïve Bayes
    classifier
  • Samples with lower probabilities to be positive
    will be assigned into RN
  • Voting procedure make Spying more robust
  • Run one-step SpyNB for n times and get n sets of
    RNi
  • A sample appear in at least m (mwill appear in the final RN

38
http//dleecpu1.cs.ust.hk8080/SpyNoby/
39
My publications
  • Wilfred NG. Book Review An Introduction to
    Search Engines and Web Navigation. An
    International Journal of Information Processing
    Management, pp. 290-292, 43(1) (2007).
  • Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out
    Real User Preferences in Web Searching. Accepted
    and to appear ACM Transactions on Internet
    Technology, (2006).
  • Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE.
    Web Dynamics and their Ramifications for the
    Development of Web Search Engines. Accepted and
    to appear Computer Networks Journal - Special
    Issue on Web Dynamics, (2005).
  • Qingzhao TAN, Yiping KE and Wilfred NG. WUML A
    Web Usage Manipulation Language For Querying Web
    Log Data. International Conference on Conceptual
    Modeling ER 2004, Lecture Notes of Computer
    Science Vol.3288, Shanghai, China, page 567-581,
    (2004).
  • Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred
    NG, Dik-Lun LEE. Spying Out Real User Preferences
    for Metasearch Engine Personalization. ACM
    Proceedings of WEBKDD Workshop on Web Mining and
    Web Usage Analysis 2004, Seattle, USA, (2004).
  • Qingzhao TAN,  Xiaoyong CHAI, Wilfred NG and
    Dik-Lun LEE. Applying Co-training to Clickthrough
    Data for Search Engine Adaptation. 9th
    International Conference on Database Systems for
    Advanced Applications DASFAA 2004, Lecture Notes
    of Computer Science Vol. 2973, Jeju Island,
    Korea, page 519-532, (2004).
  • Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred
    NG, Wei WANG and Baile SHI. Refining Web
    Authoritative Resource by Frequent Structures.
    IEEE Proceedings of the International Database
    Engineering and Applications Symposium IDEAS
    2003, Hong Kong, pages 236-241, (2003).
  • Wilfred NG. Capturing the Semantics of Web Log
    Data by Navigation Matrices. A Book Chapter in
    "Semantic Issues in E-Commerce Systems", Edited
    by R. Meersman, K. Aberer and T. Dillon, Kluwer
    Academic Publishers, pages 155-170, (2003).
About PowerShow.com