Entropybiased Models for Query Representation on the Click Graph - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Entropybiased Models for Query Representation on the Click Graph

Description:

Random walks for relevance rank in image search (Craswell and Szummer, SIGIR'05) ... The personalized PageRank. 23. Hongbo Deng, Irwin King, and Michael R. Lyu ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 35
Provided by: appsrvCs
Category:

less

Transcript and Presenter's Notes

Title: Entropybiased Models for Query Representation on the Click Graph


1
Entropy-biased Models for Query Representation on
the Click Graph
  • Hongbo Deng, Irwin King and Michael R. Lyu
  • Department of Computer Science and Engineering
  • The Chinese University of Hong Kong
  • July 21st, 2009

2
Introduction
  • Query log analysis improve search engines
    capabilities
  • Query suggestion
  • Query classification
  • Targeted advertising
  • Ranking

3
Introduction
  • Click graph an important technique
  • A bipartite graph between queries and URLs
  • Edges connect a query with the URLs
  • Capture some semantic relations, e.g., map and
    travel

How to utilize and model the click graph to
represent queries?
Traditional model based on the raw click
frequency (CF)
  • Robustness Some queries with
  • skewed click count may exclusively
  • influence the click graph
  • Spam Raw CF can be easily manipulated

Propose an entropy-biased framework
4
Motivation
Is a single click on different URLs equally
important?
  • Basic idea
  • Various query-URL pairs should be treated
    differently
  • Intuition
  • Common clicks on less frequent but more specific
    URLs are of greater value than common clicks on
    frequent and general URLs

5
Outline
  • Introduction
  • Related Work
  • Methodology
  • Preliminaries
  • Click Frequency Model
  • Entropy-biased Model
  • Experiments
  • Conclusion

6
Related Work
  • Using click graph
  • Query clustering (Befferman and Berger, KDD00,
    Wen et al., WWW 01)
  • Random walks for relevance rank in image search
    (Craswell and Szummer, SIGIR05)
  • Query suggestion by computing the hitting time on
    a click graph (Mei et al., CIKM08)
  • Query classification from regularized click graph
    (Li et al., SIGIR08)

Using click graph
Using click graph
Modeling queries and URLs
Click entropy result entropy
These methods are proposed based on the click
graph, while our objective is to investigate a
better model to utilize and represent the click
graph.
7
Related Work
  • Using click graph
  • Query clustering (Befferman and Berger, KDD00,
    Wen et al., WWW 01)
  • Random walks for relevance rank in image search
    (Craswell and Szummer, SIGIR05)
  • Query suggestion by computing the hitting time on
    a click graph (Mei et al., CIKM08)
  • Query classification from regularized click graph
    (Li et al., SIGIR08)
  • Modeling the representation
  • Use the content of clicked Web pages to define a
    term-weight vector model for a query (Baeza-Yates
    et al., 2004)
  • Represent query as a vector of documents (URLs)
    without considering the content information
    (Baeza-Yates and Tiberi, KDD07)
  • Propose the query-set document model to represent
    documents by mining frequent query patterns
    rather than the content information of the
    documents (Poblete et al., WWW08)

Using click graph
Using click graph
Modeling queries and URLs
Modeling queries and URLs
Click entropy result entropy
These existing methods do not distinguish the
variation on different query-URL pairs
8
Related Work
  • Using click graph
  • Query clustering (Befferman and Berger, KDD00,
    Wen et al., WWW 01)
  • Random walks for relevance rank in image search
    (Craswell and Szummer, SIGIR05)
  • Query suggestion by computing the hitting time on
    a click graph (Mei et al., CIKM08)
  • Query classification from regularized click graph
    (Li et al., SIGIR08)
  • Modeling the representation
  • Use the content of clicked Web pages to define a
    term-weight vector model for a query (Baeza-Yates
    et al., 2004)
  • Represent query as a vector of documents (URLs)
    without considering the content information
    (Baeza-Yates and Tiberi, KDD07)
  • Propose the query-set document model to represent
    documents by mining frequency query patterns
    rather than the content information of the
    documents (Qin et al., WWW08)

Using click graph
Using click graph
  • For personalization
  • Explore click entropy to measure the variability
    in click results (Dou et al., WWW 07)
  • Propose result entropy to capture how often
    results change (Teevan et al., SIGIR08)

Modeling queries and URLs
Modeling queries and URLs
Click entropy result entropy
Click entropy result entropy
These methods are focused on personalization for
different queries, while our entropy- biased
models are focused on the weighting scheme of
various query-URL pairs
9
Outline
  • Introduction
  • Related Work
  • Methodology
  • Preliminaries
  • Click Frequency Model
  • Entropy-biased Model
  • Experiments
  • Conclusion

10
Preliminaries
Query instance
Query
URL
User
11
Traditional Click Frequency Model
  • Edges of click graph
  • Weighted by the raw click frequency (CF)
  • Transition probability
  • Normalize CF

From query to URL
From URL to query
Based on the transition probabilities, the query
and document can be represented by the vector of
transition probabilities respectively.
12
Traditional Click Frequency Model
  • Measure the similarity between queries
  • The most similar query
  • q2 (map) ? q1 (Yahoo)
  • More reasonable
  • q2 (map) ? q3 (travel)

Cosine similarity
The CF model only considers the raw click
frequency, and treats different query-URL pairs
equally, even if some URLs are heavily clicked.
13
Methodology
Traditional click frequency model
M
Entropy-biased models
14
Entropy-biased Model
It would be more reasonable to weight these two
edges differently because of the variation of the
connected URLs.
  • The more general and highly ranked URL
  • Connect with more queries
  • Increase the ambiguity and uncertainty
  • The entropy of a URL
  • Suppose
  • Tend to be proportional to the n(dj)

15
Entropy ? Discriminative Ability
  • Entropy increase, discriminative ability decrease
  • Be inversely proportional to each other
  • A URL with a high query frequency is less
    discriminative overall
  • Inverse query frequency
  • Measure the discriminative ability of the URL
  • Benefits
  • Constrain the influence of some heavily-clicked
    URLs
  • Balance the inherent bias of clicks for those
    highly ranked
  • Incorporate with other factors to tune the model

16
CF-IQF Model
  • Incorporate the IQF with the click frequency
  • A high click frequency
  • A low query frequency
  • A is weighted higher than B

17
CF-IQF Model
  • Transition probability

The most similar query q2 (map) ? q3 (travel)
The most similar query q2 (map) ? q1 (Yahoo)
18
UF Model and UF-IQF Model
  • Drawback of CF model
  • Prone to spam by some malicious clicks (if a
    single user clicks on a certain URL thousands of
    times)
  • UF model
  • Weight by user frequency instead of click
    frequency
  • Improve the resistance against malicious click
  • UF-IQF model

19
Connection with TF-IDF
  • TF-IDF has been extensively and successfully used
    in the vector space model for text retrieval
  • Several researchers have tried to interpret IDF
    based on binary independence retrieval (BIR),
    Possion, information entropy and LM
  • TF-IDF has never been explored to bipartite
    graphs, and the IQF is new. The CF-IQF is a
    simplified version of the entropy-biased model
  • The entropy-biased model is employed to identify
    the edge weighting of the click graph, which can
    be applied to other bipartite graphs

20
Mining Query Log on Click Graph
Query-to-query similarity
Query-to-query similarity
Models
Query clustering
Query suggestion
Query suggestion
21
Similarity Measurement
  • Cosine similarity
  • Jaccard coefficient
  • The similarity results are reported and analyzed

22
Graph-based Random Walk
  • Query-to-query graph
  • The transition probability from qi to qj
  • The personalized PageRank

23
Outline
  • Introduction
  • Related Work
  • Methodology
  • Preliminaries
  • Click Frequency Model
  • Entropy-biased Model
  • Experiments
  • Conclusion

24
Experimental Evaluation
  • Data collection
  • AOL query log data
  • Cleaning the data
  • Removing the queries that appear less than 2
    times
  • Combining the near-duplicated queries
  • 883,913 queries and 967,174 URLs
  • 4,900,387 edges

25
Distributions
26
Evaluation ODP Similarity
  • A simple measure of similarity among queries
    using ODP categories (query ? category)
  • Definition
  • Example
  • Q1 United States ? Regional gt North America gt
    United States
  • Q2 National Parks ? Regional gt North America
    gt United States gt Travel and Tourism gt National
    Parks and Monuments
  • Precision at rank n (P_at_n)
  • 300 distinct queries

3/5
27
Experimental Results
Results
  • Query similarity analysis

1. CF-IQF is better than CF UF-IQF gt UF
The results support our intuition of the
entropy-biased framework about treating various
query-URL pairs differently
2. UF is better than CF UF-IQF gt CF-IQF
The results indicates the user frequency
associated with the query-URL pair is more robust
than the click frequency for modeling the click
graph.
28
Experimental Results
  • Query similarity analysis

3. TF-IDF is better than TF
The improvements of CF-IQF over CF and UF-IQF
over UF models are consistent with the
improvement of TF-IDF over TF model. The reason
they share the same key point to identify and
tune the importance of a term or a query-URL edge.
29
Experimental Results
  • Query similarity analysis

4. Jaccard coefficient
The improvements are consistent with the Cosine
similarity
30
Experimental Results
  • Query similarity analysis

5. UF-IQF achieves best performance in most cases.
6. CF and UF models gt TF CF-IQF, UF-IQF gt
TF-IDF The click graph catches more semantic
relations between queries than the query terms
It is very essential and promising to consider
the entropy-biased models for the click graph.
31
Experimental Results
  • Random Walk Evaluation

Results 1. With the increase of n, both
models improve their performance. 2. CF-IQF
model always performs better than the CF mode.
32
Experimental Results
  • Random Walk Evaluation

In general, the results generated by the CF and
the CF-IQF models are similar, and mostly
semantically relative to the original query,
such as American airline. Another important
observation is that the CF-IQF model can boost
more relevant queries as suggestion and reduce
some irrelevant queries.
33
Conclusions
  • Introduce the inverse query frequency (IQF) to
    measure the discriminative ability of a URL
  • Identify a new source, user frequency, for
    diminishing the manipulation of the malicious
    clicks
  • Propose the entropy-biased models to combine the
    IQF with the CF as well as UF for click graphs
  • Experimental results show that the improvements
    of our proposed models are consistent and
    promising

34
QA
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com