Title: An Exploration of Proximity Measures in Information Retrieval Tao Tao et al, Microsoft Corp SIGIR07
1- An Exploration of Proximity Measures in
Information Retrieval--- Tao Tao et al, Microsoft
Corp SIGIR07 - Learn from Web Search Logs to Organize Search
Results--- Xuanhui Wang et al ,UIUC SIGIR07
2An Exploration of Proximity Measures in
Information Retrieval
- Tao Tao, Microsoft Corp.
- ChengXiang Zhai, University of Illinois at
Urbana-Champaign
Published in SIGIR07
3Outline
- 1 Motivation
- 2 Measuring proximity
- 3 Proximity retrieval models
- 4 Experiment
- 5 Related work
- 6 Conclusion and future work
4Motivation
Heuristics to measure proximity
Query ltspace programgt
Document 1 .. however, the first practical
solar cell was not introduced until 1900 in
response to the program of the space, this first
solar photovoltaic cell were made of single
crystal silicon and show about 50 percent
efficiency ..
- Document 2
- ..
- film have been determine in from space charge
limit current measure. - ..
- this paper summarizes the result of a program
initial at the naval research laboratory - ..
Document 1 is more relevant than document 2,
since the two query words are closer to each
other.
5Motivation
Heuristics to measure proximity
Query ltspace programgt
- Document
- ..
- however, the first program of space was not
introduced until 1900 in response to the space
program, this first solar photovoltaic cell were
made of single crystal silicon and show about 50
percent efficiency - ..
When there are multiple query words and multiple
occurrences, there is no clear way to define the
distance between those occurrences.
6Measuring proximity Five heuristics
Query ltt1,t2gt
t1, t2, t1, t3 , t5 , t4, t2, t3, t4
Span 7
7Five heuristics
Query ltt1,t2,t4gt
t1, t2, t1, t3 , t5 , t4, t2, t3, t4
MinCover 5
MinCoverThe length of the minimum segment to
cover all query words at least once.
8Five heuristics pair-wise distances
Query ltt1, t2, t4gt
- t1, t2, t1, t3 , t5 , t4, t2, t3, t4
ltt1, t2gt distance 2
ltt2, t4gt distance 2
ltt1, t4gt distance 4
aggregation
MinDist 2
AveDist 8/3
MaxDist 4
9Data set
- remove the queries without relevant documents in
FR collection. Thus, there are only 21queries left
10Data Driven Analysis The Global Measuring
Proximity
- We use the KL-divergence retrieval method to
retrieve top 1000 documents for each query,
calculate different proximity distance measures
for each document, and then take the average of
these values for relevant and non-relevant
documents respectively.
All negative results here
11Normalization
- Normalized Span Span is divided by the total
number of query word occurrences in a
document.Span9/6 - Normalized MinCover MinCover is divided by the
total number of unique query word occurrences in
a document MinCover5/3
Query ltt1,t2,t4gt
t1, t2, t1, t3 , t5 , t4, t2, t3, t4
12Data Driven Analysis The Global Measuring
Proximity(normalized)
- They are more promising than non-normalized
version.
13Data Driven Analysis The Local Measuring Proximity
- They are more promising than (normalized) Span
and MinCover. - We dont need to normalized pair-wise distances
14Proximity retrieval models
- 1)The smaller the distance is, the larger the
relevance is - 2)Drop quickly in the beginning, and go flat in
the end
Distance measure
15Proximity retrieval models
- Incorporating proximities into other retrieval
models - We are trying to understand which of the five
proximity measures will work. So, we directly
incorporate them into existing successful
retrieval models.
16Experiment
17Experiment
baseline
MAP
18The best performance
Pr_at_10
Pr_at_0.1 RECALL
Retrieved relevant documents
19Compare with the other retrieval models
- Incorporating proximity to retrieval models
Linearly
Linearly incorporating
R3 cannot improve retrieval performance no matter
how we turn the parameter beta.
20Tuning parameter alpha
- Sensitivity to parameter alpha of different
methods
21Tuning parameter alpha
- Parameter alpha sensitivity of MinDist over
different data sets
22RELATED WORK
- E. M. Keen. 91,92 ,M. Beigbeder SAC05------
Boolean modle - S. Buttcher SIGIR03 , S. Buttcher TREC05, Y.
Rasolofo,ECIR03--------BM25Distance - C. L. A. Clarke, D. Hawking,TREC95-----span
distance including all query terms - X. Liu ,CIKM02 --------passage retrieval
- F. Song SIGIR99 (nGram)
23Conclusions and Future Works
- Conclusions
- ?Systematically explored the query term proximity
heuristics. - ?Proposed five different proximity distance
measures, each modeling proximity from a
different perspective. - ?MinDist proximity distance measure is found to
be highly correlated with document relevance - Future works
- ?Further understanding why MinDist models
proximity best. - ?Other transformation functions for incorporating
proximity into an existing model.
24 25Learn from Web Search Logs to Organize Search
Results
- Xuanhui Wang and ChengXiang Zhai
- Department of Computer Science
- University of Illinois, Urbana-Champaign
Published in SIGIR2007
26Motivation
- Search engine utility
- Ranking accuracy Result presentation
- Lots of research on improving ranking accuracy
- Relatively little work on improving result
presentation -
Put the best results on top
Easy for users to digest
Whats the best way to present search results?
27Ranked List Presentation
28However, when the query is ambiguous
Query Jaguar
29Cluster Presentation (e.g., Hearst et al. 96,
Zamir Etzioni 99)
From http//vivisimo.com(http//clusty.com/)
30Deficiencies of Data-Driven Clustering
- Users may prefer different ways to group the
results. E.g., queryarea codes - phone codes vs zip codes
- international codes vs local codes
- Cluster labels may not be informative to help a
user choose the right cluster. E.g., label
panthera onca
Need to group search results from a users
perspective
31Our Idea User-Oriented Clustering
- User-oriented clustering (aspects clusters)
- Partition search results according to the aspects
interesting to users - Label each aspect with words meaningful to users
- Exploit search logs to do both
- Partitioning
- Learn interesting aspects of an arbitrary query
- Classify results into these aspects
- Labeling
- Learn representative queries of the identified
aspects - Use representative queries to label the aspects
32Why Logs Make Difference
- Search logs record
- Queries submitted and URLs clicked
- Partitioning based on logs is user-oriented
- Search logs record the users search activities
- Reflect the general user interests
- Labeling based on past queries is more accessible
- Past queries in search logs are input by end
users - Easy to be understood by users
- E.g., jaguar cat vs panthera onca
33Rest of the Talk
- General Approach
- Technical Details
- Experiment Results
34Illustration of the General Idea
query car
Results
www.avis.com www.hertz.com www.cars.com
35User-Oriented Clustering via Log Mining
Results
Query
Search History Collection
Query pseudo doc1
Query pseudo doc2
36Implementation Strategy
Results
Query
Search History Collection
Query pseudo doc1
Query pseudo doc2
37Implementation Strategy
Results
Query
Search History Collection
Query pseudo doc1
Query pseudo doc2
38More Details Search Engine Logs
- Record user activities (queries, clicks)
- Valuable resources for learning to improve search
engine utility - Logs consist of sessions. Each session contains
- An identical query
- Clicked URLs by a particular user
39More Details Build History Collection
For every query (e.g., car rental)
Sessions
History Collection
40Retrieval Past Queries
Past queries
Pseudo-documents Qi
History Collection
Input query q (e.g, qcar)
Similar Queries
41Implementation Strategy
Results
Query
Categorization
Query Aspect 1
Query Aspect k
Retrieval
Center query
Labeling
Search History Collection
Star clustering
Clustering
Label 1
Query pseudo doc1
Label 2
Query pseudo doc2
Similar Queries
42More Details Star Clustering Aslam et al. 04
Similar Queries
- 1. Form a similarity graph
- TF-IDF weight vectors
- Cosine similarity
- Thresholding
2. Iteratively identify a star center and
its satellites
Star center query serves as a label for a
cluster
43Implementation Strategy
Results
Query
Categorization
Centroid-based
Query Aspect 1
Query Aspect k
Retrieval
Labeling
Search History Collection
Clustering
Label 1
Query pseudo doc1
Label 2
Query pseudo doc2
Similar Queries
44Centroid-Based Classifier
- Represent each query doc as a term vector (TF-IDF
weighting) - Compute a centroid vector for each cluster/aspect
- Assign a new result vector to the cluster whose
centroid is the closest to the new vector
45Evaluation Data Preparation
- Log data May 2006 search log released by
Microsoft Live Labs - First 2/3 to simulate history last 1/3 to
simulate future queries - History collection (169,057 queries3.5 clicked
URLs/query) - Future collection is further split into two
sets for validation and testing - Test case a session with more than 4 clicks and
at least 100 matching queries in history (172 and
177 test cases in two test sets) - Use clicked URLs to approximate relevant
documents Joachims, 2003
46Evaluation Data Preparation
History Collection 169,057 queries 3.5 clicked
URLs/query
172 tests
177 tests
Two test sets
- Test case a session with more than 4 clicks and
at least 100 matching queries in history
47Experiment Design
- Methods
- Baseline method
- the original search engine ranking
- Cluster-based method
- Traditional method solely based on content
- Log-based method
- Our method based on search logs
48Evaluation
- Each test case is a session
- Use clicked URLs to approximate relevant
documents Joachims, 2003 - A user is assumed to first view the cluster with
largest number of relevant docs - Measures
- Precision_at_5 documents (P_at_5)
- Mean Reciprocal Rank (MRR) of the first relevant
document
The viewed cluster
Results
49Overall Comparison
50Overall Comparison
the number of test cases whose P_at_5's are improved
versus decreased w.r.t the baseline.
Why?
51Diversity Analysis
- Do queries with diverse results benefit more?
- Bin by size ratios of the two largest clusters
Log vs Baseline
more diverse
Primary/Secondary cluster size ratio
Queries with diverse results benefit more
52Query Difficulty Analysis
- Do difficult queries benefit more?
- Bin by Mean Average Precisions (MAPs)
Log vs Baseline
more difficult
Difficult queries benefit more
53Effectiveness of Learning
P_at_5
more history information
54Sample Results Partitioning
- Log-based method and regular clustering partition
the results differently
Query area codes
International codes or local codes
Phone codes or zip codes
55Sample Results Labeling
Query apple
Query jaguar
56Related Work
- Categorization-based (e.g.,Chen Dumais 00,01)
- Cluster Presentation (Hearst et al. 96, Zamir
Etzioni 99)
57Conclusions and Future Work
- Proposed a general strategy for organizing search
results based on interesting topic aspects
learned from search log - Experimented with a way to implement the strategy
- Results show that
- User-oriented clustering is better than
data-oriented clustering - Particularly help difficult topics and topics
with diverse results - Future directions
- Mixture of data-driven and user-driven clustering
- Study user interaction/feedback with cluster
interface - Use general search log to smooth personal
search log
58Thank You!