An Exploration of Proximity Measures in Information Retrieval Tao Tao et al, Microsoft Corp SIGIR07 - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

An Exploration of Proximity Measures in Information Retrieval Tao Tao et al, Microsoft Corp SIGIR07

Description:

{car rental, hertz car rental,...} www.avis.com. www.hertz.com {car pricing, used car, ... Car rental. Used cars. Car accidents. 35. Query. Search History ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 58
Provided by: chengxi
Category:

less

Transcript and Presenter's Notes

Title: An Exploration of Proximity Measures in Information Retrieval Tao Tao et al, Microsoft Corp SIGIR07


1
  • An Exploration of Proximity Measures in
    Information Retrieval--- Tao Tao et al, Microsoft
    Corp SIGIR07
  • Learn from Web Search Logs to Organize Search
    Results--- Xuanhui Wang et al ,UIUC SIGIR07

2
An Exploration of Proximity Measures in
Information Retrieval
  • Tao Tao, Microsoft Corp.
  • ChengXiang Zhai, University of Illinois at
    Urbana-Champaign

Published in SIGIR07
3
Outline
  • 1 Motivation
  • 2 Measuring proximity
  • 3 Proximity retrieval models
  • 4 Experiment
  • 5 Related work
  • 6 Conclusion and future work

4
Motivation
Heuristics to measure proximity
Query ltspace programgt
Document 1 .. however, the first practical
solar cell was not introduced until 1900 in
response to the program of the space, this first
solar photovoltaic cell were made of single
crystal silicon and show about 50 percent
efficiency ..
  • Document 2
  • ..
  • film have been determine in from space charge
    limit current measure.
  • ..
  • this paper summarizes the result of a program
    initial at the naval research laboratory
  • ..

Document 1 is more relevant than document 2,
since the two query words are closer to each
other.
5
Motivation
Heuristics to measure proximity
Query ltspace programgt
  • Document
  • ..
  • however, the first program of space was not
    introduced until 1900 in response to the space
    program, this first solar photovoltaic cell were
    made of single crystal silicon and show about 50
    percent efficiency
  • ..

When there are multiple query words and multiple
occurrences, there is no clear way to define the
distance between those occurrences.
6
Measuring proximity Five heuristics
  • 1?Span

Query ltt1,t2gt
t1, t2, t1, t3 , t5 , t4, t2, t3, t4
Span 7
7
Five heuristics
  • 2?MinCover

Query ltt1,t2,t4gt
t1, t2, t1, t3 , t5 , t4, t2, t3, t4
MinCover 5
MinCoverThe length of the minimum segment to
cover all query words at least once.
8
Five heuristics pair-wise distances
Query ltt1, t2, t4gt
  • t1, t2, t1, t3 , t5 , t4, t2, t3, t4

ltt1, t2gt distance 2
ltt2, t4gt distance 2
ltt1, t4gt distance 4
aggregation
MinDist 2
AveDist 8/3
MaxDist 4
9
Data set
  • remove the queries without relevant documents in
    FR collection. Thus, there are only 21queries left

10
Data Driven Analysis The Global Measuring
Proximity
  • We use the KL-divergence retrieval method to
    retrieve top 1000 documents for each query,
    calculate different proximity distance measures
    for each document, and then take the average of
    these values for relevant and non-relevant
    documents respectively.

All negative results here
11
Normalization
  • Normalized Span Span is divided by the total
    number of query word occurrences in a
    document.Span9/6
  • Normalized MinCover MinCover is divided by the
    total number of unique query word occurrences in
    a document MinCover5/3

Query ltt1,t2,t4gt
t1, t2, t1, t3 , t5 , t4, t2, t3, t4
12
Data Driven Analysis The Global Measuring
Proximity(normalized)
  • They are more promising than non-normalized
    version.

13
Data Driven Analysis The Local Measuring Proximity
  • They are more promising than (normalized) Span
    and MinCover.
  • We dont need to normalized pair-wise distances

14
Proximity retrieval models
  • 1)The smaller the distance is, the larger the
    relevance is
  • 2)Drop quickly in the beginning, and go flat in
    the end

Distance measure
15
Proximity retrieval models
  • Incorporating proximities into other retrieval
    models
  • We are trying to understand which of the five
    proximity measures will work. So, we directly
    incorporate them into existing successful
    retrieval models.

16
Experiment
  • Normalization

17
Experiment
  • The best performance

baseline
MAP
18
The best performance
Pr_at_10
Pr_at_0.1 RECALL
Retrieved relevant documents
19
Compare with the other retrieval models
  • Incorporating proximity to retrieval models
    Linearly

Linearly incorporating
R3 cannot improve retrieval performance no matter
how we turn the parameter beta.
20
Tuning parameter alpha
  • Sensitivity to parameter alpha of different
    methods

21
Tuning parameter alpha
  • Parameter alpha sensitivity of MinDist over
    different data sets

22
RELATED WORK
  • E. M. Keen. 91,92 ,M. Beigbeder SAC05------
    Boolean modle
  • S. Buttcher SIGIR03 , S. Buttcher TREC05, Y.
    Rasolofo,ECIR03--------BM25Distance
  • C. L. A. Clarke, D. Hawking,TREC95-----span
    distance including all query terms
  • X. Liu ,CIKM02 --------passage retrieval
  • F. Song SIGIR99 (nGram)

23
Conclusions and Future Works
  • Conclusions
  • ?Systematically explored the query term proximity
    heuristics.
  • ?Proposed five different proximity distance
    measures, each modeling proximity from a
    different perspective.
  • ?MinDist proximity distance measure is found to
    be highly correlated with document relevance
  • Future works
  • ?Further understanding why MinDist models
    proximity best.
  • ?Other transformation functions for incorporating
    proximity into an existing model.

24
  • QA

25
Learn from Web Search Logs to Organize Search
Results
  • Xuanhui Wang and ChengXiang Zhai
  • Department of Computer Science
  • University of Illinois, Urbana-Champaign

Published in SIGIR2007
26
Motivation
  • Search engine utility
  • Ranking accuracy Result presentation
  • Lots of research on improving ranking accuracy
  • Relatively little work on improving result
    presentation

Put the best results on top
Easy for users to digest
Whats the best way to present search results?
27
Ranked List Presentation
28
However, when the query is ambiguous
Query Jaguar
29
Cluster Presentation (e.g., Hearst et al. 96,
Zamir Etzioni 99)
From http//vivisimo.com(http//clusty.com/)
30
Deficiencies of Data-Driven Clustering
  • Users may prefer different ways to group the
    results. E.g., queryarea codes
  • phone codes vs zip codes
  • international codes vs local codes
  • Cluster labels may not be informative to help a
    user choose the right cluster. E.g., label
    panthera onca

Need to group search results from a users
perspective
31
Our Idea User-Oriented Clustering
  • User-oriented clustering (aspects clusters)
  • Partition search results according to the aspects
    interesting to users
  • Label each aspect with words meaningful to users
  • Exploit search logs to do both
  • Partitioning
  • Learn interesting aspects of an arbitrary query
  • Classify results into these aspects
  • Labeling
  • Learn representative queries of the identified
    aspects
  • Use representative queries to label the aspects

32
Why Logs Make Difference
  • Search logs record
  • Queries submitted and URLs clicked
  • Partitioning based on logs is user-oriented
  • Search logs record the users search activities
  • Reflect the general user interests
  • Labeling based on past queries is more accessible
  • Past queries in search logs are input by end
    users
  • Easy to be understood by users
  • E.g., jaguar cat vs panthera onca

33
Rest of the Talk
  • General Approach
  • Technical Details
  • Experiment Results

34
Illustration of the General Idea
query car
Results
www.avis.com www.hertz.com www.cars.com
35
User-Oriented Clustering via Log Mining
Results
Query

Search History Collection
Query pseudo doc1
Query pseudo doc2

36
Implementation Strategy
Results
Query

Search History Collection
Query pseudo doc1
Query pseudo doc2

37
Implementation Strategy
Results
Query

Search History Collection
Query pseudo doc1
Query pseudo doc2

38
More Details Search Engine Logs
  • Record user activities (queries, clicks)
  • Valuable resources for learning to improve search
    engine utility
  • Logs consist of sessions. Each session contains
  • An identical query
  • Clicked URLs by a particular user

39
More Details Build History Collection
For every query (e.g., car rental)
Sessions
History Collection
40
Retrieval Past Queries
Past queries
Pseudo-documents Qi
History Collection
Input query q (e.g, qcar)
Similar Queries
41
Implementation Strategy
Results
Query
Categorization


Query Aspect 1
Query Aspect k
Retrieval
Center query
Labeling
Search History Collection
Star clustering
Clustering
Label 1
Query pseudo doc1
Label 2
Query pseudo doc2
Similar Queries


42
More Details Star Clustering Aslam et al. 04
Similar Queries
  • 1. Form a similarity graph
  • TF-IDF weight vectors
  • Cosine similarity
  • Thresholding

2. Iteratively identify a star center and
its satellites
Star center query serves as a label for a
cluster
43
Implementation Strategy
Results
Query
Categorization
Centroid-based


Query Aspect 1
Query Aspect k
Retrieval
Labeling
Search History Collection
Clustering
Label 1
Query pseudo doc1
Label 2
Query pseudo doc2
Similar Queries


44
Centroid-Based Classifier
  • Represent each query doc as a term vector (TF-IDF
    weighting)
  • Compute a centroid vector for each cluster/aspect
  • Assign a new result vector to the cluster whose
    centroid is the closest to the new vector

45
Evaluation Data Preparation
  • Log data May 2006 search log released by
    Microsoft Live Labs
  • First 2/3 to simulate history last 1/3 to
    simulate future queries
  • History collection (169,057 queries3.5 clicked
    URLs/query)
  • Future collection is further split into two
    sets for validation and testing
  • Test case a session with more than 4 clicks and
    at least 100 matching queries in history (172 and
    177 test cases in two test sets)
  • Use clicked URLs to approximate relevant
    documents Joachims, 2003

46
Evaluation Data Preparation
  • From Microsoft Live Labs

History Collection 169,057 queries 3.5 clicked
URLs/query
172 tests
177 tests
Two test sets
  • Test case a session with more than 4 clicks and
    at least 100 matching queries in history

47
Experiment Design
  • Methods
  • Baseline method
  • the original search engine ranking
  • Cluster-based method
  • Traditional method solely based on content
  • Log-based method
  • Our method based on search logs

48
Evaluation
  • Each test case is a session
  • Use clicked URLs to approximate relevant
    documents Joachims, 2003
  • A user is assumed to first view the cluster with
    largest number of relevant docs
  • Measures
  • Precision_at_5 documents (P_at_5)
  • Mean Reciprocal Rank (MRR) of the first relevant
    document

The viewed cluster
Results


49
Overall Comparison
50
Overall Comparison
the number of test cases whose P_at_5's are improved
versus decreased w.r.t the baseline.
Why?
51
Diversity Analysis
  • Do queries with diverse results benefit more?
  • Bin by size ratios of the two largest clusters

Log vs Baseline
more diverse
Primary/Secondary cluster size ratio
Queries with diverse results benefit more
52
Query Difficulty Analysis
  • Do difficult queries benefit more?
  • Bin by Mean Average Precisions (MAPs)

Log vs Baseline
more difficult
Difficult queries benefit more
53
Effectiveness of Learning
P_at_5
more history information
54
Sample Results Partitioning
  • Log-based method and regular clustering partition
    the results differently

Query area codes
International codes or local codes
Phone codes or zip codes
55
Sample Results Labeling
Query apple
Query jaguar
56
Related Work
  • Categorization-based (e.g.,Chen Dumais 00,01)
  • Cluster Presentation (Hearst et al. 96, Zamir
    Etzioni 99)

57
Conclusions and Future Work
  • Proposed a general strategy for organizing search
    results based on interesting topic aspects
    learned from search log
  • Experimented with a way to implement the strategy
  • Results show that
  • User-oriented clustering is better than
    data-oriented clustering
  • Particularly help difficult topics and topics
    with diverse results
  • Future directions
  • Mixture of data-driven and user-driven clustering
  • Study user interaction/feedback with cluster
    interface
  • Use general search log to smooth personal
    search log

58
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com