Title: Less is More Probabilistic Models for Retrieving Fewer Relevant Documents
1Less is MoreProbabilistic Models for Retrieving
Fewer Relevant Documents
- Harr Chen, David R. Karger
- MIT CSAIL
- ACM SIGIR 2006
- August 9, 2006
2Outline
- Motivations
- Expected Metric Principle
- Metrics
- Bayesian Retrieval
- Objectives
- Heuristics
- Experimental Results
- Related Work
- Future Work and Conclusions
3Motivation
- In IR, we have formal models, and formal metrics
- Models provide framework for retrieval
- E.g. Probabilistic
- Metrics provide rigorous evaluation mechanism
- E.g. Precision and recall
- Probability ranking principle (PRP) provably
optimal for precision/recall - Ranking by probability of relevance
- But other metrics capture other notions of result
set quality ? and PRP isnt necessarily optimal
4Example Diversity
- User may be satisfied with one relevant result
- Navigational queries, question/answering
- In this case, we want to hedge our bets by
retrieving for diversity in result set - Better to satisfy different users with different
interpretations, than one user many times over - Reciprocal rank/search length metrics capture
this notion - PRP is suboptimal
5IR System Design
- Metrics define preference ordering on result sets
- MetricResult set 1 gt MetricResult set 2
- ?Result set 1 preferred to Result set 2
- Traditional approach Try out heuristics that we
believe will improve relevance performance - Heuristics not directly motivated by metric
- E.g. synonym expansion, psuedorelevance feedback
- Observation Given a model, we can try to
directly optimize for some metric
6Expected Metric Principle (EMP)
- Knowing which metric to use tells us what to
maximize for the expected value of the metric
for each result set, given a model
Corpus
Result Sets
Calculate EMetric using model
Return set with max score
1, 2
Document 1
1, 3
2, 1
Document 2
2, 3
3, 1
Document 3
3, 2
7Our Contributions
- Primary EMP metric as retrieval goal
- Metric designed to measure retrieval quality
- Metrics we consider precision/recall _at_ n, search
length, reciprocal rank, instance recall, k-call - Build probabilistic model
- Retrieve to maximize an objective the expected
value of metric - Expectations calculated according to our
probabilistic model - Use computational heuristics to make optimization
problem tractable - Secondary retrieving for diversity (special
case) - A natural side effect of optimizing for certain
metrics
8Detour What is a Heuristic?
- Ad hoc approach
- Use heuristics that are believed to be correlated
with good performance - Heuristics used to improve relevance
- Heuristics (probably) make system slower
- Infinite number of possibilities, no formalism
- Model, heuristics intertwined
- Our approach
- Build model that directly optimizes for good
performance - Heuristics used to improve efficiency
- Heuristics (probably) make optimization worse
- Well-known space of optimization techniques
- Clean separation between model and heuristics
9Our Contributions
- Primary EMP metric as retrieval goal
- Metric designed to measure retrieval quality
- Metrics we consider precision/recall _at_ n, search
length, reciprocal rank, instance recall, k-call - Build probabilistic model
- Retrieve to maximize an objective the expected
value of metric - Expectations calculated according to our
probabilistic model - Use computational heuristics to make optimization
problem tractable - Secondary retrieving for diversity (special
case) - A natural side effect of optimizing for certain
metrics
10Search Length/Reciprocal Rank
- (Mean) search length (MSL) number of irrelevant
results until first relevant - (Mean) reciprocal rank (MRR) one over rank of
first relevant
Search length 2 Reciprocal rank 1/3
11Instance Recall
- Each topic has multiple instances (subtopics,
aspects) - Instance recall is how many instances covered (in
union) over first n results
Instance recall _at_ 5 0.75
12k-call _at_ n
- Binary metric 1 if top n results has k relevant,
0 otherwise - 1-call is (1 no)
- See TREC robust track
1-call _at_ 5 1 2-call _at_ 5 1 3-call _at_ 5 0
13Motivation for k-call
- 1-call Want one relevant document
- Many queries satisfied with one relevant result
- Only need one relevant document, more room to
explore ? promotes result set diversity - n-call Want all relevant documents
- Perfect precision
- Hone in on one interpretation and stick to it!
- Intermediate k
- Risk/reward tradeoff
- Plus, easily modeled in our framework
- Binary variable
14Our Contributions
- Primary EMP metric as retrieval goal
- Metric designed to measure retrieval quality
- Metrics we consider precision/recall _at_ n, search
length, reciprocal rank, instance recall, k-call - Build probabilistic model
- Retrieve to maximize an objective the expected
value of metric - Expectations calculated according to our
probabilistic model - Use computational heuristics to make optimization
problem tractable - Secondary retrieving for diversity (special
case) - A natural side effect of optimizing for certain
metrics
15Bayesian Retrieval Model
- There exists distributions that generate relevant
documents, irrelevant documents - PRP rank by
- Remaining modeling questions form of rel/irrel
distributions and parameters for those
distributions - In this paper, we assume multinomial models, and
choose parameters by maximum a posteriori - Prior is background corpus word distribution
16Our Contributions
- Primary EMP metric as retrieval goal
- Metric designed to measure retrieval quality
- Metrics we consider precision/recall _at_ n, search
length, reciprocal rank, instance recall, k-call - Build probabilistic model
- Retrieve to maximize an objective the expected
value of metric - Expectations calculated according to our
probabilistic model - Use computational heuristics to make optimization
problem tractable - Secondary retrieving for diversity (special
case) - A natural side effect of optimizing for certain
metrics
17Objective
- Probability Ranking Principle (PRP) maximize
at each step in ranking - Expected Metric Principle (EMP) maximize
for complete result
set - In particular for k-call, maximize
18Our Contributions
- Primary EMP metric as retrieval goal
- Metric designed to measure retrieval quality
- Metrics we consider precision/recall _at_ n, search
length, reciprocal rank, instance recall, k-call - Build probabilistic model
- Retrieve to maximize an objective the expected
value of metric - Expectations calculated according to our
probabilistic model - Use computational heuristics to make optimization
problem tractable - Secondary retrieving for diversity (special
case) - A natural side effect of optimizing for certain
metrics
19Optimization of Objective
- Exact optimization of objective is usually
NP-hard - E.g. Exact optimization for k-call reducible to
NP-hard maximum graph clique problem - Approximation heuristic Greedy algorithm
- Select documents successively in rank order
- Hold previous documents fixed, optimize objective
at each rank
Maximize Emetric d
d1
20Optimization of Objective
- Exact optimization of objective is usually
NP-hard - E.g. Exact optimization for k-call reducible to
NP-hard maximum graph clique problem - Approximation heuristic Greedy algorithm
- Select documents successively in rank order
- Hold previous documents fixed, optimize objective
at each rank
Fixed
d1
Maximize Emetric d, d1
d2
21Optimization of Objective
- Exact optimization of objective is usually
NP-hard - E.g. Exact optimization for k-call reducible to
NP-hard maximum graph clique problem - Approximation heuristic Greedy algorithm
- Select documents successively in rank order
- Hold previous documents fixed, optimize objective
at each rank
Fixed
d1
Fixed
d2
Maximize Emetric d, d1, d2
d3
22Greedy on 1-call and n-call
- 1-greedy
- Greedy algorithm reduces to ranking each
successive document assuming all previous
documents are irrelevant - Algorithm has discovered incremental negative
pseudorelevance feedback - n-greedy Assume all previous documents relevant
23Greedy on Other Metrics
- Greedy with precision/recall ? reduces to PRP!
- Greedy on k-call for general k (k-greedy)
- More complicated
- Greedy with MSL, MRR, instance recall works out
to 1-greedy algorithm - Intuition to make first relevant document appear
earlier, we want to hedge our bets as to query
interpretation (i.e., diversify)
24Experiments Overview
- Experiments verify that optimizing for metric
improves performance on metric - They do not tell us which metrics to use
- Looked at ad hoc diversity examples
- TREC topics/queries
- Tuned weights on separate development set
- Tested on
- Standard ad hoc (robust track) topics
- Topics with multiple annotators
- Topics with multiple instances
25Diversity on Google Results
- Task reranking top 1,000 Google results
- In optimizing 1-call, our algorithm finds more
diverse results than PRP, Google results
26Experiments Robust Track
- TREC 2003, 2004 robust tracks
- 249 topics
- 528,000 documents
- 1-call, 10-call results statistically significant
27Experiments Instance Retrieval
- TREC-6,7,8 interactive tracks
- 20 topics
- 210,000 documents
- 7 to 56 instances per topic
- PRP baseline instance recall _at_ 10 0.234
- Greedy 1-call instance recall _at_ 10 0.315
28Experiments Multi-annotator
- TREC-4,6 ad hoc retrieval
- Independent annotators assessed same topics
- TREC-4 49 topics, 568,000 documents, 3
annotators - TREC-6 50 topics, 556,000 documents, 2
annotators - ? More annotators more satisfied using 1-greedy
29Related Work
- Fits in risk minimization framework (objective as
negative loss function) - Other approaches look at optimizing for metrics
directly, with training data - Pseudorelevance feedback
- Subtopic retrieval
- Maximal marginal relevance
- Clustering
- See paper for references
30Future Work
- General k-call (k 2, etc.)
- Determination if this is what users want
- Better underlying probabilistic model
- Our contribution is in the ranking objective, not
the model ? model can be arbitrarily
sophisticated - Better optimization techniques
- E.g., Local search would differentiate algorithms
for MRR and 1-call - Other metrics
- Preliminary work on mean average precision,
precision _at_ recall - (Perhaps) surprisingly, these metrics are not
optimized by PRP!
31Conclusions
- EMP Metric can motivate model choosing and
believing in a metric already gives us a
reasonable objective, Emetric - Can potentially apply EMP on top of a variety of
different underlying probabilistic models - Diversity is one practical example of a natural
side effect of using EMP with the right metric
32Acknowledgments
- Harr Chen supported by the Office of Naval
Research through a National Defense Science and
Engineering Graduate Fellowship - Jaime Teevan, Susan Dumais, and anonymous
reviewers provided constructive feedback - ChengXiang Zhai, William Cohen, and Ellen
Voorhees provided code and data