Title: Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing
1Efficient and Self-tuning Incremental Query
Expansions for Top-k Query Processing
- Martin Theobald
- Ralf Schenkel
- Gerhard Weikum
- Max-Planck Institute for Informatics
- Saarbrücken
- Germany
ACM SigIR 05
2An Initial Example
- Robust Track 04, hard query no. 363 (Aquaint
news corpus) - transportation tunnel disasters
- Increased robustness
- Count only the best match per document and
expansion set - Increased efficiency
- Top-k-style query evaluations
- Open scans on new terms only on demand
- No threshold tuning
transportation tunnel
disasters
1.0
1.0
1.0
transit highway train truck metro rail car car
tube underground Mont Blanc
catastrophe accident fire flood earthquake land
slide
0.9 0.8 0.7 0.6 0.6 0.5 0.1
1.0 0.9 0.7 0.6 0.6 0.5
0.9 0.8 0.7
d2
d1
- Expansion terms from relevance feedback,
thesaurus lookups, Google top-10 snippets, etc. - Term similarities, e.g., RobertsonSparck-Jones,
concept similarities,
or other correlation measures
3Outline
- Computational model background on top-k
algorithms - Incremental Merge over inverted lists
- Probabilistic candidate pruning
- Phrase matching
- Experiments Conclusions
4Computational Model
- Vector space model with a Cartesian product space
D1Dm and a data set D ? D1Dm ? ? m - Precomputed local scores s(ti,d)? Di for all d?
D - e.g., TFIDF variations, probabilistic models
(Okapi BM25), etc. - typically normalized to s(ti,d)? 0,1
- Monotonous score aggregation
- aggr (D1Dm ) ? (D1Dm ) ? ?
- e.g., sum, max, product (sum over log sij ),
cosine (L2 norm) - Partial-match queries (aka. andish)
- Non-conjunctive query evaluations
- Weak local matches can be compensated
- Access model
- Inverted index over large text corpus,
- Inverted lists sorted by decreasing local scores
- ? Inexpensive sequential accesses to per-term
lists getNextItem() - ? More expensive random accesses
getItemBy(docid)
5No-Random-Access (NRA) Algorithm Fagin et al.,
PODS 02
- NRA(q,L)
- scan all lists Li (i 1..m) in parallel // e.g.,
round-robin - ltd, s(ti ,d)gt Li.getNextItem()
- E(d) E(d) ? i
- highi s(ti ,d)
- worstscore(d) ???E(d) s(t? ,d)
- bestscore(d) worstscore(d) ???E(d)
high? - if worstscore(d) gt min-k then
- add d to top-k
- min-k minworstscore(d) d ?
top-k - else if bestscore(d) gt min-k then
- candidates candidates ? d
- if max bestscore(d) d? candidates ?
min-k then return top-k
Corpus d1,,dn
d1
d1
d1
s(t1,d1) 0.7 s(tm,d1) 0.2
Query q (t1,t2,t3)
Inverted Index
Rank Doc Worst-score Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
k 1
Rank Doc Worst-score Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
d78 0.9
d1 0.7
d88 0.2
d23 0.8
d10 0.8
t1
Scan depth 1
Scan depth 2
Naive Join-then-Sort in between O(mn) and O(mn2)
runtime
Scan depth 3
d64 0.8
d10 0.2
d78 0.1
d23 0.6
d10 0.6
t2
d99 0.2
d34 0.1
d10 0.7
d78 0.5
d64 0.4
STOP!
t3
6Outline
- Computational model background on top-k
algorithms - Incremental Merge over inverted lists
- Probabilistic candidate pruning
- Phrase matching
- Experiments Conclusions
7Dynamic Self-tuning Query Expansions
top-k (t1,t2,t3)
- Incrementally merge inverted lists Li1Lim in
descending order of local scores - Dynamically add lists into set of active
expansions exp(ti) according to the combined term
similarities and local scores - Best match score aggregation
t1
t2
d66
d95
incr. merge
d93
d17
d95
d11
d99
d101
...
...
- Increased retrieval robustness fewer topic
drifts - Increased efficiency through fewer active
expansions - No threshold tuning required !
8Incremental Merge Operator
Index list meta data (e.g., histograms)
Relevance feedback, Thesaurus lookups,
Initial high-scores
Expansion terms
t t1,t2,t3
Correlation measures, Large corpus statistics,
sim(t, t1 ) 1.0
sim(t, t2 ) 0.9
Expansion similarities
sim(t, t3 ) 0.5
Incremental Merge iteratively triggered by top-k
operator getNextItem()
d88 0.3
d78 0.9
d23 0.8
d10 0.8
d64 0.72
d23 0.72
d10 0.63
d11 0.45
d78 0.45
d1 0.4
...
9Outline
- Computational model background on top-k
algorithms - Incremental Merge over inverted lists
- Probabilistic candidate pruning
- Phrase matching
- Experiments Conclusions
10Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04
- For each physical index list Li
- Treat each s(ti,d) ? 0,1 as a random variable
Si and consider - Approximate local score distribution using an
- equi-width histogram with n buckets
11Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04
- For each physical index list Li
- Treat each s(ti,d) ? 0,1 as a random variable
Si and consider - Approximate local score distribution using an
- equi-width histogram with n buckets
- For a virtual index list Li Li1Lim
- Consider the max-distribution
- Alternatively, construct meta histogram for the
active expansions
12Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04
- For each physical index list Li
- Treat each s(ti,d) ? 0,1 as a random variable
Si and consider - Approximate local score distribution using an
- equi-width histogram with n buckets
- For a virtual index list Li Li1Lim
- Consider the max-distribution
- Alternatively, construct meta histogram for the
active expansions
- For all d in the candidate queue
- Consider the convolution over score distributions
for aggregated scores - Drop d from candidates, if
13Outline
- Computational model background on top-k
algorithms - Incremental Merge over inverted lists
- Probabilistic candidate pruning
- Phrase matching
- Experiments Conclusions
14Incremental Merge for Multidimensional Predicates
q (undersea fiber optic cable)
Top-k
- Nested Top-k operator iteratively prefetches
joins candidate items for each subquery condition - getNextItem()
- Provides wortscore(d), bestscore(d) guarantees
to superordinate top-k operator - Propagates candidates in descending order of
bestscore(d) values for monotonicity - Top-level top-k operator performs phrase tests
only for the most promising items (random IO) - (Expensive predicates minimal probes
Chang Hwang, SIGMOD 02 ) - Single threshold condition for algorithm
termination (candidate pruning at the top-level
queue only)
sim(fiber optic cable, fiber optic
cable) 1.0
sim(fiber optic cable, fiber optics)
0.8
random access
term-to- position index
15Outline
- Computational model background on top-k
algorithms - Incremental Merge over inverted lists
- Probabilistic candidate pruning
- Phrase matching
- Experiments Conclusions
16Experiments Aquaint with Fixed Expansions
- Aquaint corpus of English news articles (528,155
docs) - 50 hard queries from TREC 2004 Robust track
- WordNet expansions using a simple form of WSD
- Okapi-BM25 model for local scores, Dice
coefficients as term similarities - Fixed expansion technique (synonyms first-order
hyponyms) with m 118
Title-only
JoinSort 4 2,305,637
NRA-baseline 4 1,439,815 0 9.4 432 KB 0.252 0.092 1.000
Static Expansions
JoinSort 118 20,582,764
NRAPhrases 118 18,258,834 210,531 245.0 37,355 KB 0.286 0.105 1.000
NRAPhrases 118 3,622,686 49,783 79.6 5,895 KB 0.238 0.086 0.541
Dynamic Expansions
Incr.Merge 118 7,908,666 53,050 159.1 17,393 KB 0.310 0.118 1.000
Incr.Merge 118 5,908,017 48,622 79.4 13,424 KB 0.298 0.110 0.786
17Experiments Aquaint with Fixed Expansions contd
Probabilistic Pruning Performance Incremental
Merge vs. Top-k with Static ExpansionsEpsilon
controls pruning aggressiveness
18Experiments Aquaint with Large Expansions
Query Expansion PerformanceIncremental Merge vs.
Top-k with Static ExpansionsTheta controls
expansion size Aggressive expansion technique
(synonyms hyponyms hypernyms) with 36 m
876
19Conclusions Current Work
- Increased efficiency
- Incremental Merge vs. Join-then-Sort top-k
using static expansions - Very good precision/runtime ratio for
probabilistic pruning - Increased retrieval robustness
- Largely avoids topic drifts
- Modeling of fine grained semantic similarities
- (Incremental Merge Nested Top-k operators)
- Scalability (see paper)
- Large expansions (lt 876 terms per query) on
Aquaint - Experiments on Terabyte collection
- Efficient support for XML-IR (INEX Benchmark)
- Inverted lists for combined tag-term pairs
e.g., secmining - Efficiently supports child-or-descendant axis
e.g., //article//secmining - Vague content structure queries (VCAS)
e.g., //article//secmining - Incremental Merge over Data-Guide-like XPath
locators - VLDB 05, Trondheim
20Thank you!