Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

Description:

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing ... e.g., TF*IDF variations, probabilistic models (Okapi BM25), etc. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 20

Provided by: marti256

Category:

more less

Transcript and Presenter's Notes

Title: Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

1
Efficient and Self-tuning Incremental Query
Expansions for Top-k Query Processing

Martin Theobald
Ralf Schenkel
Gerhard Weikum
Max-Planck Institute for Informatics
Saarbrücken
Germany

ACM SigIR 05
2
An Initial Example

Robust Track 04, hard query no. 363 (Aquaint
news corpus)
transportation tunnel disasters

Increased robustness
Count only the best match per document and
expansion set
Increased efficiency
Top-k-style query evaluations
Open scans on new terms only on demand
No threshold tuning

transportation tunnel
disasters
1.0
1.0
1.0
transit highway train truck metro rail car car
tube underground Mont Blanc
catastrophe accident fire flood earthquake land
slide
0.9 0.8 0.7 0.6 0.6 0.5 0.1
1.0 0.9 0.7 0.6 0.6 0.5
0.9 0.8 0.7
d2
d1

Expansion terms from relevance feedback,
thesaurus lookups, Google top-10 snippets, etc.
Term similarities, e.g., RobertsonSparck-Jones,
concept similarities,
or other correlation measures

3
Outline

Computational model background on top-k
algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments Conclusions

4
Computational Model

Vector space model with a Cartesian product space
D1Dm and a data set D ? D1Dm ? ? m
Precomputed local scores s(ti,d)? Di for all d?
D
e.g., TFIDF variations, probabilistic models
(Okapi BM25), etc.
typically normalized to s(ti,d)? 0,1
Monotonous score aggregation
aggr (D1Dm ) ? (D1Dm ) ? ?
e.g., sum, max, product (sum over log sij ),
cosine (L2 norm)
Partial-match queries (aka. andish)
Non-conjunctive query evaluations
Weak local matches can be compensated
Access model
Inverted index over large text corpus,
Inverted lists sorted by decreasing local scores
? Inexpensive sequential accesses to per-term
lists getNextItem()
? More expensive random accesses
getItemBy(docid)

5
No-Random-Access (NRA) Algorithm Fagin et al.,
PODS 02

NRA(q,L)
scan all lists Li (i 1..m) in parallel // e.g.,
round-robin
ltd, s(ti ,d)gt Li.getNextItem()
E(d) E(d) ? i
highi s(ti ,d)
worstscore(d) ???E(d) s(t? ,d)
bestscore(d) worstscore(d) ???E(d)
high?
if worstscore(d) gt min-k then
add d to top-k
min-k minworstscore(d) d ?
top-k
else if bestscore(d) gt min-k then
candidates candidates ? d
if max bestscore(d) d? candidates ?
min-k then return top-k

Corpus d1,,dn
d1
d1
d1
s(t1,d1) 0.7 s(tm,d1) 0.2
Query q (t1,t2,t3)
Inverted Index
Rank Doc Worst-score Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
k 1
Rank Doc Worst-score Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
d78 0.9
d1 0.7
d88 0.2
d23 0.8
d10 0.8
t1
Scan depth 1

Scan depth 2
Naive Join-then-Sort in between O(mn) and O(mn2)
runtime
Scan depth 3
d64 0.8
d10 0.2
d78 0.1
d23 0.6
d10 0.6
t2

d99 0.2
d34 0.1
d10 0.7
d78 0.5
d64 0.4
STOP!
t3

6
Outline

Computational model background on top-k
algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments Conclusions

7
Dynamic Self-tuning Query Expansions
top-k (t1,t2,t3)

Incrementally merge inverted lists Li1Lim in
descending order of local scores
Dynamically add lists into set of active
expansions exp(ti) according to the combined term
similarities and local scores
Best match score aggregation

t1
t2
d66
d95
incr. merge
d93
d17
d95
d11
d99
d101
...
...

Increased retrieval robustness fewer topic
drifts
Increased efficiency through fewer active
expansions
No threshold tuning required !

8
Incremental Merge Operator
Index list meta data (e.g., histograms)
Relevance feedback, Thesaurus lookups,
Initial high-scores
Expansion terms
t t1,t2,t3
Correlation measures, Large corpus statistics,
sim(t, t1 ) 1.0
sim(t, t2 ) 0.9
Expansion similarities
sim(t, t3 ) 0.5
Incremental Merge iteratively triggered by top-k
operator getNextItem()
d88 0.3
d78 0.9
d23 0.8
d10 0.8
d64 0.72
d23 0.72
d10 0.63
d11 0.45
d78 0.45
d1 0.4
...
9
Outline

Computational model background on top-k
algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments Conclusions

10
Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04

For each physical index list Li
Treat each s(ti,d) ? 0,1 as a random variable
Si and consider
Approximate local score distribution using an
equi-width histogram with n buckets

11
Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04

For each physical index list Li
Treat each s(ti,d) ? 0,1 as a random variable
Si and consider
Approximate local score distribution using an
equi-width histogram with n buckets

For a virtual index list Li Li1Lim
Consider the max-distribution
Alternatively, construct meta histogram for the
active expansions

12
Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04

For each physical index list Li
Treat each s(ti,d) ? 0,1 as a random variable
Si and consider
Approximate local score distribution using an
equi-width histogram with n buckets

For a virtual index list Li Li1Lim
Consider the max-distribution
Alternatively, construct meta histogram for the
active expansions

For all d in the candidate queue
Consider the convolution over score distributions
for aggregated scores
Drop d from candidates, if

13
Outline

Computational model background on top-k
algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments Conclusions

14
Incremental Merge for Multidimensional Predicates
q (undersea fiber optic cable)
Top-k

Nested Top-k operator iteratively prefetches
joins candidate items for each subquery condition
getNextItem()
Provides wortscore(d), bestscore(d) guarantees
to superordinate top-k operator
Propagates candidates in descending order of
bestscore(d) values for monotonicity
Top-level top-k operator performs phrase tests
only for the most promising items (random IO)
(Expensive predicates minimal probes
Chang Hwang, SIGMOD 02 )
Single threshold condition for algorithm
termination (candidate pruning at the top-level
queue only)

sim(fiber optic cable, fiber optic
cable) 1.0
sim(fiber optic cable, fiber optics)
0.8
random access
term-to- position index
15
Outline

Computational model background on top-k
algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments Conclusions

16
Experiments Aquaint with Fixed Expansions

Aquaint corpus of English news articles (528,155
docs)
50 hard queries from TREC 2004 Robust track
WordNet expansions using a simple form of WSD
Okapi-BM25 model for local scores, Dice
coefficients as term similarities
Fixed expansion technique (synonyms first-order
hyponyms) with m 118

Title-only
JoinSort 4 2,305,637
NRA-baseline 4 1,439,815 0 9.4 432 KB 0.252 0.092 1.000
Static Expansions
JoinSort 118 20,582,764
NRAPhrases 118 18,258,834 210,531 245.0 37,355 KB 0.286 0.105 1.000
NRAPhrases 118 3,622,686 49,783 79.6 5,895 KB 0.238 0.086 0.541
Dynamic Expansions
Incr.Merge 118 7,908,666 53,050 159.1 17,393 KB 0.310 0.118 1.000
Incr.Merge 118 5,908,017 48,622 79.4 13,424 KB 0.298 0.110 0.786
17
Experiments Aquaint with Fixed Expansions contd
Probabilistic Pruning Performance Incremental
Merge vs. Top-k with Static ExpansionsEpsilon
controls pruning aggressiveness
18
Experiments Aquaint with Large Expansions
Query Expansion PerformanceIncremental Merge vs.
Top-k with Static ExpansionsTheta controls
expansion size Aggressive expansion technique
(synonyms hyponyms hypernyms) with 36 m
876
19
Conclusions Current Work