Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

Description:

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing ... e.g., TF*IDF variations, probabilistic models (Okapi BM25), etc. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 20
Provided by: marti256
Category:

less

Transcript and Presenter's Notes

Title: Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing


1
Efficient and Self-tuning Incremental Query
Expansions for Top-k Query Processing
  • Martin Theobald
  • Ralf Schenkel
  • Gerhard Weikum
  • Max-Planck Institute for Informatics
  • Saarbrücken
  • Germany

ACM SigIR 05
2
An Initial Example
  • Robust Track 04, hard query no. 363 (Aquaint
    news corpus)
  • transportation tunnel disasters

  • Increased robustness
  • Count only the best match per document and
    expansion set
  • Increased efficiency
  • Top-k-style query evaluations
  • Open scans on new terms only on demand
  • No threshold tuning

transportation tunnel
disasters
1.0
1.0
1.0
transit highway train truck metro rail car car
tube underground Mont Blanc
catastrophe accident fire flood earthquake land
slide
0.9 0.8 0.7 0.6 0.6 0.5 0.1
1.0 0.9 0.7 0.6 0.6 0.5
0.9 0.8 0.7
d2
d1
  • Expansion terms from relevance feedback,
    thesaurus lookups, Google top-10 snippets, etc.
  • Term similarities, e.g., RobertsonSparck-Jones,
    concept similarities,
    or other correlation measures

3
Outline
  • Computational model background on top-k
    algorithms
  • Incremental Merge over inverted lists
  • Probabilistic candidate pruning
  • Phrase matching
  • Experiments Conclusions

4
Computational Model
  • Vector space model with a Cartesian product space
    D1Dm and a data set D ? D1Dm ? ? m
  • Precomputed local scores s(ti,d)? Di for all d?
    D
  • e.g., TFIDF variations, probabilistic models
    (Okapi BM25), etc.
  • typically normalized to s(ti,d)? 0,1
  • Monotonous score aggregation
  • aggr (D1Dm ) ? (D1Dm ) ? ?
  • e.g., sum, max, product (sum over log sij ),
    cosine (L2 norm)
  • Partial-match queries (aka. andish)
  • Non-conjunctive query evaluations
  • Weak local matches can be compensated
  • Access model
  • Inverted index over large text corpus,
  • Inverted lists sorted by decreasing local scores
  • ? Inexpensive sequential accesses to per-term
    lists getNextItem()
  • ? More expensive random accesses
    getItemBy(docid)

5
No-Random-Access (NRA) Algorithm Fagin et al.,
PODS 02
  1. NRA(q,L)
  2. scan all lists Li (i 1..m) in parallel // e.g.,
    round-robin
  3. ltd, s(ti ,d)gt Li.getNextItem()
  4. E(d) E(d) ? i
  5. highi s(ti ,d)
  6. worstscore(d) ???E(d) s(t? ,d)
  7. bestscore(d) worstscore(d) ???E(d)
    high?
  8. if worstscore(d) gt min-k then
  9. add d to top-k
  10. min-k minworstscore(d) d ?
    top-k
  11. else if bestscore(d) gt min-k then
  12. candidates candidates ? d
  13. if max bestscore(d) d? candidates ?
    min-k then return top-k

Corpus d1,,dn
d1
d1
d1
s(t1,d1) 0.7 s(tm,d1) 0.2
Query q (t1,t2,t3)
Inverted Index
Rank Doc Worst-score Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
k 1
Rank Doc Worst-score Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
d78 0.9
d1 0.7
d88 0.2
d23 0.8
d10 0.8
t1
Scan depth 1

Scan depth 2
Naive Join-then-Sort in between O(mn) and O(mn2)
runtime
Scan depth 3
d64 0.8
d10 0.2
d78 0.1
d23 0.6
d10 0.6
t2

d99 0.2
d34 0.1
d10 0.7
d78 0.5
d64 0.4
STOP!
t3

6
Outline
  • Computational model background on top-k
    algorithms
  • Incremental Merge over inverted lists
  • Probabilistic candidate pruning
  • Phrase matching
  • Experiments Conclusions

7
Dynamic Self-tuning Query Expansions
top-k (t1,t2,t3)
  • Incrementally merge inverted lists Li1Lim in
    descending order of local scores
  • Dynamically add lists into set of active
    expansions exp(ti) according to the combined term
    similarities and local scores
  • Best match score aggregation

t1
t2
d66
d95
incr. merge
d93
d17
d95
d11
d99
d101
...
...
  • Increased retrieval robustness fewer topic
    drifts
  • Increased efficiency through fewer active
    expansions
  • No threshold tuning required !

8
Incremental Merge Operator
Index list meta data (e.g., histograms)
Relevance feedback, Thesaurus lookups,
Initial high-scores
Expansion terms
t t1,t2,t3
Correlation measures, Large corpus statistics,
sim(t, t1 ) 1.0
sim(t, t2 ) 0.9
Expansion similarities
sim(t, t3 ) 0.5
Incremental Merge iteratively triggered by top-k
operator getNextItem()
d88 0.3
d78 0.9
d23 0.8
d10 0.8
d64 0.72
d23 0.72
d10 0.63
d11 0.45
d78 0.45
d1 0.4
...
9
Outline
  • Computational model background on top-k
    algorithms
  • Incremental Merge over inverted lists
  • Probabilistic candidate pruning
  • Phrase matching
  • Experiments Conclusions

10
Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04
  • For each physical index list Li
  • Treat each s(ti,d) ? 0,1 as a random variable
    Si and consider
  • Approximate local score distribution using an
  • equi-width histogram with n buckets

11
Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04
  • For each physical index list Li
  • Treat each s(ti,d) ? 0,1 as a random variable
    Si and consider
  • Approximate local score distribution using an
  • equi-width histogram with n buckets
  • For a virtual index list Li Li1Lim
  • Consider the max-distribution
  • Alternatively, construct meta histogram for the
    active expansions

12
Probabilistic Candidate Pruning Theobald,
Schenkel, Weikum, VLDB 04
  • For each physical index list Li
  • Treat each s(ti,d) ? 0,1 as a random variable
    Si and consider
  • Approximate local score distribution using an
  • equi-width histogram with n buckets
  • For a virtual index list Li Li1Lim
  • Consider the max-distribution
  • Alternatively, construct meta histogram for the
    active expansions
  • For all d in the candidate queue
  • Consider the convolution over score distributions
    for aggregated scores
  • Drop d from candidates, if

13
Outline
  • Computational model background on top-k
    algorithms
  • Incremental Merge over inverted lists
  • Probabilistic candidate pruning
  • Phrase matching
  • Experiments Conclusions

14
Incremental Merge for Multidimensional Predicates
q (undersea fiber optic cable)
Top-k
  • Nested Top-k operator iteratively prefetches
    joins candidate items for each subquery condition
  • getNextItem()
  • Provides wortscore(d), bestscore(d) guarantees
    to superordinate top-k operator
  • Propagates candidates in descending order of
    bestscore(d) values for monotonicity
  • Top-level top-k operator performs phrase tests
    only for the most promising items (random IO)
  • (Expensive predicates minimal probes
    Chang Hwang, SIGMOD 02 )
  • Single threshold condition for algorithm
    termination (candidate pruning at the top-level
    queue only)

sim(fiber optic cable, fiber optic
cable) 1.0
sim(fiber optic cable, fiber optics)
0.8
random access
term-to- position index
15
Outline
  • Computational model background on top-k
    algorithms
  • Incremental Merge over inverted lists
  • Probabilistic candidate pruning
  • Phrase matching
  • Experiments Conclusions

16
Experiments Aquaint with Fixed Expansions
  • Aquaint corpus of English news articles (528,155
    docs)
  • 50 hard queries from TREC 2004 Robust track
  • WordNet expansions using a simple form of WSD
  • Okapi-BM25 model for local scores, Dice
    coefficients as term similarities
  • Fixed expansion technique (synonyms first-order
    hyponyms) with m 118

Title-only
JoinSort 4 2,305,637
NRA-baseline 4 1,439,815 0 9.4 432 KB 0.252 0.092 1.000
Static Expansions
JoinSort 118 20,582,764
NRAPhrases 118 18,258,834 210,531 245.0 37,355 KB 0.286 0.105 1.000
NRAPhrases 118 3,622,686 49,783 79.6 5,895 KB 0.238 0.086 0.541
Dynamic Expansions
Incr.Merge 118 7,908,666 53,050 159.1 17,393 KB 0.310 0.118 1.000
Incr.Merge 118 5,908,017 48,622 79.4 13,424 KB 0.298 0.110 0.786
17
Experiments Aquaint with Fixed Expansions contd
Probabilistic Pruning Performance Incremental
Merge vs. Top-k with Static ExpansionsEpsilon
controls pruning aggressiveness
18
Experiments Aquaint with Large Expansions
Query Expansion PerformanceIncremental Merge vs.
Top-k with Static ExpansionsTheta controls
expansion size Aggressive expansion technique
(synonyms hyponyms hypernyms) with 36 m
876
19
Conclusions Current Work
  • Increased efficiency
  • Incremental Merge vs. Join-then-Sort top-k
    using static expansions
  • Very good precision/runtime ratio for
    probabilistic pruning
  • Increased retrieval robustness
  • Largely avoids topic drifts
  • Modeling of fine grained semantic similarities
  • (Incremental Merge Nested Top-k operators)
  • Scalability (see paper)
  • Large expansions (lt 876 terms per query) on
    Aquaint
  • Experiments on Terabyte collection
  • Efficient support for XML-IR (INEX Benchmark)
  • Inverted lists for combined tag-term pairs
    e.g., secmining
  • Efficiently supports child-or-descendant axis
    e.g., //article//secmining
  • Vague content structure queries (VCAS)
    e.g., //article//secmining
  • Incremental Merge over Data-Guide-like XPath
    locators
  • VLDB 05, Trondheim

20
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com