TopX Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data - PowerPoint PPT Presentation

Loading...

PPT – TopX Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data PowerPoint presentation | free to download - id: 22594d-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

TopX Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data

Description:

Text, Structured, and Semistructured Data. PhD Defense. May 16th. 2006. Martin Theobald. Max Planck Institute for Informatics. VLDB 05 ' ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 29
Provided by: MartinT59
Learn more at: http://www.mpi-inf.mpg.de
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: TopX Efficient and Versatile Top-k Query Processing for Text, Structured, and Semistructured Data


1
TopX Efficient and Versatile Top-k Query
Processing for Text, Structured, and
Semistructured Data
  • PhD Defense
  • May 16th
  • 2006
  • Martin Theobald
  • Max Planck Institute for Informatics

VLDB 05
2
An XML-IR Scenario (INEX IEEE)
//article.//bibabout(.//item, W3C)
//secabout(.//, XML retrieval)
//parabout(.//, native XML databases)
RANKING
VAGUENESS
PRUNING
3
TopX Efficient XML-IR
Goal Efficiently retrieve the best results of a
similarity query
  • Extend top-k query processing algorithms for
    sorted lists Güntzer, Balke Kießling, VLDB00
    Fagin, PODS 01
  • to XML data and XPath-like full-text search
  • Non-schematic, heterogeneous data sources
  • Efficient support for IR-style vague search
  • Combined inverted index for content structure
  • Avoid full index scans, postpone expensive random
    accesses to large disk-resident data structures
  • Exploit cheap disk space for redundant index
    structures

4
Outline
  • Data relevance scoring model
  • Database schema indexing
  • TopX query processing
  • Index access scheduling probabilistic candidate
    pruning
  • Dynamic query relaxation expansion
  • Experiments conclusions

5
Outline
  • Data relevance scoring model
  • Database schema indexing
  • TopX query processing
  • Index access scheduling probabilistic candidate
    pruning
  • Dynamic query relaxation expansion
  • Experiments conclusions

6
Data Model
ftf(xml, article1 ) 4
xml data manage xml manage system vary wide
expressive power native xml data base native xml
data base system store schemaless data
ltarticlegt lttitlegtXML Data Management
lt/titlegt ltabsgtXML management systems vary
widely in their expressive power. lt/absgt
ltsecgt lttitlegtNative XML Data Bases.
lt/titlegt ltpargtNative XML data base systems
can store schemaless data.lt/pargt
lt/secgt lt/articlegt
native xml data base native xml data base
system store schemaless data
  • XML tree model
  • Pre/postorder labels for all tags and merged
    tag-term pairs
  • ? XPath Accelerator Grust, Sigmod 02
  • Redundant full-content text nodes
  • Full-content term frequencies ftf(ti,e)

7
Full-Content Scoring Model
individual element statistics
tag N avg.length k1 b
article 12,223 2,903 10.5 0.75
sec 96,709 413 10.5 0.75
par 1,024,907 32 10.5 0.75
fig 109,230 13 10.5 0.75
  • Basic scoring idea within IR-style family of
    TFIDF ranking functions

bibtransactions vs. partransactions
  • Extended Okapi-BM25 probabilistic model for XML
    with
  • element-specific parameterization VLDB 05
    INEX 05
  • Additional static score mass c for relaxable
    structural conditions
  • and non-conjunctive (andish) XPath
    evaluations

8
Outline
  • Data relevance scoring model
  • Database schema indexing
  • TopX query processing
  • Index access scheduling probabilistic candidate
    pruning
  • Dynamic query relaxation expansion
  • Experiments conclusions

9
Inverted Block-Index for Content Structure
secxml
Random Access (RA)
Sorted Access (SA)
titlenative
parretrieval
secxml
titlenative
parretrieval
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
eid docid score pre post max- score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
eid docid score pre post max- score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
  • Combined inverted index over merged tag-term
    pairs
  • (on redundant element full-contents)
  • Sequential block-scans
  • Group elements in descending order of (maxscore,
    docid) per list
  • Block-scan all elements per doc for a given (tag,
    term) key
  • Stored as inverted files or database tables
  • (two B-tree indexes over full range of
    attributes)

10
Navigational Index
sec
C1.0
Sorted Access (SA)
titlenative
parretrieval
Random Access (RA)
titlenative
parretrieval
sec
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
eid docid score pre post max- score
216 17 0.9 2 15 0.9
72 3 0.8 14 10 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max- score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
  • Additional element directory
  • Random accesses on B-tree index using (docid,
    tag) as key
  • Carefully scheduled probes
  • Schema-oblivious indexing querying
  • Non-schematic, heterogeneous data sources (no DTD
    required)
  • Supports full NEXI syntax
  • Supports all 13 XPath axes (level )

11
Outline
  • Data relevance scoring model
  • Database schema indexing
  • TopX query processing
  • Index access scheduling probabilistic candidate
    pruning
  • Dynamic query relaxation expansion
  • Experiments conclusions

12
TopX Query Processor
  • Adapt Threshold Algorithm (TA) paradigm Fagin et
    al., PODS 01
  • Focus on inexpensive SA postpone expensive RA
    (NRA CA)
  • Keep intermediate top-k enqueue partially
    evaluated candidates
  • Lower/Upper score guarantees for each candidate d
  • Remember set of evaluated query dimensions E(d)
  • worstscore(d) ?i?E(d) score(ti, ed)
  • bestscore(d) worstscore(d) ?i?E(d) highi
  • Early min-k threshold termination
  • Return current top-k, iff
  • TopX core engine VLDB 04
  • SA batching efficient queue management
  • Multi-threaded SA query processing
  • Probabilistic cost model for RA scheduling
  • Probabilistic candidate pruning for approximate
    top-k results
  • XML engine VLDB 05
  • Efficiently deals with uncertainty in the
    structure content (andish XPath)
  • Controlled amount of RA (unique among current
    XML-top-k engines)
  • Dynamically switch between document element
    granularity

13
TopX Query Processing By Example (NRA)
Top-2 results
secxml
parretrieval
titlenative
min-20.0
min-20.5
min-20.9
min-21.6
min-21.0
parretrieval
secxml
titlenative
1.0
1.0
1.0
1.0
0.9
0.9
eid docid score pre post
216 17 0.9 2 15
72 3 0.8 14 10
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
eid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
0.8
0.8
0.85
0.5
0.75
0.1
doc2
doc17
doc1
doc5
Pseudo- doc
Candidate queue
doc3
14
Andish XPath over Element Blocks
worstscore(d) 0.14
0.63
RA
1.18
C1.0
C0.2
0.2 1, 419
1.0 1, 419
bib
3.69
1.38
1.0 398, 418
0.2 398, 418
0.2 169, 348 0.2 351, 389 0.2 392, 395
1.0 169, 348 1.0 351, 389 1.0 392, 395
SA
item w3c
0.21 169, 348 0.16 351, 389 0.11 37, 46
0.11 351, 389
0.49 174, 324
0.07 389, 388 0.06 354, 353 0.04 375, 378
0.02 372, 371
0.24 354, 353 0.18 357, 359 0.16 65, 64
0.14 347, 343 0.13 166, 164 0.12 354, 353
  • Incremental non-conjunctive XPath evaluations
    using
  • Hash joins on the content conditions
  • Staircase joins Grust, VLDB 03 on the
    structure
  • Tight accurate worstscore(d), bestscore(d)
    bounds for early pruning (ensuring monotonous
    updates)
  • ? Virtual support elements for navigation

15
Outline
  • Data relevance scoring model
  • Database schema indexing
  • TopX query processing
  • Index access scheduling probabilistic candidate
    pruning
  • Dynamic query relaxation expansion
  • Experiments conclusions

16
Random Access Scheduling Minimal Probing
RA
1.0 1, 419
1.0 398, 418
1.0 169, 348
SA
0.16 351, 389
0.06 354, 353
0.24 354, 353
0.12 354, 353
0.11 351, 389
0.49 174, 324
  • MinProbe
  • Schedule RAs only for the most promising
    candidates
  • Extending Expensive Predicates Minimal
    Probing ChangHwang, SIGMOD 02
  • Schedule batch of RAs on d, only iff
  • worstscore(d) rd c gt min-k

rank-k worstscore
evaluated content structure- related score
unresolved, static structural score mass
17
Cost-based Scheduling (CA) Ben Probing
  • Goal Minimize overall execution cost SA
    cR/cS RA
  • Access costs on d are wasted, if d does not make
    it into the final top-k (considering both
    structural selectivities content scores)
  • Probabilistic cost model comparing different
    types of
  • Expected Wasted Costs
  • EWC-RAs(d) of looking up d in the remaining
    structure
  • EWC-RAc(d) of looking up d in the remaining
    content
  • EWC-SA(d) of not seeing d in the next batch of b
    SAs
  • BenProbe Schedule batch of RAs on d, iff
  • EWC-RAsc(d) cR/cS lt EWC-SA
  • Bounds the ratio between RA and SA
  • Schedule RAs late last
  • Schedule RAs in asc. order of EWC-RAsc(d)

18
Selectivity Estimator VLDB 05
//sec//figurejava //parxml
//bibvldb
  • Split the query into a set of basic,
    characteristic XML patterns
  • twigs, paths tag-term pairs

sec
  • Consider structural selectivities of unresolved
    non-redundant patterns Y
  • PS d satisfies all structural
  • conditions Y

bib vldb
conjunctive
p1 0.682 p2 0.001 p3 0.002 p4 0.688 p5
0.968 p6 0.002 p7 0.023 p8 0.067 p9 0.011
//sec//figure//par //sec//figure//bib //sec/
/par//bib //sec//figure //sec//par //sec//bib //b
ibvldb //parxml //figurejava
andish
PS d satisfies a subset Y of structural
conditions Y
  • Consider binary correlations between
  • structural patterns and/or tag-term pairs
  • (data sampling, query logs, etc.)

19
Score Predictor VLDB 04
  • Consider score distributions of the
    content-related inverted lists

PC d gets in the final top-k
Probabilistic candidate pruning Drop d from
the candidate queue, iff PC d gets in
the final top-k lt e (with probabilistic
guarantees for relative precision recall)
  • Convolutions of score histograms (assuming
    independence)

titlenative
eid docid score pre post max- score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
parretrieval
sampling
eid docid score pre post max- score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
  • Closed-form convolutions, e.g., truncated Poisson
  • Moment-generating functions Chernoff-Hoeffding
    bounds
  • Combined score predictor selectivity estimator

20
Outline
  • Data relevance scoring model
  • Database schema indexing
  • TopX query processing
  • Index access scheduling probabilistic candidate
    pruning
  • Dynamic query relaxation expansion
  • Experiments conclusions

21
Dynamic and Self-tuning Query Expansion SIGIR
05
TREC Robust Topic no. 363
  • Incrementally merge inverted lists for a set of
    active expansions exp(t1)..exp(tm) in descending
    order of scores s(ti, d)
  • Max-score aggregation for fending off topic
    drifts
  • Dynamically expand set of active expansions only
    when beneficial for finding the final top-k
    results
  • Specialized expansion operators
  • Incremental Merge operator
  • Nested Top-k operator (phrase matching)
  • Supports text, structured records XML
  • Boolean (but ranked) retrieval mode

Top-k (transport, tunnel, disaster)
SA
SA
SA
Incr. Merge
22
Incremental Merge Operator
Thesaurus lookups/ Relevance feedback
Index list metadata (e.g., histograms)
Initial high-scores
Expansion terms
t t1, t2, t3
Large corpus statistics (DF, etc.)
sim(t, t1 ) 1.0
sim(t, t2 ) 0.9
Expansion similarities
sim(t, t3 ) 0.5
SA
d23 0.8
d10 0.8
d64 0.72
d23 0.72
d10 0.63
d11 0.45
d78 0.45
d1 0.4
d88 0.3
d78 0.9
t
...
Meta histograms seamlessly integrate the
Incremental Merge operator into probabilistic
scheduling and pruning strategies
23
Outline
  • Data relevance scoring model
  • Database schema indexing
  • TopX query processing
  • Index access scheduling probabilistic candidate
    pruning
  • Dynamic query relaxation expansion
  • Experiments conclusions

24
Data Collections Competitors
  • INEX 04 Ad-hoc Track setting
  • IEEE collection with 12,223 docs 12M elemts in
    534 MB XML data
  • 46 NEXI queries with official relevance judgments
    and a strict quantization
  • e.g., //article.//bibQBIC and
    .//parimage retrieval
  • TREC 04 Robust Track setting
  • Aquaint news collection with 528,155 docs in
    1,904 MB text data
  • 50 hard queries from TREC Robust Track 04 with
    official relevance judgments
  • e.g., transportation tunnel disasters or
    Hubble telescope achievements
  • Competitors for XML setup
  • DBMS-style JoinSort
  • Using index full scans on the TopX index
    (Holistic Twig Joins)
  • StructIndex Kaushik et al, Sigmod 04
  • Top-k with separate indexes for content
    structure
  • DataGuide-like structural index
  • Eager RAs (Fagins TA)
  • StructIndex
  • Extent chaining technique for DataGuide-based
    extent identifiers

25
INEX TopX vs. JoinSort StructIndex
46 NEXI Queries
26
IMDB Results
20 NEXI Queries
P_at_k
MAP_at_k
rel.Prec
SA
epsilon
CPU sec
RA
k
754.00
0
14,510077
n/a
10
JoinSort
3.20
291,655
346,697
n/a
10
StructIndex
n/a
1.00
3.40
301,647
22,445
n/a
10
StructIndex
1.60
72,196
317,380
0.0
10
TopX MinProbe
1.20
50,016
241,471
0.0
10
TopX BenProbe
27
INEX TopX with Probabilistic Pruning
46 NEXI Queries
28
TREC Robust Dynamic vs. Static Query Expansion
  • Careful WordNet expansions using automatic Word
    Sense Disambiguation phrase detection WebDB
    03 PKDD 05 with (mlt118)
  • MinProbe RA scheduling for phrase matching
    (auxiliary term-offset table)
  • Incremental Merge Nested Top-k (mtoplt 22) vs.
    Static Expansions (mtoplt 118)

50 Keyword Phrase Queries
29
Conclusions
  • Efficient and versatile TopX query processor
  • Extensible framework for XML-IR full-text
    search
  • Very good precision/runtime ratio for
    probabilistic candidate pruning
  • Self-tuning solution for robust query expansions
    IR-style vague search
  • Combined SA and RA scheduling close to lower
    bound for CA access cost Submitted for VLDB 06
  • Scalability
  • Optimized for query processing IO
  • Exploits cheap disk space for redundant index
    structures
  • (constant redundancy factor of 4-5 for INEX
    IEEE)
  • Extensive TREC Terabyte runs with 25,000,000 text
    documents (426 GB)
  • INEX 2006
  • New Wikipedia XML collection with 660,000
    documents 120,000,000 elements ( 6 GB raw XML)
  • Official host for the Topic Development and
    Interactive Track
  • (69 groups registered worldwide)
  • TopX WebService available (SOAP connector)

30
XML-IR History and Related Work
IR on structured docs (SGML)
Web query languages
1995
OED etc. (U Waterloo) HySpirit (U
Dortmund) HyperStorM (GMD Darmstadt) WHIRL (CMU)
W3QS (Technion Haifa)
Araneus (U Roma)
Lorel (Stanford U)
WebSQL (U Toronto)
IR on XML
XML query languages
XIRQL HyRex (U Dortmund) XXL TopX (U Saarland
/ MPII) ApproXQL (U Berlin / U Munich) ELIXIR (U
Dublin) JuruXML (IBM Haifa ) XSearch (Hebrew
U) Timber (U Michigan) XRank Quark (Cornell
U) FleXPath (ATT Labs) XKeyword (UCSD)
XML-QL (ATT Labs)
2000
XPath 1.0 (W3C)
NEXI (INEX Benchmark) XPath 2.0 XQuery
1.0 Full-Text (W3C)
XPath 2.0 (W3C)
XQuery 1.0 (W3C)
TeXQuery (ATT Labs)
Commercial software MarkLogic, Verity?,
IBM?, Oracle?, ...
2005
31
Thats it. Thank you!
32
TREC Terabyte Comparison of Scheduling Strategies
Thanks to Holger Bast Deb Majumdar!
About PowerShow.com