Title: Indexing%20and%20Searching%20XML%20Documents%20based%20on%20Content%20and%20Structure%20Synopses
1Indexing and Searching XML Documents based on
Content and Structure Synopses
- Weimin He, Leonidas Fegaras, David Levine
- University of Texas at Arlington
- http//lambda.uta.edu
2Outline
- Motivation
- Key Contributions
- Related Work
- Data Synopses Indexing
- Query Processing
- Experimental Results
- Conclusion
3Why not Google?
- Need to query both structure and content
- an opportunity for more precise search
- Keyword queries are NOT adequate for XML search
- An example query beyond Google
- Find the price of the book whose authors
lastname is Smith and whose title contains
XML and SAX - Semantic search using an XPath Query
- //bookauthor/lastname Smithtitle
XML and SAX/price - Simpler query formats cannot express complex
containment relationships - (lastname, Smith), (title, XML SAX), price
- Fully indexing XML data is neither efficient nor
scalable
4Key Contributions
- A framework for indexing and searching
schema-less XML documents based on data synopses
extracted from documents - Two novel data synopsis structures that can
achieve higher query precision and scalability - A hash-based processing algorithm to speed up
searching - A prototype implementation to evaluate the
performance of the indexing scheme and to
validate the data synopsis precision
5Related Work
- Extend keyword queries to XML
- XRank
- XKSearch
- Integrate IR constructs and scoring into XQuery
- TIX
- TeXQuery
- XML Summarization Techniques
- XSketch
- XCluster
6System Architecture
7Specification of Search Queries
- XPath is extended with a simple IR syntax
- Queries may contain predicates of the form e
S - e is an XPath expression
- S is a search predicate that takes the form
- term S1 and S2 S1 or S2 (S)
- A running query example
- //auction//itemlocation
Dallasdescription mountain and
bicycle/price - Query result
- A list of document locations (path names) that
satisfy the query
8Data Indexing
- Structural Summary (SS)
- A tree that captures all unique paths in an XML
document - It is constructed from XML data incrementally
- Each SSnode corresponds to a unique full label
path - 9 /auction/sponsor/address
9Data Indexing (cont.)
- Content Synopsis (CS)
- Summarizes the text associated with an SS node in
an XML document - Approximated as a bit matrix of size WL
- L is fixed but W may depend on the document size
- Stored as a B-tree that implements the mapping
- (SSnode, doc) ? bit-matrix
- Used in evaluating search predicates in the query
- Positional Filter (PF)
- Captures the position spans of all XML elements
associated with an SS node in an XML document - Represented as a bit matrix of size ML, where M
2 - Stored as a B-tree that implements the mapping
- (SSnode, doc) ? bit-matrix
- Used in enforcing containment constraints among
query predicates - Do we need positional dimension?
10Data Synopsis Example
Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price
11Containment Filtering
Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price
12Query Processing Overview
- Query Footprint (QF) Extraction
- Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price - QF //auction//item0location
1description 2/price - Structural Summary Matching
- Retrieve all structural summaries that match the
QF - We use the standard preorder numbering scheme to
represent an SS - An SS is stored as a B-tree that implements the
mapping - tag ? (SS, SSnode, begin, end, level)
- We use containment joins to retrieve the
qualified full label paths that match the entry
points in the QF - /auction/item, /auction/item/location,
/auction/item/description - Containment Filtering
- Qualified document locations are collected and
returned - The unit of query processing is a mapping from a
doc to a bit matrix of size ML (positions) - An empty bit matrix means an unqualified document
13Two-Phase Containment Filtering
- Many sources of inefficiency
- A large number of full label path may match a
single generic XPath query - A long list of data synopses has to be retrieved
for each label path in a QF - The retrieved lists of data synopses have to be
correlated at each step during containment
filtering - Solution
- Aggregate data synopses lists from multiple
documents into a single bit matrix, called
Document Synopsis, of size WD - path ? bit-matrix
- so that, given a term t and a full label
path p, the document doc is a candidate if the
document synopsis for p is set at
hash(t),hash(doc) - Need a two-phase containment filtering algorithm
to prune unqualified document locations before
the actual containment filtering
14Document Synopsis
The document synopsis for /biblio/book/paragraph
15Experimental Setup
- A prototype system is implemented in Java
- Employed Berkeley DB Java Edition 3.2.13 as a
storage manager -
- Datasets
- XMark
- XBench
Data Set Data Size (MB) Files Avg. File Size (KB) Avg. SS Size (Byte) Avg. CS Size (Byte) Avg. PF Size (Byte)
XBench 1050 2666 394 432 20564 178
XMark 55.8 11500 5 417 306 16
16Query Workload
Dataset Query Query Expression
XMark Q1 /site//itemlocation "United"payment "Creditcard" and "Check"/description
XMark Q2 //regions//itemlocation "States"payment "Creditcard" or "Cash"/name
XMark Q3 /site//itemlocation "United"payment "Creditcard"/description
XMark Q4 //regions//itemlocation "States"payment "Check"/quantity
XMark Q5 /site//itemdescription//text "gold"/name
XMark Q6 /regions//itemdescription//text "character "/payment
XMark Q7 //closed_auctiontype "Regular"annotation//text "heat"/date
XMark Q8 //closed_auctionannotation//text "heat" or "country"/seller
XMark Q9 //closed_auctionannotation//text "heat" and "country"/buyer
XMark Q10 //closed_auctionannotation//text "country"/type
XBench Q11 /article//bodyabstract/p "hockey"section/p "hockey" and "patterns"/section
XBench Q12 //article//bodysection/p "regular"abstract/p "hockey" or "patterns"/abstract
XBench Q13 /article//bodysection/subsec/p "hockey"abstract/p "hockey"/abstract
XBench Q14 /article//bodysection/subsec/p "regular"abstract/p "patterns"/section
XBench Q15 /article//bodysection/p "patterns"abstract/p "patterns"/abstract
XBench Q16 /article//bodysection/p "hockey"abstract/p "patterns"/abstract
XBench Q17 //prologkeywords/keyword "bold" or "regular"title "regular"/authors
XBench Q18 //prologkeywords/keyword "bold"title "bold"/title
XBench Q19 //prologgenre "Travel" keywords/keyword "bold" or "stealth" //author/name
XBench Q20 //prologgenre "Travel" keywords/keyword "bold"/title
17Indexing Scheme Comparison
ILI using a standard XML indexing scheme based
on full Inverted Lists DSI using our indexing
scheme based on Data Synopses
18Query Precision Measurement
ODBF using one-dimensional Bloom Filters TDBF
using two-dimensional Bloom Filters
19Efficiency of Optimization Algorithm
OPCF using one-phase containment filtering TPCF
using two-phase containment filtering
20Future Research Directions
- Develop an effective ranking function
- Adopt top-k algorithms to improve search
efficiency - Apply our framework to structured P2P networks
- Evaluate our framework over INEX data