Indexing%20and%20Searching%20XML%20Documents%20based%20on%20Content%20and%20Structure%20Synopses - PowerPoint PPT Presentation

About This Presentation

Title:

Indexing%20and%20Searching%20XML%20Documents%20based%20on%20Content%20and%20Structure%20Synopses

Description:

Weimin He, Leonidas Fegaras, David Levine. University of Texas at Arlington ... prolog[genre ~ 'Travel'] [keywords/keyword ~ 'bold' or 'stealth' ]//author/name. Q19 ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 21

Provided by: lambd

Learn more at: https://lambda.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Indexing%20and%20Searching%20XML%20Documents%20based%20on%20Content%20and%20Structure%20Synopses

1
Indexing and Searching XML Documents based on
Content and Structure Synopses

Weimin He, Leonidas Fegaras, David Levine
University of Texas at Arlington
http//lambda.uta.edu

2
Outline

Motivation
Key Contributions
Related Work
Data Synopses Indexing
Query Processing
Experimental Results
Conclusion

3
Why not Google?

Need to query both structure and content
an opportunity for more precise search
Keyword queries are NOT adequate for XML search
An example query beyond Google
Find the price of the book whose authors
lastname is Smith and whose title contains
XML and SAX
Semantic search using an XPath Query
//bookauthor/lastname Smithtitle
XML and SAX/price
Simpler query formats cannot express complex
containment relationships
(lastname, Smith), (title, XML SAX), price
Fully indexing XML data is neither efficient nor
scalable

4
Key Contributions

A framework for indexing and searching
schema-less XML documents based on data synopses
extracted from documents
Two novel data synopsis structures that can
achieve higher query precision and scalability
A hash-based processing algorithm to speed up
searching
A prototype implementation to evaluate the
performance of the indexing scheme and to
validate the data synopsis precision

5
Related Work

Extend keyword queries to XML
XRank
XKSearch
Integrate IR constructs and scoring into XQuery
TIX
TeXQuery
XML Summarization Techniques
XSketch
XCluster

6
System Architecture
7
Specification of Search Queries

XPath is extended with a simple IR syntax
Queries may contain predicates of the form e
S
e is an XPath expression
S is a search predicate that takes the form
term S1 and S2 S1 or S2 (S)
A running query example
//auction//itemlocation
Dallasdescription mountain and
bicycle/price
Query result
A list of document locations (path names) that
satisfy the query

8
Data Indexing

Structural Summary (SS)
A tree that captures all unique paths in an XML
document
It is constructed from XML data incrementally
Each SSnode corresponds to a unique full label
path
9 /auction/sponsor/address

9
Data Indexing (cont.)

Content Synopsis (CS)
Summarizes the text associated with an SS node in
an XML document
Approximated as a bit matrix of size WL
L is fixed but W may depend on the document size
Stored as a B-tree that implements the mapping
(SSnode, doc) ? bit-matrix
Used in evaluating search predicates in the query
Positional Filter (PF)
Captures the position spans of all XML elements
associated with an SS node in an XML document
Represented as a bit matrix of size ML, where M
2
Stored as a B-tree that implements the mapping
(SSnode, doc) ? bit-matrix
Used in enforcing containment constraints among
query predicates
Do we need positional dimension?

10
Data Synopsis Example
Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price
11
Containment Filtering
Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price
12
Query Processing Overview

Query Footprint (QF) Extraction
Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price
QF //auction//item0location
1description 2/price
Structural Summary Matching
Retrieve all structural summaries that match the
QF
We use the standard preorder numbering scheme to
represent an SS
An SS is stored as a B-tree that implements the
mapping
tag ? (SS, SSnode, begin, end, level)
We use containment joins to retrieve the
qualified full label paths that match the entry
points in the QF
/auction/item, /auction/item/location,
/auction/item/description
Containment Filtering
Qualified document locations are collected and
returned
The unit of query processing is a mapping from a
doc to a bit matrix of size ML (positions)
An empty bit matrix means an unqualified document

13
Two-Phase Containment Filtering

Many sources of inefficiency
A large number of full label path may match a
single generic XPath query
A long list of data synopses has to be retrieved
for each label path in a QF
The retrieved lists of data synopses have to be
correlated at each step during containment
filtering
Solution
Aggregate data synopses lists from multiple
documents into a single bit matrix, called
Document Synopsis, of size WD
path ? bit-matrix
so that, given a term t and a full label
path p, the document doc is a candidate if the
document synopsis for p is set at
hash(t),hash(doc)
Need a two-phase containment filtering algorithm
to prune unqualified document locations before
the actual containment filtering

14
Document Synopsis
The document synopsis for /biblio/book/paragraph
15
Experimental Setup

A prototype system is implemented in Java
Employed Berkeley DB Java Edition 3.2.13 as a
storage manager
Datasets
XMark
XBench

Data Set Data Size (MB) Files Avg. File Size (KB) Avg. SS Size (Byte) Avg. CS Size (Byte) Avg. PF Size (Byte)
XBench 1050 2666 394 432 20564 178
XMark 55.8 11500 5 417 306 16
16
Query Workload
Dataset Query Query Expression
XMark Q1 /site//itemlocation "United"payment "Creditcard" and "Check"/description
XMark Q2 //regions//itemlocation "States"payment "Creditcard" or "Cash"/name
XMark Q3 /site//itemlocation "United"payment "Creditcard"/description
XMark Q4 //regions//itemlocation "States"payment "Check"/quantity
XMark Q5 /site//itemdescription//text "gold"/name
XMark Q6 /regions//itemdescription//text "character "/payment
XMark Q7 //closed_auctiontype "Regular"annotation//text "heat"/date
XMark Q8 //closed_auctionannotation//text "heat" or "country"/seller
XMark Q9 //closed_auctionannotation//text "heat" and "country"/buyer
XMark Q10 //closed_auctionannotation//text "country"/type
XBench Q11 /article//bodyabstract/p "hockey"section/p "hockey" and "patterns"/section
XBench Q12 //article//bodysection/p "regular"abstract/p "hockey" or "patterns"/abstract
XBench Q13 /article//bodysection/subsec/p "hockey"abstract/p "hockey"/abstract
XBench Q14 /article//bodysection/subsec/p "regular"abstract/p "patterns"/section
XBench Q15 /article//bodysection/p "patterns"abstract/p "patterns"/abstract
XBench Q16 /article//bodysection/p "hockey"abstract/p "patterns"/abstract
XBench Q17 //prologkeywords/keyword "bold" or "regular"title "regular"/authors
XBench Q18 //prologkeywords/keyword "bold"title "bold"/title
XBench Q19 //prologgenre "Travel" keywords/keyword "bold" or "stealth" //author/name
XBench Q20 //prologgenre "Travel" keywords/keyword "bold"/title
17
Indexing Scheme Comparison
ILI using a standard XML indexing scheme based
on full Inverted Lists DSI using our indexing
scheme based on Data Synopses
18
Query Precision Measurement
ODBF using one-dimensional Bloom Filters TDBF
using two-dimensional Bloom Filters
19
Efficiency of Optimization Algorithm
OPCF using one-phase containment filtering TPCF
using two-phase containment filtering
20
Future Research Directions