Indexing%20and%20Searching%20XML%20Documents%20based%20on%20Content%20and%20Structure%20Synopses - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing%20and%20Searching%20XML%20Documents%20based%20on%20Content%20and%20Structure%20Synopses

Description:

Weimin He, Leonidas Fegaras, David Levine. University of Texas at Arlington ... prolog[genre ~ 'Travel'] [keywords/keyword ~ 'bold' or 'stealth' ]//author/name. Q19 ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 21
Provided by: lambd
Learn more at: https://lambda.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Indexing%20and%20Searching%20XML%20Documents%20based%20on%20Content%20and%20Structure%20Synopses


1
Indexing and Searching XML Documents based on
Content and Structure Synopses
  • Weimin He, Leonidas Fegaras, David Levine
  • University of Texas at Arlington
  • http//lambda.uta.edu

2
Outline
  • Motivation
  • Key Contributions
  • Related Work
  • Data Synopses Indexing
  • Query Processing
  • Experimental Results
  • Conclusion

3
Why not Google?
  • Need to query both structure and content
  • an opportunity for more precise search
  • Keyword queries are NOT adequate for XML search
  • An example query beyond Google
  • Find the price of the book whose authors
    lastname is Smith and whose title contains
    XML and SAX
  • Semantic search using an XPath Query
  • //bookauthor/lastname Smithtitle
    XML and SAX/price
  • Simpler query formats cannot express complex
    containment relationships
  • (lastname, Smith), (title, XML SAX), price
  • Fully indexing XML data is neither efficient nor
    scalable

4
Key Contributions
  • A framework for indexing and searching
    schema-less XML documents based on data synopses
    extracted from documents
  • Two novel data synopsis structures that can
    achieve higher query precision and scalability
  • A hash-based processing algorithm to speed up
    searching
  • A prototype implementation to evaluate the
    performance of the indexing scheme and to
    validate the data synopsis precision

5
Related Work
  • Extend keyword queries to XML
  • XRank
  • XKSearch
  • Integrate IR constructs and scoring into XQuery
  • TIX
  • TeXQuery
  • XML Summarization Techniques
  • XSketch
  • XCluster

6
System Architecture
7
Specification of Search Queries
  • XPath is extended with a simple IR syntax
  • Queries may contain predicates of the form e
    S
  • e is an XPath expression
  • S is a search predicate that takes the form
  • term S1 and S2 S1 or S2 (S)
  • A running query example
  • //auction//itemlocation
    Dallasdescription mountain and
    bicycle/price
  • Query result
  • A list of document locations (path names) that
    satisfy the query

8
Data Indexing
  • Structural Summary (SS)
  • A tree that captures all unique paths in an XML
    document
  • It is constructed from XML data incrementally
  • Each SSnode corresponds to a unique full label
    path
  • 9 /auction/sponsor/address

9
Data Indexing (cont.)
  • Content Synopsis (CS)
  • Summarizes the text associated with an SS node in
    an XML document
  • Approximated as a bit matrix of size WL
  • L is fixed but W may depend on the document size
  • Stored as a B-tree that implements the mapping
  • (SSnode, doc) ? bit-matrix
  • Used in evaluating search predicates in the query
  • Positional Filter (PF)
  • Captures the position spans of all XML elements
    associated with an SS node in an XML document
  • Represented as a bit matrix of size ML, where M
    2
  • Stored as a B-tree that implements the mapping
  • (SSnode, doc) ? bit-matrix
  • Used in enforcing containment constraints among
    query predicates
  • Do we need positional dimension?

10
Data Synopsis Example
Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price
11
Containment Filtering
Query //auction//itemlocation
Dallasdescription mountain and
bicycle/price
12
Query Processing Overview
  • Query Footprint (QF) Extraction
  • Query //auction//itemlocation
    Dallasdescription mountain and
    bicycle/price
  • QF //auction//item0location
    1description 2/price
  • Structural Summary Matching
  • Retrieve all structural summaries that match the
    QF
  • We use the standard preorder numbering scheme to
    represent an SS
  • An SS is stored as a B-tree that implements the
    mapping
  • tag ? (SS, SSnode, begin, end, level)
  • We use containment joins to retrieve the
    qualified full label paths that match the entry
    points in the QF
  • /auction/item, /auction/item/location,
    /auction/item/description
  • Containment Filtering
  • Qualified document locations are collected and
    returned
  • The unit of query processing is a mapping from a
    doc to a bit matrix of size ML (positions)
  • An empty bit matrix means an unqualified document

13
Two-Phase Containment Filtering
  • Many sources of inefficiency
  • A large number of full label path may match a
    single generic XPath query
  • A long list of data synopses has to be retrieved
    for each label path in a QF
  • The retrieved lists of data synopses have to be
    correlated at each step during containment
    filtering
  • Solution
  • Aggregate data synopses lists from multiple
    documents into a single bit matrix, called
    Document Synopsis, of size WD
  • path ? bit-matrix
  • so that, given a term t and a full label
    path p, the document doc is a candidate if the
    document synopsis for p is set at
    hash(t),hash(doc)
  • Need a two-phase containment filtering algorithm
    to prune unqualified document locations before
    the actual containment filtering

14
Document Synopsis
The document synopsis for /biblio/book/paragraph
15
Experimental Setup
  • A prototype system is implemented in Java
  • Employed Berkeley DB Java Edition 3.2.13 as a
    storage manager
  • Datasets
  • XMark
  • XBench

Data Set Data Size (MB) Files Avg. File Size (KB) Avg. SS Size (Byte) Avg. CS Size (Byte) Avg. PF Size (Byte)
XBench 1050 2666 394 432 20564 178
XMark 55.8 11500 5 417 306 16
16
Query Workload
Dataset Query Query Expression
XMark Q1 /site//itemlocation "United"payment "Creditcard" and "Check"/description
XMark Q2 //regions//itemlocation "States"payment "Creditcard" or "Cash"/name
XMark Q3 /site//itemlocation "United"payment "Creditcard"/description
XMark Q4 //regions//itemlocation "States"payment "Check"/quantity
XMark Q5 /site//itemdescription//text "gold"/name
XMark Q6 /regions//itemdescription//text "character "/payment
XMark Q7 //closed_auctiontype "Regular"annotation//text "heat"/date
XMark Q8 //closed_auctionannotation//text "heat" or "country"/seller
XMark Q9 //closed_auctionannotation//text "heat" and "country"/buyer
XMark Q10 //closed_auctionannotation//text "country"/type
XBench Q11 /article//bodyabstract/p "hockey"section/p "hockey" and "patterns"/section
XBench Q12 //article//bodysection/p "regular"abstract/p "hockey" or "patterns"/abstract
XBench Q13 /article//bodysection/subsec/p "hockey"abstract/p "hockey"/abstract
XBench Q14 /article//bodysection/subsec/p "regular"abstract/p "patterns"/section
XBench Q15 /article//bodysection/p "patterns"abstract/p "patterns"/abstract
XBench Q16 /article//bodysection/p "hockey"abstract/p "patterns"/abstract
XBench Q17 //prologkeywords/keyword "bold" or "regular"title "regular"/authors
XBench Q18 //prologkeywords/keyword "bold"title "bold"/title
XBench Q19 //prologgenre "Travel" keywords/keyword "bold" or "stealth" //author/name
XBench Q20 //prologgenre "Travel" keywords/keyword "bold"/title
17
Indexing Scheme Comparison
ILI using a standard XML indexing scheme based
on full Inverted Lists DSI using our indexing
scheme based on Data Synopses
18
Query Precision Measurement
ODBF using one-dimensional Bloom Filters TDBF
using two-dimensional Bloom Filters
19
Efficiency of Optimization Algorithm
OPCF using one-phase containment filtering TPCF
using two-phase containment filtering
20
Future Research Directions
  • Develop an effective ranking function
  • Adopt top-k algorithms to improve search
    efficiency
  • Apply our framework to structured P2P networks
  • Evaluate our framework over INEX data
Write a Comment
User Comments (0)
About PowerShow.com