Efficient Processing of XPath Queries Using Indexes - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Processing of XPath Queries Using Indexes

Description:

Yan Chen1, Sanjay Madria1, Kalpdrum Passi2, Sourav Bhowmick3 ... XQuery, XML-QL, XML-GL, Lorel, and Quilt. Semistructured data is represented as a graph ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 37
Provided by: kalp7
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient Processing of XPath Queries Using Indexes


1
Efficient Processing of XPath Queries Using
Indexes
  • Yan Chen1, Sanjay Madria1, Kalpdrum Passi2,
    Sourav Bhowmick3
  • 1 Department of Computer Science, University of
    Missouri-Rolla, Rolla, MO 65409, USA
  • madrias_at_umr.edu
  • 2 Dept. of Math. Computer Science, Laurentian
    University, Sudbury ON P3E 2C6 Canada
  • kpassi_at_cs.laurentian.ca
  • 3 School of Computer Engineering, Nanyang
    Technological University, Singapore
  • assourav_at_ntu.edu.sg

2
Querying Semistructured Data
  • Query languages to query semistructured data
  • XQuery, XML-QL, XML-GL, Lorel, and Quilt
  • Semistructured data is represented as a graph
  • Queries on such data are expressed in the form of
    regular path expressions
  • XPath is a language that describes the syntax for
    addressing path expressions over XML data
  • Indexes on XML data - improves the performance
    of the query on large XML files
  • Indexing techniques used in relational and
    object-oriented databases do not suffice for
    semistructured data due to the nature of the data

3
Indexing Semistructured Data
  • Dataguides
  • record information on the existing paths in a
    database
  • do not provide any information of parent-child
    relationships between nodes in the database
  • as a result they cannot be used for navigation
    from any arbitrary node.
  • T-indexes
  • specialized path indexes, which only summarize a
    limited class of paths.
  • 1-index and 2-index are special cases of T-indexes

4
Indexing Semistructured Data
  • LORE
  • Uses four different types of index structures -
    value, text, link, and path indexes
  • Value index and text index are used to search
    objects that have specific values
  • link index and path index provide fast access to
    parents of an object and all objects reachable
    via a given labeled path
  • Lore uses OEM (Object Exchange Model) to store
    data and OQL (Object Query Language) as its query
    language

5
Indexing Semistructured Data
  • ToXin
  • has two different types of index structure the
    value index and the path index.
  • The path index has two parts index tree and
    instance functions, and these functions can be
    used to trace the parent-child relationship.
  • Their path index contains only parent and
    children information but in our model, we store
    the complete path from root to each node.
  • ToXin uses index for single level while we use
    multiple index for different levels

6
A Sample XML File
ltBOOK title What lies beneathgt
ltISBNgt1-1-4lt/ISBNgt
ltAUTHORgt Michaellt/AUTHORgt lt/BOOKgt
ltBOOK title Matrix IIgt
ltISBNgt1-1-5lt/ISBNgt ltAUTHORgt
Jason lt/AUTHORgt lt/BOOKgt ltBOOK
title The Rootgt
ltISBNgt1-1-6lt/ISBNgt ltAUTHORgt
Tomas lt/AUTHORgt lt/BOOKgt lt/BOOKSTOREgt
ltBOOKSTORE name Benny-bookstoregt
ltBOOK title Brave the new worldgt
ltISBNgt1-1-1lt/ISBNgt
ltAUTHORgt David lt/AUTHORgt lt/BOOKgt
ltBOOK title Glory daysgt
ltISBNgt1-1-2lt/ISBNgt ltAUTHORgt Chris
lt/AUTHORgt lt/BOOKgt ltBOOK title I
love the gamegt
ltISBNgt1-1-3lt/ISBNgt ltAUTHORgt
Chrislt/AUTHORgt lt/BOOKgt 
7
XML as DOM Tree
8
Indexing XML Data - Motivation
  • Retrieve all the books with authors name as
    Chris from the Benny-bookstore
  • We need to find all the nodes in the DOM tree
    with child nodes of BOOKSTORE as BOOK.
  • Then for each BOOK, we need to test the authors
    name.
  • After about 100,000 comparisons we get a couple
    of books with author Chris as the output
  • By using index on AUTHOR, we do not need to test
    author of each BOOK node.
  • With the index of the key as Chris, we can find
    all author nodes faster
  • The nodes obtained can be checked if they satisfy
    the query condition.
  • This is a bottom-up query plan.
  • Such a plan is useful in the case when we have a
    relatively small result set at the bottom,
    which can be pre-selected

9
Indexing XML Data - Motivation
  • Find all the books with the name beginning with
    glory and the author as Chris
  • The query plan could be to get all the books with
    the name glory disregarding their authors.
  • If there are small number of books satisfying the
    constraint, (e.g., four glory books), it might
    be useful to introduce another type of index,
    which is built on the values of some nodes.
  • Here, we need index upon strings.
  • On the basis of the nodes obtained in the first
    step, we can further test another condition on
    the query.
  • Hence, we can build a set of nodes as the entry
    set, which will depend on the specific query and
    on the type of XML data

10
Types of Indexes
  • Name-index (Nindex)
  • A name index locates nodes with the tag names
  • The Nindex for the incoming tag ltBOOKgt over the
    XML fragment in figure 2 will then be 2, 3,
    4, 13, 16, 19
  • Value-index (Vindex)
  • A value-index locates nodes with given value
  • The Value-index for the word Chris is 10,
    12, for the word the is 2, 4
  • Path-index (Pindex)
  • A path-index, locates nodes with the path from
    root node
  • Path index is the information we attach to each
    node to record its ancestors paths
  • In Dom tree the path information of 11 is 1,
    4 node 7 is 1, 2
  • Descent Number (DN)
  • Descent Number is the information we attach to
    every node to record the number of its descents.
  • In the DOM tree, the DN of node 11 is 0 the DN
    of node 3 is 2

11
Example for XPath Queries
  • ltbibgtltbookgt ltpublishergt Addison-Wesley
    lt/publishergt ltauthorgt Serge
    Abiteboul lt/authorgt ltauthorgt
    ltfirst-namegt Rick lt/first-namegt
    ltlast-namegt Hull lt/last-namegt
    lt/authorgt ltauthorgt Victor
    Vianu lt/authorgt lttitlegt Foundations
    of Databases lt/titlegt ltyeargt 1995
    lt/yeargtlt/bookgtltbook price55gt
    ltpublishergt Freeman lt/publishergt
    ltauthorgt Jeffrey D. Ullman lt/authorgt
    lttitlegt Principles of Database and Knowledge
    Base Systems lt/titlegt ltyeargt 1998
    lt/yeargtlt/bookgt
  • lt/bibgt

12
Data Model for XPath
The root
Processing instruction
Comment
The root element
book
book
publisher
author
. . . .
Much like the Xquery data model
Addison-Wesley
Serge Abiteboul
13
XPath Simple Expressions
  • /bib/book/year
  • Result ltyeargt 1995 lt/yeargt
  • ltyeargt 1998 lt/yeargt
  • /bib/paper/year
  • Result empty (there were no papers)

14
Entry-point Technique
  • We find an entry-point node among a set of middle
    level nodes in the XPath expression.
  • Then we split the XPath expression at the
    entry-point and test for the path condition for
    the first part and eliminate nodes from DOM tree
    that do not satisfy the path condition.
  • Then we test the remaining part of the XPath
    expression recursively eliminating nodes that do
    not satisfy the path condition.
  • The algorithm can be implemented either using
    top-down approach or bottom-up approach

15
Entry-point Technique An Example
  • Select BOOKSTORE/BOOK
  • where BOOK.name Glory days and /AUTHOR.title
  • Chris and BOOKSTORE.name Benny-bookstore
  • The above query is transformed to the following
    XPath expression
  • /BOOKSTORE name Benny-bookstore/child
    BOOKtitle
  • Glory Days /Child AUTHOR/child
    FIRSTNAMEname Chris
  • Use Nindex to get all BOOK nodes or AUTHOR nodes

16
Entry-point Technique An Example
  • Get all books named Glory Days and then test
    the condition on each one of them if the author
    is Chris
  • /BOOKSTORE name Benny-bookstore/child
    BOOKtitle
  • Glory Days
  • Then, we test each author child node, which is
    the latter part of X-path expression
  • /Child AUTHOR/child FIRSTNAMEname
    Chris
  • In second strategy, first get all authors named
    Chris, and then test the parent nodes if book
    name is Glory Days

17
Entry-point Root-first Algorithm
  • INPUT XPath expression root/X1/X2//Xi//Xm
  • STEP 1 FOR each Xi
  • BEGIN
  • IF Xi is indexed THEN
  • BEGIN
  • get every node xi of type Xi
  • get the DN ni of each
    xi
  • Sumi ?ni
  • END
  • END
  • STEP 2 Get entry point Xn with minimum Sum, add
    all xn to a node set S
  • Consider the tree obtained after deleting all
    branches that do not have the node xn in its
    path.
  • split the XPath into root/X1/X2//Xn-1 and
    /Xn1//Xm by the entry point Xn
  • STEP 3 FOR each node xn in S
  • BEGIN
  • IF the path starting from root
    to node xn
  • is not included in the
    path
  • root/X1/X2//Xn-1/Xn
  • THEN
  • delete the sub tree that
    does not
  • satisfy the path
    condition
  • END
  • STEP 4 FOR each node xn in S, consider all sub
  • trees starting with xn
  • BEGIN
  • IF Xn1//Xm is same as /Xm
  • THEN return nodes Xm
  • ELSE INPUT Xn/Xn1//Xm
  • GO TO STEP 1
  • END

18
Example Entry-point Root-first Algorithm
X-Path A/B/C/E//H
19
Example Entry-point Root-first Algorithm
  • Step 1 calculate descent numbers (DN) of the
    nodes that have indexes
  • DN of node B 31
  • DN of node E 18
  • Entry-point node E (minimum DN)

20
Example Entry-point Root-first Algorithm
  • Step 2 Delete the branches that do not have E

XPath A/B/C/E and E//H
21
Example Entry-point Root-first Algorithm
  • Step 3 test A/B/C/E on each E node and discard
    the right most sub tree with node E
  • Step 4 evaluate E//H on each E and finally we
    get the three H nodes
  • Cost O(N) where N is the number of nodes

22
Rest-tree Conception
  • Performance deterioration in Entry-point
    algorithm
  • Find books written by David where the title of
    the book contains the word book
  • The XML file might have hundreds of books having
    the word book in the title and
  • further there might be a large number of books by
    author David, but only one of them has the word
    book in its title
  • The Entry-point algorithm first eliminates all
    the nodes that do not have the word book in its
    title.
  • Then it eliminates the nodes that do not have
    David as the author
  • Due to relatively large number of instances at
    the two levels, large number of eliminations is
    required

23
Rest-tree Conception
  • The tree formed by the nodes that meet certain
    condition at its level, along with its descendant
    and ancestor nodes
  • In the example, the Rest-tree of the node that
    satisfies the condition that the ltBOOKgt node has
    the word glory in its title, is as shown

24
Rest-tree Conception
  • First employ Entry-point algorithm to find all
    nodes that meet the condition statements at each
    level
  • The final result will then be the intersection of
    the Rest-trees of these nodes
  • In practice, we do not need to find the Rest-tree
    of every node satisfying the condition.
  • Small set of nodes are left after applying the
    Entry-point algorithm
  • So we need to find the Rest-trees of a relatively
    small set of nodes within a small sub tree
  • To get the intersection of rest-trees, note that
    the nodes that satisfy the query condition and
    that have the minimum number of descendants is
    available from the Entry-point algorithm

25
Rest-tree Conception
  • The minimum level is the anchor level of the
    rest-tree algorithm.
  • We just need to intersect the Rest-trees at this
    minimum level.
  • For example, after the first step of Entry-point
    algorithm, we know there are 2000 nodes at Level
    A that meet say condition A, 1000 nodes at Level
    B that meet condition B, 200 nodes at Level C,
    3000 at Level D, 400 at Level E.
  • The minimum level is C and the order of the
    levels is C-gtE-gtB-gtA-gtD

26
Rest-tree Conception
  • Ancestor node information is available as
    path-index
  • Filter some nodes at Level C by checking the
    grandparent node information of the 400 nodes at
    Level E
  • Similarly, we can filter some other nodes at
    Level C by checking the parent node information
    of the nodes at Level B.
  • The intersection at Level C will be complete by
    checking ancestor information at Level D nodes.
  • The final step is to get all the nodes that
    satisfy the query requirement

27
Rest-tree Algorithm
  • INPUT X-path expression root/X1/X2//Xi//Xm
  • STEP 1 FOR each Xi
  • BEGIN
  • IF Xi is indexed THEN
  • BEGIN
  • get every node xi of type Xi
  • get the DN number ni of
    each xi
  • Sumi ?ni
  • END
  • END
  • STEP 2 get entry point Xj with minimum Sum, add
    all xj to a node set Sj
  • get comparison point Xk with second minimum
    Sum, add all xk to a node set Sk
  • STEP 3 IF level j gt k
  • FOR each node xk in Sk
  • IF its ancestor is not in Sj
    THEN
  • delete xk from Sk
  • ELSE
  • FOR each node xj in Sj
  • IF its ancestor is not in Sk
    THEN
  • STEP 4 FOR each node xj in Sj
  • BEGIN
  • IF the path starting from
    root to node
  • xj is not included in the
    path
  • root/X1/X2//Xj
  • THEN
  • delete the sub tree that
    does not
  • satisfy the path condition
  • END
  • STEP 5 FOR each node xj in Sj, consider all sub
  • trees starting with xj
  • BEGIN
  • IF Xj1//Xm is same as /Xm
  • THEN return nodes Xm
  • ELSE INPUT Xj/ Xj1//Xm
  • GO TO STEP 1
  • END

28
Rest-tree Algorithm - Example
XPath - A/B/C/E//H Step 1 Calculate
DNs
DOM Tree
29
Rest-tree Algorithm - Example
Step 2 Minimum DN DN of node B 32 DN of node C
20 DN of node E 18
30
Rest-tree Algorithm - Example
Step 3 Delete E nodes whose ancestor does not
have C
31
Rest-tree Algorithm - Example
Step 4 Delete the subtree that does not satisfy
the path A/B/C/E Step 5 Get all the nodes
from E//H
32
Test Cases and Comparisons
  • Size of DOM Tree
  • Entry-point algorithm performs much better than
    the traditional algorithm, taking less than one
    third of the processing time of the traditional
    algorithm

Increasing Number of Nodes for XPath
//A20//C30//A80
33
Test Cases and Comparisons
  • Result Nodes Set
  • The processing time for the Entry-point algorithm
    has increased slightly with increasing number of
    result nodes.
  • Partially, the reason is due to the recursive
    function call in the Entry-point Algorithm code

Increasing Number of Result Nodes
34
Test Cases and Comparisons
  • Tree Height
  • The variation tendency of processing time of the
    three methods is the same with the height of the
    tree

Tree Height Increasing
35
Test Cases and Comparisons
  • Without Index on result nodes
  • The traditional method turns out to be a
    disaster, falling into no index method category.
  • However, the Entry-point Algorithm is still in
    good shape

Tree Height Increasing
36
Conclusions
  • Proposed three types of indexes on XML data to
    execute efficiently XPath queries.
  • We proposed two algorithms to process XPath
    queries using these indexes to optimize the
    queries.
  • We have also simulated both bottom-up and
    top-down approaches
  • Processing XPath query using the Entry-point
    indexing technique performs much better than
    traditional algorithms with or without indexes
Write a Comment
User Comments (0)
About PowerShow.com