Efficient Processing of XPath Queries Using Indexes - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient Processing of XPath Queries Using Indexes

Description:

Yan Chen1, Sanjay Madria1, Kalpdrum Passi2, Sourav Bhowmick3 ... XQuery, XML-QL, XML-GL, Lorel, and Quilt. Semistructured data is represented as a graph ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 37

Provided by: kalp7

Learn more at: https://web.mst.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Processing of XPath Queries Using Indexes

1
Efficient Processing of XPath Queries Using
Indexes

Yan Chen1, Sanjay Madria1, Kalpdrum Passi2,
Sourav Bhowmick3
1 Department of Computer Science, University of
Missouri-Rolla, Rolla, MO 65409, USA
madrias_at_umr.edu
2 Dept. of Math. Computer Science, Laurentian
University, Sudbury ON P3E 2C6 Canada
kpassi_at_cs.laurentian.ca
3 School of Computer Engineering, Nanyang
Technological University, Singapore
assourav_at_ntu.edu.sg

2
Querying Semistructured Data

Query languages to query semistructured data
XQuery, XML-QL, XML-GL, Lorel, and Quilt
Semistructured data is represented as a graph
Queries on such data are expressed in the form of
regular path expressions
XPath is a language that describes the syntax for
addressing path expressions over XML data
Indexes on XML data - improves the performance
of the query on large XML files
Indexing techniques used in relational and
object-oriented databases do not suffice for
semistructured data due to the nature of the data

3
Indexing Semistructured Data

Dataguides
record information on the existing paths in a
database
do not provide any information of parent-child
relationships between nodes in the database
as a result they cannot be used for navigation
from any arbitrary node.
T-indexes
specialized path indexes, which only summarize a
limited class of paths.
1-index and 2-index are special cases of T-indexes

4
Indexing Semistructured Data

LORE
Uses four different types of index structures -
value, text, link, and path indexes
Value index and text index are used to search
objects that have specific values
link index and path index provide fast access to
parents of an object and all objects reachable
via a given labeled path
Lore uses OEM (Object Exchange Model) to store
data and OQL (Object Query Language) as its query
language

5
Indexing Semistructured Data

ToXin
has two different types of index structure the
value index and the path index.
The path index has two parts index tree and
instance functions, and these functions can be
used to trace the parent-child relationship.
Their path index contains only parent and
children information but in our model, we store
the complete path from root to each node.
ToXin uses index for single level while we use
multiple index for different levels

6
A Sample XML File
ltBOOK title What lies beneathgt
ltISBNgt1-1-4lt/ISBNgt
ltAUTHORgt Michaellt/AUTHORgt lt/BOOKgt
ltBOOK title Matrix IIgt
ltISBNgt1-1-5lt/ISBNgt ltAUTHORgt
Jason lt/AUTHORgt lt/BOOKgt ltBOOK
title The Rootgt
ltISBNgt1-1-6lt/ISBNgt ltAUTHORgt
Tomas lt/AUTHORgt lt/BOOKgt lt/BOOKSTOREgt
ltBOOKSTORE name Benny-bookstoregt
ltBOOK title Brave the new worldgt
ltISBNgt1-1-1lt/ISBNgt
ltAUTHORgt David lt/AUTHORgt lt/BOOKgt
ltBOOK title Glory daysgt
ltISBNgt1-1-2lt/ISBNgt ltAUTHORgt Chris
lt/AUTHORgt lt/BOOKgt ltBOOK title I
love the gamegt
ltISBNgt1-1-3lt/ISBNgt ltAUTHORgt
Chrislt/AUTHORgt lt/BOOKgt
7
XML as DOM Tree
8
Indexing XML Data - Motivation

Retrieve all the books with authors name as
Chris from the Benny-bookstore
We need to find all the nodes in the DOM tree
with child nodes of BOOKSTORE as BOOK.
Then for each BOOK, we need to test the authors
name.
After about 100,000 comparisons we get a couple
of books with author Chris as the output
By using index on AUTHOR, we do not need to test
author of each BOOK node.
With the index of the key as Chris, we can find
all author nodes faster
The nodes obtained can be checked if they satisfy
the query condition.
This is a bottom-up query plan.
Such a plan is useful in the case when we have a
relatively small result set at the bottom,
which can be pre-selected

9
Indexing XML Data - Motivation

Find all the books with the name beginning with
glory and the author as Chris
The query plan could be to get all the books with
the name glory disregarding their authors.
If there are small number of books satisfying the
constraint, (e.g., four glory books), it might
be useful to introduce another type of index,
which is built on the values of some nodes.
Here, we need index upon strings.
On the basis of the nodes obtained in the first
step, we can further test another condition on
the query.
Hence, we can build a set of nodes as the entry
set, which will depend on the specific query and
on the type of XML data

10
Types of Indexes

Name-index (Nindex)
A name index locates nodes with the tag names
The Nindex for the incoming tag ltBOOKgt over the
XML fragment in figure 2 will then be 2, 3,
4, 13, 16, 19
Value-index (Vindex)
A value-index locates nodes with given value
The Value-index for the word Chris is 10,
12, for the word the is 2, 4
Path-index (Pindex)
A path-index, locates nodes with the path from
root node
Path index is the information we attach to each
node to record its ancestors paths
In Dom tree the path information of 11 is 1,
4 node 7 is 1, 2
Descent Number (DN)
Descent Number is the information we attach to
every node to record the number of its descents.
In the DOM tree, the DN of node 11 is 0 the DN
of node 3 is 2

11
Example for XPath Queries

ltbibgtltbookgt ltpublishergt Addison-Wesley
lt/publishergt ltauthorgt Serge
Abiteboul lt/authorgt ltauthorgt
ltfirst-namegt Rick lt/first-namegt
ltlast-namegt Hull lt/last-namegt
lt/authorgt ltauthorgt Victor
Vianu lt/authorgt lttitlegt Foundations
of Databases lt/titlegt ltyeargt 1995
lt/yeargtlt/bookgtltbook price55gt
ltpublishergt Freeman lt/publishergt
ltauthorgt Jeffrey D. Ullman lt/authorgt
lttitlegt Principles of Database and Knowledge
Base Systems lt/titlegt ltyeargt 1998
lt/yeargtlt/bookgt
lt/bibgt

12
Data Model for XPath
The root
Processing instruction
Comment
The root element
book
book
publisher
author
. . . .
Much like the Xquery data model
Addison-Wesley
Serge Abiteboul
13
XPath Simple Expressions

/bib/book/year
Result ltyeargt 1995 lt/yeargt
ltyeargt 1998 lt/yeargt
/bib/paper/year
Result empty (there were no papers)

14
Entry-point Technique

We find an entry-point node among a set of middle
level nodes in the XPath expression.
Then we split the XPath expression at the
entry-point and test for the path condition for
the first part and eliminate nodes from DOM tree
that do not satisfy the path condition.
Then we test the remaining part of the XPath
expression recursively eliminating nodes that do
not satisfy the path condition.
The algorithm can be implemented either using
top-down approach or bottom-up approach

15
Entry-point Technique An Example

Select BOOKSTORE/BOOK
where BOOK.name Glory days and /AUTHOR.title
Chris and BOOKSTORE.name Benny-bookstore
The above query is transformed to the following
XPath expression
/BOOKSTORE name Benny-bookstore/child
BOOKtitle
Glory Days /Child AUTHOR/child
FIRSTNAMEname Chris
Use Nindex to get all BOOK nodes or AUTHOR nodes

16
Entry-point Technique An Example

Get all books named Glory Days and then test
the condition on each one of them if the author
is Chris
/BOOKSTORE name Benny-bookstore/child
BOOKtitle
Glory Days
Then, we test each author child node, which is
the latter part of X-path expression
/Child AUTHOR/child FIRSTNAMEname
Chris
In second strategy, first get all authors named
Chris, and then test the parent nodes if book
name is Glory Days

17
Entry-point Root-first Algorithm

INPUT XPath expression root/X1/X2//Xi//Xm
STEP 1 FOR each Xi
BEGIN
IF Xi is indexed THEN
BEGIN
get every node xi of type Xi
get the DN ni of each
xi
Sumi ?ni
END
END
STEP 2 Get entry point Xn with minimum Sum, add
all xn to a node set S
Consider the tree obtained after deleting all
branches that do not have the node xn in its
path.
split the XPath into root/X1/X2//Xn-1 and
/Xn1//Xm by the entry point Xn

STEP 3 FOR each node xn in S
BEGIN
IF the path starting from root
to node xn
is not included in the
path
root/X1/X2//Xn-1/Xn
THEN
delete the sub tree that
does not
satisfy the path
condition
END
STEP 4 FOR each node xn in S, consider all sub
trees starting with xn
BEGIN
IF Xn1//Xm is same as /Xm
THEN return nodes Xm
ELSE INPUT Xn/Xn1//Xm
GO TO STEP 1
END

18
Example Entry-point Root-first Algorithm
X-Path A/B/C/E//H
19
Example Entry-point Root-first Algorithm

Step 1 calculate descent numbers (DN) of the
nodes that have indexes
DN of node B 31
DN of node E 18
Entry-point node E (minimum DN)

20
Example Entry-point Root-first Algorithm

Step 2 Delete the branches that do not have E

XPath A/B/C/E and E//H
21
Example Entry-point Root-first Algorithm

Step 3 test A/B/C/E on each E node and discard
the right most sub tree with node E
Step 4 evaluate E//H on each E and finally we
get the three H nodes
Cost O(N) where N is the number of nodes

22
Rest-tree Conception

Performance deterioration in Entry-point
algorithm
Find books written by David where the title of
the book contains the word book
The XML file might have hundreds of books having
the word book in the title and
further there might be a large number of books by
author David, but only one of them has the word
book in its title
The Entry-point algorithm first eliminates all
the nodes that do not have the word book in its
title.
Then it eliminates the nodes that do not have
David as the author
Due to relatively large number of instances at
the two levels, large number of eliminations is
required

23
Rest-tree Conception

The tree formed by the nodes that meet certain
condition at its level, along with its descendant
and ancestor nodes
In the example, the Rest-tree of the node that
satisfies the condition that the ltBOOKgt node has
the word glory in its title, is as shown

24
Rest-tree Conception

First employ Entry-point algorithm to find all
nodes that meet the condition statements at each
level
The final result will then be the intersection of
the Rest-trees of these nodes
In practice, we do not need to find the Rest-tree
of every node satisfying the condition.
Small set of nodes are left after applying the
Entry-point algorithm
So we need to find the Rest-trees of a relatively
small set of nodes within a small sub tree
To get the intersection of rest-trees, note that
the nodes that satisfy the query condition and
that have the minimum number of descendants is
available from the Entry-point algorithm

25
Rest-tree Conception

The minimum level is the anchor level of the
rest-tree algorithm.
We just need to intersect the Rest-trees at this
minimum level.
For example, after the first step of Entry-point
algorithm, we know there are 2000 nodes at Level
A that meet say condition A, 1000 nodes at Level
B that meet condition B, 200 nodes at Level C,
3000 at Level D, 400 at Level E.
The minimum level is C and the order of the
levels is C-gtE-gtB-gtA-gtD

26
Rest-tree Conception

Ancestor node information is available as
path-index
Filter some nodes at Level C by checking the
grandparent node information of the 400 nodes at
Level E
Similarly, we can filter some other nodes at
Level C by checking the parent node information
of the nodes at Level B.
The intersection at Level C will be complete by
checking ancestor information at Level D nodes.
The final step is to get all the nodes that
satisfy the query requirement

27
Rest-tree Algorithm

INPUT X-path expression root/X1/X2//Xi//Xm
STEP 1 FOR each Xi
BEGIN
IF Xi is indexed THEN
BEGIN
get every node xi of type Xi
get the DN number ni of
each xi
Sumi ?ni
END
END
STEP 2 get entry point Xj with minimum Sum, add
all xj to a node set Sj
get comparison point Xk with second minimum
Sum, add all xk to a node set Sk
STEP 3 IF level j gt k
FOR each node xk in Sk
IF its ancestor is not in Sj
THEN
delete xk from Sk
ELSE
FOR each node xj in Sj
IF its ancestor is not in Sk
THEN

STEP 4 FOR each node xj in Sj
BEGIN
IF the path starting from
root to node
xj is not included in the
path
root/X1/X2//Xj
THEN
delete the sub tree that
does not
satisfy the path condition
END
STEP 5 FOR each node xj in Sj, consider all sub
trees starting with xj
BEGIN
IF Xj1//Xm is same as /Xm
THEN return nodes Xm
ELSE INPUT Xj/ Xj1//Xm
GO TO STEP 1
END

28
Rest-tree Algorithm - Example
XPath - A/B/C/E//H Step 1 Calculate
DNs
DOM Tree
29
Rest-tree Algorithm - Example
Step 2 Minimum DN DN of node B 32 DN of node C
20 DN of node E 18
30
Rest-tree Algorithm - Example
Step 3 Delete E nodes whose ancestor does not
have C
31
Rest-tree Algorithm - Example
Step 4 Delete the subtree that does not satisfy
the path A/B/C/E Step 5 Get all the nodes
from E//H
32
Test Cases and Comparisons

Size of DOM Tree
Entry-point algorithm performs much better than
the traditional algorithm, taking less than one
third of the processing time of the traditional
algorithm

Increasing Number of Nodes for XPath
//A20//C30//A80
33
Test Cases and Comparisons

Result Nodes Set
The processing time for the Entry-point algorithm
has increased slightly with increasing number of
result nodes.
Partially, the reason is due to the recursive
function call in the Entry-point Algorithm code

Increasing Number of Result Nodes
34
Test Cases and Comparisons

Tree Height
The variation tendency of processing time of the
three methods is the same with the height of the
tree

Tree Height Increasing
35
Test Cases and Comparisons

Without Index on result nodes
The traditional method turns out to be a
disaster, falling into no index method category.
However, the Entry-point Algorithm is still in
good shape

Tree Height Increasing
36
Conclusions

Proposed three types of indexes on XML data to
execute efficiently XPath queries.
We proposed two algorithms to process XPath
queries using these indexes to optimize the
queries.
We have also simulated both bottom-up and
top-down approaches
Processing XPath query using the Entry-point
indexing technique performs much better than
traditional algorithms with or without indexes