CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions - PowerPoint PPT Presentation

About This Presentation

Title:

CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

Description:

flexibility node insertion usually doesn't require recomputation of tree nodes. ... B -tree using document identifier (did) as a key. ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 29

Provided by: amnons8

Learn more at: http://web.cs.wpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

1
CS 561 Presentation Indexing and
Querying XML Data for Regular Path Expressions

A Paper by Quanzhong Li and Bongki Moon
Presented by Ming Li

2
Our Objective

Developing a system that will enable us to
perform XML data queries efficiently.

3
XML Queries Languages

Used for retrieving data from XML files.
Use a regular path expression syntax.
e.g. XPath, XQuery.

4
Queries Today - Inefficient

Usually XML tree traversals Inefficient.
Top-Down Approach
Bottom-Up Approach
An example
the query
/chapter/_/figure
(finding all figures in all chapters.)

5
Our Objective - Refined

Developing a system that will enable us to
perform XML data queries efficiently
Developing such a system consists of
Developing a way to efficiently store XML data.
Developing efficient algorithms for processing
regular path expressions (e.g. XQuery
expressions).

6
Storing XML Documents - XISS

XISS - XML Indexing and Storage System.
Provides us with ways to
efficiently find all elements or attributes with
the same name string grouped by document which
they belong to.
quickly determine the ancestor-descendant
relationship between elements and/or attributes
in the hierarchy of XML data hierarchy.

7
Determining Ancestor-Descendent Relationship

According to Dietzs for two given nodes x and y
of a tree T, x is an ancestor of y iff x occurs
before y in the preorder traversal and after y in
the postorder traversal.
Example

8
Determining Ancestor-Descendent Relationship
cont.

Advantage the ancestor-descendent relationship
can be determined in constant time.
Disadvantage a lack of flexibility.
e.g. inserting a new node requires recomputation
of many tree nodes.

9
Determining Ancestor-Descendent Relationship
cont.

A new numbering scheme
Each node is associated with a ltorder, sizegt
pair
For a tree node y and its parent x
order(y), order(y) size(y) Ì (order(x),
order(x) size(x)
For two sibling nodes x and y, if x is the
predecessor of y in preorder traversal holds
order(x) size(x) lt order(y).

10
Determining Ancestor-Descendent Relationship
cont.

Fact for two given nodes x and y of a tree T, x
is an ancestor of y iff
order(x) lt order(y) order(x) size(x)

11
Determining Ancestor-Descendent Relationship
cont.

Properties
the ancestor-descendent relationship can be
determined in constant time.
flexibility node insertion usually doesnt
require recomputation of tree nodes.
an element can be uniquely identified in a
document by its order value.

12
XISS System Overview
13
Name Index and Value Table

Objective minimizing the storage and computation
overhead by eliminating replicated strings and
string comparisons.
Name Index - mapping distinct name strings into
unique name identifiers (nid).
Value Table - mapping distinct value strings
(i.e. attribute value and text value) into unique
value identifiers (vid).
Both implemented as a B-tree.

14
The Element Index

Objective quickly finding all elements with the
same name string.
Structure

15
The Attribute Index

Objective quickly finding all elements with the
same name string.
Structure
Same structure as the Element Index except that
the record in attribute index has a value
identifier vid which is a key used to obtain the
attribute from the value table.

16
The Structure Index

Objectives
Finding the parent element and child elements (or
attributes) for a given element.
Finding the parent element for a given attribute.
Structure

17
The Structure Index cont.

Structure
B-tree using document identifier (did) as a key.
Leaf nodes linear arrays with records for all
elements and attributes from an XML document.
Each record nid, ltorder,sizegt, Parent order,
Child order, Sibling order, Attribute order.
Records are ordered by order value.

18
Querying Method

Decomposing path expressions into simple path
expressions.
Applying algorithms on simple path expressions
and their intermediate results.

19
Decomposition of Path Expressions

The main idea
A complex path expression is decomposed into
several simple path expressions.
Each simple path expression produces an
intermediate result that can be used in the
subsequent stage of processing.
The results of the simple path expressions are
than combined or joined together to obtain the
final result of the given query.

20
Basic Subexpressions - Example
Decomposition of (E1/E2)/ E3 / ((E4_at_aV)
(E5/_/E6))
21
Example EA-Join Element and Attribute Join
22
EA-Join Element and Attribute Join
Input E1,,Em Ei is a set of elements
having a common document identifier
(did) A1,,An Aj is a set of elements having
a common document identifier (did) Output A
set of (e,a) pairs such that the element e is the
parent of the attribute a.
23
EA-Join Element and Attribute Join
The Algorithm // Sort-merge Ei and Aj by
did. (1) foreach Ei and Aj with the same did
do // Sort-merge Ei and Aj by //
PARENT-CHILD relationship (2) foreach e Î Ei and
a Î Aj do (3) if (e is a parent of a) then
output (e,a) end end
24
EA-Join Example

Consider the XML document
ltEle AttA1gt
ltEle AttA2gt lt/Elegt
lt/Elegt
And the query /Ele_at_AttA1

25
EA-Join Querying /Ele_at_AttA1

ltEle AttA1gt
ltEle AttA2gt lt/Elegt
lt/Elegt
Sort-merging Eles and Atts by parent-child
relation ship will give us the list
lt1,3gt, lt2,0gt, lt3,1gt, lt4,0gt
Finding the elements Eles with a child
attribute Att with a value A1 from the
accepted list is easy using the information in
the Element Record.

26
EA-Join Comments

Only a two-stage sort-merge operation without
additional cost of sorting
First merge by did.
Second merge by examining parent-child
relationship.
This merge is based on the order values of the
element and attribute as defined by the numbering
scheme.
Attributes should be placed before their sibling
elements in the order of the numbering scheme.
guarantees that elements and attributes with the
same did can be merged in a single scan.

27
Conclusions

XISS can efficiently process regular path
expression queries.
Performance improvement over the conventional
methods by up to an order of magnitude.
Future workoptimal page size or the break-even
point between the two criteria.

28
Thank you so much!

Write a Comment

User Comments (0)