Title: CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions
1CS 561 Presentation Indexing and
Querying XML Data for Regular Path Expressions
- A Paper by Quanzhong Li and Bongki Moon
- Presented by Ming Li
2Our Objective
- Developing a system that will enable us to
perform XML data queries efficiently.
3XML Queries Languages
- Used for retrieving data from XML files.
- Use a regular path expression syntax.
- e.g. XPath, XQuery.
4Queries Today - Inefficient
- Usually XML tree traversals Inefficient.
- Top-Down Approach
- Bottom-Up Approach
- An example
- the query
- /chapter/_/figure
- (finding all figures in all chapters.)
5Our Objective - Refined
- Developing a system that will enable us to
perform XML data queries efficiently - Developing such a system consists of
- Developing a way to efficiently store XML data.
- Developing efficient algorithms for processing
regular path expressions (e.g. XQuery
expressions).
6Storing XML Documents - XISS
- XISS - XML Indexing and Storage System.
- Provides us with ways to
- efficiently find all elements or attributes with
the same name string grouped by document which
they belong to. - quickly determine the ancestor-descendant
relationship between elements and/or attributes
in the hierarchy of XML data hierarchy.
7Determining Ancestor-Descendent Relationship
- According to Dietzs for two given nodes x and y
of a tree T, x is an ancestor of y iff x occurs
before y in the preorder traversal and after y in
the postorder traversal. - Example
8Determining Ancestor-Descendent Relationship
cont.
- Advantage the ancestor-descendent relationship
can be determined in constant time. - Disadvantage a lack of flexibility.
- e.g. inserting a new node requires recomputation
of many tree nodes.
9Determining Ancestor-Descendent Relationship
cont.
- A new numbering scheme
- Each node is associated with a ltorder, sizegt
pair - For a tree node y and its parent x
- order(y), order(y) size(y) Ì (order(x),
order(x) size(x) - For two sibling nodes x and y, if x is the
predecessor of y in preorder traversal holds - order(x) size(x) lt order(y).
10Determining Ancestor-Descendent Relationship
cont.
- Fact for two given nodes x and y of a tree T, x
is an ancestor of y iff - order(x) lt order(y) order(x) size(x)
-
11Determining Ancestor-Descendent Relationship
cont.
- Properties
- the ancestor-descendent relationship can be
determined in constant time. - flexibility node insertion usually doesnt
require recomputation of tree nodes. - an element can be uniquely identified in a
document by its order value. -
12XISS System Overview
13Name Index and Value Table
- Objective minimizing the storage and computation
overhead by eliminating replicated strings and
string comparisons. - Name Index - mapping distinct name strings into
unique name identifiers (nid). - Value Table - mapping distinct value strings
(i.e. attribute value and text value) into unique
value identifiers (vid). - Both implemented as a B-tree.
14The Element Index
- Objective quickly finding all elements with the
same name string. - Structure
15The Attribute Index
- Objective quickly finding all elements with the
same name string. - Structure
- Same structure as the Element Index except that
the record in attribute index has a value
identifier vid which is a key used to obtain the
attribute from the value table.
16The Structure Index
- Objectives
- Finding the parent element and child elements (or
attributes) for a given element. - Finding the parent element for a given attribute.
- Structure
17The Structure Index cont.
- Structure
- B-tree using document identifier (did) as a key.
- Leaf nodes linear arrays with records for all
elements and attributes from an XML document. - Each record nid, ltorder,sizegt, Parent order,
Child order, Sibling order, Attribute order. - Records are ordered by order value.
18Querying Method
- Decomposing path expressions into simple path
expressions. - Applying algorithms on simple path expressions
and their intermediate results.
19Decomposition of Path Expressions
- The main idea
- A complex path expression is decomposed into
several simple path expressions. - Each simple path expression produces an
intermediate result that can be used in the
subsequent stage of processing. - The results of the simple path expressions are
than combined or joined together to obtain the
final result of the given query.
20Basic Subexpressions - Example
Decomposition of (E1/E2)/ E3 / ((E4_at_aV)
(E5/_/E6))
21Example EA-Join Element and Attribute Join
22EA-Join Element and Attribute Join
Input E1,,Em Ei is a set of elements
having a common document identifier
(did) A1,,An Aj is a set of elements having
a common document identifier (did) Output A
set of (e,a) pairs such that the element e is the
parent of the attribute a.
23EA-Join Element and Attribute Join
The Algorithm // Sort-merge Ei and Aj by
did. (1) foreach Ei and Aj with the same did
do // Sort-merge Ei and Aj by //
PARENT-CHILD relationship (2) foreach e Î Ei and
a Î Aj do (3) if (e is a parent of a) then
output (e,a) end end
24EA-Join Example
- Consider the XML document
- ltEle AttA1gt
- ltEle AttA2gt lt/Elegt
- lt/Elegt
- And the query /Ele_at_AttA1
25EA-Join Querying /Ele_at_AttA1
- ltEle AttA1gt
- ltEle AttA2gt lt/Elegt
- lt/Elegt
- Sort-merging Eles and Atts by parent-child
relation ship will give us the list - lt1,3gt, lt2,0gt, lt3,1gt, lt4,0gt
- Finding the elements Eles with a child
attribute Att with a value A1 from the
accepted list is easy using the information in
the Element Record.
26EA-Join Comments
- Only a two-stage sort-merge operation without
additional cost of sorting - First merge by did.
- Second merge by examining parent-child
relationship. - This merge is based on the order values of the
element and attribute as defined by the numbering
scheme. - Attributes should be placed before their sibling
elements in the order of the numbering scheme. - guarantees that elements and attributes with the
same did can be merged in a single scan.
27Conclusions
- XISS can efficiently process regular path
expression queries. - Performance improvement over the conventional
methods by up to an order of magnitude. - Future workoptimal page size or the break-even
point between the two criteria.
28Thank you so much!