Title: Indexing and Querying XML Data for Regular Path Expressions
1Indexing and Querying XML Data for Regular Path
Expressions
- Quanzhong Li and Bongki Moon
- Dept. of Computer Science
- University of Arizona
- VLDB 2001.
2Querying XML
- XML has tree structured data model.
- Queries involve navigating data using regular
path expressions.(e.g., XPath) - e.g. /chapter/-/figure_at_captionTree Frogs
- Accessing all elements with same name string.
- Ancestor-descendant relationship between
elements.
3Contribution
- New system for Indexing XML data.
- Querying XML data based on a numbering scheme for
elements - Join algorithms for processing complex regular
path expressions.
4Outline
- Numbering scheme
- Index structure
- Join algorithms
- Experimental results
5Path expression evaluation
- Previous approaches
- Conventional tree traversals
- Disadvantage Overhead of traversing for long or
unknown path lengths. - New approach
- Indexing for efficient element access.
- Numbering scheme for ancestor-descendant
relationship.
6Dietzs Numbering Scheme
(1,7)
- for two given nodes x and y, x is an ancestor of
y, if and only if - x occurs before y in the preorder traversal of T
and - after y in postorder traversal.
(6,6)
(2,4)
(7,5)
(3,1)
(5,3)
(4,2)
7Proposed numbering scheme
- This associates with each node
- a pair of numbers ltorder, sizegt
- as follows
- For a tree node y and its parent x,
- order(x) lt order(y)
- order(y)size(y) lt order(x) size(x)
- For two sibling nodes x and y, if x is the
predecessor of y in preorder traversal then - order(x) size(x) lt order(y)
(1,100)
(10,30)
(41,10)
(45,5)
(25,5)
(11,5)
(17,5)
8Advantages
- Efficient Updates
- Extra space can be reserved to accommodate future
insertions.
9Ancestordescendant relationship
- For two given nodes x and y of a tree T, x is an
ancestor of y if and only if - order(x) lt order(y) lt order(x) size(x).
10Outline
- Numbering scheme
- Index structure
- Join algorithms
- Experimental results
11Index and Data Organization
Query Processor
Query
Result
XISS
Element Index
Attribute Index
Structure Index
Name Index
Value Table
XML Raw Data
Document Loader
Paged File
12Element Index
Element nid
Element nid
Document ID list
B-tree
B-tree
ltOrder, Sizegt Depth, Parent ID
Element Record
Element list with the Same name in the Same
Document
13Structure Index
B-tree
Document ID (did)
nid, ltorder,sizegt, Parent order, Child
order, Sibling order, Attribute order
Array of All Elements And Attributes in the Same
Document
14Outline
- Numbering scheme
- Index structure
- Join algorithms
- Experimental results
15Regular Path expression
- complex regular path expressions.
- e.g., /chapter/_/figure_at_captionTree Frogs
Symbol Function of symbol
__ Any single node
/ Union of node
Zero or more occurrences of a node
_at_ Denotes attributes
16Regular expression Decomposition
- A regular path expression can be decomposed to a
combination of following basic subexpressions - A subexpression with a single element or a single
attribute, - A subexpression with an element and an attribute
( e.g., figure_at_caption Tree Frogs) - A subexpression with two elements (e.g.,
chapter/figure or chapter/_/figure), - A subexpression with a Kleene closure (,) of
another subexpression, and - A subexpression that is a union of two other
subexpressions.
17Example
- ( E1 / E2 ) / E3 / ( ( E4 _at_A v ) ( E5 /
_ / E6 ) )
E2
E3
E4
_at_Av
E5
E6
E1
/
/_/
EE-Join
EA-Join
EE-Join
/
KC-Join
Union
/
EE-Join
/
EE-Join
18Join algorithms
- Element Attribute join
- Element Element join
- Kleene Closure join
19EA-Join Algorithm
- Input
- E1..Em Ei is a set of elements having a common
document identifier - A1..An Aj is a set of attributes having a
common document identifier - Output
- A set of (e,a) pairs such that the element e is
the parent of the attribute a. - //Sort-merge Ei and Aj by document
identifier. - For each Ei and Aj with the same did do
- //Sort-merge Ei and Aj by PARENT-CHILD
relationship. - For each e in Ei and a in Aj do
- If ( e is a parent of a) then output (e,a)
- End
- End.
20Example
book
chapter
chapter
chapter
appendix
Figure
Figure
Figure
21Attribute-element position
chapter lt1,3gt
chapter lt1,3gt
chapterlt2,1gt
chapter lt3,1gt
name lt4,0gt
namelt2,0gt
name lt4,0gt
name lt3,0gt
22EE-Join Algorithm
- Input
- E1..Em and F1..Fn Ei and Fj is a set of
elements having a common document identifier. - Output
- A set of (e,f) pairs such that the element e is
an ancestor of the element f. - //Sort-merge Ei and Fj by doc. identifier.
- For each Ei and Fj with the same did do
- //Sort-merge Ei and Fj by ANCESTOR-DESCENDANT
relationship. - For each e in Ei and f in Fj do
- If (e is an ancestor of f ) then output (e,f)
- End
- End
23Extreme case of EE-Join
chapter lt1,90gt
chapter lt2,80gt
chapter lt8,20gt
chapter lt9,10gt
figure lt19,0gt
figure lt10,0gt
figure lt11,0gt
24KC-Join Algorithm
- Input
- E1..Em where Ei is a group of elements from an
XML document. - Output
- A Kleene Closure of E1..Em
- //Apply EE-Join algorithm repeatedly.
- Set x 1
- Set Ki E1..Em
- Repeat
- Set I I 1
- Set Ki EE-Join(Ei-1, E1)
- Until ( Ki is empty)
- Output union of K1,K2..Ki-1.
25Outline
- Numbering scheme
- Index structure
- Join algorithms
- Experimental results
26Experiment Results
- Comparison with top-down and bottom-up evaluation
methods. - Comparison for
- EE-Join ( E1 /_/ E2 )
- EA-Join ( E_at_A )
- Scalability test
27EE-Join performance
28EA-Join performance
29Results
- EE-Join algorithm outperformed bottom-up.
- EA-Join algorithm is comparable with top-down but
outperformed bottom-up. - Both are linearly scalable.