Title: Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach
1Efficient Processing of XML Twig Patterns with
Parent Child Edges A Look-ahead Approach
2Outline
- ? XML Twig Pattern Matching
- Problem definition
- State of the Art TwigStack
- Sub-optimality of TwigStack
- Our algorithm TwigStackList
- Performance
- Conclusion
3XML Twig Pattern Matching
- XML Data Model
- A XML document is commonly modeled as a rooted,
ordered and labeled tree. - E.g. Note that identifiers (e.g. b1) are given to
tree nodes for easy reference
b1
book
D1
c1
pf1
c2
chapter
preface
chapter
s1
p1
.
paragraph
section
t1
s2
s3
p4
section
paragraph
section
title
t2
p2
p3
f3
paragraph
figure
title
paragraph
f2
f1
figure
figure
4XML Twig Pattern Matching
- Regional Coding 1
- Node Label (startPos endPos, LevelNum)
- startPos and endPos are calculated by performing
a pre-order traversal of the document tree - LevelNum is the level of the node in the tree.
- E.g.
book (0 50, 1)
preface (13, 2)
chapter (422, 2)
D1
chapter(2345, 2)
section (521, 3)
paragraph (22, 3)
section(1317, 4)
section(712, 4)
paragraph(1820, 4)
title (66, 4)
paragraph(1416, 5)
figure (1919, 5)
title (88, 5)
paragraph(911, 5)
figure (1515, 6)
figure (1010, 6)
- M.P. Consens and T.Milo. Optimizing queries on
files. In In Proceedings of ACM SIGMOD, 1994.
5XML Twig Pattern Matching
- What is a Twig Pattern?
- A twig pattern is a small tree whose nodes are
predicates (e.g. element type test) and edges are
either Parent-Child (P-C) edges or
Ancestor-Descendant (A-D) edges. - E.g. An XPath query Q1 selects Figure elements
which are descendants of some Paragraph elements
which in turn are children of Section elements
having at least one child element Title
Q1 SectionTitle/Paragraph//Figure
Section
Paragraph
Title
Figure
6XML Twig Pattern Matching
- Twig Pattern Matching
- Problem Statement
- Given a query twig pattern Q, and a XML database
D that has index structures (e.g. regional coding
scheme) to identify database nodes that satisfy
each of Qs node predicates, compute ALL the
answers to Q in D. - E.g. The matches for twig pattern
SectionTitle/Paragraph//Figure in the document
D1 are - (s1, t1, p4, f3)
- (s2, t2, p2, f1)
b1
D1
c1
c2
pf1
s1
p1
t1
s2
s3
s4
p2
p3
f3
t2
f2
f1
7XML Twig Pattern Matching
- TwigStack2 a holistic approach
- Tag Streaming all elements of tag q are grouped
in a stream Tq ordered by their startPos - Optimal when all the edges in twig pattern are
A-D edges - Two-phase algorithm
- Phase 1 TwigJoin a list of intermediate paths
are outputted - Phase 2 Merge merge the intermediate path list
to get the result
- N. Bruno, D. Srivastava, and N. Koudas. Holistic
twig joins optimal xml pattern matching. In In
Proceedings of ACM SIGMOD, 2002.
8XML Twig Pattern Matching
- TwigStack Review
- A node q in a twig pattern Q is coupled with a
stack Sq - An element e is pushed into its stack if and only
if e is in some match to Q. - E.g. Only color highlighted elements are pushed
into their stacks. - Thus it is ensured that no redundant paths are
output. - An element e is popped out from its stack if all
matches involving it have been reported - Thus we ensure that the memory space used by
stacks is bounded.
D1
Q Section//Title//Paragraph//Figure
b1
SSection
c1
c2
pf1
s1
SParagraph
p1
t1
s2
s3
s4
STitle
p2
p3
f3
t2
f2
SFigure
f1
9XML Twig Pattern Matching
- Optimality of TwigStack for only A-D edge twig
pattern - Each stream Tq is scanned only once ,where q
appears the twig pattern - No redundant intermediate result All
intermediate paths output in Phase 1 appear in
the final result - CPU and I/O cost O(Input Output)
- Space Complexity O(Longest Path in the XML
tree)
10Sub-optimality of TwigStack
- Unfortunately, TwigStack is sub-optimal for
queries with any parent-child relationship. - TwigStack may output a large size of
intermediate results that are not merge-joinable
to final solutions for queries with parent-child
relationships.
11Example for sub-optimality of TwigStack
An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f2
- TwigStack output (s1,t1) as the intermediate
result, since s1 has a descendant t1 and p1 which
in turn has a descendant f2. - Observe that p1 has no child with tag figure.
There is not any matching in this XML tree. So
(s1,t1) is a useless solution.
12Main problem and my experiment
- As shown before, TwigStack might output some
intermediate results that are not merge-joinable
to final solutions for queries with parent-child
edges. - To have a better understanding , we perform
TwigStack on real dataset. - Data set TreeBank UW XML repository
- Queries
- Q1VP /DT //PRP_DOLLAR_
- Q2 S//NP//PP/TO/VP/_NONE_/JJ
- Q3 S /JJ /NP
- All queries contain parent-child relationships.
13Our experimental results
Intermediate paths by TwigStack Merge-joinable paths Percentage of useless intermediate paths
Q1 10,663 5 99.9
Q2 24,493 49 99.5
Q3 70,967 10 99.9
Most intermediate paths do not contribute to
final answers due to parent-child edges! It is a
big challenge to improve TwigStack to answer
queries with parent-child edges.
14Our intuitive observation
- We can improve TwigStack for queries in the
previous example.
An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f1
- Our intuitive observation why not read more
paragraph elements and cache them in the main
memory? - For example, in this XML tree, after we scan the
p1, we do not stop and continue to read the next
element. Then we find that there is only one
paragraph element and f1 is not the child of
paragraph. So we should not output any solution.
15Outline
- XML Twig Pattern Matching
- Problem definition
- State of the Art TwigStack
- Sub-optimality of TwigStack
- ? Our algorithm TwigStackList
- Experimental results
- Conclusion
16Our main idea
- Main idea we read more elements in the input
stream and cache some of them in the main memory
so that we can make a more accurate decision
about whether an element can contribute to final
answer. - One desiderata We cannot cache too many elements
in the main memory. For each node q in twig
query, the number of elements with tag q cached
in the main memory should not be greater than the
longest path in the XML dataset.
17Our caching strategy
- What elements should be cached into the main
memory? - Only those that may contribute to final answers
An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
s2
s3
s4
p1
- We only need to cache s1,s2,s4 into main memory,
why not s3? - Because if s3 contributed to final answer, then
there would be an element before p1 that is child
of s3. Now we see that p1 is the first element.
So s3 is guaranteed not to contribute to final
answer.
18Our criteria for pushing an element to stack
- Whether an element can be pushed into stack is
very important for controlling intermediate
results. Why? - Because, once an element is pushed into stack,
then this element is ready to output. So less
elements are pushed into stack, less intermediate
results are output. - Our Criteria Given an element eq from stream
Tq, before eq is pushed into stack Sq , we
ensure that - (i) element eq has a descendant eq for each
child q of q, and - (ii) if (q, q) is a parent-child relationship,
eq has parent with tag q in the path from eq to
eqmax , where eqmax is the descendant of eq
with the maximal start value. - (iii) each of q recursively satisfy the first
two conditions.
19Examples
- Let us see two examples to understand the
criteria.
An simple XML tree
Twig Pattern
Section
s1
t1
title
paragraph
s2
figure
p1
s3
f1
- Element s1 can be pushed into stack , but s2, s3
cannot. - Note that s1 can be pushed into stack, not just
because t1,p1 and f1 are descendants, more
importantly, because in the path from s1 to f1,
element t1 , p1 and f1 can find their parents
with tag section.
20Examples
An simple XML tree
Twig Pattern
Section
s1
title
paragraph
o1
figure
t1
p1
s2
f1
- In this example, s1 cannot be pushed into stack.
Because although elements t1,p1 and f1 are still
descendants of s1, now in the path from s1 to f1,
element p1 cannot find the parent with tag
section. Observe that the parent of p1 is o1 (i.
e. o1 means other element ). - In this example, we cache s1 and s2 to main
memory, for they might involve in query answers
in the future.
21TwigStackList
- We propose a novel holistic twig algorithm
TwigStacklist to evaluate a twig query. - Unlike previous TwigStack, TwigStackList has the
unique features - It considers the parent-child edge in the query
and enhance the criteria for elements to be
pushed into stack. - It use data structure list to cache some
elements that likely participate in final
solutions. The number of elements in any list is
strictly bounded by the longest path in the
dataset. - It has a broader class of optimal queries.
TwigStackList can guarantee each output
intermediate solution contributes to final
answers when queries contain only
ancestor-descendant edges below branching nodes.
22Example
- TwigStackList show I/O optimal for the following
query. In contrast, TwigStack shows sub-optimal.
Note that below branching node section, all edges
in query are A-D relationship.
An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f1
- In this case, TwigStacklList does not push s1 to
stack and thereby avoid outputting (s1,t1) . - But TwigStack push s1 to stack and output
(s1,t1). - Observe that (s1,t1) is a useless intermediate
solution.
23Sub-optimality of TwigStackList
- Although TwigStackList broaden the class of
optimal query compared to TwigStack,
TwigStackList is still show sub-optimality for
queries with parent-child edge below branching
edges.
Twig Pattern
An simple XML tree
Section
s1
paragraph
title
t1
s2
f1
- Observe that there is no matching solution for
this dataset. But TwigStackList caches s1 and s2
in the list and push s1 to stack. So (s1,t1) will
be output as a useless solution.
24Outline
- XML Twig Pattern Matching
- Problem definition
- State of the Art TwigStack
- Sub-optimality of TwigStack
- Our algorithm TwigStackList
- ? Experimental results
- Conclusion
25Experimental Setting
- Experimental Setting
- Pentium 4 CPU, RAM 768MB, disk 2GB
- TreeBank
- Maximal depth 36, 2.4 million nodes
- DTD data
- a ? bc cb d
- c ? a
- a and c are non- terminals, b and d are terminals
- Random
- Seven tags a, b, c, d, e, f, g. uniform
distributed - Fan-out of elements varied 2-100, depth varied
10-100
26Performance against TreeBank
- Queries with XPath expression
Q1 S//MD//ADJ
Q2 S/VP/PP/NP/VBN/IN
Q3 S/VP//PP//NP/VBN//IN
Q4 VP/DT//PRP_DOLLAR_
Q5 S//VP/IN//NP
Q6 S/JJ/NP
- Number of intermediate path solutions for
TwigStackList V.s. TwigStack
TwigStack TwigStackList Reduction percentage Useful Path
Q1 35 35 0 35
Q2 2957 143 95 92
Q3 25892 4612 82 4612
Q4 10663 11 99.9 5
Q5 702391 22565 96.8 22565
Q6 70988 30 99.9 10
27Performance analysis
- We have three observations
- (1) when queries contain only ancestor-descendant
edges, two algorithms have similar performance.
See Q1. - (2)When edges below non-branching nodes contain
only ancestor-descendant relationships, TwigStack
is optimal, but TwigStack show the sub-optimal.
See Q3.Q5 - (3) When edges below branching nodes contain
parent-child relationships, both TwigStack and
TwigStackList are sub-optimal. Buit TwigStack
typically output far few useless intermediate
solution than TwigStack. See Q 2,Q4,Q6.
28Performance against DTD data
There is no matching solution for query
a//b//c/d in the DTD dataset. But TwigStack
outputs too much redundant path solutions. In
contrast, TwigStackList shows its optimal and
significantly outperforms TwigStack in this
query.
29Performance against random dataset
Twig queries
From the following table, we see that for all
queries, TwigStackList again is more efficient
than TwigStack in terms of the size of
intermediate results.
TwigStack TwigStackList Reduction percentage Useful Path
Q1 9048 4354 52 2077
Q2 1098 467 57 100
Q3 25901 14476 44 14476
Q4 32875 16775 49 16775
Q5 3896 1320 66 566
30Outline
- XML Twig Pattern Matching
- Problem definition
- State of the Art TwigStack
- Sub-optimality of TwigStack
- Our algorithm TwigStackList
- Experimental results
- ? Conclusion
31Conclusion
- Previous algorithm TwigStack show the
sub-optimality for queries with parent-child
edges. - We propose new algorithm TwigStackList to address
this problem. - TwigStackList broadens the class of query with
I/O optimality. - Experiments show that TwigStackList typically
output much fewer useless intermediate result as
far as the query contains parent-child
relationships. - We commend to use TwigStackList to evaluate a
query with parent-child relationships.
32Backup questions
- 1. Turn back to the slide about Performance
against DTD data. In two figures , what is the
X-axis? -
- X-axis shows that the ratio of the number of
elements with tag d relative to that with b and
c. This ratio is important. Because according to
the DTD a ? bc cb d , c ? a, for query
a//b//c/d, while the ratio decreases, the
useless intermediate results output by
TwigStack increase. In contrast, TwigStackList is
optimal in this case. So it does not affected by
the variety of the ratio. Therefore, we show the
superiority of TwigStackList over TwigStack by
varying the ratio.
33Backup questions
- 2. You say that TwigStackList is more efficient
than TwigStack, since it outputs less
intermediate results. So it is easy to understand
that TwigStackList is better than TwigStack in
terms of I/O cost, but how about CPU cost? -
- TwigStackList is more efficient than TwigStack
for evaluating query with parent-child
relationships in terms of not only intermediate
result size, but also the execution time. Of
course, TwigStackList needs to scan the elements
cached in the main memory and slightly increase
the CPU cost. But compared to the great benefit
from the reduction of I/O cost, this cost is
worthy.