Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach

Description:

Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach ... (i) element eq has a descendant eq' for each child q' of q, and ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 34
Provided by: luj92
Category:

less

Transcript and Presenter's Notes

Title: Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach


1
Efficient Processing of XML Twig Patterns with
Parent Child Edges A Look-ahead Approach
  • Presenter Qi He

2
Outline
  • ? XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • Our algorithm TwigStackList
  • Performance
  • Conclusion

3
XML Twig Pattern Matching
  • XML Data Model
  • A XML document is commonly modeled as a rooted,
    ordered and labeled tree.
  • E.g. Note that identifiers (e.g. b1) are given to
    tree nodes for easy reference

b1
book
D1
c1
pf1
c2
chapter
preface
chapter
s1
p1
.
paragraph
section
t1
s2
s3
p4
section
paragraph
section
title
t2
p2
p3
f3
paragraph
figure
title
paragraph
f2
f1
figure
figure
4
XML Twig Pattern Matching
  • Regional Coding 1
  • Node Label (startPos endPos, LevelNum)
  • startPos and endPos are calculated by performing
    a pre-order traversal of the document tree
  • LevelNum is the level of the node in the tree.
  • E.g.

book (0 50, 1)
preface (13, 2)
chapter (422, 2)
D1
chapter(2345, 2)
section (521, 3)
paragraph (22, 3)
section(1317, 4)
section(712, 4)
paragraph(1820, 4)
title (66, 4)
paragraph(1416, 5)
figure (1919, 5)
title (88, 5)
paragraph(911, 5)
figure (1515, 6)
figure (1010, 6)
  • M.P. Consens and T.Milo. Optimizing queries on
    files. In In Proceedings of ACM SIGMOD, 1994.

5
XML Twig Pattern Matching
  • What is a Twig Pattern?
  • A twig pattern is a small tree whose nodes are
    predicates (e.g. element type test) and edges are
    either Parent-Child (P-C) edges or
    Ancestor-Descendant (A-D) edges.
  • E.g. An XPath query Q1 selects Figure elements
    which are descendants of some Paragraph elements
    which in turn are children of Section elements
    having at least one child element Title


Q1 SectionTitle/Paragraph//Figure
Section
Paragraph
Title
Figure
6
XML Twig Pattern Matching
  • Twig Pattern Matching
  • Problem Statement
  • Given a query twig pattern Q, and a XML database
    D that has index structures (e.g. regional coding
    scheme) to identify database nodes that satisfy
    each of Qs node predicates, compute ALL the
    answers to Q in D.
  • E.g. The matches for twig pattern
    SectionTitle/Paragraph//Figure in the document
    D1 are
  • (s1, t1, p4, f3)
  • (s2, t2, p2, f1)


b1
D1
c1
c2
pf1
s1
p1
t1
s2
s3
s4
p2
p3
f3
t2
f2
f1
7
XML Twig Pattern Matching
  • TwigStack2 a holistic approach
  • Tag Streaming all elements of tag q are grouped
    in a stream Tq ordered by their startPos
  • Optimal when all the edges in twig pattern are
    A-D edges
  • Two-phase algorithm
  • Phase 1 TwigJoin a list of intermediate paths
    are outputted
  • Phase 2 Merge merge the intermediate path list
    to get the result
  • N. Bruno, D. Srivastava, and N. Koudas. Holistic
    twig joins optimal xml pattern matching. In In
    Proceedings of ACM SIGMOD, 2002.

8
XML Twig Pattern Matching
  • TwigStack Review
  • A node q in a twig pattern Q is coupled with a
    stack Sq
  • An element e is pushed into its stack if and only
    if e is in some match to Q.
  • E.g. Only color highlighted elements are pushed
    into their stacks.
  • Thus it is ensured that no redundant paths are
    output.
  • An element e is popped out from its stack if all
    matches involving it have been reported
  • Thus we ensure that the memory space used by
    stacks is bounded.

D1
Q Section//Title//Paragraph//Figure
b1
SSection
c1
c2
pf1
s1
SParagraph
p1
t1
s2
s3
s4
STitle
p2
p3
f3
t2
f2
SFigure
f1
9
XML Twig Pattern Matching
  • Optimality of TwigStack for only A-D edge twig
    pattern
  • Each stream Tq is scanned only once ,where q
    appears the twig pattern
  • No redundant intermediate result All
    intermediate paths output in Phase 1 appear in
    the final result
  • CPU and I/O cost O(Input Output)
  • Space Complexity O(Longest Path in the XML
    tree)

10
Sub-optimality of TwigStack
  • Unfortunately, TwigStack is sub-optimal for
    queries with any parent-child relationship.
  • TwigStack may output a large size of
    intermediate results that are not merge-joinable
    to final solutions for queries with parent-child
    relationships.

11
Example for sub-optimality of TwigStack
An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f2
  • TwigStack output (s1,t1) as the intermediate
    result, since s1 has a descendant t1 and p1 which
    in turn has a descendant f2.
  • Observe that p1 has no child with tag figure.
    There is not any matching in this XML tree. So
    (s1,t1) is a useless solution.

12
Main problem and my experiment
  • As shown before, TwigStack might output some
    intermediate results that are not merge-joinable
    to final solutions for queries with parent-child
    edges.
  • To have a better understanding , we perform
    TwigStack on real dataset.
  • Data set TreeBank UW XML repository
  • Queries
  • Q1VP /DT //PRP_DOLLAR_
  • Q2 S//NP//PP/TO/VP/_NONE_/JJ
  • Q3 S /JJ /NP
  • All queries contain parent-child relationships.

13
Our experimental results
Intermediate paths by TwigStack Merge-joinable paths Percentage of useless intermediate paths
Q1 10,663 5 99.9
Q2 24,493 49 99.5
Q3 70,967 10 99.9
Most intermediate paths do not contribute to
final answers due to parent-child edges! It is a
big challenge to improve TwigStack to answer
queries with parent-child edges.
14
Our intuitive observation
  • We can improve TwigStack for queries in the
    previous example.

An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f1
  • Our intuitive observation why not read more
    paragraph elements and cache them in the main
    memory?
  • For example, in this XML tree, after we scan the
    p1, we do not stop and continue to read the next
    element. Then we find that there is only one
    paragraph element and f1 is not the child of
    paragraph. So we should not output any solution.

15
Outline
  • XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • ? Our algorithm TwigStackList
  • Experimental results
  • Conclusion

16
Our main idea
  • Main idea we read more elements in the input
    stream and cache some of them in the main memory
    so that we can make a more accurate decision
    about whether an element can contribute to final
    answer.
  • One desiderata We cannot cache too many elements
    in the main memory. For each node q in twig
    query, the number of elements with tag q cached
    in the main memory should not be greater than the
    longest path in the XML dataset.

17
Our caching strategy
  • What elements should be cached into the main
    memory?
  • Only those that may contribute to final answers

An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
s2
s3
s4
p1
  • We only need to cache s1,s2,s4 into main memory,
    why not s3?
  • Because if s3 contributed to final answer, then
    there would be an element before p1 that is child
    of s3. Now we see that p1 is the first element.
    So s3 is guaranteed not to contribute to final
    answer.

18
Our criteria for pushing an element to stack
  • Whether an element can be pushed into stack is
    very important for controlling intermediate
    results. Why?
  • Because, once an element is pushed into stack,
    then this element is ready to output. So less
    elements are pushed into stack, less intermediate
    results are output.
  • Our Criteria Given an element eq from stream
    Tq, before eq is pushed into stack Sq , we
    ensure that
  • (i) element eq has a descendant eq for each
    child q of q, and
  • (ii) if (q, q) is a parent-child relationship,
    eq has parent with tag q in the path from eq to
    eqmax , where eqmax is the descendant of eq
    with the maximal start value.
  • (iii) each of q recursively satisfy the first
    two conditions.

19
Examples
  • Let us see two examples to understand the
    criteria.

An simple XML tree
Twig Pattern
Section
s1
t1
title
paragraph
s2
figure
p1
s3
f1
  • Element s1 can be pushed into stack , but s2, s3
    cannot.
  • Note that s1 can be pushed into stack, not just
    because t1,p1 and f1 are descendants, more
    importantly, because in the path from s1 to f1,
    element t1 , p1 and f1 can find their parents
    with tag section.

20
Examples
An simple XML tree
Twig Pattern
Section
s1
title
paragraph
o1
figure
t1
p1
s2
f1
  • In this example, s1 cannot be pushed into stack.
    Because although elements t1,p1 and f1 are still
    descendants of s1, now in the path from s1 to f1,
    element p1 cannot find the parent with tag
    section. Observe that the parent of p1 is o1 (i.
    e. o1 means other element ).
  • In this example, we cache s1 and s2 to main
    memory, for they might involve in query answers
    in the future.

21
TwigStackList
  • We propose a novel holistic twig algorithm
    TwigStacklist to evaluate a twig query.
  • Unlike previous TwigStack, TwigStackList has the
    unique features
  • It considers the parent-child edge in the query
    and enhance the criteria for elements to be
    pushed into stack.
  • It use data structure list to cache some
    elements that likely participate in final
    solutions. The number of elements in any list is
    strictly bounded by the longest path in the
    dataset.
  • It has a broader class of optimal queries.
    TwigStackList can guarantee each output
    intermediate solution contributes to final
    answers when queries contain only
    ancestor-descendant edges below branching nodes.

22
Example
  • TwigStackList show I/O optimal for the following
    query. In contrast, TwigStack shows sub-optimal.
    Note that below branching node section, all edges
    in query are A-D relationship.

An simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f1
  • In this case, TwigStacklList does not push s1 to
    stack and thereby avoid outputting (s1,t1) .
  • But TwigStack push s1 to stack and output
    (s1,t1).
  • Observe that (s1,t1) is a useless intermediate
    solution.

23
Sub-optimality of TwigStackList
  • Although TwigStackList broaden the class of
    optimal query compared to TwigStack,
    TwigStackList is still show sub-optimality for
    queries with parent-child edge below branching
    edges.

Twig Pattern
An simple XML tree
Section
s1
paragraph
title
t1
s2
f1
  • Observe that there is no matching solution for
    this dataset. But TwigStackList caches s1 and s2
    in the list and push s1 to stack. So (s1,t1) will
    be output as a useless solution.

24
Outline
  • XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • Our algorithm TwigStackList
  • ? Experimental results
  • Conclusion

25
Experimental Setting
  • Experimental Setting
  • Pentium 4 CPU, RAM 768MB, disk 2GB
  • TreeBank
  • Maximal depth 36, 2.4 million nodes
  • DTD data
  • a ? bc cb d
  • c ? a
  • a and c are non- terminals, b and d are terminals
  • Random
  • Seven tags a, b, c, d, e, f, g. uniform
    distributed
  • Fan-out of elements varied 2-100, depth varied
    10-100

26
Performance against TreeBank
  • Queries with XPath expression

Q1 S//MD//ADJ
Q2 S/VP/PP/NP/VBN/IN
Q3 S/VP//PP//NP/VBN//IN
Q4 VP/DT//PRP_DOLLAR_
Q5 S//VP/IN//NP
Q6 S/JJ/NP
  • Number of intermediate path solutions for
    TwigStackList V.s. TwigStack

TwigStack TwigStackList Reduction percentage Useful Path
Q1 35 35 0 35
Q2 2957 143 95 92
Q3 25892 4612 82 4612
Q4 10663 11 99.9 5
Q5 702391 22565 96.8 22565
Q6 70988 30 99.9 10
27
Performance analysis
  • We have three observations
  • (1) when queries contain only ancestor-descendant
    edges, two algorithms have similar performance.
    See Q1.
  • (2)When edges below non-branching nodes contain
    only ancestor-descendant relationships, TwigStack
    is optimal, but TwigStack show the sub-optimal.
    See Q3.Q5
  • (3) When edges below branching nodes contain
    parent-child relationships, both TwigStack and
    TwigStackList are sub-optimal. Buit TwigStack
    typically output far few useless intermediate
    solution than TwigStack. See Q 2,Q4,Q6.

28
Performance against DTD data
There is no matching solution for query
a//b//c/d in the DTD dataset. But TwigStack
outputs too much redundant path solutions. In
contrast, TwigStackList shows its optimal and
significantly outperforms TwigStack in this
query.
29
Performance against random dataset
Twig queries
From the following table, we see that for all
queries, TwigStackList again is more efficient
than TwigStack in terms of the size of
intermediate results.
TwigStack TwigStackList Reduction percentage Useful Path
Q1 9048 4354 52 2077
Q2 1098 467 57 100
Q3 25901 14476 44 14476
Q4 32875 16775 49 16775
Q5 3896 1320 66 566
30
Outline
  • XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • Our algorithm TwigStackList
  • Experimental results
  • ? Conclusion

31
Conclusion
  • Previous algorithm TwigStack show the
    sub-optimality for queries with parent-child
    edges.
  • We propose new algorithm TwigStackList to address
    this problem.
  • TwigStackList broadens the class of query with
    I/O optimality.
  • Experiments show that TwigStackList typically
    output much fewer useless intermediate result as
    far as the query contains parent-child
    relationships.
  • We commend to use TwigStackList to evaluate a
    query with parent-child relationships.

32
Backup questions
  • 1. Turn back to the slide about Performance
    against DTD data. In two figures , what is the
    X-axis?
  • X-axis shows that the ratio of the number of
    elements with tag d relative to that with b and
    c. This ratio is important. Because according to
    the DTD a ? bc cb d , c ? a, for query
    a//b//c/d, while the ratio decreases, the
    useless intermediate results output by
    TwigStack increase. In contrast, TwigStackList is
    optimal in this case. So it does not affected by
    the variety of the ratio. Therefore, we show the
    superiority of TwigStackList over TwigStack by
    varying the ratio.

33
Backup questions
  • 2. You say that TwigStackList is more efficient
    than TwigStack, since it outputs less
    intermediate results. So it is easy to understand
    that TwigStackList is better than TwigStack in
    terms of I/O cost, but how about CPU cost?
  • TwigStackList is more efficient than TwigStack
    for evaluating query with parent-child
    relationships in terms of not only intermediate
    result size, but also the execution time. Of
    course, TwigStackList needs to scan the elements
    cached in the main memory and slightly increase
    the CPU cost. But compared to the great benefit
    from the reduction of I/O cost, this cost is
    worthy.
Write a Comment
User Comments (0)
About PowerShow.com