Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach - PowerPoint PPT Presentation

Loading...

PPT – Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach PowerPoint presentation | free to view - id: 702336-MTU5N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach

Description:

Title: Efficient Processing of XML Twig Patterns with Parent Child Edges: A look-ahead Approach Author: Lu Jiaheng Last modified by: g0201726 Created Date – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 35
Provided by: LuJ83
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach


1
Efficient Processing of XML Twig Patterns with
Parent Child Edges A Look-ahead Approach
CIKM 2004 Washington D.C. U.S.A.
  • Jiaheng Lu, Ting Chen, Tok Wang Ling
  • National University of Singapore
  • Nov. 11. 2004

2
Outline
  • ? XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • Our algorithm TwigStackList
  • Performance
  • Conclusion

3
XML Twig Pattern Matching
  • An XML document is commonly modeled as a rooted,
    ordered and labeled tree.

book
chapter
preface
chapter
.
Intro
section
section
paragraph
section
title
paragraph
figure
title
paragraph
Data
figure
figure
XML
4
Regional Coding
  • Node Label1 (startPos endPos, LevelNum)
  • E.g.

book (0 32, 1)
preface (13, 2)
chapter (429, 2)
chapter(3031, 2)
section (528, 3)
Intro (22, 3)
section(1823, 4)
section(917, 4)
title (68, 4)
paragraph(2427, 4)
paragraph(1922, 5)
title (1012, 5)
figure (2526, 5)
Data (77, 3)
paragraph(1316, 5)
figure (2021, 6)
XML (1111, 3)
figure (1415, 6)
  1. M.P. Consens and T.Milo. Optimizing queries on
    files. In In Proceedings of ACM SIGMOD, 1994.

5
What is a Twig Pattern?
  • A twig pattern is a small tree whose nodes are
    tags, attributes or text values and edges are
    either Parent-Child (P-C) edges or
    Ancestor-Descendant (A-D) edges.
  • E.g. Selects Figure elements which are
    descendants of Paragraph elements which in turn
    are children of Section elements having child
    element Title
  • Twig pattern


Section
Paragraph
Title
Figure
6
XML Twig Pattern Matching
  • Problem Statement
  • Given a query twig pattern Q, and an XML database
    D, we need to compute ALL the answers to Q in D.
  • E.g. Consider Q1 and Doc 1


Section
  • Query solutions
  • (s1,
    t1, f1)
  • (s2, t2, f1)
  • (s1, t2, f1)

Q1
Doc1
s1
t1
s2
title
figure
p1
t2
f1
7
Previous work TwigStack
  • TwigStack2 a holistic approach
  • Two-phase algorithm
  • Phase 1 TwigJoin intermediate root-leaf paths
    are outputted
  • Phase 2 Merge merge the intermediate path list
    to get the result
  • 2. N. Bruno, D. Srivastava, and N. Koudas.
    Holistic twig joins optimal xml pattern
    matching. In In Proceedings of ACM SIGMOD, 2002.

8
Previous work TwigStack
  • A node q in a twig pattern Q is associated with a
    stack Sq
  • Insertion and deletion in a stack Sq
  • Insertion An element eq from stream Tq is pushed
    into its stack Sq if and only if
  • eq has a descendant eqi in each Tqi , where qi
    is a child of q
  • Each node eqi recursively has the first property
  • Deletion An element eq is popped out from its
    stack if all matches involving it have been
    output.

9
Sub-optimality of TwigStack
  • TwigStack is I/O optimal for only
    ancestor-descendant edge query
  • Unfortunately, TwigStack is sub-optimal for
    queries with any parent-child edge.
  • TwigStack may output a large size of
    intermediate results that are not merge-joinable
    to any final solution for queries with
    parent-child relationships.

10
Sub-optimality of TwigStack an example
A simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f1
  • Since s1 has descendants t1,p1 and in turn p1 has
    descendant f1, TwigStack output an intermediate
    path solution lts1,t1gt.
  • But it is useless, for there is no solution for
    this example at all.

11
Main problem and our experiment
  • TwigStack might output some intermediate results
    that are useless to query answers .
  • To have a better understanding , we perform
    TwigStack on real dataset.
  • Data set TreeBankfrom U. of Washington XML
    datasets
  • Queries
  • Q1VP /DT //PRP_DOLLAR_
  • Q2 S//NP//PP/TO/VP/_NONE_/JJ
  • Q3 S /JJ /NP
  • All queries contain parent-child relationships.

12
Our experimental results
Intermediate paths by TwigStack Merge-joinable paths Percentage of useless intermediate paths
Q1 10,663 5 99.9
Q2 24,493 49 99.5
Q3 70,967 10 99.9
Most intermediate paths do not contribute to
final answers due to parent-child edges! It is a
big challenge to improve TwigStack to answer
queries with parent-child edges.
13
Intuition for improvement
A simple XML tree
Twig Pattern
s1
Section
t1
paragraph
title
p1
t2
figure
f1
  • Our intuitive observation why not read more
    paragraph elements and cache them in the main
    memory?
  • For example, after we scan the p1, we do not stop
    and continue to read the next paragraph element.
    Then we find that there is only one paragraph
    element and f1 is not the child of paragraph. So
    we should not output any intermediate solution.

14
Outline
  • XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • ? Our algorithm TwigStackList
  • Experimental results
  • Conclusion

15
Our main idea
  • Main idea we read more elements in the input
    streams and cache some of them in the main memory
    so that we can make a more accurate decision
    about whether an element can contribute to final
    answer.
  • But we cannot cache too many elements in the main
    memory. For each node q in twig query, the
    number of elements with tag q cached in the main
    memory should not be greater than the longest
    path in the XML dataset.

16
Our caching method
  • What elements should be cached into the main
    memory?
  • Only those that might contribute to final answers

A simple XML tree
s1
p1
t1
p3
p2
f1
  • We only need to cache p1,p3 into main memory, why
    not p2?
  • Because if p2 contributed to final answers, then
    there would be an element before f1 to become the
    child of p2. But now we see that f1 is the first
    element. So p2 is guaranteed not to contribute to
    final answers.

17
Our criteria for pushing an element to stack
  • The criteria for an element to be pushed into
    stack is very important for controlling
    intermediate results. Why?
  • Because, once an element is pushed into stack,
    then this element is ready to output. So less
    elements are pushed into stack, less intermediate
    results are output.
  • Our criteria Given an element eq from stream
    Tq, before eq is pushed into stack Sq , we
    ensure that
  • (i) element eq has a descendant eq for each
    child q of q, and
  • (ii) if (q, q) is a parent-child relationship,
    eq has parent with tag q in the path from eq to
    eqmax , where eqmax is the descendant of eq
    with the maximal start value, qmax being a child
    of q.
  • (iii) each of q recursively satisfy the first
    two conditions.

18
Examples
A simple XML tree
s1
t1
p1
p2
p3
f1
  • Element p3 can be pushed into stack , but p1, p2
    cannot.
  • Because p3 has a child f1.
  • Although p1 has a descendant f1, but f1 is not
    the child of p1.

19
Our algorithm TwigStackList
  • We propose a novel holistic twig algorithm
    TwigStacklist to evaluate a twig query.
  • Unique features of TwigStackList
  • It considers the parent-child edge in the query
  • There is a list for each query node to cache
    elements that likely participate in final
    solutions.
  • It identifies a broader class of optimal queries.
    TwigStackList can guarantee the I/O optimality
    for queries with only ancestor-descendant edges
    connecting branching nodes and their children.

20
TwigStackList an example
Twig Pattern
An XML tree
Section
s2
Root
title
paragraph
p2
p3
p3
p1
s2
s1
t3
p2
t3
p1
t1
figure
p3
t2
f2
f1
f2
Stack
List
Scan s1, t1, p1 ,f1.
21
TwigStackList an example
Twig Pattern
An XML tree
Section
s2
Root
title
paragraph
p2
p3
p3
p1
s2
s1
t3
p2
t3
p1
t1
figure
p3
t2
f2
f1
f2
Stack
List
Since p1 is not the parent of f1 (but ancestor) ,
we continue to scan p2 and put p1 to list.
22
TwigStackList an example
Twig Pattern
An XML tree
Section
s2
Root
title
paragraph
p2
p3
p3
p1
s2
s1
t3
p2
t3
p1
t1
figure
p3
t2
f2
f1
f2
Stack
List
Put p2,p3 to list and the cursor points to p3,
for it is the parent of f2.
23
TwigStackList an example
Twig Pattern
An XML tree
Section
s2
Root
title
paragraph
p2
p3
p3
p1
s2
s1
t3
p2
t3
p1
t1
figure
p3
t2
f2
f1
f2
Stack
List
Merge
Final lts2,t3,p3,f2gt
Output intermediate solutions lts2,t3gt
,lts2,p3,f2gt
24
TwigStackList v.s. TwigStack
Twig Pattern
Root
An XML tree
Section
s2
s1
title
p2
t3
paragraph
p1
t1
p3
t2
figure
f1
f2
  • TwigStackList shows I/O optimal for the above
    query. In contrast, TwigStack shows sub-optimal,
    for it output the uesless path solution lt s1,t1gt

25
Sub-optimality of TwigStackList
  • Although TwigStackList broadens the class of
    optimal query compared to TwigStack,
    TwigStackList is still show sub-optimality for
    queries with parent-child edge connecting
    branching nodes.

Twig Pattern
A simple XML tree
Section
s1
t1
title
s2
paragraph
p1
  • Observe that there is no matching solution for
    this dataset. But TwigStackList caches s1 and s2
    in the list and push s1 to stack. So (s1,t1) will
    be output as a useless solution.

26
Sub-optimality of TwigStackList
  • Although TwigStackList broadens the class of
    optimal query compared to TwigStack,
    TwigStackList is still show sub-optimality for
    queries with parent-child edge connecting
    branching nodes.

Twig Pattern
A simple XML tree
Section
s1
t1
title
s2
paragraph
p2
p1
  • Observe that there is no matching solution for
    this dataset. But TwigStackList caches s1 and s2
    in the list and push s1 to stack. So (s1,t1) will
    be output as a useless solution.
  • Here the behavior of TwigStackList is still
    reasonable since we do not know whether s1 has
    a child p2 following p1 before we advance p1.

27
Outline
  • XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • Our algorithm TwigStackList
  • ? Experimental results
  • Conclusion

28
Experimental Setting
  • Experimental Setting
  • Pentium 4 CPU, RAM 768MB, disk 2GB
  • TreeBank
  • Download from University of Washington XML
    dataset
  • Maximal depth 36, 2.4 million nodes
  • Random
  • Seven tags a, b, c, d, e, f, g. uniform
    distributed
  • Fan-out of elements varied 2-100, depth varied
    10-100

29
Performance against TreeBank
  • Queries with XPath expression

Q1 S//MD//ADJ Q4 VP/DT//PRP_DOLLAR_
Q2 S/VP/PP/NP/VBN/IN Q5 S//VP/IN//NP
Q3 S/VP//PP//NP/VBN//IN Q6 S/JJ/NP
  • Number of intermediate path solutions for
    TwigStackList V.s. TwigStack

TwigStack TwigStackList Reduction percentage Useful Path
Q1 35 35 0 35
Q2 2957 143 95 92
Q3 25892 4612 82 4612
Q4 10663 11 99.9 5
Q5 702391 22565 96.8 22565
Q6 70988 30 99.9 10
30
Performance analysis
  • We have three observations
  • (1) when queries contain only ancestor-descendant
    edges, two algorithms have similar performance.
    See Q1.
  • (2)When edges connecting branching nodes contain
    only ancestor-descendant relationships, TwigStack
    is optimal, but TwigStack show the sub-optimal.
    See Q3.Q5
  • (3) When edges connecting branching nodes contain
    parent-child relationships, both TwigStack and
    TwigStackList are sub-optimal. But TwigStack
    typically output far few useless (lt5)
    intermediate solution than TwigStack. See
    Q2,Q4,Q6.

31
Performance against random dataset
From the following table, we see that for all
queries, TwigStackList again is more efficient
than TwigStack in terms of the size of
intermediate results.
TwigStack TwigStackList Reduction Useful Path
Q1 9048 4354 52 2077
Q2 1098 467 57 100
Q3 25901 14476 44 14476
Q4 32875 16775 49 16775
Q5 3896 1320 66 566
32
Outline
  • XML Twig Pattern Matching
  • Problem definition
  • State of the Art TwigStack
  • Sub-optimality of TwigStack
  • Our algorithm TwigStackList
  • Experimental results
  • ? Conclusion

33
Conclusion
  • Previous algorithm TwigStack show the
    sub-optimality for queries with parent-child
    edges.
  • We propose a new algorithm TwigStackList to
    address this problem.
  • TwigStackList broadens the class of query with
    I/O optimality.
  • Experiments show that TwigStackList typically
    output much fewer useless intermediate result as
    far as the query contains parent-child edges.
  • We recommend to use TwigStackList as a new
    holistic join algorithm to evaluate a query with
    parent-child edges.

34
  • Thank You!
  • Q A
About PowerShow.com