Twig2Stack: Bottomup Processing of GeneralizedTreePattern Queries over XML Documents - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Twig2Stack: Bottomup Processing of GeneralizedTreePattern Queries over XML Documents

Description:

Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents ... TwigStack [Bruno et.al] minimizes intermediate results through: ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 17
Provided by: neccorp
Category:

less

Transcript and Presenter's Notes

Title: Twig2Stack: Bottomup Processing of GeneralizedTreePattern Queries over XML Documents


1
Twig2Stack Bottom-up Processing of
Generalized-Tree-Pattern Queries over XML
Documents
  • Songting Chen, Hua-Gang Li , Junichi Tatemura
  • Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk
    Candan
  • NEC Laboratories America
  • University of California, Santa Barbara

2
Background
  • XML
  • Hierarchical (tree) structured data
  • Provide flexibility to model semi-structured data
  • Widely accepted as universal data exchange format
  • Query over XML
  • XPath, XQuery W3C
  • Extensively used by many applications
  • Adopted by a number of commercial systems

3
State-of-the-art XML Query Processing
Algebraic Approach
Binary Structure Joins Timber Large
intermediate results
Optimize multiple path expressions of XQuery
Chen, et. al Expensive post-processing
Holistic Approach
?
PathStack Bruno, et. al
TwigStack Bruno, et. al
Twig2Stack
4
Processing Generalized Tree Pattern (GTP) Queries
Structural Joins
Structural Outer Joins

Grouping
Duplication Elimination
a1
A
//A//B
a2
B
b1
Our goal Avoid ALL these!
D
C
Sort
a1
XQuery FOR b in //AE/B, d in
b/D LET c b/C RETURN b, c, d
//A/B
a2
b2
b1
5
Motivation PathStack Bruno et.al
a1
  • Query //A//B Data
  • Key observation minimize intermediate results
    through compact representation of path matches,
    by
  • Inter-node record AD relationship between
    elements in different query nodes, e.g., b1?a2,
    b2?a2
  • Intra-node record AD relationship between
    elements within the same query nodes, e.g., b1,
    b2
  • TwigStack Bruno et.al minimizes intermediate
    results through
  • Output only those path matches that are in final
    twig results
  • However, such optimality cannot be guaranteed
    Choi, et.al
  • Not helpful for processing GTP queries
  • Question can we minimize intermediate results
    for twig queries through compact result encoding
    (similar to PathStack)?
  • Useful for processing GTP queries as well?

b2
a2
a2
a1
b1
b1
SA
SB
b2
?
?
6
Hierarchical Stack Encoding
a1
a1
  • Inter-node //A//B
  • Can still use explicit edges
  • Intra-node A
  • Matching elements forms a tree structure as well
  • Associate each query node with a hierarchical
    stack
  • Push element e into hierarchical stack HSE iff
    e satisfies the sub-twig query rooted at E
  • Matching can be determined when entire sub-tree
    of e seen
  • Require post-order document traversal

a2
a2
a3
a4
a3
a4
HSA
7
Twig2Stack Running Example
1,20, 1
a1
A
2,15, 2
16,19, 2
a2
b3
B
a2
17,18, 3
3,14, 3
C
D
d3
HSA
b1
12,13, 4
4,11, 4
c2
d1
5,10, 5
b2
b1
8, 9, 6
b2
6,7, 6
c1
d2
HSB
Merging Stacks
TwigStack needs to enumerate 3 matches for
//A/B//D and 2 for //A/B//C then join them
together. Twig2Stack requires neither path
joins nor path enumeration!
d1
c1
d2
c2
d3
HSC
HSD
8
GTP Result Enumeration
a4
  • Bottom-up Computation .vs. Top-down Enumeration
  • Visit Only those that are in the twig matches
  • Handling grouping results
  • Automatic grouping through Inter-node edges
  • Handling duplicates and out-of-order results
  • Problems coming from non-return nodes
  • If D is return node while B is not
  • b1 ? d1, d2, d3 and b2 ?d2, d3 (duplicates)
  • Observation Intra-node hierarchy provides hints

b1
b2
d2
c2
c1
d3
d1
9
Experiment Setup
  • Implementation
  • Twig2Stack Java 1.4.2
  • TwigStack, TJFast Java 1.4.2
  • Kindly provided by Jiaheng Lu from National
    University of Singapore (NUS)
  • Datasets
  • XMark, DBLP, TreeBank
  • Metrics
  • Query processing time
  • IO time

10
Processing Full Twig Queries
Optimization of Query Processing TwigStack
Twig2Stack Optimization of IO TJFast
11
Not yet done Memory Usage
  • Hierarchical Stack Encoding could hold entire
    document in memory in the worst case
  • Unlike DOM approach, only matches need to be
    stored
  • Tag match
  • (Partial) twig match
  • Predicate evaluation
  • Early result enumeration dramatically reduces the
    memory usage
  • Enumerate query results before the end of
    document and release buffer
  • Main idea hybrid of top-down (PathStack) and
    bottom-up (Twig2Stack) approaches

12
Early Result Enumeration (ERM)
  • Enumerate results and release buffer when
    elements in top-branch node are popped from
    PathStack

A
1,20, 1
a1
a2
a1
B
2,15, 2
16,19, 2
a2
b3
C
D
17,18, 3
3,14, 3
d3
b1
12,13, 4
4,11, 4
c2
d1
5,10, 5
b2
8, 9, 6
6,7, 6
c1
d2
13
Memory Usage
dblp
Small sub-tree ?
article
title
year
site
open_auctions
Huge sub-tree ?
bid
reserve
bidder
increase
14
Conclusions and Future Work
  • Proposed a bottom-up GTP processing solution
  • A twig encoding scheme
  • A GTP enumeration algorithm that avoids any
    post-processing operations
  • A hybrid scheme to reduce memory usage
  • Future directions
  • Handling worst case memory issues
  • Optimizing IO cost by exploiting indexes
  • Handling other axes, full XQuery, graph input
  • Handling XML streams

15
(No Transcript)
16
Processing GTP
Optimization of non-return nodes
Automatic grouping
Write a Comment
User Comments (0)
About PowerShow.com