<AUTHORS> - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

<AUTHORS>

Description:

By Value: get me 'document'; get me 'element= node1' ' or 'attribute=10' ... To join intermediate results from sub-expressions with a list of elements and a ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 21
Provided by: gnanas
Learn more at: http://www.cise.ufl.edu
Category:
Tags: subelement

less

Transcript and Presenter's Notes

Title: <AUTHORS>


1
ltTITLEgtIndexing Querying XML Data for
../Regular Path Expressions/lt/TITLEgt
  • ltAUTHORSgt
  • ltNAME ID1gtQuanzhong lilt/NAMEgt
  • ltNAME ID2gtBongki MOONlt/NAMEgt
  • ltAUTHORSgt
  • ltPRESENTERSgt
  • ltNAME UFID1234567gtSUNDARlt/NAMEgt
  • ltNAME UFID7654321gtsUPRIYAlt/NAMEgt
  • ltPRESENTERSgt

2
Need for this paper
  • XML emerged as a popular standard for data
    representation and data exchange on the Internet
  • XML Query Languages use Regular Path Expressions
    to query the data
  • Conventional approaches (for indexing searching
    this data) based on Tree traversals goes for a
    toss! under heavy access requests
  • Traversing this hierarchy of XML data becomes a
    overhead if the path lengths are long or unknown
  • What can be done???

3
Try our System and the Algorithms !!!
  • New system for indexing storing XML data XISS
  • New numbering scheme for elements and attributes
  • Quick in figuring-out ancestor-descendant
    relationship
  • New index structures
  • Easier to find all elements and attributes with a
    particular given name string
  • Join algorithms for processing Reg-Path-Exp
    queries
  • EE-Join to search paths from element to element
  • EA-Join to find element-attribute pairs
  • KC-Join to find KC () on repeated paths or
    elements

4
Go XISS!!!
  • In general, XML data can be queried for a
    particular value (or) a structure
  • By Value get me document get me
    elementnode1 or attribute10
  • By Structure get me parent and child
    elements/attributes for a given element
  • Components
  • Index Structure element, attribute and structure
    (index)
  • Data Loader
  • Query Processor
  • Numbering Scheme first..

5
Deitz vs. Li-Moon
  • Deitz says, If x and y are the nodes of a tree
    T, x is an ancestor of y iff x comes before y
    when I climb down the tree (pre-order), and
    after y when I climb up (post-order) and shows
    us his scheme,
  • Ancestor-Descendant relationship
  • determination in constant time
  • Li-Moon says, but this lacks flexibility
  • This leads to many re-computations
  • when a new node is inserted.
  • Hmm let us check-out Li-Moons.

6
Li-Moons Numbering
  • Hey folks, we are going to extend this preorder
    and cover up a range of descendants ?
  • Just associate a pair of numbers ltorder, sizegt
    with each node
  • Parent node x says to its child node y, I came
    before you so my order is less than yours my
    size is gt (your order your size) and so your
    interval is always contained in my interval
  • If there are siblings x y (same parent), say, x
    is before y, then order(x) size(x) lt order(y)

7
Voila!
  • Here it goes,
  • So, for any node x, size(x) gt size of all its
    direct children size(x) is Laarrrge!
  • That being said, Given nodes x and y of a tree
    T, x is an ancestor of y iff
  • order(x) lt order(y) lt order(x)
    size(x)

8
Good news!
  • Easy accommodation of future insertions more
    flexible
  • Global reordering not necessary until no more
    reserved spaces
  • order in ltorder, sizegt pair is an unique
    identifier for each element and attribute in the
    document
  • Attribute nodes are placed before their sibling
    elements in the order why?
  • How this scheme helps? wait till the
    algorithms!
  • Switching back to XISS

9
Internals of XISS
  • Index Structure Overview

10
More structures
  • Element Index
  • Structure Index

11
Path Join Algorithms
  • Conventional approaches (top down, bottom up and
    hybrid traversals) not effective
  • Main Idea of proposed algorithm
  • For a given query chapter/-/figure,
  • - find all chapter elements
  • - find all figure elements
  • - join the qualified chapter-figure
    pairs without
  • traversing XML data trees (if ancestor-
  • descendant relationship is obtained
    quickly)

12
Complex -gt Simple
  • Complex path expression decomposed to many simple
    path expressions
  • Intermediate results are joined to get the final
    result.
  • Different types of sub-expressions

13
EA-Join Algorithm
  • To join intermediate results from sub-expressions
    with a list of elements and a list of attributes
  • E.g. figure_at_captionflowchart
  • Attributes should be placed before sibling
    elements in the order by the numbering scheme

14
EA-Join Algorithm
  • Input List of figure elements and List of
    caption attributes grouped by documents
  • Steps (2 stages)
  • Element sets and attribute sets merged by doc. Id
    (single scan)
  • Elements and attributes are merged by figuring
    out the parent-child relationship using ltordergt
    value (single scan)
  • Output A set of (e, a) pairs where e is the
    parent of a

15
EE-Join Algorithm
  • To join intermediate results each of which is a
    list of elements from a sub-expression
  • E.g. chapter/-/figure
  • Input List of chapter elements and List of
    figure elements
  • Steps (2 stages) are similar to EA-Algorithm
  • Both element sets are merged by doc. Id (single
    scan)
  • Chapter element and Figure element are merged by
    finding the ancestor-descendant relationship
    using ltorder, sizegt values
  • Output A set of (e, f) pairs where e is the
    ancestor of f

16
EE-Algorithm
  • The second stage cannot be done in a single scan
  • In this E.g. , a figure element can be
    descendant of more than one chapter element
    (see book1.xml)
  • order(figure) will lie in more than one chapter
    interval (order(chapter), order(chapter)
    size(chapter))
  • This multiple-times scan is still highly
    effective in searching long or unknown length
    paths when compared to the conventional tree
    traversals.

17
KC-Algorithm
  • Processes a regular path expression with zero,
    one or more occurrences of a subexpression
  • E.g. chapter, chapter
  • Input Set of elements from an XML document
  • Steps
  • In each stage applies EE-Algorithm to previous
    stages result
  • Repeat until no change in result
  • Output Kleene Closure of all elements in the
    given input set

18
Experiments.. ? ?
  • Prototype of XISS was implemented
  • Query Interface C Parse XML Gnome XML
    Parser B-Tree - GiST C Library
  • Workstation
  • Sun Ultrasparc-II running on Solaris 2.7
  • RAM 256 MB Hard-disk 20GB
  • Data Sets
  • Shakespeares Plays
  • SIGMOD Record
  • NITF100 and NITF1

19
Performance Comparison
  • EE-Join Query
  • Outperformed bottom-up method by a wide margin
  • Real-World data set an order of magnitude faster
  • Synthetic data set 6 to 10 times faster
  • Disk IO was a dominant Cost factor 60 to 90
    of total elapsed time
  • EA-Join Query
  • It was comparatively better than top-down and
    bottom-up approaches
  • KC-Join Query
  • Performance was not measured dependent on EEs
    performance

20
THE END!
  • Hope this presentation was useful
  • THANKS!
Write a Comment
User Comments (0)
About PowerShow.com