Indexing Methods for Efficient XML Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing Methods for Efficient XML Query Processing

Description:

Support queries of q= Px where P = /l1/l2/.../ln. Non-deterministic ... collection of all equivalence class. Exponential construction cost. Backward ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 36
Provided by: peopleI
Category:

less

Transcript and Presenter's Notes

Title: Indexing Methods for Efficient XML Query Processing


1
Indexing Methods for Efficient XML Query
Processing
  • Jun-Ki Min
  • KAIST
  • http//islab.kaist.ac.kr/jkmin/

2
XML
  • eXtensible Markup Language
  • The de facto standard
  • data representation and exchange on the Web
  • XML Data
  • An instance of semistructured data
  • self-describing
  • irregularly structured

3
XML Data
  • Comprise hierarchically nested collections of
    elements
  • Element can contains
  • Atomic data value
  • A sequences of subelements
  • attributes composed of name-value pairs
  • ID-IDREF relationship
  • Tree or Graph representation

4
XML Example
ltlibraryDBgt ltbook editor 1gt lttitlegt title1
lt/titlegt ltauthorgt author1 lt/authorgt
ltchaptergt lt/chaptergt lt/bookgt ltpapergt
lttitlegt title2lt/titlegt ltauthor id 1gt
author2 lt/authorgt ltauthorgt author3 lt/authorgt
ltsectiongt lt/sectiongt lt/papergt lt/libraryDBgt
ToXin
Index Fabric
APEX
5
XML Query
  • XML Query Language
  • XSLT, XML-QL, XPath, XQuery
  • use path expression to traverse the irregularly
    structured data
  • ex) /libraryDB/book/title or //title
  • search the whole XML data gt inefficiency
  • Structural Summary Path Index
  • by restricting the search to only relevant
    portions of XML Data

6
Schemas for XML
  • DTD, XML Schema
  • Specifies the constraints of XML Data
  • lt!ELEMENT book (title, author,chapter)gt
  • are not mandatory
  • gt lack of external schema
  • Structural Summary
  • Summary of label paths
  • Path Index
  • Structural Summary Extents

7
Schemas for XML
  • Applications
  • User Interface
  • XML Data Design, Editing
  • Query Formulation
  • Query Validation
  • Query Optimization
  • Path Index

8
Structural Summary
  • DTD Extraction
  • XTRACT
  • based on element information
  • Structural Summary
  • Representative Objects
  • based on path information

9
XTRACT
  • Garofalakis, Gionis, Rastogi, Seshadri, Shim
    SIGMOD 00
  • Infer concise and accurate DTD
  • Choose a DTD from candidate DTDs
  • (a b),(b a) gt (ab) or (a b)(b a)
  • Based on Minimum Description Length (MDL)
    Principle
  • ranks each candidate DTDs depending on the number
    of bits required to describe the subelement
    sequences in terms of the candidate DTD
  • 6(for DTD)33 12
  • 9(for DTD)11 11

10
Representative Objects(RO)
  • Nestorov, Ullman, Wiener, Chawathe ICDE 97
  • Provide a concise representation of the inherent
    schema of a semistructured hierarchical data
  • Full-RO
  • Describe all simple paths
  • K-RO
  • K-RO guarantees that its paths whose length are
    k1 exist in data.
  • 1-RO
  • Simplest very compacted representation

11
Representative Objects(RO)
12
Path Index
  • Access Support Relations
  • Deterministic
  • Strong DataGuide
  • Index Fabric
  • ToXin
  • APEX
  • Non-Deterministic
  • 1-Index
  • A(k) Index
  • FB Index

13
Access Support Relations
  • Kemper, Moerkotte IS 92
  • Originated from OODBMS
  • select Name
  • from Mercedes.Manufactures.Composition.Division
  • To support join along arbitrary reference chains
  • Generalization of Join IndexValduriez 87
  • Based on the paths in the schema
  • Materialize access paths of arbitrary length
  • Support only predefined subsets of paths.

14
DataGuides
  • Goldman, Widom VLDB 97
  • An implementation version of Full-RO
  • Summary of label paths from the root ( simple
    paths)
  • Concise describe every unique simple path
    exactly once, regardless of the number of times
    it appears
  • Accuracy do not contains label paths that do not
    appear in the data
  • Convenience can store and access it using
    similar techniques available for processing
    semistructured data

15
DataGuides
  • Construction Algorithm emulates the conversion
    algorithm from non-deterministic finite automata
    (NFA) to deterministic finite automata (DFA)
  • Intuitively, a simple path is represented as a
    node in DataGuide
  • One XML Data may have multiple DataGuides

16
Strong DataGuide
  • If the sets of nodes which are reachable for
    simple paths are equal, then the simple paths are
    represented as a single node.
  • Linear time and linear space for tree structured
    data
  • Exponential time and exponential space for
    graph structured data

17
1/2/T-Index
  • Milo and Suciu ICDT 99
  • 1-Index
  • Summary all label paths starting from the root
  • Support queries of q Px where P /l1/l2//ln
  • Non-deterministic
  • Based on backward bisimulation which is
    originated from graph verification
  • Extents are disjoint
  • More compact size than Strong DataGuides

18
1-Index
  • Equivalence relation ()
  • v u iff Lv Lu
  • where Lx w w is a simple path from the root
    to x
  • the collection of all equivalence class
  • Exponential construction cost
  • Backward Bisimulation (b)
  • If xby and x is the root then y is the root
  • Conversely, If xby and y is the root, then x is
    the root.
  • If xby and ltxl xgt is an edge, then there is
    exists an edge (yl y), such that x by
  • Conversely, if xby and (yl y) is an edge, then
    there exists an edge (xl x) such that xby

19
vs b
a
a
a
a
c
b
b
c
d
d
d
X
Y
  • X Y since LX LY a.b.d, a.c.d
  • X Y
  • v b u ? v u
  • O(mlogm) construction cost Paige and Tarjan 87

20
1-Index vs Strong DataGuide
  • In tree structured Data, strong Dataguide and
    1-Index coincide

21
2/T-Index
  • 2-Index
  • To support queries of x1Px2
  • ex) //title
  • Equivalence relation ()
  • (v, u) (v, u) iff L(v,u) L(v,u)
  • where L(x,y) w w is a label path from x to
    y
  • Summary of path information bwt. two arbitrary
    nodes
  • T-Index
  • Generalization of 1/2-Index
  • (v1,,vn ) (u1,,un) iff L(v1,,vn) L(u1,,un)
  • Conceptually similar to Access Support Relations
  • Support only predefined paths

22
Index Fabric
  • Cooper, Sample, Franklin, Hjaltason, Shadmon,
    VLDB 01
  • Tree Structured Data
  • Conceptual similar to strong DataGuide
  • Layered structure
  • Use Patricia trie to index a large number of
    search keys
  • The simple path of an element which has a data
    value is encoded as a special character sequence
  • Keeps the key which is the combination of encoded
    sequence and data value.

23
Index Fabric
XML Data
  • Keeps only the information of elements which have
    data values
  • Patricia trie lossy Compression

24
ToXin
  • Rizzolo, Mendelzon WebDB 01
  • Tree Structured Data
  • Conceptually Similar to strong DataGuide (not
    minimal DataGuide)
  • Support navigation of forward and backward
    traversal
  • Path Tree ( strong DataGuide)
  • A node of Path Tree has an Index Table or Value
    Tables
  • Index Table (IT) parent-child relationships
  • Value Table (VT) owner-value relationships

25
ToXin
XML Data
  • Index Tables

LibararyDB parent child null 1
LibraryDB.book parent child 1 2
LibraryDB.paper parent child 1 6
  • Value Tables
  • LibraryDB.book.author
  • parent value
  • author1

  • Since ToXin keeps parent-child relationships,
    ToXin supports path expression with value
    predicates
  • ex) /libraryDB/bookauthor author1

26
A(k)-Index
  • Kaushik, Shenoy, Bohannon, Gudes ICDE 02
  • Strong DataGuide and 1-Index record the all
    simple paths
  • Increase index size gt Increase search space
  • Approximation of 1-Index
  • Non-deterministic
  • Utilize local similarity( degree k)
  • reduce the size of index graph

27
A(k)-Index
  • k-bisimulation (k)
  • For any two nodes, v and u, v 0 u iff u and v
    have the same label
  • Node vku iff vk-1u and for every parent v of
    v, there is a parent u of u such that vk-1u

28
A(k)-Index
  • Building cost O(km)
  • In general, for 1-Index, k lt logm
  • Query Processing
  • label path expression whose length k1
  • precise
  • label path expression whose length gt k1
  • safe include false results
  • validation gt require the data scan

29
APEXAdaptive Path indEx for XML Data
  • Chung, Min, Shim SIGMOD 02
  • Strong DataGuide and 1-Index are kept the all
    simple paths
  • Users used partial matching path queries
  • //book/title
  • Exhaustive navigation of index structure for
    partial matching path queries may result in
    performance degradation

30
APEX
  • Deterministic
  • Approximation of DataGuides
  • Efficient processing of partial matching path
    queries
  • Workload-Aware
  • Self Tuning Strategies Chaudhuri et. al 00
  • Utilize Query Workload
  • Build APEX with both XML data and frequently used
    paths
  • Sequential pattern mining Agrawal and Srikant 95

31
APEX
APEX frequently used paths book.title
label xnode next
xroot 0
libraryDB 1
book 2
paper 3
title
author 4
chapter 5
section 6
editor 7
label count xnode next
book 8
remainder 9
extent 0 ltnull,0gt 1 lt0,1gt 2 lt1,2gt
3 lt1,6gt 4 lt2,4gt, lt6,8gt, lt6,9gt 5
lt2,5gt 6 lt6,10gt 7 lt2,8gt 8
lt2,3gt 9 lt6,7gt
  • Hash Tree
  • keep frequently used paths
  • prevent the exhaustive search
  • Graph Structure
  • structural summary extents

XML Data
32
FB Index
  • Kaushik, Bohannon, Naughton, Korth SIGMOD 02
  • Support Twig path expression
  • /A/BC
  • Basic Idea
  • For every edge e labelled l from v to u, add an
    (inverse) edge e-1 with label l-1 from u to v
  • And then, compute 1-Index on this modified graph.
  • Very large Index space
  • Apply some heuristics
  • Exploiting Local Similarity k-bisimulation

A
B
C-1
33
Discussion
  • Path Index
  • Improve the query performance by restriction of
    search space
  • Can be apply to various application
  • Selectivity Estimation
  • QBE(Query By Example)
  • Future Work
  • Support twig queries
  • Query Optimization
  • cost formula of path index

34
Thank You!
  • Any Question?
  • http//islab.kaist.ac.kr/jkmin
  • jkmin_at_islab.kaist.ac.kr

35
Reference
  1. C. Chung, J. Min and K. Shim, APEX An Adaptive
    Path Index for XML Data, SIGMOD 02
  2. B. Cooper, N. Sample, M. Franklin, G. Hjaltason
    and M. Shadmon, A Fast Index for Semistructed
    Data, VLDB 01
  3. M. Garofalakis, A. Gionis, R. Rastogi, S.
    Seshadri, and K. Shim, XTRACT A System for
    Extracting Document Type Descriptors from XML
    Documents, SIGMOD 00
  4. L. Goldman and J. Widom, DataGuides Enabling
    Queries Formulation and Optimization in
    Seminstructured Databases, VLDB 97
  5. R. Kaushik, P. Bohannon, J. Naughton and H.
    Korth, Covering Indexes for Branching Path
    Queries, SIGMOD 02
  6. R. Kaushik, P. Shenoy, P. Bohannon and E. Gudes,
    Exploiting Local Similarity for Indexing Paths
    in Graph-Structured Data, ICDE 02
  7. A. Kemper and G. Moerkotte, Access Support
    Relations An Indexing Method for Object Bases,
    Information Systems 92
  8. T. Milo and D. Suciu, Index Structures for Path
    Expressions, ICDT 99
  9. S. Nestorov, J. Ullman, J. Wiener and S.
    Chawathe, Representative Objects Concise
    Representations of Semi structured, Hierarchical
    Data, ICDE 97
  10. F. Rizzolo and A. Mendelzon, Indexing XML Data
    with ToXin, WebDB 01
  11. R. Paige and R. Tarjan, Three partition
    refinement algorithms, SIAM Journal of Computing
    87
  12. P. Valduriez, Join Indices, TODS 87
Write a Comment
User Comments (0)
About PowerShow.com