Title: ViST: a dynamic index method for querying XML data by tree structures
1ViST a dynamic index method for querying XML
data by tree structures
- Authors Haixun Wang, Sanghyun Park, Wei Fan,
Philip Yu - Presenter Elena Zheleva, November 2003
2Overview
- Modeling XML Queries
- Structure-encoded sequences
- Indexing
- ViST
- Experimental Results
3Modeling XML Queries
4- DTD of purchase records
- (!ELEMENT purchases (purchase))
- (!ELEMENT purchase (seller, buyer))
- (!ATTRIST seller ID ID location CDATA name CDATA)
- (!ELEMENT seller (item))
- (!ATTRIST buyer ID ID location CDATA name CDATA)
- (!ELEMENT item (item))
- (!ATTRIST item name CDATA manufacturer CDATA)
5Modeling XML Queries
- Focus in XML query language design ability to
express complex structural or graphical queries
6Modeling XML Queries
- Querying XML data finding sub structures of the
data graph that match the sequence - Structure-encoded sequences a sequential
representation of both XML data and XML queries
7Structure-Encoded Sequences
8Structure-Encoded Sequences
- Maps the data and the queries
- Matches the subsequence
- Purpose to avoid as many join operations as
possible - Def. Sequence of (symbol, prefix) pairs
-
9Mapping Data
- Represent XML document/tree in preorder
- Represent in structure-encoded seq
10Mapping Queries
- Benefit of sequence matching query gets
processed as whole - Path Expression
11Structure-Encoded Sequences
12Querying XML
- through Structure-Encoded Sequence Matching
13Indexing
14Role of Indexing
- To provide an algorithm to perform this sequence
matching - Desired features for algorithm
- Efficient support for subsequence matching
- Use well-supported DB indexing techniques such as
B trees - Allow dynamic index insertion
15What is indexing useful for
- Auxiliary access structures
- Used to speed up the retrieval of records
- In response to certain search conditions
- Provide efficient support for arbitrary
structured queries - Using wild-cards // and
16Indexing
- State-of the-art approaches
- Indexes on paths
- Indexes on nodes
- Indexes on both (structures) ViST
17ViST
18Algorithms
- Naïve Algorithm based on Suffix Trees
- RIST Relationships Indexed Suffix Tree
- ViST Virtual Suffix Tree
19Algorithm Using Suffix Trees
- Suffix Tree a compact index to all distinct,
contiguous substrings of a string - D-Ancestorship in XML doc tree
- Through structure-encoded sequence
- S-Ancestorship in suffix tree
20Example Using Suffix Trees
21Algorithm Using Suffix Trees
- Searches
- first by S-Ancestorship searching under suffix
tree - then by D-Ancestorship matching nodes and
prefixes - Disadvantages
- Costly traverse large portion of subtree
- Most commercial DBMSs do not support
22RIST Indexing by Ancestor-Descendant
Relationships
- Jumps directly to the nodes Y to which X is both
a D-Ancestor and S-Ancestor - Index Construction uses B trees
23RIST Indexing by Ancestor-Descendant
Relationships
- Subsequence Matching
- Determine D-Ancestorship by prefixes
- Determine S-Ancestorship by label ltnx,sizexgt
- x suffix tree node (root of S-tree)
- nx prefix traversal order
- sizex number of descendants
24ViST the Virtual Suffix Tree
- Same sequence algorithm as RIST
- BUT supports dynamic insertions
- Uses dynamic method to assign labels
- Once assigned, the labels are fixed and are not
affected by subsequent data insertion or deletion - Labeling the suffix tree w/o building it
- Relies on statistical information about the XML
data
25ViST the Virtual Suffix Tree
- Index structure contains the sequence
- Sequence to be inserted
- Dynamic scope of x ltnx, sizex,kxgt
26ViST the Virtual Suffix Tree
27Experimental Results
- Datasets used
- DBLP CS bibliography DB
- 289,627 records/publications
- Each publication tree of max depth 6
- Avg length of structure-encoded seq 31
- XMARK
- 1 record
- Complicated tree structure
- Synthetic
28Experimental Results
- Comparison Methods
- Index Fabric Algorithm XML paths
- XISS uses nodes as basic query unit
- ViST appx. 1/10 of time to perform queries due
to (multiple) join operations
29Experimental Results - remove
- Index Structure and Size (1/3 less from suffix
tree) - DocId B Tree N elements
- Combined D-ancestor and S-ancestor B tree - N x
L elements - Index Construction
30Conclusion
- XML Queries Subsequence Matching
- Advantages of ViST algorithm for subsequence
matching - Avoids expensive join operations
- Index on both content and structure of XML
documents - B trees supported by disk-based data
- Dynamic data insertion and deletion