A Summary of XISS and Index Fabric - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

A Summary of XISS and Index Fabric

Description:

Absolute Path Expression (APE) ... APE queries are translated to prefix to keys and submitted to the index trie ... solve APE by single index lookup ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 42
Provided by: CSI115
Category:
Tags: xiss | ape | fabric | index | summary

less

Transcript and Presenter's Notes

Title: A Summary of XISS and Index Fabric


1
A Summary of XISS and Index Fabric
  • Ho Wai Shing

2
Contents
  • Definition of Terms
  • XISS (Li and Moon, VLDB2001)
  • Numbering Scheme
  • Indices Stored
  • Join Algorithms
  • Index Fabric (Cooper et al, VLDB2001)
  • Patricia
  • Balanced Trie
  • Raw Path Index

3
Definition of Terms
  • Absolute Path Expression (APE)
  • the path which start from root, each step is a
    traversal of child axis or attribute axis, no
    wildcards
  • e.g., /, /A/B, /A/_at_C

4
Definition of Terms
  • Regular Path Expression (RPE)
  • may start from root or not,
  • may traverse different axes (restricted to child,
    descendant-or-self, attribute for discussions
    since they are the most commonly used ones)
  • may contain wildcards
  • e.g., //, /A//C, /A/_/B, //A/B//C/D/_at_E

5
XISS
  • XISS XML Indexing and Storage System
  • by Li and Moon, published in VLDB 2001, with
    title Indexing and Querying XML Data for Regular
    Path Expressions
  • decomposes and stores XML documents in the
    indices
  • can answer regular path expressions

6
XISS - General Idea
  • solve RPE by decomposing RPE into these 5 basic
    subexpressions
  • element retrieval
  • attribute retrieval
  • steps involve an element and an attribute
  • steps involve two elements
  • a Kleene Closure of another subexpression

7
XISS - General Idea
  • each subexpression is solved by its own method
  • element index lookup
  • attribute index lookup
  • EA-join
  • EE-join
  • KC-join

8
XISS - General Idea
  • result lists from the subexpressions are joined
    to produce the final result
  • to make this decomposition and join efficient, an
    efficient method to determine ancestor-descendant
    relationship is needed
  • XISS uses an extended preorder based numbering
    scheme

9
XISS - Numbering Scheme
  • number all the nodes with a ltorder, sizegt tuple
  • order is assigned based on an extended preorder
    traversal
  • size can be imagined as the size of the subtree
    rooted at that node

10
XISS - Numbering Scheme
  • The rules for number assignment
  • if x precedes y in the preorder traversal,
    x.order lt y.order (preorder)
  • if x and y are siblings, either x.order x.size
    lt y.order or y.order y.size lt x.order(siblings
    wont overlap)
  • if x is an ancestor of y, x.order lt y.order lt
    x.order x.size (ancestor contains descendant)

11
XISS - Numbering Scheme
  • Actual Assignment
  • uses heuristics to reserve some space between
    orders
  • reserve more space to the sizes for future node
    insertions
  • attributes are place before sibling elements

12
XISS - Index Organization
  • There are 5 indices
  • Name Index
  • Element Index
  • Attribute Index
  • Structure Index
  • Value Table

13
XISS - Name Index
  • maps element or attribute name to a name
    identifier (or nid)
  • nid is used for further query evaluation
    representing that element or attribute
  • reduce the time for string comparison in further
    index lookup
  • stored in a B-tree

14
XISS - Name Index
Name
nid
B-tree
15
XISS - Value Table
  • stores all the string values of the XML document

16
XISS - Element Index
  • input nid, output list of element records
  • implemented by a B-tree
  • leaves are pointers to list of document ID (did),
    each list element points to a list of all
    elements with the same name in the same document

17
XISS - Element Index
element list
did list
nid
element list
ltorder, sizegt,Depth,ParentID
B-tree
element record
18
XISS - Attribute Index
  • Very similar to element index
  • always has a value identifier, vid

19
XISS - Structure Index
  • Input did, Output array containing all the
    element and attributes in the document
  • implemented by a B-tree

20
XISS - Structure Index
did
nidltorder, sizegt,Parent order,Child
order,Sibling order,Attribute order
B-tree
record array
21
XISS - Indices
  • When to use which index?
  • first use Name Index to find nid of the
    element/attribute to be queried
  • search Element/Attribute index for the records
  • if we need values, lookup Value Table
  • use Structure Index to rebuild or traverse the
    XML document tree

22
XISS - Join Algorithms
  • After getting the record lists from each
    subexpression, we need to find out which are
    answers to the original query
  • e.g., to find /A/B, we found a record list of all
    element A, another list of all element B, and we
    have to find out which Bs are A/B

23
XISS - Join Algorithms
  • Three join algorithms proposed
  • EA-join - merges an element record list and an
    attribute record list (solves A/_at_B)
  • EE-join - merges two element record lists (solves
    A/B or A//B)
  • KC-join - self-merge an element record list
    (solves (E))

24
XISS - EA-Join
  • to solve E/_at_A
  • input an element record list and an attribute
    record list
  • find out the attribute records which have parents
    in the element record list
  • two lists are sorted by did and then order

25
XISS - EA-join
  • 2-stage sort-merge
  • group by did first
  • merge using order then
  • output criterion E is a parent of A
  • single scan on both list is enough

26
XISS - EE-join
  • to solve E/_/E, e.g., E/E, E//E, E/_/E
  • input two Element record lists, E, F
  • output (e,f) where e is an ancestor of f
  • also use 2-stage sort-merge
  • however, may need scanning of lists multiple
    times (for special cases, e.g., the document has
    /A/A/B/B)

27
XISS - KC-join
  • to solve Kleene Closure of a subexpression
  • input a list of element records fits the base
    case
  • recursively use EE join on the list, and stop
    until no more grow in the result list

28
Index Fabric
  • by Cooper at el, published in VLDB 2001, with
    title A fast index for semistructured data
  • has 2 subtypes, raw path index and refined path
    index
  • use Patricia technique to compress the index

29
Index Fabric - General Idea
  • it is a disk balanced indexing structure based on
    Patricia
  • each data node is associated with a key string
    and this string is stored in the trie index for
    retrieval
  • the layered approach in building the index ensure
    the number of disk pages accessed per query

30
Index Fabric - General Idea
  • raw path index answers absolute path queries
  • refined path index answers any predefined queries
  • the difference is how to generate the key

31
Patricia
  • Patricia Practical Algorithm To Retrieve
    Information Coded in Alphanumeric
  • by Morrison, in JACM 1968
  • a method to store and retrieve strings in a
    space efficient way
  • binary, use bit comparisons, has a skip in each
    internal node

32
Patricia
  • an example Patricia trie

0
1
0
0
1
1
101110
101111
110000
110011
33
Patricia
  • its basically a trie with internal nodes having
    single child removed
  • search is done by
  • branch according to the value of bit at skip
  • retrieve the string at leaf
  • compare it with the query string

34
Index Fabric - Balanced Trie
  • The number of disk pages accessed per query is
    bounded by the number of layers in the layered
    index
  • The idea is similar to that of B-tree, The
    Patricia trie is decomposed into blocks, and
    there is an upper layer trie which traverse the
    blocks

35
Index Fabric - Balanced Trie
1
  • e.g.

0
1
0
0
1
1
101110
101111
110000
110011
Layer 0
Layer 1
36
Index Fabric - Balanced Trie
  • There are 3 types of links in the balanced trie
  • far link across layer, a result of branching
  • near link within the same block, a result of
    branching
  • direct link across layer, the root nodes are the
    same
  • Each query will access 1 block in 1 layer

37
Index Fabric - Balanced Trie
  • increase the speed by skipping nodes of original
    trie using traversals in upper layers
  • number of page accessed is bounded

38
Index Fabric - Raw Path
  • each data node is associated with a key
  • key path (encoded in designators) value
  • designators are special characters, each
    represents a name
  • APE queries are translated to prefix to keys and
    submitted to the index trie

39
Index Fabric - Raw Path
  • Example
  • ltinvoicegtltbuyergtltnamegtHKUlt/namegtlt/buyergtlt/invoice
    gt is translated to IBNHKU (bolded underlined
    are designators
  • query of /invoice/buyer/nameHKU is translated
    to query string IBNHKU

40
Index Fabric - Refined Path
  • Special designators can be assigned to special
    queries (can be regular)
  • e.g., we define P as the path //buyer/name, and
    PHKU means there is a buyer/name has value HKU in
    the document
  • can answer any predefined RPE very quickly

41
Comparison
  • XISS
  • can solve general RPE
  • solve APE by dividing it into steps
  • Index Fabric
  • RPE solved by compile time expansion of RPE or
    using predefined Refined Path Index
  • solve APE by single index lookup
Write a Comment
User Comments (0)
About PowerShow.com