XML Indexing Techniques - PowerPoint PPT Presentation

About This Presentation
Title:

XML Indexing Techniques

Description:

Title: XML Indexing Techniques Author: GARDARIN Last modified by: GARDARIN Created Date: 12/25/2003 4:19:30 PM Document presentation format: Affichage l' cran – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 42
Provided by: gardarin
Category:

less

Transcript and Presenter's Notes

Title: XML Indexing Techniques


1
XML Indexing Techniques
  1. Requirements
  2. Dataguide and Variation
  3. Index Fabric
  4. Adaptative Path Index
  5. Node Numbering scheme
  6. Compact Structural Summary
  7. Conclusion

2
Requirements
  • XML Queries involve navigating data using regular
    path expressions.(e.g., XPath)
  • /Livre//Auteur_at_specialite"informatique")
  • Accessing all elements with same name string.
  • Ancestor-descendant relationship between
    elements.
  • Content based access on values included in text.

3
Index Types
  • Structural index
  • Accessing all elements of given name
  • Ancestor-descendant and parent-child relationship
    between elements
  • Content index
  • Accessing elements containing given keywords
  • Supporting most text search functionalities

4
Classical Content Index
  • Classically based on inverted lists
  • For each term, gives the doc.ID localization
  • Several variations allows different search types
  • Offset, Relative, Proximity
  • Generally stored in a B-Tree to optimize search
    for a given word
  • Size is an important issue
  • Memory and Disk
  • (word, localization)
  • Fixed entry (word repeated)
  • (word, Frequency, (localization))
  • Variable length entry

5
Problem with XML
  • Support of element addressing
  • Doc.ID should include NodeId (Xpath) Offset
  • Index size becomes very large
  • XPath are long
  • Support of typed data
  • Integer, float, simple types of XML schema
  • Requires classical indexes for certain elements
  • Query processing
  • Structural joins
  • Text search
  • Exact search
  • Support of updates
  • Incremental updates would be a plus

6
Evaluation Criteria
  • Identifiers
  • Per node or per document
  • Descendant/Ancestor Search
  • By join algo.
  • By graph traversal
  • By OID comparison
  • Keyword Search
  • By element scan
  • By B-tree traversal
  • Update
  • Incremental
  • Index size
  • Entry number
  • Entry size

7
2-Dataguide and Variation
  • Goldman Widom VLDB97
  • Dynamic schemas
  • helps in query formulation
  • Concise and accurate structural summaries
  • Every path in the database has one and only one
    corresponding path in the DataGuide with the same
    sequence of labels
  • A legal label path
  • Restaurant/Name
  • Target set
  • for eRestaurant/Entree is Ts(e) 6,10,11.
  • DocId can be added to identifiers

8
Dataguide Principle
  • To achieve conciseness
  • a DataGuide describes every unique label path of
    a source exactly once.
  • To ensure accuracy
  • a DataGuide encodes no label path that does not
    appear in the source.
  • And for convenience
  • a DataGuide itself be an object (OEM or XML).

9
Dataguide Evaluation
  • Identifier
  • One per node
  • Descendant/Ancestor Search
  • By graph traversal
  • Keyword Search
  • By element scan
  • Update
  • Insertion is incremental
  • Deletion is complex
  • Index size
  • Entry number Linear for tree can be
    exponential in number of DB nodes
  • Entry size number of elements for a path

10
T-Index
  • Milo Suciu, LNCS 1997
  • T-index stands for Template-index
  • A path template t has the form
  • T1 x1 T2 x2 Tn xn
  • where each Ti is either a regular path expression
    or one of the following two place holders P (any
    Path) and F (any Formula)
  • //restaurant/ x P y /Address/City z F u
  • A query path q is obtained from t by
    instantiating
  • P by any path F by any formula

11
Principle
  • T-index indexes all sequences of objects
    connected by a sequence of path expressions
    defined by a template.
  • Particular cases
  • 1-index indexes template any path P
  • Indexes all objects reachable through an
    arbitrary path expression P from a root
  • two nodes are equivalent (same entry) if the set
    of paths into them from the root is the same.
  • 1-index is a non-deterministic version of the
    strong data guide
  • 2-index indexes template P x P
  • all pairs of objects connected by an arbitrary
    path expression P

12
Building a T-index
  • Group objects into equivalence classes containing
    objects that are indistinguishable w.r.t to a
    class of paths defined by a path template
  • Finer equivallence classes are more efficient to
    construct using bi-simulation
  • Construct a non deterministic automaton
  • states represent the equivalence classes
  • transitions correspond to edges between objects
    in those classes.
  • T-index can be used to answer queries of more
    general forms than the template

13
3-Adaptative Path Index (APEX)
  • Adaptative Path Index for XML Chung et.al.
    SIGMOD 2002
  • Summarize paths that appear frequently in query
    workload
  • Maintain all paths of length 1
  • Efficient for partial match paths
  • Incremental update of index

14
APEX details
  • Each node has an identifier (nid)
  • Required paths for indexing (labelsome
    composed paths)
  • APEX Graph (structural summary) hash tree
    (incoming required paths to nodes of Graph)
  • Hash tree is used to find nodes of graph for
    given label path, also for incremental update
  • Determine frequently used path from query
    workload using sequential pattern mining

15
APEX Example
XML data structure
APEX Hash tree and Graph
16
APEX Evaluation
  • Identifiers
  • One per node
  • Descendant/Ancestor Search
  • Hash tree access if required or graph traversal
    or join
  • Keyword Search
  • Not supported
  • Update
  • Insertion is incremental
  • Index size (two structures)
  • Entry number Linear in number of nodes
  • Entry size number of elements for a path

17
4-Index Fabric
  • Cooper et al. .A Fast Index for Semistructured
    Data.. VLDB, 2001
  • Extension of dataguide for text search
  • Keeps all label paths starting from the root
  • Encode each label path with data value as a
    string
  • Use efficient index for strings to store it
    (Patricia trie)
  • Perform queries on keywords for elements as
    string search
  • Does not keep information on non-terminal nodes

18
Patricia Trié
  • Trié Key ? Value
  • A Patricia trie is a simple form of compressed
    trie which merges single child nodes with their
    parents
  • More efficient for long keys (non-common postfix
    in one node)

Trie A tree for storing strings in which there
is one node for every common prefix. The strings
are stored in extra leaf nodes.
19
Exemple
  • Doc 1ltinvoicegt
  • ltbuyergt
  • ltnamegtABC Corplt/namegt
  • ltaddressgt1 Industrial Waylt/addressgt
  • lt/buyergt
  • ltsellergt
  • ltnamegtAcme Inclt/namegt
  • ltaddressgt2 Acme Rd.lt/addressgt
  • lt/sellergt
  • ltitem count3gtsawlt/itemgt
  • ltitem count2gtdrilllt/itemgt
  • lt/invoicegt
  • Doc 2 ltinvoicegt
  • ltbuyergt
  • ltnamegtOracle Inclt/namegt
  • ltphonegt555-1212lt/phonegt
  • lt/buyergt
  • ltsellergt
  • ltnamegtIBM Corplt/namegt
  • lt/sellergt
  • ltitemgt
  • ltcountgt4lt/countgt
  • ltnamegtnaillt/namegt
  • lt/itemgt
  • lt/invoicegt

20
Patricia Trie
21
Search on Paths
  • Example of queries
  • /invoice/buyer/name/ABC Corp
  • /invoice/buyer//ABC Corp
  • A key lookup operator search for the path key
    corresponding to the path expression.
  • If path expands to infinite number of tags
  • start by using a prefix key lookup operator,
  • then navigate through children to check the rest

22
Fabric Evaluation
  • Identifiers
  • One per document
  • Descendant/Ancestor Search
  • As string search do not keep order of elements
  • Keyword Search
  • By Patricia trie leaves if expanded value index
    otherwise
  • Update
  • Insertion is incremental
  • Deletion is complex
  • Index size (index stored with document)
  • Entry number Linear for tree
  • Entry size number of elements for a path

23
5-Node Numbering Scheme
  • Used for indexing elements
  • Node Identifier (NID) ? element
  • The NID aims at replacing structural joins by
    simple function computation
  • check parent ancestor relationships
  • is_parent(NID1,NID2), is_ancestor(NID1,NID2)
  • determine parent children
  • get_parent(NID1), get_children(NID1)

24
Virtual nodes (1)
  • Lee Yoo Digital Libraries 99
  • Document structure mapped on a k-ary tree
  • Node identifier assigned according to the
    level-order tree traversal
  • parent(i) (i-2)/k 1
  • child(i,j) k(i-1) j 1

25
Virtual nodes (2)
  • NID can be used to address elements in index of
    elements
  • Only certain nodes (e.g., leaves) have to be
    indexed as parent nodes can be determined by
    computation
  • Problems
  • arity of tree may be variable and large
  • determination of real existence of parent/child
  • update when arity increases ?

26
XML trees node pre/post numbering
  • Dietz82
  • Identification of nodes
  • Identifier preorder rankpostorder rank
  • X ancestor of Y ltgt
  • pre(X) lt pre(Y) and
  • post(X) gt post(Y)
  • Example
  • 1lt5 and 7gt3 gt (1,7) ancestor (5,3)

(1,7)
(6,6)
(2,4)
(7,5)
(3,1)
(5,3)
(4,2)
27
Interval encoding
  • LiMoon VLDB 2001
  • Identify each node by a pair of numbers ltorder,
    sizegt as follows
  • For a tree node y of parent x
  • order(x) lt order(y)
  • order(y)size(y) lt order(x) size(x)
  • For two sibling nodes x and y, if x is the
    predecessor of y in preorder traversal then
  • order(x) size(x) lt order(y)

(1,100)
(41,10)
(10,30)
(45,5)
(25,5)
(11,5)
(17,5)
Size keeps space for updates
28
Relative Region Coordinates (1)
  • Kha Yoshikawa IEEE Data Engin. 2001
  • A RRC of a node n of an XML tree is a pair
    sp-sn,sp-en of addresses in the region of
    parent, i.e., relative to parent start

Parent
Child
s
e
29
Relative Region Coordinates (2)
  • Absolute region coordinate (ARC)
  • Relative to root begin (from byte Nth to Mth)
  • Allow to extract the XML data
  • Can be derived from RRCs of parents and self
  • Begin ?(parents?self)s (k-1)
  • End ?(parents)s e(self)(k-1)
  • Advantages
  • Updates are kept local to a region
  • To access parent-child efficiently
  • A B-tree like structure is maintained (à la
    Natix).

30
Xyleme
  • Generate a form of dataguide per cluster
  • Generalized DTD
  • Manage a label and value index (full index)
  • Keep document ID and element ID
  • Two forms of element ID
  • Bit structured scheme structure position
  • Prefix-postfix scheme left-deep traversal
  • Stores XML DOM trees in pages
  • NATIX (Mannheim Univ.) technology

31
Xyleme
32
6-Compact Structural Summary
  • Bremer Gertz Tech Report 2003
  • Compact addressing of words in XML doc.
  • Encode XPath as reference to a path in a document
    guide (path set, DTD or schema)

33
Managing a Compact Index
  • Naïve XML Indexing
  • (Word,docId,(XPath))
  • Example
  • book/chapter2/resume/section3
  • article/author/name
  • Difficulties
  • Index size !
  • Processing time !
  • Intersection of lists
  • Problem
  • How to memorize the location of a word inside an
    element ?
  • Solution Bremer Gertz 02
  • Encode the XPath as a reference to a path in a
    document guide (path sequence or schema)

34
XPath Encoding
  • XPath encoded as a path ID (PID) of structure
    (N,(p1,p2, ...)
  • N being a node identifier in the guide
  • (p1, p2, ...) being indices for repetitive
    ancestors from root to N

PID (V, (1, 3))
/db/article1/text/sect3
35
PID Ordering and Encoding
  • PID order
  • IV,(1))lt(V,(1,2)) lt(V,(1,3)).
  • Pre-order relationship
  • X Parent Y
  • ? PID(X) lt PID(Y)
  • Compact PID encoding
  • Path number
  • Integer (short)
  • Repetitive node
  • log2(n) bits
  • Compact PID Encoding (V, (1, 3))
    /db/article1/text/sect3

2 children 1 bit
1 child 0 bit
3 children 2 bits
Total 3 bits
36
Index Implementation
ltlivregt lttitregtLes Misérables, Tome 1
Fantinelt/titregt ltauteurgtVictor
Hugolt/auteurgt lthistoiregt 1815. Alors que tous
les aubergistes de la ville l'ont chassé, le
bagnard Jean Valjean est hébergé par Mgr Myriel (
que les pauvres ont baptisé, d'après l'un de ses
prénoms, Mgr Bienvenu). L'évêque de la ville de
Digne, l'accueille avec bienveillance, le fait
manger à sa table et lui offre un bon
lit. . lt/histoiregt lt/livregt
  • Entry
  • Word (stem) Address
  • Address is
  • PID (offset in element)
  • Example
  • City (V(1,3) (9, 36))

Word PID offset
Valjean (PID 15)
Ville (PID 9, 36)

37
XQuery Text Evaluator
  • Normalize the query through thesaurus
  • Translation
  • Synonyms
  • Conceptualization
  • Access to the text index
  • Intersection, union, difference of PIDs
  • Access to the relevant elements from PIDs
  • Verification of relevance

38
7-Conclusion
  • Various indexing techniques for XML
  • Main dimensions of variations
  • Structural summary
  • Dataguide, Schema guide, Generalized DTD
  • Identification of nodes (NID)
  • Should keep parent-child relationship
  • Should be stable to updates
  • Index of keywords
  • Should be compact
  • Should give NID and offset of instances

39
Classification
XML Indexing Methods
Numbering Scheme
Text Search
Graph Traversal
RRC
Hierarchy
T-Index
Pre/Post Order
Fabric
Dataguide
APEX
Interval Encoding
40
Index for XQuery Text
  • Facilitate the retrieval of
  • Non stop words
  • Suffixes, prefixes
  • Location of words in elements
  • Relevant nodes for a search
  • Entries should focus on elements
  • Word (docId, NID)

41
Trreguide patterns
Book
Book
Author
Category
Author
Category
_at_speciality
Address
Company
_at_speciality
Company
Address
City
City
(b)
(a)
Write a Comment
User Comments (0)
About PowerShow.com