L09: Introduction to XML Data Management - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

L09: Introduction to XML Data Management

Description:

L09: Introduction to XML Data Management XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 90
Provided by: Dom197
Category:

less

Transcript and Presenter's Notes

Title: L09: Introduction to XML Data Management


1
L09 Introduction to XML Data Management
  • XML and XML Query Languages
  • Structural Summary and Coding Scheme
  • Managing XML Data in Relational Systems

2
XML and XML Query Languages
  • XML and XML Query Languages
  • Structural Summary and Coding Scheme
  • Managing XML Data in Relational Systems

3
XML
  • Extensible Markup Language for data
  • A W3C standard to complement HTML
  • http//www.w3.org/TR/2000/REC-xml-20001006
    (version 2, 10/2000)
  • Standard for publishing and interchange
  • Origins structured text SGML
  • Cleaner SGML for the Internet
  • Motivation
  • HTML describes presentation
  • XML describes content

4
XML Describing the Content
  • ltprojectgt
  • lttalk gt
  • lttitlegt XML Query Processing Optimization
    lt/titlegt
  • ltdategt March 18, 2004 lt/dategt
  • ltinstructorgt Instructor
  • ltnamegt Lu Hongjun lt/namegt
  • ltaffiliationgt HKUST lt/
    affiliation gt
  • ltemailgt luhj_at_cs.ust.hk lt/emailgt
  • ltnamegt Jeffrey X. Yu lt/namegt
  • ltaffiliationgt CUHK lt/
    affiliation gt
  • ltemailgt yu_at_se.cuhk.edu.hk
    lt/emailgt
  • lt/ instructor gt
  • lt/talkgt
  • lt/projectgt

5
XML Document/Data
  • Hierarchical document format for information
    exchange in WWW
  • Self describing data (tags)
  • Nested element structures having a root
  • Element data can have
  • Attributes
  • Sub-elements

6
Basic XML Structures
  • Elements lttitlegt lt/titlegt,ltnamegt lt/namegt
  • Open close tags or empty tag
  • Ordered, nestable
  • an element can be empty
  • Attributes
  • PCDATA/CDATA
  • An XML document single root element
  • well formed XML document if it has matching tags

7
Basic XML Structures Attributes
  • Single-valued, ordered
  • ltproject proj_id P1234 budget 1000000gt
  • lttitlegt XML Data Management lt/titlegt
  • ltyeargt 2003-2004 lt/yeargt
  • lt/projectgt
  • Special types ID, IDREF, IDREFS
  • ltmember idm007gt ltnamegt James lt/namegt
    lt/membergt
  • ltproject idp123gt lttitlegt XML Data Management
    lt/titlegt
  • ltmember
    idrefm007 m008/gt
  • lt/projectgt

8
Other XML Structures
  • Processing instructions instructions for
    applications
  • lt?xml version1.0?gt
  • CDATA sections treat content as char data
  • lt!CDATAlttaggtWhatever!!!lt/taggtltwhatevergtgt
  • Comments just like HTML
  • lt!-- Comments --gt
  • Entities external resources and macros
  • my-entity (non-parameter entity)
  • param-entity (parameter entity for DTD
    declarations)

9
Data Centric vs. Document centric
  • ltprojectgt
  • ltpnamegt XML lt/pnamegt
  • ltmember ID3, age 50 gt
  • ltnamegtH. Lu lt/namegt
  • ltemailgt luhj_at_cs.ust.hk lt/emailgt
  • ltpublication author H. Lugt
  • lttitlegt Managing XML data using RDBMS lt/titlegt
  • ltyeargt 2001 lt/yeargt
  • lt/publicationgt
  • lt/membergt
  • ltmember ID24, age 35 gt
  • ltnamegt J.X. Yu lt/namegt
  • ltprojectgt
  • ltpnamegt Data mining lt/pnamegt
  • lt/projectgt
  • lt/membergt
  • lt/projectgt

ltbiogt ltpgt Dr Lu is a professor at ltbgt
HKUST. lt/bgt He worked at ltbgt NUSgt lt/bgt
before 1998. lt/pgt lt/biogt
10
XML Data Model
  • Several competing models
  • Document Object Model (DOM)
  • a platform- and language-neutral interface that
    will allow programs and scripts to dynamically
    access and update the content, structure and
    style of documents
  • http//www.w3.org/DOM/

11
DOM Core Interface Node
  • DOM tree a tree-like structure of Node objects
    the root of the tree is a document object.
  • Node Object (nodeName, nodeValue, nodeType,
    parentNode, childnodes, firstChild, lastChild,
    previousSibling, nextSibling, attributes,
    ownerDocument)
  • nodeType ELEMENT_NODE, ATTRIBUTE_NODE,
    TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE,
    PROCESSING_INSTRUCTION_NODE, COMMENT_NODE,
    DOCUMENT_NODE, DOCUMENT_TYPE_NODE,
    DOCUMENT_FRAGMENT_NODE, NOTATION_NODE

12
DOM Interface
  • Each node of the document tree may have a number
    of child nodes, contained in a NodeList object.
  • Two ways of accessing a node object
  • Based on the location of an object in the
    document tree
  • Based on the name of an object

13
A Sample DOM Tree
Project Node NodeTypeELEMENT_NODE tagName
project NodeValue nill
1
project
2
3
24
pname
member
member
XML
28
26
27
65
66
publication
project
name
email
name
H. Lu
69
luhj_at_cs.ust.hk
J.X. Yu
70
294
71
author
title
pname
year
Managing
2001
Data mining
name Node NodeTypeTEXT_NODE tagName name
NodeValue H. Lu
publicatiom Node NodeTypeELEMENT_NODE tagName
publication NodeValue nill
14
Data Graph
  • Similar to DOM tree, but may have different
    notations that represent an XML document

1
project
2
3
24
pname
member
member
28
26
65
27
66
author
XML
publication
project
name
email
name
50
69
70
294
71
50
J.X. Yu
luhj_at_cs.ust.hk
H. Lu
title
year
pname
Managing
2001
Data mining
15
Document Type Definition
  • Inherited from SGML DTD standard
  • BNF grammar establishing constraints on element
    structure and content
  • Specification of attributes and their types
  • Definitions of entities

16
A Sample DTD
project
1


publication
member
pname

3
4
2
author

?
5
6
7
9
8
  • lt?xml version"1.0" standalone"yes"?gt
  • lt!DOCTYPE Research gt
  • lt!ELEMENT project(pname,member,publication)gt
  • lt!ELEMENT pname(PCDATA)gt
  • lt!ELEMENT member (name,email?, publication,
    project)gt
  • lt!ATTLIST member ID ID REQUIREDgt
  • lt!ELEMENT name(PCDATA)gt
  • lt!ELEMENT email(PCDATA)gt
  • lt!ELEMENT publication(title,year)gt
  • lt!ATTLIST publication author IDREF IMPLIED)gt
  • lt!ELEMENT title(PCDATA)gt
  • lt!ELEMENT year(PCDATA)gt

ID
name
email
title
year
17
XML Query Languages
  • There have been a large number of proposals
    during the past few years
  • XPath Clark, DeRose, W3C 1999
  • XQuery Boag, Chamberlin et al, W3C 2003
  • XML-QLDeutsch, Fernandez et al, QL99
  • XQL Robie, Lapp, QL99
  • XML_GL Ceri, Comai et al, WWW99
  • Quilt Chamberlin, Robie et al, 2000
  • From W3C
  • XQuery 1.0 (W3C Working Draft, 12 November 2003)
  • http//www.w3.org/TR/xquery/
  • XPath 2.0 (W3C Working Draft 12 November 2003)
  • http//www.w3.org/TR/xpath20/

18
XPath XML Path Language
  • The purpose
  • To address the node of an XML tree using a path
    notation for navigating through the hierarchical
    structure of an XML document.
  • Uses a compact, non-XML syntax
  • Designed to be embedded in a host language (e.g.,
    XSLT, XQuery)
  • XPath Expressions
  • String of characters
  • Value of an expression is always an ordered
    collection of zero or more items (atomic value,
    node)

19
XPath Steps
  • An XPath expression has following syntax
  • Path/Step1/Step2//Stepn,
  • where each Xpath step is defined as follows
  • StepAxisNode-test Predicate
  • Axis specifies the direction in which the
    document should be navigated.
  • For example, childtitleposition() 2
  • There are 12 axes child, descendant,
    descendant-or-self, parent, ancestor,
    ancestor-or-self, following, preceding,
    following-sibling, preceding-sibling, attribute,
    self, namespace

20
XPath Path Expressions
  • project matches a project element
  • matches any element
  • / matches the root element
  • /project matches a project element under root
  • project/member matches a member in project
  • project//name matches a name in project, at any
    depth
  • //title matches a title at any depth
  • memberpublciation matches a member or a
    publication
  • _at_age matches an age attribute
  • project/member/_at_age matches age attribute in
    member, in project
  • project/member/_at_agelt45 matches a member with
    age lt 45

21
XPath Query Examples
/project/member/name matches a name of member in
project
Result ltnamegt H. Lu lt/namegt ltnamegt J.X. Yu
lt/namegt
/project/publication/venue
Result empty there was no venue element
//pname matches a pname at any depth
Result ltpnamegt XML lt/pnamegt ltpnamegt Data
mining lt/pnamegt
  • /project/member/name/text() text of name elements

Result H. Lu J.X. Yu
22
More XPath Queries
  • /project/memberpublication
  • ltmember ID3, age 50 gt
  • ltnamegtH. Lu lt/namegt
  • ltemailgt luhj_at_cs.ust.hk lt/emailgt
  • ltpublication author H. Lugt
  • lttitlegt Managing XML data using RDBMS lt/titlegt
  • ltyeargt 2001 lt/yeargt
  • lt/publicationgt
  • lt/membergt
  • /project/member_at_age lt 45
  • ltmember ID24, age 35 gt
  • ltnamegt J.X. Yu lt/namegt
  • ltprojectgt
  • ltpnamegt Data mining lt/pnamegt
  • lt/projectgt
  • lt/membergt
  • /project member/_at_age lt 25
  • No element returned
  • /project/memberemail/text()

23
XQuery
  • XQuery 1.0 An XML Query Language
  • W3C Working Draft 12 November 2003
  • http//www.w3.org/TR/xquery/
  • XPath expressions are still the basic building
    block

24
XQuery
  • XQuery 1.0 An XML Query Language
  • W3C Working Draft 12 November 2003
  • http//www.w3.org/TR/xquery/
  • FLWR Expressions FOR-LET-WHERE-RETURN

25
XQuery Examples
  • ltresultgt
  • FOR x in /project/member/publication
  • WHERE x/year gt 2000
  • RETURN ltrecentpubgt x/title
  • lt/ recentpub gt
  • lt/resultgt

distinct a function that eliminates duplicates
ltactive_membersgt FOR m IN distinct(document(pro
ject.xml")//member) LET p document(project.x
ml")//publicationauthor m WHERE count(p) gt
10 RETURN m lt/ active_members gt
count a (aggregate) function that returns the
number of elements
26
Structural Summary and Coding Scheme
  • XML and XML Query Languages
  • Structural Summary and Coding Scheme
  • Managing XMLData in Relational Systems

27
Structural Summary
  • A structural summary for a data graph GD(VD, ED )
    is another labeled graph GI (VI, EI ).
  • Each node vi ? GI represents a set of nodes,
    extent(vi ), and extent(vi ) ? VD.
  • An edge ed (vi , vi) ? GI exists if there is an
    edge ed (vd , vd) ? GD vd ? extent(vi ), vd ?
    extent(vi ).
  • The summary preserves all the paths in the data
    graph. A path expression query can be executed on
    GI instead of GD, which is most likely more
    efficient since size of GI is much smaller than
    GD.

28
Structural Summary
  • Basically, nodes in the data graph is grouped
    based on certain criteria, each group of nodes is
    represented by one node in the summary.
  • The size of summary will be determined by the
    grouping criteria.
  • Desired properties in supporting evaluating path
    expression queries using summary
  • The results are safe (no false negatives)
  • If not safe, only approximate answers can be
    obtained
  • The results are precise contains no false
    positives
  • If not precise, need validate results using the
    data graph

29
Structural Summary
R
r
A
a1,a2,a3
B
b1,b2,b3
C
c1,c2,c3
Data Graph
Structural summary
30
Sample Structural Summaries
  • Query workload independent summaries
  • Data Guide
  • 1-index Milo, Suciu, ICDT99
  • A(k) index Kaushik, Shenoy, ICDE02
  • Query workload dependent summaries
  • APEX Chung, Min et al, SIGMOD02
  • D(k)-index Chen, Lim et al, SIGMOD03

31
Data Guides
  • DataGuide dynamic structural summary of current
    database
  • Each label path in database appears once in
    DataGuide
  • No extraneous paths in DataGuide
  • Maintained incrementally as database evolves
  • Serves role of schema

C1 is duplicated to achieve determinism in
DataGuides
32
Bisimilarity and 1-Index
  • Most existing structural summary are based on
    graph bisimilarity, defined as follows
  • Two data nodes u and v are bisimilar (u ? v) if
  • u and v have the same label
  • if u is a parent of u, then there is a parent
    v of v such that u? v, and vice versa
  • Intuitively, the set of paths coming into them is
    the same if two nodes are bisimilar
  • Tova Milo and Dan Suciu. Index structures for
    path expressions. In ICDT99. 277-295, January
    1999.

33
1-Index
  • 1-index Each index node represents an
    equivalence class, in which data nodes are
    mutually bisimilar.
  • Evaluating path expression query using 1-index
  • safe the result always contains the result of
    evaluating on the data graph
  • precise its result contains no false data node

34
K-bisimilarity
  • 1-index can be big
  • Formally, based on the notion of k-bisimilarity
    (?k ) which is defined inductively
  • Node u ?k v iff u ?k-1 v, and for every parent
    u of u, there is a parent v of v such that u
    ?k-1 v, and vice versa
  • For any two nodes, u and v, u ?0 v iff u and v
    have the same label
  • Intuitively, if two data nodes are k-bisimilar,
    the set of paths coming into them with length (?
    k) is the same

35
A(k)-Index
  • A(k)-Index group nodes based on their local
    structure paths of length up to k, instead of
    the global path information
  • data nodes in each index nodes of A(k) index are
    mutually k-bisimilar
  • Evaluation path expression query using
    A(k)-index
  • safe its result always contains the result of
    evaluating on the data graph
  • precision its result contains no false data
    node
  • Raghav Kaushik, Pradeep Shenoy, Philip Bohannon
    and Ehud Gudes. Exploiting local similarity for
    indexing paths in graph-structured data. ICDE02,
    129-140.

36
A(2)-Index
C2 and C3 can be grouped because their length-2
incoming paths are the same
37
APEX Adaptive Path Index
  • 1-index, A(k)-index and FB index are all
    workload independent
  • APEX Adaptive Path index
  • Maintains two types of paths in the summary
  • All paths of length two so that all queries can
    be answered using APEX
  • Full paths are maintained for those paths that
    frequently appear in query workload so that
    frequently asked queries can be answered
    efficiently
  • A hash table is included in the index so that
    partial matching queries with the
    self-or-descendent axis (//) can be processed
    efficiently
  • C-W Chung, J-K Min, K. Shim, APEX An Adaptive
    Path Index for XML Data, SIGMOD 02

38
D(k)-Index
  • A generalization of 1-Index and A(k)-Index.
  • Assigning different local bisimilarites to index
    nodes in the summary structure according to the
    query load to optimize its structure.
  • for any two index nodes ni and nj, k(ni) ?
    k(nj)-1 if there is an edge from ni to nj, in
    which k(ni) and k(ni) are ni and njs local
    bisimilarities, respectively.
  • Advantage over 1-Index and A(k)-Index
  • workload-sensitive
  • can be more efficiently updated
  • Qun Chen, Andrew Lim and Kian Win Ong.
    D(k)-index An adaptive structural summary for
    graph-structured data. SIGMOD 03, 134-144.

39
Node (Edge) Encoding
  • Structural relationships
  • Is node u an ancestor of node v?
  • Is node u the parent of node v?
  • Assigning a unique code to a node (edge) in the
    data graph so that the above question can be
    answered by looking at the codes rather than the
    original data graphs.
  • Issues
  • Length of the code.
  • Complexity for computing the structural
    relationship. between two nodes from their codes.
  • Efficient code generation and code maintenance.

40
XML Data Coding Scheme
  • Region-based
  • XML document is ordered
  • Codes are assigned based on the lexicographical
    location of an element in the original document
  • Path-based
  • XML document is nested
  • Codes are assigned based on the nesting structure
    of the document, or the path that reaches and
    element from the root.
  • There are quite a number of variants for both
    categories of coding schemes

41
XML Region Based Coding
  • Region code (start, end, level)
  • u is an ancestor of v iff u.start lt v.start lt
    u.end
  • u is the parent of v, additionally, u.level
    v.level-1
  • Only a depth-first traversal for code generation
  • Property strictly nesting
  • Completely disjoint (case 1,4) or containing
    (case 2,3)
  • Formally, a.start lt b.start lt a.end, if a is an
    ancestor of b

42
Sample of Region Codes
  • The order of start values is also the document
    order
  • The region can also be interpreted as an interval

43
Dewey
1
  • ltcontactgt
  • ltnamegtblahlt/namegt
  • ltphonegt
  • ltofficegt1234lt/officegt
  • lthomegt5678lt/homegt
  • ltmobilegt0000lt/mobilegt
  • lt/phonegt
  • lt/contactgt

contact
1.2
1.1
phone
name
1.2.1
1.2.3
1.2.2
office
blah
home
mobile
1.1.1
0000
5678
1234
1.2.1.1
1.2.2.1
1.2.3.1
a.Dewey is a prefix of d.Dewey
Igor Tatarinov, Stratis D. Viglas, Kevin Beyer,
Jayavel Shanmugasundaram, Eugene Shekita, and
Chun Zhang. Storing and querying ordered XML
using a relational database system. SIGMOD 2002.
44
Managing XML Data in Relational Systems
  • XML and XML Query Languages
  • XML Coding Scheme and Structural Summary
  • Managing XMLData in Relational Systems

45
XML-Enabled DB Systems
  • IBM DB2 XML Extender
  • XML column support, XML Collection, File liked
    from the DBMS, or Character Large Objects
    (CLOBs).
  • Side Tables server as XML indexes
  • Oracle 9i
  • CLOB, OracleText Cartridge, XMLType, and XML SQL
    Utility
  • Microsoft SQL Server
  • CLOBs, Generic Edge technique and user-defined
    decomposition (from XML to tables), XML views.

46
Storing XML Data in RDBMSs
  • RDBMS a matured technology
  • RDBMS widely available
  • Less investment to adopt the new technology
  • Easy to be integrated with other existing
    applications
  • Impedance mismatch
  • Two level nature of relational schema (tuples and
    attributes) vs. arbitrary nesting of XML DTD
  • Flat structure vs. recursion
  • Structure-based and content-based query

47
XQuery vs SQL Different Culture
  • Data Characteristics
  • Relational data regular, homogeneous, flat
    structure in nature, and no order among tuples.
  • XML data irregular, heterogeneous, unpredictable
    structure, order sensitive.
  • Query Languages
  • SQL
  • Select-from-where
  • With capability to support some fix-point
    operation
  • XQuery
  • FLWOR (pronounced flower) For-let-where-order-r
    eturn
  • Simple/Regular Path expressions

48
Storing XML Data in RDBMSs Architecture
Automatic Schema/Data Mapping
SQL Query
Tuples
Commercial RDBMS
49
Storing XML Data in RDBMSs Issues
  • Schema/Data mapping
  • Automate storage of XML in RDBMS
  • Query mapping
  • Provide XML views of relational sources
  • Result construction
  • Export existing data as XML

50
XML-Relational Mapping
  • Model mapping
  • Database schemas represent constructs of the XML
    document model.
  • DTD Independent
  • Florescu Kossmann 99, Yoshikawa, et. al.
    TOIT01
  • Structure mapping
  • Database schemas represent the logical structure
    of target XML documents
  • DTD Dependent
  • Shanmugasundaram et. al. VDLB 99

51
A Simple XML Document
  • ltprojectgt
  • ltpnamegt XML lt/pnamegt
  • ltmember ID3gt
  • ltnamegtH. Lu lt/namegt
  • ltemailgt luhj_at_cs.ust.hk lt/emailgt
  • ltpublication author H. Lugt
  • lttitlegt Managing XML data using RDBMS lt/titlegt
  • ltyeargt 2001 lt/yeargt
  • lt/publicationgt
  • lt/membergt
  • ltmember ID24gt
  • ltnamegt J.X. Yu lt/namegt
  • ltprojectgt
  • ltpnamegt Data mining lt/pnamegt
  • lt/projectgt
  • lt/membergt
  • lt/projectgt

52
A Sample DOM Tree
1
project
2
3
24
pname
member
member
XML
28
26
27
65
66
publication
project
name
email
name
H. Lu
69
luhj_at_cs.ust.hk
J.X. Yu
70
294
71
author
title
pname
year
Managing
2001
Data mining
53
Model Mapping Document Model to Relation
  • Database schema represents the constructs of XML
    documents
  • Fixed database schema for all XML documents
  • Data graph tree (may contain cycles)
  • Relational schema represents a tree
  • Pros and cons
  • DTD is not required. Documents may not conform
    to DTD
  • Fixed schema no schema evolution issue
  • Large collection of documents with various DTDs
  • Semantics get (totally) lost

54
Model Mapping Edge/Monet Approach
  • Edge oriented approach
  • Single table schema Florescu Kossmann 99
  • Edge (source, ordinal, target, label, flag,
    value)
  • Monet Schmidt et. al. WebDB00
  • multiple tables, horizontal partitions of edge
    table on label-path

Note Document ID is omitted here
55
Querying with Edge
/DBGroup/MemberAgegt20/Name
  • select name.Value
  • from Edge dbgroup, Edge member, Edge age, Edge
    name
  • where dbgroup.Label DBGroup'
  • and member.Label Member'
  • and age.Label Age'
  • and name.Label Name'
  • and dbgroup.Source 0
  • and dbgroup.Target member.Source
  • and member.Target age.Source
  • and member.Target name.Source
  • and cast (age.Value as int) gt 20

56
Model Mapping Node Approach
  • XRel Yoshikawa et. al. TOIT 2001
  • Four table schema
  • Element(pathID, start, end, ordinal)
  • Attribute(pathID, start, end, value)
  • Text(pathID, start, end, value)
  • Path(pathID, pathexp)

57
Querying with XRel
/DBGroup/MemberAgegt20/Name
  • select v2.Value
  • from Element e1, Path p1, Path p2, Path p3, Text
    v1, Text v2
  • where p1.Pathexp \/DBGroup\/Member'
  • and p2.Pathexp \/DBGroup\/Member\/Age'
  • and p3.Pathexp \/DBGroup\/Member\/Name'
  • and e1.PathID p1.PathID
  • and v1.PathID p2.PathID
  • and v2.PathID p3.PathID
  • / containment testing /
  • and e1.Start lt v1.Start and e1.End gt v1.End
  • and e1.Start lt v2.Start and e1.End gt v2.End
  • and cast(v1.Value as int ) gt 20

58
Structural Mapping Simplifying DTDs
  • DTD element specifications can be of arbitrary
    complexity
  • lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt is
    valid!
  • Simple DTD for translation purposes
  • Key observations not necessary to regenerate DTD
    from relational schema
  • XML queries query the position of an element,
    relative to its siblings, and the parent/child
    relationships.

59
DTD Simplification Transformations
Simplification Transformations
Flattening Transformations
(e1, e2) ? e1, e2 (e1, e2)? ? e1?, e2? (e1e2)
? e1?, e2?
e1 ? e1 e1? ? e1 e1? ? e1 e1?? ? e1?
Grouping Transformations
..., a, ..., a, ... ? a, ... ..., a, ...,
a?, ... ? a, ... ..., a?, ..., a, ... ? a,
... ..., a?, ..., a?, ... ? a, , ...a, ,
a, ? a,
lt!ELEMENT a ((bce)?,(e?(f?,(b,b))))gt
lt!ELEMENT a (b, c?, e, f)gt
Deutsch, Fernandez, and Suciu,
SIGMOD99 Shanmugasundaram, Tufte, He, Zhang,
DeWitt, and Naughton, VLDB99
60
A Sample DTD
lt!ELEMENT book (booktitle, author) lt!ELEMENT
booktitle (PCDATA)gt lt!ELEMENT author (name,
address)gt lt!ATTLIST author id ID REQUIREDgt
lt!ELEMENT name (firstname?, lastname)gt
lt!ELEMENT firstname (PCDATA)gt lt!ELEMENT
lastname (PCDATA)gt lt!ELEMENT address
ANYgt lt!ELEMENT article (title, author,
contactauthor)gt lt!ELEMENT title (PCDATA)gt
lt!ELEMENT contactauthor EMPTYgt
lt!ATTLIST contactauthor authorID IDREF
IMPLIEDgt lt!ELEMENT monograph (title, author,
editor)gt lt!ELEMENT editor (monograph)gt
lt!ATTLIST editor name CDATA REQUIREDgt
Shanmugasundaram et. al. VDLB 99
61
DTD to Relational Schema Naïve Approach
  • Each Element gt Relation
  • Each Attribute of Element gt Column of Relation
  • Connect elements using foreign keys

lt!ELEMENT author (name, address)gt lt!ATTLIST
author id ID REQUIREDgt lt!ELEMENT name
(firstname?, lastname)gt lt!ELEMENT firstname
(PCDATA)gt lt!ELEMENT lastname (PCDATA)gt lt!ELEMENT
address ANYgt
author (authorID integer, id string) name
(nameID integer, authorID integer) firstname
(firstnameID integer, nameID integer, value
string) lastname (lastnameID integer, nameID
integer, value string) address (addressID
integer, authorID integer, value string)
62
Basic Inlining Technique
  • Problem of the naïve approach fragmentation
    too many tables
  • Results in 5 relations in the previous example
    retrieving first and last names of an author
  • Intuition
  • Inline as many sub-elements as possible
  • Do not inline only if it is a set sub-element
  • RDBMSs do not all support set-valued columns.
  • Connect relations using foreign keys
  • Can handle recursions
  • A document can be rooted at any element
  • Create separate a relation for each root

63
Basic Inlining Technique Relation Schemas
article
  • article (articleID integer, article.contactauthor
    .authorid string, article.title string)
  • article.author (article.authorID integer,
    article.author.parentID integer,
    article.author.name.firstname string,
  • article.author.name.lastname string,
    article.author.address string,
    article.author.authorid string)

?
contactauthor
title

authorid
author
name
address
authorid
?
firstname
lastname
64
Basic Inlining Technique Pros Cons
  • Reduces number of joins for queries like get the
    first and last names of a book author
  • Efficient for queries such as list all authors
    of books
  • Queries like list all authors with name Ullman
  • Union of 5 queries!
  • Large number of relations
  • Unrolling recursive strongly connected components
    (major)
  • Separate relational schema for each element as
    root (minor)

65
Shared Inlining Technique
  • Intuition
  • Inline as many sub-elements as possible.
  • Do not inline only if it is a shared, recursive
    or set sub-element.
  • An element node is represented in exactly one
    relation.
  • Technique
  • Mapping the following nodes into relations
  • Shared In-degree gt 2 in DTD graph
  • Root elements In-degree 0

66
Issues with Sharing Elements
  • Parent of elements not fixed at schema level
  • Need to store type and ids of parents (or if
    there are no parents)
  • parentCODE field (type of parent)
  • parentID field (id of parent)
  • Not foreign key relationship

67
Shared Relational Schema
  • book (bookID integer, book.booktitle.isroot
    boolean, book.booktitle string)
  • article (articleID integer, article.contactauthor
    .isroot boolean,
  • article.contactauthor.authorid
    string)
  • monograph (monographID integer,
    monograph.parentID integer,
  • monograph.parentCODE
    integer, monograph.editor.isroot boolean,
  • monograph.editor.name
    string)
  • title (titleID integer, title.parentID integer,
    title.parentCODE integer, title string)
  • author (authorID integer, author.parentID
    integer, author.parentCODE integer,
  • author.name.isroot boolean,
    author.name.firstname.isroot boolean,
  • author.name.firstname string,
    author.name.lastname.isroot boolean,
  • author.name.lastname string,
    author.address.isroot boolean,
  • author.address string,
    author.authorid string)

68
Shared Inlining Techniques Pros Cons
  • Reduces number of joins for queries like get the
    first and last names of an author
  • Efficient for queries such as list all authors
    with name Ullman
  • Sharing whenever possible implies extra joins for
    path expressions
  • Article with a given title name

69
Hybrid Inlining Technique
  • Inlines some elements that are shared in Shared
  • Elements with in-degree gt 2 that are not set
    sub-elements or recursive
  • Handles set and recursive sub-elements as in
    Shared

70
Hybrid Relational Schema
  • book (bookID integer, book.booktitle.isroot
    boolean, book.booktitle string,
  • author.name.firstname string,
    author.name.lastname string,
  • author.address string,
    author.authorid string)
  • article (articleID integer, article.contactauthor
    .isroot boolean,
  • article.contactauthor.authorid string,
    article.title.isroot boolean, article.title
    string)
  • monograph (monographID integer,
    monograph.parentID integer,
  • monograph.parentCODE integer,
    monograph.title string,
  • monograph.editor.isroot boolean,
    monograph.editor.name string,
  • author.name.firstname string,
    author.name.lastname string, author.address
    string,
  • author.authorid string)
  • author (authorID integer, author.parentID
    integer, author.parentCODE integer,
  • author.name.isroot boolean,
    author.name.firstname.isroot boolean,
  • author.name.firstname string,
    author.name.lastname.isroot boolean,
  • author.name.lastname string,
    author.address.isroot boolean,
  • author.address string,
    author.authorid string)

71
Hybrid Inlining Technique Pros Cons
  • Reduces joins through shared elements (that are
    not set or recursive elements)
  • Shares some strengths of Shared
  • Reduces joins for queries like get first and
    last names of a book author
  • Requires more SQL sub-queries to retrieve all
    authors with name Ullman.
  • Tradeoff between reducing number of queries and
    reducing number of joins
  • Shared and Hybrid target query- and
    join-reduction respectively

72
More on Shared and Hybrid
  • Shared and Hybrid have pros and cons
  • In many cases, Shared and Hybrid are nearly
    identical
  • Number of joins per SQL query path length
  • Mainly due to large number of set nodes
  • Problem as join processing is expensive!

73
Regular Expressions
  • Path expression queries can be represented by
    regular expressions.
  • Considering path expressions in the following
    from
  • r (r) (r) (r)? r1/r2 r1r2 r1//r2
    name.
  • 0 or more occurrences
  • 1 or more occurrences
  • ? 0 or 1 occurrences
  • r1/r2 form a path from r1 to r2 (child)
  • r1//r2 form a path from r1 to r2 (descendant)
  • disjunction.

74
SPE to SQL
Find the name of the authors for all members
publications
  • /member/publication/author/name
  • select m2.name
  • from member m1, publication, member m2
  • where publication.perantid m1.ID
  • and publication.author m2.ID

75
RPE Expansion
List the title of publications for all projects
  • project//publication

Substitute //
  • project/member/(project.member)/publication
  • project/(member.project)/publication

76
RPE Expansion
List the title of publications for all projects
  • project/member/(project/member)/publication/title
  • project/(member/project)/publication/title

Expanding
select project.publication.title
union select project.member.publication.title
union select project.member.project.publication.ti
tle
77
Recursive Path Expression Queries to SQL
  • Some DBMS supports least-fixed point computation.
    E.g., WITH statement in DB2

project/(member/project)/publication
WITH R(PARENTID, ID) AS ( select m.PARENTID,
p1.ID from member m, project p1
where m.IDp1.PARENTID UNION ALL
select R.PARENTID, p1.ID from R, member
m, project p1 where R.IDm.PARENTID and
m.IDp1.PARENTID) select p3. from project p2,
R, publicaton p3 where p2.IDR.PARENTID and
R.IDp3.PARENTID
78
Expanding Recursive Path Expression Queries
  • Expanding wild cards before sending to DBMS
  • Transitive closure operation is not always
    supported by RDBMS
  • Transitive closure with arbitrary nesting seems
    not supported
  • Can handle nested recursive queries (though DB2
    does not support it)
  • How many SQL statements are required?
  • Executing SQL until empty result returned
  • VXMLR approach keep statistics
  • Zhou et. al. VLDB 2001

79
Query Translation for Structural Mapping
  • Translating XML-QL into SQL Shanmugasundaram, et
    al, VLDB99
  • Simple Path Expressions to SQL
  • Simple Recursive Path Expressions to SQL
  • Arbitrary Path Expressions to Simple Recursive
    Path Expressions
  • Discussion based on Shared approach

80
Queries with Simple Path Expressions
  • WHERE ltbookgt
  • ltbooktitlegt The Selfish Gene lt/booktitlegt
  • ltauthorgt
  • ltnamegt
  • ltfirstnamegt f lt/firstnamegt
  • ltlastnamegt l lt/lastnamegt
  • lt/namegt
  • lt/authorgt
  • lt/bookgt IN CONFORMING TO pubs.dtd
  • CONSTRUCT ltresultgt f l lt/resultgt

Select A.author.name.firstname,
A.author.name.lastname From author A, book
B Where B.bookID A.parentID AND A.parentCODE
0 AND B.book.booktitle The Selfish Gene
81
Queries with Recursive Path Expressions
  • WHERE lt.monographgt
  • lteditor.(monograph.editor)gt
  • ltnamegt n lt/namegt
  • lt/gt
  • lttitlegt Subclass Cirripedia lt/titlegt
  • lt/gt IN CONFORMING TO pubs.dtd
  • CONSTRUCT ltresultgt n lt/resultgt

With Q1 (monographID, name) AS (Select
X.monographID, X.editor.name From
monograph X Where X.title Subclass
Cirripedia UNION ALL Select Z.monographID,
Z.editor.name From Q1 Y, monograph Z
Where Y.monographID Z.parentID AND Z.parentCODE
0 ) Select A.name From Q1 A
82
Queries with Arbitrary Path Expressions
WHERE lt(articlemonograph)..namegt n
lt/gt CONSTRUCT ltnamegt n lt/gt
  • Split complex path expression to (possibly many)
    simple recursive path expressions
  • Has effect of splitting a single XML-QL query to
    (possibly many) SQL queries
  • Can handle nested recursive queries

83
References (1)
  • Aboulnaga, Alameldeen et al, VLDB01
  • Ashraf Aboulnaga, Alaa R. Alameldeen, and
    Jeffrey F. Naughton. Estimating the selectivity
    of XML path expressions for Internet scale
    applications. VLDB 2001.
  • Bohannon et al, ICDE 2002
  • P. Bohannon, J. Freire, P. Roy, and J. Simeon.
    From XML schema to relations A cost-based
    approach to XML storage. In Proceedings of ICDE,
    2002.
  • Boag, Chamberlin et al, W3C 2003
  • Scott Boag, Don Chamberlin, Mary F.
    Fernández, Daniela Florescu, Jonathan Robie,
    Jérôme Siméon, XQuery 1.0 An XML Query Language,
    http//www.w3.org/TR/xquery
  • Bruno et al, SIGMOD02
  • N. Bruno, N. Koudas, D. Srivastava. Holistic
    twig joins Optimal XML pattern matching. In
    SIGMOD Int'l Conf. on Management of Data,
    310-311, 2002.
  • Chen, Jagadish et al, ICDE01
  • Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.
    Muthukrishnan, R. T. Ng, and D. Srivastava.
    Counting twig matches in a tree. In Proceedings
    of the IEEE International Conference on Data
    Engineering, pages 595-604, 2001.
  • Cohen, Kaplen et al, PODS02
  • E. Cohen, H. Kaplan, T. Milo. Labeling dynamic
    XML trees. In Symposium on Principles of Database
    Systems (PODS), 271-281, 2002.
  • Cark, DeRose, W3C 1999
  • James Clark, and Steven DeRose, XML Path
    Language (XPath) Version 1.0
  • http//www.w3.org/TR/xpath

84
References (2)
  • Ceri, Comai et al, WWW99
  • S. Ceri, S. Comai, E. Damiani, P. Fraternali, S.
    Paraboschi, and L. Tanca. XMLGL a graphical
    language for querying and restructuring WWW data.
    In International World Wide Web Conference (WWW),
    Toronto, Canada, May 1999.
  • Chamberlin, Robie et al, 2000
  • Don Chamberlin, Jonathan Robie, and Daniela
    Florescu. Quilt An XML query language for
    heterogeneous data source. In Proceedings of the
    Third International Workshop on the Web and
    Databases, May 2000.
  • Chamberlin, Draper et al, 2003
  • Don Chamberlin, Denie Draper, Mary Fernandez,
    Michael Kay, Jonathan Robie, Michael Rys, Jerome
    Simeon, Jim Tivy, Philip Wadler. Editor Howard
    Katz. XQuery from the Experts A Guide to the W3C
    XML Query Language. Addison-Wesley Press, 2003
  • Chaudhri, Rashid et al, 2003
  • Akmal B. Chaudhri , Awais Rashid , Roberto
    Zicari. XML Data Management Native XML and
    XML-Enabled Database Systems. Addison-Wesley
    Press, 2003
  • Chen, Lim et al, SIGMOD03
  • Qun Chen, Andrew Lim and Kian Win Ong.
    D(k)-index An adaptive structural summary for
    graph-structured data. In SIGMOD'03, 134-144.
  • Chien, Vagena and Zhang et al, VLDB02
  • S.-Y. Chien, Z. Vagena, D. Zhang, V. Tsotras,
    and C. Zaniolo. Efficient structural joins on
    indexed XML documents. In VLDB02, pages 263--274,
    2002.
  • Chung, Min et al, SIGMOD02
  • C-W Chung, J-K Min, K. Shim, APEX An Adaptive
    Path Index for XML Data, In SIGMOD'02, 2002

85
References (3)
  • Deutsch, Fernandez et al, QL98
  • A. Deutsch, M. Fernandez, D. Florescu,
    A. Levy, and D. Suciu. XML-QL A query language
    for XML. In M. Marchiori, editor. QL'98--The
    Query Languages Workshop. W3C, Dec. 1998.
  • http//www.w3.org/TR/1998/NOTE-xml-ql-19980819/
  • Deutsch, Fernandez, and Suciu, SIGMOD99
  • A. Deutsch, M. Fernandez, and D. Suciu. Storing
    Semistructured Data with STORED. In Proc. of the
    ACM SIGMOD Conference on Management of Data, June
    1999.
  • Dietz STOC 82
  • Paul F. Dietz. Maintaining order in a linked
    list. STOC 1982.
  • Grust SIGMOD02
  • Torsten Grust. Accelerating XPath Location
    Steps. In Proc. of the 21st ACM SIGMOD
    Conference, pages 109--120, Madison, Wisconsin,
    USA, June 2002. ACM Press.
  • Jiang, Lu, Wang and Ooi, ICDE03
  • Haifeng Jiang, Hongjun Lu, Wei Wang, Beng Chin
    Ooi, XR-Tree Indexing XML Data for Efficient
    Structural Joins, The 19th International
    Conference on Data Engineering (ICDE 2003), page
    253-264, Bangalore, India, March 5-8, 2003.
  • Jiang, Wang, Lu and Yu, VLDB03
  • Haifeng Jiang, Wei Wang, Hongjun Lu, Jeffrey Xu
    Yu, Holistic Twig Joins on Indexed XML Documents,
    The 29th International Conference on Very Large
    Data Bases (VLDB 2003), pages 273-284, Berlin,
    Germany, September 9-12, 2003.

86
References (4)
  • Kaushik, Shenoy, ICDE02
  • Raghav Kaushik, Pradeep Shenoy, Philip Bohannon
    and Ehud Gudes. Exploiting local similarity for
    indexing paths in graph-structured data. In
    ICDE02, 129-140.
  • Kha et al, ICDE01
  • Dao Dinh Kha, Masatoshi Yoshikawa, and Shunsuke
    Uemura. An XML indexing structure with relative
    region coordinate. ICDE 2001.
  • Krishnamurthy et al, 2003
  • R. Krishnamurthy, R. Kaushik, J. Naughto XML,
    XML-to-SQL Query Translation Literature The
    State of the Art and Open Problems, Symposium
    (XSym), Sep 2003.
  • Li and Moon, VLDB01
  • Quanzhong Li and Bongki Moon. Indexing and
    querying XML data for regular path expressions.
    VLDB 2001.
  • Milo, Suciu, ICDT99
  • Tova Milo and Dan Suciu. Index structures for
    path expressions. In ICDT99. 277-295, January
    1999.
  • Lee, Srivastava DASFAA04
  • Dongwon Lee and Divesh Srivastava. Counting
    relaxed twig matches in a tree. DASFAA 2004.
  • Lim, Wang et al, VLDB02
  • Lipyeow Lim, Min Wang, Sriram Padmanabhan,
    Jeffrey Scott Vitter, and Ronald Parr.
    XPathLearner An on-line self-tuning Markov
    histogram for XML path selectivity estimation.
    VLDB 2002.
  • Lee, Yoo et al, 1996
  • Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon, and
    P. Bruce Berra. Index structures forstructured
    documents. In Proceedings of the ACM Conference
    on Digital Libraries, 1996.

87
References (5)
  • Manolescu, Florescu et al, 2001
  • I. Manolescu, D. Florescu, and D. Kossmann.
    Pushing XML queries inside relational databases.
    Tech. Report no. 4112, INRIA, 2001
  • Manolescu, Florescu et al, VLDB01
  • I. Manolescu, D. Florescu, and D. Kossmann.
    Answering xml queries over heterogeneous data
    sources. In proceedings on the International
    Conference on Very Large Data Bases (VLDB), Rome,
    Italy, September 2001.
  • Meier, 2002
  • Wolfgang Meier. eXist An open source native XML
    database. In Web, Web-Services, and Database
    Systems 2002, 2002.
  • McHugh, Widom, VLDB99
  • Jason McHugh and Jennifer Widom. Query
    optimization for XML. VLDB 1999.
  • Polyzotis, Garofalakis SIGMOD02
  • Neoklis Polyzotis and Minos N. Garofalakis.
    Statistical synopses for graph-structured XML
    databases. SIGMOD 2002.
  • Polyzotis, Garofalakis VLDB02
  • Neoklis Polyzotis and Minos N. Garofalakis.
    Structure and value synopses for XML data graphs.
    VLDB 2002.
  • Robie, Lapp, QL98
  • J. Robie, J. Lapp, and D. Schach. XML query
    language (XQL). In M. Marchiori, editor.
    QL'98--The Query Languages Workshop. W3C, Dec.
    1998.
  • http//www.w3.org/TandS/QL/QL98/pp/xql.html

88
References (6)
  • Schmidt et. al. WebDB00
  • A. Schmidt, M. L. Kersten, M. Windhouwer, and F.
    Waas. Efficient relational storage and retrieval
    of XML documents. In WebDB (Informal
    Proceedings), pages 47--52, 2000.
  • Shanmugasundaram, Tufte, He, Zhang, DeWitt, and
    Naughton, VLDB99
  • Jayavel Shanmugasundaram, Kristin Tufte, Chun
    Zhang, Gang He, David J. DeWitt, and Jeffrey F.
    Naughton. Relational databases for querying XML
    documents Limitations and oppor- tunities. In
    Proceedings of 25th International Conference on
    Very Large Data Bases (VLDB'99), pages 79-90.
    Morgan Kaufmann, 1999.
  • Shanmugasundaram et. al. VDLB 99
  • Jayavel Shanmugasundaram, Kristin Tufte, Chun
    Zhang, Gang He, David J. DeWitt, Jeffrey F.
    Naughton Relational Databases for Querying XML
    Documents Limitations and Opportunities. VLDB
    1999 302-314
  • Srivastava, Al-Khalifa et al, ICDE02
  • D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N.
    Koudas, J. M. Patel, and Y. Wu. Structural joins
    A primitive for efficient XML query pattern
    matching. In ICDE, pages 141-- 152, 2002.
  • Tatarinov, Viglas et al, SIGMOD02
  • Igor Tatarinov, Stratis D. Viglas, Kevin Beyer,
    Jayavel Shanmugasundaram, Eugene Shekita, and
    Chun Zhang. Storing and querying ordered XML
    using a relational database system. SIGMOD 2002.
  • Wang, Jiang et al, SIGMOD03
  • Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Xu
    Yu. Containment Join Size Estimation Models and
    Methods. The 2003 ACM SIGMOD International
    Conference on Management of Data(SIGMOD03), San
    Diego, California, June 9 - June 12, 2003. Pages
    145-156

89
References (7)
  • Wang, Jiang et al, ICDE03
  • Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey
    Xu Yu. PBiTree coding and efficient processing of
    containment joins. ICDE 2003.
  • Wu et al., EDBT02
  • Yuqing Wu, Jignesh Patel, H. V. Jagadish. Using
    Histograms to Estimate Answer Size for XML
    Queries. Information Systems 28 (1-2) 33-59
    (2003) -- Special Issue Best Papers from EDBT
    2002.
  • Wu et al, ICDE03
  • Yuqing Wu, Jignesh Patel and H.V. Jagadish,
    Structural Join Order Selection for XML Query
    Optimization. ICDE 2003.
  • Yoshikawa, et. al. TOIT01
  • Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki
    Shimura, Shunsuke Uemura XRel a path-based
    approach to storage and retrieval of XML
    documents using relational databases. ACM Trans.
    Internet Techn. 1(1) 110-141 (2001)
  • Zhou et. al. VLDB 2001
  • Aoying Zhou, Hongjun Lu, Shihui Zheng, Yuqi
    Liang, Long Zhang, Wenyun Ji, Zengping Tian
    VXMLR A Visual XML-Relational Database System.
    VLDB 2001 719-720
  • Zhang, Naughton SIGMOD01
  • Chun Zhang, Jeffrey F. Naughton, David J.
    DeWitt, Qiong Luo, and Guy M. Lohman. On
    supporting containment queries in relational
    database management systems. SIGMOD 2001.
Write a Comment
User Comments (0)
About PowerShow.com