Database Systems and XML - PowerPoint PPT Presentation

About This Presentation
Title:

Database Systems and XML

Description:

... queries is run within the engine. Must add XML constructor support to the engine. ... A query engine designed to tap information from XML documents is valuable. ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 53
Provided by: willia160
Category:
Tags: xml | database | systems

less

Transcript and Presenter's Notes

Title: Database Systems and XML


1
Database Systems and XML
  • David Wu
  • CS 632
  • April 23, 2001

2
Researched Papers
  • J. Shanmugasundaram, et al. "Efficiently
    Publishing Relational Data as XML Documents",
    VLDB Conference, September 2000.
  • J. Shanmugasundaram, et al. "Relational Databases
    for Querying XML Documents Limitations and
    Opportunities," VLDB Conference, September 1999.

3
Efficiently Publishing Relational Data as XML
Documents
4
Motivation
  • Relational database systems and XML are heavily
    used on the Web.
  • Would like some way to publish relational data as
    XML.

5
What is Needed
  • Language to specify the conversion from
    relational data to XML.
  • Implementation to efficiently carry out the
    conversion.

6
SQL Based Language
7
Implementation Alternatives
  • Main differences between relations and XML
  • XML docs have tags
  • XML has nested structure

8
Early Tagging, Early Structuring
  • Stored Procedure Approach (outside engine)
  • Performs a nested-loop join by issuing queries
    for each nested structure in the desired XML.
  • High overhead due to the number of queries.
  • Fixed join order.

9
Early Tagging, Early Structuring
  • Correlated CLOB Approach (inside engine)
  • Have one large query with sub-queries is run
    within the engine.
  • Must add XML constructor support to the engine.
  • XML fragments from the constructors are stored as
    CLOBs (Character Long Objects). Costly to
    handle.
  • De-Correlated CLOB Approach (inside)
  • Perform query de-correlation to give optimizer
    more flexibility.

10
Late Tagging, Late Structuring
  • Two phases
  • Content creation
  • Tagging and structuring

11
Late Tagging, Late Structuring
  • Content Creation Redundant Relation Approach
  • Join all source tables
  • Both content and process redundancy

12
Late Tagging, Late Structuring
  • Content creation Outer Union Approach
  • Separate the children of the same parent (e.g.
    one tuple should represent either account or
    purchaseOrder).
  • At the end outer union the results.
  • Still some data redundancy (e.g. parent info)

13
Late Tagging, Late Structuring
  • Outer Union Plan

14
Late Tagging, Late Structuring
  • Structuring/Tagging Hashed-based Tagger
  • Group by hashing
  • Extract tuples and tag them.

15
Late Tagging, Early Structuring
  • Late Tagging, Late Structuring requires much
    memory for the hash table.
  • Fix by creating structured content and then tag.

16
Late Tagging, Early Structuring
  • Structured content Sorted Outer Union Approach
  • Desired format
  • Parent information comes before or with its child
  • All info of a node and its descendants occur
    together
  • Relative order of the tuples matches
    user-specified order
  • Achieve by performing a sort on ids on the result
    of the outer union.

17
Late Tagging, Early Structuring
  • Tagging Sorted DataConstantSpaceTagger
  • Can append tags as soon as data is seen.
  • Only need to remember the parent ids of the last
    tuple seen to know when to append closing tags.

18
Experiement
  • Inside Engine
  • Outside Engine

19
Breakdown of Construction
20
Summary of Results
  • Constructing inside the relational engine is more
    efficient.
  • When processing can be done in main mem, the
    Unsorted Outer Union approach wins.
  • When main mem is not enough, the Sorted Outer
    Union approach is best.

21
Relational Databases for Querying XML Documents
22
Why Bother?
  • XML is becoming the standard for data
    representation in WWW.
  • A query engine designed to tap information from
    XML documents is valuable.
  • Relational database system is a mature technology
    and could be used to support XML querying.

23
Basic Idea
  • Step 1 Generate a relational schema from the
    DTD
  • Step 2 Parse the XML document and load the data
    into tuples of the relational table.
  • Step 3 Translate the semi-structured XML
    queries into SQL corresponding to the
    relational data.
  • Step 4 Convert the result back to XML.

24
Translating XML to Relational Schema
  • Main Issues
  • DTDs complexity
  • Arbitrary nesting of XML DTDs vs. two-level
    nature of relational schemas.
  • Set-valued attributes and recursion

25
  • Flattening transformation
  • Simplification transformation of unary operations
  • Grouping transformation

26
Techniques to translate XML DTD to relations.
  • Basic Inlining Technique
  • Shared Inlining Technique
  • Hybrid Inlining technique

27
Basic Inlining Technique
  • Inlining as many descendants of an element into a
    relation. (authorfirstname,lastname,address)
  • Every element will have a relation corresponding
    to it. (firstname, lastname, and address will all
    have elements)

28
Basic Inlining Technique (cont.)
  • Complications
  • Set-valued attributes (eg. Article)
  • Solve by using foreign keys and other tables.
  • Recursion
  • Solve with relational keys and relational
    recursive processing to retrieve the
    relationship.

29
Tools used in creating relations
  • DTD Graph
  • Nodes are elements, attributes,operators
  • Each element appears once
  • Attributes and operators appear as many times as
    they do in the DTD
  • Cycles in the graph indicates recursion

30
Tools used in creating relations
  • Element Graphs
  • Generated from the DTD graph
  • Created by doing a DFS from an element node

31
Creating a Relation
  • Given an element graph, the root it made into
  • a relation with all descendents inlined into it,
  • except
  • Children directly below a are made into
    separate relations
  • Each node with a backpointer edge are made into
    separate relations.
  • These additional relations are named by their
    path
  • from the root and have parentID fields that serve
    as
  • foreign keys (e.g. Article.author has the
    attribute
  • article.author.parentID)

32
Problems with Basic
  • Large number of relations it creates
  • Not efficient for certain queries
  • Good list all authors of books
  • Bad list all authors having first name Jack

33
Shared Inlining Technique
  • Idea Identify commonly used element nodes and
    share them by creating separate relations for
    them.

34
Shared Inlining Technique
  • Rules for creating relations
  • Nodes with in-degreegt1 have relations made
  • Nodes with in-degree1 are inlined
  • Nodes with in-degree0 have relations made
  • Nodes following have relations made
  • Nodes with in-degree1 AND mutually recurive, one
    of them is made into a relation

35
Shared Inlining Technique
  • Rules for designing the schema
  • Relation X inlines all nodes Y that it an reach
    such that the path from X to Y does not contain a
    node that is to be made a separate relation.
  • Inlined elements are flagged as being a root with
    the isRoot field.

36
Problems with Shared
  • Too many joins required!

37
Hybrid Inlining Technique
  • Same as Shared except Hybrid also inlines
    elements that
  • have in-degreegt1 AND
  • are not recursive AND
  • are not reached through a node.

38
Evaluation Metric
  • For path expressions of length N, data was
    gathered on
  • The avg number of SQL queries generated
  • The avg number of joins in each SQL query
  • The total average number of joins in order to
    process the path expression

39
Results for N3
  • For Basic, 1/3 of the DTDs tests didnt run to
    completion due to lack of virtual memory. Basic
    is thus ignored.

40
Results for N3
41
Results for N3
  • Group 1 Hybrid reduce join/query, increases a
    smaller amount of queries gt Hybrid requires
    fewer joins than shared.
  • Group 2 Hybrid reduces join/query, increases a
    comparable amount of queriesgt Hybrid and Shared
    are the same.

42
Results for N3
  • Group 3 Hybrid reduces some joins/query, but
    increased the queries by a lot gt Hybrid
    generates more joins than Shared.
  • Hybrid and Shared performed similarly in both
    joins/query and of queries gt Hybrid and
    Shared are about the same.

43
Semi-Structured Queries to SQL
  • Semi-structured query languages
  • Allow path expressions with various operators and
    wildcards.
  • XML-QL Query Lorel

44
Simple Path to SQL
  • The relations corresponding to the start of the
    root path is added to the FROM clause.
  • If needed, the path expressions are translated to
    joins.

45
Simple Recursive Path to SQL
  • Find initialization of the recursion (e.g.
    .monograph.editor with condition
    monograph.title Subclass Cirripedia)
  • Find the actual recursive path expression (e.g.
    monograph.editor)
  • Union the two

46
Arbitrary Path to Simple Recursive Path
  • Use a general technique to translate path
    expressions to many simple (recursive) path
    expressions.

47
Relational Results to XML Simple Structuring
  • Requires only attaching appropriate tags to each
    tuple.

48
Relational Results to XML Tag Variables
  • Have the relational query contain the tag value
    in the result tuple. Then just covert it to a
    tag during XML generation.

49
Grouping
  • Could sort the result tuples by the group-by
    field and and scan through it in order when
    generating the XML.
  • Could do a grouping operation.

50
Other Cases
  • Complex Element Construction
  • e.g. asking for all article elements and assume
    that may be multiple elements (e.g. author
    title)
  • Difficult to do in traditional relational model.
  • Heterogeneous Results
  • e.g. asking for either title or author of
    article.
  • Could be done in two queries and then merged.

51
Other Cases
  • Nested Queries
  • Could be rewritten in terms of SQL queries using
    outer joins.

52
Conclusion
  • Suggested modifications to relational systems
  • Untyped/variable-typed references.
  • Information retrieval style indices
  • Flexible comparison operators
  • Multiple-query optimization/execution
  • More powerful recursion support.
Write a Comment
User Comments (0)
About PowerShow.com