Introduction to Semistructured Data and XML - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Semistructured Data and XML

Description:

easy access: across platforms, across organizations. No ... 'Serge' 'Abiteboul' 1997 'Victor' 'Vianu' 122. 133. paper. book. paper. references. references ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 38
Provided by: www249
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Semistructured Data and XML


1
Introduction to Semistructured Data and XML
  • Based on slides by Dan Suciu
  • University of Washington

2
How the Web is Today
  • HTML documents
  • often generated by applications
  • consumed by humans only
  • easy access across platforms, across
    organizations
  • No application interoperability
  • HTML not understood by applications
  • screen scraping brittle
  • Database technology client-server
  • still vendor specific

3
New Universal Data Exchange Format XML
  • A recommendation from the W3C
  • XML data
  • XML generated by applications
  • XML consumed by applications
  • Easy access across platforms, organizations

Remark HTML Data Presentation
4
Paradigm Shift on the Web
  • From documents (HTML) to data (XML)
  • From information retrieval to data management
  • For databases, also a paradigm shift
  • from relational model to semistructured data
  • from data processing to data/query translation
  • from storage to transport

5
Semistructured Data
  • Origins
  • Integration of heterogeneous sources
  • Data sources with non-rigid structure
  • Biological data
  • Web data

6
The Semistructured Data Model
Bib
Object Exchange Model (OEM)
o1
complex object
paper
paper
book
references
o12
o24
o29
references
references
author
page
author
year
author
title
http
title
title
publisher
author
author
author
o43
25
96
1997
last
firstname
firstname
lastname
first
lastname
243
206
Serge
Abiteboul
Victor
122
133
Vianu
atomic object
7
Syntax for Semistructured Data
  • Bib o1 paper o12 ,
  • book o24 ,
  • paper o29
  • author o52
    Abiteboul,
  • author o96
    firstname 243 Victor,

  • lastname o206 Vianu,
  • title o93 Regular
    path queries with constraints,
  • references o12,
  • references o24,
  • pages o25 first
    o64 122, last o92 133

Observe Nested tuples, set-values, oids!
8
Syntax for Semistructured Data
  • May omit oids
  • paper author Abiteboul,
  • author firstname Victor,
  • lastname
    Vianu,
  • title Regular path queries
    ,
  • page first 122, last 133

9
Characteristics of Semistructured Data
  • Missing or additional attributes
  • Multi-valued attributes
  • Different types in different objects
  • Heterogeneous collections

Self-describing, irregular data, no a priori
structure
10
Comparison with Relational Data
  • row name John, phone 3634 ,
  • row name Sue, phone 6343 ,
  • row name Dick, phone 6363

11
XML
  • A W3C standard to complement HTML
  • Origins Structured text SGML
  • Motivation
  • HTML describes presentation
  • XML describes content
  • http//www.w3.org/TR/2000/REC-xml-20001006
    (version 2, 10/2000)

12
From HTML to XML
HTML describes the presentation
13
HTML
  • lth1gt Bibliography lt/h1gt
  • ltpgt ltigt Foundations of Databases lt/igt
  • Abiteboul, Hull, Vianu
  • ltbrgt Addison Wesley, 1995
  • ltpgt ltigt Data on the Web lt/igt
  • Abiteoul, Buneman, Suciu
  • ltbrgt Morgan Kaufmann, 1999

14
XML
  • ltbibliographygt
  • ltbookgt lttitlegt Foundations lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltauthorgt Hull lt/authorgt
  • ltauthorgt Vianu lt/authorgt
  • ltpublishergt Addison Wesley
    lt/publishergt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • lt/bibliographygt

XML describes the content
15
Why are we DBers interested?
  • Its data, stupid. Thats us.
  • Proof by Altavista
  • databaseXML -- 40,000 pages.
  • Database issues
  • How are we going to model XML? (graphs).
  • How are we going to query XML? (XML-QL)
  • How are we going to store XML (in a relational
    database? object-oriented?)
  • How are we going to process XML efficiently? (uh
    well..., um..., ah..., get some good grad
    students!)

16
Document Type Descriptors
  • Sort of like a schema but not really.
  • Inherited from SGML DTD standard
  • BNF grammar establishing constraints on element
    structure and content
  • Definitions of entities

17
Shortcomings of DTDs
  • Useful for documents, but not so good for data
  • No support for structural re-use
  • Object-oriented-like structures arent supported
  • No support for data types
  • Cant do data validation
  • Can have a single key item (ID), but
  • No support for multi-attribute keys
  • No support for foreign keys (references to other
    keys)
  • No constraints on IDREFs (reference only a
    Section)

18
XML Schema
  • In XML format
  • Includes primitive data types (integers, strings,
    dates, etc.)
  • Supports value-based constraints (integers gt 100)
  • User-definable structured types
  • Inheritance (extension or restriction)
  • Foreign keys
  • Element-type reference constraints

19
Sample XML Schema
  • ltschema version1.0 xmlnshttp//www.w3.org/199
    9/XMLSchemagt
  • ltelement nameauthor typestring /gt
  • ltelement namedate type date /gt
  • ltelement nameabstractgt
  • lttypegt
  • lt/typegt
  • lt/elementgt
  • ltelement namepapergt
  • lttypegt
  • ltattribute namekeywords typestring/gt
  • ltelement refauthor minOccurs0
    maxOccurs /gt
  • ltelement refdate /gt
  • ltelement refabstract minOccurs0
    maxOccurs1 /gt
  • ltelement refbody /gt
  • lt/typegt
  • lt/elementgt
  • lt/schemagt

20
Important XML Standards
  • XSL/XSLT presentation and transformation
    standards
  • RDF resource description framework (meta-info
    such as ratings, categorizations, etc.)
  • Xpath/Xpointer/Xlink standard for linking to
    documents and elements within
  • Namespaces for resolving name clashes
  • DOM Document Object Model for manipulating XML
    documents
  • SAX Simple API for XML parsing

21
XML Data Model (Graph)
Think of the labels as names of binary relations.
  • Issues
  • Distinguish between attributes and
    sub-elements?
  • Should we conserve order?

22
XML Terminology
  • Tags book, title, author,
  • start tag ltbookgt, end tag lt/bookgt
  • Elements ltbookgtltbookgt,ltauthorgtlt/authorgt
  • elements can be nested
  • empty element ltredgtlt/redgt (Can be abbrv.
    ltred/gt)
  • XML document Has a single root element
  • Well-formed XML document Has matching tags

23
More XML Attributes
  • ltbook price 55 currency USDgt
  • lttitlegt Foundations of Databases lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt

Attributes are alternative ways to represent data
24
More XML Oids and References
  • ltperson ido555gt ltnamegt Jane lt/namegt lt/persongt
  • ltperson ido456gt ltnamegt Mary lt/namegt
  • ltchildren
    idrefo123 o555/gt
  • lt/persongt
  • ltperson ido123 mothero456gtltnamegtJohnlt/namegt
  • lt/persongt

25
XML-Query Data Model
  • Describes XML data as a tree
  • Node DocNode ElemNode
    ValueNode
    AttrNode NSNode
    PINode CommentNode
    InfoItemNode
    RefNode

http//www.w3.org/TR/query-datamodel/2/2001
26
XML Query Data Model
  • Example

price2 attrNode(price,string10)string10
valueNode(stringValue(55))currency3
attrNode(currency, string11)string11
valueNode(stringValue(USD)) title4
elemNode(title, string9)string9
valueNode(stringValue(Foundations))
ltbook price 55 currency USDgt
lttitlegt Foundations lt/titlegt ltauthorgt
Abiteboul lt/authorgt ltauthorgt Hull lt/authorgt
ltauthorgt Vianu lt/authorgt ltyeargt 1995
lt/yeargt lt/bookgt
27
XML vs. Semistructured Data
  • Both described best by a graph
  • Both are schema-less, self-describing
  • XML is ordered, ssd is not
  • XML can mix text and elements
  • lttalkgt Making Java easier to type and easier
    to type
  • ltspeakergt Phil Wadler lt/speakergt
  • lt/talkgt
  • XML has lots of other stuff entities, processing
    instructions, comments

28
XQUERY --- Path Expressions
  • Examples
  • Bib.paper
  • Bib.book.publisher
  • Bib.paper.author.lastname
  • Given an OEM instance, the value of a path
    expression p is a set of objects

29
Path Expressions
  • Examples
  • DB

Bib.papero12,o29 Bib.book.publishero51 Bi
b.paper.author.lastnameo71,206
30
XQuery
  • Summary
  • FOR-LET-WHERE-RETURN FLWR

FOR/LET Clauses
List of tuples
WHERE Clause
List of tuples
RETURN Clause
Instance of Xquery data model
31
XQuery
  • FOR x in expr -- binds x to each value in the
    list expr
  • LET x expr -- binds x to the entire list
    expr
  • Useful for common subexpressions and for
    aggregations

32
FOR v.s. LET
Returns ltresultgt ltbookgt...lt/bookgtlt/resultgt
ltresultgt ltbookgt...lt/bookgtlt/resultgt ltresultgt
ltbookgt...lt/bookgtlt/resultgt ...
FOR x IN document("bib.xml")/bib/book RETURN
ltresultgt x lt/resultgt
LET x IN document("bib.xml")/bib/book RETURN
ltresultgt x lt/resultgt
Returns ltresultgt ltbookgt...lt/bookgt
ltbookgt...lt/bookgt
ltbookgt...lt/bookgt ... lt/resultgt
33
XQuery
  • Find all book titles published after 1995

FOR x IN document("bib.xml")/bib/book WHERE
x/year gt 1995 RETURN x/title
Result lttitlegt abc lt/titlegt lttitlegt def
lt/titlegt lttitlegt ghi lt/titlegt
34
XQuery
  • For each author of a book by Morgan Kaufmann,
    list all books she published

FOR a IN distinct(document("bib.xml")
/bib/bookpublisherMorgan
Kaufmann/author) RETURN ltresultgt
a, FOR t IN
/bib/bookauthora/title
RETURN t lt/resultgt
distinct a function that eliminates duplicates
35
XQuery
  • Result
  • ltresultgt
  • ltauthorgtJoneslt/authorgt
  • lttitlegt abc lt/titlegt
  • lttitlegt def lt/titlegt
  • lt/resultgt
  • ltresultgt
  • ltauthorgt Smith lt/authorgt
  • lttitlegt ghi lt/titlegt
  • lt/resultgt

36
XQuery
ltbig_publishersgt FOR p IN
distinct(document("bib.xml")//publisher)
LET b document("bib.xml")/bookpublisher
p WHERE count(b) gt 100 RETURN
p lt/big_publishersgt
count a (aggregate) function that returns the
number of elms
37
XQuery
  • Find books whose price is larger than average

LET aavg(document("bib.xml")/bib/book/price) FOR
b in document("bib.xml")/bib/book WHERE
b/price gt a RETURN b
Write a Comment
User Comments (0)
About PowerShow.com