Lecture 08: XML and Semistructured Data - PowerPoint PPT Presentation

Loading...

PPT – Lecture 08: XML and Semistructured Data PowerPoint presentation | free to download - id: 1ec819-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Lecture 08: XML and Semistructured Data

Description:

http://www.xmlportfolio.com/xquery.html. Main source: www.w3.org (but ... Roots: SGML (a very nasty language). After the roots: a format for sharing data. 5 ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 44
Provided by: DAN3107
Learn more at: http://lsirwww.epfl.ch
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Lecture 08: XML and Semistructured Data


1
Lecture 08 XML and Semistructured Data
2
Outline
  • XML (Section 17)
  • XML syntax, semistructured data
  • Document Type Definitions (DTDs)
  • XPath

3
Additional Readings on XML
  • XML
  • http//www.w3.org/XML/1999/XML-in-10-points
  • www.zvon.org/xxl/XMLTutorial/General/book_en.html
  • http//db.bell-labs.com/galax/
  • http//www.w3.org/TR/REC-xml-names (1/99)
  • Xpath
  • http//java.sun.com/webservices/docs/ea2/tutorial/
    doc/JAXPXSLT2.html
  • Xquery
  • http//www.w3.org/TR/xmlquery-use-cases/
  • http//www.xmlportfolio.com/xquery.html
  • Main source www.w3.org (but hard to read)

4
XML
  • eXtensible Markup Language
  • XML 1.0 a recommendation from W3C, 1998
  • Roots SGML (a very nasty language).
  • After the roots a format for sharing data

5
XML Data
  • Relational data does not have a syntax
  • I cant give you my relational database
  • Need to import it from other syntax, like CSV
    (comma-separated-values)
  • XML rich syntax for data
  • But XML is not relational semistructured
  • Usage
  • Map any data to XML
  • Store it in files, exchange on the Web, etc.
  • Even query it directly, using XPath, XQuery

6
XML Data Sharing and Exchange
application
application
object-relational
Integrate
XML Data
WEB (HTTP)
Transform
Warehouse
application
relational data
legacy data
Specific data management tasks
7
From HTML to XML
HTML describes the presentation
8
HTML
  • lth1gt Bibliography lt/h1gt
  • ltpgt ltigt Foundations of Databases lt/igt
  • Abiteboul, Hull, Vianu
  • ltbrgt Addison Wesley, 1995
  • ltpgt ltigt Data on the Web lt/igt
  • Abiteoul, Buneman, Suciu
  • ltbrgt Morgan Kaufmann, 1999

9
XML
  • ltbibliographygt
  • ltbookgt lttitlegt Foundations lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltauthorgt Hull lt/authorgt
  • ltauthorgt Vianu lt/authorgt
  • ltpublishergt Addison Wesley
    lt/publishergt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • lt/bibliographygt

XML describes the content
10
XML Terminology
  • tags book, title, author,
  • start tag ltbookgt, end tag lt/bookgt
  • elements ltbookgtltbookgt,ltauthorgtlt/authorgt
  • elements are nested
  • empty element ltredgtlt/redgt abbrv. ltred/gt
  • an XML document single root element

well formed XML document if it has matching tags
11
More XML Attributes
  • ltbook price 55 currency USDgt
  • lttitlegt Foundations of Databases lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt

attributes are alternative ways to represent data
12
More XML IDs and References
  • ltperson ido555gt ltnamegt Jane lt/namegt lt/persongt
  • ltperson ido456gt ltnamegt Mary lt/namegt
  • ltchildren
    idrefo123 o555/gt
  • lt/persongt
  • ltperson ido123 mothero456gtltnamegtJohnlt/namegt
  • lt/persongt

Scope of IDs and references is the document
13
More XML CDATA Section
  • Syntax lt!CDATA .....any text here...gt
  • Example

ltexamplegt lt!CDATA some text here
lt/notAtaggt ltgtgtlt/examplegt
14
More XML Entity References
  • Syntax entityname
  • Example ltelementgt this is less than lt
    lt/elementgt
  • Some entities

lt lt
gt gt
amp
apos
quot
38 Unicode char
15
More XML Processing Instructions
  • Syntax lt?target argument?gt
  • Example
  • Processed by external applications, e.g. php(bad
    style)

ltproductgt ltnamegt Alarm Clock lt/namegt
lt?ringBell 20?gt ltpricegt
19.99 lt/pricegtlt/productgt
16
More XML Comments
  • Syntax lt!-- .... Comment text... --gt
  • Yes, they are part of the data model !!!

17
XML Namespaces
  • XML namespace is a collection of names (markup
    vocabulary)
  • identified by a prefix (URL reference)
  • name prefixlocalname

default name space
ltbook xmlns'urnloc.govbook'
xmlnsisbn'www.isbn-org.org/def'gt lttitlegt
lt/titlegt ltnumbergt 15 lt/numbergt
ltisbnnumbergt . lt/isbnnumbergt lt/bookgt
names belong to default name space
18
XML Namespaces
  • syntactic ltnumbergt , ltisbnnumbergt
  • semantic URL used as unique identifier
  • URL may not exist, has no function

lttag xmlnsmystyle http//gt
ltmystyletitlegt
lt/mystyletitlegt ltmystylenumbergt
lt/taggt
Belong to this namespace
19
XML Data a Tree !
data
ltdatagt ltperson ido555 gt ltnamegt Mary
lt/namegt ltaddressgt ltstreetgt Maple lt/streetgt ltnogt
345 lt/nogt ltcitygt Seattle lt/citygt
lt/addressgt lt/persongt ltpersongt ltnamegt John
lt/namegt ltaddressgt Thailand lt/addressgt ltphonegt
23456 lt/phonegt lt/persongt lt/datagt
person
person
id
address
name
address
name
phone
o555
street
no
city
Mary
Thai
John
23456
Maple
345
Seattle
Order matters !!!
20
From Relational Data to XML Data
XML
persons
persons
row
row
row
phone
name
name
name
phone
phone
John
3634
Sue
Dick
6343
6363
  • ltpersonsgt
  • ltrowgt ltnamegtJohnlt/namegt
  • ltphonegt 3634lt/phonegtlt/rowgt
  • ltrowgt ltnamegtSuelt/namegt
  • ltphonegt 6343lt/phonegt
  • ltrowgt ltnamegtDicklt/namegt
  • ltphonegt 6363lt/phonegtlt/rowgt
  • lt/personsgt

21
XML Data
  • XML is self-describing
  • Schema elements become part of the data
  • Relational schema persons(name,phone)
  • In XML ltpersonsgt, ltnamegt, ltphonegt are part of the
    data, and are repeated many times
  • Consequence XML is much more flexible
  • XML semistructured data

22
Semi-structured Data Explained
  • Missing attributes
  • Could represent ina table with nulls

ltpersongt ltnamegt Johnlt/namegt
ltphonegt1234lt/phonegt lt/persongt ltpersongt
ltnamegtJoelt/namegt lt/persongt
? no phone !
name phone
John 1234
Joe -
23
Semi-structured Data Explained
  • Repeated attributes
  • Impossible in tables

ltpersongt ltnamegt Marylt/namegt
ltphonegt2345lt/phonegt
ltphonegt3456lt/phonegt lt/persongt
? two phones !
name phone
Mary 2345 3456

???
24
Semistructured Data Explained
  • Attributes with different types in different
    objects
  • Nested collections (no 1NF)
  • Heterogeneous collections
  • ltdbgt contains both ltbookgts and ltpublishergts

ltpersongt ltnamegt ltfirstgt John lt/firstgt
ltlastgt Smith lt/lastgt
lt/namegt
ltphonegt1234lt/phonegt lt/persongt
? structured name !
25
Document Type DefinitionsDTD
  • part of the original XML specification
  • an XML document may have a DTD
  • XML document
  • well-formed if tags are correctly closed
  • Valid if it has a DTD and conforms to it
  • validation is useful in data exchange

26
Very Simple DTD
lt!DOCTYPE company lt!ELEMENT company
((personproduct))gt lt!ELEMENT person (ssn,
name, office, phone?)gt lt!ELEMENT ssn
(PCDATA)gt lt!ELEMENT name (PCDATA)gt
lt!ELEMENT office (PCDATA)gt lt!ELEMENT phone
(PCDATA)gt lt!ELEMENT product (pid, name,
description?)gt lt!ELEMENT pid (PCDATA)gt
lt!ELEMENT description (PCDATA)gt gt
27
Very Simple DTD
Example of valid XML document
ltcompanygt ltpersongt ltssngt 123456789 lt/ssngt
ltnamegt John lt/namegt
ltofficegt B432 lt/officegt
ltphonegt 1234 lt/phonegt lt/persongt
ltpersongt ltssngt 987654321 lt/ssngt
ltnamegt Jim lt/namegt
ltofficegt B123 lt/officegt lt/persongt
ltproductgt ... lt/productgt ... lt/companygt
28
DTD The Content Model
lt!ELEMENT tag (CONTENT)gt
  • Content model
  • Complex a regular expression over other
    elements
  • Text-only PCDATA
  • Empty EMPTY
  • Any ANY
  • Mixed content (PCDATA A B C)

contentmodel
29
DTD Regular Expressions
DTD
XML
sequence
lt!ELEMENT name
(firstName, lastName))
ltnamegt ltfirstNamegt . . . . . lt/firstNamegt
ltlastNamegt . . . . . lt/lastNamegt lt/namegt
optional
lt!ELEMENT name (firstName?, lastName))
ltpersongt ltnamegt . . . . . lt/namegt
ltphonegt . . . . . lt/phonegt ltphonegt . . . .
. lt/phonegt ltphonegt . . . . . lt/phonegt .
. . . . . lt/persongt
star (repeated occurrence)
lt!ELEMENT person (name, phone))
alternation
lt!ELEMENT person (name, (phoneemail)))
30
DTD Attributes
  • Document Type Definitionlt!ELEMENT person (ssn,
    name, office, phone?)gtlt!ATTLIST person age
    CDATA REQUIRED "18" birthdate CDATA
    IMPLIED nationality CDATA FIXED
    "CH" gender (malefemale) "female"gt
  • Document ltperson age"24" nationality"CH"
    gender"male"gt ltssngt lt/ssngt ltphonegt
    lt/phonegt lt/persongt

mandatory
optional
default
enumeration
31
Inclusion of DTD in Documents
External DTD Declaration
lt?xml version"1.0" encoding"ISO-8859-1"?gt
lt!DOCTYPE test PUBLIC "-//Test AG//DTD test
V1.0//EN" SYSTEM
"http//www.test.org/test.dtd"gtlttestgt "test" is
a document element lt/testgt
Internal DTD Declaration
lt!DOCTYPE test lt!ELEMENT test EMPTYgt gtlttest/gt
Mixed usage
lt!DOCTYPE test SYSTEM "http//www.test.org/test.dt
d" lt!ENTITY hello "hello world"gtgtlttestgthe
llolt/testgt
32
Querying XML Data
  • XPath simple navigation through the tree
  • XQuery the SQL of XML
  • XSLT recursive traversal
  • will not discuss
  • XQuery and XSLT build on XPath

33
Sample Data for Queries
  • ltbibgtltbookgt ltpublishergt Addison-Wesley
    lt/publishergt ltauthorgt Serge
    Abiteboul lt/authorgt ltauthorgt
    ltfirst-namegt Rick lt/first-namegt
    ltlast-namegt Hull lt/last-namegt
    lt/authorgt ltauthorgt Victor
    Vianu lt/authorgt lttitlegt Foundations
    of Databases lt/titlegt ltyeargt 1995
    lt/yeargtlt/bookgtltbook price55gt
    ltpublishergt Freeman lt/publishergt
    ltauthorgt Jeffrey D. Ullman lt/authorgt
    lttitlegt Principles of Database and Knowledge
    Base Systems lt/titlegt ltyeargt 1998
    lt/yeargtlt/bookgt
  • lt/bibgt

34
Data Model for XPath
The root
The root element
book
book
publisher
author
. . . .
Addison-Wesley
Serge Abiteboul
35
XPath Simple Expressions
/bib/book/year
  • Result ltyeargt 1995 lt/yeargt
  • ltyeargt 1998 lt/yeargt
  • Result empty (there were no papers)

/bib/paper/year
36
XPath Restricted Kleene Closure
//author
  • Resultltauthorgt Serge Abiteboul lt/authorgt
  • ltauthorgt ltfirst-namegt Rick
    lt/first-namegt
  • ltlast-namegt Hull
    lt/last-namegt
  • lt/authorgt
  • ltauthorgt Victor Vianu lt/authorgt
  • ltauthorgt Jeffrey D. Ullman
    lt/authorgt
  • Result ltfirst-namegt Rick lt/first-namegt

/bib//first-name
37
XPath Text Nodes
/bib/book/author/text()
  • Result Serge Abiteboul
  • Jeffrey D. Ullman
  • Rick Hull doesnt appear because he has
    firstname, lastname
  • Functions in XPath
  • text() matches the text value
  • node() matches any node ( or _at_ or text())
  • name() returns the name of the current tag

38
XPath Wildcard
  • Result ltfirst-namegt Rick lt/first-namegt
  • ltlast-namegt Hull lt/last-namegt
  • Matches any element

//author/
39
XPath Attribute Nodes
/bib/book/_at_price
  • Result 55
  • _at_price means that price is has to be an attribute

40
XPath Predicates
/bib/book/authorfirstname
  • Result ltauthorgt ltfirst-namegt Rick lt/first-namegt
  • ltlast-namegt Hull
    lt/last-namegt
  • lt/authorgt

41
XPath More Predicates
  • Result ltlastnamegt lt/lastnamegt
  • ltlastnamegt lt/lastnamegt

/bib/book/authorfirstnameaddress//zipcity/
lastname
42
XPath More Predicates
/bib/book_at_price lt 60
/bib/bookauthor/_at_age lt 25
/bib/bookauthor/text()
43
XPath Summary
  • bib matches a bib element
  • matches any element
  • / matches the root element
  • /bib matches a bib element under root
  • bib/paper matches a paper in bib
  • bib//paper matches a paper in bib, at any depth
  • //paper matches a paper at any depth
  • paperbook matches a paper or a book
  • _at_price matches a price attribute
  • bib/book/_at_price matches price attribute in book,
    in bib
  • bib/book/_at_pricelt55/author/lastname matches
About PowerShow.com