XML and Databases - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

XML and Databases

Description:

XML and Databases – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 57
Provided by: Preferr162
Category:

less

Transcript and Presenter's Notes

Title: XML and Databases


1
  • XML and Databases

2
Outline (ambitious)
  • Background documents (SGML/HTML) and databases
    (structured and semistructured data)
  • XML Basics and Document Type Descriptors
  • XML query languages XML-QL and XSL.
  • XML additions Xlink, Xpointer, RDF, SOX,
    XML-Data
  • Document Object Model (XML API's)

3
Some Useful Articles
  • XML, Java, and the future of the web
  • http//webreview.com/wr/pub/97/12/19/xml/index.htm
    l
  • XML and the Second-Generation Web
  • http//www.sciam.com/1999/0599issue/0599bosak.html
  • Articles/standards for XML, XSL, XML-QL
    http//www.w3c.org/
  • http//www.w3.org/TR/REC-xml

4
Background
  • Whats the difference between the world of
    documents and information retrieval and databases
    and query interfaces?

5
Documents vs Databases
  • Document world
  • gt plenty of small documents
  • gt usually static
  • gt implicit structure
  • section, paragraph, toc,
  • gt tagging
  • gt human friendly
  • gt content
  • form/layout, annotation
  • gt Paradigms
  • Save as, wysiwyg
  • gt meta-data
  • author name, date, subject
  • Database world
  • gt a few large databases
  • gt usually dynamic
  • gt explicit structure (schema)
  • gt records
  • gt machine friendly
  • gt content
  • schema, data, methods
  • gt Paradigms
  • Atomicity, Concurrency, Isolation, Durability
  • gt meta-data
  • schema description

6
What to do with them
  • Documents
  • editing
  • printing
  • spell-checking
  • counting words
  • retrieving (IR)
  • searching
  • Database
  • updating
  • cleaning
  • querying
  • composing/transforming

7
HTML
  • Publishing hypertext on the World Wide Web
  • Designed to describe how a Web browser should
    arrange text, images and push-buttons on a page.
  • Easy to learn, but does not convey structure.
  • Fixed tag set.

Text (PCDATA)
Opening tag
ltHTMLgt ltHEADgtltTITLEgtWelcome to the XML
courselt/TITLEgtlt/HEADgt ltBODYgt ltH1gtIntroductionlt/H1
gt ltIMG SRCdragon.jpeg" WIDTH"200" HEIGHT"150
gt lt/BODYgt lt/HTMLgt
Closing tag
Bachelor tag
Attribute name
Attribute value
8
The Structure of XML
  • XML consists of tags and text
  • Tags come in pairs ltdategt ...lt/dategt
  • They must be properly nested
  • ltdategt ltdaygt ... lt/daygt ... lt/dategt --- good
  • ltdategt ltdaygt ... lt/dategt... lt/daygt --- bad
  • (You cant do ltigt ... ltbgt ... lt/igt ...lt/bgt in
    HTML)

9
XML text
  • XML has only one basic type -- text.
  • It is bounded by tags e.g.
  • lttitlegt The Big Sleep lt/titlegt
  • ltyeargt 1935 lt/ yeargt --- 1935 is still text
  • XML text is called PCDATA (for parsed
  • character data). It uses a 16-bit encoding,
  • e.g. \\x0152 for the Hebrew letter Mem
  • Later we shall see how new types are specified by
    XML-data

10
XML structure
  • Nesting tags can be used to express various
    structures. E.g. A tuple (record)

ltpersongt ltnamegt Malcolm Atchison lt/namegt
lttelgt (215) 898 4321 lt/telgt ltemailgt
mp_at_dcs.gla.ac.sc lt/emailgt lt/persongt
11
XML structure (cont.)
  • We can represent a list by using the same
  • tag repeatedly

ltaddressesgt ltpersongt ... lt/persongt ltpersongt
... lt/persongt ltpersongt ... lt/persongt
... lt/addressesgt
12
Terminology
  • The segment of an XML document between an opening
    and a corresponding closing tag is called an
    element.

ltpersongt ltnamegt Malcolm Atchison
lt/namegt lttelgt (215) 898 4321 lt/telgt lttelgt
(215) 898 4321 lt/telgt ltemailgt mp_at_dcs.gla.ac.sc
lt/emailgt lt/persongt
element
element, a sub-element of
not an element
13
XML is tree-like
Malcolm Atchison
(215) 898 4321
(215) 898 4321
mp_at_dcs.gla.ac.sc
Semistructured data models typically put the
labels on the edges
14
Mixed Content
  • An element may contain a mixture of sub-elements
    and PCDATA
  • ltairlinegt
  • ltnamegt British Airways lt/namegt
  • ltmottogt
  • Worlds ltdubiousgt
    favoritelt/dubiousgt airline
  • lt/mottogt
  • lt/airlinegt
  • Data of this form is not typically generated from
    databases. It is needed for consistency with HTML

15
A Complete XML Document
  • lt?xml version"1.0"?gt
  • ltpersongt
  • ltnamegt Malcolm Atchison lt/namegt
  • lttelgt (215) 898 4321 lt/telgt
  • ltemailgt mp_at_dcs.gla.ac.sc lt/emailgt
  • lt/persongt

16
Two ways of representing a DB

projects
title budget managedBy
employees
name ssn age
17
Project and Employee relations in XML
Projects and employees are intermixed
  • ltdbgt
  • ltprojectgt
  • lttitlegt Pattern recognition lt/titlegt
  • ltbudgetgt 10000 lt/budgetgt
  • ltmanagedBygt Joe lt/managedBygt
  • lt/projectgt
  • ltemployeegt
  • ltnamegt Joe lt/namegt
  • ltssngt 344556 lt/ssngt
  • ltagegt 34 lt /agegt
  • lt/employeegt

ltemployeegt ltnamegt Sandra lt/namegt
ltssngt 2234 lt/ssngt ltagegt 35 lt/agegt
lt/employeegt ltprojectgt lttitlegt Auto
guided vehicle lt/titlegt ltbudgetgt 70000
lt/budgetgt ltmanagedBygt Sandra lt/managedBygt
lt/projectgt lt/dbgt
18
Project and Employee relations in XML (contd)
Employees follow projects
ltemployeesgt ltemployeegt ltnamegt Joe
lt/namegt ltssngt 344556 lt/ssngt ltagegt
34 lt/agegt lt/employeegt ltemployeegt
ltnamegt Sandra lt/namegt ltssngt 2234
lt/ssngt ltagegt35 lt/agegt lt/employeegt
ltemployeesgt lt/dbgt
ltdbgt ltprojectsgt ltprojectgt
lttitlegt Pattern recognition lt/titlegt
ltbudgetgt 10000 lt/budgetgt ltmanagedBygt
Joe lt/managedBygt lt/projectgt ltprojectgt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt
Sandra lt/managedBygt lt/projectgt
lt/projectsgt
19
Project and Employee relations in XML (contd)
Or without separator tags
ltdbgt ltprojectsgt lttitlegt Pattern
recognition lt/titlegt ltbudgetgt 10000
lt/budgetgt ltmanagedBygt Joe lt/managedBygt
lttitlegt Auto guided vehicles lt/titlegt
ltbudgetgt 70000 lt/budgetgt ltmanagedBygt Sandra
lt/managedBygt lt/projectsgt
ltemployeesgt ltnamegt Joe lt/namegt
ltssngt 344556 lt/ssngt ltagegt 34 lt/agegt
ltnamegt Sandra lt/namegt ltssngt 2234 lt/ssngt
ltagegt 35 lt/agegt lt/employeesgt lt/dbgt
20
Attributes
  • An (opening) tag may contain attributes. These
    are typically used to describe the content of
    an element
  • ltentrygt
  • ltword language engt cheese lt/wordgt
  • ltword language frgt fromage lt/wordgt
  • ltword language rogt branza lt/wordgt
  • ltmeaninggt A food made lt/meaninggt
  • lt/entrygt
  • Order of attributes in an element does not matter
  • XML elements are ordered

21
Attributes (contd)
  • Another common use for attributes is to express
    dimension or type
  • ltpicturegt
  • ltheight dim cmgt 2400 lt/heightgt
  • ltwidth dim ingt 96 lt/widthgt
  • ltdata encoding gif compression zipgt
  • M05-.C_at_02!G96YEltFEC ...
  • lt/datagt
  • lt/picturegt
  • A document that obeys the nested tags rule and
    does not repeat an attribute within a tag is said
    to be well-formed .

22
When to use attributes
  • Its not always clear when to use attributes

ltperson ssno 123 45 6789gt ltnamegt F. MacNiel
lt/namegt ltemailgt fmacn_at_dcs.barra.ac.sc
lt/emailgt ... lt/persongt
ltpersongt ltssnogt 123 45 6789 lt/ssnogt ltnamegt
F. MacNiel lt/namegt ltemailgt
fmacn_at_dcs.barra.ac.sc lt/emailgt
... lt/persongt
23
XML Misc.
  • Apart from elements and attributes, XML allows
    processing instructions and comments. A
    processing instruction is a statement of the
    form
  • lt?xml version"1.0"?gt
  • lt?XML ENCODING"UTF-8" VERSION"1.0"?gt
  • A comment takes the following form enclose
    comments between lt!- - and - -gt
  • lt! - A Comment --gt

24
Document Type Descriptors
  • Imposing structure on XML documents

25
Document Type Descriptors
  • Document Type Descriptors (DTDs) impose structure
    on an XML document.
  • There is some relationship between a DTD and a
    schema, but it is not close -- hence the need for
    additional typing systems.
  • The DTD is a syntactic specification.

26
Example The Address Book
  • ltpersongt
  • ltnamegt MacNiel, John lt/namegt
  • ltgreetgt Dr. John MacNiel lt/greetgt
  • ltaddrgt1234 Huron Street lt/addrgt
  • ltaddrgt Rome, OH 98765 lt/addrgt
  • lttelgt (321) 786 2543 lt/telgt
  • ltfaxgt (321) 786 2543 lt/faxgt
  • lttelgt (321) 786 2543 lt/telgt
  • ltemailgt jm_at_abc.com lt/emailgt
  • lt/persongt

Exactly one name
At most one greeting
As many address lines as needed (in order)
Mixed telephones and faxes
As many as needed
27
Specifying the structure
  • name to specify a name element
  • greet? to specify an optional (0 or 1)
    greet elements
  • name,greet? to specify a name followed by an
    optional greet

28
Specifying the structure (cont)
  • addr to specify 0 or more address lines
  • tel fax a tel or a fax element
  • (tel fax) 0 or more repeats of tel or fax
  • email 0 or more email elements

29
Specifying the structure (cont)
  • So the whole structure of a person entry is
    specified by
  • name, greet?, addr, (tel fax), email
  • This is known as a regular expression. Why is it
    important?

30
Regular Expressions
  • Each regular expression determines a
    corresponding finite state automaton. Lets start
    with a simpler example
  • name, addr, email
  • This suggests a simple parsing program

addr
name
email
31
Another example
  • name,address,(tel fax),email

address
email
tel
tel
name
email
fax
fax
email
Adding in the optional greet further complicates
things
32
A DTD for the address book
  • lt!DOCTYPE addressbook
  • lt!ELEMENT addressbook (person)gt
  • lt!ELEMENT person
  • (name, greet?, address, (fax tel),
    email)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT greet (PCDATA)gt
  • lt!ELEMENT address (PCDATA)gt
  • lt!ELEMENT tel (PCDATA)gt
  • lt!ELEMENT fax (PCDATA)gt
  • lt!ELEMENT email (PCDATA)gt
  • gt

33
Our relational DB revisited

projects
title budget managedBy
employees
name ssn age
34
Two DTDs for the relational DB
lt!DOCTYPE db lt!ELEMENT db
(projects,employees)gt lt!ELEMENT projects
(project)gt lt!ELEMENT employees (employee)gt
lt!ELEMENT project (title, budget,
managedBy)gt lt!ELEMENT employee (name, ssn,
age)gt ... gt

lt!DOCTYPE db lt!ELEMENT db (project
employee)gt lt!ELEMENT project (title,
budget, managedBy)gt lt!ELEMENT employee (name,
ssn, age)gt ... gt
35
Some things are hard to specify
  • Each employee element is to contain name, age and
    ssn elements in some order.
  • lt!ELEMENT employee
  • ( (name, age, ssn) (age, ssn, name)
  • (ssn, name, age) ...
  • )gt
  • Suppose there were many more fields !

36
Summary of XML regular expressions
  • A The tag A occurs
  • e1,e2 The expression e1 followed by e2
  • e 0 or more occurrences of e
  • e? Optional -- 0 or 1 occurrences
  • e 1 or more occurrences
  • e1 e2 either e1 or e2
  • (e) grouping

37
Specifying attributes in the DTD
  • lt!ELEMENT height (PCDATA)gt
  • lt!ATTLIST height
  • dimension CDATA REQUIRED
  • accuracy CDATA IMPLIED gt
  • The dimension attribute is required the accuracy
    attribute is optional.
  • CDATA is the type of the attribute -- it means
    string.

38
The DTD Language
  • Default modifiers in DTD attributes

39
The DTD Language
  • Datatypes in DTD attributes

40
Consistency of ID and IDREF attribute values
  • If an attribute is declared as ID
  • the associated values must all be distinct (no
    confusion)
  • Id is a poor cousin of a key in relational
    databases.
  • If an attribute is declared as IDREF
  • the associated value must exist as the value of
    some ID attribute (no dangling pointers)
  • IDREF is a poor cousin of foreign key in
    relational databases.
  • Similarly for all the values of an IDREFS
    attribute
  • An attribute of type IDREFS represent a
    space-separated list of strings of references to
    valid IDs.
  • ID and IDREF attributes are not typed

41
Specifying ID and IDREF attributes
  • lt!DOCTYPE family
  • lt!ELEMENT family (person)gt
  • lt!ELEMENT person (name)gt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ATTLIST person
  • id ID REQUIRED
  • mother IDREF IMPLIED
  • father IDREF IMPLIED
  • children IDREFS IMPLIEDgt
  • gt

42
Some conforming data
  • ltfamilygt
  • ltperson id"jane" mother"mary"
    father"john"gt
  • ltnamegt Jane Doe lt/namegt
  • lt/persongt
  • ltperson id"john" children"jane jack"gt
  • ltnamegt John Doe lt/namegt
  • lt/persongt
  • ltperson id"mary" children"jane jack"gt
  • ltnamegt Mary Doe lt/namegt
  • lt/persongt
  • ltperson id"jack" mothermary"
    father"john"gt
  • ltnamegt Jack Doe lt/namegt
  • lt/persongt
  • lt/familygt

43
An alternative specification
  • lt!DOCTYPE family
  • lt!ELEMENT family (person)gt
  • lt!ELEMENT person (name, mother?, father?,
    children?)gt
  • lt!ATTLIST person id ID REQUIREDgt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT mother EMPTYgt
  • lt!ATTLIST mother idref IDREF REQUIREDgt
  • lt!ELEMENT father EMPTYgt
  • lt!ATTLIST father idref IDREF REQUIREDgt
  • lt!ELEMENT children EMPTYgt
  • lt!ATTLIST children idrefs IDREFS REQUIREDgt
  • gt

44
The revised data
  • ltfamilygt
  • ltperson id "janegt
  • ltnamegt Jane Doe lt/namegt
  • ltmother idref "marygtlt/mothergt
  • ltfather idref "john"gtlt/fathergt
  • lt/persongt
  • ltperson id "johngt
  • ltnamegt John Doe lt/namegt
  • ltchildren idrefs "jane jack"gt lt/childrengt
  • lt/persongt
  • ...
  • lt/familygt

45
The DTD Language
  • Example Sales Order Document
  • An order document is comprised of several sales
    orders. Each individual order has a number and it
    contains the customer information, the date when
    the order was received, and the items ordered.
    Each customer has a number, a name, street, city,
    state, and ZIP code. Each item has an item
    number, parts information and a quantity. The
    parts information contains a number, a
    description of the product and its unit price.
  • The numbers should be treated as attributes.

46
The DTD Language
  • Example Sales Order Document DTD

lt!-- DTD for example sales order document --gt
lt!ELEMENT Orders (SalesOrder)gt
lt!ELEMENT SalesOrder (Customer,OrderDate,Item)gt
lt!ELEMENT Customer (CustName,Street,City,State,ZIP
)gt
lt!ELEMENT OrderDate (PCDATA)gt
lt!ELEMENT Item (Part,Quantity)gt
lt!ELEMENT Part (Description,Price)gt
lt!ELEMENT CustName (PCDATA)gt
lt!ELEMENT Street (PCDATA)gt
lt!ELEMENT ... (PCDATA)gt
lt!ATTLIST SalesOrder SONumber CDATA REQUIREDgt
lt!ATTLIST Customer CustNumber CDATA REQUIREDgt
lt!ATTLIST Part PartNumber CDATA REQUIREDgt
lt!ATTLIST Item ItemNumber CDATA REQUIREDgt
47
The DTD Language
  • Example Sales Order XML Document
  • ltOrdersgt ltSalesOrder SONumber12345gt
    ltCustomer CustNumber543gt ltCustNamegtABC
    Industrieslt/CustNamegt ltStreetgt123 Main
    St.lt/Streetgt ltCitygtChicagolt/Citygt
    ltStategtILlt/Stategt ltZIPgt60609lt/ZIPgt
    lt/Customergt ltOrderDategt10222000lt/OrderDategt
    ltItem ItemNumber1gt ltPart PartNumber234gt
    ltDescriptiongtTurkey wrenchlt/Descriptiongt
    ltPricegt9.95lt/Pricegt lt/Partgt
    ltQuantitygt10lt/Quantitygt lt/Itemgt
    lt/SalesOrdergtlt/Ordersgt

48
A useful abbreviation
  • When an element has empty content we can use
  • lttag blahblahbla/gt for lttag
    blahblahblagtlt/taggt
  • For example
  • ltfamilygt
  • ltperson id "janegt
  • ltnamegt Jane Doe lt/namegt
  • ltmother idref "mary/gt
  • ltfather idref "john/gt
  • lt/persongt
  • ...
  • lt/familygt

49
Schema.dtd
  • lt!DOCTYPE db
  • lt!ELEMENT db (movie, actor)gt
  • lt!ELEMENT movie (title,director,cast,budget)
    gt
  • lt!ATTLIST movie id ID REQUIREDgt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT director (PCDATA)gt
  • lt!ELEMENT casts EMPTYgt
  • lt!ATTLIST casts idrefs IDREFS
    REQUIREDgt
  • lt!ELEMENT budget (PCDATA)gt

50
Schema.dtd (contd)
  • lt!ELEMENT actor (name, acted_In,age?,
    directed)gt
  • lt!ATTLIST actor id ID REQUIREDgt
  • lt!ELEMENT name (PCDATA)gt
  • lt!ELEMENT acted_In EMPTYgt
  • lt!ATTLIST acted_In idrefs IDREFS
    REQUIREDgt
  • lt!ELEMENT age (PCDATA)gt
  • lt!ELEMENT directed (PCDATA)gt
  • gt

51
Connecting the document with its DTD
  • In line
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE db lt!ELEMENT ...gt gt
  • ltdbgt ... lt/dbgt
  • Another file
  • lt!DOCTYPE db SYSTEM "schema.dtd"gt
  • A URL
  • lt!DOCTYPE db SYSTEM
  • "http//www.schemaauthority.com/
    schema.dtd"gt

52
Well-formed and Valid Documents
  • Well-formed applies to any document (with or
    without a DTD) proper nesting of tags and unique
    attributes
  • Valid specifies that the document conforms to the
    DTD conforms to regular expression grammar,
    types of attributes correct, and constraints on
    references satisfied

53
DTDs v.s Schemas (or Types)
  • By database (or programming language) standards
    DTDs are rather weak specifications.
  • Only one base type -- PCDATA
  • No useful abstractions e.g., sets
  • IDREFs are untyped. You point to something, but
    you dont know what!
  • No constraints e.g., child is inverse of parent
  • No methods
  • Tag definitions are global
  • Some of the XML extensions impose something like
    a schema or type on an XML document. Well see
    these later

54
Lots of possibilities for schemas
  • XML Schema (under W3Cs spotlight)
  • XDR (Microsofts BizTalk)
  • SOX (Schema for Object-Oriented XML)
  • Schematron
  • DSD (ATT Labs and BRICS)
  • and more.

55
Some tools
  • XML Authority http//www.extensibility.com/tibco/s
    olutions/xml_authority/index.htm
  • XML Spy http//www.xmlspy.com
    /download.html

56
Summary
  • XML is a new data format. Its main virtues are
    widespread acceptance and the (important) ability
    to handle semistructured data (data without
    schema)
  • DTDs provide some useful syntactic constraints on
    documents. As schemas they are weak
  • How to store large XML documents?
  • How to query them?
  • How to map between XML and other representations?
Write a Comment
User Comments (0)
About PowerShow.com