Title: Semistructured Data and XML
1Chapter 29
- Semistructured Data and XML
- Transparencies
2Chapter - Objectives
- What semistructured data is.
- Concepts of the Object Exchange Model (OEM), a
model for semistructured data. - Main language elements of XML.
- Difference between well-formed and valid XML
documents. - How Document Type Definitions (DTDs) can be used
to define the valid syntax of an XML document.
3Chapter - Objectives
- About other related XML technologies.
- Limitations of DTDs and how the W3C XML Schema
overcomes these limitations. - How RDF and RDF Schema provide a foundation for
processing meta-data. - Proposals for a W3C Query Language.
4(No Transcript)
5Introduction
- In 1998 XML 1.0 was formally ratified by W3C.
- Yet, set to impact every aspect of programming
including graphical interfaces, embedded systems,
distributed systems, and database management. - Already becoming de facto standard for data
communication within software industry, and is
quickly replacing EDI systems as primary medium
for data interchange among businesses. - Some analysts believe it will become language in
which most documents are created and stored, both
on and off Internet.
6Introduction
- Due to nature of information on Web and inherent
flexibility of XML, expected that much of the
data encoded in XML will be semistructured ie.,
data may be irregular or incomplete, and its
structure may change rapidly or unpredictably. - Unfortunately, relational, object-oriented, and
object-relational DBMSs do not handle data of
this nature particularly well.
7Semistructured Data
- Data that may be irregular or incomplete and
have a structure that may change rapidly or
unpredictably. - Semistructured data is data that has some
structure, but structure may not be rigid,
regular, or complete. - Generally, the data does not conform to a fixed
schema (sometimes terms schema-less or
self-describing is used to describe such data). .
8Semistructured Data
- The information normally associated with a schema
is contained within the data itself. - In some forms of semistructured data there is no
separate schema, in others it exists but only
places loose constraints on the data. - Unfortunately, relational, object-oriented, and
object-relational DBMSs do not handle data of
this nature particularly well.
9Semistructured Data
- Has gained importance recently for various
reasons - may be desirable to treat Web sources like a
database, but cannot constrain these sources with
a schema - may be desirable to have a flexible format for
data exchange between disparate databases - emergence of XML as standard for data
representation and exchange on the Web, and
similarity between XML documents and
semistructured data.
10Example 29.1
11Example 29.1
- Note, data is not regular
- for John White, hold first and last names, but
for Ann Beech store single name and also store a
salary - for property at 2 Manor Rd, store a monthly rent
whereas for property at 18 Dale Rd, store an
annual rent - for property at 2 Manor Rd, store property type
(flat) as a string, whereas for property at 18
Dale Rd, store type (house) as an integer value.
12Example 29.1
13Object Exchange Model (OEM)
- Data in OEM is schema-less and self-describing,
and can be thought of as labeled directed graph
where nodes are objects, consisting of - unique object identifier (for example, 7),
- descriptive textual label (street),
- type (string),
- a value (22 Deer Rd).
- Objects are decomposed into atomic and complex
- atomic object contains a value for a base type
(eg., integer or string) and can be recognized in
diagram as one that has no outgoing edges. - All other objects are complex objects whose type
are a set of object identifiers.
14Object Exchange Model (OEM)
- A label indicates what the object represents and
is used to identify the object and to convey the
meaning of the object, and so should be as
informative as possible. - Labels can change dynamically.
- A name is a special label that serves as an alias
for a single object and acts as an entry point
into the database (for example, DreamHome is a
name that denotes object 1).
15Object Exchange Model (OEM)
- An OEM object can be considered as a quadruple
(label, oid, type, value). - For example
- Staff, 4, set, 9, 10
- name, 9, string, Ann Beech
- salary, 10, decimal, 12000
16Semistructured Data - Case StudyObject Exchange
Model
17OEM Features
- Common model for heterogeneous information
exchange, self-describing - Each object
OID
Label
Type
Value
- OID unique identifier or NULL
- Label character string descriptor
- Type atomic data type or set
- Value atomic value or set of object references
- Help pages for labels
- Query language OEM-QL
18Representing Semistructured Data Using OEM
Label
ltcollection, b1, a1, ...gt b1 ltbook, t, agt
t lttitle, Database and ...gt a
ltauthor, n, pgt n ltname, Jeff Ullmangt p
ltpicture, /gifs/ullman.gifgt a1 ltarticle, v,
w, xgt v ltauthor, Gio Wiederholdgt w lttitle,
Mediators in the gt x ltjournal, IEEE
Computergt
Set Value
Memory Addresses
Atomic Value
...
19An OEM Query Language OEM-QL
- Logic-based language for OEM
- Match object patterns, generate variable
bindings, construct new OEM objects from existing
ones - Get articles published in IEEE Computer
- P -
- Pltarticles ltjournal IEEE Computergtgt
- Get titles of books by Jeff Ullman
- ltanswer_title Tgt -
- ltbook ltauthor Jeff Ullmangt lttitle Tgtgt
20XML
- Vendors introduced some browser-specific HTML
tags, making it difficult to develop
sophisticated, widely viewable Web documents. - W3C has produced new standard called XML, which
could preserve general application independence
that makes HTML portable and powerful.
21XML
- XML is a restricted version of SGML, designed
especially for Web documents. - SGML allows document to be logically separated
into two one that defines the structure of the
document (DTD), other containing the text itself.
- By giving documents a separately defined
structure, and by giving authors ability to
define custom structures, SGML provides extremely
powerful document management system. - However, SGML has not been widely adopted due to
its inherent complexity.
22XML
- XML attempts to provide a similar function to
SGML, but is less complex and, at same time,
network-aware. - XML retains key SGML advantages of extensibility,
structure, and validation. - Since XML is a restricted form of SGML, any fully
compliant SGML system will be able to read XML
documents (although the opposite is not true). - XML is not intended as a replacement for SGML or
HTML.
23XML (eXtensible Markup Language)
- origins HTML SGML (ISO Standard, 1986,
600pp) - W3C standard (26 pp) XML syntax DTDs
- XML HTML ? presentational tags
- user-defined DTD
(tagsnesting) - gt a metalanguage for defining other languages
via DTDs - gt XML is more like SGML than HTML
- XML SGML ? complexity, document perspective
simplicity, data exchange perspective
24Advantages of XML
- Simplicity
- Open standard and platform/vendor-independent
- Extensibility
- Reuse
- Separation of content and presentation
- Improved load balancing
25Advantages of XML
- Support for integration of data from multiple
sources - Ability to describe data from a wide variety of
applications - More advanced search engines
- New opportunities.
26Why are Database folks so excited about XML?
- XML is just a syntax for (self-describing) data
- This is still exciting because
- No standard syntax for relational data
- With XML, we can
- Translate any legacy data to XML
- Can exchange data in XML format
- Ship over the web, input to any application
27XML ? machine accessible meaning
This is what a web-page in natural language
looks like for a machine
28XML ? machine accessible meaning
XML allows meaningful tags to be added toparts
of the text
29XML ? machine accessible meaning
But to your machine, the tags look like this.
30XML ? machine accessible meaning
Schemas help.
lt CV gt
by relating common termsbetween documents
private
31But other people use other schemas
Someone else has one like this.
32But other people use other schemas
lt CV gt
which dont fit in
private
Moral There is still need for
ontology mapping..
33An HTML document
34HTML code
- lttitlegtICS185/ICS180 - Spring, 2003lt/titlegt
- ltbody bgcolor"d0d0ff"gt
- ltH2gtIndexlt/H2gt
- ltULgt
- ltLIgt lta HREF "announcements"gtAnnouncements
lt/agt - ltLIgt lta HREF "geninfo"gtCourse Information
lt/agt - lt/ULgt
- ltH2gtCourse Informationlt/H2gt
- lta href"geninfo.html"gtGeneral Informationlt/agt.
The following are a - few important entries
- ltULgt
- ltligt ltA HREF "geninfo.htmlgoals"gtCourse
Goalslt/AgtltBRgt - ltligt ltA HREF "geninfo.htmlcrsenum"gtAbout
the course numberslt/AgtltBRgt - lt/ULgt
- lt/bodygt
35What is the problem?
- To do more fancy things with documents
- need to make their logical structure explicit.
- Otherwise, software applications
- do not know what is what
- do not have any handle over documents.
36An XML document
- lt?xml version"1.0" ?gt
- ltbibgt
- ltvendor id"id3_4"gt
- ltnamegtQuickBookslt/namegt
- ltemailgtbooksales_at_quickbooks.comlt/emailgt
- ltphonegt1-800-333-9999lt/phonegt
- ltbookgt
- lttitlegtInorganic Chemistrylt/titlegt
- ltpublishergtBrooks/Cole Publishinglt/publish
ergt - ltyeargt1991lt/yeargt
- ltauthorgt
- ltfirstnamegtJameslt/firstnamegt
- ltlastnamegtBowserlt/lastnamegt
- lt/authorgt
- ltpricegt43.72lt/pricegt
- lt/bookgt
- lt/vendorgt
- lt/bibgt
37lth1gt Bibliography lt/h1gt ltpgt ltigt Foundations of
Databases lt/igt Abiteboul, Hull, Vianu
ltbrgt Addison Wesley, 1995 ltpgt ltigt Data on
the Web lt/igt Abiteoul, Buneman, Suciu
ltbrgt Morgan Kaufmann, 1999
ltbibliographygt ltbookgt lttitlegt Foundations
lt/titlegt ltauthorgt Abiteboul
lt/authorgt ltauthorgt Hull
lt/authorgt ltauthorgt Vianu
lt/authorgt ltpublishergt Addison
Wesley lt/publishergt ltyeargt 1995
lt/yeargt lt/bookgt lt/bibliographygt
HTML describes presentation
XML describes content
38What is XML?
- eXtensible Markup Language
- Data are identified using tags (identifiers
enclosed in angle brackets lt...gt) - Collectively, the tags are known as markup
- XML tags tell you what the data means, rather
than how to display it
39XML versus relational
- Relational structured
- XML semi-structured
- Plain text file unstructured
40How does XML work?
- XML allows developers to write their own Document
Type Definitions (DTD) - DTD is a markup languages rule book that
describes the sets of tags and attributes that is
used to describe specific content - If you want to use a certain tag, then it must
first be defined in DTD
41Key Components in XML
- Three generic components, and one customizable
component
XML Content
XML Parser
Application
DTD Rules
42Meta Markup Language
- Not a language but a way of specifying other
languages - Meta-markup language gives the rules by which
other markup languages can be written - Portable - platform independent
43Markup Languages
- Presentation based
- Markup languages that describe information for
presentation for human consumption - Content based
- Describe information that is of interest to
another computer application
44HTML and XML
- HTML tag says "display this data in bold font"
- ltbgt...lt/bgt
- XML tag acts like a field name in your program
- It puts a label on a piece of data that
identifies it - ltmessagegt...lt/messagegt
45HTML vs. XML
- ltbibliographygt
- ltbookgt lttitlegt Foundations lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
- ltauthorgt Hull lt/authorgt
- ltauthorgt Vianu lt/authorgt
- ltpublishergt Addison Wesley
lt/publishergt - ltyeargt 1995 lt/yeargt
- lt/bookgt
-
- lt/bibliographygt
- lth1gt Bibliography lt/h1gt
- ltpgt ltigt Foundations of Databases lt/igt
- Abiteboul, Hull, Vianu
- ltbrgt Addison Wesley, 1995
- ltpgt ltigt Data on the Web lt/igt
- Abiteoul, Buneman, Suciu
- ltbrgt Morgan Kaufmann, 1999
Self-describing -Schema info part of the
data -Good for data exchange (albeit
baroque for storage)
46Simple Example
- XML data for a messaging application
- ltmessagegt
- lttogtyou_at_yourAddress.comlt/togt ltfromgtme_at_myAddress.c
omlt/fromgt lttextgt Why is it good? Let me count
the ways... lt/textgt - lt/messagegt
47Element
- Data between the tag and its matching end tag
defines an element of the data - Comment
- lt!-- This is a comment --gt
48Example
- lt!-- Using attributes --gt
- ltmessage to"you_at_yourAddress.com"
from"me_at_myAddress.com"gt - lttextgtWhty is it good? Let me count the
ways...lt/textgt - lt/messagegt
49Attributes
- Tags can also contain attributes
- Attributes contain additional information
included as part of the tag, within the tag's
angle brackets - Attribute name is followed by an equality sign
and the attribute value
50Other Basics
- White space is essentially irrelevant
- Commas between attributes are not ignored - if
present, they generate an error - Case sensitive message and MESSAGE are
different
51Well Formed XML
- Every tag has a closing tag
- XML represents hierarchical data structures
having one tag to contain others - Tags have to be completely nested
- Correct
- ltmessagegt..lttogt..lt/togt..lt/messagegt
- Incorrect
- ltmessagegt..lttogt..lt/messagegt..lt/togt
52Empty Tag
- Empty tag is used when it makes sense to have a
tag that stands by itself and doesn't enclose any
content - a "flag" - You can create an empty tag by ending it with /gt
- ltflag/gt
53Example
- ltmessage to"you_at_yourAddress.com"
from"me_at_myAddress.com" subjectXML is good"gt
ltflag/gt lttextgt Whty is it good? Let me count the
ways... - lt/textgt
- lt/messagegt
54Tree representation
- ltBOOKSgt
- ltbook id123 loclibrarygt
- ltauthorgtHulllt/authorgt
- lttitlegtCalifornialt/titlegt
- ltyeargt 1995 lt/yeargt
- lt/bookgt
- ltarticle id555 ref123gt
- ltauthorgtSult/authorgt
- lttitlegt Purduelt/titlegt
- lt/articlegt
- lt/BOOKSgt
Hull
55Prolog in XML Files
- XML file always starts with a prolog
- The minimal prolog contains a declaration that
identifies the document as an XML document - lt?xml version"1.0"?gt
- The declaration may also contain additional
information - version - version of the XML used in the data
- encoding - Identifies the character set used
- standalone - whether the document references an
external entity or data type specification
56Detailed Example of XML File
- simple version of the kind of XML data you could
use for a slide presentation - You can use your text editor to create the data
- Step 1 create a file named slideSample01.xml
- Step 2 write the declaration, which identifies
the file as an XML document - lt?xml version'1.0' encoding'us-ascii'?gt
57Defining the Root Element
- Step 3 Adding a comment
-
- lt!-- A SAMPLE set of slides --gt
- Step 4 Defining the Root Element
-
- ltslideshowgt
- lt/slideshowgt
- After the declaration, every XML file defines
exactly one element, known as the root element - Any other elements in the file are contained
within that element
58Attributes
- A slide presentation has a title
- ...
- ltslideshow
- title"Sample Slide Show"gt
- lt/slideshowgt
59Adding Nested Elements
- Step 5 Adding Nested Elements
- ltslideshow
- ...
- lt!-- TITLE SLIDE --gt
- ltslide title"Title of Talk"/gt
- lt!-- TITLE SLIDE --gt
- ltslide type"all"gt
- lttitlegtIntroduction to XML lt/titlegt
- lt/slidegt
- lt/slideshowgt
60Attribute vs. Element
- type of the slide is defined as an attribute
- Slides could be earmarked for a mostly technical
or mostly executive audience with type"tech" or
type"exec", or identified as suitable for both
with type"all - title element is defined as an element
- The title is something the audience will see
- So it is an element
- The type is something that never gets presented
- So it is an attribute
61Adding Text
- Step 6 Adding Text
- ltslideshowgt
-
- lt!-- OVERVIEW --gt
- ltslide type"all"gt lttitlegtOverviewlt/titlegt
- ltitemgtWhy is XML great?lt/itemgt
- ltitemgtWho uses it?lt/itemgt
- lt/slidegt
- lt/slideshowgt
62Adding an Empty Element
- Step 7 Adding an Empty Element
- ltslideshowgt
-
- lt!-- OVERVIEW --gt
- ltslidegt
- lt!-- define an empty list item --gt
- ltitem/gt
-
- lt/slidegt
- lt/slideshowgt
63Complete Example
- lt?xml version"1.0" encoding"us-ascii" ?gt
- lt!-- A SAMPLE set of slides --gt
- ltslideshow title"Sample Slide Show"gt
- lt!-- TITLE SLIDE --gt
- ltslide type"all"gt
- lttitlegtIntroduction to CMLlt/titlegt
- lt/slidegt
- lt!-- OVERVIEW --gt
- ltslide type"all"gt
- lttitlegtOverviewlt/titlegt
- ltitemgtWhy is XML great?lt/itemgt
- ltitem /gt
- lt/slidegt
- lt/slideshowgt
64XML Parsing IE Example
65Processing Instructions
- An XML file can also contain processing
instructions that give commands or information to
an application that is processing the XML data - lt?target instructions?gt
- target is the name of the application that is
expected to do the processing - instructions is a string of characters that
embodies the information or commands for the
application to process
66XML
67XML -Elements
- Elements, or tags, are most common form of
markup. - First element must be a root element, which can
contain other (sub)elements. - XML document must have one root element
(ltSTAFFLISTgt. Element begins with start-tag
(ltSTAFFgt) and ends with end-tag (lt/STAFFgt). - XML elements are case sensitive
- An element can be empty, in which case it can be
abbreviated to ltEMPTYELEMENT/gt. - Elements must be properly nested.
68XML - Attributes
- Attributes are name-value pairs that contain
descriptive information about an element. - Attribute is placed inside start-tag after
corresponding element name with the attribute
value enclosed in quotes. - ltSTAFF branchNo B005gt
- Could also have represented branch as subelement
of STAFF. - A given attribute may only occur once within a
tag, while subelements with same tag may be
repeated.
69Data Type Definition (DTD)
- DTD specifies the types of tags that can be
included in the XML document - it defines which tags are valid, and in what
arrangements - where text is expected, letting the parser
determine whether the whitespace it sees is
significant or ignorable - An optional part of the document prolog
70XML document and DTD
XML DTD
Slideshow
Slideshow
Slide
slide
slide
item
title
title
item
DB
item
title
item
item1
item2
- lt?xml version'1.0' encoding'us-ascii'?gt
- lt!-- DTD for a simple "slide show".--gt
- lt!ELEMENT slideshow (slide)gt
- lt!ELEMENT slide (title, item)gt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT item (PCDATA item) gt
AI
item3
XML Document
71Detailed DTD Example
- Step 1 Create a file named slideshow.dtd
- Step 2 Enter an XML declaration
- lt?xml version'1.0' encoding'us-ascii'?gt
- lt!-- DTD for a simple "slide show". --gt
- Step 3 Specify contains of a slideshow element
-
- lt!- slideshow contains 1 slide elements --gt
- lt!ELEMENT slideshow (slide)gt
72Qualifiers
- lt?xml version'1.0' encoding'us-ascii'?gt
- lt!-- DTD for a simple example. --gt
- lt!ELEMENT slideshow (slide)gt
- slideshow element contains slide elements and
nothing else
73Grouping multiple items
- ((image, title))
- Every image element must be paired with a title
element - Plus sign applies to the image/title pair to
indicate that one or more pairs of the specified
items can occur
74Defining Text and Nested Elements
- Step 4 Defining Text and Nested Elements
- lt!ELEMENT slide (title, item)gt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT item (PCDATA item) gt
- Text Parsed Character DATA (PCDATA)
- "" that precedes PCDATA indicates that what
follows is a special word, rather than an element
name
75Complete Example
- lt?xml version'1.0' encoding'us-ascii'?gt
- lt!-- DTD for a simple "slide show".--gt
- lt!ELEMENT slideshow (slide)gt
- lt!ELEMENT slide (title, item)gt
- lt!ELEMENT title (PCDATA)gt
- lt!ELEMENT item (PCDATA item) gt
76Attribute Types
- (PCDATA item)
- Vertical bar () indicates an or condition
- In this case, either PCDATA or an item can occur
77What you cannot do?
- Double-definition for an item element doesn't
work - lt!ELEMENT item (PCDATA) gt
- lt!ELEMENT item (PCDATA, item) gt
- Produces a "duplicate definition" warning
- The second definition is ignored
78XML Names and NMTOKEN
- Name Characters are letters, digits, hyphens,
underscores, colons or full stops. - An NMTOKEN is any collection of Name Characters
- NMTOKENS is any list of NMTOKENs separated by
white space (space, tab, newline etc.) - Case is significant PERSON and person are
distinct names - Attribute and Element names must be (a subset of)
NMTOKEN with restriction - Names cannot begin with a digit
- Names cannot begin with xml (or any variant
gotten by case changes) system will use this
prefix
79Element Declarations EMPTY
- Keyword ELEMENT Introduces a new
elementlt!ELEMENT NAME CONTENT_MODELgt - Element name must begin with a letter, and may
additionally contain digits and some
punctuations, i.e. ., -, _, and as we
described earlier under NMTOKEN - If an element can hold no child elements, and
also no text, then it is known as empty element
and denoted by EMPTY for CONTENT_MODEL - This seems trivial but it isnt because the
present or absence of this element in an XML file
can be used as a flag - As an example we can find several in HTML such as
HR and IMG which never have children and include
no text. Here we would writelt!ELEMENT HR EMPTYgt
and then ltHR/gt or ltHRgtlt/HRgt generates a
horizontal line - EMPTY ELEMENTS can have attributes such as the
SRC attribute in ltIMG/gt to specify source of
image.
80Element Declarations ANY
- An element declared to have a content of ANY may
contain all of the other elements declared in the
DTD - This is not quite the same as no DTD for the file
- lt!DOCTYPE fred lt!ELEMENT fred ANY gtgt
- ltfredgt ltpeoplegtMe and Yoult/peoplegt ltpeoplegtThem
lt/peoplegtlt/fredgt - Gets an error due to presence of ltpeoplegt tag
- Adding lt!ELEMENT people ANY gt inside DTD
declaration produces a valid document.
81Entities
- The DTD of an XML document can contain entity
declarations. These are like macro substitutions
in other languages. - ENTITYs are defined in DTD and consist of
several flavors - General Entities are referenced as EntName
- Parameter Entities are referenced as Entname
- We have already seen the character entities
- amp for
- apos for
- gt for gt
- lt for lt
- quot for
- These are built in but you could add other such
entities with - lt!ENTITY aitself A gt and aitself would be
replaced by A
82General Entities
- As another example, we can use in DTDlt!ENTITY
TODAY May 12 2003 gt andltcommentgtTODAY was
very quiet in Irvinelt/commentgtis parsed as
ltcommentgtMay 12 2003 was very quiet in
Irvinelt/commentgt - General Entity references can be nested inside a
DTD, e.g., one can write lt!ENTITY YEAR 2003 gt
lt!ENTITY TODAY May 12 YEAR gt - However one must use Parameter Entities and not
General Entities for macro substitution in other
DTD declarations like lt!ATTLIST and lt!ELEMENT - Parameter entities are defined as inlt!ENTITY
CUSTARDTAGS (NAME,DATE,ORDERS) gt
83Parameter Entities
- lt!ENTITY peopletags (firstname,lastname,dateofbi
rth) gtlt!ELEMENT student peopletags gt
lt!ELEMENT teacher peopletags gt lt!ELEMENT
administrator peopletags gt - Defines a bunch of people ELEMENTS to have the
same child elements - Parameter entities are even more commonly used
for attributes because almost always several
ELEMENTS share the same attributes (with often a
basic set being augmented in different ways for
different ELEMENTS) - This basic set can be set in a parameter Entity
84Defining Implied Attributes
- Attributes must be declared in the DTD to be able
to be used - Implied means that this attribute optional and
there is no default value - lt!ELEMENT population (PCDATA)gt
- lt!ATTLIST population year CDATA IMPLIEDgt
- The attribute year can be defined or undefined in
the element population. Valid Examples - ltpopulation year2000gt80lt/populationgt
- ltpopulationgt80lt/populationgt
85Defining Required Attributes
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year REQUIREDgt - The population must contain a year attribute
- ltpopulation year1996gt80lt/populationgt
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year (20002001) REQUIREDgt - The population must contain a year attribute of
2000 or 2001 - ltpopulation year2000gt80lt/populationgt
- No quotes on the enumeration values
86Defining Default Attributes
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year CDATA 2000gt - All these are valid
- ltpopulation year2001gt80lt/populationgt
- ltpopulation year2000gt80lt/populationgt
- ltpopulationgt80lt/populationgt
87Defining Fixed Attributes
- lt!ELEMENT population (PCDATA)gt lt!ATTLIST
population year CDATA FIXED 2000gt - Invalid ltpopulation year2001gt80lt/populationgt
- Valid ltpopulation year2000gt80lt/populationgt
- Valid ltpopulationgt80lt/populationgt
88Defining Unique Attributes
- lt!ELEMENT animal (name)gt
- lt!ATTLIST animal code ID REQUIREDgt
- The code attribute has to be unique in the XML
document - ltanimal codeT50gtltnamegtLionlt/namegt lt/animalgt
ltanimal codeT51gtltnamegtRabbitlt/namegt lt/animalgt
89Referring Unique Attributes
- lt!ELEMENT website (url)gt
lt!ATTLIST website animal_refer IDREF REQUIREDgt - animal_refer attribute refers to previous ID
attribute defined - ltwebsite animal_referT50gt
lturlgthttp//www.lions.comlt/urlgt
lt/websitegt
90Referring Multiple Unique Attributes
- lt!ELEMENT website (url)gt
lt!ATTLIST website contents IDREFS REQUIREDgt - contents attribute contain series of IDs
- ltwebsite contentsT50 T51gt
lturlgthttp//www.animals.comlt/urlgt
lt/websitegt
91XML Example - the DTD
- lt!ELEMENT addressBook (person)gt
- lt!ELEMENT person (name, email, link?) gt
- lt!ATTLIST person id ID REQUIRED gt
- lt!ATTLIST person gender (malefemale) IMPLIEDgt
- lt!ELEMENT name (PCDATA(family,given))gt
- lt!ELEMENT family (PCDATA)gt
- lt!ELEMENT given (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- lt!ELEMENT link EMPTY gtlt!ATTLIST link manager
IDREF IMPLIED
subordinates IDREF IMPLIEDgt
92DOCTYPE declarations
- Internal local definition of DTD
- External to an external file
- Can combine both
93Internal DTD
- lt?xml version"1.0" standalone"yes" ?gt
- lt!--open the DOCTYPE declaration -
- the open square bracket indicates an internal
DTD--gt - lt!DOCTYPE foo
- lt!--define the internal DTD--gt
- lt!ELEMENT foo (PCDATA)gt
- lt!--close the DOCTYPE declaration--gt
- gt
- ltfoogtHello World.lt/foogt
94Internal DTD rules
- The document type declaration must be placed
between the XML declaration and the first element
(root element) in the document . - The keyword DOCTYPE must be followed by the name
of the root element in the XML document . - The keyword DOCTYPE must be in upper case .
95External DTD
- Useful for creating a common DTD that can be
shared between multiple documents. - Any changes that are made to the external DTD
automatically updates all the documents that
reference it. - Two types private, and public.
- Rules
- If any elements, attributes, or entities are used
in the XML document that are referenced or
defined in an external DTD, standalone"no" must
be included in the XML declaration .
96"Private" External DTDs
- Identified by the keyword SYSTEM
- Intended for use by a single author or group of
authors. - Example
- lt!DOCTYPE root_element SYSTEM "DTD_location"gt
- where DTD_location is relative or absolute URL
(such as - http/ and file/).
97"Private" External DTDs (cont)
- XML document
- lt?xml version"1.0" standalone"no" ?gt
- lt!DOCTYPE document SYSTEM "subjects.dtd"gt
- ltdocumentgt lt/documentgt
- subjects.dtd
- lt!ELEMENT document gt
98Public" External DTDs
- Identified by the keyword PUBLIC
- Intended for broad use.
- lt!DOCTYPE root_element PUBLIC "DTD_name"
"DTD_location"gt where - DTD_location relative or absolute URL
- DTD_name follows the syntax
- "prefix//owner_of_the_DTD// description_of_the_D
TD//ISO 639_language_identifier - "DTD_location" is used to find the public DTD if
it cannot be located by the "DTD_name".
99Public" External DTDs (cont)
- lt?xml version"1.0" standalone"no" ?gt
- lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http//www.w3.org/TR/REC-html40
/loose.dtd"gt - ltHTMLgt
- ltHEADgt
- ltTITLEgtA typical HTML filelt/TITLEgt
- lt/HEADgt
- ltBODYgt
-
- lt/BODYgt
- lt/HTMLgt
100Public" External DTDs (cont)
- Valid DTD_name Prefix
- ISO The DTD is an ISO standard. All ISO
standards are approved. - The DTD is an approved non-ISO standard.
- - The DTD is an unapproved non-ISO standard.
101Combining Internal and External DTDs
- A document can use both internal and external DTD
subsets. - The internal DTD subset is specified between the
square brackets of the DOCTYPE declaration. - The declaration for the external DTD subset is
placed before the square brackets immediately
after the SYSTEM keyword. - Declaring an ELEMENT with the same name in both
the internal and external DTD subsets is invalid
102Example
- lt?xml version"1.0" standalone"no" ?gt
- lt!DOCTYPE document SYSTEM "subjects.dtd"
-
- lt!ATTLIST assessment assessment_type (exam
assignment prac)gt - lt!ELEMENT results (PCDATA)gt
- gt
- subjects.dtd
- lt!ELEMENT document (title,subjectID,subjectname,p
rerequisite?, classes,assessment,syllabus,textbook
s)gt - lt!ELEMENT prerequisite (subjectID,subjectname)gt
103DTD Validation
- A XML content can be well-formed but invalid
under DTD rules - e.g. DTD rule lt!ELEMENT name (PCDATA)gt
- Acceptable ltnamegt Giancarlo Succi lt/namegt
- Unacceptable
- ltnamegt
- ltfirst_namegt Giancarlo lt/first_namegt
- ltlast_namegt Succi lt/last_namegt
- lt/namegt
104Beyond DTDs
- DTD limitations
- Simple document structures
- Lack of real datatypes
- Advanced schema languages
- XML Schema
- Relax NG
105References
- http//www.java.sun.com/xml/docs/tutorial/TOC.html
- http//www.xml.com/pub/a/1999/09/expat/index.html
- http//xmlfiles.com/dtd/dtd_attributes.asp
- http//xmlwriter.net/xml_guide/doctype_declaration
.shtml