Semistructured Data and XML - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Semistructured Data and XML

Description:

What semistructured data is. Concepts of the Object Exchange Model ... a metalanguage for defining other languages via DTDs = XML is more like SGML than HTML ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 105
Provided by: thomas861
Learn more at: http://impact.asu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Semistructured Data and XML


1
Chapter 29
  • Semistructured Data and XML
  • Transparencies

2
Chapter - Objectives
  • What semistructured data is.
  • Concepts of the Object Exchange Model (OEM), a
    model for semistructured data.
  • Main language elements of XML.
  • Difference between well-formed and valid XML
    documents.
  • How Document Type Definitions (DTDs) can be used
    to define the valid syntax of an XML document.

3
Chapter - Objectives
  • About other related XML technologies.
  • Limitations of DTDs and how the W3C XML Schema
    overcomes these limitations.
  • How RDF and RDF Schema provide a foundation for
    processing meta-data.
  • Proposals for a W3C Query Language.

4
(No Transcript)
5
Introduction
  • In 1998 XML 1.0 was formally ratified by W3C.
  • Yet, set to impact every aspect of programming
    including graphical interfaces, embedded systems,
    distributed systems, and database management.
  • Already becoming de facto standard for data
    communication within software industry, and is
    quickly replacing EDI systems as primary medium
    for data interchange among businesses.
  • Some analysts believe it will become language in
    which most documents are created and stored, both
    on and off Internet.

6
Introduction
  • Due to nature of information on Web and inherent
    flexibility of XML, expected that much of the
    data encoded in XML will be semistructured ie.,
    data may be irregular or incomplete, and its
    structure may change rapidly or unpredictably.
  • Unfortunately, relational, object-oriented, and
    object-relational DBMSs do not handle data of
    this nature particularly well.

7
Semistructured Data
  • Data that may be irregular or incomplete and
    have a structure that may change rapidly or
    unpredictably.
  • Semistructured data is data that has some
    structure, but structure may not be rigid,
    regular, or complete.
  • Generally, the data does not conform to a fixed
    schema (sometimes terms schema-less or
    self-describing is used to describe such data). .

8
Semistructured Data
  • The information normally associated with a schema
    is contained within the data itself.
  • In some forms of semistructured data there is no
    separate schema, in others it exists but only
    places loose constraints on the data.
  • Unfortunately, relational, object-oriented, and
    object-relational DBMSs do not handle data of
    this nature particularly well.

9
Semistructured Data
  • Has gained importance recently for various
    reasons
  • may be desirable to treat Web sources like a
    database, but cannot constrain these sources with
    a schema
  • may be desirable to have a flexible format for
    data exchange between disparate databases
  • emergence of XML as standard for data
    representation and exchange on the Web, and
    similarity between XML documents and
    semistructured data.

10
Example 29.1
11
Example 29.1
  • Note, data is not regular
  • for John White, hold first and last names, but
    for Ann Beech store single name and also store a
    salary
  • for property at 2 Manor Rd, store a monthly rent
    whereas for property at 18 Dale Rd, store an
    annual rent
  • for property at 2 Manor Rd, store property type
    (flat) as a string, whereas for property at 18
    Dale Rd, store type (house) as an integer value.

12
Example 29.1
13
Object Exchange Model (OEM)
  • Data in OEM is schema-less and self-describing,
    and can be thought of as labeled directed graph
    where nodes are objects, consisting of
  • unique object identifier (for example, 7),
  • descriptive textual label (street),
  • type (string),
  • a value (22 Deer Rd).
  • Objects are decomposed into atomic and complex
  • atomic object contains a value for a base type
    (eg., integer or string) and can be recognized in
    diagram as one that has no outgoing edges.
  • All other objects are complex objects whose type
    are a set of object identifiers.

14
Object Exchange Model (OEM)
  • A label indicates what the object represents and
    is used to identify the object and to convey the
    meaning of the object, and so should be as
    informative as possible.
  • Labels can change dynamically.
  • A name is a special label that serves as an alias
    for a single object and acts as an entry point
    into the database (for example, DreamHome is a
    name that denotes object 1).

15
Object Exchange Model (OEM)
  • An OEM object can be considered as a quadruple
    (label, oid, type, value).
  • For example
  • Staff, 4, set, 9, 10
  • name, 9, string, Ann Beech
  • salary, 10, decimal, 12000

16
Semistructured Data - Case StudyObject Exchange
Model
17
OEM Features
  • Common model for heterogeneous information
    exchange, self-describing
  • Each object

OID
Label
Type
Value
  • OID unique identifier or NULL
  • Label character string descriptor
  • Type atomic data type or set
  • Value atomic value or set of object references
  • Help pages for labels
  • Query language OEM-QL

18
Representing Semistructured Data Using OEM
Label
ltcollection, b1, a1, ...gt b1 ltbook, t, agt
t lttitle, Database and ...gt a
ltauthor, n, pgt n ltname, Jeff Ullmangt p
ltpicture, /gifs/ullman.gifgt a1 ltarticle, v,
w, xgt v ltauthor, Gio Wiederholdgt w lttitle,
Mediators in the gt x ltjournal, IEEE
Computergt
Set Value
Memory Addresses
Atomic Value
...
19
An OEM Query Language OEM-QL
  • Logic-based language for OEM
  • Match object patterns, generate variable
    bindings, construct new OEM objects from existing
    ones
  • Get articles published in IEEE Computer
  • P -
  • Pltarticles ltjournal IEEE Computergtgt
  • Get titles of books by Jeff Ullman
  • ltanswer_title Tgt -
  • ltbook ltauthor Jeff Ullmangt lttitle Tgtgt

20
XML
  • Vendors introduced some browser-specific HTML
    tags, making it difficult to develop
    sophisticated, widely viewable Web documents.
  • W3C has produced new standard called XML, which
    could preserve general application independence
    that makes HTML portable and powerful.

21
XML
  • XML is a restricted version of SGML, designed
    especially for Web documents.
  • SGML allows document to be logically separated
    into two one that defines the structure of the
    document (DTD), other containing the text itself.
  • By giving documents a separately defined
    structure, and by giving authors ability to
    define custom structures, SGML provides extremely
    powerful document management system.
  • However, SGML has not been widely adopted due to
    its inherent complexity.

22
XML
  • XML attempts to provide a similar function to
    SGML, but is less complex and, at same time,
    network-aware.
  • XML retains key SGML advantages of extensibility,
    structure, and validation.
  • Since XML is a restricted form of SGML, any fully
    compliant SGML system will be able to read XML
    documents (although the opposite is not true).
  • XML is not intended as a replacement for SGML or
    HTML.

23
XML (eXtensible Markup Language)
  • origins HTML SGML (ISO Standard, 1986,
    600pp)
  • W3C standard (26 pp) XML syntax DTDs
  • XML HTML ? presentational tags
  • user-defined DTD
    (tagsnesting)
  • gt a metalanguage for defining other languages
    via DTDs
  • gt XML is more like SGML than HTML
  • XML SGML ? complexity, document perspective
    simplicity, data exchange perspective

24
Advantages of XML
  • Simplicity
  • Open standard and platform/vendor-independent
  • Extensibility
  • Reuse
  • Separation of content and presentation
  • Improved load balancing

25
Advantages of XML
  • Support for integration of data from multiple
    sources
  • Ability to describe data from a wide variety of
    applications
  • More advanced search engines
  • New opportunities.

26
Why are Database folks so excited about XML?
  • XML is just a syntax for (self-describing) data
  • This is still exciting because
  • No standard syntax for relational data
  • With XML, we can
  • Translate any legacy data to XML
  • Can exchange data in XML format
  • Ship over the web, input to any application

27
XML ? machine accessible meaning
This is what a web-page in natural language
looks like for a machine
28
XML ? machine accessible meaning
XML allows meaningful tags to be added toparts
of the text
29
XML ? machine accessible meaning
But to your machine, the tags look like this.
30
XML ? machine accessible meaning
Schemas help.
lt CV gt
by relating common termsbetween documents
private
31
But other people use other schemas
Someone else has one like this.
32
But other people use other schemas
lt CV gt
which dont fit in
private
Moral There is still need for
ontology mapping..
33
An HTML document
34
HTML code
  • lttitlegtICS185/ICS180 - Spring, 2003lt/titlegt
  • ltbody bgcolor"d0d0ff"gt
  • ltH2gtIndexlt/H2gt
  • ltULgt
  • ltLIgt lta HREF "announcements"gtAnnouncements
    lt/agt
  • ltLIgt lta HREF "geninfo"gtCourse Information
    lt/agt
  • lt/ULgt
  • ltH2gtCourse Informationlt/H2gt
  • lta href"geninfo.html"gtGeneral Informationlt/agt.
    The following are a
  • few important entries
  • ltULgt
  • ltligt ltA HREF "geninfo.htmlgoals"gtCourse
    Goalslt/AgtltBRgt
  • ltligt ltA HREF "geninfo.htmlcrsenum"gtAbout
    the course numberslt/AgtltBRgt
  • lt/ULgt
  • lt/bodygt

35
What is the problem?
  • To do more fancy things with documents
  • need to make their logical structure explicit.
  • Otherwise, software applications
  • do not know what is what
  • do not have any handle over documents.

36
An XML document
  • lt?xml version"1.0" ?gt
  • ltbibgt
  • ltvendor id"id3_4"gt
  • ltnamegtQuickBookslt/namegt
  • ltemailgtbooksales_at_quickbooks.comlt/emailgt
  • ltphonegt1-800-333-9999lt/phonegt
  • ltbookgt
  • lttitlegtInorganic Chemistrylt/titlegt
  • ltpublishergtBrooks/Cole Publishinglt/publish
    ergt
  • ltyeargt1991lt/yeargt
  • ltauthorgt
  •   ltfirstnamegtJameslt/firstnamegt
  •   ltlastnamegtBowserlt/lastnamegt
  • lt/authorgt
  • ltpricegt43.72lt/pricegt
  • lt/bookgt
  • lt/vendorgt
  • lt/bibgt

37
lth1gt Bibliography lt/h1gt ltpgt ltigt Foundations of
Databases lt/igt Abiteboul, Hull, Vianu
ltbrgt Addison Wesley, 1995 ltpgt ltigt Data on
the Web lt/igt Abiteoul, Buneman, Suciu
ltbrgt Morgan Kaufmann, 1999
ltbibliographygt ltbookgt lttitlegt Foundations
lt/titlegt ltauthorgt Abiteboul
lt/authorgt ltauthorgt Hull
lt/authorgt ltauthorgt Vianu
lt/authorgt ltpublishergt Addison
Wesley lt/publishergt ltyeargt 1995
lt/yeargt lt/bookgt lt/bibliographygt
HTML describes presentation
XML describes content
38
What is XML?
  • eXtensible Markup Language
  • Data are identified using tags (identifiers
    enclosed in angle brackets lt...gt)
  • Collectively, the tags are known as markup
  • XML tags tell you what the data means, rather
    than how to display it

39
XML versus relational
  • Relational structured
  • XML semi-structured
  • Plain text file unstructured

40
How does XML work?
  • XML allows developers to write their own Document
    Type Definitions (DTD)
  • DTD is a markup languages rule book that
    describes the sets of tags and attributes that is
    used to describe specific content
  • If you want to use a certain tag, then it must
    first be defined in DTD

41
Key Components in XML
  • Three generic components, and one customizable
    component

XML Content
XML Parser
Application
DTD Rules
42
Meta Markup Language
  • Not a language but a way of specifying other
    languages
  • Meta-markup language gives the rules by which
    other markup languages can be written
  • Portable - platform independent

43
Markup Languages
  • Presentation based
  • Markup languages that describe information for
    presentation for human consumption
  • Content based
  • Describe information that is of interest to
    another computer application

44
HTML and XML
  • HTML tag says "display this data in bold font"
  • ltbgt...lt/bgt
  • XML tag acts like a field name in your program
  • It puts a label on a piece of data that
    identifies it
  • ltmessagegt...lt/messagegt

45
HTML vs. XML
  • ltbibliographygt
  • ltbookgt lttitlegt Foundations lt/titlegt
  • ltauthorgt Abiteboul lt/authorgt
  • ltauthorgt Hull lt/authorgt
  • ltauthorgt Vianu lt/authorgt
  • ltpublishergt Addison Wesley
    lt/publishergt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • lt/bibliographygt
  • lth1gt Bibliography lt/h1gt
  • ltpgt ltigt Foundations of Databases lt/igt
  • Abiteboul, Hull, Vianu
  • ltbrgt Addison Wesley, 1995
  • ltpgt ltigt Data on the Web lt/igt
  • Abiteoul, Buneman, Suciu
  • ltbrgt Morgan Kaufmann, 1999

Self-describing -Schema info part of the
data -Good for data exchange (albeit
baroque for storage)
46
Simple Example
  • XML data for a messaging application
  • ltmessagegt
  • lttogtyou_at_yourAddress.comlt/togt ltfromgtme_at_myAddress.c
    omlt/fromgt lttextgt Why is it good? Let me count
    the ways... lt/textgt
  • lt/messagegt

47
Element
  • Data between the tag and its matching end tag
    defines an element of the data
  • Comment
  • lt!-- This is a comment --gt

48
Example
  • lt!-- Using attributes   --gt
  • ltmessage to"you_at_yourAddress.com"
    from"me_at_myAddress.com"gt 
  • lttextgtWhty is it good? Let me count the
    ways...lt/textgt  
  • lt/messagegt

49
Attributes
  • Tags can also contain attributes
  • Attributes contain additional information
    included as part of the tag, within the tag's
    angle brackets
  • Attribute name is followed by an equality sign
    and the attribute value

50
Other Basics
  • White space is essentially irrelevant
  • Commas between attributes are not ignored - if
    present, they generate an error
  • Case sensitive message and MESSAGE are
    different

51
Well Formed XML
  • Every tag has a closing tag
  • XML represents hierarchical data structures
    having one tag to contain others
  • Tags have to be completely nested
  • Correct
  • ltmessagegt..lttogt..lt/togt..lt/messagegt
  • Incorrect
  • ltmessagegt..lttogt..lt/messagegt..lt/togt

52
Empty Tag
  • Empty tag is used when it makes sense to have a
    tag that stands by itself and doesn't enclose any
    content - a "flag"
  • You can create an empty tag by ending it with /gt
  • ltflag/gt

53
Example
  • ltmessage to"you_at_yourAddress.com"
    from"me_at_myAddress.com" subjectXML is good"gt
    ltflag/gt lttextgt Whty is it good? Let me count the
    ways...
  • lt/textgt
  • lt/messagegt

54
Tree representation
  • ltBOOKSgt
  • ltbook id123 loclibrarygt
  • ltauthorgtHulllt/authorgt
  • lttitlegtCalifornialt/titlegt
  • ltyeargt 1995 lt/yeargt
  • lt/bookgt
  • ltarticle id555 ref123gt
  • ltauthorgtSult/authorgt
  • lttitlegt Purduelt/titlegt
  • lt/articlegt
  • lt/BOOKSgt

Hull
55
Prolog in XML Files
  • XML file always starts with a prolog
  • The minimal prolog contains a declaration that
    identifies the document as an XML document
  • lt?xml version"1.0"?gt
  • The declaration may also contain additional
    information
  • version - version of the XML used in the data
  • encoding - Identifies the character set used
  • standalone - whether the document references an
    external entity or data type specification

56
Detailed Example of XML File
  • simple version of the kind of XML data you could
    use for a slide presentation
  • You can use your text editor to create the data
  • Step 1 create a file named slideSample01.xml
  • Step 2 write the declaration, which identifies
    the file as an XML document
  • lt?xml version'1.0' encoding'us-ascii'?gt

57
Defining the Root Element
  • Step 3 Adding a comment
  • lt!-- A SAMPLE set of slides --gt
  • Step 4 Defining the Root Element
  • ltslideshowgt
  • lt/slideshowgt
  • After the declaration, every XML file defines
    exactly one element, known as the root element
  • Any other elements in the file are contained
    within that element

58
Attributes
  • A slide presentation has a title
  • ...
  • ltslideshow
  • title"Sample Slide Show"gt
  • lt/slideshowgt

59
Adding Nested Elements
  • Step 5 Adding Nested Elements
  • ltslideshow
  • ...
  • lt!-- TITLE SLIDE --gt
  • ltslide title"Title of Talk"/gt
  • lt!-- TITLE SLIDE --gt
  • ltslide type"all"gt
  • lttitlegtIntroduction to XML lt/titlegt
  • lt/slidegt
  • lt/slideshowgt

60
Attribute vs. Element
  • type of the slide is defined as an attribute
  • Slides could be earmarked for a mostly technical
    or mostly executive audience with type"tech" or
    type"exec", or identified as suitable for both
    with type"all
  • title element is defined as an element
  • The title is something the audience will see
  • So it is an element
  • The type is something that never gets presented
  • So it is an attribute

61
Adding Text
  • Step 6 Adding Text
  • ltslideshowgt
  • lt!-- OVERVIEW --gt
  • ltslide type"all"gt lttitlegtOverviewlt/titlegt
  • ltitemgtWhy is XML great?lt/itemgt
  • ltitemgtWho uses it?lt/itemgt
  • lt/slidegt
  • lt/slideshowgt

62
Adding an Empty Element
  • Step 7 Adding an Empty Element
  • ltslideshowgt
  • lt!-- OVERVIEW --gt
  • ltslidegt
  • lt!-- define an empty list item --gt
  • ltitem/gt
  • lt/slidegt
  • lt/slideshowgt

63
Complete Example
  •  lt?xml version"1.0" encoding"us-ascii" ?gt
  • lt!-- A SAMPLE set of slides   --gt
  • ltslideshow title"Sample Slide Show"gt
  • lt!-- TITLE SLIDE   --gt
  • ltslide type"all"gt 
  • lttitlegtIntroduction to CMLlt/titlegt
  •   lt/slidegt
  • lt!-- OVERVIEW   --gt
  • ltslide type"all"gt
  •   lttitlegtOverviewlt/titlegt
  •  ltitemgtWhy is XML great?lt/itemgt
  •   ltitem /gt  
  • lt/slidegt
  •  lt/slideshowgt

64
XML Parsing IE Example
65
Processing Instructions
  • An XML file can also contain processing
    instructions that give commands or information to
    an application that is processing the XML data
  • lt?target instructions?gt
  • target is the name of the application that is
    expected to do the processing
  • instructions is a string of characters that
    embodies the information or commands for the
    application to process

66
XML
67
XML -Elements
  • Elements, or tags, are most common form of
    markup.
  • First element must be a root element, which can
    contain other (sub)elements.
  • XML document must have one root element
    (ltSTAFFLISTgt. Element begins with start-tag
    (ltSTAFFgt) and ends with end-tag (lt/STAFFgt).
  • XML elements are case sensitive
  • An element can be empty, in which case it can be
    abbreviated to ltEMPTYELEMENT/gt.
  • Elements must be properly nested.

68
XML - Attributes
  • Attributes are name-value pairs that contain
    descriptive information about an element.
  • Attribute is placed inside start-tag after
    corresponding element name with the attribute
    value enclosed in quotes.
  • ltSTAFF branchNo B005gt
  • Could also have represented branch as subelement
    of STAFF.
  • A given attribute may only occur once within a
    tag, while subelements with same tag may be
    repeated.

69
Data Type Definition (DTD)
  • DTD specifies the types of tags that can be
    included in the XML document
  • it defines which tags are valid, and in what
    arrangements
  • where text is expected, letting the parser
    determine whether the whitespace it sees is
    significant or ignorable
  • An optional part of the document prolog

70
XML document and DTD
XML DTD
Slideshow

Slideshow
Slide

slide
slide
item
title
title
item
DB
item
title
item
item1
item2
  • lt?xml version'1.0' encoding'us-ascii'?gt
  • lt!-- DTD for a simple "slide show".--gt
  • lt!ELEMENT slideshow (slide)gt
  • lt!ELEMENT slide (title, item)gt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT item (PCDATA item) gt

AI
item3
XML Document
71
Detailed DTD Example
  • Step 1 Create a file named slideshow.dtd
  • Step 2 Enter an XML declaration
  • lt?xml version'1.0' encoding'us-ascii'?gt
  • lt!-- DTD for a simple "slide show". --gt
  • Step 3 Specify contains of a slideshow element
  • lt!- slideshow contains 1 slide elements --gt
  • lt!ELEMENT slideshow (slide)gt

72
Qualifiers
  • lt?xml version'1.0' encoding'us-ascii'?gt
  • lt!-- DTD for a simple example. --gt
  • lt!ELEMENT slideshow (slide)gt
  • slideshow element contains slide elements and
    nothing else

73
Grouping multiple items
  • ((image, title))
  • Every image element must be paired with a title
    element
  • Plus sign applies to the image/title pair to
    indicate that one or more pairs of the specified
    items can occur

74
Defining Text and Nested Elements
  • Step 4 Defining Text and Nested Elements
  • lt!ELEMENT slide (title, item)gt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT item (PCDATA item) gt
  • Text Parsed Character DATA (PCDATA)
  • "" that precedes PCDATA indicates that what
    follows is a special word, rather than an element
    name

75
Complete Example
  • lt?xml version'1.0' encoding'us-ascii'?gt
  • lt!-- DTD for a simple "slide show".--gt
  • lt!ELEMENT slideshow (slide)gt
  • lt!ELEMENT slide (title, item)gt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT item (PCDATA item) gt

76
Attribute Types
  • (PCDATA item)
  • Vertical bar () indicates an or condition
  • In this case, either PCDATA or an item can occur

77
What you cannot do?
  • Double-definition for an item element doesn't
    work
  • lt!ELEMENT item (PCDATA) gt
  • lt!ELEMENT item (PCDATA, item) gt
  • Produces a "duplicate definition" warning
  • The second definition is ignored

78
XML Names and NMTOKEN
  • Name Characters are letters, digits, hyphens,
    underscores, colons or full stops.
  • An NMTOKEN is any collection of Name Characters
  • NMTOKENS is any list of NMTOKENs separated by
    white space (space, tab, newline etc.)
  • Case is significant PERSON and person are
    distinct names
  • Attribute and Element names must be (a subset of)
    NMTOKEN with restriction
  • Names cannot begin with a digit
  • Names cannot begin with xml (or any variant
    gotten by case changes) system will use this
    prefix

79
Element Declarations EMPTY
  • Keyword ELEMENT Introduces a new
    elementlt!ELEMENT NAME CONTENT_MODELgt
  • Element name must begin with a letter, and may
    additionally contain digits and some
    punctuations, i.e. ., -, _, and as we
    described earlier under NMTOKEN
  • If an element can hold no child elements, and
    also no text, then it is known as empty element
    and denoted by EMPTY for CONTENT_MODEL
  • This seems trivial but it isnt because the
    present or absence of this element in an XML file
    can be used as a flag
  • As an example we can find several in HTML such as
    HR and IMG which never have children and include
    no text. Here we would writelt!ELEMENT HR EMPTYgt
    and then ltHR/gt or ltHRgtlt/HRgt generates a
    horizontal line
  • EMPTY ELEMENTS can have attributes such as the
    SRC attribute in ltIMG/gt to specify source of
    image.

80
Element Declarations ANY
  • An element declared to have a content of ANY may
    contain all of the other elements declared in the
    DTD
  • This is not quite the same as no DTD for the file
  • lt!DOCTYPE fred lt!ELEMENT fred ANY gtgt
  • ltfredgt ltpeoplegtMe and Yoult/peoplegt ltpeoplegtThem
    lt/peoplegtlt/fredgt
  • Gets an error due to presence of ltpeoplegt tag
  • Adding lt!ELEMENT people ANY gt inside DTD
    declaration produces a valid document.

81
Entities
  • The DTD of an XML document can contain entity
    declarations. These are like macro substitutions
    in other languages.
  • ENTITYs are defined in DTD and consist of
    several flavors
  • General Entities are referenced as EntName
  • Parameter Entities are referenced as Entname
  • We have already seen the character entities
  • amp for
  • apos for
  • gt for gt
  • lt for lt
  • quot for
  • These are built in but you could add other such
    entities with
  • lt!ENTITY aitself A gt and aitself would be
    replaced by A

82
General Entities
  • As another example, we can use in DTDlt!ENTITY
    TODAY May 12 2003 gt andltcommentgtTODAY was
    very quiet in Irvinelt/commentgtis parsed as
    ltcommentgtMay 12 2003 was very quiet in
    Irvinelt/commentgt
  • General Entity references can be nested inside a
    DTD, e.g., one can write lt!ENTITY YEAR 2003 gt
    lt!ENTITY TODAY May 12 YEAR gt
  • However one must use Parameter Entities and not
    General Entities for macro substitution in other
    DTD declarations like lt!ATTLIST and lt!ELEMENT
  • Parameter entities are defined as inlt!ENTITY
    CUSTARDTAGS (NAME,DATE,ORDERS) gt

83
Parameter Entities
  • lt!ENTITY peopletags (firstname,lastname,dateofbi
    rth) gtlt!ELEMENT student peopletags gt
    lt!ELEMENT teacher peopletags gt lt!ELEMENT
    administrator peopletags gt
  • Defines a bunch of people ELEMENTS to have the
    same child elements
  • Parameter entities are even more commonly used
    for attributes because almost always several
    ELEMENTS share the same attributes (with often a
    basic set being augmented in different ways for
    different ELEMENTS)
  • This basic set can be set in a parameter Entity

84
Defining Implied Attributes
  • Attributes must be declared in the DTD to be able
    to be used
  • Implied means that this attribute optional and
    there is no default value
  • lt!ELEMENT population (PCDATA)gt
  • lt!ATTLIST population year CDATA IMPLIEDgt
  • The attribute year can be defined or undefined in
    the element population. Valid Examples
  • ltpopulation year2000gt80lt/populationgt
  • ltpopulationgt80lt/populationgt

85
Defining Required Attributes
  • lt!ELEMENT population (PCDATA)gt lt!ATTLIST
    population year REQUIREDgt
  • The population must contain a year attribute
  • ltpopulation year1996gt80lt/populationgt
  • lt!ELEMENT population (PCDATA)gt lt!ATTLIST
    population year (20002001) REQUIREDgt
  • The population must contain a year attribute of
    2000 or 2001
  • ltpopulation year2000gt80lt/populationgt
  • No quotes on the enumeration values

86
Defining Default Attributes
  • lt!ELEMENT population (PCDATA)gt lt!ATTLIST
    population year CDATA 2000gt
  • All these are valid
  • ltpopulation year2001gt80lt/populationgt
  • ltpopulation year2000gt80lt/populationgt
  • ltpopulationgt80lt/populationgt

87
Defining Fixed Attributes
  • lt!ELEMENT population (PCDATA)gt lt!ATTLIST
    population year CDATA FIXED 2000gt
  • Invalid ltpopulation year2001gt80lt/populationgt
  • Valid ltpopulation year2000gt80lt/populationgt
  • Valid ltpopulationgt80lt/populationgt

88
Defining Unique Attributes
  • lt!ELEMENT animal (name)gt
  • lt!ATTLIST animal code ID REQUIREDgt
  • The code attribute has to be unique in the XML
    document
  • ltanimal codeT50gtltnamegtLionlt/namegt lt/animalgt

    ltanimal codeT51gtltnamegtRabbitlt/namegt lt/animalgt

89
Referring Unique Attributes
  • lt!ELEMENT website (url)gt
    lt!ATTLIST website animal_refer IDREF REQUIREDgt
  • animal_refer attribute refers to previous ID
    attribute defined
  • ltwebsite animal_referT50gt
    lturlgthttp//www.lions.comlt/urlgt
    lt/websitegt

90
Referring Multiple Unique Attributes
  • lt!ELEMENT website (url)gt
    lt!ATTLIST website contents IDREFS REQUIREDgt
  • contents attribute contain series of IDs
  • ltwebsite contentsT50 T51gt
    lturlgthttp//www.animals.comlt/urlgt
    lt/websitegt

91
XML Example - the DTD
  • lt!ELEMENT addressBook (person)gt
  • lt!ELEMENT person (name, email, link?) gt
  • lt!ATTLIST person id ID REQUIRED gt
  • lt!ATTLIST person gender (malefemale) IMPLIEDgt
  • lt!ELEMENT name (PCDATA(family,given))gt
  • lt!ELEMENT family (PCDATA)gt
  • lt!ELEMENT given (PCDATA)gt
  • lt!ELEMENT email (PCDATA)gt
  • lt!ELEMENT link EMPTY gtlt!ATTLIST link manager
    IDREF IMPLIED
    subordinates IDREF IMPLIEDgt

92
DOCTYPE declarations
  • Internal local definition of DTD
  • External to an external file
  • Can combine both

93
Internal DTD
  • lt?xml version"1.0" standalone"yes" ?gt
  • lt!--open the DOCTYPE declaration -
  • the open square bracket indicates an internal
    DTD--gt
  • lt!DOCTYPE foo
  • lt!--define the internal DTD--gt
  • lt!ELEMENT foo (PCDATA)gt
  • lt!--close the DOCTYPE declaration--gt
  • gt
  • ltfoogtHello World.lt/foogt

94
Internal DTD rules
  • The document type declaration must be placed
    between the XML declaration and the first element
    (root element) in the document .
  • The keyword DOCTYPE must be followed by the name
    of the root element in the XML document .
  • The keyword DOCTYPE must be in upper case .

95
External DTD
  • Useful for creating a common DTD that can be
    shared between multiple documents.
  • Any changes that are made to the external DTD
    automatically updates all the documents that
    reference it.
  • Two types private, and public.
  • Rules
  • If any elements, attributes, or entities are used
    in the XML document that are referenced or
    defined in an external DTD, standalone"no" must
    be included in the XML declaration .

96
"Private" External DTDs
  • Identified by the keyword SYSTEM
  • Intended for use by a single author or group of
    authors.
  • Example
  • lt!DOCTYPE root_element SYSTEM "DTD_location"gt
  • where DTD_location is relative or absolute URL
    (such as
  • http/ and file/).

97
"Private" External DTDs (cont)
  • XML document
  • lt?xml version"1.0" standalone"no" ?gt
  • lt!DOCTYPE document SYSTEM "subjects.dtd"gt
  • ltdocumentgt lt/documentgt
  • subjects.dtd
  • lt!ELEMENT document gt

98
Public" External DTDs
  • Identified by the keyword PUBLIC
  • Intended for broad use.
  • lt!DOCTYPE root_element PUBLIC "DTD_name"
    "DTD_location"gt where
  • DTD_location relative or absolute URL
  • DTD_name follows the syntax
  • "prefix//owner_of_the_DTD// description_of_the_D
    TD//ISO 639_language_identifier
  • "DTD_location" is used to find the public DTD if
    it cannot be located by the "DTD_name".

99
Public" External DTDs (cont)
  • lt?xml version"1.0" standalone"no" ?gt
  • lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
    Transitional//EN" "http//www.w3.org/TR/REC-html40
    /loose.dtd"gt
  • ltHTMLgt
  • ltHEADgt
  • ltTITLEgtA typical HTML filelt/TITLEgt
  • lt/HEADgt
  • ltBODYgt
  • lt/BODYgt
  • lt/HTMLgt

100
Public" External DTDs (cont)
  • Valid DTD_name Prefix
  • ISO The DTD is an ISO standard. All ISO
    standards are approved.
  • The DTD is an approved non-ISO standard.
  • - The DTD is an unapproved non-ISO standard.

101
Combining Internal and External DTDs
  • A document can use both internal and external DTD
    subsets.
  • The internal DTD subset is specified between the
    square brackets of the DOCTYPE declaration.
  • The declaration for the external DTD subset is
    placed before the square brackets immediately
    after the SYSTEM keyword.
  • Declaring an ELEMENT with the same name in both
    the internal and external DTD subsets is invalid

102
Example
  • lt?xml version"1.0" standalone"no" ?gt
  • lt!DOCTYPE document SYSTEM "subjects.dtd"
  • lt!ATTLIST assessment assessment_type (exam
    assignment prac)gt
  • lt!ELEMENT results (PCDATA)gt
  • gt
  • subjects.dtd
  • lt!ELEMENT document (title,subjectID,subjectname,p
    rerequisite?, classes,assessment,syllabus,textbook
    s)gt
  • lt!ELEMENT prerequisite (subjectID,subjectname)gt

103
DTD Validation
  • A XML content can be well-formed but invalid
    under DTD rules
  • e.g. DTD rule lt!ELEMENT name (PCDATA)gt
  • Acceptable ltnamegt Giancarlo Succi lt/namegt
  • Unacceptable
  • ltnamegt
  • ltfirst_namegt Giancarlo lt/first_namegt
  • ltlast_namegt Succi lt/last_namegt
  • lt/namegt

104
Beyond DTDs
  • DTD limitations
  • Simple document structures
  • Lack of real datatypes
  • Advanced schema languages
  • XML Schema
  • Relax NG

105
References
  • http//www.java.sun.com/xml/docs/tutorial/TOC.html
  • http//www.xml.com/pub/a/1999/09/expat/index.html
  • http//xmlfiles.com/dtd/dtd_attributes.asp
  • http//xmlwriter.net/xml_guide/doctype_declaration
    .shtml
About PowerShow.com