Title: DTD (Document Type Definition)
1DTD(Document Type Definition)
- Imposing Structure on
- XML Documents
- (W3Schools on DTDs)
2Motivation
- A DTD adds syntactical requirements in addition
to the well-formed requirement - It helps in eliminating errors when creating or
editing XML documents - It clarifies the intended semantics
- It simplifies the processing of XML documents
3An Example
- In an address book, where can a phone number
appear? - Under ltpersongt, under ltnamegt or under both?
- If we have to check for all possibilities,
processing takes longer and it may not be clear
to whom a phone belongs
4Document Type Definitions
- Document Type Definitions (DTDs) impose structure
on XML documents - There is some relationship between a DTD and a
schema, but it is not close hence the need for
additional typing systems (XML schemas) - The DTD is a syntactic specification
5Example An Address Book
- ltpersongt
- ltnamegt Homer Simpson lt/namegt
- ltgreetgt Dr. H. Simpson lt/greetgt
- ltaddrgt1234 Springwater Road lt/addrgt
- ltaddrgt Springfield USA, 98765 lt/addrgt
- lttelgt (321) 786 2543 lt/telgt
- ltfaxgt (321) 786 2544 lt/faxgt
- lttelgt (321) 786 2544 lt/telgt
- ltemailgt homer_at_math.springfield.edu lt/emailgt
- lt/persongt
6Specifying the Structure
- name to specify a name element
- greet? to specify an optional (0 or 1)
greet elements - name, greet? to specify a name followed by
an optional greet
7Specifying the Structure (contd)
- addr to specify 0 or more address lines
- tel fax a tel or a fax element
- (tel fax) 0 or more repeats of tel or fax
- email 0 or more email elements
8Specifying the Structure (contd)
- So the whole structure of a person entry is
specified by - name, greet?, addr, (tel fax), email
- This is known as a regular expression
9Element Type Definition
- for each element type E, a declaration of the
form - lt!ELEMENT E Pgt
- where P is a regular expression, i.e.,
- P EMPTY ANY PCDATA E
- P1, P2 P1 P2 P? P P
- E element type
- P1 , P2 concatenation
- P1 P2 disjunction
- P? optional
- P one or more occurrences
- P the Kleene closure
10Summary of Regular Expressions
- A The tag (i.e., element) A occurs
- e1,e2 The expression e1 followed by e2
- e 0 or more occurrences of e
- e? Optional 0 or 1 occurrences
- e 1 or more occurrences
- e1 e2 either e1 or e2
- (e) grouping
11The Definition of an Element Consists of Exactly
One of the Following
- A regular expression (as defined earlier)
- EMPTY means that the element has no content
- ANY means that content can be any mixture of
PCDATA and elements defined in the DTD - Mixed content which is defined as described on
the next slide - (PCDATA)
12The Definition of Mixed Content
- Mixed content is described by a repeatable OR
group - (PCDATA element-name )
- Inside the group, no regular expressions just
element names - PCDATA must be first followed by 0 or more
element names, separated by - The group can be repeated 0 or more times
13An Address-Book XML Document with an Internal DTD
- lt?xml version"1.0" encoding"UTF-8"?gt
- lt!DOCTYPE addressbook
- lt!ELEMENT addressbook (person)gt
- lt!ELEMENT person
- (name, greet?, address, (fax tel),
email)gt - lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT greet (PCDATA)gt
- lt!ELEMENT address (PCDATA)gt
- lt!ELEMENT tel (PCDATA)gt
- lt!ELEMENT fax (PCDATA)gt
- lt!ELEMENT email (PCDATA)gt
- gt
The syntax of a DTD is not XML syntax
14The Rest of theAddress-Book XML Document
ltaddressbookgt ltpersongt ltnamegt Jeff
Cohen lt/namegt ltgreetgt Dr. Cohen
lt/greetgt ltemailgt jc_at_penny.com lt/emailgt
lt/persongt lt/addressbookgt
15Regular Expressions
- Each regular expression determines a
corresponding finite-state automaton - Lets start with a simpler example
- name, addr, email
A double circle denotes an accepting state
This suggests a simple parsing program
16Another Example
- name,address,(tel fax),email
17Some Things are Hard to Specify
- Each employee element should contain name, age
and ssn elements in some order - lt!ELEMENT employee
- ( (name, age, ssn) (age, ssn, name)
- (ssn, name, age) ...
- )gt
- Suppose that there were many more fields!
18Some Things are Hard to Specify (contd)
- lt!ELEMENT employee
- ( (name, age, ssn) (age, ssn, name)
- (ssn, name, age) ...
- )gt
- Suppose there were many more fields!
There are n! different orders of n elements It
is not even polynomial
19Specifying Attributes in the DTD
- lt!ELEMENT height (PCDATA)gt
- lt!ATTLIST height
- dimension CDATA REQUIRED
- accuracy CDATA IMPLIED gt
- The dimension attribute is required
- The accuracy attribute is optional
- CDATA is the type of the attribute it means
character data, and may take any literal string
as a value
20The Format of an Attribute Definition
- lt!ATTLIST element-name attr-name attr-type
default-valuegt - The default value is given inside quotes
- attribute types
- CDATA
- ID, IDREF, IDREFS
-
21Summary of AttributeDefault Values
- REQUIRED means that the attribute must by
included in the element - IMPLIED
- FIXED value
- The given value (inside quotes) is the only
possible one - value
- The default value of the attribute if none is
given
22Recursive DTDs
- ltDOCTYPE genealogy
- lt!ELEMENT genealogy (person)gt
- lt!ELEMENT person (
- name,
- dateOfBirth,
- person, -- mother
- person )gt -- father
- ...
- gt
- What is the problem with this?
- A parser does not notice it!
Each person should have a father and a mother.
This leads to either infinite data or a person
that is a descendent of herself.
23Recursive DTDs (contd)
- ltDOCTYPE genealogy
- lt!ELEMENT genealogy (person)gt
- lt!ELEMENT person (
- name,
- dateOfBirth,
- person?, -- mother
- person? )gt -- father
- ...
- gt
- What is now the problem with this?
If a person only has a father, how can you
tell that he has a father and does not have a
mother?
24Using ID and IDREF Attributes
- lt!DOCTYPE family
- lt!ELEMENT family (person)gt
- lt!ELEMENT person (name)gt
- lt!ELEMENT name (PCDATA)gt
- lt!ATTLIST person
- id ID REQUIRED
- mother IDREF IMPLIED
- father IDREF IMPLIED
- children IDREFS IMPLIEDgt
- gt
25IDs and IDREFs
- ID attribute unique within the entire document.
- An element can have at most one ID attribute.
- No default (fixed default) value is allowed.
- required a value must be provided
- implied a value is optional
- IDREF attribute its value must be some other
elements ID value in the document. - IDREFS attribute its value is a set, each
element of the set is the ID value of some other
element in the document. - ltperson id898 father332 mother336
- children982 984 986gt
26Some Conforming Data
- ltfamilygt
- ltperson idlisa mothermarge
fatherhomergt - ltnamegt Lisa Simpson lt/namegt
- lt/persongt
- ltperson idbart mothermarge
fatherhomergt - ltnamegt Bart Simpson lt/namegt
- lt/persongt
- ltperson idmarge childrenbart lisagt
- ltnamegt Marge Simpson lt/namegt
- lt/persongt
- ltperson idhomer childrenbart lisagt
- ltnamegt Homer Simpson lt/namegt
- lt/persongt
- lt/familygt
27ID References do not Have Types
- The attributes mother and father are references
to IDs of other elements - However, those are not necessarily person
elements! - The mother attribute is not necessarily a
reference to a female person
28An Alternative Specification
- lt?xml version"1.0" encoding"UTF-8"?gt
- lt!DOCTYPE family
- lt!ELEMENT family (person)gt
- lt!ELEMENT person (name, mother?, father?,
children?)gt - lt!ATTLIST person id ID REQUIREDgt
- lt!ELEMENT name (PCDATA)gt
- lt!ELEMENT mother EMPTYgt
- lt!ATTLIST mother idref IDREF REQUIREDgt
- lt!ELEMENT father EMPTYgt
- lt!ATTLIST father idref IDREF REQUIREDgt
- lt!ELEMENT children EMPTYgt
- lt!ATTLIST children idrefs IDREFS REQUIREDgt
- gt
29The Revised Data
- ltperson id"bart"gt
- ltnamegt Bart Simpson lt/namegt
- ltmother idref"marge"/gt
- ltfather idref"homer"/gt
- lt/persongt
- ltperson id"lisa"gt
- ltnamegt Lisa
- Simpson lt/namegt
- ltmother idref"marge"/gt
- ltfather idref"homer"/gt
- lt/persongt
- lt/familygt
- ltfamilygt
- ltperson id"marge"gt
- ltnamegt Marge
- Simpson lt/namegt
- ltchildren idrefs"bart lisa"/gt
- lt/persongt
- ltperson id"homer"gt
- ltnamegt Homer
- Simpson lt/namegt
- ltchildren idrefs"bart lisa"/gt
- lt/persongt
30Consistency of ID and IDREF Attribute Values
- If an attribute is declared as ID
- The associated value must be distinct, i.e.,
different elements (in the given document) must
have different values for the ID attribute (no
confusion) - Even if the two elements have different element
names - If an attribute is declared as IDREF
- The associated value must exist as the value of
some ID attribute (no dangling pointers) - Similarly for all the values of an IDREFS
attribute - ID, IDREF and IDREFS attributes are not typed
31Adding a DTD to the Document
- A DTD can be internal
- The DTD is part of the document file
- or external
- The DTD and the document are on separate files
- An external DTD may reside
- In the local file system
- (where the document is)
- In a remote file system
32Connecting a Document with its DTD
- An internal DTD
- lt?xml version"1.0"?gt
- lt!DOCTYPE db lt!ELEMENT ...gt gt
- ltdbgt ... lt/dbgt
- A DTD from the local file system
- lt!DOCTYPE db SYSTEM "schema.dtd"gt
- A DTD from a remote file system
- lt!DOCTYPE db SYSTEM
"http//www.schemaauthority.com/schema.dtd"gt
33Well-Formed XML Documents
- An XML document (with or without a DTD) is
well-formed if - Tags are syntactically correct
- Every tag has an end tag
- Tags are properly nested
- There is a root tag
- A start tag does not have two occurrences of the
same attribute
An XML document must be well formed
34Valid Documents
- A well-formed XML document isvalid if it
conforms to its DTD, that is, - The document conforms to the regular-expression
grammar, - The types of attributes are correct, and
- The constraints on references are satisfied