e-Science e-Business e-Government and their Technologies Core XML - PowerPoint PPT Presentation


PPT – e-Science e-Business e-Government and their Technologies Core XML PowerPoint presentation | free to download - id: 452768-NDg2Y


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

e-Science e-Business e-Government and their Technologies Core XML


XML Design Goals XML shall be usable over the Internet. ... This unit is easily large enough to hold the integer ... illustrated in our introductory dynamic HTML ... – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 196
Provided by: itUomGrt4
Learn more at: http://www.it.uom.gr


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: e-Science e-Business e-Government and their Technologies Core XML

e-Science e-Business e-Government and their
TechnologiesCore XML
  • Bryan Carpenter, Geoffrey Fox, Marlon Pierce
  • Pervasive Technology Laboratories
  • Indiana University Bloomington IN 47404
  • January 12 2004
  • dbcarpen_at_indiana.edu
  • gcf_at_indiana.edu
  • mpierce_at_cs.indiana.edu
  • http//www.grid2004.org/spring2004

What are we doing
  • This is a semester-long course on Grids (viewed
    as technologies and infrastructure) and the
    application mainly to science but also to
    business and government
  • We will assume a basic knowledge of the Java
    language and then interweave 6 topic areas
    first four cover technologies that will be used
    by students
  • 1) Advanced Java including networking, Java
    Server Pages and perhaps servlets
  • 2) XML Specification, Tools, Linkage to Java
  • 3) Web Services Basic Ideas, WSDL, Axis and
  • 4) Grid Systems GT3/Cogkit, Gateway, XSOAP,
  • 5) Advanced Technology Discussions CORBA as
    history, OGSA-DAI, security, Semantic Grid,
  • 6) Applications Bioinformatics, Particle
    Physics, Engineering, Crises, Computing-on-demand
    Grid, Earth Science

Contents of this Lecture Set
  • Intro HTML and XML and Unicode
  • Core XML
  • XML syntax and well-formedness, DTDs and
  • XML namespaces.
  • The XML DOM with linkage to Java.
  • XPath basics.
  • XML Schema.
  • Validation for data-centric applications.
  • Later lectures may include additional information
  • XML style languages XSLT and CSS.
  • XML Databases (Xindice, Sleepycat).
  • Search advanced XPath, XQuery.

Motivations for XML a Better HTML?
  • Limitations of HTML
  • Extensibility HTML does not allow users to
    specify their own tags or attributes in order to
    parameterize or otherwise semantically qualify
    their data.
  • Structure HTML does not support the
    specification of deep structures needed to
    represent database schema or object-oriented
  • Validation HTML does not support the kind of
    language specification that allows applications
    to check data for structural validity when it is

XML in the HTML world
  • XML eXtensible Markup Language.
  • XML is a subset of SGMLStandard Generalized
    Markup Language, but XML is specifically designed
    for the web.
  • Specification by W3C http//www.w3.org/XML and
    lots of links like http//www.xml.org
  • XML 1.0 in February 98.
  • XML 1.1 became a W3C recommendation 4 Feb, 2004!
  • How XML fits into the new HTML world
  • XML describes the logical structure of the
  • CSS (Cascading Style Sheets) and/or XSL describes
    the visual presentation of the document.
  • DOM (Document Object Model) allows scripting
    languages like JavaScript to access and
    dynamically change document objects.

Logical vs. Visual Design
  • Logical design of a document (content) should be
    separate from its visual design (presentation).
  • Promotes sound typography.
  • Encourages better writing.
  • Is more flexible.
  • Allows the same knowledge/information (defined
    in XML) to be displayed on PCs, PDAs, Braille
    devices etc.
  • XML used to define the logical design, with XSL
    (Extensible Style Language) or other mechanism
    used to define the visual layout (e.g. by mapping
    XML into HTML).

XML Design Goals
  1. XML shall be usable over the Internet.
  2. XML shall support a variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs that process
    XML documents.
  5. Optional features in XML shall be kept to the
    absolute minimum, ideally zero.
  6. XML documents should be human-legible and
    reasonably clear.
  7. Design of XML should be prepared quickly.
  8. Design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.

Document-Centric or Data-Centric?
  • Roots of XML in document markup (HTML-like).
  • In practice use of XML as a data format has
    become at least as pervasive. Examples
  • Use of XML format in configuration and deployment
    files of EJB, Tomcat,
  • Uses of XML as a format for message exchange
    (e.g. SOAP, BEEP).
  • There is also an important intermediate caseXML
    as program text for machine interpretation.
  • XSLT declarative transformation language.
  • WSDL interface definition language for Web
  • BPEL Web services workflow language.

Features of XML
  • Documents are stored in plain text and thus can
    be transferred and processed anywhere.
  • Unifying principles make it easily acceptable
  • Everything is a tree (DOM).
  • Unicode for different languages.

XML and Unicode
  • All XML documents must be written using the
    Unicode character set.
  • Unicode is also the character set used by Java,
    C, ECMAScript, , so we should know something
    about it.

Special Topic Unicode
  • Unicode (http//www.unicode.org) is an
    international standard character set that covers
    alphabets of all the Worlds common written
  • Eventually it should cover all languages, living
    and dead.
  • Unicode helps make the Web truly worldwide?!
  • Unlike, say, ASCII, which allows for 128
    characters, Unicode has space for over 1,000,000,
    of which around 96,000 are currently allocated.
  • Unicode itself assigns a unique sequence number
    (code point) to any character, regardless its
  • Three Unicode encoding forms map these code
    points to sequences of fixed size unitsUTF-8,
    UTF-16, UTF-32.

Unicode Code Points
  • A Unicode code point is a numeric value between 0
    and 10FFFF16, commonly denoted in one of the
  • U10XXXX
  • where X is a hexadecimal digit.
  • There are a total of 1,114,112 ( 17 164) code
    points, but most of the Worlds common characters
    are encoded in the first 65,536 pointsthe Basic
    Multilingual Plane (BMP).
  • 2048 code points in BMP are disallowed because
    their values have a special role in UTF-16
  • For each assigned character code, the Unicode
    standard defines a name, and semantic
    properties like case, directionality, ...

  • The space of 17 216 Unicode code points is
    conventionally divided into 17 planes of 216
    points each.
  • Currently used planes include

Plane Plane Name Code Range
0 Basic Multilingual Plane 000016..FFFF16
1 Supplementary Multilingual Plane 1000016..1FFFF16
2 Supplementary Ideographic Plane 2000016..2FFFF16
  • Note early versions of Unicode used a strict
    16-bit encoding, and essentially contained just

Unicode Allocation
  • Layout of planes

  • Planes are subdivided into blocks.
  • Blocks have variable size. Each block contains
    the characters of one alphabet or a group of
    related alphabets.
  • The following slides are a random sampling of the
    blocks in BMP.
  • I have put 128 code points on each slide, but
    this is just what would fit no general
    significance to pages of size 128.
  • For all blocks in the current Unicode standard
  • http//www.unicode.org/charts/

Basic Latin (a.k.a. ASCII)
Latin 1 (supplement)
Greek and Coptic
Arabic (1 of 2)
Hangul Jamo (1 of 2)
CJK Unified Ideographs (1 of 164)
Unicode Allocation
  • Layout of Basic MultilingualPlane

Unicode Allocation
  • Layout of Plane 1

Encoding Forms
  • In electronic documents or computer programs the
    space of Unicode code points is normally broken
    down into a sequence of units, each unit having a
    convenient, fixed number of bits.
  • The Unicode standard defines 3 encoding forms.
  • The most straightforward is UTF-32, in which the
    units have size 32 bits.
  • This unit is easily large enough to hold the
    integer value of a single code point, so UTF-32
    encoding is obvious.
  • But for nearly all documents, UTF-32 wastes at
    least half the available storage space.
  • Also, most programming languages work with 8 bit
    or 16 bit character units.

  • The UTF-16 encoding form breaks Unicode
    characters into 16 bit units.
  • Java, for example, uses UTF-16 for chars and
  • One 16 bit unit is not large enough to represent
    all possible Unicode code points.
  • Code points higher than 216-1 are split over two
    consecutive units.
  • These are called surrogate pairs. The leading
    unit is a high-surrogate unit trailing is a
    low-surrogate unit.
  • There are 1024 code points reserved in the BMP
    for high surrogates, and 1024 more reserved for
    low surrogates.
  • This allows for 1024 1024 surrogate pairs
    representing code points higher than 216-1, while
    ensuring a legal BMP code point can always be
    represented in a single unit, and such a unit can
    never be confused with a surrogate unit.

UTF-16 Bit Distribution
  • The UTF-8 encoding form breaks Unicode characters
    into 8 bit units (i.e., individual bytes).
  • UTF-8 is a variable-width encoding with the
    following properties
  • Any Unicode code point maps to 1, 2, 3, or 4
  • Byte sub-sequences for individual characters can
    always be recognized by local search in the
    encoded string.
  • The Basic Latin block coding points
    (U0000..U007F) map to one byte, identical to
    their ASCII value.
  • All code points in the BMP map to at most 3
  • For European texts UTF-8 will normally use 8 or
    16 bits per character (vs 16 bits for UTF-16).
  • For East Asian texts UTF-8 will normally use 24
    bits per character (vs 16 bits for UTF-16).

UTF-8 Bit Distribution
Encoding Schemes
  • The 3 encoding forms dont quite complete the
    encoding schemes of Unicode, because they dont
    address the endianness with which the UTF-32,
    UTF-16 numeric unit values are rendered to bytes
  • To allow applications to distinguish the
    endianness of a given document instance, Unicode
    allows a Byte Order Mark (BOM) as the first
    character of a document.
  • BOM is a code point (UFEFF) for which the
    byte-reversed unit value doesnt correspond to a
    legal code point, so serves to determine the
    actual byte order.

The Seven Unicode Encoding Schemes
Unicode Summary
  • Unicode is a large and important standard that is
    a foundation for XML, HTML, etc.
  • Although you are unlikely to manipulate the
    encodings yourself, you should be aware of the
    pros and cons of UTF-16, UTF-8.
  • UTF-8 is backwards compatible with ASCIIBasic
    Latin texts can be read by legacy applications.
  • UTF-16 is better-suited for internationalization.
    It is the internal representation used by Java,
    C, ECMAScript,

Core XML IThe XML Specification
  • In this section we will describe core XML, as
    defined by the XML specification document from
  • XML is a format for documentsoriginally
    documents for the Webbut its scope is wider than
  • XML is a subset of SGMLStandard Generalized
    Markup Language. Some features of XML exist
    simply for compatibility with SGML.
  • XML can also be viewed as a kind of
    generalization of HTMLpresumably familiar from
    the Web.

XML Parsers and Applications
  • For purposes of this section an application is
    any program that reads data from an XML document.
  • Applications normally do not (and probably should
    not) read the text of XML documents directly.
  • The XML specification assumes that this text is
    initially processed by a piece of software called
    an XML processor. We will also refer to this as
    an XML parser.
  • The parser exhaustively checks that the text is
    in a legal XML form, then extracts the essential
    data from the document, and hands that data to
    the application.

Reading XML Data
lt?xml version"1.0"?gt lt!DOCTYPE svg PUBLIC
"-//W3C//DTD SVG 1.0//EN" "http//www.w3.org/T
R/2001/REC-SVG-20010904/DTD/svg10.dtd"gt ltsvg
width"500" height"500"gt ltg transform'rotate(45)
'gt ltcircle cx'150' cy'50' r'25'/gt
lttext x'125' y'100'gtA Circlelt/textgt lt/ggt lt/svggt
XML Parser
XML Source
width 500
height 500
Parsed XML Data
transform rotate(45)
cx 150
cy 50
r 25
x 125
y 100
A Circle
Well-formed Documents
  • An XML document follows a strict syntax. For
  • An XML document contains regions of text called
    elements, delimited by matching start-tags and
    end-tags. Elements must be correctly nested.
  • Start-tags may include attribute specifications,
    where attribute values are strings delimited by
    matching quote marks.
  • A document that obeys the full set of these rules
    is called well-formed.
  • Every legal XML document must be well-formed,
    otherwise it cannot be parsed.

  • Well-formed
  • lthtmlgt
  • ltbody style"font-style italic"gt
  • This is a well-formed document.
  • lt/bodygt
  • lt/htmlgt
  • Not well-formed
  • lthtmlgt
  • ltbody stylefont-style italicgt
  • This is not a well-formed document.
  • lt/htmlgt
  • lt/bodygt
  • The style attribute value is not in quote marks,
    and the html and body tags dont nest correctly.

Install Xerces
  • The Xerces parser is a product of the Apache XML
    project, http//xml.apache.org.
  • Follow the Xerces Java 2 project link and go to
    the download area, then to the master
    distribution directory or a mirror directory.
  • Download Xerces-J-bin.2.6.2.zip, and extract it
    to a suitable place, e.g. C\
  • When extracting, remember to select Use folder
  • This should create a folder called, e.g.,

Put Xerces on your Class Path
  • Using the menu at
  • Control Panel?System?Advanced?Environment
  • add the jar files xercesSamples.jar,
    xercesImpl.jar, and xml-apis.jar, to you class
  • E.g. append
  • C\xerces2_6_2\xercesSamples.jarC\xerces2_6_2\
  • to the value of your CLASSPATH variable.

Example Using Xerces
  • Copy the two HTML examples given above to files
    called, say, wellformed.html and illformed.html.
    Then, in a new Command Prompt window, try running
    the commands
  • gt java dom.Writer wellformed.html
  • gt java dom.Writer illformed.html
  • The first command should just echo the document.
    The second should print a syntax error message.
  • dom.Writer is one of the sample applications in
    the Xerces release. It simply uses the Xerces
    parser to convert the source file to a tree data
    structure (DOM), then converts the tree back to
    nicely formatted XML, which it prints.

Rolling Your Own Parser?
  • People approaching XML sometimes decide they can
    write their own lightweight parser that handles
    just the bit of XML their application needs.
  • In general this is a bad idea!
  • We will see later that even basic XML is a
    moderately complex specification unless you are
    going to invest a lot of effort it is unlikely
    you can parse the full specification more
    efficiently than existing parsers.
  • If you subset the specification you may be
    compromising the most crucial advantage that XML
    brings to your applicationinteroperability.
  • Later in these lectures we will see how to use
    the Xerces parser from your own Java programs, to
    read XML input.

Valid Documents
  • An XML document may optionally include a Document
    Type Definition (DTD).
  • This declares the names of all elements and
    attributes appearing in the document, and how
    they may nest.
  • The DTD also declares and defines entities that
    may be referenced from within document content.
  • A well-formed XML document that includes a
    DTDand accurately follows the declarations in
    that DTDis called valid.

Invalid Documents
  • It is quite possible to parse invalid (but
    well-formed) documents, by using a non-validating
  • Many applications accept XML files without DTDs,
    which are therefore technically invalid.
  • Applications may exploit validation mechanisms
    other than DTDs. An important one is XML Schema
    which we will discuss later.
  • A document validated against an XML Schema
    usually does not have a DTD, so technically is
    invalid as far as the base XML specification is
  • But of course it is valid relative to the XML
    Schema specification!

Validation Side Effects
  • The use of a validating parser certainly affects
    what documents are treated as legal.
  • In some cases switching on validation may also
    alter the exact data passed from the parser to
    application. These effects will be considered
    when we discuss DTDs.

Physical Entities
  • An XML document is represented by one or more
    storage units (typically files), called
  • We can enumerate five kinds
  • Document entitiesroot XML documents.
  • Parsed external entities, which contain
    fragmentary XML content.
  • External DTD subsets, which contain some or all
    of the DTD declarations needed by a document.
  • External parameter entities, which also contain
    fragmentary DTD content.
  • Unparsed external entities, which are usually
    complete binary files in some native format
    (not XML).

Physical Structure
  • The structure of a non-trivial XML document is
    illustrated in the following figure.
  • Every XML document must have exactly one document
  • It may also involve zero or more external
  • The document entity may reference any number of
    external general entities. These can be parsed
    external entities or unparsed external entities.
    A parsed external entity may in turn reference
    other external general entities.
  • The document may have at most one external DTD
  • A DTD subset in the document entity, or an
    external DTD subset, may reference any number of
    external parameter entities (which may in turn
    reference other external parameter entities).

A Complex XML Document
Document Entity
External Parameter Entity
External DTD Subset
External Parameter Entity
Parsed External Entity
Parsed External Entity
Parsed External Entity
Unparsed External Entity
Syntactic Features
  • The following two tables summarize the
    top-level syntax of all the constructs in XML.
    Full details will be given in later slides, as
  • The first columns give an abbreviated example of
    the syntax, the second columns (what?)
    describe the construct, and the third columns
    (where?) specify the places in an XML document
    where the construct may appear.
  • In a where? column, Document means at the
    top-level of the document entity, and Content
    means in the kind of content allowed in an
    elementalso called Parsed Character Data.
  • A Literal is character data in quotesexactly
    what can appear in a literal depends strongly on
    its context.
  • XML Names will be discussed shortly.

Syntax I Logical Structures
Example Syntax What? Where?
ltName gtContentlt/Namegt Element Document, Content
Name Literal Attribute specification Element start tag
lt?xml gt XML declaration/ Text declaration Document/ External entity
lt?Name gt Processing instruction Document, DTD, Content
lt!-- --gt Comment Document, DTD, Content
lt!DOCTYPE gt DTD Document
lt!ELEMENT gt Element declaration DTD
lt!ATTLIST gt Attributes declaration DTD
lt!ENTITY gt Entity declaration DTD
lt!NOTATION gt Notation declaration DTD
Syntax II References, Sections
Example Syntax What? Where?
Code-point Character reference Content, Literal
Name Entity reference Content, Literal
Name Parameter entity reference DTD
lt! gt CDATA section Content
lt!IGNORE gt Conditional section DTD
lt!INCLUDE gt Conditional section DTD
Character Set
  • Every XML document must be composed using the
    Unicode character set.
  • The specification does not stipulate any
    particular encoding, though defaults are UTF-8 or
  • ASCII is a subset of Unicode, so you can create
    XML documents using your favorite, pre-Unicode,
    text editor.

Allowed Character Ranges
  • The allowed characters are
  • white space
  • U0020 (space), U0009 (tab), U000A (line
    feed), U000D (carriage return),
  • plus all Unicode characters higher than
    U0020, excluding
  • The surrogate blocks UD800..DFFF.
  • UFFFE and UFFFF (noncharacters in Unicode.
    Note FFFE16 is the BOM after byte-reversal in
  • Because some codes are forbidden, cant consider
    including raw binary data in parsed XML (without
    additional encoding).

Names and Name Tokens
  • In XML, names are used in many places
  • An element has a name, an attribute has a name,
    an entity is referenced by a name, etc.
  • As in programming languages, there are rules
    about what constitutes a valid name (next slide).
  • In XML there is also a concept of name tokens,
    which are strings similar to names that can be
    specified as values of certain types of
  • They are less restricted. For example a number
    can be a valid name token.

What is a Name?
  • Well-formed XML name tokens include any sequence
    of the following characters
  • a letter
  • a digit
  • a period (.), hyphen (-), underscore (_),
    or colon ()
  • Well-formed XML names include any name token that
    starts with a letter, _ or .
  • Names that begin with XML (in upper, lower or
    mixed case) are reserved.
  • Case is significant in XML names names are only
    identified if they consist of identical character

  • Some well-formed names
  • bryan
  • Big-Foot
  • Century_21
  • ratedPG
  • _.--._
  • Some illegal or reserved names
  • 2004
  • .com
  • xFFFF
  • xml-1.0

What is a Letter?
  • In defining names and name tokens, XML 1.0 relies
    on a definition of letter and digit. These are
    Western-centric notionssee Chapter 4 of the
    Unicode 4 standard for relevant discussion.
  • XML 1.0 defines a letter as an alphabetic or
    syllabic base character or an ideograph, and
    gives a (probably dated) recipe for extracting
    these from Unicode character databases.
  • XML 1.1 gives intentionally liberal Unicode
    ranges for NameStartChar and NameChar, then has a
    non-normative appendix suggesting what kind of
    characters should be preferred.

  • The most characteristic markup feature of XML is
    the element.
  • The basic syntax for an element is either
  • ltNamegtContentlt/Namegt
  • or
  • ltName/gt
  • Here Name is an XML namethe type of element.
  • In the first case, Content stands for further
    text, which may include nested elements.
  • The second case is called an empty element. It
    is formally equivalent to
  • ltNamegtlt/Namegt

Examples from XHTML
  • XHTML is an XML-compatible dialect of HTML.
  • Examples
  • lth1gtA headerlt/h1gt
  • This is an element of type h1 with content text
    A header.
  • ltbodygtlth1gtMy Pagelt/h1gtWelcome.lt/bodygt
  • This is an element of type body. The content
    is a nested h1 element plus the text Welcome..
  • lthr/gt
  • This is an empty element of type hr.
  • Examples here dont illustrate, but it is allowed
    to include white space in the tags, on either
    side of the element name.

  • An element start tag (or an empty element tag)
    may include one or more attribute specifications.
  • An attribute specification has the general syntax
  • Name "Value"
  • or
  • Name 'Value'
  • where Name is the name of the attribute and
    Value is some text.
  • This value text must not include the literal
    character lt (directly or through an entity
    replacement, see later).
  • It must not include the character (" or ') used
    to delimit it.
  • The value text may include line breaks and other
    white space.
  • Attribute specifications can have white space
    around the .

Examples from SVG
  • SVG (Scalable Vector Graphics) is an XML notation
    for representing graphical objects.
  • Examples
  • ltrect x"50" y"50" width"100" height"75"/gt
  • An empty element representing a rectangle at
    position (50,50), with shape 100 75 pixels.
  • ltg transform'rotate(45)'gt
  • ltcircle cx'150' cy'50' r'25'/gt
  • lttext x'125' y'100'gtA Circlelt/textgt
  • lt/ggt
  • A group containing a circle and some text. The
    group as a whole is rotated 45 about the origin,
    as specified by the attribute with name transform
    and value rotate(45).

Possible Display of Examples
The Document
  • It is recommended to start a document entity with
    an XML declaration, although technically this is
  • An XML declaration must be strictly the first
    thing in the fileno white space can go before an
    XML declaration (a single BOM is allowed).
  • Every XML document entity must contain exactly
    one top-level element called the root element.
  • Of course there can be any number of elements
    nested inside the root element.
  • If there is a Document Type Declaration, this
    must appear before the root element.
  • Miscellaneous other things are allowed at the
    document level (anywhere between XML declaration
    and end of file)
  • White space
  • Comments
  • Processing instructions
  • Anything before the root element is collectively
    called the prolog.

Document Layout
  • XML declaration (optional)
  • optional comments and processing instructions
  • Document Type Declaration (optional)
  • optional comments and processing instructions
  • root element (required)
  • optional comments and processing instructions

XML Declaration
  • It is strongly recommended to start any document
    entity with an XML Declaration.
  • The XML declaration has syntax
  • lt?xml versionLiteral optional
    declarations ?gt
  • The version specification and optional
    declarations have a syntax similar to attribute
    specifications on elements (declared values are
    quoted the same way).
  • The value assigned to version must be 1.0 or 1.1,
    according to which version of the XML
    specification you are following (1.0 may be
    prudent, for now).
  • The optional declarations are the encoding
    declaration and the standalone declaration.

Example XML Declaration
  • Here is a complete (and completely pointless) XML
    document with an XML declarationincluding
    optional partsand an empty root element
  • lt?xml version"1.0" encoding"UTF-8"
  • ltmy-root/gt
  • In general the encoding declaration, if included,
    specifies the character encoding scheme.
  • This must be an encoding of Unicode, but it may
    be some encoding not defined in the Unicode
  • Informally a standalone declaration says whether
    this document stands alone. See later for
    details. Meanwhile, if in doubt, leave it
    outthe default is always safe.

On Encoding Declarations
  • If you are awake, you may wander what is the
    point of declaring the character encoding inside
    a document encoded according to that
    schemeapparently we cant read the declaration
    unless we already know the scheme!
  • The XML specification gives an argument about
    auto-detection of the encoding scheme allowing us
    to read far enough into the document to read the
    contents of the declaration.
  • Debatable how much point there is giving an
    explicit encoding declarationseems more logical
    to rely on auto-detection of standard encodings
    alone, or external metadata?

  • Comments can appear at the top level of a
    document, in the DTD, or in the document content.
  • Specification says the parser may or may not pass
    the text of the comment to the application.
  • XML comments are not white spaceplacement is
    more restricted.
  • Comments have the syntax
  • lt!-- Text --gt
  • where Text is any text, except it must not
    contain two adjacent hyphens, --, and it must
    not end with a hyphen
  • lt!-- This--is--not--legal --gt
  • lt!-- Neither is this---gt

Processing Instructions
  • Processing instructions can appear anywhere that
    comments canthey effectively are comments so far
    as the XML parser is concerned.
  • But the specification requires the parser to pass
    the text of a processing instruction to the
  • The syntax is
  • lt?Target Text ?gt
  • where Target is an XML name, and Text is any
    text, except that it must not contain the string
  • Processing instructions are a convenience for the
    application. They allow application-specific
    annotation of the document (example lt?php ?gt).
    The target name may be declared as a notation
    (see later).

The Document Type Definition
Document Types
  • In the syntax for the document entity, we saw
    that the document type declaration was an
    optional feature.
  • This declaration, if present, contains the
    document type definition, or DTD.
  • A validating parser will read the DTD, which
    should contain (among other things) declarations
    of all the elements and attributes appearing in
    the body of the document.
  • The DTD is required if the parser is validating,
    but optional for a non-validating parser
  • Even a non-validating parser may read the DTD if
    it is present, to look for entity declarations.
    These will be discussed later.

Document Type Declaration
  • The document type declaration, if present,
    appears before the root element of the document.
  • The most general form of this declaration
    contains three things
  • The type (i.e. name) of the following root
  • An identifier for an External DTD Subset
  • An Internal DTD Subset.
  • Items 2. and 3. are optional.
  • The DTD itself is either given in an external
    file (external subset), or in line in the
    document (internal subset), or divided between
    the two.

General Syntax
  • General syntax is one of
  • lt!DOCTYPE Name Declarations gt
  • lt!DOCTYPE Name External-ID gt
  • lt!DOCTYPE Name External-ID Declarations
  • where Name is the type of the root element,
    External-ID is an identifier for an external
    entitya separate file containing the external
    DTD subsetand Declarations is the internal DTD
  • Syntactically, the form
  • lt!DOCTYPE Name gt
  • is allowed, but can never yield a valid
    document. (Why?)

External Entity Identifiers
  • External entity identifiers can occur in a couple
    of places they will be discussed in the next
    section when we discuss entity declarations.
  • Meanwhile the simplest form is SYSTEM Literal,
    where Literal contains a URIa file name or URL
    or (in principle) a URN.
  • So a valid document with an external DTD might
  • lt!DOCTYPE my-root SYSTEM
  • ltmy-rootgt
  • lt/my-rootgt
  • where my-type.dtd is a local file name.

Example DTD
  • A DTD (subset) contains a series of DTD
  • Comments and processing instructions can be
    interleaved amongst the declarations.
  • A possible pair of declarations in an internal
  • lt!DOCTYPE my-root
  • lt!ELEMENT my-root
  • lt!ELEMENT my-leaf EMPTYgt
  • gt
  • These declare two types of element my-root and
  • They also specify a very restricted kind of
    nesting. The only valid document content would
    be something equivalent to
  • ltmy-rootgt ltmy-leaf/gt lt/my-rootgt

Validation Using dom.Writer
  • If I save the document type declaration and root
    element in a file called internaldtd.xml, I can
    validate the file with the Xerces parser by using
    the dom.Writer sample application as follows
  • gt java dom.Writer v internaldtd.xml
  • If validation is successful, this simply prints a
    formatted version of the input file. If
    validation fails, you will see error messages
    early in the output.
  • The v flag is needed here validation is off,
    by default.
  • v must be lower case -V forces no validation.

Element Type Declarations
  • The general form of an element type declaration
  • lt!ELEMENT Name Content-Specification gt
  • where
  • Name is an XML name, the element type, and
  • Content-Specification describes the content
    allowed in elements of type Name (i.e. it
    describes the structure of any text surrounded by
    Name tags).
  • No element type may be declared more than once.

Empty Content Specification
  • The simplest content specification that can
    appear in a document type definition is EMPTY.
    This means elements of this type never have any
    content, not even white space!
  • E.g. the declaration
  • lt!ELEMENT b EMPTYgt
  • allows elements
  • ltb/gt, ltbgtlt/bgt
  • but not
  • ltbgtHellolt/bgt, ltbgtlt!-- Hello --gtlt/bgt, ltbgt
  • ltbgt
  • lt/bgt
  • The form ltb/gt is preferred when the element type
    is declared EMPTY ltbgtlt/bgt is preferred when
    empty content matches another declaration.

Parsed Character Data
  • The content specification (PCDATA) means
    elements include only flat textno nested
    elements are allowed. PCDATA stands for Parsed
    Character Data.
  • Comments and processing instructions are allowed
    in parsed character data.
  • E.g. the declaration
  • lt!ELEMENT a (PCDATA)gt
  • allows elements
  • ltagtHello, hello lt/agt,
  • ltagtHello lt!-- pause --gt hellolt/agt,
  • but not
  • ltagtHello ltb/gt hellolt/agt,
  • ltagt ltagtHellolt/agt hello lt/agt.

Content Models and Mixed Content
  • Two sorts of content specification allow for
    nested elements
  • If an element may contain only nested elements,
    with no interspersed character data, one gives a
    content model.
  • E.g. typical content like this
  • ltcgt ltagtHellolt/agt ltb/gt lt/cgt
  • c contains nested a and b elements, but no
    directly nested character data.
  • Otherwise, if character data and elements can
    both appear, the content is specified (less
    precisely) as mixed content.
  • E.g. typical content like this
  • ltcgt Hello ltb/gt lt/cgt
  • c cant have a content model because
    character data Hello directly nested in c.

Structure of Content Models
  • A content model is a kind of regular expression,
    in which elements as whole are treated as the
  • The content model
  • (a, b, c)
  • means the content is a sequence of an a
    element, a b element, and a c element, in exactly
    that order.
  • The content model
  • (a b c)
  • means the content is a choice of an a
    element, a b element, or a c element.

Composing Content Models
  • You can compose sequences and choices with
    appropriate parentheses, e.g.
  • ((a b), c)
  • means a sequence of a c or a sequence b c,
  • ((a, b) c)
  • means a sequence of a b or a single c.
  • In these expressions you can follow an individual
    element name or a parenthesized expression by one
    of the modifiers ?, , or .
  • ? Means the term is optional (occurs zero or one
  • Means the term can be repeated zero or more
  • Means the term can be repeated one or more
    times, e.g.
  • (a b c)
  • means any combination of one or more as,
    bs and cs

Content Model Example
  • Suppose the format of a simple report is a
    title, followed by any mix of paragraphs and
    figures, followed by an optional bibliography
  • A suitable document type might start
  • lt!DOCTYPE report
  • lt!ELEMENT report
  • (title, (paragraph
    figure), bibliography?) gt
  • lt!ELEMENT title (PCDATA)gt
  • lt!ELEMENT paragraph (PCDATA)gt
  • lt!ELEMENT figure EMPTYgt
  • lt!ELEMENT bibliography (reference) gt
  • gt

Example report Data
  • ltreportgt
  • lttitlegtEarly Use of XMLlt/titlegt
  • ltparagraphgtRecently uncovered documents (see
  • prove XML was first used by the
    Incas of
  • ancient Peru.lt/paragraphgt
  • ltfigure source"notafake.jpg"/gt
  • ltparagraphgtThe author is grateful to W3C for
    making this
  • research possible.lt/paragraphgt
  • ltbibliographygt
  • lt/bibliographygt
  • lt/reportgt

Content Model Miscellany
  • As stated earlier, character data cannot appear
    in a (valid) element described by a content
    model. But white space, comments, and processing
    instructions can be interleaved in an allowed
    element sequence.
  • The first character in the top level of the
    content model expression must be a left
    parenthesis (a) and (a) are allowed as content
    models, but a is not.
  • The specification says a content model must be
    deterministic. This fairly technical requirement
    (needed for SGML compatibility only) is outlined
    in the following few slides.

Positions in Content Models
  • For purposes of this discussion, we label every
    appearance of an element type in a content model
    with some unique identifier. We call a labeled
    element type a position.
  • E.g. if the content model is
  • (a, (b, c), (b a))
  • then a labeled version could be
  • (a1, (b2, c3), (b4 a5))
  • which includes positions a1, b2, c3, b4, and

  • Imagine reading some input string of elements
    from the document, one element at a time, and
    assume that at the time each individual element
    is read we have no knowledge of what follows it
    in the document.
  • A content model is deterministic if, as elements
    are read from the document and matched against
    the content model, there is never more than one
    position in the content model that could match
    each inputted element.
  • This must be true for all possible input strings.
  • To give a full formal definition, need to define
    a follows relation between positions. Wont do
    that here.

Determinism Examples
  • The content model
  • ((b1, c2) (b3,
  • is not deterministic. Suppose the first
    element read is a b. At the time it is read, the
    b may match against b1 or against b3.
  • The content model (b, (c d)) is equivalent and
  • The content model
  • (a1, (b2, c3), (b4
  • is not deterministic. Suppose one a element
    has already been read and the next element
    encountered is a b. At the time b is read, it
    may match against b2 or b4.
  • In this case DTDs provide no equivalent
    deterministic model.

Mixing Content
  • Many XML formats allow character data and text to
    be intermingled in the content of a single
  • For example XHTML allows a mixture of text and
    element markup in the content of its body element
    (or, say, its p element).
  • Neither the parsed character content
    specification (PCDATA), nor a content model
    specification, will allow this mixture.
  • Instead one must use a mixed content

Mixed Content Specifications
  • A mixed content specification looks like
  • (PCDATA Name1 Name2 Namen)
  • where Name1, Name2, , Namen are different
    element types.
  • This may look like a content model, but it is
    not! Mixed content specification must appear in
    the ELEMENT declaration in exactly the form
  • You cant replace any of the Name fields with
    more general expressions the PCDATA token must
    appear first in the parentheses the right
    parenthesis must be followed by .
  • Nevertheless the matching rules are what the
    syntax suggests a valid element can contain an
    unrestricted mix of character data and the listed
    types of element.

Mixed Content Example
  • Earlier we said that an element c like
  • ltcgt Hello ltb/gt lt/cgt
  • cannot be described by a content model.
  • It can be described by the declaration
  • lt!ELEMENT c (PCDATA b) gt
  • Note however this allows any sequence of text and
    b elements.
  • There is no way to specify there must be exactly
    one piece of text preceding exactly one b element.

ANY Content
  • The last kind of content specification allowed in
    an element declaration is ANY.
  • This is equivalent to a mixed specification
    naming all elements declared anywhere in this
  • It does not allow appearance of elements that are
    undeclared and all nested elements must be valid
    according to their own declarations!

Attribute-List Declarations
  • For a valid document, besides declaring the
    elements themselves, you must also declare all
    attributes that appear on all elements.
  • An ATTLIST declaration declares a list of
    attributes for a named element type.
  • For each attribute in the list, the declaration
    defines three things
  • the name of the attribute,
  • the type of values it can be assigned, and
  • whether the attribute has a default value, and if
    so what it is.

Syntax of ATTLIST
  • Is fairly unstructuredit contains the name of
    the element the attributes apply to, then simply
    a list of triples
  • lt!ATTLIST Element-Name
  • Name1 Type1
  • Name2 Type2

  • gt
  • where Name1, Type1, Default1, etc are the
    attribute properties mentioned on the previous

Attribute Types
  • There is a longish list of allowed types for
    attributes (not nearly as long as in XML Schema).
    The specification subdivides them into
  • string type,
  • tokenized types, and
  • enumerated types.
  • The simplest and most general is string type,
    indicated by the keyword CDATA in the attribute
  • The value of an attribute is declared with this
    type can be any literal string.
  • Other types will be described shortly.

Attribute Defaults
  • The attribute type is followed by a default rule,
    one of
  • Literal
  • FIXED Literal
  • Literal is a default value. The attribute is
    always logically present (passed to application),
    but optionally specified.
  • FIXED modifier means the attribute can only take
    its default value (trying to specify something
    else is invalid).
  • REQUIRED means the attribute must be specified
    (so no default is necessary).
  • IMPLIED means the attribute is optional (if
    unspecified it is absent, so no default is

Attribute Default Examples
  • Attribute list declaration
  • lt!ATTLIST a
  • val CDATA "nothing"
  • Instances of element a
  • lta val"something" fix"constant
  • req"reading"
  • lta req"no experience"/gt
  • lt!-- OK val nothing, fix
    constant, opt absent. --gt
  • lta fix"variable"/gt
  • lt!-- Invalid! fix not constant and
    req unspecified. --gt

Tokenized Attribute Types
  • In place of CDATA we may have
  • NMTOKEN syntax of attribute value is an XML name
    token (defined earlier), e.g. 2004.
  • NMTOKENS a list of name tokens.
  • ID attribute is an identifier for this element.
  • IDREF attribute is a reference to another
    element in this document.
  • IDREFS a list of references to other elements.
  • ENTITY attribute is a reference to an unparsed
    external. Entity (see next section).
  • ENTITIES a list of references to unparsed
    external entities.
  • Items in a list are separated by white space.

Element Identifiers
  • The value given to an attribute with type ID
    should follow the syntax of an XML name. This
    name acts as an identifier for the element
    instance on which it appears. For validity
  • For each element type in the DTD, there should be
    at most one attribute declared with type ID.
  • All identifiers on all element instances in the
    document must be different (regardless element
  • For validity, the value given to an attribute of
    type IDREF should be an identifier for an element
    appearing somewhere in the document.

  • Assume the attribute name on element type agent
    is declared to have type ID, and the attribute
    boss is declared to have type IDREF.

ltagent name"Alice" boss"Alice"/gt ltagent
name"Bob" boss"Alice"/gt ltagent name"Carole"
boss"Alice"/gt ltagent name"Dave" boss"Bob"/gt
  • This document captures the hierarchy illustrated
    on the right. Can use this technique to
    represent general graphs.

Enumerated Attribute Types
  • The type field in an ATTLIST declaration may have
    the form
  • (Token1 Token2 Tokenn)
  • where each Token is an XML name token.
  • This says that the specified value of the
    attribute must be one of these token values.
  • Example attribute list declaration
  • lt!ATTLIST a
  • color (red green
    blue white) "white"gt
  • Instances of element a
  • lta color"red"/gt lt!-- OK.
  • lta color"black"/gt lt!--
    Invalid! --gt

Notation Attribute Types
  • The type field in an ATTLIST declaration may have
    the form
  • Notation (Name1 Name2 Namen)
  • where each Name is declared elsewhere in the
    DTD as a notation (see next section).
  • The handling of this type is similar to
    enumeration types.

Attribute Declaration Miscellany
  • Attributes for a single element type may all
    appear in a single ATTLIST declaration, or they
    may be divided over several ATTLIST declarations.
  • It is allowed to declare the same attribute more
    than once, but any declarations after the first
    are ignored.
  • It is allowed (but pointless) to declare an
    attribute for an undeclared element type.

Attribute Order
  • Attribute specifications can appear in element
    tags in any order.
  • The same attribute cannot be specified more than
    once on a single element.

Entities, References, and other Processing Issues
Collecting Things Together
  • So far we have described an ideal subset of XML
    in which a document contains DDT, elements,
    attributes, and character data, all laid out
  • In reality fragments of content and DTD may be
    defined in places other than where they
    ultimately appear (perhaps in other files), and
    portions of text may need special processing
    before they are made available to the
  • To complete the discussion we must cover
  • Character references and entity references.
  • Internal and external entity declarations, and
  • CDATA sections.
  • Conditional sections.

Character References
  • Character references can be viewed as an escape
    mechanism that allows us to include
    specially-treated or hard-to-type characters in
    the XML document.
  • They include the Unicode code point for a
    character, taking either of the forms dd or
    xXX, where ds are decimal digits and Xs are
    hexadecimal digits.
  • For example
  • 38 or x26 represents (ampersand).
  • 60 or x3C represents lt (left angle
  • 963 or x3A3 represents S (large Greek
  • One application is for including the literal
    characters lt or in parsed character data.

CDATA Sections
  • CDATA sections provide a way of including a
    section of character data in an XML document.
  • The section can include lt and characters
    without escapingin a CDATA section these
    characters have no special significance (so
    markup syntax is ignored).
  • The syntax is
  • lt!CDATA Text gt
  • where Text is any text, except that it must
    not contain the string gt.
  • One application is for including scripting in
    XMLe.g. JavaScript uses lt and for operators.
  • You cannot include any characters generally
    forbidden in XML cant put raw binary data in
    a CDATA section!

Entity References
  • A character reference includes a single Unicode
    character in the document. An entity reference
    includes the content of some entity, which may
    be the contents of an external file.
  • The simple syntax is
  • Name
  • where Name is an XML name.
  • The name of the entity, Name, must have been
    declared in the document DTD. There are just
    five exceptions to this rule.

Predefined Entities
  • As a convenience the entities amp, lt, gt, apos,
    and quot are considered predefined.
  • You may declare them in a DTD, but it isnt
  • They must contain the single-character values
  • amp expands to (ampersand).
  • lt expands to lt (left angle bracket, or
    less than).
  • gt expands to gt (right angle bracket, or
    greater than).
  • apos expands to ' (single quote, or
  • quot expands to " (double quote).
  • Provide a more convenient way of including
    reserved characters.
  • Note these are entity references, not character
    references (affects processing in some contexts).

A Hofstadteresque XHTML Example
  • lthtmlgt
  • ltbodygt
  • The source of this document is
  • ltpregt
  • lthtmlgt
  • ltbodygt
  • The source of this document is
  • ltpregt
  • amplthtmlampgt
  • amplthtmlampgt
  • lt/pregt
  • lt/bodygt
  • lt/htmlgt
  • lt/pregt
  • lt/bodygt
  • lt/htmlgt

Declaring Entities
  • Entities are defined in the DTD by an ENTITY
    declaration with
About PowerShow.com