Using XML to Describe Hierarchically Structured Documents - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Using XML to Describe Hierarchically Structured Documents

Description:

Author: Jane Austen. Year: 1811. Chapter Heading: 1. Text: The family of... author Jane Austen /author year 1811 /year chapterHeading 1 /chapterHeading ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 62
Provided by: miles2
Category:

less

Transcript and Presenter's Notes

Title: Using XML to Describe Hierarchically Structured Documents


1
Using XML to Describe Hierarchically Structured
Documents
  • Miles Efron
  • School of Information
  • UT Austin

2
The Idea of Metadata
  • Assuming we know how we want to represent our
    documents, metadata provides us with a suitable
    medium.
  • Metadata is structured data about information

Metadata typically adheres to some agreed-upon
conventions. This consistency promotes
interoperability...So your web browser always
knows how to represent a properly structured
document.
  • This structure is usually expressed as
    attribute-value pairs
  • Element attribute value
  • where an attribute is a feature (who, what, when,
    etc.) and a value assigns a specific measurement
    (or answer) to the feature.

3
Elements The Attribute-Value Model
  • Title Sense and Sensibility
  • Author Jane Austen
  • Year 1811
  • Chapter Heading 1
  • Text The family of.The End

Each document is composed of a group of
elements each of which is made up by an
attribute and a value
4
Encoding The Grammar of Metadata
Title Sense and Sensibility Creator Jane
Austen Year 1811 Number 1 Body The family
of.The End
  • Title Sense and Sensibility
  • Author Jane Austen
  • Year 1811
  • Chapter Heading 1
  • Text The family of.The End

lttitlegtSense and Sensibilitylt/titlegt ltauthorgtJane
Austenlt/authorgt ltyeargt1811lt/yeargt ltchapterHeadinggt
1lt/chapterHeadinggt lttextgtThe family of.The
Endlt/textgt
5
SGML-Based Metadata
SGML
XML
HTML
XHTML
A document is composed of pieces called elements.
The elements nest inside each other like small
boxes inside larger boxes, shaping and labeling
the content of the document. (Ray 4)
6
The Idea of Markup Languages
  • A markup language defines metadata that we add to
    a document.
  • Typically markup is interspersed with data from
    the document itself in order to communicate the
    structure inherent in the document in a
    machine-readable format.

7
HTML Representing form
lthtmlgt ltbody bgcolor"white"gt ltbgt10 September
2008lt/bgt ltpgtDear Dean Dillon,lt/pgt ltpgtI would
like very much to teach another section of
INF384C this year. Would you please let me know
if any opportunities become available?lt/pgt ltpgtSin
cerely, ltbr/gtltemgtMileslt/emgt lt/pgt lt/bodygt lt/htmlgt
8
HTML Representing form
9
HTML a Familiar Markup Language
  • Markup
  • ltbgtHello, Worldlt/bgt
  • ltulgtHello, Worldlt/ulgt

Displayed as Hello, World Hello, World
10
Anatomy of an Element
  • ltexamplegtThis is an examplelt/examplegt

11
Anatomy of an Element
  • ltexamplegtThis is an examplelt/examplegt

End tag
Start tag
Content
HTML Samples ltbodygthere is some
textlt/bodygt ltbgtThis is bold textlt/bgt
12
Anatomy of an Element
  • lte1gtThis lte2gtislt/e2gt an examplelt/e1gt

This element contains another element, in
addition to its textual content.
HTML Sample ltbodygtThis is ltbgtboldlt/bgt textlt/bodygt
13
Anatomy of an Element
  • ltexample3 interestingnogtan examplelt/example3gt

Some elements contain one or more
attributes. An attribute consists of a name and
a value. Attributes modify the behavior of an
element
lta hrefmiles.htmlgtA link to my home pagelt/agt
14
Anatomy of an Element
  • ltexample4 exampleAttributevalue /gt

Some elements ONLY contain attributes. These
are called empty elements.
HTML Sample ltimg srcmiles.jpg alta terrible
picture /gt
15
A really simple XML document
  • lt?xml version1.0?gt
  • ltmessagegt
  • ltexclamationgtHello, World!lt/exclamationgt
  • ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
    easy.
  • ltgraphic filerefimage.jpg/gt
  • lt/paragraphgt
  • lt/messagegt

16
A really simple XML document
Elements in this document
  • lt?xml version1.0?gt
  • ltmessagegt
  • ltexclamationgtHello, World!lt/exclamationgt
  • ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
    easy.
  • ltgraphic filerefimag.jpg/gtlt/paragraphgt
  • lt/messagegt

message
exclamation
paragraph
graphic
emphasis
17
A really simple XML document
Elements in this document
  • lt?xml version1.0?gt
  • ltmessagegt
  • ltexclamationgtHello, World!lt/exclamationgt
  • ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
    easy.
  • ltgraphic filerefimag.jpg/gtlt/paragraphgt
  • lt/messagegt

These are all called nodes in the XML document
tree.
message
exclamation
paragraph
graphic
emphasis
18
A really simple XML document
Elements in this document
  • lt?xml version1.0?gt
  • ltmessagegt
  • ltexclamationgtHello, World!lt/exclamationgt
  • ltparagraphgtXML is ltemphasisgtfunlt/emphasisgt and
    easy.
  • ltgraphic filerefimag.jpg/gtlt/paragraphgt
  • lt/messagegt

This node is special. We call it the root node.
message
exclamation
paragraph
graphic
emphasis
19
Logical Markup and Markup for formatting
  • HTML is largely geared toward formatting
    documents for display in Web browsers.
  • People usually use XML for describing the
    logical structure of documents.
  • What, to your thinking does this mean?

20
Some Virtues of XML (cf. Ray 11-13)
  • Application-Specific Markup Unlike HTML, where
    everyone uses the same set of tags, with XML user
    communities define their own element sets to suit
    their needs.
  • Maximum Portability Despite its expressiveness,
    XML is an open standard. Thus user communities
    can exchange metadata expressed in XML freely.

21
Some Virtues of XML (cf. Ray 11-13)
  • Unambiguous Structure Unlike HTML, XML documents
    are considered to be in error if they violate
    basic rules of syntax. While this makes XML a
    bit difficult to write, it makes manipulating XML
    easy.

22
Some Virtues of XML (cf. Ray 14-15)
  • Separation of Format and Content When written
    well, XML should reflect the logical structure of
    data, not its formatting. This allows us to
    change formatting as we see fit, and allows us to
    treat documents logically.

23
Structure of an XML Document
lt?xml version"1.0?gt lt!DOCTYPE letter SYSTEM
http//www.ibiblio.org/mefron/xml/dtd/letter.dtd"
gt ltletter letterDate"2008-09-10"gt ltgreetinggt ltsa
lutationgtDearlt/salutationgt ltrecipientgtDean
Dillonlt/recipientgt lt/greetinggt ltbodygt I would
like very much to teach another section of INF
384c this yes. Would you please let me know if
any opportunities become available? lt/bodygt ltclosi
nggt ltsignoffgtSincerelylt/signoffgt ltsendergtMileslt/se
ndergt lt/closinggt lt/lettergt
24
Structure of an XML Document
PROLOGUE
lt?xml version"1.0"?gt lt!DOCTYPE letter SYSTEM
http//www.ibiblio.org/mefron/xml/dtd/letter.dtd"
gt ltletter letterDate"2008-09-10"gt ltgreetinggt lts
alutationgtDearlt/salutationgt ltrecipientgtDean
Dillonlt/recipientgt lt/greetinggt ltbodygt I would
like very much to teach another section of INF
384c this fall. Would you please let me know if
any opportunities become available? lt/bodygt ltclosi
nggt ltsignoffgtSincerelylt/signoffgt ltsendergtMileslt/se
ndergt lt/closinggt lt/lettergt
BODY
25
Structure of an XML Document
PROLOGUE
lt?xml version"1.0"?gt lt!DOCTYPE letter SYSTEM
http//www.ibiblio.org/mefron/xml/dtd/letter.dtd"
gt
DOCTYPE declaration. This defines the tagset
that will be used to mark up the document. DTDs
may be defined locally or remotely.
XML declaration. Tells the parser what language
the document is expressed in. Additionally, may
specify the character encoding of the document.
26
Structure of an XML Document
ltletter letterDate"2008-09-10"gt ltgreetinggt ltsalut
ationgtDearlt/salutationgt ltrecipientgtDean
Dillonlt/recipientgt lt/greetinggt ltbodygt I would
like very much to teach another section of
INF384C this year. Would you please let me know
if any opportunities become available? lt/bodygt ltcl
osinggt ltsignoffgtSincerelylt/signoffgt ltsendergtMileslt
/sendergt lt/closinggt lt/lettergt
BODY
Well formed?
Valid?
27
Well formed XML
Well formed
Not well formed
  • ltlistgt
  • ltitemgtonelt/itemgt
  • ltitemgttwolt/itemgt
  • lt/listgt

ltlistgt ltitemgtone ltitemgttwo lt/listgt
28
Well formed XML
Well formed
Not well formed
  • ltfigure
  • fileNamef.jpg /gt

ltfigure fileNamef.jpg gt
29
Well formed XML
Well formed
Not well formed
ltagtA ltbgtbadlt/agt nestinglt/bgt ltmathgt2 lt 5lt/mathgt
  • ltagtA good ltbgtnestinglt/bgt examplelt/agt
  • ltmathgt2 lt 5lt/mathgt

30
Valid XML
  • What does it mean for a document to be valid XML?

31
Valid XML
  • What does it mean for a document to be valid XML?
  • It is well-formed
  • Its syntax also follows the rules specified in
    the document type definition (DTD) to which it
    refers.
  • Instead of a DTD the rules that define a valid
    document can be expressed using an XML schema
    (well focus more on DTDs).

32
Document Type Definitions (DTDs)
  • What is a DTD?

33
Document Type Definitions (DTDs)
  • In a markup language, a DTD serves the function
    of a grammar.
  • It provides the rules governing how elements are
    expressed and combined in efforts to organize
    document content.

34
Document Type Definitions (DTDs)
  • In the most literal sense, a DTD is a file on a
    computer system. Using special statements, this
    file defines what is legal behavior for a given
    markup language.
  • In a more conceptual sense, a DTD operates as a
    document model, expressing the relationship
    among elements in a family of documents.

35
Document Type Definitions (DTDs)
lt?xml version1.0?gt lt!DOCTYPE rootElement
SYSTEM root.dtdgt ltrootElementgt lt/rootElementgt

doc1.xml
root.dtd
lt!ELEMENT rootElement (PCDATA)gt
36
What if I dont have a DTD?
  • Without a DTD your XML can still be well-formed.
  • Without a DTD your XML can still be very useful

37
DTDs Doctype Definitions
  • lt!ELEMENT letter (greeting,body,closing)gt
  • lt!ELEMENT greeting (salutation?,recipient)gt
  • lt!ELEMENT body (PCDATA)gt
  • lt!ELEMENT closing (signoff?,sender)gt
  • lt!ELEMENT salutation (PCDATA)gt
  • lt!ELEMENT recipient (PCDATA)gt
  • lt!ELEMENT signoff (PCDATA)gt
  • lt!ELEMENT sender (PCDATA)gt
  • lt!ATTLIST letter letterDate CDATA REQUIREDgt

letter.dtd file containing the DTD for the
letter document.
38
DTDs Doctype Definitions
A letter element contains 3 sub-elements
greeting, body, and closing.
  • lt!ELEMENT letter (greeting,body,closing)gt
  • lt!ELEMENT greeting (salutation?,recipient)gt
  • lt!ELEMENT body (PCDATA)gt
  • lt!ELEMENT closing (signoff?,sender)gt
  • lt!ELEMENT salutation (PCDATA)gt
  • lt!ELEMENT recipient (PCDATA)gt
  • lt!ELEMENT signoff (PCDATA)gt
  • lt!ELEMENT sender (PCDATA)gt
  • lt!ATTLIST letter letterDate CDATA REQUIREDgt

It also has a mandatory attribute, letterDate..
letter.dtd file containing the DTD for the
letter document.
39
DTDs Doctype Definitions
A greeting element contains 2 sub-elements
salutation (optional) and recipient (occurs at
least once).
  • lt!ELEMENT letter (greeting,body,closing)gt
  • lt!ELEMENT greeting (salutation?,recipient)gt
  • lt!ELEMENT body (PCDATA)gt
  • lt!ELEMENT closing (signoff?,sender)gt
  • lt!ELEMENT salutation (PCDATA)gt
  • lt!ELEMENT recipient (PCDATA)gt
  • lt!ELEMENT signoff (PCDATA)gt
  • lt!ELEMENT sender (PCDATA)gt
  • lt!ATTLIST letter letterDate CDATA REQUIREDgt

letter.dtd file containing the DTD for the
letter document.
40
DTDs Doctype Definitions
  • lt!ELEMENT letter (greeting,body,closing)gt
  • lt!ELEMENT greeting (salutation?,recipient)gt
  • lt!ELEMENT body (PCDATA)gt
  • lt!ELEMENT closing (signoff?,sender)gt
  • lt!ELEMENT salutation (PCDATA)gt
  • lt!ELEMENT recipient (PCDATA)gt
  • lt!ELEMENT signoff (PCDATA)gt
  • lt!ELEMENT sender (PCDATA)gt
  • lt!ATTLIST letter letterDate CDATA REQUIREDgt

A body element contains parsed character
data...i.e. text.
letter.dtd file containing the DTD for the
letter document.
41
Creating a Simple DTDa novel
42
Creating a Simple DTDa novel
  • What are some of the elements we might keep
    track of in a novel?

43
Creating a Simple DTDa novel
  • What are some of the elements we might keep
    track of in a novel?

chapter author year title text
44
Creating a Simple DTDa novel
  • Lets arrange these into a tree structure

45
Creating a Simple DTDa novel
  • Lets arrange these into a tree structure

novel
46
Creating a Simple DTDa novel
  • Lets arrange these into a tree structure

novel
title
author
year
chapter
text
47
Creating a Simple DTDa novel
  • Lets arrange these into a tree structure

novel
title
author
year
chapter
Could some of these elements be attributes of the
novel? What are the implications/motivations for
making them elements or attributes?
text
48
Creating a Simple DTDa novel
lt!ELEMENT novel (title, author, chapter)gt
This just says, hey, were defining an element now
49
Creating a Simple DTDa novel
lt!ELEMENT novel (title, author, chapter)gt
and the element were defining is called novel.
50
Creating a Simple DTDa novel
lt!ELEMENT novel (title, author, chapter)gt
and a novel contains a title, and author, and 1
or more chapters
51
Creating a Simple DTDa novel
lt!ELEMENT novel (title, author,
chapter)gt lt!ATTLIST novel year CDATA IMPLIEDgt
lastly, a novel has an attribute called year
that contains unparsed text. Including a year is
optional.
52
Creating a Simple DTDa novel
lt!ELEMENT novel (title, author,
chapter)gt lt!ELEMENT title (PCDATA)gt lt!ELEMENT
author (PCDATA)gt lt!ELEMENT chapter
(text)gt lt!ELEMENT text (PCDATA)gt lt!ATTLIST
novel year CDATA IMPLIEDgt
53
XML Namespaces Combining Markup Languages
  • As we read in Ray, XML isnt a markup language
    it provides a way of defining your own markup
    language
  • Often we might want to combine two pre-existing
    sets of tags in a single document.
  • Do do this, we can use XML namespaces to clarify
    the relationships among our elements.

54
XML Namespaces Combining Markup Languages
  • Syntax for declaring a namespace namespaces are
    declared as attributes to an element. The
    namespace is available to all elements below this
    one
  • ltelementName xmlnsnsNameurlgt

55
XML Namespaces Combining Markup Languages
  • Syntax for declaring a namespace namespaces are
    declared as attributes to an element. The
    namespace is available to all elements below this
    one
  • ltelementName xmlnsnsNameurlgt

We say that all elements below this element are
in the scope of the namespace url.
56
XML Namespaces Combining Markup Languages
  • Example combining Dublin Core and Vcard (a
    standard for representing business card
    information).
  • Dublin Core elements are defined at
    http//purl.org/dc/elements/1.1/
  • Vcard elements are defined at http//www.imc.org/r
    fc2426

57
  • lt?xml version1.0?gt
  • ltdcdc xmlnsdchttp//purl.org/dc/elements/1.
    1/
  • xmlnsvchttp//www.imc.org/rfc24
    26 gt
  • ltdccreatorgt
  • ltvcngt
  • ltvcfamilygtEfronlt/vcfamilygt
  • ltvcgivengtMileslt/vcgivengt
  • lt/vcngt
  • lt/dccreatorgt
  • ltdctitlegtMiles Efrons Home Pagelt/dctitlegt
  • lt/dcdcgt

58
(No Transcript)
59
OAI-PMH
  • Data structure standard ???
  • Data communication standard ???

60
XML 1 Summary
  • XML is a standard maintained by the W3C.
  • XML imposes a tree structure on documents. Can
    we think of a document type that doesnt lend
    itself to a tree structure?
  • The nodes of the document tree are the ______ of
    the document?
  • (Some) types of XML elements (Ray p. 50)
  • Empty
  • Container
  • Character reference

61
XML 1 Summary
  • In order to parse, an XML document must be
  • Well formed (always)
  • Valid (under what condition?)
  • A DTD defines a particular XML language (i.e. a
    data structure definition).
  • Namespaces allow us to combine XML-encoded dad
    structure definitions.
Write a Comment
User Comments (0)
About PowerShow.com