Title: XML
1XML
- An introduction in relation to semistructured data
2Overview
- Background / History
- Basic syntax
- XML and semistructured data
- Document type definitions
- Extensions for XML
- Paraphernalia
3Overview
- Background / History
- SGML
- SGML, HTML and XML
- World Wide Web Consortium
- Basic syntax
- XML and semistructured data
- Document type definitions
- Extensions for XML
- Paraphernalia
4Standard Generalized Markup Language (SGML)
- model information exclusively on basis of its
inner laws and its function - ? platform independent storage of structured
information - standard ISO 8879 from 1986
5SGML, HTML and XML
- SGML(web application) HTML (is one special
instance of SGML) - XML ? SGML
6Why XML from SGML?
- SGML
- is exceedingly complex and difficult to
understand - is formally so complex, that online-applications
have difficulties to process it in reasonable
time - has many properties which were not designed for
use in network environments (remember that it is
a standard from 1986)
7World Wide Web Consortium
- Nov 1996 initial XML draft
- Dec 1997 XML1.0 Proposed Recommendation
- Feb 1998 W3C Recommendation Extensible Markup
Language (XML) 1.0 - Oct 2000 XML1.0 2nd edition
8Overview
- Background / History
- Basic syntax
- Elements
- Attributes
- Well-formed XML documents
- XML and semistructured data
- Document type definitions
- Extensions for XML
- Paraphernalia
9Elements
- element lttaggt content lt/taggt
- lttaggt, lt/taggt markups
- content structures between markups
- no predefined tags
- basic content (no markups) is treated as text
PCDATA (Parsed Character Data) - abbreviation for empty elements lttag /gt
10Example
- ltpersonnelgt
- ltpersongt
- ltnamegt John Cage lt/namegt
- ltfunctiongt Bearer lt/functiongt
- lt/persongt
- ltpersongt
- ltnamegt Elaine Vassal lt/namegt
- ltfunctiongt chief secretary lt/functiongt
- lt/persongt
-
- lt/personnelgt
11Attributes
- sometimes called property in data models
- (namevalue) pairs
- value always a string (type NMTOKEN)
- allows building of groups of elements
- ambiguity information as attribute or element?
12Example
- ltpersonnelgt
- ltperson sexmgt
- ltnamegt John Cage lt/namegt
- ltfunction departmentcivil rightsgt Bearer
lt/functiongt - lt/persongt
- ltperson sexfgt
- ltnamegt Elaine Vassal lt/namegt
- ltfunction departmentadmingt chief secretary
lt/functiongt - lt/persongt
-
- lt/personnelgt
13Well-formed XML documents
- a XML document is well-formed, if
- tags nest properly
- (not ltt1gtltt2gtlt/t1gtlt/t2gt)
- attributes are unique within one element
- (not lttag atta attbgt)
14Overview
- Background / History
- Basic syntax
- XML and semistructured data
- Simple transformations
- Differences that make transformation more
difficult - Additional constructs
- Document type definitions
- Extensions for XML
- Paraphernalia
15Simple transformations
- with basic XML syntax (no attributes, tree as
data structure) - from XML to ssd
- ltpersongt
- ltnamegt John Cage lt/namegt
- ltfunctiongt Bearer lt/functiongt
- lt/persongt
- ? person name John Cage, function
bearer
16Simple transformations II
- from ssd to XML (transformation function T)
- T(atomic value) atomic value
- T(l1 v1, , ln vn)
- ltl1gt T(v1) lt/l1gt
-
- ltlngt T(vn) lt/lngt
17Differences that make transformation more
difficult
- different semantic of labels
- element or attribute
- order
- mixing elements and text
18Semantics of labels
- XML
- graphs with labels on nodes
- ssd
- graphs with labels on edges
ltpersongt ltnamegtAlanlt/namegt ltagegt42lt/agegt ltemail
gtab_at_comlt/emailgt lt/persongt
person name Alan, age 42,
email ab_at_com
19Element or attribute
- ambiguity between representation of information
as element or as attribute - ? different possibilities of encoding
- in particular in combination with references
- ltagt ltb ido123gt some string lt/bgtlt/agt
- lta co123 /gt
- or
- lta bo123 /gt
- ltagt ltc ido123gt some string lt/cgtlt/agt
20Order
- ssd model based on unordered collections
- XML elements are ordered
- but XML attributes are not
- unordered data can be processed more efficiently
- ? for data exchange applications ignore order of
XML
21Mixing elements and text
- XML allows mixing of PCDATA and subelements
- lttalkgt
- XML - An introduction in relation to
semistructured data - ltspeakergt Sebastian Bitzer lt/speakergt
- lt/talkgt
22Additional constructs in XML
- comments
- lt!-- comment --gt
- processing instructions
- lt?application-name instruction-textgt
- CDATA (for escaping)
- lt!CDATA markups wont be processed here gt
- entities
- e.g. auml but also external files can be
declared as entities e.g. a gif-file as pic-1
23Overview
- Background / History
- Basic syntax
- XML and semistructured data
- Document type definitions
- DTDs as grammars
- DTDs as schemas
- Attributes
- Valid XML documents
- Limitations
- Extensions for XML
- Paraphernalia
24DTDs as grammar
- document type definition (DTD) serves as grammar
for underlying XML document - is precisely a context-free grammar (non-terminal
? ordered list of one or more terminals and
non-terminals) - can be recursive
25Definitions
- DTD
- lt!DOCTYPE root-name element-def.s gt
- element-def.s
- lt!ELEMENT name ( content model )gt
-
- content model
- ordered list of names of elements which can
occur in the outer element
26Variations of content model
- lt!ELEMENT r1 (a?, b, c d)gt
- means that elements of type r1 contain
- 0 or 1 a (a is optional) and
- arbitrary many b (0 - 8) and
- either exactly 1 c (c is obligatory)
- or at least 1 d (d is required)
- groups can be build, too
- lt!ELEMENT r2 ((a, b), c?)gt
- means at least one sequence of a followed by
b comes in front of the optional c
27DTDs as Schemas
- DTD
- lt!DOCTYPE db
- lt!ELEMENT db ((r1,r2))gt
- lt!ELEMENT r1 ((a,b,c)(a,c,b) (b,a,c) (b,c,a)
(c,a,b) (c,b,a))gt - lt!ELEMENT r2 ((c, d) (d, c))gt
- lt!ELEMENT a (PCDATA)gt
- lt!ELEMENT b (PCDATA)gt
- lt!ELEMENT c (PCDATA)gt
- lt!ELEMENT d (PCDATA)gt
- gt
- can be seen as representation for relational
schema r1(a,b,c), r2(c,d)
28Declaring attributes
- lt!ATTLIST el.name att.name1 type1 spec1
- att.name2 type2 spec2
- gt
- el.name element which is modified by att.s
- type often CDATA, but also more restricted
e.g. (mf) for male or female in att. sex - spec REQUIRED, IMPLIED, FIXED or default value
29Unique Identifiers
- e.g.
- lt!ATTLIST person id ID REQUIRED
- mom IDREF IMPLIED
- dad IDREF IMPLIED
- children IDREFS IMPLIED
- instance
- ltperson idjohn momjane dadjames
childrenjack jimgt
30Valid XML documents
- a XML document is valid, if
- document is well-formed
- additionally has a DTD
- conforms to that DTD
- elements only nested as described in DTD
- just attributes used which are allowed by DTD
- all attributes of type ID must have distinct
values - all IDREFS must be to existing identifiers
31Limitations of DTDs as schemas (summarized)
- order
- only one atomic type (PCDATA, but no INT etc.)
- names are global (partial solution namespaces)
- IDREFs are not constrained to a certain type
(mother-reference should point to a person)
32Overview
- Background / History
- Basic syntax
- XML and semistructured data
- Document type definitions
- Extensions for XML
- DCD
- Document navigation
- Paraphernalia
33Document Content Definitions
- making typing more precise
- seems to be gone
- recent approach XML Schema which must e.g.
- provide for primitive data typing, including
byte, date, integer, sequence, SQL Java
primitive data types, etc. - allow creation of user-defined datatypes, such as
datatypes that are derived from existing
datatypes and which may constrain certain of its
properties - mechanism for URI reference to standard semantic
understanding of a construct - (http//www.w3.org/TR/NOTE-xml-schema-req)
34XLink XPointer
- pointing to arbitrary positions in documents
- using IDs or relative position
- links can be defined externally to both source
and target (files)
35Overview
- Background / History
- Basic syntax
- XML and semistructured data
- Document type definitions
- Extensions for XML
- Paraphernalia
- RDF
- Stylesheets
- SAX and DOM
36Resource Description Framework
- for representing metadata
- consists of data model and syntax
- simple form edge-labelled graph
- additionally
- containers (bag, sequence or alternative)
- higher-order statements (John says that )
37Stylesheets
- to specify presentation of data
- Cascading Style Sheets (CSS)
- associate with each element type a presentation
- Extensible Stylesheet Language (XSL)
- specifies the presentation of a class of XML
documents by describing how an instance of the
class is transformed into an XML document that
uses the formatting vocabulary - http//www.w3.org/Style/XSL/
38SAX and DOM
- Application Programming Interfaces
- Simple API for XML (SAX)
- standard for parsing
- Document Object Model (DOM)
- interface that will allow programs and scripts to
dynamically access and update the content,
structure and style of documents - compile whole document and build a tree
representation for it - http//www.w3.org/DOM/
39Outlook
- Database issues
- How are we going to model XML? (graphs).
- How are we going to query XML? (XML-QL)
- How are we going to store XML (in a relational
database? object-oriented?) - How are we going to process XML efficiently? (uh
well..., um..., ah..., get some good grad
students!)
Raghu Ramakrishnan http//www.cs.wisc.edu/cs784-1
/handouts/intro-ssxml.ppt
40References
- S. Abiteboul, P. Buneman, and D. Suciu, Data on
the Web. From relations to Semistructured Data
and XML, Morgan Kaufmann Publishers, San
Francisco 2000 - H. Lobin, Informationsmodellierung in XML und
SGML, Berlin, Heidelberg, 2000 - World Wide Web Consortium, Extensible Markup
Language (XML), http//www.w3.org/XML/