Introduction to XML: A Librarians Perspective - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Introduction to XML: A Librarians Perspective

Description:

Allows us to create consistency across a collection of documents (e.g., 5000 poems) ... notes This is funny book /notes /book ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 97
Provided by: OEM582
Category:

less

Transcript and Presenter's Notes

Title: Introduction to XML: A Librarians Perspective


1
Introduction to XMLA Librarians Perspective
  • Delphine KhannaRutgers UniversityPalinet,
    July/August 1999

2
Overview of the workshop
  • What is XML? How does it work?
  • Why XML? What is it going to change?
  • Overview of other XML-related standards.
  • XML in libraries (standards projects).
  • Practical skills
  • Creating an XML document,
  • Creating an XSL style sheet,
  • Work with MS Internet Explorer 5.0.

3
Workshop Web site
  • http//scc01.rutgers.edu/ceth/intromat/xml/
  • Contents
  • This slide presentation,
  • XML Samples used in this workshop,
  • List of useful Web links and other resources.

4
A First Look at XML
5
Basics
  • Simplified definition XML is a kind of
    super-HTML where you can define your own tags.

6
Term Clarification
  • XML can be called a
  • encoding format,
  • language,
  • standard.
  • We will prefer standard.

7
The XML Family A whole family of standards
  • XML,
  • XSL,
  • XLINK, XPOINTER,
  • Namespaces,
  • RDF,
  • XML Schemas,
  • DOM,
  • and more

8
XML Who? When?
  • XML family developed by W3C.
  • Very recent
  • XML 1.0 February 1998.
  • Namespaces January 1999.
  • RDF February 1999.
  • XLINK, XPOINTER, XSL, XML Schemas still working
    drafts.

9
XML To develop the next generation of Web
applications
  • People want to do more sophisticated things with
    the Web.
  • HTML is too limited for that.
  • Need for a more powerful language XML.

10
XML Hype 2 myths
  • XML will replace everything.
  • (HTML, back-end relational databases, etc.)
  • XML is completely different from Web technologies
    we had before.

11
Why is XML better?
12
Lets look at a typical HTML document
  • Lines Written in Early Spring
  • William Wordsworth
  • I heard a thousand blended notes,
  • While in grove I sate reclined,
  • In that sweet mood when pleasant
    thoughts
  • Bring sad thoughts to the mind.
  • To her fair works did nature link
  • The human soul that through me ran
  • And much it griev'd me my heart to
    think
  • What man has made of man.

13
What is the problem?
  • To do more fancy things with documents
  • need to make their logical structure explicit.
  • Otherwise software applications
  • do not know what is what,
  • do not have any handle over documents.

14
Why XML is better Overall
  • HTML
  • Encoding too vague and messy.
  • Logical structure is not clearly encoded.
  • XML
  • Allows us to create clean structured documents,
    where logical structure of document is totally
    explicit.

15
The same document in XML
  • Lines Written in Early Spring
  • William
    Wordsworth
  • I heard a thousand blended
    notes,
  • While in grove I sate
    reclined,
  • In that sweet mood when
    pleasant thoughts
  • Bring sad thoughts to the
    mind.
  • To her fair works did nature
    link
  • The human soul that through
    me ran
  • And much it griev'd me my
    heart to think
  • What man has made of
    man.

16
Why XML is better Reason 1
  • HTML One single fixed tag set
  • , , , , etc.
  • XML You can define your own tag set
  • , , .
  • , , , .
  • Possible to describe the logical structure
    exactly.

17
Why XML is better Reason 2
  • HTML Lack of syntax controlHelloPis considered OK.
  • XML Documents have to be at least
    well-formedHellois the only
    form acceptable.
  • Code much cleaner.

18
Why XML is better Reason 3
  • HTML Logical structure and display are mixed up
  • , , .
  • This text is important.
  • XML Clear distinction between logical structure
    and display
  • This text is important.
  • Code much cleaner.

19
By the way, HTML is not that bad
  • HTML
  • Really simple Attractive to basic users.
  • Works fine for basic Web pages.
  • XML
  • Clearly more complexWill scare off basic users.
  • Probably an overkill for basic Web pages.

20
What will XML change?Or, why do we need to
make the logical structure explicit?
21
Different displays for different output devices
  • Regular computer screens,
  • Pocket computers, Palm Pilot,
  • WebTV,
  • Audio (visually-impaired, cars),
  • Braille,
  • Print.

22
Term Clarification
  • The Web based on a client/server architecture.

23
Server-side Databases should speak to each other
  • A very successful model
  • relational databases on the server side.
  • Next step data integration.
  • Example 1 An online bookstore,
  • Example 2 Medical records,
  • Example 3 An index that knows which journals
    are available in the library.

24
XML representing structured data
  • If XML can represent structured text,it can also
    represent structured data.
  • XML is also very good at representing mixed data
    seamlessly.

25
XML for Interchange Example of a converted R-DB
record
  • 33456
  • Next day delivery
  • 15
  • New York City
  • Pittsburgh
  • 07/30/1999
  • 07/31/1999

26
Client-side The Web more than an online
fax-machine
  • Web-browsers thin clients
  • They just display documents.
  • Clients can do more
  • Client workstation has a lot of unused power,
  • Less strain on the network and on the server,
  • Example Viewing and sorting of a medical record.

27
Client-side The Web more than an online
fax-machine (2)
  • Clients can do more
  • Personalized and sophisticated processing
    possible.
  • Processing possibly provided by 3rd-party client
    applications.
  • Example Bibliography manager.

28
XML The nitty-gritty details
29
Term Clarification
  • Element,
  • Tag (opening tag / closing tag / delimiter),
  • Element content,
  • Attribute (name / value).
  • Example
  • John Smith

30
Differences in Syntax between XML and HTML
  • XML Declaration
  • Every opening tag must have a closing tag.
  • Empty tags have a different syntax
  • Tags are case sensitive different from

31
2-Level Syntax Control
  • XML documents can be
  • Well-formed,
  • Valid.

32
Syntax Control Well-formed documents
  • All XML documents must be well-formed.
  • XML parsers check the well-formedness.
  • Criteria of well-formedness
  • Every opening tag must have a closing tag.
    Illegal Hello
  • No overlapping elements Illegal
    Hello
  • One unique root element

33
Tree Representation
  • POEM
  • TITLE AUTHOR STANZA STANZA
  • FIRSTNAME LASTNAME LINE LINE
    LINE LINE LINE LINE

34
Create your own XML document
  • The cooking recipe document
  • 1. Brainstorming on the structure of the
    document,
  • 2. Creation of the document with a template.

35
Editing XML Documents
  • Textpad Internet Explorer 5 as a parser.
  • Caution IE5 comes with limitations and
    proprietary features.
  • Alternative
  • XML editor (e.g. Softquads XMetal).

36
To get started
  • Create file in Textpad and load it in IE5.
  • File extension xml.
  • Save regularly and reload in IE5.
  • Begin with

37
Document Type Definitions (DTD)
38
Document Type Definitions(DTD)
  • Formal way of defining the tags used in a series
    of documents.
  • A DTD
  • specifies a list of tags,
  • defines the relationships between these tags.
  • Allows us to create consistency across a
    collection of documents (e.g., 5000 poems).

39
How does a DTD look like?

40
Creating a DTD
  • Non-trivial task.
  • Higher level of expertise needed than for using a
    DTD.
  • In-depth knowledge of XML,
  • In-depth knowledge of the type of documents being
    described.
  • Preliminary Document Analysis.
  • A DTD can be dozens of pages long.

41
Syntax ControlValid documents
  • Higher level of control than well-formed
    documents.
  • An XML document is valid if it conforms to its
    DTD.

42
XML DTD declaration
  • Local file
  • URL
  • edu/ceth/intromat/xml/samples/poem/poem.dtd

43
Validation with IE5
  • When loading a documentThe IE5 parser does not
    validate it.
  • Possible to validate a document through a script.
  • Possible also to use a separate validating
    parser.
  • For instance, the Scholarly Technology Groups
    XML parser. (Brown U.).
  • Validating vs. non-validating parsers.

44
Validation Strategy
  • For now, best model
  • When creating documents use a validating parser.
  • For instance Scholarly Technology Group's
    On-line XML Validating Parser (http//www.stg.brow
    n.edu/service/xmlvalid/).
  • When users download them parser only checks if
    well formed.

45
Namespaces
  • Need to use elements from several DTDs in the
    same document.
  • Scheme to identify the source of each element.
  • Special case Same element name used by 2 DTDs.

46
Namespace Example
  • xmlnsisbn'urlISBNhttp//www.isbn.o
    rg/isbndtd
  • Cheaper by the
    Dozen
  • 1568491379
  • This is funny
    book
  • Note Adapted from example in the Namespaces
    recommendation.

47
Namespace Example (2)Default Namespace
  • xmlnsisbn'urlISBNhttp//www.isbn.org
    /isbndtd
  • Cheaper by the Dozen
  • 1568491379
  • This is funny book
  • Note Adapted from example in the Namespaces
    recommendation.

48
More good things about XML
49
Positive side-effects of XML (1)
  • XML fosters the development of community-based
    standards.
  • Concept of 2-level standard very powerful
  • XML universal,
  • DTDs community-specific.
  • Now developing a new standard amounts to writing
    a DTD.
  • Much easier than starting from scratch.
  • E.g., Xlit.

50
Positive side-effects of XML (2)
  • Wide-spread standards are stronger than those
    used by a limited community(regardless of their
    intrinsic value).
  • HL7 -- XML.
  • Easier to hire programmers.
  • More documentation available.
  • Actively maintained by very large base of people.

51
Positive side-effects of XML (3)
  • A set of standards bundled together are stronger
    than an isolated one.
  • Likely to appeal to more people (The Microsoft
    Office idea).
  • The standards reinforce each others.

52
Stylesheet Languages for XML
53
Stylesheet Languages for XML
  • Specify how to display logical elements.
  • XML supports 2 stylesheet languages
  • CSS
  • Quite Limited,
  • But eases transition HTML--XML.
  • XSL
  • Very powerful,
  • Still a working draft.

54
Extensible Stylesheet Language (XSL)
  • 2 Parts
  • Transformations
  • Transform the XML document (reorder, hide, add
    elements).
  • Formatting Objects (FO)
  • Attach formatting properties to XML elements.

55
XSL in IE 5.0
  • Supports transformations but not the FO.
  • Trick transform XML DTD-specific elements into
    HTML elements.
  • Convenient because everybody knows HTML.

56
XSL-to-HTML Stylesheets Syntax
  • Style Sheet Excerpt XML Document Excerpt

  • Mary Brown
  • Easy
    Cooking

  • John Smith
  • 101
    Recipes
  • Sue
    Meyer
  • Italian
    Cuisine
  • HTML Output
  • Mary Brown Easy Cooking
  • John Smith 101 Recipes
  • Sue Meyer Italian Cuisine

57
Beginning of an XSL-to-HTML Stylesheet
  • -xsl"

58
Example of XSL-to-HTML Stylesheet
  • See poem.xsl.

59
Declaring an XSL Stylesheet in an XML document
  • Just after the XML declaration (and the DTD
    declaration if there is one).
  • Local file
  • hrefpoem.xsl?
  • URL
  • hrefhttp//scc01.rutg ers.edu/ceth/intromat/xml/
    samples/poem/poem.xsl ?

60
Creating your own Stylesheet
  • The XSL-to-HTML recipe stylesheet
  • XSL stylesheets can be tricky.
  • Always use another stylesheet as a model.
  • Name the file recipe.xsl.
  • Make sure to declare it in the XML document.
  • hrefrecipe.xsl"?
  • Always add one template at a time, and reload in
    IE5 to make sure it works.

61
Recipe Stylesheet Step 1
  • -xsl"

62
Recipe Stylesheet Step 2

63
Recipe Stylesheet Step 3

  • Country
  • Yield
  • Calories

64
Recipe Stylesheet Step 4
  • Ingredients

    65
    Recipe Stylesheet Step 5
    • Step-by-step

      66
      XML Formatting ObjectsExample
      • ,255)
      • font-size16pt
      • Note Adapted from stylesheet created by Lynn
        Lobash.

      67
      Some Other XML-related Standards
      68
      Linking Standards
      • HTML links
      • Really primitive and limited.
      • Linking standards for XML
      • Much more powerful.
      • 2 parts
      • XLink (aka. XLL),
      • XPointer (aka. XLP).
      • Still working drafts.

      69
      XLink
      • To define links to one or several documents.
      • 2 types of links
      • Simple,
      • Extended.

      70
      XLink Simple link
      • Example
      • hrefpoem1.xmlGo to related poem
      • Other attributes / Alternative values
      • inline true, false (link to same document vs.
        outside).
      • show replace, new, embed.
      • actuate user, auto.
      • title ( a caption).
      • Similar to HTML links, but slightly more fancy.

      71
      Xlink Extended Link
      • One link, several targets.
      • For instance, the link See related poems would
        open as a list of links in a pop-up window.

      72
      Xlink Extended Link (2)
      • Example
      • titleSee related poems
      • titleBlue Mountains hrefpoem1.xml/
      • titlePink Flowers hrefpoem2.xml/
      • titleSea of Green hrefpoem3.xml/

      73
      XPointer
      • To define links that target points within
        documents.
      • Special language to explain which spot is
        targeted.
      • In HTML
      • Need to manually insert a tag .
      • Hence need to own the document.
      • With Xpointer
      • No need to add anything to the target document.

      74
      XPointer (2)
      • Example
      • hrefpoem1.xmlroot().child(2)Go to related
        poem
      • Other possibilities
      • root().child(3).child(4)
      • id(poem273)
      • root().descendant(2, stanza)
      • root().string(1, my heart)
      • span(root().child(3), root().child(5))

      75
      Unicode
      • Default character encoding for XML.
      • Great improvement for encoding of non-western
        languages
      • more than 65,000 characters,
      • Eventually will represent all alphabets and
        writing systems,
      • Also includes special typographic characters (
        ¼ ).

      76
      SGML, XML, HTMLWhat is the difference?
      • XML SGML slightly simplified.
      • HTML just an SGML DTD.
      • Can be easily converted to an XML DTD.
      • Relationship
      • XML and SGML are meta-languages,
      • HTML is a language.

      77
      Searching XML Documents
      78
      Models for XML RepositoriesFlat file system
      • A bunch of XML documents in a folder.
      • Native XML search engine
      • an XML-aware Web site search engine.
      • XML Query Language XQL
      • Still in development
      • Find word milk only when it appears in
        attribute DIETINFO2 of element PRODUCT.

      79
      Models for XML RepositoriesRegular relational
      databases
      • E.g., Web-based OPACs, Ovid, Amazon.
      • Back-end relational DBMS
      • MS Access or Oracle, for instance.
      • Web interface
      • uses scripts like CGI or Cold Fusion,
      • Easy to change the scripts to output XML instead
        of HTML,
      • Can even produce XML OR HTML according to the
        capabilities of the requesting browser.

      80
      Models for XML RepositoriesXML-aware relational
      DBs
      • Benefit from R-Databases AND XML advantages.
      • Mixed record
      • Nested structured text difficult to map to R-DB.
      • However, many structured texts have a table-like
        section (the bibliographic information).
      • R-Databases very mature technology (data
        integrity, security, load balance, etc.).

      81
      Models for XML RepositoriesXML-aware relational
      DBs (2)
      • Example of Oracle
      • Enhanced full-text capabilities
      • indexing,
      • truncations, stemming, thesaurus, etc.,
      • XML-like searching,
      • can create SQL queries with embedded XML
        subqueries.
      • Automatic mapping
      • R-DB record -- XML document,
      • XML document -- R-DB record,
      • Virtual flat file system.

      82
      Information Retrieval Standardfor XML
      • Needed to implement cross-repository search
      • To query across several XML servers seamlessly,
      • Whatever the implementation on the server side
        (Flat file system, R-DBMS, etc.).

      83
      Information Retrieval Standard Z39.50
      • Used in the library community.
      • To query OPACs, indexes, etc.
      • Possible to specify
      • A Query Language,
      • The format of the results,
      • A session protocol.

      84
      Information Retrieval Standard Z39.50 XML
      • Currently beginning to integrate XML
      • Defined as a possible output format,
      • Some propositions to use XML as an alternative to
        BER for overall Z39.50 syntax.
      • Once XQL is stabilized it could be ported to
        Z39.50.
      • Good candidate to become the IR-Standard for XML.
      • Little known outside the library community.

      85
      XML in Libraries
      86
      Which library projects are already using XML/SGML?
      • Mostly academic institutions.
      • (as well as Library of Congress and NYPL.)
      • Usually in SGML.
      • (Very recent ones in XML.)
      • Mostly
      • large and long-term digitization projects,
      • involving the digitization of numerous texts.
      • Converted to HTML on-the-fly.

      87
      Text Encoding Initiative (TEI)
      • Standard to encode primary sources in the
        Humanities.
      • SGML-based. (It is an SGML DTD.)
      • Currently being converted to XML.
      • Maintained by TEI Consortium.
      • Widely adopted in Humanities computing community.
      • Has spread to libraries.

      88
      Examples of TEI Projects
      • Special collections
      • Library of Congresss American Memory Project,
      • Literary texts
      • U. of Virginias E-text Collection,
      • Browns Women Writers Project,
      • Historical editions (MEP DTD)
      • Abraham Lincoln Papers,
      • Susan B. Anthony Papers.

      89
      Encoding Archival Description (EAD)
      • Finding Aids to Special Collections and Archives.
      • SGML/XML-based standard. (It is a DTD.)
      • Maintained by the Library of Congress.
      • Widely adopted.

      90
      Examples of EAD Projects
      • Among many others California Digital Librarys
        Online Archive of California.
      • Union DatabaseRLGs Archival Resources Project
      • (MARC AMC records and EAD finding aids).

      91
      Materials Used by Libraries
      • Reference Materials
      • Oxford English Dictionary,
      • American National Biography,
      • Electronic Journals
      • Springer-Verlags Link.

      92
      XML in Libraries What will it change? (1)
      • EAD finding aids
      • Offer precise and controlled search capabilities,
      • Make the creation of union databases possible.

      93
      XML in Libraries What will it change? (2)
      • Full-text databases of primary sources
      • Easy to search, display, etc.
      • Next step, union databases.
      • With precise and controlled search capabilities.
      • Full-text databases of e-journals, monographs.
      • Competition with PDF/page images, though.
      • Again next step, union databases.

      94
      XML in Libraries What will it change? (3)
      • More sophisticated and customized clients
      • Bibliography manager,
      • Concordance program.
      • New library standards based on XML
      • TEI, EAD
      • MARC (!)
      • XML not just a fad, more than 10 years of
        SGML-based TEI.

      95
      XML in Libraries What will it change? (4)
      • XML is more likely than any other formats to
        resist obsolescence
      • Platform independent,
      • Open standard
      • (not proprietary),
      • Written in ASCII/Unicode plain text
      • (no binary encoding, the simplest text editor can
        read it),
      • Tags are human-readable.

      96
      Should you use XML in your project today?
      • Are your data made of a repetition of similar
        objects? (e.g., 3000 poems)
      • Is your project database-based?
      • Is your project large?
      • Do you plan to
      • deliver to different output devices?
      • integrate your project with others? (e.g. union
        database)
      • develop advanced capabilities? (server-side or
        client-side)
      Write a Comment
      User Comments (0)
      About PowerShow.com