PowerPoint presentation | free to view - id: 2e5af-ZGJiY

<title>XML and the Encoded Archival Description: Providing Access to Collections</title> - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

<title>XML and the Encoded Archival Description: Providing Access to Collections</title>

Description:

Elizabeth Nielsen. elizabeth.nielsen_at_oregonstate.edu. Oregon State ... Known then as the Berkley Finding Aid Project, we know it better today as EAD. ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 45
Provided by: ree086
Category:

less

Transcript and Presenter's Notes

Title: <title>XML and the Encoded Archival Description: Providing Access to Collections</title>


1
lttitlegtXML and the Encoded Archival Description
Providing Access to Collectionslt/titlegt
  • Elizabeth Nielsen
  • elizabeth.nielsen_at_oregonstate.edu
  • Oregon State UniversityTerry Reese
  • terry.reese_at_oregonstate.edu
  • Oregon State University

2
ltroadmap /gt
  • Introduction to XML
  • EAD as an example of an XML schema
  • Demonstrate some cross-walking
  • Questions/Comments

3
ltxmlintrogt
  • History of XML
  • XMLs Family Tree
  • Brief descriptions of related technologies
  • Current developments and ancillary XML related
    technologies
  • Early uses of XML in the Library
  • CONTENTdm as an example of current usage
  • Why XML?
  • XML 101 What you need to know
  • XML 101 The Basics
  • How XML works
  • Myth vs. Reality

4
ltxmlintrogtlthistorygtlttreegt
Abbreviated XML Family Tree
5
ltxmlintrogtlthistorygtlttreegtltsgml /gt
  • SGML Standardized General Markup Language
  • Developed in 1969 by Ed Mosher, Ray Lorie and
    Charles Goldfarb
  • http//www.sgmlsource.com/history/roots.htm
  • Was initially developed as a method of
    structuring legal documents
  • In 1970, IBM (with whom these individuals were
    associated), extended the project to encompass
    text processing in general.
  • Example
  • http//www.cs.rpi.edu/wiseb/talks/html-talk/node9
    .html

6
ltxmlintrogtlthistorygtlttreegtlthtml /gt
  • HTML HyperText Markup Language
  • Initially developed in the early 1990s at CERN
    and formally accepted by the W3 in 1992.
  • lingua franca for publishing linked documents on
    the WWW.
  • Non-proprietary subset of SGML
  • Example
  • http//www.w3.org

7
ltxmlintrogtlthistorygtlttreegtltxmlgt
  • XML Extensible Markup Language
  • Initially conceived in 1996
  • Driven primarily by W3, Jon Bosak (Sun) and other
    SGML experts
  • First Specification was developed by Tim Bray and
    C.M. Sperberg-McQueen.
  • Version 1.0 was accepted by the W3 in 1998
  • Examples
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/connexion.xml
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/blm_metadata.xml

8
ltxmlintrogtlthistorygtlttreegtlt/xmlgt
  • XML Extensible Markup Language
  • Sought to meet the following goals (see W3Cs XML
    1.0 recommendation for more info)
  • Should support a variety of different
    applications and uses by allowing
    users/developers to create their own markup.
  • Should be human readable.
  • Should follow a strict set of defined rules and
    principles.
  • i.e., XML parsers are required to stop when a
    markup error has been encountered.

9
ltxmlintrogtlthistorygtlttreegtlt/ancillaryTechgtlt/treegt
  • XML Extensible Markup Language
  • Additional development and ancillary technologies
  • XML Schema http//www.w3.org/XML/Schema
  • XLINK http//www.w3.org/XML/Linking
  • XPointer http//www.w3.org/XML/Linking
  • XSL XSLT http//www.w3.org/Style/XSL/
  • XHTML http//www.w3.org/MarkUp/
  • XForms http//www.w3.org/MarkUp/Forms/
  • XML Query http//www.w3.org/XML/Query/
  • XML Encryption http//www.w3.org/Encryption/2001
    /

10
ltxmlintrogtlthistorygtltearlyDevgtltLane /gt
  • Medlane Lane Medical Library (Stanford
    University) http//laneweb.stanford.edu2380/wiki/
    medlane/overview
  • Project Started in 1998 in an attempt to bridge
    the access gap between print and digital
    resources.
  • Developed one of the first MARC to XML DTDs and
    crosswalks
  • Developed XMLMARC one of the first MARC to XML
    mapping softwares
  • Current projects
  • XMLMARC http//laneweb.stanford.edu2380/wiki/me
    dlane/xmlmarc
  • xobis http//laneweb.stanford.edu2380/wiki/medla
    ne/schema
  • Note xobis stands for XML organic bibliographic
    information schema. xobis represents Medlanes
    own Bibliographic XML schema.

11
ltxmlintrogtlthistorygtltearlyDevgtltOAC /gt
  • OAC (Online Archive of California)
    http//www.oac.cdlib.org/
  • Between 1993 and 1995, OAC developed a prototype
    finding aid syntax in the form of SGML. Known
    then as the Berkley Finding Aid Project, we know
    it better today as EAD.
  • 1995-1997 OAC created one of the very first
    union databases composed primarily of EAD encoded
    finding aids.
  • Current Projects
  • Now a part of the California Digital Library, the
    OAC remains a major contributor and developer of
    the EAD standard.

12
ltxmlintrogtlthistorygtltearlyDevgtltTEI /gtlt/earlyDevgt
  • TEI (Text Encoding Initiative) http//www.tei-c.or
    g
  • An international body consisting of libraries,
    museums, publishers, etc. started in 1987 to
    create and manage a standard descriptive language
    for linguistic texts.
  • 1994, TEI released its first DTD and guidelines
    for creating XML-based descriptions.

13
ltxmlintrogtlthistorygtltcurrentDevgtltCONTENTdm
/gtlt/currentDevgtlt/historygt
  • CONTENTdm (http//www.contentdm.com)
  • CONTENTdm is a content management system for
    digital collections, and like many digital
    collection packages, it produces data in XML
    (Dublin Core specifically) and stores most data
    internally as pseudo-XML.
  • Examples
  • Collection http//oregonstate.edu/reeset/present
    ations/ola/2004/pawardsmedalsgHZw-dcq.xml
  • Internal Format http//digitalcollections.library
    .oregonstate.edu/pawardsmedals/index/description/d
    esc.all

14
ltxmlintrogtltwhyXMLgtltXML101gtltneedToKnowgt
  • When people talk about XML, often times they are
    talking about both XML and its ancillary
    technologies (see Slide 9)
  • While XML stands for Extensible Markup Language,
    it is probably easier to think of XML as a
    grammar rather then a language.
  • This underscores XMLs extensibility. XML
    defines the rules and structures for defining new
    markup formats.

15
ltxmlintrogtltwhyXMLgtltXML101gtltneedToKnowgt
  • XML is a container format
  • Only defines the structure for storing
    information
  • Allows users/developers to define structures that
    best suit their data.
  • Example
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/marcive.xml
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/lindex.xml
  • Does not define how a document will be displayed,
    i.e., content is not defined by display.
  • Display of XML documents is done via a custom
    parser or a combination of ancillary technology
    (like XSLT).

16
ltxmlintrogtltwhyXMLgtltXML101gtlt /needToKnow gt
  • How does XML compare to MARC
  • Similarities
  • Both are formats used to transmit data
  • Both have strict markup rules
  • Differences
  • While XML has strict markup rules, its tag
    structure can change to suite its data MARC
    cannot.
  • MARC must be process sequentially, XML does not.
  • Examples
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/marc.txt
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/marc.xml

17
ltxmlintrogtltwhyXMLgtltXML101gtltbasicsgt
  • Like standard SGML/HTML
  • Uses start/end tags
  • ltnamegtlt/namegt
  • ltname /gt
  • Default Character set is UTF-16 not ANSI, but all
    XML parsers must support both UTF-8 and UTF-16 to
    be compliant.
  • i.e., XML can represent multiple charactersets.
  • Example
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/DIAC8.xml
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/unicode.xml

18
ltxmlintrogtltwhyXMLgtltXML101gtltbasicsgt
  • Structure
  • Uses, but doesnt require, a DTD (Document Type
    Definition)
  • Rigid Structure
  • Empty tags must be identified
  • ltnamegtlt/namegt
  • ltname /gt
  • Tags must always be closed
  • Not Valid Paragraph tag is not
    closedltParagraphgt ltSentencegt
    ltLinegtlt/Linegt ltLinegtlt/Linegt ltParagraphgt
    ltSentencegt

19
ltxmlintrogtltwhyXMLgtltXML101gtltbasicsgt
  • XML Attributes
  • Used to describe an element or provide additional
    information
  • Generally are used to provide information that
    doesnt necessarily fit within the rest of the
    XML document, i.e., information that is more
    useful to an XML parser, than a human reader.
  • Exampleltbook id120111021gtCatcher and the
    Ryelt/bookgt

20
ltxmlintrogtltwhyXMLgtltXML101gtlt/basicsgt
  • XML Entities
  • Defined within a documents DTD
  • Provide a list of parameters required (or
    applicable) for each xml element or link external
    information to an XML document.

21
ltxmlintrogtltwhyXMLgtltXML101gtlthowitworksgt
  • If XML is simply a grammar then ?
  • XML requires other technologies to make it work
    (i.e., to be more than simply a container)
  • Includes both ancillary technologies, as well as
    XML specific technologies like parsers, editors,
    validators, etc.
  • Implication is that XML brings with it lots of
    new software.

22
ltxmlintrogtltwhyXMLgtltXML101gtlthowitworksgt
  • DOM Document Object Model (developers)
  • The W3C DOM is a platform- and language-neutral
    interface that permits script to access and
    update the content, structure, and style of a
    document. 1
  • DOM is not new to Web Development JavaScript
    users have used DOM for years to access document
    elements
  • Very robust tools set for manipulating XML
    documents.
  • Very resource intensive a parser must create a
    memory tree of the entire document, so larger
    documents are sometimes difficult to process
    using DOM.

1 Microsoft Corp. About the W3C object model.
Retrieved from http//msdn.microsoft.com/library/
default.asp?url/workshop/author/dom/domoverview.a
sp
23
ltxmlintrogtltwhyXMLgtltXML101gtlthowitworksgt
  • SAX Simple API for XML (developers)
  • Originally developed as part of the JAVA language
    (First widely accepted XML API for use with JAVA)
  • Unlike DOM, SAX is event, not object driven.
  • SAX requires a smaller memory footprint, so is
    ideal for processing larger files.
  • In generally, it contains less capabilities for
    manipulating a documents data than DOM.
    However, it is ideal for reading back data from
    an XML file.

24
ltxmlintrogtltwhyXMLgtltXML101gtlt/howitworksgt
  • XSL XSLT
  • Provides a simple method to define display
    elements
  • Works like style sheets
  • Can embed javascript/vbscript/objects into style
    sheets
  • Example
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/EADtoMARC21slimXML.xsl

25
ltxmlintrogtltwhyXMLgtltXML101gtltimplicationsgt
  • In addition to new softwares and technologies
    specifically geared toward XML, we see
  • Document itself as information (not simply a
    page)
  • Illustrates relationships between data
  • Interlinks documents
  • Documents become smarter
  • Document elements have properties that can now be
    manipulated and reordered, without needing to
    re-work the document.

26
ltxmlintrogtltwhyXMLgtltXML101gtlt/implicationsgt
  • Using XML will very rarely provide an out-of the
    box solution. In addition to coding the XML,
    someone must also
  • Define DTDs
  • Create Stylesheets or processing software to
    render the XML document
  • Setup a system for validating XML documents.

27
ltxmlintrogtltwhyXMLgtltXML101gtltmythsgt
  • Myth XML will eliminate the need for proprietary
    formats
  • While it is true that XML is a non-proprietary
    markup format, the DTDs that are created to
    validate an XML document need not be.
  • Example of this Adobes PDF files are stored in
    an encrypted, yet proprietary, form of XML.
  • In a lot of ways, XML could actually exacerbate
    the problem, since proprietary DTDs essentially
    lock a document format.

28
ltxmlintrogtltwhyXMLgtltXML101gtltmythsgt
  • Myth XML will do my work for me
  • Well, no. XML provides a uniform structure that
    will allow a programmer (or programs) to
    transmit, manipulate, arrange or redisplay your
    data but it does have its limitations ?

29
ltxmlintrogtltwhyXMLgtltXML101gtltmythsgt
  • Myth XML will help me do my work faster
  • Depending on the type of work you do, this will
    be true. XML is
  • very flexible
  • a good general purpose application
  • But
  • it is also verbose
  • computationally intensive (flexibility here,
    comes at a cost).

30
ltxmlintrogtltwhyXMLgtltXML101gtlt/mythsgtlt/whyXMLgtlt/xml
introgt
  • Myth XML is the best tool for everything
  • Again, like any tool, XML excels in some areas,
    and lags behind in others.

31
lteadelizabeth /gt
  • EAD as an example of an XML schema

32
ltxmlcrosswalkgtltintrogt
  • Why are computer system and description systems
    migrating to XML?
  • The use of XML takes us one step closer to what
    I, and many programmers look at as the Holy Grail
    true data interoperability.

33
ltxmlcrosswalkgtltintrogt
  • Before XML
  • Outside of delimited formats, few good,
    non-propriety formats existed that could be used
    to place information in a system neutral format.
  • Problems with delimited formats
  • Lose document formatting
  • Lose established relationships within a document.
  • Example
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/dindex.txt
  • http//oregonstate.edu/reeset/presentations/ola/2
    004/lindex.xml

34
ltxmlcrosswalkgtltintrogt
  • Before XML
  • Migrating from MARC to something else was fairly
    difficult. Non-MARC systems could not share data
    with MARC systems, and visa versa (unless a
    custom loader was created).

35
ltxmlcrosswalkgtlt/introgt
  • After XML
  • Migration is fairly easy.
  • Systems require data to be available in a
    specific XML protocol (like Dublin Core, EAD or
    MARC21XML).
  • Data crosswalking is made easier because the
    XML structure is portable.
  • Programmer no longer needs to become familiar
    with a data format just its DTD.

36
ltxmlcrosswalkgtltexamplegt
  • EADgtMARC crosswalk
  • Elizabeth had mentioned the need to create MARC
    records by hand from the EAD finding aid not
    so.
  • Using XSLT, EAD records can be easily converted
    to MARC

37
ltxmlcrosswalkgtltexamplegt
  • System Requirements
  • component to handle XSLT translation
  • Examples
  • SAXON (for UNIX folks)
  • MSXML (for windows folks)
  • component to handle MARC21XMLgtMARC translation
  • Examples
  • MARCRECORD additional PERL libraries to
    handle XML translations. (for UNIX folks)
  • MarcEdit (http//oregonstate.edu/reeset/marcedit/
    html/) This is a free MarcEditing suite of
    software developed by myself, which integrates
    both traditional MARC and XML functionality.

38
ltxmlcrosswalkgtltexamplegt
  • For this example we will use MarcEdit.
  • Two demonstrations
  • Translating a single record
  • Translating a batch of records

39
ltxmlcrosswalkgtltexamplegt
  • MarcEdit utilizes
  • MARC21XML as its base metadata format.
  • XSLT v.1 to transform XML to other formats.

40
ltxmlcrosswalkgtltexamplegtlte1gt
  • Example 1 (individual record translation)
  • Record to translate http//nwda.wsulibs.wsu.edu/p
    roject_info/xml_markup.asp?docnameOREEDWARDS.XML
  • XSLT Stylesheet http//oregonstate.edu/reeset/pr
    esentations/ola/2004/EADtoMARC21slimXML.xsl

41
ltxmlcrosswalkgtltexamplegtlte1gt
  • Example 1
  • Live Demo
  • View AVI file http//oregonstate.edu/reeset/pres
    entations/ola/2004/ead_marc2004_04_0013.avi
  • Resulting MARC file
  • http//oasis.oregonstate.edu/search/taliceledwar
    ds/taliceledwards/1,2,3,E/framesetFFtalicele
    dwardspapers189519622,,2
  • Comparison file (manually created record)
  • http//oasis.oregonstate.edu/search/taliceledwar
    ds/taliceledwards/1,2,3,E/framesetFFtalicele
    dwardspapers189519621,,2

42
ltxmlcrosswalkgtltexamplegtlt/e1gt
  • Observations
  • Creation of a TOC using a 505 (experimental)
  • EAD record creates local subjects
  • Translation could be better if NWDA practices
    were amended to allow nesting of ltcontrolaccessgt
    tags
  • Exampleltcontrolaccessgt ltcontrolaccessgt
    ltsubject source"lcsh" encodinganalog"650a"gtHome
    economicslt/subjectgt ltsubject
    sourcelcsh encodinganalog650xgtStudy and
    teaching (Higher).lt/subjectgt
    lt/controlaccessgtlt/controlaccessgt

43
ltxmlcrosswalkgtltexamplegtlt/e2gtlt/examplegt
  • Example 2
  • Batch EAD Translation
  • .vbs file.
  • Demo

44
ltquestions_comments /gt
  • Contact information
  • Elizabeth Nielsenelizabeth.nielsen_at_oregonstate.ed
    u
  • Terry Reeseterry.reese_at_oregonstate.edu
  • Presentation Address http//oregonstate.edu/rees
    et/presentations/ola/2004/ola_EAD_XML_2004.ppt
Write a Comment
User Comments (0)
About PowerShow.com