Title: <title>XML and the Encoded Archival Description: Providing Access to Collections</title>
1lttitlegtXML and the Encoded Archival Description
Providing Access to Collectionslt/titlegt
- Elizabeth Nielsen
- elizabeth.nielsen_at_oregonstate.edu
- Oregon State UniversityTerry Reese
- terry.reese_at_oregonstate.edu
- Oregon State University
2ltroadmap /gt
- Introduction to XML
- EAD as an example of an XML schema
- Demonstrate some cross-walking
- Questions/Comments
3ltxmlintrogt
- History of XML
- XMLs Family Tree
- Brief descriptions of related technologies
- Current developments and ancillary XML related
technologies - Early uses of XML in the Library
- CONTENTdm as an example of current usage
- Why XML?
- XML 101 What you need to know
- XML 101 The Basics
- How XML works
- Myth vs. Reality
4ltxmlintrogtlthistorygtlttreegt
Abbreviated XML Family Tree
5ltxmlintrogtlthistorygtlttreegtltsgml /gt
- SGML Standardized General Markup Language
- Developed in 1969 by Ed Mosher, Ray Lorie and
Charles Goldfarb - http//www.sgmlsource.com/history/roots.htm
- Was initially developed as a method of
structuring legal documents - In 1970, IBM (with whom these individuals were
associated), extended the project to encompass
text processing in general. - Example
- http//www.cs.rpi.edu/wiseb/talks/html-talk/node9
.html
6ltxmlintrogtlthistorygtlttreegtlthtml /gt
- HTML HyperText Markup Language
- Initially developed in the early 1990s at CERN
and formally accepted by the W3 in 1992. - lingua franca for publishing linked documents on
the WWW. - Non-proprietary subset of SGML
- Example
- http//www.w3.org
7ltxmlintrogtlthistorygtlttreegtltxmlgt
- XML Extensible Markup Language
- Initially conceived in 1996
- Driven primarily by W3, Jon Bosak (Sun) and other
SGML experts - First Specification was developed by Tim Bray and
C.M. Sperberg-McQueen. - Version 1.0 was accepted by the W3 in 1998
- Examples
- http//oregonstate.edu/reeset/presentations/ola/2
004/connexion.xml - http//oregonstate.edu/reeset/presentations/ola/2
004/blm_metadata.xml
8ltxmlintrogtlthistorygtlttreegtlt/xmlgt
- XML Extensible Markup Language
- Sought to meet the following goals (see W3Cs XML
1.0 recommendation for more info) - Should support a variety of different
applications and uses by allowing
users/developers to create their own markup. - Should be human readable.
- Should follow a strict set of defined rules and
principles. - i.e., XML parsers are required to stop when a
markup error has been encountered.
9ltxmlintrogtlthistorygtlttreegtlt/ancillaryTechgtlt/treegt
- XML Extensible Markup Language
- Additional development and ancillary technologies
- XML Schema http//www.w3.org/XML/Schema
- XLINK http//www.w3.org/XML/Linking
- XPointer http//www.w3.org/XML/Linking
- XSL XSLT http//www.w3.org/Style/XSL/
- XHTML http//www.w3.org/MarkUp/
- XForms http//www.w3.org/MarkUp/Forms/
- XML Query http//www.w3.org/XML/Query/
- XML Encryption http//www.w3.org/Encryption/2001
/
10ltxmlintrogtlthistorygtltearlyDevgtltLane /gt
- Medlane Lane Medical Library (Stanford
University) http//laneweb.stanford.edu2380/wiki/
medlane/overview - Project Started in 1998 in an attempt to bridge
the access gap between print and digital
resources. - Developed one of the first MARC to XML DTDs and
crosswalks - Developed XMLMARC one of the first MARC to XML
mapping softwares - Current projects
- XMLMARC http//laneweb.stanford.edu2380/wiki/me
dlane/xmlmarc - xobis http//laneweb.stanford.edu2380/wiki/medla
ne/schema - Note xobis stands for XML organic bibliographic
information schema. xobis represents Medlanes
own Bibliographic XML schema.
11ltxmlintrogtlthistorygtltearlyDevgtltOAC /gt
- OAC (Online Archive of California)
http//www.oac.cdlib.org/ - Between 1993 and 1995, OAC developed a prototype
finding aid syntax in the form of SGML. Known
then as the Berkley Finding Aid Project, we know
it better today as EAD. - 1995-1997 OAC created one of the very first
union databases composed primarily of EAD encoded
finding aids. - Current Projects
- Now a part of the California Digital Library, the
OAC remains a major contributor and developer of
the EAD standard.
12ltxmlintrogtlthistorygtltearlyDevgtltTEI /gtlt/earlyDevgt
- TEI (Text Encoding Initiative) http//www.tei-c.or
g - An international body consisting of libraries,
museums, publishers, etc. started in 1987 to
create and manage a standard descriptive language
for linguistic texts. - 1994, TEI released its first DTD and guidelines
for creating XML-based descriptions.
13ltxmlintrogtlthistorygtltcurrentDevgtltCONTENTdm
/gtlt/currentDevgtlt/historygt
- CONTENTdm (http//www.contentdm.com)
- CONTENTdm is a content management system for
digital collections, and like many digital
collection packages, it produces data in XML
(Dublin Core specifically) and stores most data
internally as pseudo-XML. - Examples
- Collection http//oregonstate.edu/reeset/present
ations/ola/2004/pawardsmedalsgHZw-dcq.xml - Internal Format http//digitalcollections.library
.oregonstate.edu/pawardsmedals/index/description/d
esc.all
14ltxmlintrogtltwhyXMLgtltXML101gtltneedToKnowgt
- When people talk about XML, often times they are
talking about both XML and its ancillary
technologies (see Slide 9) - While XML stands for Extensible Markup Language,
it is probably easier to think of XML as a
grammar rather then a language. - This underscores XMLs extensibility. XML
defines the rules and structures for defining new
markup formats.
15ltxmlintrogtltwhyXMLgtltXML101gtltneedToKnowgt
- XML is a container format
- Only defines the structure for storing
information - Allows users/developers to define structures that
best suit their data. - Example
- http//oregonstate.edu/reeset/presentations/ola/2
004/marcive.xml - http//oregonstate.edu/reeset/presentations/ola/2
004/lindex.xml - Does not define how a document will be displayed,
i.e., content is not defined by display. - Display of XML documents is done via a custom
parser or a combination of ancillary technology
(like XSLT).
16ltxmlintrogtltwhyXMLgtltXML101gtlt /needToKnow gt
- How does XML compare to MARC
- Similarities
- Both are formats used to transmit data
- Both have strict markup rules
- Differences
- While XML has strict markup rules, its tag
structure can change to suite its data MARC
cannot. - MARC must be process sequentially, XML does not.
- Examples
- http//oregonstate.edu/reeset/presentations/ola/2
004/marc.txt - http//oregonstate.edu/reeset/presentations/ola/2
004/marc.xml
17ltxmlintrogtltwhyXMLgtltXML101gtltbasicsgt
- Like standard SGML/HTML
- Uses start/end tags
- ltnamegtlt/namegt
- ltname /gt
- Default Character set is UTF-16 not ANSI, but all
XML parsers must support both UTF-8 and UTF-16 to
be compliant. - i.e., XML can represent multiple charactersets.
- Example
- http//oregonstate.edu/reeset/presentations/ola/2
004/DIAC8.xml - http//oregonstate.edu/reeset/presentations/ola/2
004/unicode.xml
18ltxmlintrogtltwhyXMLgtltXML101gtltbasicsgt
- Structure
- Uses, but doesnt require, a DTD (Document Type
Definition) - Rigid Structure
- Empty tags must be identified
- ltnamegtlt/namegt
- ltname /gt
- Tags must always be closed
- Not Valid Paragraph tag is not
closedltParagraphgt ltSentencegt
ltLinegtlt/Linegt ltLinegtlt/Linegt ltParagraphgt
ltSentencegt
19ltxmlintrogtltwhyXMLgtltXML101gtltbasicsgt
- XML Attributes
- Used to describe an element or provide additional
information - Generally are used to provide information that
doesnt necessarily fit within the rest of the
XML document, i.e., information that is more
useful to an XML parser, than a human reader. - Exampleltbook id120111021gtCatcher and the
Ryelt/bookgt
20ltxmlintrogtltwhyXMLgtltXML101gtlt/basicsgt
- XML Entities
- Defined within a documents DTD
- Provide a list of parameters required (or
applicable) for each xml element or link external
information to an XML document.
21ltxmlintrogtltwhyXMLgtltXML101gtlthowitworksgt
- If XML is simply a grammar then ?
- XML requires other technologies to make it work
(i.e., to be more than simply a container) - Includes both ancillary technologies, as well as
XML specific technologies like parsers, editors,
validators, etc. - Implication is that XML brings with it lots of
new software.
22ltxmlintrogtltwhyXMLgtltXML101gtlthowitworksgt
- DOM Document Object Model (developers)
- The W3C DOM is a platform- and language-neutral
interface that permits script to access and
update the content, structure, and style of a
document. 1 - DOM is not new to Web Development JavaScript
users have used DOM for years to access document
elements - Very robust tools set for manipulating XML
documents. - Very resource intensive a parser must create a
memory tree of the entire document, so larger
documents are sometimes difficult to process
using DOM.
1 Microsoft Corp. About the W3C object model.
Retrieved from http//msdn.microsoft.com/library/
default.asp?url/workshop/author/dom/domoverview.a
sp
23ltxmlintrogtltwhyXMLgtltXML101gtlthowitworksgt
- SAX Simple API for XML (developers)
- Originally developed as part of the JAVA language
(First widely accepted XML API for use with JAVA) - Unlike DOM, SAX is event, not object driven.
- SAX requires a smaller memory footprint, so is
ideal for processing larger files. - In generally, it contains less capabilities for
manipulating a documents data than DOM.
However, it is ideal for reading back data from
an XML file.
24ltxmlintrogtltwhyXMLgtltXML101gtlt/howitworksgt
- XSL XSLT
- Provides a simple method to define display
elements - Works like style sheets
- Can embed javascript/vbscript/objects into style
sheets - Example
- http//oregonstate.edu/reeset/presentations/ola/2
004/EADtoMARC21slimXML.xsl
25ltxmlintrogtltwhyXMLgtltXML101gtltimplicationsgt
- In addition to new softwares and technologies
specifically geared toward XML, we see - Document itself as information (not simply a
page) - Illustrates relationships between data
- Interlinks documents
- Documents become smarter
- Document elements have properties that can now be
manipulated and reordered, without needing to
re-work the document.
26ltxmlintrogtltwhyXMLgtltXML101gtlt/implicationsgt
- Using XML will very rarely provide an out-of the
box solution. In addition to coding the XML,
someone must also - Define DTDs
- Create Stylesheets or processing software to
render the XML document - Setup a system for validating XML documents.
27ltxmlintrogtltwhyXMLgtltXML101gtltmythsgt
- Myth XML will eliminate the need for proprietary
formats - While it is true that XML is a non-proprietary
markup format, the DTDs that are created to
validate an XML document need not be. - Example of this Adobes PDF files are stored in
an encrypted, yet proprietary, form of XML. - In a lot of ways, XML could actually exacerbate
the problem, since proprietary DTDs essentially
lock a document format.
28ltxmlintrogtltwhyXMLgtltXML101gtltmythsgt
- Myth XML will do my work for me
- Well, no. XML provides a uniform structure that
will allow a programmer (or programs) to
transmit, manipulate, arrange or redisplay your
data but it does have its limitations ?
29ltxmlintrogtltwhyXMLgtltXML101gtltmythsgt
- Myth XML will help me do my work faster
- Depending on the type of work you do, this will
be true. XML is - very flexible
- a good general purpose application
- But
- it is also verbose
- computationally intensive (flexibility here,
comes at a cost).
30ltxmlintrogtltwhyXMLgtltXML101gtlt/mythsgtlt/whyXMLgtlt/xml
introgt
- Myth XML is the best tool for everything
- Again, like any tool, XML excels in some areas,
and lags behind in others.
31lteadelizabeth /gt
- EAD as an example of an XML schema
32ltxmlcrosswalkgtltintrogt
- Why are computer system and description systems
migrating to XML? - The use of XML takes us one step closer to what
I, and many programmers look at as the Holy Grail
true data interoperability.
33ltxmlcrosswalkgtltintrogt
- Before XML
- Outside of delimited formats, few good,
non-propriety formats existed that could be used
to place information in a system neutral format. - Problems with delimited formats
- Lose document formatting
- Lose established relationships within a document.
- Example
- http//oregonstate.edu/reeset/presentations/ola/2
004/dindex.txt - http//oregonstate.edu/reeset/presentations/ola/2
004/lindex.xml
34ltxmlcrosswalkgtltintrogt
- Before XML
- Migrating from MARC to something else was fairly
difficult. Non-MARC systems could not share data
with MARC systems, and visa versa (unless a
custom loader was created).
35ltxmlcrosswalkgtlt/introgt
- After XML
- Migration is fairly easy.
- Systems require data to be available in a
specific XML protocol (like Dublin Core, EAD or
MARC21XML). - Data crosswalking is made easier because the
XML structure is portable. - Programmer no longer needs to become familiar
with a data format just its DTD.
36ltxmlcrosswalkgtltexamplegt
- EADgtMARC crosswalk
- Elizabeth had mentioned the need to create MARC
records by hand from the EAD finding aid not
so. - Using XSLT, EAD records can be easily converted
to MARC
37ltxmlcrosswalkgtltexamplegt
- System Requirements
- component to handle XSLT translation
- Examples
- SAXON (for UNIX folks)
- MSXML (for windows folks)
- component to handle MARC21XMLgtMARC translation
- Examples
- MARCRECORD additional PERL libraries to
handle XML translations. (for UNIX folks) - MarcEdit (http//oregonstate.edu/reeset/marcedit/
html/) This is a free MarcEditing suite of
software developed by myself, which integrates
both traditional MARC and XML functionality.
38ltxmlcrosswalkgtltexamplegt
- For this example we will use MarcEdit.
- Two demonstrations
- Translating a single record
- Translating a batch of records
39ltxmlcrosswalkgtltexamplegt
- MarcEdit utilizes
- MARC21XML as its base metadata format.
- XSLT v.1 to transform XML to other formats.
40ltxmlcrosswalkgtltexamplegtlte1gt
- Example 1 (individual record translation)
- Record to translate http//nwda.wsulibs.wsu.edu/p
roject_info/xml_markup.asp?docnameOREEDWARDS.XML - XSLT Stylesheet http//oregonstate.edu/reeset/pr
esentations/ola/2004/EADtoMARC21slimXML.xsl
41ltxmlcrosswalkgtltexamplegtlte1gt
- Example 1
- Live Demo
- View AVI file http//oregonstate.edu/reeset/pres
entations/ola/2004/ead_marc2004_04_0013.avi - Resulting MARC file
- http//oasis.oregonstate.edu/search/taliceledwar
ds/taliceledwards/1,2,3,E/framesetFFtalicele
dwardspapers189519622,,2 - Comparison file (manually created record)
- http//oasis.oregonstate.edu/search/taliceledwar
ds/taliceledwards/1,2,3,E/framesetFFtalicele
dwardspapers189519621,,2
42ltxmlcrosswalkgtltexamplegtlt/e1gt
- Observations
- Creation of a TOC using a 505 (experimental)
- EAD record creates local subjects
- Translation could be better if NWDA practices
were amended to allow nesting of ltcontrolaccessgt
tags - Exampleltcontrolaccessgt ltcontrolaccessgt
ltsubject source"lcsh" encodinganalog"650a"gtHome
economicslt/subjectgt ltsubject
sourcelcsh encodinganalog650xgtStudy and
teaching (Higher).lt/subjectgt
lt/controlaccessgtlt/controlaccessgt
43ltxmlcrosswalkgtltexamplegtlt/e2gtlt/examplegt
- Example 2
- Batch EAD Translation
- .vbs file.
- Demo
44ltquestions_comments /gt
- Contact information
- Elizabeth Nielsenelizabeth.nielsen_at_oregonstate.ed
u - Terry Reeseterry.reese_at_oregonstate.edu
- Presentation Address http//oregonstate.edu/rees
et/presentations/ola/2004/ola_EAD_XML_2004.ppt