Title: New%20Digital%20Library%20Possibilities%20Using%20the%20Open%20Archives%20InitiativeProtocol%20for%20Metadata%20Harvesting%20(OAI-PMH)
1New Digital Library Possibilities Using the Open
Archives InitiativeProtocol for Metadata
Harvesting (OAI-PMH)
- Michael L. Nelson
- Old Dominion University
- Norfolk Virginia, USA
- mln_at_cs.odu.edu
- http//www.cs.odu.edu/mln/icsep/
International Conference on Scientific Electronic
Publishing in Developing Countries Valparaiso,
Chile October 2, 2002
Several Slides Also from Van de Sompel Warner
2Random Thoughts
- Thanks to the Organizing Committee for inviting
me - Me deseo habla prestado la atencion a mis clases
del Espanol de la escuela secundaria - Publishers Editors if you want increased
coverage, exposure and readership, you must do
OAI
3Outline
- OAI-PMH history and technical highlights
- a full technical review is out of the scope of
this presentation - Example data provider user
- Example service provider uses
- Implicatations for authors and editors
- Looking to the future
4Open Archives Initiative
5The Rise and Fall of Distributed Searching
- wholesale distributed searching, popular at the
time, is attractive in theory but troublesome in
practice - Davis Lagoze, JASIS 51(3), pp. 273-80
- Powell French, Proc 5th ACM DL, pp. 264-265
- distributed searching of N nodes still viable,
but only for small values of N - NCSTRL N gt 100 bad
- NTRS/NIX Nlt20 ok (but could be better)
6The Rise and Fall of Distributed Searching
- Other problems of distributed searching (from
STARTS) - source-metadata problem
- how do you know which nodes to search?
- query-language problem
- syntax varies and drifts over time between the
various nodes - rank-merging problem
- how do you meaningfully merge multiple result
sets? - Temptations
- centralize all functions
- everything will be done at X
- standardize on a single product
- everyone will use system Y
7 Santa Fe Convention 02/2000
- goal optimize discovery of e-prints
- http//www.dlib.org/dlib/february00/vandesompel-oa
i/02vandesompel-oai.html - input
- the UPS prototype
- http//www.dlib.org/dlib/february00/vandesompel-up
s/02vandesompel-ups.html - RePEc /SODA data provider / service provider
model - Dienst protocol
- deliberations at Santa Fe meeting 10/99
8Data and Service Providers
- Data Providers
- publishing into an archive
- providing methods for metadata harvesting
- provide non-technical context for sharing
information also - Service Providers
- harvest metadata from providers
- implement user interface to data
- Self-describing archives
- Much of the learning about the constituent UPS
archives occurred out of band
Even if these are done by the same DL, these are
distinct roles
9Metadata Harvesting
- Move away from distributed searching
- Extract metadata from various sources
- Build services on local copies of metadata
- data remains at remote repositories
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
10 OAI-PMH v.1.0 01/2001
- low-barrier interoperability specification
- metadata harvesting model data provider /
service provider - focus on document-like objects
- autonomous protocol
- HTTP based
- XML responses
- unqualified Dublin Core
- experimental 12-18 months
11Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
12OAI-PMH 2.0
- Good news OAI-PMH is still
- Six Verbs Dublin Core
- Incremental improvements
- single XML schema
- ambiguities removed
- more expressive options
- cleaner separation of roles responsibilities
- Bad news not backwards compatible with 1.1
13Dublin Core
- Dublin Core Metadata Initiative
- http//www.dublincore.org/
- from 1994-1995, recognizing the need for simple,
interoperable metadata for resource discovery - good overview of metadata DC
http//www.dlib.org/dlib/january01/lagoze/01lagoze
.html - 15 elements (qualifiers possible)
14OAI Mechanics
Request is encoded in http
Response is encoded in XML
XML Schemas for the responses are defined in the
OAI-PMH document
15Overview of OAI-PMH Verbs
Verb Function
Identify description of archive
ListMetadataFormats metadata formats supported by archive
ListSets sets defined by archive
ListIdentifiers OAI unique ids contained in archive
ListRecords listing of N records
GetRecord listing of a single record
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
16 protocol vs periphery
- clear distinction between protocol and periphery
- fixed protocol document
- extensible implementation guidelines
- e.g. sample metadata formats, description
containers, about containers - allows for OAI guidelines and community
guidelines
17 OAI-PMH vs HTTP
- clear separation of OAI-PMH and HTTP
- OAI-PMH error handling
- all OK at HTTP level? gt 200 OK
- something wrong at OAI-PMH level? gt OAI-PMH
error (e.g. badVerb) - http codes 302, 503, etc. still available to
implementers, but no longer represent OAI-PMH
events
18 resource item - record
item identifier
record identifier metadata format datestamp
19 other general changes
- better definitions of harvester, repository,
item, unique identifier, record, set, selective
harvesting - oai_dc schema builds on DCMI XML Schema for
unqualified Dublin Core - usage of must, must not etc. as in RFC2119
- wording on response compression
20 other general changes
- all protocol responses can be validated with a
single XML Schema - easier for data providers
- no redundancy in type definitions
- SOAP-ready
- clean for error handling
21 response no errors
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verbGetRecord gthttp//arXiv.org/oai
2lt/requestgt ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergtoaiarXivcs/0112017lt/identifiergt
ltdatestampgt2001-12-14lt/datestampgt
ltsetSpecgtcslt/setSpecgt ltsetSpecgtmathlt/setSpecgt
lt/headergt ltmetadatagt ..
lt/metadatagt lt/recordgt lt/GetRecordgt lt/OAI-PMHgt
22 response with error
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequestgthttp//arXiv.org/oai2lt/requestgt lterror
codebadVerbgtShowMe is not a valid OAI-PMH
verblt/errorgt lt/OAI-PMHgt
23resumptionToken
scenario harvesting 2770 records in 3
separate 1000 record chunks
24 resumptionToken
- idempotency of resumptionToken return same
incomplete list when rT is reissued - while no changes occur in the repo strict
- while changes occur in the repo all items with
unchanged datestamp - new, optional attributes for the resumptionToken
- expirationDate
- completeListSize
- cursor
25 harvesting granularity
- harvesting granularity
- mandatory support of YYYY-MM-DD
- optional support of YYYY-MM-DDThhmmssZ
- other granularities considered, but ultimately
rejected - granularity of from and until must be the same
26 Identify
ltIdentifygt ltrepositoryNamegtLibrary of
Congress 1lt/repositoryNamegt
ltbaseURLgthttp//memory.loc.gov/cgi-bin/oailt/baseUR
Lgt ltprotocolVersiongt2.0lt/protocolVersiongt
ltadminEmailgtr.e.gillian_at_larc.nasa.govlt/adminEmailgt
ltadminEmailgtrgillian_at_visi.netlt/adminEmailgt
ltdeletedRecordgttransientlt/deletedRecordgt
ltearliestDatestampgt1990-02-01T000000Zlt/earliestD
atestampgt ltgranularitygtYYYY-MM-DDThhmmssZlt/g
ranularitygt ltcompressiongtdeflatelt/compressiongt
27 header
- header contains set membership of item
ltrecordgt ltheadergt ltidentifiergtoaiarXiv
cs/0112017lt/identifiergt ltdatestampgt2001-12-14
lt/datestampgt ltsetSpecgtcslt/setSpecgt
ltsetSpecgtmathlt/setSpecgt lt/headergt
ltmetadatagt .. lt/metadatagt lt/recordgt
28 ListIdentifiers
- ListIdentifiers returns headers
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verb gthttp//arXiv.org/oai2lt/reques
tgt ltListIdentifiersgt ltheadergt
ltidentifiergtoaiarXivhep-th/9801001lt/identifiergt
ltdatestampgt1999-02-23lt/datestampgt
ltsetSpecgtphysicheplt/setSpecgt lt/headergt
ltheadergt ltidentifiergtoaiarXivhep-th/9801
002lt/identifiergt ltdatestampgt1999-03-20lt/datest
ampgt ltsetSpecgtphysicheplt/setSpecgt
ltsetSpecgtphysicexplt/setSpecgt lt/headergt
29 provenance
- introduction of provenance container to
facilitate tracing of harvesting history
ltaboutgt ltprovenancegt ltoriginDescriptiongt
ltbaseURLgthttp//an.oa.orglt/baseURLgt
ltidentifiergtoair1plog/9801001lt/identifiergt
ltdatestampgt2001-08-13T130002Zlt/datestampgt
ltmetadataPrefixgtoai_dclt/metadataPrefixgt
ltharvestDategt2001-08-15T120130Zlt/harvestDategt
ltoriginDescriptiongt
lt/originDescriptiongt lt/originDescriptiongt
lt/provenancegt lt/aboutgt
30 friends
- introduction of friends container to facilitate
discovery of repositories
ltdescriptiongt ltfriendsgt ltbaseURLgthttp//cav2001
.library.caltech.edu/perl/oailt/baseURLgt
ltbaseURLgthttp//formations2.ulst.ac.uk/perl/oailt/b
aseURLgt ltbaseURLgthttp//cogprints.soton.ac.uk/pe
rl/oailt/baseURLgt ltbaseURLgthttp//wave.ldc.upenn.
edu/OLAC/dp/aps.php4lt/baseURLgt
lt/friendsgt lt/descriptiongt
31NASA ltfriendsgt example (1)
- A light weight, DP-centric method to communicate
the existence of others - http//techreports.larc.nasa.gov/ltrs/oai2.0/?verb
Identify - ..
- ltdescriptiongt
- ltfriends ..namespace stuff..gt
- ltbaseURLgthttp//naca.larc.nasa.gov/oai2.0lt/base
URLgt - ltbaseURLgthttp//ntrs.nasa.gov/oai2.0lt/baseURLgt
- ltbaseURLgthttp//horus.riacs.edu/perl/oai/lt/base
URLgt - ltbaseURLgthttp//ston.jsc.nasa.gov/collections/
TRS/oai/lt/baseURLgt - lt/friendsgt
- lt/descriptiongt
- ..
32NASA ltfriendsgt example (2)
33 branding
- introduction of branding container for DPs to
suggest rendering association hints - ltbranding xmlns"http//www.openarchives.org/OAI/2
.0/branding/" - xmlnsxsi"http//www.w3.org/2001/XMLSchema-inst
ance" - xsischemaLocation"http//www.openarchives.org/
OAI/2.0/branding/ - http//www.openarchives.org/
OAI/2.0/branding.xsd"gt - ltcollectionIcongt
- lturlgthttp//my.site/icon.pnglt/urlgt
- ltlinkgthttp//my.site/homepage.htmllt/linkgt
- lttitlegtMySite(tm)lt/titlegt
- ltwidthgt88lt/widthgt
- ltheightgt31lt/heightgt
- lt/collectionIcongt
- ltmetadataRendering
- metadataNamespace"http//www.openarchives.org
/OAI/2.0/oai_dc/" - mimeType"text/xsl"gthttp//some.where/DCrender
.xsllt/metadataRenderinggt - ltmetadataRendering
- metadataNamespace"http//another.place/MARC"
- mimeType"text/css"gthttp//another.place/MARCr
ender.csslt/metadataRenderinggt
34 oai-identifier
- revision of oai-identifier
- ltdescriptiongt
- ltoai-identifier xmlns"http//www.openarchives.o
rg/OAI/2.0/oai-identifier" - xmlnsxsi"http//www.w3.org/2001/XMLSchema-
instance" - xsischemaLocation"http//www.openarchives.
org/OAI/2.0/oai-identifier - http//www.openarchives.org/OAI/2.0/oai-iden
tifier.xsd"gt - ltschemegtoailt/schemegt
- ltrepositoryIdentifiergtoai-stuff.foo.orglt/repos
itoryIdentifiergt - ltdelimitergtlt/delimitergt
- ltsampleIdentifiergtoaioai-stuff.foo.org5324lt/
sampleIdentifiergt - lt/oai-identifiergt
- lt/descriptiongt
domain based repository names
35 did not make it into OAI-PMH v.2.0
- SOAP implementation
- Result set filtering
- Multiple / best metadata
- GetRecord -gt GetRecords
- Machine readable rights management
- XML format for mini-archives
36So What Does OAI-PMH Mean for Your Digital
Library?
- Resources on DL projects are typically spent in 2
areas - creating maintaining the collection
- data provider
- developing access services for the collection
(searching, browsing, etc.) - service provider
- OAI-PMH allows for specialization based on
resources / interest
37NACA Report 1345 as seen through its native
DL http//naca.larc.nasa.gov/
38NACA Report 1345 as seen through
MAGiC http//www.magic.ac.uk/
39NACA Report 1345 as seen through its
Scirus (Elsevier) http//www.scirus.com/
40NACA Report 1345 as seen through my.OAI (FS
Consulting) http//www.myoai.com/
41Scientific Communication
- With only some exceptions, which interface is
used for discovery is not as important as the
fact that discovery occurred in the first place - control of the discovered objects is not lost
by data providers - however, higher level mirroring services can be
built on top of OAI (cf. NACA ARC mirroring
between NASA LaRC and MAGiC) - The real power of OAI-PMH derives as much from
what it does not do as what it actually does
42What Does OAI-PMH Mean for Authors?
- On the surface, absolutely nothing!
- the ideal OAI deployment should be absolutely
invisible to normal DL operations - uninterested users should not even notice or care
- Indirectly, they should enjoy the benefits of the
critical mass of current and developing DL tools
systems - personal, institutional data providers
- proliferation of targetted, value-added service
providers
43What Does OAI-PMH Mean For Editors?
- Absolutely everything
- The decoupling of SPs and DPs will have
significant and profound implications on
scientific and technical information exchange - OAI-PMH is actually just one component in a
larger engineering effort for scholarly
communication (e.g. OpenURL) - Service and resource integration will be the
focus of journals, professional societies,
universities, etc. - OAI-PMH will be a basic, core technology for
scientific publishing as http XML
44Field of Dreams
- It should be easy to be a data provider, even if
it makes more work for the service provider. - if enough data providers exist, the service
providers will come (DPs gtgt SPs) - Open-source / freely available tools
- drop-in data providers
- industrial strength http//www.eprints.org/
- personal size http//kepler.cs.odu.edu/
- tools to make your existing DL a data provider
- http//www.openarchives.org/tools/tools.htm
- also OAI-implementers mailing list / mail
archive! - service providers
- Arc http//sourceforge.net/projects/oaiarc/
45OAI Observation Front-End Only
- No input/registry mechanism
- OAI harvesting protocol is always a front-end for
something else - filesystem, Dienst, RDBMS, LDAP, etc.
- convenient for pre-existing DLs, but does not
address new DLs - e.g., we want to do OAI
- Bounds the scope of OAI
- responsibilities and domain of OAI are still be
discussed - tension between functionality and simplicity
46OAI Observation No TC
- Possible to use multiple OAI servers in a
DMZ-like configuration
OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
47OAI Observation No TC
- Possible to use OAI harvesting protocol in
closed, restricted systems
OAI 1
OAI 2
OAI 3
OAI 4
all OAI requests originate from these 4 DLs
48Metadata
- Q Which format should I use?
- A any/all of them
- lowest common denominator unqualified Dublin
Core - Again, little known about actual behavior
- will DC be actually be useful? or too lossy?
- will communities create/adopt specific formats?
- will native (presumably richer) formats be
harvested?
49The Future Community Building
- Ultimately, protocols and metadata formats are
not what makes a difference - Rather, the critical mass afforded by a common
set of utilities (cf. http, Dublin Core, XML) - The best current example The Open Language
Archives Community - http//www.language-archives.org
- OAI-PMH provides the basis for communication
between strangers, but allows even richer
communication between friends
50http//www.openarchives.org openarchives_at_openarch
ives.org
51Backup Slides
52Detailed Review of the OAI-PMH 2.0 Verbs
53Identify
1.1
2.0
- Arguments
- none
- Errors
- none
- Arguments
- none
- Errors
- badArgument
54ListMetadataFormats
1.1
2.0
- Arguments
- identifier (OPTIONAL)
- Errors
- id does not exist
- Arguments
- identifier (OPTIONAL)
- Errors
- badArgument
- noMetadataFormats
- idDoesNotExist
55ListSets
1.1
2.0
- Arguments
- resumptionToken (EXCLUSIVE)
- Errors
- no set hierarchy
- Arguments
- resumptionToken (EXCLUSIVE)
- Errors
- badArgument
- badResumptionToken
- noSetHierarchy
56ListIdentifiers
1.1
2.0
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- Errors
- no records match
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- badArgument
- cannotDisseminateFormat
- badResumptionToken
- noSetHierarchy
- noRecordsMatch
57ListRecords
1.1
2.0
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- no records match
- metadata format cannot be disseminated
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- noRecordsMatch
- cannotDisseminateFormat
- badResumptionToken
- noSetHierarchy
- badArgument
58GetRecord
1.1
2.0
- Arguments
- identifier (REQUIRED)
- metadataPrefix (REQUIRED)
- Errors
- id does not exist
- metadata format cannot be disseminated
- Arguments
- identifier (REQUIRED)
- metadataPrefix (REQUIRED)
- Errors
- badArgument
- cannotDisseminateFormat
- idDoesNotExist
59Argument Summary
metadataPrefix from until set resumptionToken identifier
Identify ? ? ? ? ? ?
ListMetadata Formats ? ? ? ? ? optional
ListSets ? ? ? ? exclusive ?
ListIdentifiers ? optional optional optional exclusive ?
ListRecords ? optional optional optional exclusive ?
GetRecord ? ? ? ? ? ?
60Error Summary
Identify BA
ListMetadata Formats BA NMF IDDNE
ListSets BA BRT NSH
ListIdentifiers BA BRT CDF NRM NSH
ListRecords BA BRT CDF NRM NSH
GetRecord BA CDF IDDNE
Generate badVerb on any input not matching the 6
defined verbs this is an inversion
of the table in section 3.6 of the OAI-PMH
specification