Title: Tutorial OAI and OAI-PMH for Beginners An introduction to the Open Archives Initiative and the Protocol for Metadata Harvesting
1Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
- Uwe Müller
- Humboldt University Berlin, Germany
- u.mueller_at_rz.hu-berlin.de
- Andy Powell
- UKOLN, University of Bath
- a.powell_at_ukoln.ac.uk
2Agenda
- Part I
- History and overview
- Part II
- Technical introduction
- Coffee/tea break
- Part III
- Implementation issues data provider and service
provider - Part IV
- Implementation issues XML schema and supporting
multiple record formats
3Acknowledgements
- Some of the slides presented here are our own!
- Many of them have been kindly donated by (taken
from!) - Herbert Van de Sompel
- Carl Lagoze
- Michael Nelson
- Simeon Warner
- (and others probably!)
4Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
- Part I History and overview
- Andy Powell
- UKOLN, University of Bath
- a.powell_at_ukoln.ac.uk
5OAI roots
- the roots of OAI lie in the development of eprint
archives - arXiv, CogPrints, NACA (NASA), RePEc, NDLTD,
NCSTRL - each offered Web interface for deposit of
articles and for end-user searches - difficult for end-users to work across archives
without having to learn multiple different
interfaces - recognised need for single search interface to
all archives - Universal Pre-print Service (UPS)
6Searching vs. harvesting
- two possible approaches to building the UPS
- cross-searching multiple archives based on
protocol like Z39.50 - harvesting metadata into one or more central
services bulk move data to the user-interface - US digital library experience in this area (e.g.
NCSTRL) indicated that cross-searching not
preferred approach - distributed searching of N
nodes viable, but only for small values of N - NCSTRL N gt 100 bad
7Problems of cross-searching
- collection description
- how do you know which targets to search?
- query-language problem
- syntax varies and drifts over time between the
various nodes - rank-merging problem
- how do you meaningfully merge multiple result
sets? - performance
- tends to be limited by slowest target
- difficult to build browse interface
8Universal Preprint Service
- a cross-archive DL that that provides services on
a collection of metadata harvested from multiple
archives - based on NCSTRL a modified version of Dienst
- demonstrated at Santa Fe NM, October 21-22, 1999
- http//ups.cs.odu.edu/
- D-Lib Magazine, 6(2) 2000 (2 articles)
- http//www.dlib.org/dlib/february00/02contents.htm
l - UPS was soon renamed the Open Archives Initiative
(OAI) http//www.openarchives.org/
9RDN experience
- similar experience within the UK Resource
Discovery Network (RDN) - cross-searching of only 5 subject gateways
- problems with cross-searching approach
- performance
- central browse interface
- looking for metadata harvesting solution
10Data and service providers
- UPS identified two logical groups of services
- data providers
- handle deposit/publishing of resources in archive
- expose metadata about resources in archive
- service providers
- harvest metadata from data providers
- use it to offer single user-interface across all
harvested metadata - note
- data provider may also be responsible for
human-oriented (I.e. Web) interface to archive - both functions may be offered by same service
11Human vs. machine interfaces
- move away from only supporting human end-user
interfaces for each archive - to supporting both human end-user interface and
machine interfaces for harvesting
Native harvesting interface
Input interface
Native end-user interface
Provider
Input interface
Provider
Native end-user interface
12Service provider harvesting
Native end-user interface
Service Provider
Native harvesting interface
Native harvesting interface
Input interface
Input interface
Data Provider
Data Provider
Native end-user interface
Native end-user interface optional (e.g., RePEc)
13Metadata harvesting requirements
- in order that harvesting approach can work need
agreements about - transport protocols HTTP vs. FTP vs.
- metadata formats DC vs. MARC vs.
- quality assurance mandatory elements,
mechanisms for naming of people, subjects, etc.,
handling duplicated records, best-practice - intellectual property and usage rights who can
do what with the records - work in this area resulted in the Santa Fe
Convention
14Santa Fe Convention 02/2000
- goal optimize discovery of e-prints
- inputs
- UPS prototype
- RePEc/SODA data provider / service provider
model - Dienst protocol
- deliberations at Santa Fe meeting 10/1999
15OAI-PMH v 1.0 01/2001
- goal optimise discovery of document-like objects
- inputs
- Santa Fe Convention
- various DLF meetings on metadata harvesting
- deliberations at Cornell
- alpha-testers of OAI-PMH v 1.0
- recognition of DC as best core metadata format
for interoperability across multiple archives
16OAI-PMH v 1.0 01/2001
- low-barrier interoperability specification
- metadata harvesting model data provider /
service provider - focus on document-like objects
- autonomous protocol
- HTTP based
- XML responses
- unqualified Dublin Core
- experimental 12-18 months
17Whats in a name?
Open Archives Initiative
18OAI timeline before v. 2.0
- October 21-22, 1999 - initial UPS meeting
- February 15, 2000 - Santa Fe Convention published
in D-Lib Magazine - precursor to the OAI metadata harvesting protocol
- June 3, 2000 - workshop at ACM DL 2000 (Texas)
- August 25, 2000 - OAI steering committee formed,
DLF/CNI support - September 7-8, 2000 - technical meeting at
Cornell University - defined the core of the current OAI metadata
harvesting protocol - September 21, 2000 - workshop at ECDL 2000
(Portugal) - November 1, 2000 - Alpha test group announced
(15 organizations) - January 23, 2001 - OAI protocol 1.0 announced,
OAI Open Day in the U.S. (Washington DC) - purpose freeze protocol for 12-16 months,
generate critical mass - February 26, 2001 - OAI Open Day in Europe
(Berlin) - July 3, 2001 - OAI protocol 1.1 announced
- to reflect changes in the W3Cs XML latest schema
recommendation - September 8, 2001 - workshop at ECDL 2001
(Darmstadt)
19OAI-PMH v.2.0 06/2002
- goal recurrent exchange of metadata about
resources between systems - inputs
- OAI-PMH v.1.0
- feedback on OAI-implementers
- deliberations by OAI-tech 09/01 - 06/02
- alpha test group of OAI-PMH v.2.0 03/02 -
06/02 - officially released June 14, 2002
20OAI-PMH v.2.0 06/2002
- low-barrier interoperability specification
- metadata harvesting model data provider /
service provider - metadata about resources
- autonomous protocol
- HTTP based
- XML responses
- unqualified Dublin Core
- stable
21Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
22Flexible deployment
- simple protocol based on HTTP and XML allows for
rapid deployment - a number of toolkits available see part III
- systems can be deployed in variety of
configurations - multiple service providers can harvest from
multiple data providers - aggregators can sit between data and service
providers - harvesting approach can be complemented with
searching based on Z39.50 or SRW
23Multiple data and service ps
Data providers
Harvesting based on OAI-PMH
Service providers
24Aggregators
Data providers
Aggregator
Service providers
25Can be mixed with x-searching
Data providers
Harvesting based on OAI-PMH
Searching based on Z39.50 or SRW
Service providers
26Summary
- OAI-PMH OAI Protocol for Metadata Harvesting
- low-cost mechanism for harvesting metadata
records from one system to another - from data providers to service providers
- development over last 2-3 years has seen move
from specific (discovery of e-prints) to generic
(sharing descriptions of any resources) - based on HTTP and XML Web-friendly
- allows client to say give me some or all of your
records where some is based on - date-stamps, sets, metadata formats
27Summary (2)
- mandates simple DC as record format but
extensible to any format encoded in XML - OAI-PMH is not a search protocol
- but use can underpin search-based services based
on Z39.50 or SRW or - metadata and full-text typically made freely
available but not a requirement - OAI-PMH can be used between closed groups
- access-control and compression mechanisms based
on underlying HTTP protocol - simple protocol allows easy deployment
- systems can be combined in variety of ways
28Important resources
- OAI Web site
- http//www.openarchives.org/
- OAI-PMH specification
- http//www.openarchives.org/OAI/openarchivesprotoc
ol.html - Implementation guidelines
- http//www.openarchives.org/OAI/2.0/guidelines.htm
- Discussion lists
- http//www.openarchives.org/mailman/listinfo/oai-g
eneral - http//oaisrv.nsdl.cornell.edu/mailman/listinfo/oa
i-implementers - Repository explorer
- http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
toai - Tools http//oai.dlib.vt.edu/cgi-bin/Explorer/oai
2.0/testoai
29Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
- Part II Technical Introduction
- Uwe Müller
- Humboldt University Berlin, Germany
- u.mueller_at_rz.hu-berlin.de
30Agenda
- Protocol Basics
- Protocol Details
- Request Types
- Examples
31The Open Archives Initiative (OAI)
- Main ideas
- world-wide consolidation of scholarly archives
- free access on the archives (at least metadata)
- consistent interfaces for archives and service
provider - low barrier protocol / effortless implementation
- based on existing standards (e.g. HTTP, XML, DC)
- Basic functioning
Requests (based on HTTP)
Metadata (Documents)
Metadata
Service
Metadata (encoded in XML)
Harvester
Repository
Service Provider
Data Provider
32OAI General Assumptions
- two groups of participants
- Data Providers (Open Archives, Repositories)
- free access of metadata
- not necessarily free access to full texts /
resources - easy to implement, low barriers
- Service Providers
- use OAI interfaces of the Data Providers
- harvest and store metadata (no live requests!)
- may select certain subsets from Data
Providers (set hierarchy, date stamp) - may enrich metadata
- offer (value-added) service on the basis of the
metadata
33OAI-PMH Structure Model
Data Provider
e-prints
e-print
Requests Identify ListMetadataformats
ListSets ListIdentifiers ListRecords
GetRecord
Repository
Data Provider
Images
e-print
Repository
Service Provider
Data Provider
OPAC
e-print
Repository
Data Provider
Harvester
Data Provider
Responses General information Metadata
formats Set structure Record identifier
Metadata
Museum
e-print
Repository
Data Provider
Archive
e-print
Repository
34OAI-PMH Protocol Overview
- protocol based on HTTP
- request arguments as GET or POST parameters
- six request types
- e.g. http//archive.org? verbListRecordsfrom20
02-11-01 - responses are encoded in XML syntax
- supports any metadata format (at least Dublin
Core) - logical set hierarchy (definition data
providers) - date stamps (last change of metadata set)
- error messages
- flow control
35Agenda
- Protocol Basics
- Protocol Details
- Request Types
- Examples
36Protocol Details Definitions
- Harvester
- client application issuing OAI-PMH requests
- Repository
- network accessible server, able to process
OAI-PMH requests correctly - Resource
- object the metadata is about, nature of
resources is not defined in the OAI-PMH - Item
- component of an repository from which metadata
about a resource can be disseminated - has an unique identifier
- Record
- metadata in a specific metadata format
- Identifier
- unique key for an item in a repository
- Set
- optional construct for grouping items in a
repository
37Protocol Details Definitions (2)
all available metadata about David
item identifier
item
Dublin Core metadata
MARC metadata
SPECTRUM metadata
records
38Protocol Details Records
- metadata of a resource in a specific format
- three parts
- header (mandatory)
- identifier (1)
- datestamp (1)
- setSpec elements ()
- status attribute for deleted item (?)
- metadata (mandatory)
- XML encoded metadata with root tag, namespace
- repositories must support Dublin Core
- about (optional)
- rights statements
- provenance statements
39Protocol Details Datestamps
- date of last modification of a metadata set
- mandatory characteristic of every item
- two possible granularitiesYYYY-MM-DD,
YYYY-MM-DDThhmmssZ - function information on metadata, selective
harvesting (from and until arguments) - applications incremental update mechanisms
- modification, creating, deletion
- deletion three support levels
- no, persistent, transient
40Protocol Details Metadata Schema
- OAI-PMH supports dissemination of multiple
metadata formats from a repository - properties of metadata formats
- id string to specify the format (metadataPrefix)
- metadata schema URL (XML schema to test validity)
- XML namespace URI (global identifier for metadata
format) - repositories must be able to disseminate
unqualified Dublin Core - arbitrary metadata formats can be defined and
transported via the OAI-PMH - returned metadata must comply with XML namespace
specification
41Protocol Details Metadata Schema (2)
- minimum standard unqualified Dublin Core
- http//dublincore.org/
- Dublin Core Metadata Element Set contains 15
elements - elements are optional
- elements may be repeated
- The Dublin Core Metadata Element Set
Title Contributor Source
Creator Date Language
Subject Type Relation
Description Format Coverage
Publisher Identifier Rights
42Protocol Details Sets
- logical partitioning of repositories
- optional archives do not have to define sets
- no recommendations
- not necessarily exhaustive
- not necessarily strictly hierarchical
- function selective harvesting (set parameter)
- applications subject gateways, dissertation
search engine, - examples (Germany, see http//www.dini.de)
- publication types (thesis, article, )
- document types (text, audio, image, )
- content sets, according to DNB (medicine,
biology, )
43Protocol Details Request Format
- requests must be submitted using the GET or POST
methods of HTTP - repositories must support both methods
- at least one keyvalue pair verbRequestType
- additional keyvalue pairs depend on request type
- example for GET request http//archive.org/oai?v
erbListRecordsmetadataPrefixoai_dc - encoding of special characterse.g. (host
port separator) becomes 3A
44Protocol Details Response
- formatted as HTTP responses
- content type must be text/xml
- status codes (distinguished from OAI-PMH
errors)e.g. 302 (redirect), 503 (service not
available) - compression optional in OAI-PMH,only identity
encoding is mandatory - response format well formed XML with markup
- XML declaration (lt?xml version"1.0"
encoding"UTF-8" ?gt) - root element named OAI-PMH with three
attributes(xmlns, xmlnsxsi, xsischemaLocation) - three child elements
- responseDate (UTC datetime)
- request (request that generated this response)
- a) error (in case of an error or exception
condition) b) element with the name of the
OAI-PMH request
45Protocol Details Flow Control
- four of the request types return a list of
entries - three of them may reply large lists
- OAI-PMH supports partitioning
- decision on partitioning repository
- response to a request includes
- incomplete list
- resumption token expiration date, size of
complete list, cursor (optional) - new request with same request type
- resumption token as parameter
- all other parameters omitted!
- response includes
- next (maybe last) section of the list
- resumption token (empty if last section of list
enclosed)
46Protocol Details Flow Control (2)
Example
Service Provider
Data Provider
Harvester
Repository
47Protocol Details Errors and Exceptions
- repositories must indicate OAI-PMH errors
- inclusion of one or more error elements
- defined error identifiers
- badArgument
- badResumptionToken
- badVerb
- cannotDisseminateFormat
- idDoesNotExist
- noRecordsMatch
- noMetaDataFormats
- noSetHierarchy
48Agenda
- Protocol Basics
- Protocol Details
- Request Types
- Examples
49Request Types
- six different request types
- Identify
- ListMetadataFormats
- ListSets
- ListIdentifiers
- ListRecords
- GetRecord
- harvester has not to use all types
- repository must implement all types
- required and optional arguments
- depend on request types
50Request Type Identify
- functiondescription of an archive
- example archive.org/oai-script?verbIdentify
- parameters none
- errors / exceptionsbadArgument e.g.
archive.org/oai-script?verbIdentify setbiology
51Request Type Identify (2)
Element Example
repositoryName My Archive 1
baseURL http//archive.org/oai 1
protocolVersion 2.0 1
earliestDatestamp 1999-01-01 1
deleteRecords no, transient, persistent 1
granularity YYYY-MM-DD, YYYY-MM-DDThhmmssZ 1
adminEmail oai-admin_at_archive.org
compression deflate, compress,
description oai-identifier, eprints, friends,
52Request Type ListMetadataFormats
- functionretrieve available metadata formats from
archive - example archive.org/oai-script?verbListMetadataF
ormats identifieroaiHUBerlin.de3000218 - parameters identifier (optional)
- errors / exceptionsbadArgumentidDoesNotExist e.
g. archive.org/oai-script?verbListMetadataFormats
identifierreally-wrong-identifier
noMetadataFormats
53Request Type ListSets
- functionretrieve set structure of a repository
- example archive.org/oai-script?verbListSets
- parameters resumptionToken (exclusive)
- errors / exceptionsbadArgumentbadResumptionToken
e.g. archive.org/oai-script?verbListSets resu
mptionTokenany-wrong-token - noSetHierarchy
54Request Type ListIdentifiers
- functionabbreviated form of ListRecords,
retrieving only headers - example archive.org/oai-script?verbListIdentifie
rs metadataPrefixoai_dcfrom2002-12-01 - parametersfrom (optional)until (optional)
metadataPrefix (required)set (optional)
resumptionToken (exclusive) - errors / exceptionsbadArgument, e.g.
from2002-12-01-134500badResumptionTokencann
otDisseminateFormatnoRecordsMatchnoSetHierarchy
55Request Type ListRecords
- functionharvest records from a repository
- example archive.org/oai-script?verbListRecords
metadataPrefixoai_dcsetbiology - parametersfrom (optional)until (optional)
metadataPrefix (required)set (optional)
resumptionToken (exclusive) - errors / exceptionsbadArgumentbadResumptionToken
cannotDisseminateFormatnoRecordsMatchnoSetHiera
rchy
56Request Type GetRecord
- functionretrieve individual metadata record from
a repository - example archive.org/oai-script?verbGetRecord
identifieroaiHUBerlin.de3000218 metadataPref
ixoai_dc - parametersidentifier (required)metadataPrefix
(required) - errors / exceptionsbadArgumentcannotDisseminateF
ormatidDoesNotExist
57Agenda
- Protocol Basics
- Protocol Details
- Request Types
- Examples
58Example http//edoc.hu-berlin.de/OAI-2.0? verbL
istIdentifiersfrom2002-01-06until2002-01-08
metadataPrefixoai_dcsetdoctypesdissertations
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/X
MLSchema-instance"
xsischemaLocation"http//www.openarchives.org/OA
I/2.0/
http//www.openarchives.org/OAI/2.0/OA
I-PMH.xsd"gt ltresponseDategt2002-10-22T174949
0100lt/responseDategt ltrequest
verb"ListIdentifiers" from"2002-01-03"
until"2002-01-08" metadataPrefix"oai_dc"
set"doctypesdissertations"gthttp
//edoc.hu-berlin.de/OAI-2.0lt/requestgt
ltListIdentifiersgt ltheadergt
ltidentifiergtoaiHUBerlin.de3000819lt/identifiergt
ltdatestampgt2002-01-08lt/datestampgt
ltsetSpecgtdoctypeslt/setSpecgt
ltsetSpecgtdoctypesdissertationslt/setSpecgt
ltsetSpecgtdnblt/setSpecgt
ltsetSpecgtdnbdnb33lt/setSpecgt lt/headergt
ltheadergt ltidentifiergtoaiHUBer
lin.de3000831lt/identifiergt
ltdatestampgt2002-01-07lt/datestampgt
ltsetSpecgtdoctypeslt/setSpecgt
ltsetSpecgtdoctypesdissertationslt/setSpecgt
ltsetSpecgtdnblt/setSpecgt
ltsetSpecgtdnbdnb27lt/setSpecgt lt/headergt
lt/ListIdentifiersgt lt/OAI-PMHgt
59Example http//edoc.hu-berlin.de/OAI-2.0? verbG
etRecordidentifieroaiHUBerlin3000819 metadat
aPrefixoai_dc
- lt?xml version"1.0" encoding"UTF-8"?gt
- ltOAI-PMH xmlns"http//www.openarchives.org/OAI/2
.0/" xmlnsxsi"http//www.w3.org/2001/XM
LSchema-instance" - xsischemaLocation"http//www.
openarchives.org/OAI/2.0/ -
http//www.openarchives.org/OAI/2.0/OAI-PMH.xsd
"gt - ltresponseDategt2002-11-27T1457010100lt/respo
nseDategt - ltrequest verb"GetRecord" metadataPrefix"oai_
dc" - identifier"oaiHUBerlin.de300
0819"gthttp//edoc.hu-berlin.de/OAI-2.0lt/requestgt - ltGetRecordgt
- ltrecordgt
- ltheadergt
- ltidentifiergtoaiHUBerlin.de300081
9lt/identifiergt -
- lt/headergt
- ltmetadatagt
- ltoai_dcdc xmlnsoai_dc"http//ww
w.openarchives.org/OAI/2.0/oai_dc/" -
xmlnsdc"http//purl.org/dc/elements/1.1/" -
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" -
xsischemaLocation"http//www.openarchives.org/OA
I/2.0/oai_dc/ -
http//www.openarchives.org/OAI/
2.0/oai_dc.xsd"gt
60Technical Introduction Questions?
- OAI official site
- http//www.openarchives.org/
- protocol specificationhttp//www.openarchives.org
/OAI/openarchivesprotocol.html - general mailing listhttp//www.openarchives.org/m
ailman/listinfo/OAI-general/ - implementers mailing listhttp//www.openarchives.
org/mailman/listinfo/OAI-implementers/
61Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
- Part III Implementation Issues
- Data Provider and Service Provider
- Uwe Müller
- Humboldt University Berlin, Germany
- u.mueller_at_rz.hu-berlin.de
62Agenda
- General Considerations
- Data Provider
- Service Provider
-
63General First Questions
- Data Provider
- Which data do I want to deliver?
- Which service providers do I want to provide with
data? - Service Provider
- Which Service do I want to provide?
- From which data providers do I get the metadata?
- In which way the metadata have to be processed?
- Data Provider Service Provider
- Which aspects do we have to agree upon?
64General Metadata Formats / Sets
- required unqualified Dublin Core
- special subjects / communities other metadata
specifications may be required - describe resources in a specialised way
- definition of an XML schema (publicly available
for validation) - define set hierarchy
- sensible partitioning for selective harvesting
- agreement between data providers and between data
and service providers -
65General Organisational Structure
- aggregated data providers
- if harvested by a service provider, sub data
providers should not be harvested by same SP
(duplication ...) - subject gateways
- selective harvesting if corresponding sets have
been defined and implemented
66Agenda
- General Considerations
- Data Provider
- Service Provider
-
67Data Provider Prerequisites
- metadata on resources (items)
- should be stored in (SQL) database
- possible in case of need file system
- unique identifier for each item
- web server, accessible via the internet
- e.g. apache, IIS
- programming interface / API
- e.g. Perl, PHP, Java-Servlet
- web server extension
- access to database (or filesystem)
- not needed session management
68Data Provider Prerequisites (2)
- archive identifier / base URL
- unique identifier for items
- metadata format (at least unqualified Dublin
Core) - datestamps for metadata (created / last modified)
- logical set hierarchy (may have)
- agreement within (subject) communities
- flow control / implementation of resumption token
(optional, larger archives should have that)
69Data Provider Architecture
OAI request (HTTP request)
70Data Provider General Structure
- Argument Parser
- validates OAI requests
- Error Generator
- creates XML responses with encoded error messages
- Database Query / Local Metadata Extraction
- retrieves metadata from repository
- according to the required metadata format
- XML Generator / Response Creation
- creates XML responses with encoded metadata
information - Flow Control
- realises incomplete list sequences for larger
repositories - uses resumption token as mechanism
71Data Provider Flow Chart
- verb, metadataPrefix, resump-tionToken OAI
arguments - rows size of the result list
- 100 here maximal list sizefor responses
HTTP request
metadataPrefix
72Data Provider Resumption Token
- should be implemented for large lists
- initiated by data provider
- store parameters (set, from, ) and number of
already delivered records - properties
- expiration expirationDate (optional)
- completeListSize (optional)
- already delivered records cursor (optional)
- recovery from network errors (possibility to
re-issue most recent resumption token) - problem
- database changes
- two possible solutions
- duplicate data in a request table
- store date of first request with the other
parameters ? use like additional until argument
73Data Provider Resumption Token (2)
Example
Service Provider
Data Provider
Harvester
Repository
74Data Provider Resumption Token (3)
Example (2)
Data Provider
anyID1 fromempty, untilempty,
setempty, mdPoai_dc, date
2002-12-05T150000Z, delivered100
Database
Repository
75Data Provider Data Representation
- use recommended data representation
- dates
- 2002-12-05
- 2002-xx-xx, 2002, 05.12.2002
- language code
- eng, ger, ...
- en, de, english, german
- multi values use own XML element for each entity
- author
- ltdccreatorgtSmith, Adamlt/dccreatorgtltdccreatorgtN
ash, Johnlt/dccreatorgt - ltdccreatorgtSmith, Adam Nash, Johnlt/dccreatorgt
76Data Provider Compression
- method to reduce traffic and enhance performance
- optional for both sides data and service
providers - handled on HTTP level
- harvesters may include an Accept-Encoding header
in their requests specifying preferences - harvesters without Accept-Encoding header always
receive uncompressed data - repositories must support HTTP identity encoding
- repositories should specify supported encodings
by including compression elements in the identify
response
77Data Provider Test and Registration
- create own OAI-PMH requests and send to OAI
interface check results - use the Repository Explorer (VT University)
- http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
toai/ - provide arguments via HTML forms
- responses are validated
- browsing to other requests
- automatic conformance tester
- official registration site
- http//www.openarchives.org/data/registerasprovide
r.html - provide base URL
- extensive conformance test (incl. error
conditions ) - information on incorrect behaviour
- in case of conformance added to the official
list - regular checks
78Agenda
- General Considerations
- Data Provider
- Service Provider
-
79Service Provider Examples
- Repository Explorer
- http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
toai/ - search engines / subject gateways
- Cross Archive Searching Service
http//arc.cs.odu.edu/ - MyOAI http//www.myoai.org/
- DINI http//edoc.hu-berlin.de/oaisearch/
- Physnet http//physnet.uni-oldenburg.de/oai/query
.php - internal communication
- ProPrint http//edoc.hu-berlin.de/proprint/
- library compounds
80Service Provider Prerequisites
- internet connected server
- database system (relational or XML)
- programming environment
- can issue HTTP requests to web servers
- can issue database requests
- XML parser
81Service Provider Structure (1)
- Archive Management
- selection of archives to be harvested
- enter entries manually or
- automatically add / remove archives using the
official registry - Request Component
- creates HTTP requests and sends them to OAI
archives (data provider) - demands metadata using the allowed verbs of the
OAI-PMH - possibly selective harvesting (set parameter)
82Service Provider Structure (2)
- Scheduler
- realises timed and regular retrieval of the
associated archives - simplest case manual initiation of the jobs
- else e.g. cron job
- Flow Control
- resumption token partitioning of the result list
into incomplete sections anew request to
retrieve more results - HTTP error 503 (service not available) analysis
of response to extract retry-after period
83Service Provider Structure (3)
- Update Mechanism
- realises consolidation of metadata which have
been harvested earlier (merge old and new data) - easiest case always delete all old metadata of
an archive before harvesting it - reasonable incremental update (from parameter)
insert new metadata and overwrite changed /
deleted metadata (assignment using the unique
identifiers) - XML Parser
- analyses the responses received from the archives
- validation using the XML schema
- transforms the metadata encoded in XML into the
internal data structure
84Service Provider Structure (4)
- Normaliser
- transforms data into a homogenous structure
(different metadata formats) - harmonises representation (e.g. date, author,
language code) - maps / translates different languages
- Database
- mapping the XML structure of the metadata into a
relational database (multi values ) - or use an XML database
85Service Provider Structure (5)
- Duplication Checker
- merges identical records from different data
providers - possibility unique identifier for the item (e.g.
URN, ) - but often not easily practicable and not risk /
error free - Service Module
- provides the actual service to the public
- basis harvested and stored records of the
associated archives - uses only local database for requests etc.
-
86Service Provider Architecture
User
Harvester
User
Administrator
Scheduler
OAI Service Provider
Service module
Normaliser
Update mechanism
Database
XML Parser
Flow control
Dublication checker
Data Provider
Data Provider
Data Provider
87Service Provider Resumption Token
- optional from the data providers point of view
- but mandatory for service providers
- for complete lists resume sequences of
incomplete lists - recognise that response contains incomplete
list - re-issue OAI request to data provider in order to
get next part of the list
88Service Provider Test and Registration
- harvest registered (? OAI complient!) data
providers - test behaviour of service provider
- official registration site
- http//www.openarchives.org/service/registeraspro
vider.html - provide institutional information
- web site, email address, ...
89Data Service Provider Questions?
90Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
- Part IV Implementation issues - XML schemas and
support for multiple record formats - Andy Powell
- UKOLN, University of Bath
- a.powell_at_ukoln.ac.uk
91Agenda
- basics
- XML schema details
- extending oai_dc for your application
- using IMS metadata as new record format
92Basics
- OAI-PMH uses XML Schemas to define record formats
- you can exchange any data you like using OAI-PMH
as long as you can encode it as XML and define an
XML-Schema for it! - OAI-PMH mandates the oai_dc XML schema
- OAI-PMH documentation also describes use of XML
schema to exchange - rfc1807 a schema for rfc1807 format metadata
marc21 a recommended schema for MARC21
metadata, provided by the Library of
Congressoai_marc a schema for MARC format
metadata
93A closer look at oai_dc
- the simple DC schema used as mandatory record
format in OAI-PMH defines a container schema - container schema is OAI-specific
- container schema is hosted on the OAI Web site
- imports a generic DCMES schema
- generic DCMES schema is hosted on the DCMI Web
site - same model likely to be used for qualified DC
schema container schema hosted by OAI, generic
schema hosted by DCMI
94An oai_dc record
- an example oai_dc record (viewed via the
repository explorer) - heres the full GetRecord response
- three important things to notice
- namespace for the oia_dc format
- xmlnsoai_dchttp//www.openarchives.org/OAI/2.0/o
ai_dc/ - namespace for DCMES elements
- xmlnsdchttp//purl.org/dc/elements/1.1/
- container schema associated with the oai_dc
namespace - xsischemaLocation"http//www.openarchives.org/OA
I/2.0/oai_dc/
http//www.openarchives.org/OAI/2.0/oai_dc.xsd"
95The XML schemas
- The oai_dc container schema
- http//www.openarchives.org/OAI/2.0/oai_dc.xsd
- imports DCMES schema from
- http//dublincore.org/schemas/xmls/simpledc2002031
2.xsd - defines a container element called dc
- lists the allowed elements within the dc
container (from the DCMES namespace/schema above)
96When oai_dc isnt enough
- when the 15 DCMES elements are too limited e.g.
adding extra metadata elements - when you need greater precision in your metadata
records e.g. adding encoding schemes to
existing elements - when you want to exchange other metadata formats
- IMS/IEEE LOM eLearning metadata
- ODRL Open Digital Rights Language
97Extending the oai_dc schema
- simple scenario
- RDN currently uses oai_dc schema to exchange
records but wants to add one additional element
called - accessControl
- note this is not a real scenario
- RDN really wants to use qualified DC records
but doing qualified DC too complicated for this
tutorial! - hope to write-up RDN work on exchanging qualified
DC in future issue of Ariadne
98Step 1 metadata format name
- the new metadata format needs a name
- in this case, weve chosen
- rdn_dc
- following OAIs naming of oai_dc
- alternative possibilities
- rdndc
- rdn
- etc.
99Step 2 create namespaces
- two namespaces are required
- namespace for the rdn_dc format
- http//www.rdn.ac.uk/oai/rdn_dc/
- namespace for the new metadata elements
(properties) that we are going to use in this
format - http//purl.org/rdn/terms/
- note
- use of Purl for the elements namespace follows
DCMI usage but is not mandatory - however, both these namespace URIs should be
under your control to ensure uniqueness and
prevent re-use in the future - URIs do not need to resolve to anything
100Step 3 local copy of DC schema
- make local copy of the DCMES schema
- in this case the copy is at
- http//www.rdn.ac.uk/oai/rdn_dc/20021204/dc.xsd
- this step isnt strictly necessary
- in fact it is probably bad practice to do this
- but, currently some minor problems with the
DCMI-hosted copy of the schema - working with local copy is easier
101Step 4 schema for new terms
- create an XML schema for the new rdnterms
- in this case the schema is available at
- http//www.rdn.ac.uk/oai/rdn_dc/20021204/rdnterms.
xsd - the schema defines the new element/property
- accessControl
- and adds it to the dcany group
- also creates a new container type
- rdntermselementContainer
- note
- schema URI contains a date-stamp
- this should make future enhancements to the
schema easier to implement
102Step 5 container schema
- create a container schema for the new record
format - in this case the schema is available at
- http//www.rdn.ac.uk/oai/rdn_dc/20021204/rdn_dc.xs
d - this simply imports the rdnterms schema
- then defines a container element called rdndc
of type - rdntermselementContainer
- again, the schema URI contains a date-stamp
103Step 6 validate, validate, val
- create some test records using your new schemas
- http//www.rdn.ac.uk/oai/rdn_dc/20021204/test.xml
- http//www.rdn.ac.uk/oai/rdn_dc/20021204/oai-test.
xml - use the XML schema validator at
- http//www.w3.org/2001/03/webdata/xsv
104Step 7 ListMetadataFormats
- add information about the new format to your
repositorys response to the ListMetadataFormats
request
ltmetadataFormatgt ltmetadataPrefixgtrdn_dclt/metadat
aPrefixgt ltschemagthttp//www.rdn.ac.uk/oai/rdn_dc/2
0021113/rdn_dc.xsdlt/schemagt ltmetadataNamespacegthtt
p//www.rdn.ac.uk/oai/rdn_dc/lt/metadataNamespacegt
lt/metadataFormatgt
105Step 8 other verbs
- modify your repositorys response to the
ListSets, ListIdentifiers, ListRecords and
GetRecord requests - accept metadataPrefix set to new format name
rdn_dc - return records formatted according to the new
schema(s)
106Step 9 validate again
- use the Repository Explorer to check that
- all requests work with new metadataPrefix
- oai_dc format still works!
- appropriate records are returned for each format
- responses validate correctly
107Summary
- decide on name for your new metadata format and
appropriate namespaces - develop XML schemas for container and new
elements if appropriate - create test records and validate
- modify your repository (source code and/or
configuration files) to support the new format - validate and test repository
108Other record formats
- can take similar approach with other metadata
record formats - IMS/IEEE LOM
- ODRL
- in these cases, XML schemas and namespaces have
already been agreed - deployment of these formats should be easier
because you dont need to define your own
schemas - BUT XML schema specs continually undergoing
revisions currently so sometimes hard for
applications like IMS to keep up!
109Adding support for IMS
- modify ListMetadataFormats response to include
- extend ListSets, ListIdentifiers,
ListRecords and GetRecord requests - accept metadataPrefix set to ims and return
records formatted appropriately
ltmetadataFormatgt ltmetadataPrefixgtimslt/metadataPr
efixgt ltschemagthttp//www.imsglobal.org/xsd/imsmd_v
1p2p2.xsdlt/schemagt ltmetadataNamespacegt
http//www.imsglobal.org/xsd/imsmd_v1p2 lt/metadata
Namespacegt lt/metadataFormatgt
110Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
111Summary
- during todays tutorial we hope that you have
- gained an overview of the history behind the
OAI-PMH and an overview of its key features - been given a deeper technical insight into how
the protocol works - learned something about some of the main
implementation issues - found some useful starting points and hints that
will help you as implementors
112Questions
- now
- feel free to tell us what you didnt understand
- and ask general questions (of course!)
Uwe Müller Humboldt University Berlin,
Germany u.mueller_at_rz.hu-berlin.de Andy
Powell UKOLN, University of Bath a.powell_at_ukoln.ac
.uk