Tutorial OAI and OAI-PMH for Beginners An introduction to the Open Archives Initiative and the Protocol for Metadata Harvesting - PowerPoint PPT Presentation

1 / 109
About This Presentation
Title:

Tutorial OAI and OAI-PMH for Beginners An introduction to the Open Archives Initiative and the Protocol for Metadata Harvesting

Description:

Tutorial OAI and OAI-PMH for Beginners An introduction to the Open Archives Initiative and the Protocol for Metadata Harvesting Uwe M ller Humboldt University Berlin ... – PowerPoint PPT presentation

Number of Views:939
Avg rating:3.0/5.0
Slides: 110
Provided by: UweMuell
Learn more at: https://www.oaforum.org
Category:

less

Transcript and Presenter's Notes

Title: Tutorial OAI and OAI-PMH for Beginners An introduction to the Open Archives Initiative and the Protocol for Metadata Harvesting


1
Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
  • Uwe Müller
  • Humboldt University Berlin, Germany
  • u.mueller_at_rz.hu-berlin.de
  • Andy Powell
  • UKOLN, University of Bath
  • a.powell_at_ukoln.ac.uk

2
Agenda
  • Part I
  • History and overview
  • Part II
  • Technical introduction
  • Coffee/tea break
  • Part III
  • Implementation issues data provider and service
    provider
  • Part IV
  • Implementation issues XML schema and supporting
    multiple record formats

3
Acknowledgements
  • Some of the slides presented here are our own!
  • Many of them have been kindly donated by (taken
    from!)
  • Herbert Van de Sompel
  • Carl Lagoze
  • Michael Nelson
  • Simeon Warner
  • (and others probably!)

4
Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
  • Part I History and overview
  • Andy Powell
  • UKOLN, University of Bath
  • a.powell_at_ukoln.ac.uk

5
OAI roots
  • the roots of OAI lie in the development of eprint
    archives
  • arXiv, CogPrints, NACA (NASA), RePEc, NDLTD,
    NCSTRL
  • each offered Web interface for deposit of
    articles and for end-user searches
  • difficult for end-users to work across archives
    without having to learn multiple different
    interfaces
  • recognised need for single search interface to
    all archives
  • Universal Pre-print Service (UPS)

6
Searching vs. harvesting
  • two possible approaches to building the UPS
  • cross-searching multiple archives based on
    protocol like Z39.50
  • harvesting metadata into one or more central
    services bulk move data to the user-interface
  • US digital library experience in this area (e.g.
    NCSTRL) indicated that cross-searching not
    preferred approach - distributed searching of N
    nodes viable, but only for small values of N
  • NCSTRL N gt 100 bad

7
Problems of cross-searching
  • collection description
  • how do you know which targets to search?
  • query-language problem
  • syntax varies and drifts over time between the
    various nodes
  • rank-merging problem
  • how do you meaningfully merge multiple result
    sets?
  • performance
  • tends to be limited by slowest target
  • difficult to build browse interface

8
Universal Preprint Service
  • a cross-archive DL that that provides services on
    a collection of metadata harvested from multiple
    archives
  • based on NCSTRL a modified version of Dienst
  • demonstrated at Santa Fe NM, October 21-22, 1999
  • http//ups.cs.odu.edu/
  • D-Lib Magazine, 6(2) 2000 (2 articles)
  • http//www.dlib.org/dlib/february00/02contents.htm
    l
  • UPS was soon renamed the Open Archives Initiative
    (OAI) http//www.openarchives.org/

9
RDN experience
  • similar experience within the UK Resource
    Discovery Network (RDN)
  • cross-searching of only 5 subject gateways
  • problems with cross-searching approach
  • performance
  • central browse interface
  • looking for metadata harvesting solution

10
Data and service providers
  • UPS identified two logical groups of services
  • data providers
  • handle deposit/publishing of resources in archive
  • expose metadata about resources in archive
  • service providers
  • harvest metadata from data providers
  • use it to offer single user-interface across all
    harvested metadata
  • note
  • data provider may also be responsible for
    human-oriented (I.e. Web) interface to archive
  • both functions may be offered by same service

11
Human vs. machine interfaces
  • move away from only supporting human end-user
    interfaces for each archive
  • to supporting both human end-user interface and
    machine interfaces for harvesting

Native harvesting interface
Input interface
Native end-user interface
Provider
Input interface
Provider
Native end-user interface
12
Service provider harvesting
Native end-user interface
Service Provider
Native harvesting interface
Native harvesting interface
Input interface
Input interface
Data Provider
Data Provider
Native end-user interface
Native end-user interface optional (e.g., RePEc)
13
Metadata harvesting requirements
  • in order that harvesting approach can work need
    agreements about
  • transport protocols HTTP vs. FTP vs.
  • metadata formats DC vs. MARC vs.
  • quality assurance mandatory elements,
    mechanisms for naming of people, subjects, etc.,
    handling duplicated records, best-practice
  • intellectual property and usage rights who can
    do what with the records
  • work in this area resulted in the Santa Fe
    Convention

14
Santa Fe Convention 02/2000
  • goal optimize discovery of e-prints
  • inputs
  • UPS prototype
  • RePEc/SODA data provider / service provider
    model
  • Dienst protocol
  • deliberations at Santa Fe meeting 10/1999

15
OAI-PMH v 1.0 01/2001
  • goal optimise discovery of document-like objects
  • inputs
  • Santa Fe Convention
  • various DLF meetings on metadata harvesting
  • deliberations at Cornell
  • alpha-testers of OAI-PMH v 1.0
  • recognition of DC as best core metadata format
    for interoperability across multiple archives

16
OAI-PMH v 1.0 01/2001
  • low-barrier interoperability specification
  • metadata harvesting model data provider /
    service provider
  • focus on document-like objects
  • autonomous protocol
  • HTTP based
  • XML responses
  • unqualified Dublin Core
  • experimental 12-18 months

17
Whats in a name?
Open Archives Initiative
18
OAI timeline before v. 2.0
  • October 21-22, 1999 - initial UPS meeting
  • February 15, 2000 - Santa Fe Convention published
    in D-Lib Magazine
  • precursor to the OAI metadata harvesting protocol
  • June 3, 2000 - workshop at ACM DL 2000 (Texas)
  • August 25, 2000 - OAI steering committee formed,
    DLF/CNI support
  • September 7-8, 2000 - technical meeting at
    Cornell University
  • defined the core of the current OAI metadata
    harvesting protocol
  • September 21, 2000 - workshop at ECDL 2000
    (Portugal)
  • November 1, 2000 - Alpha test group announced
    (15 organizations)
  • January 23, 2001 - OAI protocol 1.0 announced,
    OAI Open Day in the U.S. (Washington DC)
  • purpose freeze protocol for 12-16 months,
    generate critical mass
  • February 26, 2001 - OAI Open Day in Europe
    (Berlin)
  • July 3, 2001 - OAI protocol 1.1 announced
  • to reflect changes in the W3Cs XML latest schema
    recommendation
  • September 8, 2001 - workshop at ECDL 2001
    (Darmstadt)

19
OAI-PMH v.2.0 06/2002
  • goal recurrent exchange of metadata about
    resources between systems
  • inputs
  • OAI-PMH v.1.0
  • feedback on OAI-implementers
  • deliberations by OAI-tech 09/01 - 06/02
  • alpha test group of OAI-PMH v.2.0 03/02 -
    06/02
  • officially released June 14, 2002

20
OAI-PMH v.2.0 06/2002
  • low-barrier interoperability specification
  • metadata harvesting model data provider /
    service provider
  • metadata about resources
  • autonomous protocol
  • HTTP based
  • XML responses
  • unqualified Dublin Core
  • stable

21
Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
22
Flexible deployment
  • simple protocol based on HTTP and XML allows for
    rapid deployment
  • a number of toolkits available see part III
  • systems can be deployed in variety of
    configurations
  • multiple service providers can harvest from
    multiple data providers
  • aggregators can sit between data and service
    providers
  • harvesting approach can be complemented with
    searching based on Z39.50 or SRW

23
Multiple data and service ps
Data providers
Harvesting based on OAI-PMH
Service providers
24
Aggregators
Data providers
Aggregator
Service providers
25
Can be mixed with x-searching
Data providers
Harvesting based on OAI-PMH
Searching based on Z39.50 or SRW
Service providers
26
Summary
  • OAI-PMH OAI Protocol for Metadata Harvesting
  • low-cost mechanism for harvesting metadata
    records from one system to another
  • from data providers to service providers
  • development over last 2-3 years has seen move
    from specific (discovery of e-prints) to generic
    (sharing descriptions of any resources)
  • based on HTTP and XML Web-friendly
  • allows client to say give me some or all of your
    records where some is based on
  • date-stamps, sets, metadata formats

27
Summary (2)
  • mandates simple DC as record format but
    extensible to any format encoded in XML
  • OAI-PMH is not a search protocol
  • but use can underpin search-based services based
    on Z39.50 or SRW or
  • metadata and full-text typically made freely
    available but not a requirement
  • OAI-PMH can be used between closed groups
  • access-control and compression mechanisms based
    on underlying HTTP protocol
  • simple protocol allows easy deployment
  • systems can be combined in variety of ways

28
Important resources
  • OAI Web site
  • http//www.openarchives.org/
  • OAI-PMH specification
  • http//www.openarchives.org/OAI/openarchivesprotoc
    ol.html
  • Implementation guidelines
  • http//www.openarchives.org/OAI/2.0/guidelines.htm
  • Discussion lists
  • http//www.openarchives.org/mailman/listinfo/oai-g
    eneral
  • http//oaisrv.nsdl.cornell.edu/mailman/listinfo/oa
    i-implementers
  • Repository explorer
  • http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
    toai
  • Tools http//oai.dlib.vt.edu/cgi-bin/Explorer/oai
    2.0/testoai

29
Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
  • Part II Technical Introduction
  • Uwe Müller
  • Humboldt University Berlin, Germany
  • u.mueller_at_rz.hu-berlin.de

30
Agenda
  1. Protocol Basics
  2. Protocol Details
  3. Request Types
  4. Examples

31
The Open Archives Initiative (OAI)
  • Main ideas
  • world-wide consolidation of scholarly archives
  • free access on the archives (at least metadata)
  • consistent interfaces for archives and service
    provider
  • low barrier protocol / effortless implementation
  • based on existing standards (e.g. HTTP, XML, DC)
  • Basic functioning

Requests (based on HTTP)
Metadata (Documents)
Metadata
Service
Metadata (encoded in XML)
Harvester
Repository
Service Provider
Data Provider
32
OAI General Assumptions
  • two groups of participants
  • Data Providers (Open Archives, Repositories)
  • free access of metadata
  • not necessarily free access to full texts /
    resources
  • easy to implement, low barriers
  • Service Providers
  • use OAI interfaces of the Data Providers
  • harvest and store metadata (no live requests!)
  • may select certain subsets from Data
    Providers (set hierarchy, date stamp)
  • may enrich metadata
  • offer (value-added) service on the basis of the
    metadata

33
OAI-PMH Structure Model
Data Provider
e-prints
e-print
Requests Identify ListMetadataformats
ListSets ListIdentifiers ListRecords
GetRecord
Repository
Data Provider
Images
e-print
Repository
Service Provider
Data Provider
OPAC
e-print
Repository
Data Provider
Harvester
Data Provider
Responses General information Metadata
formats Set structure Record identifier
Metadata
Museum
e-print
Repository
Data Provider
Archive
e-print
Repository
34
OAI-PMH Protocol Overview
  • protocol based on HTTP
  • request arguments as GET or POST parameters
  • six request types
  • e.g. http//archive.org? verbListRecordsfrom20
    02-11-01
  • responses are encoded in XML syntax
  • supports any metadata format (at least Dublin
    Core)
  • logical set hierarchy (definition data
    providers)
  • date stamps (last change of metadata set)
  • error messages
  • flow control

35
Agenda
  1. Protocol Basics
  2. Protocol Details
  3. Request Types
  4. Examples

36
Protocol Details Definitions
  • Harvester
  • client application issuing OAI-PMH requests
  • Repository
  • network accessible server, able to process
    OAI-PMH requests correctly
  • Resource
  • object the metadata is about, nature of
    resources is not defined in the OAI-PMH
  • Item
  • component of an repository from which metadata
    about a resource can be disseminated
  • has an unique identifier
  • Record
  • metadata in a specific metadata format
  • Identifier
  • unique key for an item in a repository
  • Set
  • optional construct for grouping items in a
    repository

37
Protocol Details Definitions (2)
all available metadata about David
item identifier
item
Dublin Core metadata
MARC metadata
SPECTRUM metadata
records
38
Protocol Details Records
  • metadata of a resource in a specific format
  • three parts
  • header (mandatory)
  • identifier (1)
  • datestamp (1)
  • setSpec elements ()
  • status attribute for deleted item (?)
  • metadata (mandatory)
  • XML encoded metadata with root tag, namespace
  • repositories must support Dublin Core
  • about (optional)
  • rights statements
  • provenance statements

39
Protocol Details Datestamps
  • date of last modification of a metadata set
  • mandatory characteristic of every item
  • two possible granularitiesYYYY-MM-DD,
    YYYY-MM-DDThhmmssZ
  • function information on metadata, selective
    harvesting (from and until arguments)
  • applications incremental update mechanisms
  • modification, creating, deletion
  • deletion three support levels
  • no, persistent, transient

40
Protocol Details Metadata Schema
  • OAI-PMH supports dissemination of multiple
    metadata formats from a repository
  • properties of metadata formats
  • id string to specify the format (metadataPrefix)
  • metadata schema URL (XML schema to test validity)
  • XML namespace URI (global identifier for metadata
    format)
  • repositories must be able to disseminate
    unqualified Dublin Core
  • arbitrary metadata formats can be defined and
    transported via the OAI-PMH
  • returned metadata must comply with XML namespace
    specification

41
Protocol Details Metadata Schema (2)
  • minimum standard unqualified Dublin Core
  • http//dublincore.org/
  • Dublin Core Metadata Element Set contains 15
    elements
  • elements are optional
  • elements may be repeated
  • The Dublin Core Metadata Element Set

Title Contributor Source
Creator Date Language
Subject Type Relation
Description Format Coverage
Publisher Identifier Rights
42
Protocol Details Sets
  • logical partitioning of repositories
  • optional archives do not have to define sets
  • no recommendations
  • not necessarily exhaustive
  • not necessarily strictly hierarchical
  • function selective harvesting (set parameter)
  • applications subject gateways, dissertation
    search engine,
  • examples (Germany, see http//www.dini.de)
  • publication types (thesis, article, )
  • document types (text, audio, image, )
  • content sets, according to DNB (medicine,
    biology, )

43
Protocol Details Request Format
  • requests must be submitted using the GET or POST
    methods of HTTP
  • repositories must support both methods
  • at least one keyvalue pair verbRequestType
  • additional keyvalue pairs depend on request type
  • example for GET request http//archive.org/oai?v
    erbListRecordsmetadataPrefixoai_dc
  • encoding of special characterse.g. (host
    port separator) becomes 3A

44
Protocol Details Response
  • formatted as HTTP responses
  • content type must be text/xml
  • status codes (distinguished from OAI-PMH
    errors)e.g. 302 (redirect), 503 (service not
    available)
  • compression optional in OAI-PMH,only identity
    encoding is mandatory
  • response format well formed XML with markup
  • XML declaration (lt?xml version"1.0"
    encoding"UTF-8" ?gt)
  • root element named OAI-PMH with three
    attributes(xmlns, xmlnsxsi, xsischemaLocation)
  • three child elements
  • responseDate (UTC datetime)
  • request (request that generated this response)
  • a) error (in case of an error or exception
    condition) b) element with the name of the
    OAI-PMH request

45
Protocol Details Flow Control
  • four of the request types return a list of
    entries
  • three of them may reply large lists
  • OAI-PMH supports partitioning
  • decision on partitioning repository
  • response to a request includes
  • incomplete list
  • resumption token expiration date, size of
    complete list, cursor (optional)
  • new request with same request type
  • resumption token as parameter
  • all other parameters omitted!
  • response includes
  • next (maybe last) section of the list
  • resumption token (empty if last section of list
    enclosed)

46
Protocol Details Flow Control (2)
Example
Service Provider
Data Provider
Harvester
Repository
47
Protocol Details Errors and Exceptions
  • repositories must indicate OAI-PMH errors
  • inclusion of one or more error elements
  • defined error identifiers
  • badArgument
  • badResumptionToken
  • badVerb
  • cannotDisseminateFormat
  • idDoesNotExist
  • noRecordsMatch
  • noMetaDataFormats
  • noSetHierarchy

48
Agenda
  1. Protocol Basics
  2. Protocol Details
  3. Request Types
  4. Examples

49
Request Types
  • six different request types
  • Identify
  • ListMetadataFormats
  • ListSets
  • ListIdentifiers
  • ListRecords
  • GetRecord
  • harvester has not to use all types
  • repository must implement all types
  • required and optional arguments
  • depend on request types

50
Request Type Identify
  • functiondescription of an archive
  • example archive.org/oai-script?verbIdentify
  • parameters none
  • errors / exceptionsbadArgument e.g.
    archive.org/oai-script?verbIdentify setbiology

51
Request Type Identify (2)
  • response format

Element Example
repositoryName My Archive 1
baseURL http//archive.org/oai 1
protocolVersion 2.0 1
earliestDatestamp 1999-01-01 1
deleteRecords no, transient, persistent 1
granularity YYYY-MM-DD, YYYY-MM-DDThhmmssZ 1
adminEmail oai-admin_at_archive.org
compression deflate, compress,
description oai-identifier, eprints, friends,
52
Request Type ListMetadataFormats
  • functionretrieve available metadata formats from
    archive
  • example archive.org/oai-script?verbListMetadataF
    ormats identifieroaiHUBerlin.de3000218
  • parameters identifier (optional)
  • errors / exceptionsbadArgumentidDoesNotExist e.
    g. archive.org/oai-script?verbListMetadataFormats
    identifierreally-wrong-identifier
    noMetadataFormats

53
Request Type ListSets
  • functionretrieve set structure of a repository
  • example archive.org/oai-script?verbListSets
  • parameters resumptionToken (exclusive)
  • errors / exceptionsbadArgumentbadResumptionToken
    e.g. archive.org/oai-script?verbListSets resu
    mptionTokenany-wrong-token
  • noSetHierarchy

54
Request Type ListIdentifiers
  • functionabbreviated form of ListRecords,
    retrieving only headers
  • example archive.org/oai-script?verbListIdentifie
    rs metadataPrefixoai_dcfrom2002-12-01
  • parametersfrom (optional)until (optional)
    metadataPrefix (required)set (optional)
    resumptionToken (exclusive)
  • errors / exceptionsbadArgument, e.g.
    from2002-12-01-134500badResumptionTokencann
    otDisseminateFormatnoRecordsMatchnoSetHierarchy

55
Request Type ListRecords
  • functionharvest records from a repository
  • example archive.org/oai-script?verbListRecords
    metadataPrefixoai_dcsetbiology
  • parametersfrom (optional)until (optional)
    metadataPrefix (required)set (optional)
    resumptionToken (exclusive)
  • errors / exceptionsbadArgumentbadResumptionToken
    cannotDisseminateFormatnoRecordsMatchnoSetHiera
    rchy

56
Request Type GetRecord
  • functionretrieve individual metadata record from
    a repository
  • example archive.org/oai-script?verbGetRecord
    identifieroaiHUBerlin.de3000218 metadataPref
    ixoai_dc
  • parametersidentifier (required)metadataPrefix
    (required)
  • errors / exceptionsbadArgumentcannotDisseminateF
    ormatidDoesNotExist

57
Agenda
  1. Protocol Basics
  2. Protocol Details
  3. Request Types
  4. Examples

58
Example http//edoc.hu-berlin.de/OAI-2.0? verbL
istIdentifiersfrom2002-01-06until2002-01-08
metadataPrefixoai_dcsetdoctypesdissertations
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/X
MLSchema-instance"
xsischemaLocation"http//www.openarchives.org/OA
I/2.0/
http//www.openarchives.org/OAI/2.0/OA
I-PMH.xsd"gt ltresponseDategt2002-10-22T174949
0100lt/responseDategt ltrequest
verb"ListIdentifiers" from"2002-01-03"
until"2002-01-08" metadataPrefix"oai_dc"
set"doctypesdissertations"gthttp
//edoc.hu-berlin.de/OAI-2.0lt/requestgt
ltListIdentifiersgt ltheadergt
ltidentifiergtoaiHUBerlin.de3000819lt/identifiergt
ltdatestampgt2002-01-08lt/datestampgt
ltsetSpecgtdoctypeslt/setSpecgt
ltsetSpecgtdoctypesdissertationslt/setSpecgt
ltsetSpecgtdnblt/setSpecgt
ltsetSpecgtdnbdnb33lt/setSpecgt lt/headergt
ltheadergt ltidentifiergtoaiHUBer
lin.de3000831lt/identifiergt
ltdatestampgt2002-01-07lt/datestampgt
ltsetSpecgtdoctypeslt/setSpecgt
ltsetSpecgtdoctypesdissertationslt/setSpecgt
ltsetSpecgtdnblt/setSpecgt
ltsetSpecgtdnbdnb27lt/setSpecgt lt/headergt
lt/ListIdentifiersgt lt/OAI-PMHgt
59
Example http//edoc.hu-berlin.de/OAI-2.0? verbG
etRecordidentifieroaiHUBerlin3000819 metadat
aPrefixoai_dc
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • ltOAI-PMH xmlns"http//www.openarchives.org/OAI/2
    .0/" xmlnsxsi"http//www.w3.org/2001/XM
    LSchema-instance"
  • xsischemaLocation"http//www.
    openarchives.org/OAI/2.0/

  • http//www.openarchives.org/OAI/2.0/OAI-PMH.xsd
    "gt
  • ltresponseDategt2002-11-27T1457010100lt/respo
    nseDategt
  • ltrequest verb"GetRecord" metadataPrefix"oai_
    dc"
  • identifier"oaiHUBerlin.de300
    0819"gthttp//edoc.hu-berlin.de/OAI-2.0lt/requestgt
  • ltGetRecordgt
  • ltrecordgt
  • ltheadergt
  • ltidentifiergtoaiHUBerlin.de300081
    9lt/identifiergt
  • lt/headergt
  • ltmetadatagt
  • ltoai_dcdc xmlnsoai_dc"http//ww
    w.openarchives.org/OAI/2.0/oai_dc/"

  • xmlnsdc"http//purl.org/dc/elements/1.1/"

  • xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
    ce"

  • xsischemaLocation"http//www.openarchives.org/OA
    I/2.0/oai_dc/

  • http//www.openarchives.org/OAI/
    2.0/oai_dc.xsd"gt

60
Technical Introduction Questions?
  • OAI official site
  • http//www.openarchives.org/
  • protocol specificationhttp//www.openarchives.org
    /OAI/openarchivesprotocol.html
  • general mailing listhttp//www.openarchives.org/m
    ailman/listinfo/OAI-general/
  • implementers mailing listhttp//www.openarchives.
    org/mailman/listinfo/OAI-implementers/

61
Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
  • Part III Implementation Issues
  • Data Provider and Service Provider
  • Uwe Müller
  • Humboldt University Berlin, Germany
  • u.mueller_at_rz.hu-berlin.de

62
Agenda
  • General Considerations
  • Data Provider
  • Service Provider

63
General First Questions
  • Data Provider
  • Which data do I want to deliver?
  • Which service providers do I want to provide with
    data?
  • Service Provider
  • Which Service do I want to provide?
  • From which data providers do I get the metadata?
  • In which way the metadata have to be processed?
  • Data Provider Service Provider
  • Which aspects do we have to agree upon?

64
General Metadata Formats / Sets
  • required unqualified Dublin Core
  • special subjects / communities other metadata
    specifications may be required
  • describe resources in a specialised way
  • definition of an XML schema (publicly available
    for validation)
  • define set hierarchy
  • sensible partitioning for selective harvesting
  • agreement between data providers and between data
    and service providers

65
General Organisational Structure
  • aggregated data providers
  • if harvested by a service provider, sub data
    providers should not be harvested by same SP
    (duplication ...)
  • subject gateways
  • selective harvesting if corresponding sets have
    been defined and implemented

66
Agenda
  • General Considerations
  • Data Provider
  • Service Provider

67
Data Provider Prerequisites
  • metadata on resources (items)
  • should be stored in (SQL) database
  • possible in case of need file system
  • unique identifier for each item
  • web server, accessible via the internet
  • e.g. apache, IIS
  • programming interface / API
  • e.g. Perl, PHP, Java-Servlet
  • web server extension
  • access to database (or filesystem)
  • not needed session management

68
Data Provider Prerequisites (2)
  • archive identifier / base URL
  • unique identifier for items
  • metadata format (at least unqualified Dublin
    Core)
  • datestamps for metadata (created / last modified)
  • logical set hierarchy (may have)
  • agreement within (subject) communities
  • flow control / implementation of resumption token
    (optional, larger archives should have that)

69
Data Provider Architecture
OAI request (HTTP request)
70
Data Provider General Structure
  • Argument Parser
  • validates OAI requests
  • Error Generator
  • creates XML responses with encoded error messages
  • Database Query / Local Metadata Extraction
  • retrieves metadata from repository
  • according to the required metadata format
  • XML Generator / Response Creation
  • creates XML responses with encoded metadata
    information
  • Flow Control
  • realises incomplete list sequences for larger
    repositories
  • uses resumption token as mechanism

71
Data Provider Flow Chart
  • verb, metadataPrefix, resump-tionToken OAI
    arguments
  • rows size of the result list
  • 100 here maximal list sizefor responses

HTTP request
metadataPrefix
72
Data Provider Resumption Token
  • should be implemented for large lists
  • initiated by data provider
  • store parameters (set, from, ) and number of
    already delivered records
  • properties
  • expiration expirationDate (optional)
  • completeListSize (optional)
  • already delivered records cursor (optional)
  • recovery from network errors (possibility to
    re-issue most recent resumption token)
  • problem
  • database changes
  • two possible solutions
  • duplicate data in a request table
  • store date of first request with the other
    parameters ? use like additional until argument

73
Data Provider Resumption Token (2)
Example
Service Provider
Data Provider
Harvester
Repository
74
Data Provider Resumption Token (3)
Example (2)
Data Provider
anyID1 fromempty, untilempty,
setempty, mdPoai_dc, date
2002-12-05T150000Z, delivered100
Database
Repository
75
Data Provider Data Representation
  • use recommended data representation
  • dates
  • 2002-12-05
  • 2002-xx-xx, 2002, 05.12.2002
  • language code
  • eng, ger, ...
  • en, de, english, german
  • multi values use own XML element for each entity
  • author
  • ltdccreatorgtSmith, Adamlt/dccreatorgtltdccreatorgtN
    ash, Johnlt/dccreatorgt
  • ltdccreatorgtSmith, Adam Nash, Johnlt/dccreatorgt

76
Data Provider Compression
  • method to reduce traffic and enhance performance
  • optional for both sides data and service
    providers
  • handled on HTTP level
  • harvesters may include an Accept-Encoding header
    in their requests specifying preferences
  • harvesters without Accept-Encoding header always
    receive uncompressed data
  • repositories must support HTTP identity encoding
  • repositories should specify supported encodings
    by including compression elements in the identify
    response

77
Data Provider Test and Registration
  • create own OAI-PMH requests and send to OAI
    interface check results
  • use the Repository Explorer (VT University)
  • http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
    toai/
  • provide arguments via HTML forms
  • responses are validated
  • browsing to other requests
  • automatic conformance tester
  • official registration site
  • http//www.openarchives.org/data/registerasprovide
    r.html
  • provide base URL
  • extensive conformance test (incl. error
    conditions )
  • information on incorrect behaviour
  • in case of conformance added to the official
    list
  • regular checks

78
Agenda
  • General Considerations
  • Data Provider
  • Service Provider

79
Service Provider Examples
  • Repository Explorer
  • http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
    toai/
  • search engines / subject gateways
  • Cross Archive Searching Service
    http//arc.cs.odu.edu/
  • MyOAI http//www.myoai.org/
  • DINI http//edoc.hu-berlin.de/oaisearch/
  • Physnet http//physnet.uni-oldenburg.de/oai/query
    .php
  • internal communication
  • ProPrint http//edoc.hu-berlin.de/proprint/
  • library compounds

80
Service Provider Prerequisites
  • internet connected server
  • database system (relational or XML)
  • programming environment
  • can issue HTTP requests to web servers
  • can issue database requests
  • XML parser

81
Service Provider Structure (1)
  • Archive Management
  • selection of archives to be harvested
  • enter entries manually or
  • automatically add / remove archives using the
    official registry
  • Request Component
  • creates HTTP requests and sends them to OAI
    archives (data provider)
  • demands metadata using the allowed verbs of the
    OAI-PMH
  • possibly selective harvesting (set parameter)

82
Service Provider Structure (2)
  • Scheduler
  • realises timed and regular retrieval of the
    associated archives
  • simplest case manual initiation of the jobs
  • else e.g. cron job
  • Flow Control
  • resumption token partitioning of the result list
    into incomplete sections anew request to
    retrieve more results
  • HTTP error 503 (service not available) analysis
    of response to extract retry-after period

83
Service Provider Structure (3)
  • Update Mechanism
  • realises consolidation of metadata which have
    been harvested earlier (merge old and new data)
  • easiest case always delete all old metadata of
    an archive before harvesting it
  • reasonable incremental update (from parameter)
    insert new metadata and overwrite changed /
    deleted metadata (assignment using the unique
    identifiers)
  • XML Parser
  • analyses the responses received from the archives
  • validation using the XML schema
  • transforms the metadata encoded in XML into the
    internal data structure

84
Service Provider Structure (4)
  • Normaliser
  • transforms data into a homogenous structure
    (different metadata formats)
  • harmonises representation (e.g. date, author,
    language code)
  • maps / translates different languages
  • Database
  • mapping the XML structure of the metadata into a
    relational database (multi values )
  • or use an XML database

85
Service Provider Structure (5)
  • Duplication Checker
  • merges identical records from different data
    providers
  • possibility unique identifier for the item (e.g.
    URN, )
  • but often not easily practicable and not risk /
    error free
  • Service Module
  • provides the actual service to the public
  • basis harvested and stored records of the
    associated archives
  • uses only local database for requests etc.

86
Service Provider Architecture
User
Harvester
User
Administrator
Scheduler
OAI Service Provider
Service module
Normaliser
Update mechanism
Database
XML Parser
Flow control
Dublication checker
Data Provider
Data Provider
Data Provider
87
Service Provider Resumption Token
  • optional from the data providers point of view
  • but mandatory for service providers
  • for complete lists resume sequences of
    incomplete lists
  • recognise that response contains incomplete
    list
  • re-issue OAI request to data provider in order to
    get next part of the list

88
Service Provider Test and Registration
  • harvest registered (? OAI complient!) data
    providers
  • test behaviour of service provider
  • official registration site
  • http//www.openarchives.org/service/registeraspro
    vider.html
  • provide institutional information
  • web site, email address, ...

89
Data Service Provider Questions?
90
Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
  • Part IV Implementation issues - XML schemas and
    support for multiple record formats
  • Andy Powell
  • UKOLN, University of Bath
  • a.powell_at_ukoln.ac.uk

91
Agenda
  1. basics
  2. XML schema details
  3. extending oai_dc for your application
  4. using IMS metadata as new record format

92
Basics
  • OAI-PMH uses XML Schemas to define record formats
  • you can exchange any data you like using OAI-PMH
    as long as you can encode it as XML and define an
    XML-Schema for it!
  • OAI-PMH mandates the oai_dc XML schema
  • OAI-PMH documentation also describes use of XML
    schema to exchange
  • rfc1807 a schema for rfc1807 format metadata
    marc21 a recommended schema for MARC21
    metadata, provided by the Library of
    Congressoai_marc a schema for MARC format
    metadata

93
A closer look at oai_dc
  • the simple DC schema used as mandatory record
    format in OAI-PMH defines a container schema
  • container schema is OAI-specific
  • container schema is hosted on the OAI Web site
  • imports a generic DCMES schema
  • generic DCMES schema is hosted on the DCMI Web
    site
  • same model likely to be used for qualified DC
    schema container schema hosted by OAI, generic
    schema hosted by DCMI

94
An oai_dc record
  • an example oai_dc record (viewed via the
    repository explorer)
  • heres the full GetRecord response
  • three important things to notice
  • namespace for the oia_dc format
  • xmlnsoai_dchttp//www.openarchives.org/OAI/2.0/o
    ai_dc/
  • namespace for DCMES elements
  • xmlnsdchttp//purl.org/dc/elements/1.1/
  • container schema associated with the oai_dc
    namespace
  • xsischemaLocation"http//www.openarchives.org/OA
    I/2.0/oai_dc/
    http//www.openarchives.org/OAI/2.0/oai_dc.xsd"

95
The XML schemas
  • The oai_dc container schema
  • http//www.openarchives.org/OAI/2.0/oai_dc.xsd
  • imports DCMES schema from
  • http//dublincore.org/schemas/xmls/simpledc2002031
    2.xsd
  • defines a container element called dc
  • lists the allowed elements within the dc
    container (from the DCMES namespace/schema above)

96
When oai_dc isnt enough
  • when the 15 DCMES elements are too limited e.g.
    adding extra metadata elements
  • when you need greater precision in your metadata
    records e.g. adding encoding schemes to
    existing elements
  • when you want to exchange other metadata formats
  • IMS/IEEE LOM eLearning metadata
  • ODRL Open Digital Rights Language

97
Extending the oai_dc schema
  • simple scenario
  • RDN currently uses oai_dc schema to exchange
    records but wants to add one additional element
    called
  • accessControl
  • note this is not a real scenario
  • RDN really wants to use qualified DC records
    but doing qualified DC too complicated for this
    tutorial!
  • hope to write-up RDN work on exchanging qualified
    DC in future issue of Ariadne

98
Step 1 metadata format name
  • the new metadata format needs a name
  • in this case, weve chosen
  • rdn_dc
  • following OAIs naming of oai_dc
  • alternative possibilities
  • rdndc
  • rdn
  • etc.

99
Step 2 create namespaces
  • two namespaces are required
  • namespace for the rdn_dc format
  • http//www.rdn.ac.uk/oai/rdn_dc/
  • namespace for the new metadata elements
    (properties) that we are going to use in this
    format
  • http//purl.org/rdn/terms/
  • note
  • use of Purl for the elements namespace follows
    DCMI usage but is not mandatory
  • however, both these namespace URIs should be
    under your control to ensure uniqueness and
    prevent re-use in the future
  • URIs do not need to resolve to anything

100
Step 3 local copy of DC schema
  • make local copy of the DCMES schema
  • in this case the copy is at
  • http//www.rdn.ac.uk/oai/rdn_dc/20021204/dc.xsd
  • this step isnt strictly necessary
  • in fact it is probably bad practice to do this
  • but, currently some minor problems with the
    DCMI-hosted copy of the schema
  • working with local copy is easier

101
Step 4 schema for new terms
  • create an XML schema for the new rdnterms
  • in this case the schema is available at
  • http//www.rdn.ac.uk/oai/rdn_dc/20021204/rdnterms.
    xsd
  • the schema defines the new element/property
  • accessControl
  • and adds it to the dcany group
  • also creates a new container type
  • rdntermselementContainer
  • note
  • schema URI contains a date-stamp
  • this should make future enhancements to the
    schema easier to implement

102
Step 5 container schema
  • create a container schema for the new record
    format
  • in this case the schema is available at
  • http//www.rdn.ac.uk/oai/rdn_dc/20021204/rdn_dc.xs
    d
  • this simply imports the rdnterms schema
  • then defines a container element called rdndc
    of type
  • rdntermselementContainer
  • again, the schema URI contains a date-stamp

103
Step 6 validate, validate, val
  • create some test records using your new schemas
  • http//www.rdn.ac.uk/oai/rdn_dc/20021204/test.xml
  • http//www.rdn.ac.uk/oai/rdn_dc/20021204/oai-test.
    xml
  • use the XML schema validator at
  • http//www.w3.org/2001/03/webdata/xsv

104
Step 7 ListMetadataFormats
  • add information about the new format to your
    repositorys response to the ListMetadataFormats
    request

ltmetadataFormatgt ltmetadataPrefixgtrdn_dclt/metadat
aPrefixgt ltschemagthttp//www.rdn.ac.uk/oai/rdn_dc/2
0021113/rdn_dc.xsdlt/schemagt ltmetadataNamespacegthtt
p//www.rdn.ac.uk/oai/rdn_dc/lt/metadataNamespacegt
lt/metadataFormatgt
105
Step 8 other verbs
  • modify your repositorys response to the
    ListSets, ListIdentifiers, ListRecords and
    GetRecord requests
  • accept metadataPrefix set to new format name
    rdn_dc
  • return records formatted according to the new
    schema(s)

106
Step 9 validate again
  • use the Repository Explorer to check that
  • all requests work with new metadataPrefix
  • oai_dc format still works!
  • appropriate records are returned for each format
  • responses validate correctly

107
Summary
  • decide on name for your new metadata format and
    appropriate namespaces
  • develop XML schemas for container and new
    elements if appropriate
  • create test records and validate
  • modify your repository (source code and/or
    configuration files) to support the new format
  • validate and test repository

108
Other record formats
  • can take similar approach with other metadata
    record formats
  • IMS/IEEE LOM
  • ODRL
  • in these cases, XML schemas and namespaces have
    already been agreed
  • deployment of these formats should be easier
    because you dont need to define your own
    schemas
  • BUT XML schema specs continually undergoing
    revisions currently so sometimes hard for
    applications like IMS to keep up!

109
Adding support for IMS
  • modify ListMetadataFormats response to include
  • extend ListSets, ListIdentifiers,
    ListRecords and GetRecord requests
  • accept metadataPrefix set to ims and return
    records formatted appropriately

ltmetadataFormatgt ltmetadataPrefixgtimslt/metadataPr
efixgt ltschemagthttp//www.imsglobal.org/xsd/imsmd_v
1p2p2.xsdlt/schemagt ltmetadataNamespacegt
http//www.imsglobal.org/xsd/imsmd_v1p2 lt/metadata
Namespacegt lt/metadataFormatgt
110
Tutorial OAI and OAI-PMH for BeginnersAn
introduction to the Open Archives Initiative and
the Protocol for Metadata Harvesting
111
Summary
  • during todays tutorial we hope that you have
  • gained an overview of the history behind the
    OAI-PMH and an overview of its key features
  • been given a deeper technical insight into how
    the protocol works
  • learned something about some of the main
    implementation issues
  • found some useful starting points and hints that
    will help you as implementors

112
Questions
  • now
  • feel free to tell us what you didnt understand
  • and ask general questions (of course!)

Uwe Müller Humboldt University Berlin,
Germany u.mueller_at_rz.hu-berlin.de Andy
Powell UKOLN, University of Bath a.powell_at_ukoln.ac
.uk
Write a Comment
User Comments (0)
About PowerShow.com