New%20Digital%20Library%20Possibilities%20Using%20the%20Open%20Archives%20InitiativeProtocol%20for%20Metadata%20Harvesting%20(OAI-PMH) - PowerPoint PPT Presentation

About This Presentation
Title:

New%20Digital%20Library%20Possibilities%20Using%20the%20Open%20Archives%20InitiativeProtocol%20for%20Metadata%20Harvesting%20(OAI-PMH)

Description:

Bad news: not backwards compatible with 1.1. Dublin Core. Dublin Core ... OAI-PMH will be a basic, core technology for scientific publishing as http & XML ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: New%20Digital%20Library%20Possibilities%20Using%20the%20Open%20Archives%20InitiativeProtocol%20for%20Metadata%20Harvesting%20(OAI-PMH)


1
New Digital Library Possibilities Using the Open
Archives InitiativeProtocol for Metadata
Harvesting (OAI-PMH)
  • Michael L. Nelson
  • Old Dominion University
  • Norfolk Virginia, USA
  • mln_at_cs.odu.edu
  • http//www.cs.odu.edu/mln/icsep/

International Conference on Scientific Electronic
Publishing in Developing Countries Valparaiso,
Chile October 2, 2002
Several Slides Also from Van de Sompel Warner
2
Random Thoughts
  1. Thanks to the Organizing Committee for inviting
    me
  2. Me deseo habla prestado la atencion a mis clases
    del Espanol de la escuela secundaria
  3. Publishers Editors if you want increased
    coverage, exposure and readership, you must do
    OAI

3
Outline
  • OAI-PMH history and technical highlights
  • a full technical review is out of the scope of
    this presentation
  • Example data provider user
  • Example service provider uses
  • Implicatations for authors and editors
  • Looking to the future

4
Open Archives Initiative
5
The Rise and Fall of Distributed Searching
  • wholesale distributed searching, popular at the
    time, is attractive in theory but troublesome in
    practice
  • Davis Lagoze, JASIS 51(3), pp. 273-80
  • Powell French, Proc 5th ACM DL, pp. 264-265
  • distributed searching of N nodes still viable,
    but only for small values of N
  • NCSTRL N gt 100 bad
  • NTRS/NIX Nlt20 ok (but could be better)

6
The Rise and Fall of Distributed Searching
  • Other problems of distributed searching (from
    STARTS)
  • source-metadata problem
  • how do you know which nodes to search?
  • query-language problem
  • syntax varies and drifts over time between the
    various nodes
  • rank-merging problem
  • how do you meaningfully merge multiple result
    sets?
  • Temptations
  • centralize all functions
  • everything will be done at X
  • standardize on a single product
  • everyone will use system Y

7
Santa Fe Convention 02/2000
  • goal optimize discovery of e-prints
  • http//www.dlib.org/dlib/february00/vandesompel-oa
    i/02vandesompel-oai.html
  • input
  • the UPS prototype
  • http//www.dlib.org/dlib/february00/vandesompel-up
    s/02vandesompel-ups.html
  • RePEc /SODA data provider / service provider
    model
  • Dienst protocol
  • deliberations at Santa Fe meeting 10/99

8
Data and Service Providers
  • Data Providers
  • publishing into an archive
  • providing methods for metadata harvesting
  • provide non-technical context for sharing
    information also
  • Service Providers
  • harvest metadata from providers
  • implement user interface to data
  • Self-describing archives
  • Much of the learning about the constituent UPS
    archives occurred out of band

Even if these are done by the same DL, these are
distinct roles
9
Metadata Harvesting
  • Move away from distributed searching
  • Extract metadata from various sources
  • Build services on local copies of metadata
  • data remains at remote repositories

all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
10
OAI-PMH v.1.0 01/2001
  • low-barrier interoperability specification
  • metadata harvesting model data provider /
    service provider
  • focus on document-like objects
  • autonomous protocol
  • HTTP based
  • XML responses
  • unqualified Dublin Core
  • experimental 12-18 months

11
Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
12
OAI-PMH 2.0
  • Good news OAI-PMH is still
  • Six Verbs Dublin Core
  • Incremental improvements
  • single XML schema
  • ambiguities removed
  • more expressive options
  • cleaner separation of roles responsibilities
  • Bad news not backwards compatible with 1.1

13
Dublin Core
  • Dublin Core Metadata Initiative
  • http//www.dublincore.org/
  • from 1994-1995, recognizing the need for simple,
    interoperable metadata for resource discovery
  • good overview of metadata DC
    http//www.dlib.org/dlib/january01/lagoze/01lagoze
    .html
  • 15 elements (qualifiers possible)

14
OAI Mechanics
Request is encoded in http
Response is encoded in XML
XML Schemas for the responses are defined in the
OAI-PMH document
15
Overview of OAI-PMH Verbs
Verb Function
Identify description of archive
ListMetadataFormats metadata formats supported by archive
ListSets sets defined by archive
ListIdentifiers OAI unique ids contained in archive
ListRecords listing of N records
GetRecord listing of a single record
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
16
protocol vs periphery
  • clear distinction between protocol and periphery
  • fixed protocol document
  • extensible implementation guidelines
  • e.g. sample metadata formats, description
    containers, about containers
  • allows for OAI guidelines and community
    guidelines

17
OAI-PMH vs HTTP
  • clear separation of OAI-PMH and HTTP
  • OAI-PMH error handling
  • all OK at HTTP level? gt 200 OK
  • something wrong at OAI-PMH level? gt OAI-PMH
    error (e.g. badVerb)
  • http codes 302, 503, etc. still available to
    implementers, but no longer represent OAI-PMH
    events

18
resource item - record
item identifier
record identifier metadata format datestamp
19
other general changes
  • better definitions of harvester, repository,
    item, unique identifier, record, set, selective
    harvesting
  • oai_dc schema builds on DCMI XML Schema for
    unqualified Dublin Core
  • usage of must, must not etc. as in RFC2119
  • wording on response compression

20
other general changes
  • all protocol responses can be validated with a
    single XML Schema
  • easier for data providers
  • no redundancy in type definitions
  • SOAP-ready
  • clean for error handling

21
response no errors
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verbGetRecord gthttp//arXiv.org/oai
2lt/requestgt ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergtoaiarXivcs/0112017lt/identifiergt
ltdatestampgt2001-12-14lt/datestampgt
ltsetSpecgtcslt/setSpecgt ltsetSpecgtmathlt/setSpecgt
lt/headergt ltmetadatagt ..
lt/metadatagt lt/recordgt lt/GetRecordgt lt/OAI-PMHgt
22
response with error
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequestgthttp//arXiv.org/oai2lt/requestgt lterror
codebadVerbgtShowMe is not a valid OAI-PMH
verblt/errorgt lt/OAI-PMHgt
23
resumptionToken
scenario harvesting 2770 records in 3
separate 1000 record chunks
24
resumptionToken
  • idempotency of resumptionToken return same
    incomplete list when rT is reissued
  • while no changes occur in the repo strict
  • while changes occur in the repo all items with
    unchanged datestamp
  • new, optional attributes for the resumptionToken
  • expirationDate
  • completeListSize
  • cursor

25
harvesting granularity
  • harvesting granularity
  • mandatory support of YYYY-MM-DD
  • optional support of YYYY-MM-DDThhmmssZ
  • other granularities considered, but ultimately
    rejected
  • granularity of from and until must be the same

26
Identify
  • Identify more expressive

ltIdentifygt ltrepositoryNamegtLibrary of
Congress 1lt/repositoryNamegt
ltbaseURLgthttp//memory.loc.gov/cgi-bin/oailt/baseUR
Lgt ltprotocolVersiongt2.0lt/protocolVersiongt
ltadminEmailgtr.e.gillian_at_larc.nasa.govlt/adminEmailgt
ltadminEmailgtrgillian_at_visi.netlt/adminEmailgt
ltdeletedRecordgttransientlt/deletedRecordgt
ltearliestDatestampgt1990-02-01T000000Zlt/earliestD
atestampgt ltgranularitygtYYYY-MM-DDThhmmssZlt/g
ranularitygt ltcompressiongtdeflatelt/compressiongt
27
header
  • header contains set membership of item

ltrecordgt ltheadergt ltidentifiergtoaiarXiv
cs/0112017lt/identifiergt ltdatestampgt2001-12-14
lt/datestampgt ltsetSpecgtcslt/setSpecgt
ltsetSpecgtmathlt/setSpecgt lt/headergt
ltmetadatagt .. lt/metadatagt lt/recordgt
28
ListIdentifiers
  • ListIdentifiers returns headers

lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verb gthttp//arXiv.org/oai2lt/reques
tgt ltListIdentifiersgt ltheadergt
ltidentifiergtoaiarXivhep-th/9801001lt/identifiergt
ltdatestampgt1999-02-23lt/datestampgt
ltsetSpecgtphysicheplt/setSpecgt lt/headergt
ltheadergt ltidentifiergtoaiarXivhep-th/9801
002lt/identifiergt ltdatestampgt1999-03-20lt/datest
ampgt ltsetSpecgtphysicheplt/setSpecgt
ltsetSpecgtphysicexplt/setSpecgt lt/headergt
29
provenance
  • introduction of provenance container to
    facilitate tracing of harvesting history

ltaboutgt ltprovenancegt ltoriginDescriptiongt
ltbaseURLgthttp//an.oa.orglt/baseURLgt
ltidentifiergtoair1plog/9801001lt/identifiergt
ltdatestampgt2001-08-13T130002Zlt/datestampgt
ltmetadataPrefixgtoai_dclt/metadataPrefixgt
ltharvestDategt2001-08-15T120130Zlt/harvestDategt
ltoriginDescriptiongt
lt/originDescriptiongt lt/originDescriptiongt
lt/provenancegt lt/aboutgt
30
friends
  • introduction of friends container to facilitate
    discovery of repositories

ltdescriptiongt ltfriendsgt ltbaseURLgthttp//cav2001
.library.caltech.edu/perl/oailt/baseURLgt
ltbaseURLgthttp//formations2.ulst.ac.uk/perl/oailt/b
aseURLgt ltbaseURLgthttp//cogprints.soton.ac.uk/pe
rl/oailt/baseURLgt ltbaseURLgthttp//wave.ldc.upenn.
edu/OLAC/dp/aps.php4lt/baseURLgt
lt/friendsgt lt/descriptiongt
31
NASA ltfriendsgt example (1)
  • A light weight, DP-centric method to communicate
    the existence of others
  • http//techreports.larc.nasa.gov/ltrs/oai2.0/?verb
    Identify
  • ..
  • ltdescriptiongt
  • ltfriends ..namespace stuff..gt
  • ltbaseURLgthttp//naca.larc.nasa.gov/oai2.0lt/base
    URLgt
  • ltbaseURLgthttp//ntrs.nasa.gov/oai2.0lt/baseURLgt
  • ltbaseURLgthttp//horus.riacs.edu/perl/oai/lt/base
    URLgt
  • ltbaseURLgthttp//ston.jsc.nasa.gov/collections/
    TRS/oai/lt/baseURLgt
  • lt/friendsgt
  • lt/descriptiongt
  • ..

32
NASA ltfriendsgt example (2)
33
branding
  • introduction of branding container for DPs to
    suggest rendering association hints
  • ltbranding xmlns"http//www.openarchives.org/OAI/2
    .0/branding/"
  • xmlnsxsi"http//www.w3.org/2001/XMLSchema-inst
    ance"
  • xsischemaLocation"http//www.openarchives.org/
    OAI/2.0/branding/
  • http//www.openarchives.org/
    OAI/2.0/branding.xsd"gt
  • ltcollectionIcongt
  • lturlgthttp//my.site/icon.pnglt/urlgt
  • ltlinkgthttp//my.site/homepage.htmllt/linkgt
  • lttitlegtMySite(tm)lt/titlegt
  • ltwidthgt88lt/widthgt
  • ltheightgt31lt/heightgt
  • lt/collectionIcongt
  • ltmetadataRendering
  • metadataNamespace"http//www.openarchives.org
    /OAI/2.0/oai_dc/"
  • mimeType"text/xsl"gthttp//some.where/DCrender
    .xsllt/metadataRenderinggt
  • ltmetadataRendering
  • metadataNamespace"http//another.place/MARC"
  • mimeType"text/css"gthttp//another.place/MARCr
    ender.csslt/metadataRenderinggt

34
oai-identifier
  • revision of oai-identifier
  • ltdescriptiongt
  • ltoai-identifier xmlns"http//www.openarchives.o
    rg/OAI/2.0/oai-identifier"
  • xmlnsxsi"http//www.w3.org/2001/XMLSchema-
    instance"
  • xsischemaLocation"http//www.openarchives.
    org/OAI/2.0/oai-identifier
  • http//www.openarchives.org/OAI/2.0/oai-iden
    tifier.xsd"gt
  • ltschemegtoailt/schemegt
  • ltrepositoryIdentifiergtoai-stuff.foo.orglt/repos
    itoryIdentifiergt
  • ltdelimitergtlt/delimitergt
  • ltsampleIdentifiergtoaioai-stuff.foo.org5324lt/
    sampleIdentifiergt
  • lt/oai-identifiergt
  • lt/descriptiongt

domain based repository names
35
did not make it into OAI-PMH v.2.0
  • SOAP implementation
  • Result set filtering
  • Multiple / best metadata
  • GetRecord -gt GetRecords
  • Machine readable rights management
  • XML format for mini-archives

36
So What Does OAI-PMH Mean for Your Digital
Library?
  • Resources on DL projects are typically spent in 2
    areas
  • creating maintaining the collection
  • data provider
  • developing access services for the collection
    (searching, browsing, etc.)
  • service provider
  • OAI-PMH allows for specialization based on
    resources / interest

37
NACA Report 1345 as seen through its native
DL http//naca.larc.nasa.gov/
38
NACA Report 1345 as seen through
MAGiC http//www.magic.ac.uk/
39
NACA Report 1345 as seen through its
Scirus (Elsevier) http//www.scirus.com/
40
NACA Report 1345 as seen through my.OAI (FS
Consulting) http//www.myoai.com/
41
Scientific Communication
  • With only some exceptions, which interface is
    used for discovery is not as important as the
    fact that discovery occurred in the first place
  • control of the discovered objects is not lost
    by data providers
  • however, higher level mirroring services can be
    built on top of OAI (cf. NACA ARC mirroring
    between NASA LaRC and MAGiC)
  • The real power of OAI-PMH derives as much from
    what it does not do as what it actually does

42
What Does OAI-PMH Mean for Authors?
  • On the surface, absolutely nothing!
  • the ideal OAI deployment should be absolutely
    invisible to normal DL operations
  • uninterested users should not even notice or care
  • Indirectly, they should enjoy the benefits of the
    critical mass of current and developing DL tools
    systems
  • personal, institutional data providers
  • proliferation of targetted, value-added service
    providers

43
What Does OAI-PMH Mean For Editors?
  • Absolutely everything
  • The decoupling of SPs and DPs will have
    significant and profound implications on
    scientific and technical information exchange
  • OAI-PMH is actually just one component in a
    larger engineering effort for scholarly
    communication (e.g. OpenURL)
  • Service and resource integration will be the
    focus of journals, professional societies,
    universities, etc.
  • OAI-PMH will be a basic, core technology for
    scientific publishing as http XML

44
Field of Dreams
  • It should be easy to be a data provider, even if
    it makes more work for the service provider.
  • if enough data providers exist, the service
    providers will come (DPs gtgt SPs)
  • Open-source / freely available tools
  • drop-in data providers
  • industrial strength http//www.eprints.org/
  • personal size http//kepler.cs.odu.edu/
  • tools to make your existing DL a data provider
  • http//www.openarchives.org/tools/tools.htm
  • also OAI-implementers mailing list / mail
    archive!
  • service providers
  • Arc http//sourceforge.net/projects/oaiarc/

45
OAI Observation Front-End Only
  • No input/registry mechanism
  • OAI harvesting protocol is always a front-end for
    something else
  • filesystem, Dienst, RDBMS, LDAP, etc.
  • convenient for pre-existing DLs, but does not
    address new DLs
  • e.g., we want to do OAI
  • Bounds the scope of OAI
  • responsibilities and domain of OAI are still be
    discussed
  • tension between functionality and simplicity

46
OAI Observation No TC
  • Possible to use multiple OAI servers in a
    DMZ-like configuration

OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
47
OAI Observation No TC
  • Possible to use OAI harvesting protocol in
    closed, restricted systems

OAI 1
OAI 2
OAI 3
OAI 4
all OAI requests originate from these 4 DLs
48
Metadata
  • Q Which format should I use?
  • A any/all of them
  • lowest common denominator unqualified Dublin
    Core
  • Again, little known about actual behavior
  • will DC be actually be useful? or too lossy?
  • will communities create/adopt specific formats?
  • will native (presumably richer) formats be
    harvested?

49
The Future Community Building
  • Ultimately, protocols and metadata formats are
    not what makes a difference
  • Rather, the critical mass afforded by a common
    set of utilities (cf. http, Dublin Core, XML)
  • The best current example The Open Language
    Archives Community
  • http//www.language-archives.org
  • OAI-PMH provides the basis for communication
    between strangers, but allows even richer
    communication between friends

50
http//www.openarchives.org openarchives_at_openarch
ives.org
51
Backup Slides
52
Detailed Review of the OAI-PMH 2.0 Verbs
53
Identify
1.1
2.0
  • Arguments
  • none
  • Errors
  • none
  • Arguments
  • none
  • Errors
  • badArgument

54
ListMetadataFormats
1.1
2.0
  • Arguments
  • identifier (OPTIONAL)
  • Errors
  • id does not exist
  • Arguments
  • identifier (OPTIONAL)
  • Errors
  • badArgument
  • noMetadataFormats
  • idDoesNotExist

55
ListSets
1.1
2.0
  • Arguments
  • resumptionToken (EXCLUSIVE)
  • Errors
  • no set hierarchy
  • Arguments
  • resumptionToken (EXCLUSIVE)
  • Errors
  • badArgument
  • badResumptionToken
  • noSetHierarchy

56
ListIdentifiers
1.1
2.0
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • Errors
  • no records match
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • metadataPrefix (REQUIRED)
  • Errors
  • badArgument
  • cannotDisseminateFormat
  • badResumptionToken
  • noSetHierarchy
  • noRecordsMatch

57
ListRecords
1.1
2.0
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • metadataPrefix (REQUIRED)
  • Errors
  • no records match
  • metadata format cannot be disseminated
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • metadataPrefix (REQUIRED)
  • Errors
  • noRecordsMatch
  • cannotDisseminateFormat
  • badResumptionToken
  • noSetHierarchy
  • badArgument

58
GetRecord
1.1
2.0
  • Arguments
  • identifier (REQUIRED)
  • metadataPrefix (REQUIRED)
  • Errors
  • id does not exist
  • metadata format cannot be disseminated
  • Arguments
  • identifier (REQUIRED)
  • metadataPrefix (REQUIRED)
  • Errors
  • badArgument
  • cannotDisseminateFormat
  • idDoesNotExist

59
Argument Summary
metadataPrefix from until set resumptionToken identifier
Identify ? ? ? ? ? ?
ListMetadata Formats ? ? ? ? ? optional
ListSets ? ? ? ? exclusive ?
ListIdentifiers ? optional optional optional exclusive ?
ListRecords ? optional optional optional exclusive ?
GetRecord ? ? ? ? ? ?
60
Error Summary
Identify BA
ListMetadata Formats BA NMF IDDNE
ListSets BA BRT NSH
ListIdentifiers BA BRT CDF NRM NSH
ListRecords BA BRT CDF NRM NSH
GetRecord BA CDF IDDNE
Generate badVerb on any input not matching the 6
defined verbs this is an inversion
of the table in section 3.6 of the OAI-PMH
specification
Write a Comment
User Comments (0)
About PowerShow.com