Introduction%20to%20the%20OAI-PMH - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction%20to%20the%20OAI-PMH

Description:

distributed searching of N nodes still viable, but only for small values of N ... defined independently (though still easily mappable) Dublin Core ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 132
Provided by: Michael50
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction%20to%20the%20OAI-PMH


1
Introduction to the OAI-PMH
  • Michael L. Nelson
  • mln_at_cs.odu.edu
  • http//www.cs.odu.edu/mln/
  • Several Slides from
  • Herbert Van de Sompel, Simeon Warner and Terry L.
    Harrison
  • University of Southern California
  • 6/15/04

2
Outline
  • History of OAI-PMH
  • UPS, Santa Fe Convention
  • Overview of the OAI-PMH
  • verbs
  • data model
  • OAI 1.0, 1.1, 2.0 and how 2.0 was created
  • Example data providers and service providers
  • More information
  • http//www.openarchives.org/

3
UPS and SFC
4
The Rise and Fall of Distributed Searching
  • wholesale distributed searching, popular at the
    time, is attractive in theory but troublesome in
    practice
  • Davis Lagoze, JASIS 51(3), pp. 273-80
  • Powell French, Proc 5th ACM DL, pp. 264-265
  • distributed searching of N nodes still viable,
    but only for small values of N
  • NCSTRL N gt 100 bad
  • NTRS/NIX Nlt20 ok (but could be better)

5
The Rise and Fall of Distributed Searching
  • Other problems of distributed searching (from
    STARTS)
  • source-metadata problem
  • how do you know which nodes to search?
  • query-language problem
  • syntax varies and drifts over time between the
    various nodes
  • rank-merging problem
  • how do you meaningfully merge multiple result
    sets?

6
Universal Preprint Service
  • A cross-archive DL that that provides services on
    a collection of metadata harvested from multiple
    archives
  • based on NCSTRL a modified version of Dienst
  • support for clustering
  • support for buckets
  • Demonstrated at Santa Fe NM, October 21-22, 1999
  • http//ups.cs.odu.edu/
  • D-Lib Magazine, 6(2) 2000 (2 articles)
  • http//www.dlib.org/dlib/february00/02contents.htm
    l
  • UPS was soon renamed the Open Archives Initiative
    (OAI) http//www.openarchives.org/

7
UPS Participants
totals ca. July 1999
8
Metadata Harvesting
  • Getting metadata out of archives
  • not all archives support metadata extraction
  • some archives have undocumented metadata
    extraction procedures
  • not all archives support rich criteria for
    extraction
  • single dump concept only
  • Intellectual property and use rights not always
    clear
  • many policies akin to dont ask, dont tell

9
Metadata Formatting and Quality
  • Quality problems with
  • record duplication
  • crucial missing fields
  • internal errors
  • ambiguous references to people and places,
    publications
  • Different formats!

observation n digital libraries results in
O(n) metadata formats
10
Buckets Information Surrogates in UPS
  • Limitations on intellectual property,
  • file size, transmission time, system
  • load, etc. caused us to focus on
  • metadata only
  • Metadata was collected into
  • buckets, with pointers back to the
  • data files (still at the original sites)

11
Value Added Services Attachedto the Buckets
SFX Reference Linking Service, developed at
Univ of Ghent, Belgium. - provides a layer
of indirection between reference
services available at a local site
and the object itself SFX buttons are
attached to the buckets themselves -
communication occurs between SFX server
and the bucket Adding other services to
the buckets is easy...
12
Data and Service Providers
  • Data Providers
  • publishing into an archive
  • Self-describing archives
  • Much of the learning about the constituent UPS
    archives occurred out of band
  • providing methods for metadata harvesting
  • provide non-technical context for sharing
    information also
  • Service Providers
  • harvest metadata from providers
  • implement user interface to data

Even if these are done by the same DL, these are
distinct roles
13
Metadata Harvesting
  • Move away from distributed searching
  • Extract metadata from various sources
  • Build services on local copies of metadata
  • data remains at remote repositories

all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
14
Result OAI
  • The OAI was the result of the demonstration and
    discussion during the Santa Fe meeting
  • Initial focus was on federating collections of
    scholarly e-print materials
  • however, interest grew and the scope and
    application of OAI expanded to become a generic
    bulk metadata transport protocol
  • Note
  • OAI is only about metadata -- not full text!
  • what is metadata and what is full text?
  • OAI is neutral with respect to the nature of the
    metadata or the resources the metadata describes
  • read commercial publishers have an interest in
    OAI too...

15
Open Archives Initiative
16
Open Archives Initiative
Open Archival Information System
insuring long-term preservation of archival
materials
exposure of metadata for harvesting
OAIS
OAIS w/ an OAI interface
http//www.dlib.org/dlib/april01/04editorial.html
http//www.dlib.org/dlib/may01/05letters.html http
//ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html
17
OAI Protocol for Metadata Harvesting
  • Then
  • OAI-PMH originally a subset of the Dienst
    (NCSTRL) protocol
  • and originally called the Santa Fe Convention
  • originally defined an OAI-specific metadata
    format
  • Now
  • OAI metadata format dropped in favor of
    unqualified Dublin Core
  • other formats possible, but DC is required as
    lowest common denominator
  • No longer dependent on Dienst (Cornell CS TR
    95-1514)
  • defined independently (though still easily
    mappable)

18
Dublin Core
  • Dublin Core Metadata Initiative
  • http//www.dublincore.org/
  • from 1994-1995, recognizing the need for simple,
    interoperable metadata for resource discovery
  • good overview of metadata DC
    http//www.dlib.org/dlib/january01/lagoze/01lagoze
    .html
  • 15 elements (qualifiers/refinements possible)

19
Open Archives Initiative Protocol for Metadata
Harvesting
20
OAI-PMH Actors
  • data providers / repositories
  • A repository is a network accessible server that
    can process the 6 OAI-PMH requests in the manner
    described in the OAI-PMH document.   A
    repository is managed by a data provider to
    expose metadata to harvesters. 
  • service providers / harvesters
  • A harvester is a client application that issues
    OAI-PMH requests.  A harvester is operated by a
    service provider as a means of collecting
    metadata from repositories.

21
Data Providers / Service Providers
22
Aggregators
  • aggregators allow for
  • scalability for OAI-PMH
  • load balancing
  • community building
  • discovery

service providers (harvesters)
data providers (repositories)
aggregator
23
Aggregators
  • Frequently interchangeable terms
  • aggregators likely to be community /
    institutionally focused
  • caches stores a copy, less likely to be
    community-oriented
  • proxies less likely to store a copy, may gateway
    between OAI-PMH and other protocols
  • Dienst / OAI Gateway Harrison, Nelson, Zubair,
    JCDL 03
  • To learn more about aggregators, caches
    proxies
  • http//www.openarchives.org/OAI/2.0/guidelines-agg
    regator.htm
  • http//www.cs.odu.edu/mln/jcdl03/

24
OAI-PMH Data Model
item identifier
record identifier metadata format datestamp
25
Overview of OAI-PMH Verbs
Verb Function
Identify description of repository
ListMetadataFormats metadata formats supported by repository
ListSets sets defined by repository
ListIdentifiers OAI unique ids contained in repository
ListRecords listing of N records
GetRecord listing of a single record
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
26
supporting protocol requests
service provider harvester
data provider repository
Identify
  • Identify / Time / Request
  • Repository identifier
  • Base-URL
  • Admin e-mail
  • OAI protocol version
  • Description

herbert van de sompel
27
Identify
1.1
2.0
  • Arguments
  • none
  • Errors
  • none
  • Arguments
  • none
  • Errors
  • badArgument

28
supporting protocol requests
service provider harvester
data provider repository
ListMetadataFormats
identifieroaimlib123a
  • ListMetadataFormats / Time / Request
  • REPEAT
  • Format prefix
  • Format XML schema
  • /REPEAT

herbert van de sompel
29
ListMetadataFormats
1.1
2.0
  • Arguments
  • identifier (OPTIONAL)
  • Errors
  • id does not exist
  • Arguments
  • identifier (OPTIONAL)
  • Errors
  • badArgument
  • noMetadataFormats
  • idDoesNotExist

30
supporting protocol requests
service provider harvester
data provider repository
ListSets resumptionToken
  • ListSets / Time / Request
  • REPEAT
  • SetSpec
  • SetName
  • /REPEAT

herbert van de sompel
31
ListSets
1.1
2.0
  • Arguments
  • resumptionToken (EXCLUSIVE)
  • Errors
  • no set hierarchy
  • Arguments
  • resumptionToken (EXCLUSIVE)
  • Errors
  • badArgument
  • badResumptionToken
  • noSetHierarchy

32
harvesting requests
froma
untilb
setklm ListRecords metadataPrefixdc
resumptionToken
service provider harvester
data provider repository
  • ListRecords / Time / Request
  • REPEAT
  • Identifier
  • Datestamp
  • Metadata
  • /REPEAT

herbert van de sompel
33
ListRecords
1.1
2.0
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • metadataPrefix (REQUIRED)
  • Errors
  • no records match
  • metadata format cannot be disseminated
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • metadataPrefix (REQUIRED)
  • Errors
  • noRecordsMatch
  • cannotDisseminateFormat
  • badResumptionToken
  • noSetHierarchy
  • badArgument

34
harvesting requests
service provider harvester
data provider repository
froma
untilb
setklam metadataPrefix ListIdentifiers
resumptionToken
  • ListIdentifiers / Time / Request
  • REPEAT
  • Identifier
  • Datestamp
  • /REPEAT

herbert van de sompel
35
ListIdentifiers
1.1
2.0
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • Errors
  • no records match
  • Arguments
  • from (OPTIONAL)
  • until (OPTIONAL)
  • set (OPTIONAL)
  • resumptionToken (EXCLUSIVE)
  • metadataPrefix (REQUIRED)
  • Errors
  • badArgument
  • cannotDisseminateFormat
  • badResumptionToken
  • noSetHierarchy
  • noRecordsMatch

36
harvesting requests
service provider harvester
data provider repository
GetRecord identifieroaimlib123a
metadataPrefixdc
  • GetRecord / Time / Request
  • Identifier
  • Datestamp
  • Metadata

herbert van de sompel
37
GetRecord
1.1
2.0
  • Arguments
  • identifier (REQUIRED)
  • metadataPrefix (REQUIRED)
  • Errors
  • id does not exist
  • metadata format cannot be disseminated
  • Arguments
  • identifier (REQUIRED)
  • metadataPrefix (REQUIRED)
  • Errors
  • badArgument
  • cannotDisseminateFormat
  • idDoesNotExist

38
Argument Summary
metadataPrefix from until set resumptionToken identifier
Identify ? ? ? ? ? ?
ListMetadata Formats ? ? ? ? ? optional
ListSets ? ? ? ? exclusive ?
ListIdentifiers ? optional optional optional exclusive ?
ListRecords ? optional optional optional exclusive ?
GetRecord ? ? ? ? ? ?
39
Error Summary
Identify BA
ListMetadata Formats BA NMF IDDNE
ListSets BA BRT NSH
ListIdentifiers BA BRT CDF NRM NSH
ListRecords BA BRT CDF NRM NSH
GetRecord BA CDF IDDNE
Generate badVerb on any input not matching the 6
defined verbs this is an inversion
of the table in section 3.6 of the OAI-PMH
specification
40
Flow Control
  • ListSets, ListIdentifiers, ListRecords are all
    allowed to return partial responses, via a
    combination of
  • resumptionToken an opaque, archive-defined data
    string that when passed back to the archive
    allows the response to begin where it left off
  • each archive defines their own resumptionToken
    syntax it may have visible semantics or not
  • 503 http status code retry after
  • up to the harvester to understand this code and
    respect it, and up to the archive to enforce it

41
resumptionToken
scenario harvesting 277 records in 3
separate 100 record chunks
42
Lets Look at some Repositories
  • Repository Explorer
  • http//www.purl.org/NET/oai_explorer

43
OAI-PMH 1.0, 1.1, 2.0
44
Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
45
Santa Fe Convention 02/2000
  • goal optimize discovery of e-prints
  • input
  • the UPS prototype
  • RePEc /SODA data provider / service provider
    model
  • Dienst protocol
  • deliberations at Santa Fe meeting 10/99

46
OAI-PMH v.1.0 01/2001
  • goal optimize discovery of document-like
    objects
  • input
  • SFC
  • DLF meetings on metadata harvesting
  • deliberations at Cornell meeting 09/00
  • alpha test group of OAI-PMH v.1.0

47
OAI-PMH v.1.0 01/2001
  • low-barrier interoperability specification
  • metadata harvesting model data provider /
    service provider
  • focus on document-like objects
  • autonomous protocol
  • HTTP based
  • XML responses
  • unqualified Dublin Core
  • experimental 12-18 months

48
Selected Pre- 2.0 OAI Highlights
  • October 21-22, 1999 - initial UPS meeting
  • February 15, 2000 - Santa Fe Convention published
    in D-Lib Magazine
  • precursor to the OAI metadata harvesting protocol
  • June 3, 2000 - workshop at ACM DL 2000 (Texas)
  • August 25, 2000 - OAI steering committee formed,
    DLF/CNI support
  • September 7-8, 2000 - technical meeting at
    Cornell University
  • defined the core of the current OAI metadata
    harvesting protocol
  • September 21, 2000 - workshop at ECDL 2000
    (Portugal)
  • November 1, 2000 - Alpha test group announced
    (15 organizations)
  • January 23, 2001 - OAI protocol 1.0 announced,
    OAI Open Day in the U.S. (Washington DC)
  • purpose freeze protocol for 12-16 months,
    generate critical mass
  • February 26, 2001 - OAI Open Day in Europe
    (Berlin)
  • July 3, 2001 - OAI protocol 1.1 announced
  • to reflect changes in the W3Cs XML latest schema
    recommendation
  • September 8, 2001 - workshop at ECDL 2001
    (Darmstadt)

49
OAI-PMH v.2.0 06/2002
  • goal recurrent exchange of metadata about
    resources between systems
  • input
  • OAI-PMH v.1.0
  • feedback on OAI-implementers
  • deliberations by OAI-tech 09/01 - 06/02
  • alpha test group of OAI-PMH v.2.0 03/02 -
    06/02
  • officially released June 14, 2002

50
OAI-PMH v.2.0 06/2002
  • low-barrier interoperability specification
  • metadata harvesting model data provider /
    service provider
  • metadata about resources
  • autonomous protocol
  • HTTP based
  • XML responses
  • unqualified Dublin Core
  • stable

51
releasing OAI-PMH v.2.0 (illustrating the OAI
process) See also Lagoze, Carl and Van de
Sompel, Herbert. The making of the Open Archives
Initiative Protocol for Metadata Harvesting.
2003. Library Hi Tech. v21, N2. Draft
52
  • creation of OAI-tech
  • pre-alpha phase
  • alpha-phase
  • beta-phase

53
creation of OAI-tech 06/01
  • created for 1 year period
  • charge
  • review functionality and nature of OAI-PMH v.1.0
  • investigate extensions
  • release stable version of OAI-PMH by 05/02
  • determine need for infrastructure to support
    broad adoption of the protocol
  • communication listserv, SourceForge, conference
    calls

54
OAI-tech
US representatives Thomas Krichel (Long Island U)
- Jeff Young (OCLC) - Tim Cole - (U of Illinois
at Urbana Champaign) - Hussein Suleman (Virginia
Tech) - Simeon Warner (Cornell U) - Michael
Nelson (NASA) - Caroline Arms (LoC) - Mohammad
Zubair (Old Dominion U) - Steven Bird (U Penn.)
European representatives Andy Powell (Bath U.
UKOLN) - Mogens Sandfaer (DTV) - Thomas Baron
(CERN) - Les Carr (U of Southampton)
55
pre-alpha phase 09/01 02/02
  • review process by OAI-tech
  • identification of issues
  • conference call to filter/combine issues
  • white paper per issue
  • on-line discussion per white paper
  • proposal for resolution of issue by OAI-exec
  • discussion of proposal closure of issue
  • conference call to resolve open issues

56
pre-alpha phase 02/02
  • creation of revised protocol document
  • in-person meeting Lagoze - Van de Sompel -
    Nelson Warner
  • autonomous decisions
  • internal vetting of protocol document

57
alpha phase 02/02 05/02
  • alpha-1 release to OAI-tech March 1st 2002
  • OAI-tech extended with alpha testers
  • discussions/implementations by OAI-tech
  • ongoing revision of protocol document

58
OAI-PMH 2.0 alpha testers (1/2)
  • The British Library
  • Cornell U. -- NSDL project e-print arXiv
  • Ex Libris
  • FS Consulting Inc -- harvester for my.OAI
  • Humboldt-Universität zu Berlin
  • InQuirion Pty Ltd, RMIT University
  • Library of Congress
  • NASA
  • OCLC

59
OAI-PMH 2.0 alpha testers (2/2)
  • Old Dominion U. -- ARC , DP9
  • U. of Illinois at Urbana-Champaign
  • U. Of Southampton -- OAIA (now Celestial),
    CiteBase, eprints.org
  • UCLA, John Hopkins U., Indiana U., NYU -- sheet
    music collection
  • UKOLN, U. of Bath -- RDN
  • Virginia Tech -- repository explorer

60
beta phase 05/02-06/02
  • beta release on May 1st 2002 to
  • registered data providers and service providers
  • interested parties
  • fine tuning of protocol document
  • preparation for the release of 2.0 conformant
    tools by alpha testers

61
OAI-PMH v.2.0 highlights
62
  • quick recap
  • important improvements in 2.0
  • corrections
  • new functionality

63
important improvements
64
protocol vs periphery
  • clear distinction between protocol and periphery
  • fixed protocol document
  • extensible implementation guidelines
  • e.g. sample metadata formats, description
    containers, about containers
  • allows for OAI guidelines and community
    guidelines

65
OAI-PMH vs HTTP
  • clear separation of OAI-PMH and HTTP
  • OAI-PMH error handling
  • all OK at HTTP level? gt 200 OK
  • something wrong at OAI-PMH level? gt OAI-PMH
    error (e.g. badVerb)
  • http codes 302, 503, etc. still available to
    implementers, but no longer represent OAI-PMH
    events

66
other improvements
  • better definitions of harvester, repository,
    item, unique identifier, record, set, selective
    harvesting
  • oai_dc schema builds on DCMI XML Schema for
    unqualified Dublin Core
  • usage of must, must not etc. as in RFC2119
  • wording on response compression

67
other improvements
  • all protocol responses can be validated with a
    single XML Schema
  • easier for data providers
  • no redundancy in type definitions
  • SOAP-ready
  • clean for error handling

68
response no errors
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verbGetRecord gthttp//arXiv.org/oai
2lt/requestgt ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergtoaiarXivcs/0112017lt/identifiergt
ltdatestampgt2001-12-14lt/datestampgt
ltsetSpecgtcslt/setSpecgt ltsetSpecgtmathlt/setSpecgt
lt/headergt ltmetadatagt ..
lt/metadatagt lt/recordgt lt/GetRecordgt lt/OAI-PMHgt
69
response with error
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequestgthttp//arXiv.org/oai2lt/requestgt lterror
codebadVerbgtShowMe is not a valid OAI-PMH
verblt/errorgt lt/OAI-PMHgt
70
corrections
71
dates/times
  • all dates/times are UTC, encoded in ISO8601,
    Z-notation
  • 1957-03-20T203000Z

72
resumptionToken
  • idempotency of resumptionToken return same
    incomplete list when rT is reissued
  • while no changes occur in the repo strict
  • while changes occur in the repo all items with
    unchanged datestamp
  • new, optional attributes for the resumptionToken
  • expirationDate
  • completeListSize
  • cursor

73
noRecordsMatch
  • 1.x - if no records match, an empty list was
    returned

74
noRecordsMatch
  • 2.0 - if no records match, the error condition
    noRecordsMatch is returned -- not an empty list

75
new functionality
76
harvesting granularity
  • harvesting granularity
  • mandatory support of YYYY-MM-DD
  • optional support of YYYY-MM-DDThhmmssZ
  • other granularities considered, but ultimately
    rejected
  • granularity of from and until must be the same

77
Identify
  • Identify more expressive

ltIdentifygt ltrepositoryNamegtLibrary of
Congress 1lt/repositoryNamegt
ltbaseURLgthttp//memory.loc.gov/cgi-bin/oailt/baseUR
Lgt ltprotocolVersiongt2.0lt/protocolVersiongt
ltadminEmailgtr.e.gillian_at_larc.nasa.govlt/adminEmailgt
ltadminEmailgtrgillian_at_visi.netlt/adminEmailgt
ltearliestDatestampgt1990-02-01T000000Zlt/earlies
tDatestampgt ltdeletedRecordgttransientlt/deletedR
ecordgt ltgranularitygtYYYY-MM-DDThhmmssZlt/gran
ularitygt ltcompressiongtdeflatelt/compressiongt
78
header
  • header contains set membership of item

ltrecordgt ltheadergt ltidentifiergtoaiarXiv
cs/0112017lt/identifiergt ltdatestampgt2001-12-14
lt/datestampgt ltsetSpecgtcslt/setSpecgt
ltsetSpecgtmathlt/setSpecgt lt/headergt
ltmetadatagt .. lt/metadatagt lt/recordgt
79
ListIdentifiers
  • ListIdentifiers returns headers

lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verb gthttp//arXiv.org/oai2lt/reques
tgt ltListIdentifiersgt ltheadergt
ltidentifiergtoaiarXivhep-th/9801001lt/identifiergt
ltdatestampgt1999-02-23lt/datestampgt
ltsetSpecgtphysicheplt/setSpecgt lt/headergt
ltheadergt ltidentifiergtoaiarXivhep-th/9801
002lt/identifiergt ltdatestampgt1999-03-20lt/datest
ampgt ltsetSpecgtphysicheplt/setSpecgt
ltsetSpecgtphysicexplt/setSpecgt lt/headergt
80
ListIdentifiers
  • ListIdentifiers mandates metadataPrefix as
    argument

http//www.perseus.tufts.edu/cgi-bin/pdataprov?
verbListIdentifiers metadataPrefixolac
from2001-01-01 until2001-01-01
setPerseuscollectionPersInfo
81
ListIdentifiers
  • the changes to ListIdentifiers are subtle, and
    reflect a change in the OAI-PMH data model
  • Could have been named ListHeaders or reduced to
    an option for ListRecords
  • ListIdentifiers kept for lexigraphical
    consistency

82
metadataPrefix
  • character set for metadataPrefix and setSpec
    extended to URL-safe characters

A-Z a-z 0-9 _ ! ( ) - .
83
in the periphery
84
provenance
  • introduction of provenance container to
    facilitate tracing of harvesting history

ltaboutgt ltprovenancegt ltoriginDescriptiongt
ltbaseURLgthttp//an.oa.orglt/baseURLgt
ltidentifiergtoair1plog/9801001lt/identifiergt
ltdatestampgt2001-08-13T130002Zlt/datestampgt
ltmetadataPrefixgtoai_dclt/metadataPrefixgt
ltharvestDategt2001-08-15T120130Zlt/harvestDategt
ltoriginDescriptiongt
lt/originDescriptiongt lt/originDescriptiongt
lt/provenancegt lt/aboutgt
85
friends
  • introduction of friends container to facilitate
    dynamic discovery of repositories

ltdescriptiongt ltfriendsgt ltbaseURLgthttp//cav2001
.library.caltech.edu/perl/oailt/baseURLgt
ltbaseURLgthttp//formations2.ulst.ac.uk/perl/oailt/b
aseURLgt ltbaseURLgthttp//cogprints.soton.ac.uk/pe
rl/oailt/baseURLgt ltbaseURLgthttp//wave.ldc.upenn.
edu/OLAC/dp/aps.php4lt/baseURLgt
lt/friendsgt lt/descriptiongt
86
branding
  • introduction of branding container for DPs to
    suggest rendering association hints
  • ltbranding xmlns"http//www.openarchives.org/OAI/2
    .0/branding/"
  • xmlnsxsi"http//www.w3.org/2001/XMLSchema-inst
    ance"
  • xsischemaLocation"http//www.openarchives.org/
    OAI/2.0/branding/
  • http//www.openarchives.org/
    OAI/2.0/branding.xsd"gt
  • ltcollectionIcongt
  • lturlgthttp//my.site/icon.pnglt/urlgt
  • ltlinkgthttp//my.site/homepage.htmllt/linkgt
  • lttitlegtMySite(tm)lt/titlegt
  • ltwidthgt88lt/widthgt
  • ltheightgt31lt/heightgt
  • lt/collectionIcongt
  • ltmetadataRendering
  • metadataNamespace"http//www.openarchives.org
    /OAI/2.0/oai_dc/"
  • mimeType"text/xsl"gthttp//some.where/DCrender
    .xsllt/metadataRenderinggt
  • ltmetadataRendering
  • metadataNamespace"http//another.place/MARC"
  • mimeType"text/css"gthttp//another.place/MARCr
    ender.csslt/metadataRenderinggt

87
oai-identifier
  • revision of oai-identifier
  • ltdescriptiongt
  • ltoai-identifier xmlns"http//www.openarchives.o
    rg/OAI/2.0/oai-identifier"
  • xmlnsxsi"http//www.w3.org/2001/XMLSchema-
    instance"
  • xsischemaLocation"http//www.openarchives.
    org/OAI/2.0/oai-identifier
  • http//www.openarchives.org/OAI/2.0/oai-iden
    tifier.xsd"gt
  • ltschemegtoailt/schemegt
  • ltrepositoryIdentifiergtoai-stuff.foo.orglt/repos
    itoryIdentifiergt
  • ltdelimitergtlt/delimitergt
  • ltsampleIdentifiergtoaioai-stuff.foo.org5324lt/
    sampleIdentifiergt
  • lt/oai-identifiergt
  • lt/descriptiongt

domain based repository names
88
oai_dc
  • OAI 1.x oai_dc Schema defined by OAI
  • OAI 2.0 oai_dc Schema imports from DCMI Schema
    for unqualified DC elements

89
MARC21
  • OAI 1.x oai_marc
  • OAI 2.0 LoC marxml, oai_marc
  • http//www.loc.gov/standards/marcxml/

90
did not make it into OAI-PMH v.2.0
91
  • SOAP implementation
  • Result set filtering
  • Multiple / best metadata
  • GetRecord -gt GetRecords
  • Machine readable rights management
  • XML format for mini-archives

92
Example Data and Service Providers
93
NTRS OAI Architecture
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
NTRS
local copy of metadata
metadata harvested offline, through OAI
interface
each node independently maintained
. . .
LTRS
ATRS
GTRS
CASITRS
content (reports) remain archived at the local
sites
94
NASA Technical Report Server
  • replacement for the previous distributed
    searching version of NTRS
  • MySQL
  • Va Tech harvester
  • modified bucket
  • details in Nelson, Rocker, Harrison, Library
    Hi-Tech, 21(2) (March 2003)
  • a service provider aggregator
  • same OAI baseURL as used for interactive searching

http//ntrs.nasa.gov/
95
NASA Technical Report Server
  • advanced, fielded search
  • explicit query routing
  • 12 NASA repositories
  • 4 non-NASA repositories
  • turned off by default
  • gt600k abstracts gt300k full-text

96
NASA DLs in the Larger STI Realm
DOE
. . .
DOD
Universities
Publishers
International
this could be a fully connected graph
NTRS could also be a data provider from the
point of view of other DLs allowing
the harvesting of NASA report metadata.
NTRS could also harvest metadata from other
DLs, and provide access to non-NASA content. We
hope to influence the direction of the
science.gov effort to use OAI-PMH
97
New Kinds of DLs
  • Drawing from the same pool of DPs
  • different interfaces, capabilities and collection
    policies for
  • public affairs
  • K-12 education
  • science research
  • authors / librarians / managers
  • NTRS and NIX could harvest from the same sources
  • be the same DL, but with different interfaces?
  • be replaced with a new, all-encompassing DL?
  • DL creators can now focus on collection
    management
  • ala carting their collections and sub
    collections
  • instead of fussing over syntax synchronization of
    remote search services

98
Scientific Communication
  • With only some exceptions, which interface is
    used for discovery is not as important as the
    fact that discovery occurred in the first place
  • control of the discovered objects is not lost
    by data providers
  • however, higher level mirroring services can be
    built on top of OAI (cf. NACA ARC mirroring
    between NASA LaRC and MAGiC)

99
NACA Technical Report Server
  • publicly available
  • began in 1996
  • details in NASA TM-1999-209127
  • scanned reports from 1917-1958
  • NACA predecessor to NASA
  • contents mirrored with the MaGIC project
  • a UK-based grey-literature preservation project
  • OAI-PMH used to mirror contents

http//naca.larc.nasa.gov/ http//naca.larc.nasa.g
ov/oai2.0/
100
NACA Report 1345 as seen through its native
DL http//naca.larc.nasa.gov/
101
NACA Report 1345 as seen through
MAGiC http//www.magic.ac.uk/
102
NACA Report 1345 as seen through Scirus
(Elsevier) http//www.scirus.com/
103
NACA Report 1345 as seen through my.OAI (FS
Consulting) http//www.myoai.com/
104
What Does OAI-PMH Mean for Authors?
  • On the surface, absolutely nothing!
  • the ideal OAI deployment should be absolutely
    invisible to normal DL operations
  • uninterested users should not even notice or care
  • Indirectly, they should enjoy the benefits of the
    critical mass of current and developing DL tools
    systems
  • personal, institutional data providers
  • proliferation of targeted, value-added service
    providers

105
What Does OAI-PMH Mean For Publishers
Institutions?
  • Absolutely everything
  • The decoupling of SPs and DPs will have
    significant and profound implications on
    scientific and technical information exchange
  • OAI-PMH is actually just one component in a
    larger engineering effort for scholarly
    communication (e.g. OpenURL)
  • Service and resource integration will be the
    focus of journals, professional societies,
    universities, etc.
  • OAI-PMH will be a basic, core technology for
    scientific publishing as http XML

106
Field of Dreams
  • It should be easy to be a data provider, even if
    it makes more work for the service provider.
  • if enough data providers exist, the service
    providers will come (DPs gtgt SPs)
  • Open-source / freely available tools
  • drop-in data providers
  • industrial strength http//www.eprints.org/
  • personal size http//kepler.cs.odu.edu/
  • tools to make your existing DL a data provider
  • http//www.openarchives.org/tools/tools.htm
  • also OAI-implementers mailing list / mail
    archive!
  • service providers
  • Arc http//sourceforge.net/projects/oaiarc/

107
OAI-PMH Meeting History
108
Shift of Topics
  • From the protocol itself, supporting debugging
    tools and how to retrofit (existing) DLs
  • to building (new) services that use the OAI-PMH
    as a core technology and reporting on their
    impact to the institution/community

109
Arc
  • http//arc.cs.odu.edu/
  • harvests all known archives
  • first end-user service provider
  • source available through SourceForge
  • hierarchical harvesting

110
NCSTRL
  • http//www.ncstrl.org/
  • metadata harvesting replacement for Dienst-based
    NCSTRL
  • based on Arc
  • computer science metadata

111
Archon
  • http//archon.cs.odu.edu/
  • physics metadata
  • based on Arc
  • features
  • citation indexing
  • equation-based searching

112
Torii
  • http//torii.sissa.it/
  • physics metadata
  • features
  • personalization
  • recommendations
  • WAP access

113
iCite
  • http//icite.sissa.it/
  • physics metadata
  • features
  • citation based access to arXiv metadata

114
my.OAI
  • http//www.myoai.com/
  • covers all registered metadata
  • features
  • result sets
  • personalization
  • many other advanced features

115
Cyclades
  • http//www.ercim.org/cyclades
  • scientific metadata
  • features
  • personalization
  • recommendations
  • collaboration
  • status?

116
citebase
  • http//citebase.eprints.org/
  • arXiv metadata
  • citation based indexing, reporting

117
OAIster
  • http//oaister.umdl.umich.edu/
  • harvests all known archives

118
Public Knowledge Project
  • http//www.pkp.ubc.ca/harvester/
  • domain-specific filtering of harvested metadata
    (?)

119
Perseus
  • http//www.perseus.tufts.edu/
  • they claim to harvest all DPs, but only
    humanities related DPs appear in the pull down
    menu

120
Others
  • Commercial publishers
  • American Physical Society (APS)
  • Institute of Physics (IOP)
  • Elsevier / Scirus (www.scirus.com)
  • Department of Energy
  • OSTI
  • LANL
  • Institutional servers
  • DSpace (MIT www.dspace.org)
  • Eprints (www.eprints.org)
  • DARE (All Dutch universities)

121
Service Providers
  • It is clear that SPs are proliferating, despite
    (because of?) the inherent bias toward DPs in the
    protocol
  • easy to be a DP -gt many DPs -gt SPs eventually
    emerge
  • hard to be a DP -gt SPs starve
  • currently 5x DPs more than SPs
  • SPs are beginning to offer increasingly
    sophisticated services
  • competitive market originally envisioned for SPs
    is emerging

122
OAI-PMH Observation Front-End Only
  • No input/registry mechanism
  • OAI-PMH is always a front-end for something else
  • filesystem, Dienst, RDBMS, LDAP, etc.
  • convenient for pre-existing DLs, but does not
    address new DLs
  • e.g., we want to do OAI
  • Bounds the scope of OAI
  • tension between functionality and simplicity

123
OAI-PMH Observation No TC
  • No terms conditions provisions
  • assumes all metadata has uniform access rights
  • how to restrict metadata to certain hosts?
  • (see upcoming OAI-rights discussion)
  • introducing TC would increase the scope of
    application, but at the expense of simplicity
  • how expensive do we want to make a
    just-a-front-end protocol ?

124
OAI-PMH Observation No TC
  • Possible to use multiple repositories in a
    DMZ-like configuration

OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
125
OAI-PMH Observation No TC
  • Possible to use OAI-PMH in closed, restricted
    systems

all OAI requests originate from these 4 DLs
OAI 1
OAI 2
OAI 3
OAI 4
see Technical Report Interchange Project ---
http//www.cs.odu.edu/mln/pubs/tri.pdf
126
OAI-PMH Observation Monolithic
  • A repository has no protocol-defined concept of
    other OAI repositories
  • ltfriendsgt was added in 2.0
  • backups, mirrors, etc. have to be resolved
    outside of the scope of OAI
  • scope vs. complexity again
  • fully connected graph of DLs harvesting from each
    other is unnecessary
  • cf. web crawlers vs. gathers in U of Colorados
    Harvest System
  • 3rd party harvesting interfaces raise more TC
    and data coherency issues

127
OAI-PMH Observation Data Coherency
  • In the interest of implementer simplicity,
    several issues are left for the service provider
    to interpret
  • what is an update vs. addition?
  • in the NACA repository, they are reported as the
    same and its up to the harvesting system to
    figure it out
  • deletions?
  • it is currently optional for repositories to mark
    records as deleted or not
  • still left to the harvester to interpret
  • Liu, et al., JCDL 2003 Repository
    Synchronization in the OAI Framework
  • http//www.cs.odu.edu/mln/pubs/freshness-jcdl.pdf

128
OAI-PMH Observation Harvest Model
  • Frequency of harvests
  • all-at-once harvests?
  • initial harvest
  • resolving data coherency
  • frequent incremental harvests?
  • far more efficient for both service and data
    providers
  • Webcrawling vs. digital library models
  • webcrawlers little to no a priori information
    about target
  • DLs frequent harvesting of a small number of
    known targets

129
DC?!
  • Metadata
  • Q Which format should I use?
  • A any/all of them
  • lowest common denominator unqualified Dublin
    Core
  • Again, little known about actual behavior
  • will DC be actually be useful? or too lossy?
  • will communities create/adopt specific formats?
  • will native (presumably richer) formats be
    harvested?

130
XML Observations
  • Service providers
  • XML can be pretty picky a large ListRecords
    result can be invalidated with a single error
  • harvest in chunks? individual records?
  • author contributed metadata particularly a
    problem (e.g. control characters from
    copy-n-paste)
  • one advantage of resumptionToken is that it
    compartmentalizes bad data

131
Why The OAI-PMH is NOT Important
  • Users dont care
  • OAI-PMH is middleware
  • if done right, the uninterested user should never
    have to know
  • Using OAI-PMH does not insure a good SP
  • OAI-PMH is (or is becoming) HTTP for DLs
  • few people get excited about http now
  • http OAI-PMH are core technologies whose
    presence is now assumed
Write a Comment
User Comments (0)
About PowerShow.com