Title: Institutional Archives Technology Overview Michael L. Nelson Old Dominion University mlncs.odu.edu h
1Institutional Archives Technology
OverviewMichael L. NelsonOld Dominion
Universitymln_at_cs.odu.eduhttp//www.cs.odu.edu/m
ln/
- Institutional Archives Repositories What this
digital movement means for Federal Libraries - Library of Congress Workshop
- September 12, 2003
2Acknowledgements
- ODU K. Maly, M. Zubair, J. Bollen
- LANL R. Luce, X. Liu
- NASA G. Roncaglia, J. Rocker
- Cornell C. Lagoze, S. Warner
- MAGiC (UK) Paul Needham
- and, of course, Herbert Van de Sompel (LANL)
- the OpenURL slides are nicked from his
presentations
3Outline
- A bit of history
- Core technologies
- OAI-PMH
- OpenURL
- Example implementations
- Download and go
4OAI-PMH
5Background
- I met Herbert Van de Sompel in April 1999...
- we spoke of a demonstration project he had in
mind and had received sponsorship from Paul
Ginsparg and Rick Luce - We wanted to demonstrate a multi-disciplinary DL
that leveraged the large number of high quality,
yet often isolated, tech report servers, e-print
servers, etc. - most digital libraries (DLs) had grown up along
single disciplines or institutions - little to no interoperability isolated DL
gardens
6Universal Preprint Service
- A cross-archive DL that that provides services on
a collection of metadata harvested from multiple
archives - Nelson NCSTRL a modified version of Dienst
- support for clustering
- support for buckets
- Krichel ReDIF metadata format
- Van de Sompel SFX Linking
- Demonstrated at Santa Fe NM, October 21-22, 1999
- http//web.archive.org/web//http//ups.cs.odu.edu
/ - D-Lib Magazine, 6(2) 2000 (2 articles)
- http//www.dlib.org/dlib/february00/02contents.htm
l - UPS was soon renamed the Open Archives Initiative
(OAI) http//www.openarchives.org/
7Data and Service Providers
- Self-describing archives
- Much of the learning about the constituent UPS
archives occurred out of band - Data Providers
- publishing into an archive
- providing methods for metadata harvesting
- provide non-technical context for sharing
information also - Service Providers
- harvest metadata from providers
- implement user interface to data
Even if these are done by the same DL, these are
distinct roles
8Metadata Harvesting
- Move away from distributed searching
- Extract metadata from various sources
- Build services on local copies of metadata
- data remains at remote repositories
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
9Result OAI
- The OAI was the result of the demonstration and
discussion during the Santa Fe meeting - OAI a bunch of people, a religion, a cult, etc.
- OAI Protocol For Metadata Harvesting (OAI-PMH)
the protocol created and maintained by the OAI - Initial focus was on federating collections of
scholarly e-print materials - however, interest grew and the scope and
application of OAI-PMH expanded to become a
generic bulk metadata transport protocol - Note
- OAI-PMH is only about metadata -- not full text!
- but what is metadata vs. full-text?
- OAI is neutral with respect to the nature of the
metadata or the resources the metadata describes - read commercial publishers have an interest in
OAI-PMH too...
10Open Archives Initiative
11OAI-PMH Mechanics
Request is encoded in http
Response is encoded in XML
XML Schema for the responses are defined in the
OAI-PMH document
12Overview of OAI-PMH Verbs
archival metadata
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
13OAI-PMH Data Model
item identifier
record identifier metadata format datestamp
14Data Providers / Service Providers
15Aggregators
- aggregators allow for
- scalability for OAI-PMH
- load balancing
- community building
- discovery
service providers (harvesters)
data providers (repositories)
aggregator
16Aggregators
- Frequently interchangeable terms
- aggregators likely to be community /
institutionally focused - caches stores a copy, less likely to be
community-oriented - proxies less likely to store a copy, may gateway
between OAI-PMH and other protocols - Dienst / OAI Gateway Harrison, Nelson, Zubair,
JCDL 03 - To learn more about aggregators, caches
proxies - http//www.openarchives.org/OAI/2.0/guidelines-agg
regator.htm - http//www.cs.odu.edu/mln/jcdl03/
17Example Aggregators
- Arc - http//arc.cs.odu.edu/
- first described hierarchical harvesting in
D-Lib Magazine, 7(4) 2001 - http//www.dlib.org/dlib/april01/liu/04liu.html
- Celestial - http//celestial.eprints.org/
- among other services, it provides a history of
harvests (successful vs. errors) - http//celestial.eprints.org/cgi-bin/status
18OAI-PMH 2.0 Registration
- unregistered because
- testing / development
- not for public harvesting
- public, but low-profile
- never got around to it
- ???
??? unregistered repositories
75 repositories registered
DPSP 51
Data Providers http//www.openarchives.org/Regist
er/BrowseSites.pl Service Providers
http//www.openarchives.org/service/listproviders.
html
19Registration is NiceBut Not Required
- OAI-PMH is (becoming) the http for digital
libraries - there is no central registry of http servers
- remember the NCSA Whats New page? (ca. 1994)
- There will never be registration support in
OAI-PMH - registries are a type of service provider, built
on top of OAI-PMH - registration will be an integral part of
community building - friends
20NASA ltfriendsgt example
21Field of Dreams
- It should be easy to be a data provider, even if
it makes more work for the service provider. - if enough data providers exist, the service
providers will come (DPs gtgt SPs) - Open-source / freely available tools
- drop-in data providers
- at the end of this presentation
- tools to make your existing DL a data provider
- http//www.openarchives.org/tools/tools.htm
- also OAI-implementers mailing list / mail
archive! - service providers
- http//oaiarc.sourceforge.net/
22OAI-PMH Meeting History
23Shift of Topics
- From the protocol itself, supporting debugging
tools and how to retrofit (existing) DLs - to building (new) services that use the OAI-PMH
as a core technology and reporting on their
impact to the institution/community
24Arc
- http//arc.cs.odu.edu/
- harvests all known archives
- first end-user service provider
- source available through SourceForge
- hierarchical harvesting
25NCSTRL
- http//www.ncstrl.org/
- metadata harvesting replacement for Dienst-based
NCSTRL - based on Arc
- computer science metadata
26Archon
- http//archon.cs.odu.edu/
- physics metadata
- based on Arc
- features
- citation indexing
- equation-based searching
27Torii
- http//torii.sissa.it/
- physics metadata
- features
- personalization
- recommendations
- WAP access
28iCite
- http//icite.sissa.it/
- physics metadata
- features
- citation based access to arXiv metadata
29my.OAI
- http//www.myoai.com/
- covers all registered metadata
- features
- result sets
- personalization
- many other advanced features
30Cyclades
- http//www.ercim.org/cyclades
- scientific metadata
- features
- personalization
- recommendations
- collaboration
- status?
31citebase
- http//citebase.eprints.org/
- arXiv metadata
- citation based indexing, reporting
32OAIster
- http//oaister.umdl.umich.edu/
- harvests all known archives
33Others
- Commercial publishers
- American Physical Society (APS)
- Institute of Physics
- Elsevier / Scirus (www.scirus.com)
- Department of Energy
- OSTI
- LANL
- Institutional servers
- DSpace (MIT www.dspace.org)
- Eprints (www.eprints.org)
- DARE (All Dutch universities)
34NACA Technical Report Server
- publicly available
- began in 1996
- details in NASA TM-1999-209127
- scanned reports from 1917-1958
- NACA predecessor to NASA
- contents mirrored with the MaGIC project
- a UK-based grey-literature preservation project
- OAI-PMH used to mirror contents
http//naca.larc.nasa.gov/ http//naca.larc.nasa.g
ov/oai2.0/
35NACA Report 1345 as seen through its native
DL http//naca.larc.nasa.gov/
36NACA Report 1345 as seen through
MAGiC http//www.magic.ac.uk/
37NACA Report 1345 as seen through its
Scirus (Elsevier) http//www.scirus.com/
38NACA Report 1345 as seen through my.OAI (FS
Consulting) http//www.myoai.com/
39NASA Technical Report Server
- replacement for the previous distributed
searching version of NTRS - MySQL
- Va Tech harvester
- modified bucket
- details in Nelson, Rocker, Harrison, Library
Hi-Tech, 21(2) (March 2003) - a service provider aggregator
- same OAI baseURL as used for interactive searching
http//ntrs.nasa.gov/
40NASA Technical Report Server
- advanced, fielded search
- explicit query routing
- 10 NASA repositories
- 4 non-NASA repositories
- turned off by default
41non-NASA repositories
gt 0.5M records
42NASA DLs in the Larger STI Realm
DOE
DOD
Universities
Publishers
. . .
International
this could be a fully connected graph
NTRS could also be a data provider from the
point of view of other DLs allowing
the harvesting of NASA report metadata.
NTRS could also harvest metadata from other
DLs, and provide access to non-NASA content. We
hope to influence the direction of the
science.gov effort to use OAI-PMH
43Service Providers
- It is clear that SPs are proliferating, despite
(because of?) the inherent bias toward DPs in the
protocol - easy to be a DP -gt many DPs -gt SPs eventually
emerge - hard to be a DP -gt SPs starve
- currently 5x DPs more than SPs
- SPs are beginning to offer increasingly
sophisticated services - competitive market originally envisioned for SPs
is emerging
44OpenURL
45Origins Motivation
- The Context Library Automation Environment anno
1998 - distributed information environment
- local remote AI databases
- rapidly growing e-journal collection
- need to interlink the available information
- The Problem
- links are delivered by info providers
- links are not sensitive to users context
- appropriate copy problem
- links dependent on business agreements between
information vendors - links dont cover the complete collection
46Origins Motivation
- The Context Library Automation Environment anno
1998 - distributed information environment
- local remote AI databases
- rapidly growing e-journal collection
- need to interlink the available information
- The REAL Problem
- libraries have no say in linking
- libraries are losing core part of the
organizing information task - expensive collection is not used optimally
- users are not well served
47Origins Motivation
- The Solution
- In information services
- DO NOT provide a link which is an actual service
related to a referenced item (e.g. a link from a
record in an AI database to the corresponding
full-text) - BUT rather provide
- a link that transports metadata about the
referenced item - to
- others that are better placed to provide service
links
OpenURL
Linking server operated by library
48non-OpenURL linking
resource
resource
.
link to referenced work
reference
resolution of metadata into link
49OpenURL linking
transportation of metadata identifiers
user-specific
.
reference
context-sensitive
resolution of metadata identifiers into
services
provision of OpenURL
50Evolution 1998
- Nature of solution determined
- Experiment with local databases at Ghent
University - Demonstrated October 1998 at Belgian Library
meeting - Problem statement Experiment described in 2
D-Lib Magazine papers, April 1999
51Evolution 1999
- Feasibility of solution tested in 2 complex
environments - Experiments
- SFX_at_Ghent SFX_at_LANL LANL, Ghent, APS, Wiley,
SilverPlatter, Ex Libris - UPS Prototype arXiv, SLAC/SPIRES, LANL, Ghent,
- Demonstrated
- June 1999 at ALA LiTA session, New Orleans
- October 1999 at OAI meeting, Santa Fe
- Experiments described in 2 D-Lib Magazine
papers, October 1999 and February 2000
52Evolution 2000
- OpenURL 0.1 released
- Quick adoption of OpenURL 0.1 in information
community - SFX linking server goes beta
53Evolution 2001
- Integration of OpenURL Framework and
DOI/CrossRef framework - Experiment involving CNRI, LANL, OhioLink,
Academic Press, Ex Libris, - DOI/OpenURL integration described in 2 D-Lib
Magazine papers, March 2001 and September 2001 - First non-SFX linking servers appear
54Evolution 2001
- Proposal to standardize OpenURL
- Generalization of OpenURL Framework concepts
beyond scholarly information community - Described in
- Van de Sompel, Herbert and Beit-Arie, Oren.
Generalizing the OpenURL Framework beyond
References to Scholarly Works the Bison-Futé
model. July/August 2001. D-Lib Magazine. - NISO AX Committee starts standardization of the
OpenURL Framework using the Bison-Futé model as
the basis of its work.
55NISO OpenURL Standardization Charge
- Use existing OpenURL Framework as starting
point - notion of context-sensitive services
- notion of transporting contextual metadata
packages to obtain context-sensitive services - Define syntax and transport-method for
contextual metadata packages - Ensure extensibility
- must support future applications
- must support other information communities
- gt Generalize and Standardize
56NISO OpenURL Standardization Charge
- Therefore, to be addressed were
- OpenURL Framework beyond scholarly resources
- contextual metadata packages
- Syntax for contextual metadata packages
- Transport of contextual metadata packages
57- default links
- restricted in nature
- action-radius restricted by business agreements
- not context-sensitive
resource2
resource3
metadata plane
resource1
herbert van de sompel
58 extended services plane
service component1
service component2
resource2
resource3
metadata plane
resource1
herbert van de sompel
59Download and Go!
60Where Do You Want to Build?
user
service provider
. . .
data provider
data provider
data provider
data provider
local context- sensitive services
EPrints.org
61Fedora
- joint project between Cornell UVa
- funded by the Mellon Foundation
- a repository management system
- focuses on complex digital objects and their
behaivors - more info
- http//www.fedora.info/
- D-Lib Magazine, 9(4)
- http//www.dlib.org/dlib/april03/staples/04staples
.html
62- MIT HP Labs
- constructed to capture all the output of MITs
faculty - now generalized to the DSpace Federation
- 8 top universities in the US Canada
- More info
- http//www.dspace.org/
- http//sourceforge.net/projects/dspace/
- D-Lib Magazine 9(1)
- http//www.dlib.org/dlib/january03/smith/01smith.h
tml
63EPrints.org
- developed at Southampton University
- part of larger suite of institutional/author
self-archiving tools and services - e.g. citebase paracite
- widely adopted -- 100 sites
- http//software.eprints.org/ep2
- more info
- http//www.eprints.org/
- http//www.arl.org/sparc/core/index.asp?pageg206
64- P2P publishing for academia
- community servers for coordination, management
- archivelets for individual laptops, PCs
- more info
- http//kepler.cs.odu.edu/
- D-Lib Magazine 7(4)
- http//www.dlib.org/dlib/april01/maly/04maly.html
65- developed by UKOLN
- open source
- OpenURL 0.1 format resolver
- NISO 1.0 format???
- more info
- Ariadne, 28
- http//www.ariadne.ac.uk/issue28/resolver/
- ftp//ftp.ukoln.ac.uk/metadata/tools/openresolver/
- http//www.ukoln.ac.uk/distributed-systems/openurl
/
66Conclusions
67Why The OAI-PMH is NOT Important
- Users dont care
- OAI-PMH is middleware
- if done right, the uninterested user should never
have to know
- Using OAI-PMH does not insure a good SP
- OAI-PMH is (or is becoming) HTTP for DLs
- few people get excited about http now
- http OAI-PMH are core technologies whose
presence is now assumed
68Other Uses For the OAI-PMH
- Assumptions
- Traditional DLs / SPs will continue on their
present path of increasing sophistication - citation indexing, search results viz,
personalization, recommendations, subject-based
filtering, etc. - growth rates remain the same (5x DPs as SPs)
- Premise OAI-PMH is applicable to any scenario
that needs to update / synchronize distributed
state - Future opportunities are possible by creatively
interpreting the OAI-PMH data model - See Van de Sompel, Young Hickey, D-Lib Magazine
July 2003, http//www.dlib.org/dlib/july03/young/0
7young.html
69OpenURL Framework evolution
70The Future Community Building
- Ultimately, protocols and metadata formats are
not what makes a difference - Rather, the critical mass afforded by a common
set of utilities (cf. http, Dublin Core, XML) - The best current example The Open Language
Archives Community - http//www.language-archives.org/
- OAI-PMH provides the basis for communication
between strangers, but allows even richer
communication between friends