Title: Much of the learning about the constituent UPS archives occ
1herbert van de sompel, michael nelson, thomas
krichel
the UPS protoproto project
UPS 1 Meeting Santa Fe - October 21th 1999
2description
the UPS protoproto
the data exchange framework
3- UPS enable cross-archive end-user services
- protoproto
- facilitate discussions
- identify issues involved in creating
cross-archive services - experiment with digital object concepts for
archive material - does not claim to be a solution
- protoproto is multi-disciplinary
- a special instance of cross-archive
- there is a market
- promotional value
4- coordination herbert van de sompel, michael
nelson, thomas krichel - involvement of
- Old Dominion U NASA Langley
- U of Surrey
- U of Ghent
- Los Alamos National Laboratory - Library
- Russian Academy of Science - Siberian branch
5- Los Alamos National Laboratory - Research Library
- JISC eLib WoPEc project
6- metadata only
- full text remains at archives
- static dumps obtained ca. July 99
objects 85,223 742 3,036 29,184 1,590 73,367 193,
142
full-text 85,223 659 3,036 9,084 951 13,582 112,5
35
!organization 17,983 14 100 93 1 2,453
the arXiv CogPrints NACA NCSTRL NDLTD RePEc Tota
l
7 the arXiv CogPrints NACA NCSTRL NDLTD RePEc
format internal internal Refer RFC1807 MARC ReDIF
8- Getting metadata out of archives
- not all archives support metadata extraction
- some archives have undocumented metadata
extraction procedures - not all archives support rich criteria for
extraction - single dump concept only
- Intellectual property and use rights not always
clear
9- Metadata has problems with
- record duplication
- crucial missing fields
- internal errors
- ambiguous references to people and places,
publications
10- all datasets converted to ReDIF
- essential to have a single fomat for the
creation of services - supply by archives in a single format was not
realistic - no downgrading of data
- data enhancements
- creation of unique identifier
- addition of raw subject-classification
- normalization of publication types
11- creation of archives for ReDIF-ed metadata
- using intelligent digital objects buckets
RePEc
arXiv
NCSTRL
12- Buckets were chosen to study the implications of
using rich, intelligent objects in UPS - Buckets are
- DL protocol / system independent
- self-contained and mobile
- handle their own display, enforcement of terms
and conditions, and dissemination of their
contents - designed for bundling multiple data
representations and data instance types - The aggregative nature of buckets is well suited
for adding valued-added services at the object
level
13- NCSTRL digital library service
- indexing buckets in archives by requesting their
metadata - enhanced user-interface
- NCSTRL search results point at buckets
- buckets auto-display
- buckets provide link to full-text in native
archive
14- UPS contains 193K objects
- using buckets consumed inodes (60 inodes per
bucket) - filesystem reformatted with more generous amount
of inodes - Solaris and Dienst conflict
- Dienst wants each object in an publishing
authority to be in a single directory - Solaris has a hard limit of 32K objects in a
directory - resolution use many (100) authorities for UPS
15- integrate the archives with the traditional
communication mechanism - context-sensitive linking to deliver extended
services via SFX technology
16evaluate metadata
system A
17(No Transcript)
18- buckets for arXiv, NCSTRL and RePEc are
SFX-aware - Cogprints, NACA, NDLTD not SFX-aware
- SLAC/SPIRES is SFX-aware
- linking services for preprint metadata for
published version
19- will be available starting beginning of November
- UPS list will be notified
- disclaimer not a production system
http//ups.cs.odu.edu8000/
http//ups.cs.odu.edu
20- data exchange framework
- data provision vs. data implementation
- central searching, distributed archives
- need for a framework by which archives can
describe themselves - content
- terms and conditions
- protocols, criteria supported to extract
(meta)data - metadata scheme, subject classification scheme,
material-type scheme, ...
21- need for an identifier scheme for archives and
archive objects - (cf. ISSN, ISBN, DOI)
- metadata quality obstructs the creation of
services - desirabile to extend metadata with citation
information - smart objects
- archived objects that are active, not passsive
22- Providing data
- publishing into an archive
- providing methods for metadata harvesting
- provide non-technical context for sharing
information also - Implementing Data
- harvest metadata from providers
- implement user interface to data
- Even if provided by the same DL, these are
distinct functions
23Native harvesting interface
Input interface
Native end-user interface
Provider
Input interface
Provider
Native end-user interface
No machine based way to extract metadata
Machine and user interfaces for extracting
metadata.
24Input and harvesting interfaces optional
Native end-user interface
Implementor
Native harvesting interface
Native harvesting interface
Input interface
Input interface
Provider
Provider
Native end-user interface
Native end-user interface optional (e.g., RePEc)
25- Much of the learning about the constituent UPS
archives occurred out of band - Given an unknown archive, we should be able to
algorithmically determine the archives
metadata...
Native harvesting interface
Where possible, the harvesting interface should
provide the same criteria as the end-user
interface
Input interface
Provider
Native end-user interface
26- Recommended criteria for metadata extraction
- subject classification
- accession date
- publication date
- Criteria for archive description
- metadata formats employed
- contact information for archive
- publication type scheme
- identifier scheme
- subject classification scheme
27- Useful in
- reference linking
- can be used in citations
- resolving duplications
- UPS duplications were removed by hand
- tracking publication lifecycle
- Need the ability for an object to have multiple
unique identifiers - organization, discipline, etc.
28- Premise Objects are more important than the
archives that hold them - SODA Smart Objects, Dumb Archives
- Objects should be the canonical authority for
- metadata
- contents
- use
- Objects should be able to grow and change
- correct metadata
- add new formats
- add new services
- reflect the lifecycle of the object
29- It would be beneficial if the archived objects
could be heterogenous - with their own look-and-feel
- unique functionality / services
- e.g., the data archiving needs of an atmospheric
scientist can be different than that of a
computer scientist, engineer or medical
researcher - yet maintained a standard API for
- extracting metadata
- content retrieval
- resource discovery on the object
- terms and conditions
30- A strong distinction between the provision of
data, and the implementation of data - also, a socio-legal context for sharing metadata
- Open, self-describing archives
- A universal, unique identifier name space
- Archived objects with more intelligence and
flexibility