Much of the learning about the constituent UPS archives occ

About This Presentation

Title:

Much of the learning about the constituent UPS archives occ

Description:

Much of the learning about the constituent UPS archives occurred out of band... UPS duplications were removed by hand. tracking publication lifecycle ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 31

Provided by: centralebi

Learn more at: https://www.cs.odu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Much of the learning about the constituent UPS archives occ

1
herbert van de sompel, michael nelson, thomas
krichel
the UPS protoproto project
UPS 1 Meeting Santa Fe - October 21th 1999
2
description
the UPS protoproto
the data exchange framework
3

UPS enable cross-archive end-user services
protoproto
facilitate discussions
identify issues involved in creating
cross-archive services
experiment with digital object concepts for
archive material
does not claim to be a solution
protoproto is multi-disciplinary
a special instance of cross-archive
there is a market
promotional value

coordination herbert van de sompel, michael
nelson, thomas krichel
involvement of
Old Dominion U NASA Langley
U of Surrey
U of Ghent
Los Alamos National Laboratory - Library
Russian Academy of Science - Siberian branch

Los Alamos National Laboratory - Research Library
JISC eLib WoPEc project

metadata only
full text remains at archives
static dumps obtained ca. July 99

objects 85,223 742 3,036 29,184 1,590 73,367 193,
142
full-text 85,223 659 3,036 9,084 951 13,582 112,5
35
!organization 17,983 14 100 93 1 2,453
the arXiv CogPrints NACA NCSTRL NDLTD RePEc Tota
l
7
the arXiv CogPrints NACA NCSTRL NDLTD RePEc
format internal internal Refer RFC1807 MARC ReDIF
8

Getting metadata out of archives
not all archives support metadata extraction
some archives have undocumented metadata
extraction procedures
not all archives support rich criteria for
extraction
single dump concept only
Intellectual property and use rights not always
clear

Metadata has problems with
record duplication
crucial missing fields
internal errors
ambiguous references to people and places,
publications

all datasets converted to ReDIF
essential to have a single fomat for the
creation of services
supply by archives in a single format was not
realistic
no downgrading of data

data enhancements
creation of unique identifier
addition of raw subject-classification
normalization of publication types

creation of archives for ReDIF-ed metadata
using intelligent digital objects buckets

RePEc
arXiv
NCSTRL
12

Buckets were chosen to study the implications of
using rich, intelligent objects in UPS
Buckets are
DL protocol / system independent
self-contained and mobile
handle their own display, enforcement of terms
and conditions, and dissemination of their
contents
designed for bundling multiple data
representations and data instance types
The aggregative nature of buckets is well suited
for adding valued-added services at the object
level

NCSTRL digital library service
indexing buckets in archives by requesting their
metadata
enhanced user-interface
NCSTRL search results point at buckets
buckets auto-display
buckets provide link to full-text in native
archive

UPS contains 193K objects
using buckets consumed inodes (60 inodes per
bucket)
filesystem reformatted with more generous amount
of inodes
Solaris and Dienst conflict
Dienst wants each object in an publishing
authority to be in a single directory
Solaris has a hard limit of 32K objects in a
directory
resolution use many (100) authorities for UPS

integrate the archives with the traditional
communication mechanism
context-sensitive linking to deliver extended
services via SFX technology

16
evaluate metadata
system A
17
(No Transcript)
18

buckets for arXiv, NCSTRL and RePEc are
SFX-aware
Cogprints, NACA, NDLTD not SFX-aware
SLAC/SPIRES is SFX-aware
linking services for preprint metadata for
published version

will be available starting beginning of November
UPS list will be notified
disclaimer not a production system

http//ups.cs.odu.edu8000/
http//ups.cs.odu.edu
20

data exchange framework
data provision vs. data implementation
central searching, distributed archives
need for a framework by which archives can
describe themselves
content
terms and conditions
protocols, criteria supported to extract
(meta)data
metadata scheme, subject classification scheme,
material-type scheme, ...

need for an identifier scheme for archives and
archive objects
(cf. ISSN, ISBN, DOI)
metadata quality obstructs the creation of
services
desirabile to extend metadata with citation
information
smart objects
archived objects that are active, not passsive

Providing data
publishing into an archive
providing methods for metadata harvesting
provide non-technical context for sharing
information also
Implementing Data
harvest metadata from providers
implement user interface to data
Even if provided by the same DL, these are
distinct functions

23
Native harvesting interface
Input interface
Native end-user interface
Provider
Input interface
Provider
Native end-user interface
No machine based way to extract metadata
Machine and user interfaces for extracting
metadata.
24
Input and harvesting interfaces optional
Native end-user interface
Implementor
Native harvesting interface
Native harvesting interface
Input interface
Input interface
Provider
Provider
Native end-user interface
Native end-user interface optional (e.g., RePEc)
25

Much of the learning about the constituent UPS
archives occurred out of band
Given an unknown archive, we should be able to
algorithmically determine the archives
metadata...

Native harvesting interface
Where possible, the harvesting interface should
provide the same criteria as the end-user
interface
Input interface
Provider
Native end-user interface
26

Recommended criteria for metadata extraction
subject classification
accession date
publication date
Criteria for archive description
metadata formats employed
contact information for archive
publication type scheme
identifier scheme
subject classification scheme

Useful in
reference linking
can be used in citations
resolving duplications
UPS duplications were removed by hand
tracking publication lifecycle
Need the ability for an object to have multiple
unique identifiers
organization, discipline, etc.

Premise Objects are more important than the
archives that hold them
SODA Smart Objects, Dumb Archives
Objects should be the canonical authority for
metadata
contents
use
Objects should be able to grow and change
correct metadata
add new formats
add new services
reflect the lifecycle of the object

It would be beneficial if the archived objects
could be heterogenous
with their own look-and-feel
unique functionality / services
e.g., the data archiving needs of an atmospheric
scientist can be different than that of a
computer scientist, engineer or medical
researcher
yet maintained a standard API for
extracting metadata
content retrieval
resource discovery on the object
terms and conditions

A strong distinction between the provision of
data, and the implementation of data
also, a socio-legal context for sharing metadata
Open, self-describing archives
A universal, unique identifier name space
Archived objects with more intelligence and
flexibility

Write a Comment

User Comments (0)