Title: The Other Security: A New and Nimble Approach to Digital Preservation
1The Other SecurityA New and Nimble Approach to
Digital Preservation
UCCSC 2009 Focus on Security UC Davis, June
1617, 2009
- Stephen Abrams
- Perry Willett
- Digital Preservation Program
- California Digital Library
- University of California
2Focus on Security
- Traditional security risks
- Natural disaster
- Infrastructure failure
- Storage failure
- Server failure
- Operating system failure
- Application failure
- Human error
- Malicious attack
3Focus on Security
- The other security risks
- Legal encumbrances
- External dependencies
- Media obsolescence
- Format obsolescence
- Staff competencies
- Institutional commitment
- Financial stability
- Changing user expectations
4Focus on Security
- The other security risks
- Anything that interferes with the usability of
managed digital assets now or in the future
5Libraries Have a Long Time Horizon
- The UC Melvyl union catalog holds over 28 million
items 11,000 are more than 500 years old
6Libraries Have a Long Time Horizon
- What can we do to ensure that todays digital
assets are still usable 500 years from now?
7Agenda
- What is digital curation?
- Redefining the repository A micro-services
approach to curation - Web archiving
- CDL/campus curation collaborations
- Trusted digital curation services
- Summary
8Digital Curation
- Activities focused on maintaining and adding
value to trusted digital content - Encompasses preservation and access, which are
complementary, not disparate functions - Preservation ensures access over time
- Access depends on preservation up to a point in
time - How can we make the Save button really mean
save?
9Curation Imperatives
- Integrated business process
- Robust technological infrastructure and
- Human analysis and decision-making
- Programmatic (not project-oriented)
- Services (not systems)
- Content (not repositories)
10Agenda
- ? What is digital curation?
- Redefining the repository A micro-services
approach to curation - Web archiving
- CDL/campus curation collaborations
- Trusted digital curation services
- Summary
11D'où venons nous, que sommes nous, où allons nous?
Paul Gauguin, 1897-98, Museum of Fine Arts
Boston, 32.270
12Where are we from, what are we, where are we
going?
Paul Gauguin, 1897-98, Museum of Fine Arts
Boston, 32.270
13Where is our stuff from, what is it, where
are we going with it?
Paul Gauguin, 1897-98, Museum of Fine Arts
Boston, 32.270
14Where From? What? Where To?
Producer
Repository
Consumer
15Where From? What? Where To?
Producer Ingest
Repository Data management / archival storage
Consumer Access / preservation planning
16Where From? What? Where To?
Producer Ingest Provenance
Repository Data management / archival
storage Characterization
Consumer Access / preservation planning View
paths
17Information Landscape
- Increasing diversity in types and uses of content
- Content arising from non-library contexts
- Inevitable technological change
18Infrastructure Design Goals
- Devolve repository function into a set of
independent, but interoperable, services - Since each is small and self-contained, they are
more easily developed and maintained - Since the level of investment is lower, they are
more easily replaced - Provide complex function through the flexible
combination of atomistic services
19Infrastructure Design Goals
- Support interaction through procedural APIs,
command line applications, and web interfaces - Let content managers and curators interact with
the services without requiring changes to
existing work practices - Rather than force content to come to the
services, push the services out to the content - Easy deployment centrally or locally, either
independently or in strategic combinations
20Infrastructure Design Goals
- Defer implementation decision making until needs
and outcomes are clearly articulated - Requirements are first stated as sets of values
and strategies that promote those values - Strategies are then embodied as abstract
services, and, finally, instantiated in technical
systems
21Object-Centric Values and Strategies
22Service-Centric Values and Strategies
23Micro-Services
24Design Process
- What are the conceptual entities underlying the
service? - What are their state properties?
- What are their behaviors?
25Storage Service
- Storage service
- An aggregation of storage nodes
- Storage node
- A particular configuration of object storage
- Object
- An aggregation of files over time
- Version
- A particular configuration of files at a point in
time - File
- A formatted bit stream
26Storage Service Methods
- Help idempotent, safe
- Get-state idempotent, safe
- Get-node-state idempotent, safe
- Get-object-state idempotent, safe
- Get-object idempotent, unsafe
- Get-version-state idempotent, safe
- Get-version idempotent, unsafe
- Get-file-state idempotent, safe
- Get-file idempotent, unsafe
- Add-version non-idempotent, unsafe
27Storage Service Interfaces
28Technological Change and Invariance
- Circa 1989
- FTP
- POSIX
- SQL
- Circa 2029?
- HTTP
- URI
- XML
- Due to their inherent abstracting nature,
protocols and interfaces last longer than systems
29Storage Service Implementation
- Using the file system as the controlling
managerial abstraction, what is the thinnest
smear of additional functionality that will make
it an effective object store? - Namaste
- CAN
- Pairtree
- Dflat
- ReDD
30Name As Text (Namaste) Tags
- Directory-level signature files extending Dublin
Core Kernel metadata - Tag h0 0name_version
- Who h1 1who
- What h2 2what
- When h3 3when
- Where h4 4where
31Content Access Node (CAN)
- File system conventions (structure and reserved
names) for an object store
can/ 0can_0.2 can-info.txt log/
store/ pairtree...
32Pairtree
- Use a bigram decomposition of an objects
identifier to determine its file system path
pairtree/ 0pairtree_0.1
pairtree-info.txt pairtree_root/
id/ en/
ti/
fi/
er/
dflat...
33Dflat
- A digital flat for object data and metadata
dflat/ 0dflat_0.11 dflat-info.txt
v001/ d-manifest.txt
delta/ redd... v002/
f-manifest.txt full/
data/ metadata/
enrichment/ annotation/
34Reverse Delta Directory (ReDD)
- File-level reverse delta compression
redd/ 0redd_0.1 add/ delete.txt
35Performance Scaling
- Modern file systems, e.g. ZFS, exhibit good
performance characteristics at reasonable scale
2,272,000 files 28.5 TB 127,058,820
files 25.7 TB
36Status
- We are completing development of the foundational
Storage and Identity services - Identity is based on N2T (name-to-thing) and Noid
systems - We are planning for the Ingest, Catalog, and
Characterization services - Characterization is based on JHOVE2
- As these services become available they will be
deployed centrally and locally on campuses
37Agenda
- ? What is digital curation?
- ? Redefining the repository A micro-services
approach to curation - Web archiving
- CDL/campus curation collaborations
- Trusted digital curation services
- Summary
38Todays Web is Historys Source Material
- The web is indispensible to science, commerce,
education, entertainment, and culture - Yet, it is highly volatile
- UC faculty and researchers have their own web
publications - Libraries and archives wish to preserve important
websites - How can we secure this valuable content into the
future?
39Web Archiving Service (WAS)
Provides open source tools for curators to select
and preserve content from the free web Allows
curators to define scope of collection, frequency
of crawling, work collaboratively Content is
saved in projects, grouped by common subject
matter or publisher
40Crawl operation in WAS
41WAS Public Access
- Starting in July, curators will be able to
provide public access to their projects - Rights based on recommendations of Section 108
Study Group - 6 month embargo
- Opportunities for content owner to opt-out
- Libraries will add links in their online catalogs
to documents, websites - Advantages curated collections, persistent
access and URLs, full-text searching
42(No Transcript)
43(No Transcript)
44(No Transcript)
45WAS Partners
- Library of Congress grant funding for
development - UC campuses, University of North Texas, and
others - Internet Archive software and experience
- Heritrix crawler, Wayback display, Nutch indexing
- National Library of France standards and
leadership - IWAW international web archiving workshop
- IIPC (national libraries consortium) commitment
46Agenda
- ? What is digital curation?
- ? Redefining the repository A micro-services
approach to curation - ? Web archiving
- CDL/campus curation collaborations
- ? Trusted digital curation services
- Summary
47CDL Curation Collaborations
- DataOne
- NSF-funded project to preserve distributed
scientific data and develop infrastructure for
distributed scientific research on global change - University of New Mexico, UC Santa Barbara
- Media Vault Program
- UC Berkeley
- Historical Newspapers
- UC Riverside
48Agenda
- ? What is digital curation?
- ? Redefining the repository A micro-services
approach to curation - ? Web archiving
- ? CDL/campus curation collaborations
- Trusted digital curation services
- Summary
49Trusted Digital Repositories
- Trusted Repositories Audit and Certification
(TRAC) - Criteria for evaluating repository
trustworthiness - Developed by RLG, OCLC, NARA, CRL
- Based on Open Archival Information System (OAIS)
reference model (ISO 14721)
50TRAC
- Basic approach
- TRAC checklist provides framework
- Organization documents planning and policies
- Allows organizations to self-audit and identify
gaps - Allows other organizations to perform external
audit
51TRAC
- Three sections
- Organization
- Digital Object Management
- Technologies
52Audits Arent Perfect
53Total Transparency Is Not Possible
- Budgets, personnel issues
- NDAs, competitive environment
- Computer security, firewalls
- Burden of documentation and maintenance
54Trust but Verify
55Trust but Verify
- Process requires both trust and willingness to
question assumptions - For process to work, the underlying motivation
must be a desire to improve service - Resulting in greater transparency
- Leading to trust between repositories and clients
56Summary
- Safety through redundancy
- Meaning through description
- Utility through service
- Value through use
- Code to interfaces
- Orthogonality, but interoperability
- Composition, not addition
- Bring services to content, not content to services
57Questions?
http//www.cdlib.org/inside/diglib/ Stephen.Abram
s_at_ucop.edu Perry.Willett_at_ucop.edu