The Other Security: A New and Nimble Approach to Digital Preservation - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

The Other Security: A New and Nimble Approach to Digital Preservation

Description:

Directory-level signature files extending Dublin Core Kernel metadata ... Reverse Delta Directory (ReDD) File-level reverse delta compression. redd/ 0=redd_0.1 ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 58
Provided by: Step591
Category:

less

Transcript and Presenter's Notes

Title: The Other Security: A New and Nimble Approach to Digital Preservation


1
The Other SecurityA New and Nimble Approach to
Digital Preservation
UCCSC 2009 Focus on Security UC Davis, June
1617, 2009
  • Stephen Abrams
  • Perry Willett
  • Digital Preservation Program
  • California Digital Library
  • University of California

2
Focus on Security
  • Traditional security risks
  • Natural disaster
  • Infrastructure failure
  • Storage failure
  • Server failure
  • Operating system failure
  • Application failure
  • Human error
  • Malicious attack

3
Focus on Security
  • The other security risks
  • Legal encumbrances
  • External dependencies
  • Media obsolescence
  • Format obsolescence
  • Staff competencies
  • Institutional commitment
  • Financial stability
  • Changing user expectations

4
Focus on Security
  • The other security risks
  • Anything that interferes with the usability of
    managed digital assets now or in the future

5
Libraries Have a Long Time Horizon
  • The UC Melvyl union catalog holds over 28 million
    items 11,000 are more than 500 years old

6
Libraries Have a Long Time Horizon
  • What can we do to ensure that todays digital
    assets are still usable 500 years from now?

7
Agenda
  • What is digital curation?
  • Redefining the repository A micro-services
    approach to curation
  • Web archiving
  • CDL/campus curation collaborations
  • Trusted digital curation services
  • Summary

8
Digital Curation
  • Activities focused on maintaining and adding
    value to trusted digital content
  • Encompasses preservation and access, which are
    complementary, not disparate functions
  • Preservation ensures access over time
  • Access depends on preservation up to a point in
    time
  • How can we make the Save button really mean
    save?

9
Curation Imperatives
  • Integrated business process
  • Robust technological infrastructure and
  • Human analysis and decision-making
  • Programmatic (not project-oriented)
  • Services (not systems)
  • Content (not repositories)

10
Agenda
  • ? What is digital curation?
  • Redefining the repository A micro-services
    approach to curation
  • Web archiving
  • CDL/campus curation collaborations
  • Trusted digital curation services
  • Summary

11
D'où venons nous, que sommes nous, où allons nous?
Paul Gauguin, 1897-98, Museum of Fine Arts
Boston, 32.270
12
Where are we from, what are we, where are we
going?
Paul Gauguin, 1897-98, Museum of Fine Arts
Boston, 32.270
13
Where is our stuff from, what is it, where
are we going with it?
Paul Gauguin, 1897-98, Museum of Fine Arts
Boston, 32.270
14
Where From? What? Where To?
Producer
Repository
Consumer
15
Where From? What? Where To?
Producer Ingest
Repository Data management / archival storage
Consumer Access / preservation planning
16
Where From? What? Where To?
Producer Ingest Provenance
Repository Data management / archival
storage Characterization
Consumer Access / preservation planning View
paths
17
Information Landscape
  • Increasing diversity in types and uses of content
  • Content arising from non-library contexts
  • Inevitable technological change

18
Infrastructure Design Goals
  • Devolve repository function into a set of
    independent, but interoperable, services
  • Since each is small and self-contained, they are
    more easily developed and maintained
  • Since the level of investment is lower, they are
    more easily replaced
  • Provide complex function through the flexible
    combination of atomistic services

19
Infrastructure Design Goals
  • Support interaction through procedural APIs,
    command line applications, and web interfaces
  • Let content managers and curators interact with
    the services without requiring changes to
    existing work practices
  • Rather than force content to come to the
    services, push the services out to the content
  • Easy deployment centrally or locally, either
    independently or in strategic combinations

20
Infrastructure Design Goals
  • Defer implementation decision making until needs
    and outcomes are clearly articulated
  • Requirements are first stated as sets of values
    and strategies that promote those values
  • Strategies are then embodied as abstract
    services, and, finally, instantiated in technical
    systems

21
Object-Centric Values and Strategies
22
Service-Centric Values and Strategies
23
Micro-Services
24
Design Process
  • What are the conceptual entities underlying the
    service?
  • What are their state properties?
  • What are their behaviors?

25
Storage Service
  • Storage service
  • An aggregation of storage nodes
  • Storage node
  • A particular configuration of object storage
  • Object
  • An aggregation of files over time
  • Version
  • A particular configuration of files at a point in
    time
  • File
  • A formatted bit stream

26
Storage Service Methods
  • Help idempotent, safe
  • Get-state idempotent, safe
  • Get-node-state idempotent, safe
  • Get-object-state idempotent, safe
  • Get-object idempotent, unsafe
  • Get-version-state idempotent, safe
  • Get-version idempotent, unsafe
  • Get-file-state idempotent, safe
  • Get-file idempotent, unsafe
  • Add-version non-idempotent, unsafe

27
Storage Service Interfaces
28
Technological Change and Invariance
  • Circa 1989
  • FTP
  • POSIX
  • SQL
  • Circa 2029?
  • HTTP
  • URI
  • XML
  • Due to their inherent abstracting nature,
    protocols and interfaces last longer than systems

29
Storage Service Implementation
  • Using the file system as the controlling
    managerial abstraction, what is the thinnest
    smear of additional functionality that will make
    it an effective object store?
  • Namaste
  • CAN
  • Pairtree
  • Dflat
  • ReDD

30
Name As Text (Namaste) Tags
  • Directory-level signature files extending Dublin
    Core Kernel metadata
  • Tag h0 0name_version
  • Who h1 1who
  • What h2 2what
  • When h3 3when
  • Where h4 4where

31
Content Access Node (CAN)
  • File system conventions (structure and reserved
    names) for an object store

can/ 0can_0.2 can-info.txt log/
store/ pairtree...
32
Pairtree
  • Use a bigram decomposition of an objects
    identifier to determine its file system path

pairtree/ 0pairtree_0.1
pairtree-info.txt pairtree_root/
id/ en/
ti/
fi/
er/
dflat...
33
Dflat
  • A digital flat for object data and metadata

dflat/ 0dflat_0.11 dflat-info.txt
v001/ d-manifest.txt
delta/ redd... v002/
f-manifest.txt full/
data/ metadata/
enrichment/ annotation/
34
Reverse Delta Directory (ReDD)
  • File-level reverse delta compression

redd/ 0redd_0.1 add/ delete.txt
35
Performance Scaling
  • Modern file systems, e.g. ZFS, exhibit good
    performance characteristics at reasonable scale

2,272,000 files 28.5 TB 127,058,820
files 25.7 TB
36
Status
  • We are completing development of the foundational
    Storage and Identity services
  • Identity is based on N2T (name-to-thing) and Noid
    systems
  • We are planning for the Ingest, Catalog, and
    Characterization services
  • Characterization is based on JHOVE2
  • As these services become available they will be
    deployed centrally and locally on campuses

37
Agenda
  • ? What is digital curation?
  • ? Redefining the repository A micro-services
    approach to curation
  • Web archiving
  • CDL/campus curation collaborations
  • Trusted digital curation services
  • Summary

38
Todays Web is Historys Source Material
  • The web is indispensible to science, commerce,
    education, entertainment, and culture
  • Yet, it is highly volatile
  • UC faculty and researchers have their own web
    publications
  • Libraries and archives wish to preserve important
    websites
  • How can we secure this valuable content into the
    future?

39
Web Archiving Service (WAS)
Provides open source tools for curators to select
and preserve content from the free web Allows
curators to define scope of collection, frequency
of crawling, work collaboratively Content is
saved in projects, grouped by common subject
matter or publisher
40
Crawl operation in WAS
41
WAS Public Access
  • Starting in July, curators will be able to
    provide public access to their projects
  • Rights based on recommendations of Section 108
    Study Group
  • 6 month embargo
  • Opportunities for content owner to opt-out
  • Libraries will add links in their online catalogs
    to documents, websites
  • Advantages curated collections, persistent
    access and URLs, full-text searching

42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
WAS Partners
  • Library of Congress grant funding for
    development
  • UC campuses, University of North Texas, and
    others
  • Internet Archive software and experience
  • Heritrix crawler, Wayback display, Nutch indexing
  • National Library of France standards and
    leadership
  • IWAW international web archiving workshop
  • IIPC (national libraries consortium) commitment

46
Agenda
  • ? What is digital curation?
  • ? Redefining the repository A micro-services
    approach to curation
  • ? Web archiving
  • CDL/campus curation collaborations
  • ? Trusted digital curation services
  • Summary

47
CDL Curation Collaborations
  • DataOne
  • NSF-funded project to preserve distributed
    scientific data and develop infrastructure for
    distributed scientific research on global change
  • University of New Mexico, UC Santa Barbara
  • Media Vault Program
  • UC Berkeley
  • Historical Newspapers
  • UC Riverside

48
Agenda
  • ? What is digital curation?
  • ? Redefining the repository A micro-services
    approach to curation
  • ? Web archiving
  • ? CDL/campus curation collaborations
  • Trusted digital curation services
  • Summary

49
Trusted Digital Repositories
  • Trusted Repositories Audit and Certification
    (TRAC)
  • Criteria for evaluating repository
    trustworthiness
  • Developed by RLG, OCLC, NARA, CRL
  • Based on Open Archival Information System (OAIS)
    reference model (ISO 14721)

50
TRAC
  • Basic approach
  • TRAC checklist provides framework
  • Organization documents planning and policies
  • Allows organizations to self-audit and identify
    gaps
  • Allows other organizations to perform external
    audit

51
TRAC
  • Three sections
  • Organization
  • Digital Object Management
  • Technologies

52
Audits Arent Perfect
53
Total Transparency Is Not Possible
  • Budgets, personnel issues
  • NDAs, competitive environment
  • Computer security, firewalls
  • Burden of documentation and maintenance

54
Trust but Verify
55
Trust but Verify
  • Process requires both trust and willingness to
    question assumptions
  • For process to work, the underlying motivation
    must be a desire to improve service
  • Resulting in greater transparency
  • Leading to trust between repositories and clients

56
Summary
  • Safety through redundancy
  • Meaning through description
  • Utility through service
  • Value through use
  • Code to interfaces
  • Orthogonality, but interoperability
  • Composition, not addition
  • Bring services to content, not content to services

57
Questions?
http//www.cdlib.org/inside/diglib/ Stephen.Abram
s_at_ucop.edu Perry.Willett_at_ucop.edu
Write a Comment
User Comments (0)
About PowerShow.com