Digital Object Storage and Versioning in the Stanford Digital Repository - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Digital Object Storage and Versioning in the Stanford Digital Repository

Description:

For tape backup we use IBM's Tivoli Storage Manager (TSM). TSM has two flavors of command for storing files on tape: 'backup' and 'archive'. – PowerPoint PPT presentation

Number of Views:246
Avg rating:3.0/5.0
Slides: 90
Provided by: RichardA84
Category:

less

Transcript and Presenter's Notes

Title: Digital Object Storage and Versioning in the Stanford Digital Repository


1
Digital Object Storageand Versioning in
theStanford Digital Repository
  • Richard N Anderson
  • Digital Library Systems and Services
  • Stanford University Library
  • and Academic Information Resources
  • 13 January 2012

2
Outline
  • Introduction
  • Functional Requirements
  • Versioning Approaches
  • Replication and Tape Archives
  • Version Manifest Structure
  • Content Metadata
  • Ongoing Design Work

3
Introduction
  • Scope of this Document
  • The Stanford Repository Architecture
  • Role of Fedora
  • Digital Object Diversity
  • The Need for Versioning

4
The Stanford Repository Architecture
5
Digital Object Diversity
6
The Need for Versioning
https//consul.stanford.edu/display/chimera/Identi
tyandVersioning
7
Functional Requirements
  • Identity Requirements
  • Object Modification Requirements
  • Accessioning Requirements
  • Retrieval Requirements
  • Fidelity Requirements
  • Efficiency Requirements
  • Replication Requirements
  • Cost Requirements

8
Identity Requirements
  • Object Identifiers
  • primary key identifier
  • alternate source ID
  • Version Identifiers
  • version number
  • version label
  • descriptive information

9
Modification Requirements
  • Content or Metadata May Change
  • Atomic Operations
  • add file
  • delete file
  • modify file contents
  • rename file
  • Composite Transactions
  • insert file into a sequence

10
Accessioning Requirements
  • The system should allow submission of
  • An initial version
  • A new full object version
  • Only the changes that have occurred since the
    previous version

11
Retrieval Requirements
  • Retrieve any version
  • latest version
  • version by version number/label
  • version at a point in time
  • Retrieve any portion of an object
  • full version
  • subset of files

12
Fidelity Requirements
  • Preserve original filesystem properties
  • File content should not be modified by
    compression or headers
  • Support message digests for fixity checking
  • MD5
  • SHA1
  • SHA2 ?

13
Efficiency Requirements
  • Must support efficient storage of large binary
    files
  • Must support storage of all media types including
    image and video
  • Minimize duplicate storage of identical file
    content

14
Replication Requirements
  • Facilitate copying of object store to another
    disk location
  • Facilitate creation of tape copies using Tivoli
    Storage Manager (TSM)
  • Minimize processing, I/O, and storage needed for
    backup copies
  • Minimize resources needed for media migration.

15
Cost Requirements
  • Open Source Software
  • Commodity Hardware

16
Versioning Approaches
  • Whole Object
  • Forward-Delta
  • Reverse-Delta
  • Content-Addressable

17
Whole Object Versioning
Store each version's complete set of files in a
series of dedicated folders
https//consul.stanford.edu/display/chimera/Identi
tyandVersioning
18
Whole Object Versioning
  • Pro
  • Simple design
  • Fast reconstruction of any version
  • Con
  • High level of file duplication
  • Consumes extra resources

19
Delta Versioning
https//consul.stanford.edu/display/chimera/Identi
tyandVersioning
20
Forward-Delta Versioning
  • Store complete file set in earliest version
    folder, and newer versions of files in later
    folders

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Reverse-Delta Versioning
  • Store complete file set in latest version folder,
    but only older versions of files in previous
    folders

27
(No Transcript)
28
(No Transcript)
29
Delta Versioning
  • Pro
  • Lower level of file duplication than the whole
    version option
  • Con
  • More complex algorithm for storage and
    retrieval
  • Rename delete and re-add (?)

30
  • Forward-Delta
  • Pro
  • Less work to add a new version
  • Less impact on replication and tape archive
  • Con
  • More work to reconstruct latest version
  • Reverse-Delta
  • Pro
  • Less work to reconstruct latest version
  • Con
  • More work to add a new version
  • More impact on replication and tape archive

31
CDL Dflat/ReDD Structure
ltobject-identifiergt/ Dflat home directory
current.txt pointer to current version (e.g.
v002) dflat-info.txt Dflat properties
file v001/ reverse-delta version
manifest.txt version manifest delta/ ReDD
directory add/ files to be added relative
to subsequent delete.txt files to be
deleted relative to subsequent V002/
reverse-delta version manifest.txt version
manifest delta/ ReDD directory add/
files to be added relative to subsequent
delete.txt files to be deleted relative to
subsequent v003/ current version
manifest.txt version manifest full/
dNatural directory consumer/
consumer-supplied files directory producer/
producer-supplied files directory system
system-generated files directory
32
ReDD Process for a new version
  • Build a new full version
  • Convert previous full version to a reverse-delta
    version
  • Compare new version manifest and previous version
    manifest
  • Create temp delta version containing
  • adds folder
  • deletes.txt file
  • Replace previous full version with delta version

33
CDL MicroService Concerns
  • Reverse-delta design requires a subsequent backup
    of a full version's files to tape.
  • There does not seem to be a provision for
    renaming files across versions other than to
    delete the original file and then add in the file
    with the new name.

34
Content-Addressable Storage (CAS)
Rename files using a message digest (checksum),
store files in a common pool, and use version
manifests to specify a version's contents
35
Using Checksums for CAS
  • Content-addressable storage
  • http//en.wikipedia.org/wiki/Content-addressable_s
    torage
  • The code monkey's guide to cryptographic hashes
    for content-based addressing
  • http//valerieaurora.org/monkey.html
  • Content addressable storage FAQ
  • http//searchsmbstorage.techtarget.com/feature/Con
    tent-addressable-storage-FAQ

36
Cryptographic hash functions
  • Cryptographic hash function
  • http//en.wikipedia.org/wiki/Cryptographic_hash_fu
    nction
  • Nist Cryptographic Toolkit
  • http//csrc.nist.gov/groups/ST/toolkit/index.html
  • Which hash function should I choose?
  • http//stackoverflow.com/questions/800685/which-ha
    sh-function-should-i-choose

37
What About Collisions?
Algorithm Checksum Length Probability of Collision Time to hash 500MB
MD5 16 bytes 1 in 2128 1462 ms
SHA1 20 bytes 1 in 2160 1644 ms
SHA256 32 bytes 1 in 2256 5618 ms
SHA384 48 bytes 1 in 2384 3839 ms
SHA512 64 bytes 1 in 2512 3820 ms
http//stackoverflow.com/questions/800685/which-ha
sh-function-should-i-choose
38
Content-Addressable Storage (CAS)
  • Pro
  • Simple design
  • Eliminates file duplication
  • Easy reconstruction of any version
  • Con
  • Concern about file renames in preservation

39
CAS Implementations
  • Git
  • Boar
  • SDR's Moab Design

40
Git Object Model
http//book.git-scm.com/1_the_git_object_model.htm
l
41
Git Object Model
http//www.slideshare.net/pbhogan/power-your-workf
low-with-git
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Git Concerns
  • Does not store unaltered original files unless
    you use a 3rd-party plugin (adding complexity)
  • Adds header structure to each file, then packs
    files into a container
  • Requires special configuration settings to avoid
    zlib compression and delta compression
  • Requires a local copy of the entire repository
    in order to make revisions

48
Boar's approach
  • repository
  • blobs
  • bc
  • bc7b0fb8c2e096693acacbd6cb070f16
  • sessions
  • ltsnapshot namegt
  • bloblist.json
  • session.json
  • Each file is checksummed, and the MD5 digest is
    used to rename and store the file inside the
    blobs hierarchy
  • Each snapshot is a version of a session
    associated with a working directorys contents.
  • Bloblist.json contains a map from filenames to
    blobs.
  • Session.json contains session metadata.

49
Boar's version manifest
  • bloblist.json
  • "size" 16,
  • "md5sum" "c8fdfe3e715b32c56bf97a8c9ba0514
    3",
  • "mtime" 1317549043,
  • "ctime" 1317549026,
  • "filename" "mynewfile.txt"
  • "size" 112490,
  • "md5sum" "6bd3ef5e2d25d72b028dce1437a0e89
    a",
  • "mtime" 1215443139,
  • "ctime" 1215443138,
  • "filename" "MARC21slim2MODS3-2.xsl"

50
Boar Concerns
  • Poor documentation of internal design
  • The definitions of session and snapshot are not
    clearly articulated.
  • Pools content files of all versions in a single
    folder structure
  • Has a limited command-line API
  • Does not yet work over network protocols
  • Single developer, small community of users
  • Target audience is individual users needing a
    backup system that integrates file versioning.

51
Summary of Desired Features
Versioning Mechanism File Fixity No Disk Duplicates No Tape Duplicates No Local Repo Copy
Forward-Delta Good Fair Fair Good
Reverse-Delta (CDL ReDD) Good Fair Poor Good
Git Poor Good Poor Poor
Boar Good Good Poor Good
52
Moab Design
  • Imitates a mixture of best features
  • Uses a simple folder structure
  • Renames files using SHA1
  • Files are immutable
  • Uses version manifests
  • Stores new content files in new buckets

53
Moab Folder Structure
  • druidxyz/
  • v001/
  • manifest.txt manifest.sha1
  • filemap.txt filemap.sha1
  • data/
  • 2fd4e1c6
  • 7a2d28fc
  • ed849ee1
  • bb76e739
  • v002/
  • manifest.txt manifest.sha1
  • filemap.txt filemap.sha1
  • data/
  • 1b93eb12

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
Version Manifest
File Name Checksum
relative/path/page-1.tif 2fd4e1c6
relative/path/page-2.tif 7a2d28fc
relative/path/page-3.tif 1b93eb12
relative/path/page-4.tif ed849ee1
relative/path/page-5.tif bb76e739
60
File Map
Checksum File Location
2fd4e1c6 v001/data/2fd4e1c6
7a2d28fc v001/data/7a2d28fc
1b93eb12 v002/data/1b93eb12
ed849ee1 v001/data/ed849ee1
bb76e739 v001/data/bb76e739
61
Reconstructing an Object
62
Replication and Tape Archives
  • Disk to Disk Replication
  • Rsync ( http//rsync.samba.org/ )
  • Disk to Tape Replication
  • Tivoli Storage Manager (TSM)( http//www-01.ibm.c
    om/software/tivoli/products/storage-mgr/ )

63
Tivoli Storage Manager (TSM)
  • Backup Command
  • usually run daily by a scheduler
  • picks up incremental changes
  • wipes out deleted files after expiration period
  • Archive Command
  • requires explicit scripting to run
  • copies full directory structure to an archive
  • persistent storage of archive until explicit
    delete

64
TSM Archive Commands
  • Example
  • dsmc archive druidxyz -subdiryes
    -desc"druidxyz"
  • dsmc archive druidxyz/v002 -subdiryes
    -desc"druidxyz v002"

65
Tape Archive Efficiency vs. Versioning Design
  • Whole Object
  • poor performance (must transfer full versions)
  • Reverse-Delta
  • poor performance (latest version is always full)
  • Forward-Delta
  • good performance (new versions are deltas)
  • Content-Addressable
  • poor if data for all versions is pooled
  • good if new data is segregated by version

66
Version Manifest Structure
  • CDL Checkm specification
  • https//confluence.ucop.edu/display/Curation/Check
    m
  • Format
  • ltpathnamegt ltdigest-typegt ltdigest-valuegt
    ltfile-sizegt ltmodtimegt ...

67
versionMetadata XML
  • versionMetadata
  • __objectId
  • __versionIdentity
  • __versionId
  • __label
  • __timestamp
  • __description
  • __versionData
  • __dataGroup

68
versionMetadata XML (cont.)
  • __dataGroup
  • __group_name -gt id
    "contentmetadata"
  • __baseDirectory
  • __file (repeating)
  • __id (relative_path)
  • __size
  • __signature (SHA1 hash)
  • __modtime
  • __
  • __checksum (repeating)
  • __type
  • __value

69
Content Metadata
  • Depositor Submitted Files
  • Stanford's Repository Datastreams
  • Keeping Metadata Files Straight
  • contentMetadata Datastream
  • Building a Version Manifest

70
Depositor Submitted Files
71
Stanford's Repository Datastreams
  • identityMetadata
  • descMetadata
  • relationshipMetadata
  • provenanceMetadata
  • technicalMetadata
  • contentMetadata
  • versionManifest

72
Digital Object Structure
73
contentMetadata Datastream
  • structural information about how the files are
    arranged within the object
  • key technical information to aid delivery
  • fixity information (checksums)
  • tags to categorize resource types, including
    differentiation of data files from ancillary
    files

74
contentMetadata vs versionMetadata
  • contentMetadata datastream
  • inventories only user deposited files
  • contains metadata beyond fixity info
  • versionMetadata datastream
  • inventories all files, both user repository MD
  • records file paths and directory info
  • records fixity information

75
Building a Version Manifest
76
Do we still need BagIt?
  • BagIt is just a data directory plus manifest
    files that contain checksums.
  • Can still use BagIt for transfer, but no need to
    re-calculate checksums -- just extract them from
    the version manifest.
  • We may be transferring only a subset of files for
    a new version. Would need to filter out the
    manifest entries for files not present
  • Would not use BagIt in deep storage.

77
Ongoing Design Work
  • Review Accessioning Requirements
  • APIs and Tools for Accessioning
  • Digital Stacks/Shelver Considerations
  • Versioning Workflow

78
Accessioning Requirements
  • The system should allow submission of
  • An initial version
  • A new full object version
  • Only the changes that have occurred since the
    previous version

79
APIs and Tools for Accessioning
  • Command line
  • Procedural (e.g. java or ruby API)
  • Web API (RESTful services)
  • Graphical User Interface (GUI)

80
Metadata Revisions
81
Digital Stacks/Shelver Considerations
82
Versioning Workflow
  • Submission Package
  • DOR work steps
  • SDR work steps

83
Submission Package
  • A digital object identifier that is either the
    druid or the submitting agent's sourceID.
  • An explicit indication that this submission is a
    new version of an existing object.
  • A folder (or other container) holding the set of
    versioned files.
  • A submission manifest that lists relative path
    names and digests of ALL files comprising the
    new version

84
DOR Work Steps
  1. Bootstrap a versioning workflow
  2. Determine which driud is involved (may require
    lookup using sourceID), and verify that a DOR
    object exists
  3. Transfer the submitted content into a workspace
    object/content directory and the manifest into a
    object/metadata directory.
  4. Perform a fixity validation procedure to make
    sure that all content files being submitted are
    correctly included in the submission manifest.

85
DOR Work Steps
  • Compare the submission manifest against the
    current DOR content manifest for this object.
    This procedure will allow us to determine
  • unchanged files (same filename, same digest)
  • renamed files (new filename for a given digest)
  • modified files (same filename, different digest)
  • added files (new filename, new digest)
  • deleted files (old filename nor digest

86
DOR Work Steps
  1. Verify that any modified or added files were
    included in the submission package.
  2. Run JHOVE against any new or modified files and
    merge the results into the technicalMetadata
    datastream. If there are deleted or renamed
    files, make the appropriate revisions to the
    datastream.
  3. Update the DOR contentMetadata datastream to
    incorporate structural and other changes, as well
    as extracts from technical metadata

87
DOR Work Steps
  1. Update descMetadata and other datastreams, if
    appropriate. This may be derived from
    depositor-submitted metadata that was included as
    part of the submitted package.
  2. Add a new stanza to provenanceMetadata
  3. Generate a datastream manifest, that itemizes all
    DOR datastream names and fixity data.
  4. Generate new versionMetadata that includes both
    content and DOR datastream file info.

88
DOR Work Steps
  1. Run sdr-ingest-transfer in a mode similar to what
    occurs during initial accessioning. The
    sdr-ingest-transfer robot would create a BagIt
    bag in the export area and bootstrap the SDR
    Ingest workflow. The difference would be that
    only modified or added files would be included in
    the bag payload. The bag's data directory will
    also contain the content and metadata manifest
    files that document the full inventory of the
    object version. This will be essential for SDR
    storage versioning.

89
SDR Work Steps
  1. Bootstrap the workflow
  2. Determine whether this is a new object or a new
    version of an existing object
  3. Transfer the bag to a temp location
  4. Validate the bag
  5. Update any standard SDR Fedora datastreams
  6. Create the new version in storage, using the
    storage mechanism described previously
Write a Comment
User Comments (0)
About PowerShow.com