The Role of Format Registries in Digital Preservation - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

The Role of Format Registries in Digital Preservation

Description:

Archiving Web Resources: Issues for Cultural Heritage Institutions ... Modules for AIFF, ASCII, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, UTF-8, WAVE, XML ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 34
Provided by: step54
Category:

less

Transcript and Presenter's Notes

Title: The Role of Format Registries in Digital Preservation


1
The Role of Format Registries in Digital
Preservation
Archiving Web Resources Issues for Cultural
Heritage Institutions National Library of
Australia, Canberra, November 9-11, 2004
  • Stephen L. Abrams
  • Digital Library Program Manager
  • Harvard University Library

2
Introduction
  • Almost all aspects of repository operation are
    conditioned by the format of the objects in the
    repository
  • Without proper characterization of digital
    objects (format typing and technical metadata),
    effective long-term preservation is difficult, if
    not impossible
  • Repositories need to ensure that
  • Digital object content streams are valid with
    respect to their format
  • Metadata encapsulated within object content
    streams are consistent with externally supplied
    metadata
  • Formatted content streams remain accessible over
    time

3
What is a Format, Anyway?
  • A reversible byte-serialized encoding of an
    information model
  • A set of syntactic and semantic rules that
  • Map from abstract content to a sequence of bytes
  • Map back from a sequence of bytes to the abstract
    content represented by those bytes

4
With Format Typing, All Content is Opaque
ffd8ffe000104a464946000102010083 00830000ffed0fb05
0686f746f73686f 7020332e30003842494d03e90a507269 6
e7420496e666f000000007800000000 004800480000000002
f40240ffeeffee 030602520347052803fc000200000048 00
480000000002d80228000100000064 0000000100030303000
00001270f0001 00010000000000000000000000006008 001
90190000000000000000000000000 00000000000000000000
000000000000 000000003842494d03ed0a5265736f6c 7574
696f6e0000000010008313a30002 0001008313a3000200013
842494d040d 18465820476c6f62616c204c69676874 696e6
720416e676c650000000004000
5
With Format Typing, All Content is Opaque
ffd8ffe000104a464946000102010083 00830000ffed0fb05
0686f746f73686f 7020332e30003842494d03e90a507269 6
e7420496e666f000000007800000000 004800480000000002
f40240ffeeffee 030602520347052803fc000200000048 00
480000000002d80228000100000064 0000000100030303000
00001270f0001 00010000000000000000000000006008 001
90190000000000000000000000000 00000000000000000000
000000000000 000000003842494d03ed0a5265736f6c 7574
696f6e0000000010008313a30002 0001008313a3000200013
842494d040d 18465820476c6f62616c204c69676874 696e6
720416e676c650000000004000
6
Use Cases
  • Identification
  • I have an object what format is it?
  • Validation
  • I have an object purportedly of format F is
    it?
  • Characterization
  • I have an object of format F what are its
    salient properties?
  • Assessment
  • I have an object of format F is it at risk of
    obsolescence?
  • Processing
  • I have an object of format F how can I perform
    operation X on it?

7
Repository Format Dependencies
Based on Open Archival Information System (OAIS)
Reference Model, ISO 14721
8
Preservation Strategy Dependencies
  • Migration
  • Transform object content from format F, supported
    by yesterdays platform, to format G, supported
    by tomorrows platform
  • Emulation
  • Ensure that yesterdays platform continues to
    work tomorrow
  • Recreate the behavior of yesterdays platform in
    the context of tomorrows platform
  • Recreate the behavior of yesterdays platform in
    the context of a Universal Virtual Computer (UVC)

9
Institutional Archives
  • Often under an obligation to accept material of
    unknown provenance
  • Library of Congress Archive Ingest and Handling
    Test (AIHT)
  • Investigate issues surrounding the transfer of
    digital collections between institutions
  • Test corpus is the George Mason University 9/11
    archive
  • 57,000 file (13GB)
  • Collected via submission and web harvesting
  • 97 of all files are in 9 formats
  • AIFF, ASCII, GIF, HTML, JPEG, PDF, TIFF, WAVE,
    XML
  • The remaining 3 are in 100 formats (as indicated
    by file extension)

10
Format Representation Information
  • Information that maps formatted content to more
    meaningful concepts
  • Syntax
  • A TIFF header is composed of a two byte string,
    II or MM, a two byte string, 0x2A00 or
    0x002A, and an unsigned 32 bit integer
  • Semantics
  • II indicates big-endian byte order MM,
    little-endian
  • The two byte string is the decimal value 42 in
    correct byte order
  • The integer is the byte offset of the first IFD
    structure
  • Assessment
  • Factors bearing on a formats amenability for
    long-term preservation

11
Library of Congress Assessment Model
  • Sustainability
  • Disclosure
  • Adoption
  • Transparency
  • Self-documentation
  • External dependencies
  • DRM
  • Quality and functionality

12
A New Generation of Format-Aware Tools
  • JHOVE - JSTOR/Harvard Object Validation
    Environment
  • Format-specific object identification,
    validation, and characterization
  • Modules for AIFF, ASCII, GIF, HTML, JPEG, JPEG
    2000, PDF, TIFF, UTF-8, WAVE, XML
  • NLNZ Preservation Metadata Extraction Tool
  • Adaptors for BMP, GIF, HTML, JPEG, MS Office,
    OpenOffice, PDF, TIFF, WAVE, WordPerfect
  • Short-listed for Pilgrim Trust Conservation Award
  • A great deal of knowledge about formats is
    encapsulated into these tools where does this
    representation information come from?

13
The Harvard Format Registry
14
Format Representation Information Sources
  • There are lots of sources, but

15
Format Representation Information Sources
  • There are lots of sources, but
  • for the most part they are informal,
    inconsistent, and ephemeral

16
Diffuse Web Site c/o Internet Archives
17
Digital Formats for Library of Congress
Collections
18
Whats Wrong with MIME Types?
  • Level of detail
  • Level of disclosure
  • Level of granularity
  • Non-actionable

19
Whats Wrong with MIME Types?
MIME TYPE NAME application MIME SUBTYPE NAME
msword REQUIRED PARAMETERS none OPTIONAL
PARAMETERS An optional version parameter can be
specified. Some of the more common versions are
4 Microsoft Word 4.0 for the Macintosh. 5
Microsoft Word 5.0 and 5.1 for the Macintosh. 2w
Microsoft Word for Windows 2.0 6 Microsoft
Word 6 for Windows and Macintosh platform
independent format (coming soon) ENCODING
CONSIDERATIONS Microsoft word files are in a
binary format. Some encoding will be necessary
for MIME mailers as in application/octet-stream.
Microsoft Word files for the Macintosh are
encoded in the data fork of a macintosh file. The
type creator is MSWD, the file type is WDBN.
Microsoft Word files that contain external data
references such as publish subscribe services
are explicitly not allowed. SECURITY
CONSIDERATIONS None known. PUBLISHED
SPECIFICATION Specification by example From any
microsoft word application select "Save As..."
from the "File" menu. Enter a filename, make sure
that "Normal" is specified for the file type, and
click "Save".
20
Whats Wrong with MIME Types?
MIME TYPE NAME application MIME SUBTYPE NAME
msword REQUIRED PARAMETERS none OPTIONAL
PARAMETERS An optional version parameter can be
specified. Some of the more common versions are
4 Microsoft Word 4.0 for the Macintosh. 5
Microsoft Word 5.0 and 5.1 for the Macintosh. 2w
Microsoft Word for Windows 2.0 6 Microsoft
Word 6 for Windows and Macintosh platform
independent format (coming soon) ENCODING
CONSIDERATIONS Microsoft word files are in a
binary format. Some encoding will be necessary
for MIME mailers as in application/octet-stream.
Microsoft Word files for the Macintosh are
encoded in the data fork of a macintosh file. The
type creator is MSWD, the file type is WDBN.
Microsoft Word files that contain external data
references such as publish subscribe services
are explicitly not allowed. SECURITY
CONSIDERATIONS None known. PUBLISHED
SPECIFICATION Specification by example From any
microsoft word application select "Save As..."
from the "File" menu. Enter a filename, make sure
that "Normal" is specified for the file type, and
click "Save".
21
Characteristics of a Format Registry
  • Predictable data
  • Arbitrary granularity
  • Inclusive
  • Trustworthy
  • Authoritative
  • Honest broker with regard to proprietary
    information
  • Machine actionable
  • Interoperable
  • Informative, not evaluative

22
So, When Will Any of This Happen?
  • And will it be in time to be useful for existing
    at-risk digital assets?

23
National Archives (UK) PRONOM
24
Global Digital Format Registry
  • DLF funded two invitational workshops in 2002 to
    investigate issues surrounding the establishment
    of a GDFR

- National Archives, UK - NARA - National
Archives of Canada - New York University - NIST -
Online Computer Library Center - Research
Libraries Group - Stanford University -
University of Pennsylvania
- Bibliothèque nationale de France - California
Digital Library - Digital Library Federation -
Harvard University - Internet Engineering Task
Force - JISC - JSTOR - Library of Congress - MIT
25
GDFR Architecture
  • Not a single monolithic registry
  • A distributed network of cooperating registries
  • Standard protocol
  • Standard abstract data model

26
Distributed Network of Cooperating Registries
27
Data Model
  • General descriptive properties, including
    canonical and alias identifiers for formats
  • Characterization properties, detailing the
    syntactic and semantic properties for formats
  • Processing properties, describing systems and
    services for which registered formats are inputs
    or outputs
  • Administrative properties, capturing important
    events in a registrations provenance

28
Service Model
  • Management services, providing mechanisms for
    maintenance, technical review, and notification
  • Access services, providing discovery and delivery
    of format representation information
  • Interoperation/synchronization

29
FRED A Format Registry Demonstration
30
GDFR Technical Track
  • Deliverables
  • Data model
  • Network protocol
  • Editorial process
  • Reference implementation
  • Initial population
  • Schedule
  • Year one Analysis, design, and prototype
  • Year two Development and deployment
  • Year three Production operation and integration
    with repository workflows

31
GDFR Administrative Track
  • Deliverables
  • Recommendations for sustainable governance
    structure and business model
  • Schedule
  • Year one Analysis and consultation
  • Year two White Papers and consultation
  • Year three Final recommendations

32
Why is This Important to You?
  • The GDFR is an enabling technology underlying
    digital repository operations and preservation
    activities
  • It permits typing of digital objects at an
    appropriate level of granularity
  • It enables the future recovery of the syntax and
    semantics associated with typed digital objects
  • It provides a mechanism to pool and redistribute
    the expertise of the digital preservation
    community

33
More Information
OAIS/ISO 14721 ltwww.ccsds.org/CCSDS/documents/650x
0b1.pdfgt UVC

ltwww-5.ibm.com/nl/dias/preservation2.htmlgt LC
Assessment Model
ltwww.digitalpreservation.gov
/gt JHOVE


lthul.harvard.edu/jhove/gt NLNZ
ltwww.natlib.govt.nz/files/Project20Descriptio
n_v3-final.pdfgt Diffuse
ltweb.archive.org/web/20030128052128/http//www.dif
fuse.org/gt IANA MIME registry
ltwww.iana.org/assignments/media-types/gt PRONOM

ltwww.nationalarchives.gov.uk/PRONOM/gt GDFR



lthul.harvard.edu/gdfr/gt FRED


lttom.library.upenn.edu/fred/gt ltstephen_abram
s_at_harvard.edugt
Write a Comment
User Comments (0)
About PowerShow.com