Title: Archiving EMail and Public Records: Challenges, Strategies, and NARAs Electronic Records Archives Pr
1Archiving E-Mail and Public Records Challenges,
Strategies, and NARAs Electronic Records
Archives Program
- June 13, 2001
- Daniel M. Jansen, ERA Staff
- National Archives and Records Administration
2The Issues
- Electronic formats are replacing paper as the
medium for communication and transactions - Records that are indispensable for documenting
citizens rights, the actions of Federal
officials, and the nations history will be lost
without effective technology for preserving and
providing access to them. - Electronic records, no less than those in
traditional forms, are critical for the effective
functioning of democracy - Digital technology is both necessary and
advantageous for discovering and delivering
records
3NARAs Current Electronic Records Preservation
Capabilities
- Program in existence for 30 years
- Holdings of electronic records that have received
full archival processing (accessioning,
preservation, and access to software and hardware
independent files) are limited to structured data
files (fielded, fixed-length, comma-separated
ASCII) - Holdings of electronic records that temporarily
have received minimal processing (bitstream
preservation only) include a wider variety and
large quantity of files
4Recent Challenges
- Diversity (office automation, image, video, and
audio formats) - Complexity (decision support systems or GIS,
applets, and interactive WWW pages) - Volume
- Files
- Bytes
- Rapidly changing nature of systems used to create
records
5What do we need to do?
- Overcome technological obsolescence in a way that
preserves demonstrably authentic records. - Build a dynamic solution that incorporates the
expectation of continuing change in information
technology and in the records it produces. - Find ways to take advantage of continuing
progress in information technology in order to
maintain and improve both performance and
customer service.
6Strategies for Digital Preservation
- Existing
- Technology Preservation
- Maintain original hardware and software
- Imitate original technology
- Data Format Migration
- Version Migration
- Standardize Formats
- Emerging
- Transformation to Persistent Form
- Persistent Object Preservation
7Technology Preservation Strategies
- Includes emulation, maintaining original hardware
and software, etc. - Perpetuate the problems of current technologies
- Increase in complexity over time
- Particularly complex when collections of records
to be preserved are accumulated over time, or
constitute a system rather than individual files - Advanced technology still required to improve
information discovery, delivery, and management - In emulation, claims of authenticity derived from
original technology are invalid - Ultimately, addresses technological obsolescence,
not preservation of records
8Format Migration Strategies
- Includes systematic migration through versions of
a single software package, some forms of
standardization, etc. - Short-term and ad hoc solutions
- Replacement formats may not exist in some cases
- Even standardized products contain features that
extend the standard or implement standards in
different ways - Each migration brings risk of alteration and
associated difficulty demonstrating integrity of
the records - Complexity increases with growth of the number of
formats - Ultimately, addresses technological obsolescence,
not preservation of records
9Transformation Strategies
- Includes Persistent Object Preservation an
approach emerging from collaborative research in
which NARA is participating - Does not preserve things in original
technological state - Requires precise specification of archival
requirements related to content, context,
structure, and presentation of records and the
collections to which they belong - Currently beyond the state of the art
- Focuses on requirements for preserving records
10Comparison of Transformation and Other Strategies
- Unlike format migration strategies,
transformation approaches may minimize number of
migrations and actively manage authenticity - Unlike technology preservation strategies, it may
be possible to do transformation in a way that
diverse and complex genres of records can be
managed and made available without regard for the
choice of technology that you are using to manage
them or make them accessible - Unlike transformation strategies, format
migration and technology preservation strategies
are currently possible in limited situations,
although sustainability of these approaches is
questionable
11Persistent Object Preservation(POP)
- Application of Distributed Object Computation
Testbed (DOCT) technologies to archival
preservation and access (DARPA/USPTO/NARA
sponsored work at San Diego Supercomputer Center) - Comprehensive, scalable, infrastructure
independent, flexible - Established in the core technologies of the next
generation National Information Infrastructure - Backed by high-performance computing, high-speed
communications, physical science, life science,
and digital library communities - Implementation of Reference Model for an Open
Archival Information System (NASA-Consultative
Committee on Space Data Systems)
12POP Transformation
- Identify all significant properties of the
classes of objects (genres of records) that are
to be preserved. - Express these properties in formal models (i.e.,
define the canonical form for each class of
record) - Transform the objects and the collections to
which they belong by - Encapsulating them in metadata defined in the
formal models - Eliminating other technical characteristics that
are proprietary, dependent on specific hardware
or software, or subject to obsolescence. - Use software mediators to enable future
technologies to interpret the models and metadata
- to rebuild and repopulate collections
- support information discovery and delivery.
13POP Assumptions
- Content, context, structure, and presentation of
records needs to be maintained to prove
authenticity - All records can be represented as objects with
characteristics and behaviors - Provenance and original order of records can be
represented by grouping objects into collections - For expression of canonical form, there exists an
open, standardized, self-describing, and flexible
modeling language -- a hardware and software
independent method of expressing a complex data
model - For transformation of the records and
collections, POP requires an open and
standardized syntax
14POP Implications
- Domain-specific semantics, for every class of
records you wish to maintain, are necessary for
authenticity - Archival processing of legacy records promises
to be difficult - Full automation of processes may not be possible
- Other preservation strategies may be required in
some instances - Real opportunity for POP will be in future as
commercial products incorporate standard
markup-based data structures and communities of
interest define domain-specific markup languages
and semantics - To handle volume of materials being generated and
provide necessary services, POP requires
existence of distributed, redundant storage
distributed processing distributed security and
high-speed communications
15POP Demonstrations
- Ingest and storage demonstration for several
diverse collections of electronic records
representing the following genres - E-mail
- Geospatial data
- Office automation products
- Databases
- Images
- Access demonstrated for a few collections
- Usenet e-mail example 1 million message
collected, ingested, stored, and made available
for access in just over one day
16What is the Electronic Records Archives?
- The Electronic Records Archives (ERA) is NARAs
vision for a comprehensive, systematic, and
dynamic means of preserving and providing
continuing access to authentic electronic records
over time.
17NARAs Plan to Build ERA
- Research Leverage, sponsor, partner in, and
conduct research into archival issues and
emerging technologies for risk mitigation
purposes - Systems Development Develop the ERA system with
the most promising technologies as they become
available in the market. - Business Development Addressing policy issues
arising from research and systems development,
articulating archival rules to be implemented in
ERA system, facilitating organizational change
18Research ActivitiesPartnerships
- Open Archival Information System (OAIS) Reference
Model - NASA, Consultative Committee on Space Data
Systems - Distributed Object Computation Testbed (DOCT)
- Defense Advanced Research Projects Agency, U.S.
Patent and Trademark Office - National Partnership for Advanced Computational
Infrastructure (NPACI) - National Science Foundation
- Presidential Electronic Records Processing
Operational System (PERPOS) - Army Research Laboratory, Georgia Tech Research
Institute - Archivists Workbench
- NHPRC Grant to San Diego Supercomputer Center
- International research on Permanent Authentic
Records in Electronic Systems (InterPARES) - 7 international, multidisciplinary research
teams, 10 national archives
19Research ActivitiesCurrent Investigations
- Ingest of heterogeneous collections (includes WWW
sites) - Ingest of geospatial data/GIS
- Demonstration of POP processes in distributed
mode - Demonstration of POP security
- Use of computational linguistics techniques and
technologies to identify, retrieve, and extract
information from unmanaged electronic records
20Systems Development Activities
- Initial planning, scheduling, and cost estimating
- Progressive deployment of prototypes, pilots, and
productions systems that demonstrate ERA concepts
and address current NARA requirements - Access to Archival Databases project
- Presidential Electronic Records Processing
Operational System project - Digital Official Military Personnel Files
Repository project - Preparing for development of ERA proper
21Business Development Activities
- Business process analysis/development effort
beginning in FY2002 - Will include, at a minimum, accessioning,
preservation and access functions - Also includes analysis of existing processes,
policies, rules, procedures, and organizational
structures in other functional areas (appraisal,
scheduling, etc.) - Includes initial planning for management of
organizational change - Communications
22ERA Concept Diagrams
- ERA Functional Model
- ERA Architectural Model
- ERA Design Strategy
23ERA Functional ModelAn Open Archival
Information System Implementation
Submission Information Packages
Producer
OAIS
Archival Information Packages
Result sets
queries
orders
Consumer
Dissemination Information Packages
24ERA Infrastructure Concept
Gb/sec Internet Grid Security Distributed
Processing Mediation among Systems Distributed,
redundant Storage Infrastructure Independence
Records Creator
Workbench
Public User
Government User
NARA User
Workbench
Records Creator
Trusted Repository
Digital Library
Records Creator
NARA User
Public User
25ERA Design Strategy
NARA System
ERA Framework
Information Technology Architecture for
Persistent Digital Collections
26ERA Program Schedule
- Research will continue throughout in order to
accumulate knowledge, experience, and metrics - Develop primary system(s) at point where enabling
technologies mature and are available on the
market, estimate a 5-10 project to deploy primary
capability - Prototypes, pilots, and operational components
rolled out annually over next few years
27For more information
- http//www.nara.gov/era
- Dan Jansen
- (301) 713-6730x285
- dan.jansen_at_nara.gov