GGF18 Preservation Environments Research Group - PowerPoint PPT Presentation


PPT – GGF18 Preservation Environments Research Group PowerPoint presentation | free to download - id: 32b78-N2I0O


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

GGF18 Preservation Environments Research Group


NARA Amelia Earhart collection contributed by Mark Conrad (NARA) ... Demonstrated on a state department collection of communiques about Amelia Earhart ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 34
Provided by: sdsc
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: GGF18 Preservation Environments Research Group

GGF18 Preservation Environments Research Group
  • Organizers Reagan Moore (
  • "Bruce.Barkstrom" ltBruce.Barkstrom_at_noaa.govgt
  • Goals
  • Analyze capabilities required by a preservation
  • Barkstrom GGF paper - based on NASA Langley
    preservation model
  • NARA Electronic Records Archive capability
  • RLG/NARA assessment criteria for a Trusted
    Digital Repository
  • Demonstrate creation of a preservation
    environment based on data grid technology
  • Demonstrate federation of 17 SRB data grids
    (shared name spaces)
  • Demonstrate replication of a collection
  • Analyze capabilities that can be based on grid
  • iRODS rule-oriented data system
  • Participants
  • 19 contributors to data grid federation for GIN
  • MIT - PLEDGE project on preservation policies
  • SDSC - NARA research prototype persistent archive
  • U Md - Producer Archive Workflow Network

GGF Grid Interoperability Now
  • Organizers Erwin Laure (
  • Reagan Moore (
  • Arun Jagatheesan ( - grid
  • Sheau-Yen Chen ( - data grid
  • Chien-Yi Hou ( - collection
  • Goals
  • Demonstrate federation of 17 SRB data grids
    (shared name spaces)
  • Demonstrate replication of a collection
  • Participants (19 data grids)
  • APAC Australia Stephen McMahon
  • ASGC Taiwan Eric Yen, Wei-Long Ueng
  • ChinaGrid China Li Qi
  • DEISA-Italy Giuseppe Fiameni
  • IB-New Zealand Daniel Hanlon
  • IB-UK Daniel Hanlon
  • IN2P3-France Jean-Yves Nief
  • KEK- Japan Yoshimi Iida
  • LCDRG-US Chien-Yi Hou
  • NCHC Taiwan Hsu-Mei Chou

Intellectual Property Policy
  • I acknowledge that participation in GGF18 is
    subject to the GGF Intellectual Property Policy.
  • Intellectual Property Notices Note Well All
    statements related to the activities of the GGF
    and addressed to the GGF are subject to all
    provisions of Section 17 of GFD-C.1 (.pdf), which
    grants to the GGF and its participants certain
    licenses and rights in such statements. Such
    statements include verbal statements in GGF
    meetings, as well as written and electronic
    communications made at any time or place, which
    are addressed to the GGF plenary session,
  • any GGF working group or portion thereof,
  • the GFSG, or any member thereof on behalf of the
  • the GFAC, or any member thereof on behalf of the
  • any GGF mailing list, including any working group
    or research group list, or any other list
    functioning under GGF auspices,
  • the GFD Editor or the GWD process
  • Statements made outside of a GGF meeting, mailing
    list or other function, that are clearly not
    intended to be input to an GGF activity, group or
    function, are not subject to these provisions.
  • Excerpt from Section 17 of GFD-C.1 Where the GFSG
    knows of rights, or claimed rights, the GGF
    secretariat shall attempt to obtain from the
    claimant of such rights, a written assurance that
    upon approval by the GFSG of the relevant GGF
    document(s), any party will be able to obtain the
    right to implement, use and distribute the
    technology or works when implementing, using or
    distributing technology based upon the specific
    specification(s) under openly specified,
    reasonable, non-discriminatory terms. The working
    group or research group proposing the use of the
    technology with respect to which the proprietary
    rights are claimed may assist the GGF secretariat
    in this effort. The results of this procedure
    shall not affect advancement of document, except
    that the GFSG may defer approval where a delay
    may facilitate the obtaining of such assurances.
    The results will, however, be recorded by the GGF
    Secretariat, and made available. The GFSG may
    also direct that a summary of the results be
    included in any GFD published containing the
    specification. GGF Intellectual Property
    Policies are adapted from the IETF Intellectual
    Property Policies that support the Internet
    Standards Process.

Preservation Requirements
  • Authenticity
  • Maintain information about provenance of data
  • Assertions made about the file at the time of
  • Integrity
  • Maintain information about the management of the
  • Assertions made by the archivist
  • Access controls, audit trails, checksums,
    replication, synchronization, federation
  • Infrastructure independence
  • Management of properties of records independently
    of choice of storage system
  • Scalability
  • Management of large collections (billions of
    records, petabytes of data, thousands of

GIN - Two Approaches
  • Virtualize the storage resource
  • Provide a standard access interface to the
    storage system
  • Storage Resource Manager
  • Asynchronous interface to storage
  • Virtualize the shared collection - provides
    ability to implement infrastructure
  • Manage the properties of a shared collection
    independently of the multiple storage systems
    where it is distributed
  • Storage Resource Broker
  • Collection management
  • Federation of independent collections

SRB Data Grid Federation Status
Data Grid Federation
  • Builds on
  • Registry for data grid names - ensures each data
    grid has a unique identity
  • Trust establishment - explicit registration
    command issued by the data grid administrator of
    each data grid
  • Peer-to-peer server interaction - each SRB server
    can respond to commands from any other SRB
    server, provided trust has been established
    between the data grids
  • Administrator controlled registration of name
    spaces - each grid controls whether they will
    share user names, file names, replicate data,
    replicate metadata or allow remote data storage
  • Shibboleth style user authentication - a person
    is identified by
  • /Zone-name/user-name.domain-name.
  • Authentication is done by the home zone. No
    passwords are shared between zones.
  • Local authorization - operations are under the
    control of the zone being accessed, including
    controls on access to files, storage resources,
    metadata and user quotas. Owners of data can set
    access controls for other persons

Federation Between Data Grids
Data Access Methods (Web Browser, Scommands,
Data Collection B
Data Collection A
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints

Access controls and consistency constraints on
cross registration of name spaces
Challenge - Replicate a Collection
  • Replicate files in a collection
  • Demonstrated at GGF17
  • Replicate metadata associated with a shared
  • Authenticity metadata - describe provenance of
  • Integrity metadata - state information such as
    checksums, access controls
  • SRB information synchronization command
  • -d -z remotezone
  • Synchronize data information with zone
  • -u -z remotezone
  • Synchronize user information with zone
  • -r -z remotezone
  • Synchronize user and resource information with
    zone remotezone

  • Two collections have been made available for use
    in the collection replication demonstration
  • NOAO Astronomy collection contributed by Irene
    Barg (NOAO). This contains 120 zip files and is
    about 10.6 Gigabytes in size. Each zip file
    contains multiple FITS image files. For the
    demonstration, we replicate a FITS image to the
    remote data grid, then extract metadata from the
    FITS images and load into the remote data grid.
    We then issue q query against the user-defined
    metadata. The files are located in the SRB
  • /SDSC-GGF/home/ggfsdsc.sdsc/mtn/20051010/ct4m/noao
  • NARA Amelia Earhart collection contributed by
    Mark Conrad (NARA). This contains 43 pdf
    documents, and has extensive preservation
    metadata. The total size is less than 200
    megabytes. For the demonstration, we replicate
    the files. We load a hierarchical metadata
    description for each file through SQL commands on
    the database. We then issue a query against the
    hierarchical metadata. The files are located in
    the SRB collection
  • /SDSC-GGF/home/ggfsdsc.sdsc/nara

Collection Management
  • Metadata extraction - user-defined metadata
  • Demonstrated on FITS astronomy image
  • Images provided by Irene Barg - NOAO
    (noao-ls-t3-z1 data grid)
  • Execute remote procedure to extract metadata from
    a file
  • Created parsing template to extract metadata
    attributes from FITS header
  • Load the extracted metadata into the remote zone
    MCAT catalog
  • Modified SRB to support extraction of multiple
    versions of the same metadata attribute from
    large files
  • Executed commands on the /GGF-RNP data grid in
  • Extracted 183 metadata attributes from a FITS
  • ./Sufmeta ct422131.fits
  • DTPI 'Christopher Stubbs'
  • 89 DTPIAFFL 'University of Washington'
  • 90 DTTITLE 'A Next Generation Microlensing
    Survey of the LMC'
  • 91 DTACQUIS ''
  • 92 DTACCOUN 'mosaic '
  • 93 DTACQNAM '/ua00/mosaic/tonight/sm84.051011_05
  • 94 DTNSANAM 'ct422131.fits '

Collection Management
  • Metadata hierarchy - extensible schema
  • Demonstrated on a state department collection of
    communiques about Amelia Earhart
  • Collection provided by Mark Conrad - NARA
    (LCDRG-GGF data grid)
  • Create additional tables in MCAT catalog to
    support schema extension
  • Created scripts to add 70 tables to the MCAT
  • Had to map schema from 2nd normal form to 3rd
    normal form
  • Load a metadata hierarchy into the remote zone
    MCAT catalog
  • Created scripts to load the Life Cycle Data
    Requirements Guide metadata into MCAT
  • Added LCDRG metadata hierarchy to 43 files in the
    Amelia Earhart collection on the UERJ-HEPGrid in
  • Created scripts to load the metadata via SQL
  • Queried the metadata hierarchy
  • SRB commands

LCDRG Metadata Hierarchy
  • Based on subset of attributes in NARAs Life
    Cycle Data Requirements Guide
  • RECORD GROUP or Collection
  • 14 versus 21 attributes
  • DATE, GRNO - group number
  • 60 attributes
  • DATE, GRNO - group number
  • 43 attributes
  • DATE, GRNO - group number
  • ITEM or audioVisual ITEM
  • 39 attributes or 23 attributes
  • DATE, GRNO - group number
  • 65 attributes

Information Management
  • Squery -N LCDRG_object -S LCDRG_object.object_data
  • --------------------------- RESULTS
  • data_id 503
  • --------------------------------------------------
  • data_id 558
  • --------------------------------------------------
  • Squery -N LCDRG_recordgroup -S LCDRG_recordgroup.r
    ecordgroup_grno -N LCDRG_object
    LCDRG_object.object_data_id 558
  • --------------------------- RESULTS
  • grno 59
  • --------------------------------------------------

Demonstration - Web Browser
  • https//
  • Log onto shared collection at SDSC
  • Collection defined by port number and host
  • Differentiate between local collection and shared
  • Local collection - /home/user.domain/collection
  • Shared collection - /Zone/home/user.domain/collect
  • Web browser displays status of federated zone
  • Select remote data grid by clicking on zone
  • Browse metadata, list files, perform authorized

Demonstration - Shell Commands
  • SRB shell commands located in ./SRB3_4_1/utilities
  • ./Sinit / connect to default collection
    specified in .srb environment file
  • authenticate yourself with
    challenge- response or GSI certificate /
  • ./Sls / list collections and files /
  • ./Scd collection-name / change to another
    collection /
  • ./Sufmeta -e stylesheet file / extract metadata
    from a file /
  • ./Smeta file-name / list user-defined metadata
  • ./Squery -N namespace -S attributename
  • / query extensible schema /

Emerging Preservation Technology
  • NARA research prototype persistent archive
    demonstrated use of data grid technology to
    manage authenticity and integrity
  • Federated data grids
  • Current challenge is the management of
    preservation policies
  • Characterize policies as rules
  • Apply rules on each operation performed by the
    data grid
  • Manage state information describing the results
    of rule application
  • Validate that the preservation policies are being
  • Same challenge exists in grid services
  • Characterize and apply rules that govern grid
    service application

Preservation Environment Requirements
  • Bruce Barkstrom paper
  • GGF paper disappeared from Gridforge, will
  • Described capabilities needed in a preservation
  • ERA capabilities list
  • http//
  • RLG/NARA trusted digital repository assessment
  • http//
  • Can we express these capabilities and assessment
    criteria as rules applied by the data grid?

integrated Rule-Oriented Data System
  • Integrate a rule engine with a data grid
  • Express operations within the data grid as
  • Support rule sets for each collection and user
  • On access to the system
  • Select rule set (Collection user role desired
  • Load required metadata (state information) into a
    temporary metadata cache
  • Evaluate rule input parameters and perform
    desired actions
  • Rules cast as EventConditionAction sets
  • Rules invoke both micro-services and rules
  • Provide recovery mechanism for each micro-service
  • On completion, load changed state information
    back into persistent metadata repository

iRODS - integrated Rule-Oriented Data System
Client Interface
Admin Interface
Rule Invoker
Metadata Modifier Module
Config Modifier Module
Rule Modifier Module
Service Manager
Resource-based Services
Consistency Check Module
Consistency Check Module
Consistency Check Module
Micro Service Modules
Current State
Metadata-based Services
Rule Base
Metadata Persistent Repository
Micro Service Modules
Example Rules
0 ON register_data IF objPath like
/home/collections.nvo/2mass/fits-images/ DO
cut nop AND check_data_type(fits
image) nop AND get_resource(nvo-image-r
esource) nop AND registerData
recover_registerData AND addACLForDataToUse
r(2massusers.nvo,write) recover_addACLForDataToUs
er AND extractMetadataForFitsImage
recover_extractMetadataForFitsImage 1
ON register_data IF objPath like
/home/collections.nvo/2mass/ DO
get_resource(2mass-other-resource) nop AND
registerData recover_registerData AND
addACLForDataToUser(2massusers.nvo,write) recov
er_addACLForDataToUser 2 ON register_data DO
get_resource(null) nop AND
registerData recover_registerData
ERA Capabilities
  • List of 854 required capabilities
  • Management of disposition agreements describing
    how record retention and disposal actions
  • Accession, the formal acceptance of records into
    the data management system
  • Arrangement, the organization of the records to
    preserve a required structure (implemented as a
    collection/sub-collection hierarchy)
  • Description, the management of descriptive
    metadata as well as text indexing
  • Preservation, the generation of Archival
    Information Packages
  • Access, the generation of Dissemination
    Information Packages
  • Subscription, the specification of services that
    a user picks for execution
  • Notification, the delivery of notices on service
    execution results
  • Queuing of large scale tasks through interaction
    with workflow systems
  • System performance and failure reports. Of
    particular interest is the identification of all
    failures within the data management system and
    the recovery procedures that were invoked.
  • Transformative migration, the ability to convert
    specified data formats to new standards. In this
    case, each new encoding format is managed as a
    version of the original record.
  • Display transformation, the ability to reformat a
    file for presentation.
  • Automated client specification, the ability to
    pick the appropriate client for each user.

Example Rules
Summary of Mapping to Rules
  • Multiple systems need to be integrated
  • PAWN submission pipeline - 34 operations
  • Cheshire indexing system - 13 operations
  • Kepler workflow - 53 operations
  • iRODS data management - 597 operations
  • Operations facility - the remaining
  • The 597 operations are executed by 174 generic
  • The analysis identified five types of metadata
  • Collection metadata - 11 attributes
  • File metadata - 123 attributes
  • User metadata - 38 attributes
  • Resource metadata - 9 attributes
  • Rule metadata - 32 attributes

Data Management Rules
  • Execute rule
  • Suspend rule
  • Add rule
  • Modify rule
  • List rules
  • List rule metadata
  • Validate rule set
  • Approve rule
  • Queue rule
  • List queued rules
  • Set queued rule priority
  • Adjust max run time
  • Estimate service resources
  • List metadata
  • Get metadata
  • Set metadata
  • Bulk metadata load
  • Delete metadata
  • Define extensible schema
  • Query metadata
  • Save query
  • Select saved query
  • Run saved query
  • Modify query
  • Modify running query
  • Save query result set
  • Modify query result set
  • Delete search results
  • Annotate search result
  • Sinit - set default workbench interface
  • Register user
  • Self-registration
  • Delete user
  • Suspend user
  • Activate user
  • Add resource
  • Remove resource
  • Set resource offline

File Operations
  • List files
  • Display file (template)
  • Set number of items per display page
  • Format file
  • Delete file
  • Delete file authorized
  • Delete file copies
  • Delete file versions
  • Erase file
  • Replace file
  • Set file version
  • Create soft link
  • Replicate file
  • Synchronize replicas
  • Physmove file
  • Annotate file
  • Access URL
  • Regenerate system metadata
  • Check vault
  • Delete collection
  • Bulk move fiiles (new hierarchy)
  • Queue file for transfer
  • Queue file for encrypted transfer
  • Output file to media
  • Modify file
  • Redact file
  • Edit file
  • Replicate archives
  • Monitor resources - hot page
  • Track usage
  • Set system parameter
  • Predict resource requirements
  • Inventory resources
  • Log event
  • Delete event log entry
  • Identify data type
  • Create access role
  • Modify access control
  • Modify subscription
  • Suspend subscription
  • Resume subscription
  • Validate authenticity

Example Rules - Templates
  • File display template (file type)
  • Format conversion format template
  • Workbench display template
  • Request help format template
  • System message format template
  • Event log display template
  • System report format template
  • Monitor hot page format template
  • Hot page report template
  • Create DIP
  • Modify DIP
  • Application hot page report template
  • COTS hot page report template
  • Usage workflow report template
  • System configuration display template
  • Logistics report format template
  • Inventory report format template
  • Description extraction rule template
  • Accounting report rule template
  • DIP format template
  • Disposition agreement format template
  • Disposition action format template
  • Physical location report template
  • Inventory report template
  • Data movement summary report template
  • Access report template
  • File migration report template
  • Document internal access control template
  • AIP format template
  • Transfer format template
  • Access review determination rule template
  • Access review determination report template
  • Validate access classification rule template
  • File transfer discrepancy report template
  • Notification review report template
  • Redaction rule template
  • Search display template

Example Rules - Templates
  • Lifecycle parsing rules template
  • Authenticity validation rule template
  • Assess preservation
  • Modify workbench
  • Select workbench
  • Create description
  • Validate description
  • Modify description
  • Update description
  • Approve description
  • Create unique identifier
  • Approve disposition agreement
  • Validate transfer request
  • Validate access classification
  • Queue record for destruction
  • Certify deletion of records
  • Set disposition hold
  • Unset disposition hold
  • Record disposition action
  • Identify template use
  • Create template
  • Modify template
  • Delete template
  • List templates
  • Approve template
  • Check template
  • Assign template
  • Template-based default setting
  • Parse file
  • Generate report
  • Modify report
  • Export record
  • Export records
  • Create disposition agreement
  • Disposition record check
  • Modify disposition agreement
  • Compare disposition agreements
  • Compare access review determinations

Rule Metadata
RLG/NARA TDR Assessment Criteria
  • The assessment criteria can be mapped to
    management policies.
  • The management policies can be mapped to a set of
    rules whose execution can be automated.
  • The rules require definition of input parameters
    that define the assertion being implemented.
  • The execution of the rules generates state
    information that can be evaluated to verify the
    assertion result
  • The types of rules that are needed include
  • Specification of assertions (setting rule
    parameters - flags and descriptive metadata)
  • Deferred consistency constraints that may be
    applied at any time
  • Periodic rules that execute defined procedures
  • Atomic rules applied on each operation (access
    controls, audit trails)
  • The rules determine the metadata attributes that
    need to be managed

TDR Rules
iRODS Development
  • Opensource software
  • Collaborative development
  • Generic rule environment
  • Logical names for rules
  • Logical names for micro-services
  • Logical names for rule metadata
  • Manage versions of rules, micro-services,
  • Implies can automate the management of new
    versions of the data grid
  • Can change individual micro-services or rule sets
  • System tracks which versions are compatible and
    performs the requested operations

More Information SRB http// iROD
S http//