GGF18 Preservation Environments Research Group - PowerPoint PPT Presentation

Loading...

PPT – GGF18 Preservation Environments Research Group PowerPoint presentation | free to download - id: 32b78-N2I0O



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

GGF18 Preservation Environments Research Group

Description:

NARA Amelia Earhart collection contributed by Mark Conrad (NARA) ... Demonstrated on a state department collection of communiques about Amelia Earhart ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 34
Provided by: sdsc
Learn more at: http://www.sdsc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: GGF18 Preservation Environments Research Group


1
GGF18 Preservation Environments Research Group
  • Organizers Reagan Moore (moore_at_sdsc.edu)
  • "Bruce.Barkstrom" ltBruce.Barkstrom_at_noaa.govgt
  • Goals
  • Analyze capabilities required by a preservation
    environment
  • Barkstrom GGF paper - based on NASA Langley
    preservation model
  • NARA Electronic Records Archive capability
    requirements
  • RLG/NARA assessment criteria for a Trusted
    Digital Repository
  • Demonstrate creation of a preservation
    environment based on data grid technology
  • Demonstrate federation of 17 SRB data grids
    (shared name spaces)
  • Demonstrate replication of a collection
  • Analyze capabilities that can be based on grid
    technology
  • iRODS rule-oriented data system
  • Participants
  • 19 contributors to data grid federation for GIN
  • MIT - PLEDGE project on preservation policies
  • SDSC - NARA research prototype persistent archive
  • U Md - Producer Archive Workflow Network

2
GGF Grid Interoperability Now
  • Organizers Erwin Laure (Erwin.Laure_at_cern.ch)
  • Reagan Moore (moore_at_sdsc.edu)
  • Arun Jagatheesan (arun_at_sdsc.edu) - grid
    coordination
  • Sheau-Yen Chen (sheauc_at_sdsc.edu) - data grid
    administrator
  • Chien-Yi Hou (chienyi_at_sdsc.edu) - collection
    administrator
  • Goals
  • Demonstrate federation of 17 SRB data grids
    (shared name spaces)
  • Demonstrate replication of a collection
  • Participants (19 data grids)
  • APAC Australia Stephen McMahon
    stephen.mcmahon_at_anu.edu.au
  • ASGC Taiwan Eric Yen, Wei-Long Ueng
    wlueng_at_twgrid.org
  • ChinaGrid China Li Qi quick.qi_at_gmail.com
  • DEISA-Italy Giuseppe Fiameni
    g.fiameni_at_cineca.it
  • IB-New Zealand Daniel Hanlon
    d.j.hanlon_at_dl.ac.uk
  • IB-UK Daniel Hanlon d.j.hanlon_at_dl.ac.uk
  • IN2P3-France Jean-Yves Nief nief_at_cc.in2p3.fr
  • KEK- Japan Yoshimi Iida yoshimi.iida_at_kek.jp
  • LCDRG-US Chien-Yi Hou chienyi_at_sdsc.edu
  • NCHC Taiwan Hsu-Mei Chou hmchou_at_nchc.org.tw

3
Intellectual Property Policy
  • I acknowledge that participation in GGF18 is
    subject to the GGF Intellectual Property Policy.
  • Intellectual Property Notices Note Well All
    statements related to the activities of the GGF
    and addressed to the GGF are subject to all
    provisions of Section 17 of GFD-C.1 (.pdf), which
    grants to the GGF and its participants certain
    licenses and rights in such statements. Such
    statements include verbal statements in GGF
    meetings, as well as written and electronic
    communications made at any time or place, which
    are addressed to the GGF plenary session,
  • any GGF working group or portion thereof,
  • the GFSG, or any member thereof on behalf of the
    GFSG,
  • the GFAC, or any member thereof on behalf of the
    GFAC,
  • any GGF mailing list, including any working group
    or research group list, or any other list
    functioning under GGF auspices,
  • the GFD Editor or the GWD process
  • Statements made outside of a GGF meeting, mailing
    list or other function, that are clearly not
    intended to be input to an GGF activity, group or
    function, are not subject to these provisions.
  • Excerpt from Section 17 of GFD-C.1 Where the GFSG
    knows of rights, or claimed rights, the GGF
    secretariat shall attempt to obtain from the
    claimant of such rights, a written assurance that
    upon approval by the GFSG of the relevant GGF
    document(s), any party will be able to obtain the
    right to implement, use and distribute the
    technology or works when implementing, using or
    distributing technology based upon the specific
    specification(s) under openly specified,
    reasonable, non-discriminatory terms. The working
    group or research group proposing the use of the
    technology with respect to which the proprietary
    rights are claimed may assist the GGF secretariat
    in this effort. The results of this procedure
    shall not affect advancement of document, except
    that the GFSG may defer approval where a delay
    may facilitate the obtaining of such assurances.
    The results will, however, be recorded by the GGF
    Secretariat, and made available. The GFSG may
    also direct that a summary of the results be
    included in any GFD published containing the
    specification. GGF Intellectual Property
    Policies are adapted from the IETF Intellectual
    Property Policies that support the Internet
    Standards Process.

4
Preservation Requirements
  • Authenticity
  • Maintain information about provenance of data
  • Assertions made about the file at the time of
    ingestion
  • Integrity
  • Maintain information about the management of the
    data
  • Assertions made by the archivist
  • Access controls, audit trails, checksums,
    replication, synchronization, federation
  • Infrastructure independence
  • Management of properties of records independently
    of choice of storage system
  • Scalability
  • Management of large collections (billions of
    records, petabytes of data, thousands of
    attributes)

5
GIN - Two Approaches
  • Virtualize the storage resource
  • Provide a standard access interface to the
    storage system
  • Storage Resource Manager
  • Asynchronous interface to storage
  • Virtualize the shared collection - provides
    ability to implement infrastructure
    independence
  • Manage the properties of a shared collection
    independently of the multiple storage systems
    where it is distributed
  • Storage Resource Broker
  • Collection management
  • Federation of independent collections

6
SRB Data Grid Federation Status
7
Data Grid Federation
  • Builds on
  • Registry for data grid names - ensures each data
    grid has a unique identity
  • Trust establishment - explicit registration
    command issued by the data grid administrator of
    each data grid
  • Peer-to-peer server interaction - each SRB server
    can respond to commands from any other SRB
    server, provided trust has been established
    between the data grids
  • Administrator controlled registration of name
    spaces - each grid controls whether they will
    share user names, file names, replicate data,
    replicate metadata or allow remote data storage
  • Shibboleth style user authentication - a person
    is identified by
  • /Zone-name/user-name.domain-name.
  • Authentication is done by the home zone. No
    passwords are shared between zones.
  • Local authorization - operations are under the
    control of the zone being accessed, including
    controls on access to files, storage resources,
    metadata and user quotas. Owners of data can set
    access controls for other persons

8
Federation Between Data Grids
Data Access Methods (Web Browser, Scommands,
OAI-PMH)
Data Collection B
Data Collection A
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints
  • Data Grid
  • Logical resource name space
  • Logical user name space
  • Logical file name space
  • Logical context (metadata)
  • Control/consistency constraints

Access controls and consistency constraints on
cross registration of name spaces
9
Challenge - Replicate a Collection
  • Replicate files in a collection
  • Demonstrated at GGF17
  • Replicate metadata associated with a shared
    collection
  • Authenticity metadata - describe provenance of
    file
  • Integrity metadata - state information such as
    checksums, access controls
  • SRB information synchronization command
  • Szonesync.pl -d -z remotezone
  • Synchronize data information with zone
    remotezone
  • Szonesync.pl -u -z remotezone
  • Synchronize user information with zone
    remotezone
  • Szonesync.pl -r -z remotezone
  • Synchronize user and resource information with
    zone remotezone

10
Collections
  • Two collections have been made available for use
    in the collection replication demonstration
  • NOAO Astronomy collection contributed by Irene
    Barg (NOAO). This contains 120 zip files and is
    about 10.6 Gigabytes in size. Each zip file
    contains multiple FITS image files. For the
    demonstration, we replicate a FITS image to the
    remote data grid, then extract metadata from the
    FITS images and load into the remote data grid.
    We then issue q query against the user-defined
    metadata. The files are located in the SRB
    collection
  • /SDSC-GGF/home/ggfsdsc.sdsc/mtn/20051010/ct4m/noao
  • NARA Amelia Earhart collection contributed by
    Mark Conrad (NARA). This contains 43 pdf
    documents, and has extensive preservation
    metadata. The total size is less than 200
    megabytes. For the demonstration, we replicate
    the files. We load a hierarchical metadata
    description for each file through SQL commands on
    the database. We then issue a query against the
    hierarchical metadata. The files are located in
    the SRB collection
  • /SDSC-GGF/home/ggfsdsc.sdsc/nara

11
Collection Management
  • Metadata extraction - user-defined metadata
  • Demonstrated on FITS astronomy image
  • Images provided by Irene Barg - NOAO
    (noao-ls-t3-z1 data grid)
  • Execute remote procedure to extract metadata from
    a file
  • Created parsing template to extract metadata
    attributes from FITS header
  • Load the extracted metadata into the remote zone
    MCAT catalog
  • Modified SRB to support extraction of multiple
    versions of the same metadata attribute from
    large files
  • Executed commands on the /GGF-RNP data grid in
    Brazil
  • Extracted 183 metadata attributes from a FITS
    header
  • ./Sufmeta ct422131.fits
  • DTPI 'Christopher Stubbs'
  • 89 DTPIAFFL 'University of Washington'
  • 90 DTTITLE 'A Next Generation Microlensing
    Survey of the LMC'
  • 91 DTACQUIS 'ctioa8.ctio.noao.edu'
  • 92 DTACCOUN 'mosaic '
  • 93 DTACQNAM '/ua00/mosaic/tonight/sm84.051011_05
    16.100.fits'
  • 94 DTNSANAM 'ct422131.fits '

12
Collection Management
  • Metadata hierarchy - extensible schema
  • Demonstrated on a state department collection of
    communiques about Amelia Earhart
  • Collection provided by Mark Conrad - NARA
    (LCDRG-GGF data grid)
  • Create additional tables in MCAT catalog to
    support schema extension
  • Created scripts to add 70 tables to the MCAT
    catalog
  • Had to map schema from 2nd normal form to 3rd
    normal form
  • Load a metadata hierarchy into the remote zone
    MCAT catalog
  • Created scripts to load the Life Cycle Data
    Requirements Guide metadata into MCAT
  • Added LCDRG metadata hierarchy to 43 files in the
    Amelia Earhart collection on the UERJ-HEPGrid in
    Brazil
  • Created scripts to load the metadata via SQL
    commands
  • Queried the metadata hierarchy
  • SRB commands

13
LCDRG Metadata Hierarchy
  • Based on subset of attributes in NARAs Life
    Cycle Data Requirements Guide
  • RECORD GROUP or Collection
  • 14 versus 21 attributes
  • DATE, GRNO - group number
  • SERIES
  • 60 attributes
  • DATE, GRNO - group number
  • FILE UNIT
  • 43 attributes
  • DATE, GRNO - group number
  • ITEM or audioVisual ITEM
  • 39 attributes or 23 attributes
  • DATE, GRNO - group number
  • OBJECT
  • 65 attributes

14
Information Management
  • Squery -N LCDRG_object -S LCDRG_object.object_data
    _id
  • --------------------------- RESULTS
    ------------------------------
  • data_id 503
  • --------------------------------------------------
    ---------------
  • data_id 558
  • --------------------------------------------------
    ---------------
  • Squery -N LCDRG_recordgroup -S LCDRG_recordgroup.r
    ecordgroup_grno -N LCDRG_object
    LCDRG_object.object_data_id 558
  • --------------------------- RESULTS
    ------------------------------
  • grno 59
  • --------------------------------------------------
    ---------------

15
Demonstration - Web Browser
  • https//srb.npaci.edu/mysrb331reagan.shtml
  • Log onto shared collection at SDSC
  • Collection defined by port number and host
    machine
  • Differentiate between local collection and shared
    collection
  • Local collection - /home/user.domain/collection
  • Shared collection - /Zone/home/user.domain/collect
    ion
  • Web browser displays status of federated zone
  • Select remote data grid by clicking on zone
  • Browse metadata, list files, perform authorized
    operations

16
Demonstration - Shell Commands
  • SRB shell commands located in ./SRB3_4_1/utilities
    /bin
  • ./Sinit / connect to default collection
    specified in .srb environment file
  • authenticate yourself with
    challenge- response or GSI certificate /
  • ./Sls / list collections and files /
  • ./Scd collection-name / change to another
    collection /
  • ./Sufmeta -e stylesheet file / extract metadata
    from a file /
  • ./Smeta file-name / list user-defined metadata
    /
  • ./Squery -N namespace -S attributename
  • / query extensible schema /

17
Emerging Preservation Technology
  • NARA research prototype persistent archive
    demonstrated use of data grid technology to
    manage authenticity and integrity
  • Federated data grids
  • Current challenge is the management of
    preservation policies
  • Characterize policies as rules
  • Apply rules on each operation performed by the
    data grid
  • Manage state information describing the results
    of rule application
  • Validate that the preservation policies are being
    followed
  • Same challenge exists in grid services
  • Characterize and apply rules that govern grid
    service application

18
Preservation Environment Requirements
  • Bruce Barkstrom paper
  • GGF paper disappeared from Gridforge, will
    reinstall
  • Described capabilities needed in a preservation
    environment
  • ERA capabilities list
  • http//www.crl.edu/content.asp?l113l258l3160
  • RLG/NARA trusted digital repository assessment
    criteria
  • http//www.dlib.org.ar/dlib/july06/ross/07ross.htm
    l
  • Can we express these capabilities and assessment
    criteria as rules applied by the data grid?

19
integrated Rule-Oriented Data System
  • Integrate a rule engine with a data grid
  • Express operations within the data grid as
    micro-services
  • Support rule sets for each collection and user
    role
  • On access to the system
  • Select rule set (Collection user role desired
    operation)
  • Load required metadata (state information) into a
    temporary metadata cache
  • Evaluate rule input parameters and perform
    desired actions
  • Rules cast as EventConditionAction sets
  • Rules invoke both micro-services and rules
  • Provide recovery mechanism for each micro-service
  • On completion, load changed state information
    back into persistent metadata repository

20
iRODS - integrated Rule-Oriented Data System
Client Interface
Admin Interface
Rule Invoker
Resources
Metadata Modifier Module
Config Modifier Module
Rule Modifier Module
Service Manager
Resource-based Services
Rule
Consistency Check Module
Consistency Check Module
Consistency Check Module
Engine
Micro Service Modules
Current State
Confs
Metadata-based Services
Rule Base
Metadata Persistent Repository
Micro Service Modules
21
Example Rules
0 ON register_data IF objPath like
/home/collections.nvo/2mass/fits-images/ DO
cut nop AND check_data_type(fits
image) nop AND get_resource(nvo-image-r
esource) nop AND registerData
recover_registerData AND addACLForDataToUse
r(2massusers.nvo,write) recover_addACLForDataToUs
er AND extractMetadataForFitsImage
recover_extractMetadataForFitsImage 1
ON register_data IF objPath like
/home/collections.nvo/2mass/ DO
get_resource(2mass-other-resource) nop AND
registerData recover_registerData AND
addACLForDataToUser(2massusers.nvo,write) recov
er_addACLForDataToUser 2 ON register_data DO
get_resource(null) nop AND
registerData recover_registerData
22
ERA Capabilities
  • List of 854 required capabilities
  • Management of disposition agreements describing
    how record retention and disposal actions
  • Accession, the formal acceptance of records into
    the data management system
  • Arrangement, the organization of the records to
    preserve a required structure (implemented as a
    collection/sub-collection hierarchy)
  • Description, the management of descriptive
    metadata as well as text indexing
  • Preservation, the generation of Archival
    Information Packages
  • Access, the generation of Dissemination
    Information Packages
  • Subscription, the specification of services that
    a user picks for execution
  • Notification, the delivery of notices on service
    execution results
  • Queuing of large scale tasks through interaction
    with workflow systems
  • System performance and failure reports. Of
    particular interest is the identification of all
    failures within the data management system and
    the recovery procedures that were invoked.
  • Transformative migration, the ability to convert
    specified data formats to new standards. In this
    case, each new encoding format is managed as a
    version of the original record.
  • Display transformation, the ability to reformat a
    file for presentation.
  • Automated client specification, the ability to
    pick the appropriate client for each user.

23
Example Rules
24
Summary of Mapping to Rules
  • Multiple systems need to be integrated
  • PAWN submission pipeline - 34 operations
  • Cheshire indexing system - 13 operations
  • Kepler workflow - 53 operations
  • iRODS data management - 597 operations
  • Operations facility - the remaining
    capabilities
  • The 597 operations are executed by 174 generic
    rules
  • The analysis identified five types of metadata
    attributes
  • Collection metadata - 11 attributes
  • File metadata - 123 attributes
  • User metadata - 38 attributes
  • Resource metadata - 9 attributes
  • Rule metadata - 32 attributes

25
Data Management Rules
  • Execute rule
  • Suspend rule
  • Add rule
  • Modify rule
  • List rules
  • List rule metadata
  • Validate rule set
  • Approve rule
  • Queue rule
  • List queued rules
  • Set queued rule priority
  • Adjust max run time
  • Estimate service resources
  • List metadata
  • Get metadata
  • Set metadata
  • Bulk metadata load
  • Delete metadata
  • Define extensible schema
  • Query metadata
  • Save query
  • Select saved query
  • Run saved query
  • Modify query
  • Modify running query
  • Save query result set
  • Modify query result set
  • Delete search results
  • Annotate search result
  • Sinit - set default workbench interface
  • Register user
  • Self-registration
  • Delete user
  • Suspend user
  • Activate user
  • Add resource
  • Remove resource
  • Set resource offline

26
File Operations
  • List files
  • Display file (template)
  • Set number of items per display page
  • Format file
  • Delete file
  • Delete file authorized
  • Delete file copies
  • Delete file versions
  • Erase file
  • Replace file
  • Set file version
  • Create soft link
  • Replicate file
  • Synchronize replicas
  • Physmove file
  • Annotate file
  • Access URL
  • Regenerate system metadata
  • Check vault
  • Delete collection
  • Bulk move fiiles (new hierarchy)
  • Queue file for transfer
  • Queue file for encrypted transfer
  • Output file to media
  • Modify file
  • Redact file
  • Edit file
  • Replicate archives
  • Monitor resources - hot page
  • Track usage
  • Set system parameter
  • Predict resource requirements
  • Inventory resources
  • Log event
  • Delete event log entry
  • Identify data type
  • Create access role
  • Modify access control
  • Modify subscription
  • Suspend subscription
  • Resume subscription
  • Validate authenticity

27
Example Rules - Templates
  • File display template (file type)
  • Format conversion format template
  • Workbench display template
  • Request help format template
  • System message format template
  • Event log display template
  • System report format template
  • Monitor hot page format template
  • Hot page report template
  • Create DIP
  • Modify DIP
  • Application hot page report template
  • COTS hot page report template
  • Usage workflow report template
  • System configuration display template
  • Logistics report format template
  • Inventory report format template
  • Description extraction rule template
  • Accounting report rule template
  • DIP format template
  • Disposition agreement format template
  • Disposition action format template
  • Physical location report template
  • Inventory report template
  • Data movement summary report template
  • Access report template
  • File migration report template
  • Document internal access control template
  • AIP format template
  • Transfer format template
  • Access review determination rule template
  • Access review determination report template
  • Validate access classification rule template
  • File transfer discrepancy report template
  • Notification review report template
  • Redaction rule template
  • Search display template

28
Example Rules - Templates
  • Lifecycle parsing rules template
  • Authenticity validation rule template
  • Assess preservation
  • Modify workbench
  • Select workbench
  • Create description
  • Validate description
  • Modify description
  • Update description
  • Approve description
  • Create unique identifier
  • Approve disposition agreement
  • Validate transfer request
  • Validate access classification
  • Queue record for destruction
  • Certify deletion of records
  • Set disposition hold
  • Unset disposition hold
  • Record disposition action
  • Identify template use
  • Create template
  • Modify template
  • Delete template
  • List templates
  • Approve template
  • Check template
  • Assign template
  • Template-based default setting
  • Parse file
  • Generate report
  • Modify report
  • Export record
  • Export records
  • Create disposition agreement
  • Disposition record check
  • Modify disposition agreement
  • Compare disposition agreements
  • Compare access review determinations

29
Rule Metadata
30
RLG/NARA TDR Assessment Criteria
  • The assessment criteria can be mapped to
    management policies.
  • The management policies can be mapped to a set of
    rules whose execution can be automated.
  • The rules require definition of input parameters
    that define the assertion being implemented.
  • The execution of the rules generates state
    information that can be evaluated to verify the
    assertion result
  • The types of rules that are needed include
  • Specification of assertions (setting rule
    parameters - flags and descriptive metadata)
  • Deferred consistency constraints that may be
    applied at any time
  • Periodic rules that execute defined procedures
  • Atomic rules applied on each operation (access
    controls, audit trails)
  • The rules determine the metadata attributes that
    need to be managed

31
TDR Rules
32
iRODS Development
  • Opensource software
  • Collaborative development
  • Generic rule environment
  • Logical names for rules
  • Logical names for micro-services
  • Logical names for rule metadata
  • Manage versions of rules, micro-services,
    metadata
  • Implies can automate the management of new
    versions of the data grid
  • Can change individual micro-services or rule sets
  • System tracks which versions are compatible and
    performs the requested operations

33
More Information
moore_at_sdsc.edu SRB http//www.sdsc.edu/srb iROD
S http//www.sdsc.edu/srb/future/index.php/Main_P
age
About PowerShow.com