Trusted Datagrids: Library of Congress Projects with UCSD - PowerPoint PPT Presentation

Loading...

PPT – Trusted Datagrids: Library of Congress Projects with UCSD PowerPoint presentation | free to view - id: 2fab2-MzY5N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Trusted Datagrids: Library of Congress Projects with UCSD

Description:

Bagit Transfer specification developed by CDL and the Library of Congress. ... Library of Congress PG Image Collection. 600 GB Prokudin-Gorskii Image Collection ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 68
Provided by: cliffat
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Trusted Datagrids: Library of Congress Projects with UCSD


1
Trusted Datagrids Library of Congress Projects
with UCSD
Ardys Kozbial UCSD Libraries David Minor - SDSC
2
Building Trust in a 3RD Party Repository A
Pilot Project
David Minor San Diego Supercomputer Center
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
How can the LC trust
someone they cant control?
7
(No Transcript)
8
Moving forward in the right direction requires
more than fuzzy promises
9
… it takes a combination of experts and tools.
10
Cyberinfrastructure is the collection of ...
Resources
Computers, data storage, networks, scientific
instruments, experts, etc.
Glue
Integrating software, systems, and organizations
11
Effective cyberinfrastructure for the humanities
and social sciences will allow scholars to focus
their intellectual and scholarly energies on the
issues that engage them, and to be effective
users of new media and new technologies, rather
than having to invent them.
- ACLS Commission on Cyberinfrastructure for
the Humanities Social Sciences
12
  • The mission of the San Diego Supercomputer
    Center (SDSC) is to empower communities in
    data-oriented research, education, and practice
    through the innovation and provision of
    Cyberinfrastructure

13
  • SDSC ...
  • Is one of the original NSF supercomputer centers
  • Supports high performance computing systems
  • Supports data applications for science,
    engineering, social sciences, cultural heritage
    institutions
  • Has LARGE data capabilities
  • 3 PB Disk Storage
  • 25 PB Tape Storage

14
UCSD Libraries
  • 3.5 million volumes
  • Digital Access Management System (in
    development)
  • 250,000 objects
  • 15 TB
  • Shared collections with UC
  • California Digital Library
  • Digital Preservation Repository
  • eScholarship repository

15
Partnerships and Collaborations
  • LC Pilot Project Building Trust in a 3rd Party
    Repository
  • Using test image collections/web crawls ingest
    content to SDSC repository
  • Allow access for content audit
  • Track usage of content over time
  • Deliver content back to LC at end of project
  • Library of Congress NDIIPP Chronopolis Program
  • Build Production Capable Chronopolis Grid (50 TB
    x 3)
  • Further define transmission packaging for
    archival communities
  • Investigate best network transfer models for I2
    and TeraGrid networks
  • California Digital Library (CDL) Mass Transit
    Program
  • Enable UC System Libraries to transfer high-speed
    mass digitization collections across CENIC/I2
  • Develop transmission packaging for CDL content
  • UCSD Libraries Digital Asset Management System
  • RDF System with data managed in SRB at SDSC

16
SDSC DPI Group
  • Digital Preservation Initiatives Group
  • Charged with Developing and Supporting Digital
    Preservation Services within the Production
    Systems Division of SDSC.
  • http//dpi.sdsc.edu
  • Cross-Organizational Group
  • SDSC Personnel/UCSD Libraries Personnel
  • Libraries
  • Archives
  • Technology
  • Information Science

17
Cyberinfrastructure
Trust
18
For Example
19
We worked together to setup high speed data
replication services
Achieved 200Mb/s 2 TB/day Highly reliable
Checksums
Internet2
Checksums
20
Network setup involved …
  • LC and SDSC staff working together
  • Configurations on networks and computers
  • Resolving different security environments
  • Network monitoring

21
Networking is hard!
Its not magic - theres always a reason
Lessons Learned
It highlights collaborative nature of work
Cant forget it once its setup
22
Have multi-institutional issues been solved?
Does new infrastructure improve process?
Trust Elements
Has a long-term solution been found?
Is solution useful for other organizations?
23
(No Transcript)
24
SDSC created a robust storage environment for
this data
Multiple replications … … at SDSC … and
geographically diverse locations
25
(a process with several characteristics)
  • Needed to replicate structure exactly
  • This had to be done for 5 replications
  • Complex environment had to be transparent
  • Data had to be available for manipulation

26
  • The Storage Resource Broker provided replication
    services ...

27
... and extensive monitoring, logging and
reporting functions
(which led to many conversations)
28
Logging and monitoring procedures
  • Scripts which compared the files within the
    system with a master list checked changes on
    either side … fairly straightforward
  • But …

What is the master list and who maintains
it? Who decides what is a legitimate change? Do
you want a dark archive or an active remote data
center?
29
We tested a new Front-End
30
… and explored an important issue
  • Reliability
  • Versus
  • Accessibility

31
Always keep expectations aligned
Duplication of structure is complicated
Lessons Learned
Dont confuse accessibility and reliability
Communication highlights communication
32
Can remote data be accessed?
Can remote data be verified?
Trust Elements
Can remote data be retrieved and re-used?
Can ownership be clearly defined?
33
SDSC and LC explored a new approach to working
with web archives
Parallel indexing and display system Looked
default to the user
50,000 ARC files 6 Terabytes of data Short
processing time
34
Using default tools, our initial indexing rate
was 1000 files per day…
… more than 6 weeks of constant computing to
index entire collection.
This was over our time budget.
35
We ran 18 parallel indexing instances reduced
processing to a week
We modified the Wayback sourcecode to create a
new access infrastructure
36
Default setup isnt always easiest
Time is a wonderful motivator
Lessons Learned
Sometimes you need to start over
Experts are often interested in your work
37
Are the final results the same?
Can the results be reached in a better way?
Trust Elements
Can a new organization bring new expertise?
Can a new organization work with your partners?
38
Next steps ….
39
Chronopolis A Partnership
  • Chronopolis is being developed by a national
    consortium led by SDSC and the UCSD Libraries.
  • Initial Chronopolis provider sites include
  • SDSC and UCSD Libraries at UC San Diego
  • University of Maryland
  • National Center for Atmospheric Research (NCAR)
    in Boulder, CO

40
Institutions and Roles - UCSD
  • SDSC
  • Storage and networking services
  • SRB support
  • Transmission Packaging Modules
  • UCSD Libraries
  • Metadata services (PREMIS)
  • DIPs (Dissemination Information Packages)
  • Other advanced data services as needed

41
Institutions and Roles - NCAR
  • National Center for Atmospheric Research
  • Archives Complete copy of all data
  • Storage and network support
  • Network testing

42
Institutions and Roles - UMIACS
  • University of Maryland Institute for Advanced
    Computer Studies
  • Archives Complete copy of all data
  • Advanced data services
  • PAWN Producer Archive Workflow Network in
    Support of Digital Preservation
  • ACE Auditing Control Environment to Ensure the
    Long Term Integrity of Digital Archives
  • Other advanced data services as needed

43
SDSC Chronopolis Program
44
Chronopolis Vocabulary
  • Partners UCSD Libraries, National Center for
    Atmospheric Research, University of Maryland
    Institute for Advanced Computer Studies all
    provide grid enabled storage nodes for
    Chronopolis services.
  • Clients ICPSR, CDL contribute content to the
    Chronopolis preservation network.
  • SRB Storage Resource Broker datagrid
    software.
  • iRODS integrated Rule Oriented Data System
    datagrid software.
  • ACE Audit Control Cnvironment part of the
    ADAPT project at UMD.
  • PAWN Producer Archive Workflow Network part
    of the ADAPT project at UMD.
  • INCA user level grid monitoring - executes
    periodic, automated, user-level testing of Grid
    software and services grid middleware.
  • Bagit Transfer specification developed by CDL
    and the Library of Congress.
  • GridFTP parallel transfer technology - moves
    large collections within a grid wide-area network.

45
Chronopolis Inside
  • Linked by main staging grid where data is
    verified for integrity, and quarantined for
    security purposes.
  • Collections are independently pulled into each
    system.
  • Manifest layer provides added security for
    database management and data integrity
    validation.
  • Benefits
  • 3 independently managed copies of the
    collection
  • High availability
  • High reliability

Grid Brick Disks
46
SDSC Leveraged Infrastructure
  • Serves Both HPC Digital Preservation
  • Archive
  • 25 PB capacity
  • Both HPSS SAM-QFS
  • Online disk
  • 3PB total
  • HPC parallel file systems
  • Collections
  • Databases
  • Access Tools

Adapted from Richard Moore (SDSC)
47
Chronopolis Demonstration Project
  • Demonstration Project 2006-2007
  • Demonstration Collections Ingested
    within
    Chronopolis
  • National Virtual Observatory (NVO)
  • 3 TB Hyperatlas Images (partial collection)
  • Library of Congress PG Image Collection
  • 600 GB Prokudin-Gorskii Image Collection
  • Interuniversity Consortium for Political

    and Social Research (ICPSR)
  • 2TB Web Accessible Data
  • NCAR Observational Data
  • 3TB Observational Re-Analysis Data

48
NDIIPP Chronopolis Project
  • Creating a 3-node federated data grid at SDSC,
    NCAR and UMD up to 50 TB data from CDL and
    ICPSR
  • Installing and testing a suite of monitoring
    tools using ACE, PAWN, INCA
  • Creating Appropriate Transmission Information
    Packages
  • Generating PREMIS definitions for data
  • Writing Best Practices documents for clients and
    partners

49
Chronopolis Grid Framework
Chronopolis Data 12-25TB
Chronopolis Data 12TB
Sun 6140 62TB
SRB D-Broker
SRB D-Broker
SRB MCAT
CDL Server
ICPSR Server
CDL Server
ICPSR Network
UC BerkeleyNetwork
NCAR Network
NCAR Network
SRB MCAT
SDSC Network
SDSC Network
MarylandNetwork
UMD Network
SRB D-Broker
Apple Xsan
SRB D-Broker
SRB D-Broker
SRB D-Broker
SRB MCAT
Sun SAM-QFS
Tape Silos
Adapted from Bryan Banister (SDSC)
50
NDIIPP Chronopolis Clients-CDL
  • California Digital Library
  • A part of UCOP, supports the University of
    California libraries
  • Providing up to 25TB of data Web-At-Risk project
  • Five years of political and governmental websites
  • ARC files created from web crawls
  • Using Bagit Transfer Structure

51
Diagram of CDL Data Transfer
Wget Bagit
CDL Virtual Machine at UCB
SDSC Network
Wget files 1-10, 11-20
Parallel Wget Xfer
UMIACS Network
Possible SRB/Bagit Module
Bagit Manifest
File 1
UMIACS
Chron Staging
File n
Chron Repository
NCAR Network
NCAR
Adapted from Bryan Banister (SDSC)
52
NDIIPP Chronopolis Clients-ICPSR
  • Inter-University Consortium for Political and
    Social Research, University of Michigan
  • Providing _at_12TB of data Wide variety of types
  • Already working with SDSC using SRB

53
Diagram of ICSPR Transfer
Sput/Srsync Files
ICPSR SRB Repository UMich
SDSC Network
Sput tar files
Parallel Sput/Srsync Xfer
UMIACS Network
Chron SRB MCAT
EMC SAN
File 1
UMIACS
Chron Staging
File n
Chron Repository
NCAR Network
NCAR
Adapted from Bryan Banister (SDSC)
54
Ongoing and Future Initiatives
  • Migration of Chronopolis from SRB to iRODS
  • Develop Interoperability with Community Based
    Archival Systems/Standards
  • TRAC compliance for SDSC Production Preservation
    Services/Chronopolis Consortium

55
Looking for Partnerships
  • Repositories interested in moving large digital
    collections among heterogeneous repository
    systems.
  • Fedora, DSpace or E-Prints sites interested in
    managed datagrid storage.
  • Institutions interested in personnel swaps to
    conduct TRAC audit assessment compliance.
  • Community Needs for Mass-Scale Data Transmission
    and Storage.

56
Chronopolis Credits
  • SDSC
  • Fran Berman
  • Richard Moore
  • David Minor
  • Chris Jordan
  • Jim DAoust
  • Robert McDonald
  • Don Sutton
  • Brian Banister
  • Phong Dinh
  • Jay Dombrowski
  • Emilio Valente
  • UCSD Libraries
  • Brian Schottlaender
  • Luc Declerck
  • Ardys Kozbial
  • Brad Westbrook
  • Arwen Hutt
  • NCAR
  • Don Middleton
  • Michael Burek
  • Linda McGinley
  • UMIACS
  • Joseph JaJa
  • Mike Smorul
  • Mike McGann
  • Library of Congress
  • Martha Anderson
  • Lisa Hoppis
  • CACI
  • Mike Ivey

57
http//chronopolis.sdsc.edu
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
  • Chronopolis is ...
  • a geographically distributed preservation
    environment that supports long-term management
    and stewardship of digital collections
  • implemented by developing and deploying a
    distributed data grid, and by supporting its
    human, policy, and technological infrastructure.
  • technology forecasting and migration in support
    of long-term life-cycle management of the
    dedicated preservation environment.

62
Chronopolis focuses on ...
  • Assessment of the needs of potential user
    communities and development of appropriate
    service models
  • Development of Memoranda of Understanding (MOUs),
    Service Level Agreements (SLAs), etc. to
    formalize trust relationships and manage
    expectations
  • Assessment and prototyping of best practices for
    bit preservation, authentication, metadata, etc.
  • Development of cost and risk models for long-term
    preservation
  • Development of appropriate success metrics to
    evaluate usefulness, reliability, and usability
    of infrastructure

63
The people of Chronopolis are ...
64
Organizations need ways to validate trust in 3rd
parties
In conclusion …
65
(No Transcript)
66
SDSC and the Library of Congress explored one way
to do this …
by working with Cyberinfrastructure
… and demonstrating trust.
67
With a trusted relationship, many journeys become
possible
About PowerShow.com