Dr. Natasha Balac - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Dr. Natasha Balac

Description:

Dr' Natasha Balac – PowerPoint PPT presentation

Number of Views:255
Avg rating:3.0/5.0
Slides: 68
Provided by: sdsc6
Category:
Tags: balac | natasha

less

Transcript and Presenter's Notes

Title: Dr. Natasha Balac


1
A Grand Challenge for the Information Age
  • Dr. Natasha Balac
  • San Diego Supercomputer Center
  • UC San Diego

2
SDSC overview
ProductionSystems
User Servicesand Development
  • National NSF facility since 1985 and UCSD
    Organized Research Unit
  • 400 staff students
  • Core TeraGrid programs provide high-end
    computational and storage resources to US
    researchers, based on a national peer-review
    proposal/allocation process
  • Supports many programs including
  • Core cyberinfrastructure program
  • National TeraGrid program
  • Protein Data Bank (PDB)
  • Biomedical Informatics Research Network (BIRN)
  • Optiputer
  • Network for Earthquake Engineering Simulation
    (NEES)
  • Geosciences Network (GEON)
  • Alliance for Cell Signaling (AfCS)
  • High Performance Wireless Research and Education
    Network (HPWREN)
  • National Virtual Observatory

TechnologyResearch and Development
ScienceResearchand Development
Data andKnowledge Systems
3
SDSC in Brief
SDSC CAIDA image from the 2007 Design and the
Elastic Mind Exhibit at the NY Museum of Modern
Art
  • Funding
  • In 2007, SDSC was home to over 110 research
    projects and received research funding in excess
    of 45M
  • 85 funding from NSF also NIH, DOE, NARA, LC
    and other agencies/industry
  • Facilities
  • SDSCs data center is the largest academic data
    center in the world with 36 PBs capacity
  • SDSC hosted over 90 resident events, workshops,
    and summer institutes in our facilities in 2007.
  • SDSCs focus on increased efficiency reduced our
    utility usage by 18
  • SDSCs new building is LEED silver -equivalent,
    the first on the UCSD campus.
  • Research
  • SDSC hosts hosted over 100 separate community
    digital data sets and collections for sponsors
    such as NSF, NIH, and the Library of Congress.
  • SDSC staff and collaborators published scholarly
    articles in a spectrum of journals including
    Cell, Science, Nature, Journal of Seismology,
    Journal of the American Chemical Society,
    Journal of Medicinal Chemistry, Nano Letters,
    PLoS Computational Biology, and many others

4
Cyberinfrastructure
  • the organized aggregate of information
    technologies coordinated to address problems in
    science and society
  • If infrastructure is required for an industrial
    economy, then we could say that
    cyberinfrastructure is required for a knowledge
    economy.
  • NSF 2003 Final Report of the Blue Ribbon Advisory
    Panel on Cyberinfrastructure Atkins Report

5
SDSCs Mission To transform science and society
through Cyberinfrastructure
  • Cloud Platforms and Virtualization

6
SDSC Initiatives Harnessing the 2 Most
Significant Trends in Information Technology
  • ComputingCyberinfrastructure
  • New computers will provide a unique resource for
    massive data analysis, and provide the seed for
    growing large-scale, professionally maintained
    computational platform at UCSD
  • Data Cyberinfrastructure
  • New data resource will complement SDSCs green
    Datacenter one of the largest academic data
    center in the world
  • CI Innovation
  • Ongoing collaborations in cloud computing, power
    efficiency, virtualization, disaster response,
    drug design, etc. accelerating research,
    education, and practice

7
The Fundamental Driver of the Information Age is
Digital Data
Education
Entertainment
Shopping
Health
Information
Business
8
Digital Data Critical for Research and Education
  • Data from multiple sources in the Geosciences
  • Data at multiple scales in the Biosciences

Where should we drill for oil? What is the
Impact of Global Warming? How are the continents
shifting?
Data Integration
Complex multiple-worlds mediation
What genes are associated with cancer? What
parts of the brain are responsible for
Alzheimers?
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Geologic Map
9
Todays Presentation
  • Data Cyberinfrastructure Today Designing and
    developing infrastructure to enable todays
    data-oriented applications
  • Challenges in Building and Delivering Capable
    Data Infrastructure
  • Sustainable Digital Preservation Grand
    Challenge for the Information age

10
Data Cyberinfrastructure Today Designing and
Developing Infrastructure for Todays
Data-Oriented Applications
11
Todays Data-oriented Applications Span the
Spectrum
Designing Infrastructure for Data Data and
High Performance Computing Data and
Grids Data and Cyberinfrastructure Services
Data-intensiveand Compute-intensive HPC applicat
ions
Data-intensive applications
Home, Lab, Campus, Desktop Applications
Compute-intensiveHPC Applications
Data Grid Applications
COMPUTE (more FLOPS)
NETWORK (more BW)
Grid Applications
12
Data and High Performance Computing
  • For many applications, development of balanced
    systems needed to support applications which are
    both data-intensive and compute-intensive. Codes
    for which
  • Grid platforms not a strong option
  • Data must be local to computation
  • I/O rates exceed WAN capabilities
  • Continuous and frequent I/O is latency
    intolerant
  • Scalability is key
  • Need high-bandwidth and large-capacity local
    parallel file systems, archival storage

Data-intensiveand Compute-intensive HPC applicat
ions
Data-intensive applications
Data-intensive applications
Compute-intensiveHPC Applications
Compute-intensive applications
COMPUTE (more FLOPS)
13
Earthquake Simulation at Petascale better
prediction accuracy creates greater
data-intensive demands
Estimated figures for simulated 240 second period, 100 hour run-time TeraShake domain (600x300x80 km3) PetaShake domain (800x400x100 km3)
Fault system interaction NO YES
Inner Scale 200m 25m
Resolution of terrain grid 1.8 billion mesh points 2.0 trillion mesh points
Magnitude of Earthquake 7.7 8.1
Time steps 20,000 (.012 sec/step) 160,000 (.0015 sec/step)
Surface data 1.1 TB 1.2 PB
Volume data 43 TB 4.9 PB
Information courtesy of the Southern California
Earthquake Center
14
Data and Grids
  • Data applications some of the first applications
    which
  • required Grid environments
  • could naturally tolerate longer latencies
  • Grid model supports key data application profiles
  • Compute at site A with data from site B
  • Store Data Collection at site A with copies at
    sitesB and C
  • Operate instrument at site A, move data to site
    B for storage, post-processing, etc.

CERN data providing key driver for grid
technologies
15
Data Services Key for TeraGrid Science Gateways
  • Science Gateways provide common application
    interface for science communities on TeraGrid
  • Data services key for Gateway communities
  • Analysis
  • Visualization
  • Management
  • Remote access, etc.

Information and images courtesy of Nancy
Wilkins-Diehr
16
Unifying Data over the Grid the TeraGrid GPFS
WAN Effort
  • User wish list
  • Unlimited data capacity (everyones aggregate
    storage almost looks like this)
  • Transparent, high speed access anywhere on the
    Grid
  • Automatic archiving and retrieval
  • No Latency
  • TeraGrid GPFS-WAN effort focuses on providing
    infinite(SDSC) storage over the grid
  • Looks like local disk to grid sites
  • Uses automatic migration with a large cache to
    keep files always online and accessible
  • Data automatically archived without user
    intervention

Information courtesy of Phil Andrews
17
Data Grids
  • SRB - Storage Resource Broker
  • Persistent naming of distributed data
  • Management of data stored in multiple types of
    storage systems
  • Organization of data as a shared collection with
    descriptive metadata, access controls, audit
    trails
  • iRODS - integrated Rule-Oriented Data System
  • Rules control execution of remote micro-services
  • Manage persistent state information
  • Validate assertions about collection
  • Automate execution of management policies

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
18
iRODS integrated Rule-based Data
Systemhttp//irods.sdsc.edu
  • Organizes distributed data into shared
    collections, while automating the application of
    management policies
  • Each policy is expressed as a set of rules that
    control the execution of a set of micro-services
  • Persistent state information is maintained to
    track the results of all operations

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
19
integrated Rule-Oriented Data System
Client Interface
Admin Interface
Rule Invoker
Rule Modifier Module
Config Modifier Module
Metadata Modifier Module
Rule Base
Current State
Consistency Check Module
Consistency Check Module
Confs
Resources
Metadata-based Services
Resource-based Services
Metadata Persistent Repository
Micro Service Modules
Micro Service Modules
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
20
Data Management Applications(What do they have
in common?)
  • Data grids
  • Share data - organize distributed data as a
    collection
  • Digital libraries
  • Publish data - support browsing and discovery
  • Persistent archives
  • Preserve data - manage technology evolution
  • Real-time sensor systems
  • Federate sensor data - integrate across sensor
    streams
  • Workflow systems
  • Analyze data - integrate client- server-side
    workflows

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
21
iRODS Approach
  • To meet the diverse requirements, the
    architecture must
  • Be highly modular
  • Be highly extensible
  • Provide infrastructure independence
  • Enforce management policies
  • Provide scalability mechanisms
  • Manipulate structured information
  • Enable community standards

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
22
Data Management Challenges
  • Authenticity
  • Manage descriptive metadata for each file
  • Manage access controls
  • Manage consistent updates to administrative
    metadata
  • Integrity
  • Manage checksums
  • Replicate files
  • Synchronize replicas
  • Federate data grids
  • Infrastructure independence
  • Manage collection properties
  • Manage interactions with storage systems
  • Manage distributed data

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
23
Types of Risk
  • Media failure
  • Replicate data onto multiple media
  • Vendor specific systemic errors
  • Replicate data onto multiple vendor products
  • Operational error
  • Replicate data onto a second administrative
    domain
  • Natural disaster
  • Replicate data to a geographically remote site
  • Malicious user
  • Replicate data to a deep archive

24
How Many Replicas
  • Three sites minimize risk
  • Primary site
  • Supports interactive user access to data
  • Secondary site
  • Supports interactive user access when first site
    is down
  • Provides 2nd media copy, located at a remote
    site, uses different vendor product, independent
    administrative procedures
  • Deep archive
  • Provides 3rd media copy, staging environment for
    data ingestion, no user access

25
Data Reliability
  • Manage checksums
  • Verify integrity
  • Rule to verify checksums
  • Synchronize replicas
  • Verify consistency between metadata and records
    in vault
  • Rule to verify presence of required metadata
  • Federate data grids
  • Synchronize metadata catalogs

26
Data Services Beyond Storage to Use
What services do users want?
How can I combine my data with my colleagues
data?
How should I organize my data?
How do I make sure that my data will be there
when I want it?
What are the trends and what is the noise in my
data?
My data is confidential how do I make sure that
it is seen/used only by the right people?
How should I display my data?
How can I make my data accessible to my
collaborators?
27
Services Integrated Environment Key to Usability
  • Database selection and schema design
  • Portal creation and collection publication
  • Data analysis
  • Data mining
  • Data hosting
  • Preservation services
  • Domain-specific tools
  • Biology Workbench
  • Montage (astronomy mosaicking)
  • Kepler (Workflow management)
  • Data visualization
  • Data anonymization, etc.

Integrated Infrastructure
Many Data Sources
28
Data Hosting SDSC DataCentral A Comprehensive
Facility for Research Data
  • Broad program to support research and community
    data collections and databases
  • DataCentral services include
  • Public Data Collections and Database Hosting
  • Long-term storage and preservation (tape and
    disk)
  • Remote data management and access (SRB, iRODS
    portals)
  • Data Analysis, Visualization and Data Mining
  • Professional, qualified 24/7 support

PDB 28 TB
  • DataCentral resources include
  • 3 PB On-line disk
  • 36 PB StorageTek tape library capacity
  • 540 TB Storage-area Network (SAN)
  • DB2, Oracle, MySQL
  • Storage Resource Broker, iRODS
  • Gpfs-WAN with 800 TB

Web-based portal access
29
Data Cyberinfrastructure at SDSC
  • Comprehensive data environment that incorporates
    access to the full spectrum of data enabling
    resources
  • hosting, managing and publishing data in digital
    libraries
  • sharing data through the Web and data grids
  • creating, optimizing, porting large scale
    databases
  • data intensive computing with high bandwidth data
    movement
  • analyzing, visualizing, rendering and data mining
    large scale data
  • preservation of data in persistent archives
  • building collections, portals, ontologies, etc.
  • providing resource, services and expertise

30
Data to Discovery
31
SDSC Data Infrastructure Resources
  • 3 PB On-line disk
  • 36 PB StorageTek tape library capacity
  • 540 TB Storage-area Network (SAN)
  • DB2, Oracle, MySQL
  • SAS, R, MATLAB, Mathematica
  • Storage Resource Broker
  • Wide area file system with 800TB

Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
32
36 PB
33
DataCentral Allocated Collections include
Seismology 3D Ground Motion Collection for the LA Basin
Atmospheric Sciences50 year Downscaling of Global Analysis over California Region
Earth Sciences NEXRAD Data in Hydrometerology and Hydrology
Elementary Particle Physics AMANDA data
Biology AfCS Molecule Pages
Biomedical Neuroscience BIRN
Networking Backbone Header Traces
Networking Backscatter Data
Biology Bee Behavior
Biology Biocyc (SRI)
Art C5 landscape Database
Geology Chronos
Biology CKAAPS
Biology DigEmbryo
Earth Science Education ERESE
Earth Sciences UCI ESMF
Earth Sciences EarthRef.org
Earth Sciences ERDA
Earth Sciences ERR
Biology Encyclopedia of Life
Life Sciences Protein Data Bank
Geosciences GEON
Geosciences GEON-LIDAR
Geochemistry Kd
Biology Gene Ontology
Geochemistry GERM
Networking HPWREN
Ecology HyperLter
Networking IMDC
Biology Interpro Mirror
Biology JCSG Data
Government Library of Congress Data
Geophysics Magnetics Information Consortium data
Education UC Merced Japanese Art Collections
Geochemistry NAVDAT
Earthquake Engineering NEESIT data
Education NSDL
Astronomy NVO
Government NARA
Anthropology GAPP
Neurobiology Salk data
Seismology SCEC TeraShake
Seismology SCEC CyberShake
Oceanography SIO Explorer
Networking Skitter
Astronomy Sloan Digital Sky Survey
Geology Sensitive Species Map Server
Geology SD and Tijuana Watershed data
Oceanography Seamount Catalogue
Oceanography Seamounts Online
Biodiversity WhyWhere
Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data
Structural Engineering TeraBridge
Various TeraGrid data collections
Biology Transporter Classification Database
Biology TreeBase
Art Tsunami Data
Education ArtStor
Biology Yeast regulatory network
Biology Apoptosis Database
Cosmology LUSciD
34
Data Visualization is key
  • Visualization of Cancer Tumors
  • Prokudin Gorskii historical images
  • SCEC Earthquake simulations

Information and images courtesy of Amit
Chourasia, SCEC, Steve Cutchin, Moores Cancer
Center, David Minor, U.S. Library of Congress
35
Building and Delivering Capable Data
Cyberinfrastructure
36
Building Capable Data Cyberinfrastructure
Incorporating the ilities
  • Scalability
  • Interoperability
  • Reliability
  • Capability
  • Sustainability
  • Predictability
  • Accessibility
  • Responsibility
  • Accountability

37
Reliability
  • How can we maximize data reliability?
  • Replication, UPS systems, heterogeneity, etc.
  • How can we measure data reliability?
  • Network availability 99.999 uptime (5 nines),
  • What is the equivalent number of 0s for data
    reliability?

Entity at risk What can go wrong Frequency
File Corrupted media, disk failure 1 year
Tape Simultaneous failure of 2 copies 5 years
System Systemic errors in vendor SW, or malicious user, or operator error that deletes multiple copies 15 years
Archive Natural disaster, obsolescence of standards 50 - 100 years
Reliability What can go wrong
Information courtesy of Reagan Moore
38
Responsibility and Accountability
  • What are reasonable expectations between users
    and repositories?
  • What are reasonable expectations between
    federated partner repositories?
  • What are appropriate models for evaluating
    repositories?
  • What incentives promote good stewardship? What
    should happen if/when the system fails?
  • Who owns the data?
  • Who takes care of the data?
  • Who pays for the data?
  • Who can access the data?

39
Good Data Infrastructure Incurs Real Costs
Capacity Costs
  • Most valuable data must be replicated
  • SDSC research collections have been doubling
    every 15 months
  • SDSC storage is 36 PB and counting. Data is from
    supercomputer simulations, digital library
    collections, etc.

Information courtesy of Richard Moore
40
Economic Sustainability
Relay Funding
  • Making Infinite Funding Finite
  • Difficult to support infrastructure for data
    preservation as an infinite, increasing mortgage
  • Creative partnerships help create sustainable
    economic models

User fees, recharges
Geisel Library at UCSD
Consortium support
Endowments
Hybrid solutions
41
Preserving Digital Information Over the Long Term
42
How much Digital Data is there?
SDSC HPSS tape archive 36 PetaBytes
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
Zetta 1021
  • 5 exabytes of digital information produced in
    2003
  • 161 exabytes of digital information produced in
    2006
  • 25 of the 2006 digital universe is born digital
    (digital pictures, keystrokes, phone calls, etc.)
  • 75 is replicated (emails forwarded, backed up
    transaction records, movies in DVD format)
  • 1 zettabyte aggregate digital information
    projected for 2010

iPod (up to 20K songs) 80 GB
1 novel 1 MegaByte
U.S. Library of Congress manages 295 TB of
digital data, 230 TB of which is born digital
Source The Expanding Digital Universe A
forecast of Worldwide Information Growth through
2010 IDC Whitepaper, March 2007
43
How much Storage is there?
  • 2007 is the crossover year where the amount of
    digital information is greater than the amount of
    available storage
  • Given the projected rates of growth, we will
    never have enough space again for all digital
    information

Source The Expanding Digital Universe A
forecast of Worldwide Information Growth through
2010 IDC Whitepaper, March 2007
44
Focus for Preservation the most valuable data
  • What is valuable?
  • Community reference data collections (e.g.
    UniProt, PDB)
  • Irreplaceable collections
  • Official collections (e.g. census data,
    electronic federal records)
  • Collections which are very expensive to replicate
    (e.g. CERN data)
  • Longitudinal and historical data
  • and others

45
A Framework for Digital Stewardship
  • Preservation efforts should focus on collections
    deemed most valuable
  • Key issues
  • What do we preserve?
  • How do we guard against data loss?
  • Who is responsible?
  • Who pays? Etc.

IncreasingValue IncreasingTrust
Increasingrisk/responsibility Increasingstabili
ty Increasinginfra-structure
46
Digital Collections of Community Value
National, InternationalScale
Regional Scale
Local Scale
  • Key techniques for preservation replication,
    heterogeneous support

The Data Pyramid
47
A Conceptual Model for Preservation Data Grids
  • The Chronopolis Model
  • Geographically distributed preservation data grid
    that supports long-term management , stewardship
    of, and access to digital collections
  • Implemented by developing and deploying a
    distributed data grid, and by supporting its
    human, policy, and technological infrastructure
  • Integrates targeted technology forecasting and
    migration to support of long-term life-cycle
    management and preservation

Distributed Production Preservation Environment
Digital Information of Long-Term Value
TechnologyForecasting and Migration
Administration, Policy, Outreach
48
Chronopolis Focus Areas and Demonstration Project
Partners
  • Chronopolis RD, Policy, and Infrastructure Focus
    areas
  • Assessment of the needs of potential user
    communities and development of appropriate
    service models
  • Development of formal roles and responsibilities
    of providers, partners, users
  • Assessment and prototyping of best practices for
    bit preservation, authentication, metadata, etc.
  • Development of appropriate cost and risk models
    for long-term preservation
  • Development of appropriate success metrics to
    evaluate usefulness, reliability, and usability
    of infrastructure

2 Prototypes National Demonstration
Project Library of Congress Pilot
Project Partners SDSC/UCSD U Maryland UCSD
Libraries NCAR NARA Library of Congress NSF ICPSR
Internet Archive NVO
Information courtesy of Robert McDonald, David
Minor, Ardys Kozbial
49
National Demonstration Project Large-scale
Replication and Distribution
  • Focus on supporting multiple, geographically
    distributed copies of preservation collections
  • Bright copy Chronopolis site supports
    ingestion, collection management, user access
  • Dim copy Chronopolis site supports remote
    replica of bright copy and supports user access
  • Dark copy Chronopolis site supports reference
    copy that may be used for disaster recovery but
    no user access
  • Each site may play different roles for different
    collections

Dim copy C1
Dark copy C1
Bright copy C2
Dark copy C2
Bright copy C1
Dim copy C2
  • Demonstration collections included
  • National Virtual Observatory (NVO) 1 TB Digital
    Palomar Observatory Sky Survey
  • Copy of Interuniversity Consortium for Political
    and Social Research (ICPSR) data 1 TB
    Web-accessible Data
  • NCAR Observational Data 3 TB of Observational
    and Re-Analysis Data

50
SDSC/ UCSD Libraries Pilot Project with U.S.
Library of Congress
Prokudin-Gorskii Photographs (Library of Congress
Prints and Photographs Division) http//www.loc.go
v/exhibits/empire/ (also collection of web crawls
from the Internet Archive)
Goal To demonstrate the feasibility and
performance of current approaches for a
production digital Data Center to support the
Library of Congress requirements.
  • Historically important 600 GB Library of Congress
    image collection
  • Images over 100 years old with red, blue, green
    components (kept as separate digital files).
  • SDSC stores 5 copies with dark archival copy at
    NCAR
  • Infrastructure must support idiosyncratic file
    structure. Special logging and monitoring
    software developed so that both SDSC and Library
    of Congress could access information

Library of Congress Pilot Project information
courtesy of David Minor
51
Pilot Projects provided invaluable experience
with key Issues
  • Technical Issues
  • How to address Integrity, verification,
    provenance, authentication, etc.
  • Legal/Policy Issues
  • Who is responsible?
  • Who is liable?
  • Social Issues
  • What formats/standards are acceptable to the
    community?
  • How do we formalize trust?
  • Infrastructure Issues
  • What kinds of resources (servers, storage,
    networks) are required?
  • How should they operate?
  • Evaluation Issues
  • What is reliable?
  • What is successful?
  • Cost Issues
  • What is cost-effective?
  • How can support be sustained over time?

52
Chronopolis A Partnership
  • Chronopolis is being developed by a national
    consortium led by SDSC and the UCSD Libraries.
  • Initial Chronopolis sites include
  • SDSC and UCSD Libraries at UC San Diego
  • University of Maryland Institute for Advanced
    Computer Studies (UMIACS)
  • National Center for Atmospheric Research (NCAR)
    in Boulder, CO

53
Chronopolis An NDIIPP Project
  • Current funding for project is from the Library
    of Congress National Digital Information
    Preservation Program
  • Capturing, preserving, and making available
    significant digital content
  • Building and strengthening a network of partners
  • Developing a technical infrastructure of tools
    and services

54
Its Hard to be Successful in the Information Age
without reliable, persistent information
  • Inadequate/unrealistic general solution Let X
    do it where X is
  • The Government
  • The Libraries
  • The Archivists
  • Google
  • The private sector
  • Data owners
  • Data generators, etc.
  • Creative partnerships needed to provide
    preservation solutions with
  • Trusted stewards
  • Feasible costs for users
  • Sustainable costs for infrastructure
  • Very low risk for data loss, etc.

55
Blue Ribbon Task Force to Focus on Economic
Sustainability
  • International Blue Ribbon Task Force (BRTF-SDPA)
    to begin in 2008 to study issues of economic
    sustainability of digital preservation and
    access
  • Support from
  • National Science Foundation
  • Library of Congress
  • Mellon Foundation
  • Joint Information Systems Committee
  • National Archives and Records Administration
  • Council on Library and Information Sources

Image courtesy of Chris Greer
56
BRTF-SDPA
  • Charge to the Task Force
  • To conduct a comprehensive analysis of previous
    and current efforts to develop and/or implement
    models for sustainable digital information
    preservation (First year report)
  • To identify and evaluate best practice regarding
    sustainable digital preservation among existing
    collections, repositories, and analogous
    enterprises
  • To make specific recommendations for actions that
    will catalyze the development of sustainable
    resource strategies for the reliable preservation
    of digital information (Second Year report)
  • Provide a research agenda to organize and
    motivate future work.
  • How you can be involved
  • Contribute your ideas (oral and written
    testimony)
  • Suggest readings (website will serve as a
    community bibliography)
  • Write an article on the issues for a new
    community (Important component will be to educate
    decision makers and the public about digital
    preservation)
  • Website to be launched this Fall. Will link from
    www.sdsc.edu

57
Many Thanks
  • Fran Berman, Phil Andrews, Reagan Moore, Ian
    Foster, Jack Dongarra, Authors of the IDC Report,
    Ben Tolo, Reagan Moore, Richard Moore, David
    Moore, Robert McDonald, Southern California
    Earthquake Center, David Minor, Ardys Kozbial,
    Amit Chourasia, U.S. Library of Congress, Moores
    Cancer Center, National Archives and Records
    Administration, NSF, Chris Greer, Nancy
    Wilkins-Diehr, and many others

www.sdsc.edu
natashab_at_sdsc.edu
58
UCSD Libraries
  • 3.5 million volumes
  • Digital Access Management System
  • 250,000 objects
  • 15 TB
  • Shared collections with UC
  • California Digital Library Digital Preservation
    Repository
  • eScholarship repository

59
Previous Collaborations
  • LC Pilot Project Building Trust in a 3rd Party
    Repository
  • Using test image collections/web crawls ingest
    content to SDSC repository
  • Allow access for content audit
  • Track usage of content over time
  • Deliver content back to LC at end of project
  • California Digital Library (CDL) Mass Transit
    Program

- Enable UC System Libraries to transfer
high-speed mass digitization collections across
CENIC/I2 - Develop transmission packaging for
CDL content
60
UCSD Organizations Provide
  • SDSC
  • Storage (50 TBs) and networking services
  • SRB support
  • Transmission Packaging Modules
  • UCSD Libraries
  • Metadata services (PREMIS)
  • DIPs (Dissemination Information Packages)
  • Other advanced data services as needed

61
NCAR and UMIACS Provide
  • Archives Complete copy of all data (50 TBs
    each)
  • Advanced data services
  • PAWN Producer Archive Workflow Network in
    Support of Digital Preservation
  • ACE Auditing Control Environment to Ensure the
    Long Term Integrity of Digital Archives

62
Chronopolis Data Providers CDL
  • California Digital Library
  • A part of UCOP, supports the University of
    California libraries
  • Providing data from the Web-At-Risk project
  • Five years of political and governmental websites
  • ARC files created from web crawls
  • Using BagIt Transfer Structure

63
Chronopolis Data Providers ICPSR
  • Inter-university Consortium for Political and
    Social Research, University of Michigan
  • Providing _at_12TB of data Wide variety of types
  • Already working with SDSC using SRB

64
Chronopolis Data Providers SIO
  • Scripps Institution of Oceanography
  • Providing data from the Geological Data Center
  • Files stored using the SRB
  • Data collected under previous NDIIPP project

65
Chronopolis Data Providers NCSU
  • North Carolina State University Libraries
  • Providing _at_5TB of data State-wide geospatial
    studies and data from many sources
  • Another test-case for BagIt transfer
    specification

66
(No Transcript)
67
http//chronopolis.sdsc.edu
Write a Comment
User Comments (0)
About PowerShow.com