Dr. Natasha Balac

About This Presentation

Title:

Dr. Natasha Balac

Description:

Dr' Natasha Balac – PowerPoint PPT presentation

Number of Views:255

Avg rating:3.0/5.0

Slides: 68

Provided by: sdsc6

Category:

more less

Transcript and Presenter's Notes

Title: Dr. Natasha Balac

1
A Grand Challenge for the Information Age

Dr. Natasha Balac
San Diego Supercomputer Center
UC San Diego

2
SDSC overview
ProductionSystems
User Servicesand Development

National NSF facility since 1985 and UCSD
Organized Research Unit
400 staff students
Core TeraGrid programs provide high-end
computational and storage resources to US
researchers, based on a national peer-review
proposal/allocation process
Supports many programs including
Core cyberinfrastructure program
National TeraGrid program
Protein Data Bank (PDB)
Biomedical Informatics Research Network (BIRN)
Optiputer
Network for Earthquake Engineering Simulation
(NEES)
Geosciences Network (GEON)
Alliance for Cell Signaling (AfCS)
High Performance Wireless Research and Education
Network (HPWREN)
National Virtual Observatory

TechnologyResearch and Development
ScienceResearchand Development
Data andKnowledge Systems
3
SDSC in Brief
SDSC CAIDA image from the 2007 Design and the
Elastic Mind Exhibit at the NY Museum of Modern
Art

Funding
In 2007, SDSC was home to over 110 research
projects and received research funding in excess
of 45M
85 funding from NSF also NIH, DOE, NARA, LC
and other agencies/industry
Facilities
SDSCs data center is the largest academic data
center in the world with 36 PBs capacity
SDSC hosted over 90 resident events, workshops,
and summer institutes in our facilities in 2007.
SDSCs focus on increased efficiency reduced our
utility usage by 18
SDSCs new building is LEED silver -equivalent,
the first on the UCSD campus.

Research
SDSC hosts hosted over 100 separate community
digital data sets and collections for sponsors
such as NSF, NIH, and the Library of Congress.
SDSC staff and collaborators published scholarly
articles in a spectrum of journals including
Cell, Science, Nature, Journal of Seismology,
Journal of the American Chemical Society,
Journal of Medicinal Chemistry, Nano Letters,
PLoS Computational Biology, and many others

4
Cyberinfrastructure

the organized aggregate of information
technologies coordinated to address problems in
science and society

If infrastructure is required for an industrial
economy, then we could say that
cyberinfrastructure is required for a knowledge
economy.
NSF 2003 Final Report of the Blue Ribbon Advisory
Panel on Cyberinfrastructure Atkins Report

5
SDSCs Mission To transform science and society
through Cyberinfrastructure

Cloud Platforms and Virtualization

6
SDSC Initiatives Harnessing the 2 Most
Significant Trends in Information Technology

ComputingCyberinfrastructure
New computers will provide a unique resource for
massive data analysis, and provide the seed for
growing large-scale, professionally maintained
computational platform at UCSD

Data Cyberinfrastructure
New data resource will complement SDSCs green
Datacenter one of the largest academic data
center in the world

CI Innovation
Ongoing collaborations in cloud computing, power
efficiency, virtualization, disaster response,
drug design, etc. accelerating research,
education, and practice

7
The Fundamental Driver of the Information Age is
Digital Data
Education
Entertainment
Shopping
Health
Information
Business
8
Digital Data Critical for Research and Education

Data from multiple sources in the Geosciences

Data at multiple scales in the Biosciences

Where should we drill for oil? What is the
Impact of Global Warming? How are the continents
shifting?
Data Integration
Complex multiple-worlds mediation
What genes are associated with cancer? What
parts of the brain are responsible for
Alzheimers?
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Geologic Map
9
Todays Presentation

Data Cyberinfrastructure Today Designing and
developing infrastructure to enable todays
data-oriented applications
Challenges in Building and Delivering Capable
Data Infrastructure
Sustainable Digital Preservation Grand
Challenge for the Information age

10
Data Cyberinfrastructure Today Designing and
Developing Infrastructure for Todays
Data-Oriented Applications
11
Todays Data-oriented Applications Span the
Spectrum
Designing Infrastructure for Data Data and
High Performance Computing Data and
Grids Data and Cyberinfrastructure Services
Data-intensiveand Compute-intensive HPC applicat
ions
Data-intensive applications
Home, Lab, Campus, Desktop Applications
Compute-intensiveHPC Applications
Data Grid Applications
COMPUTE (more FLOPS)
NETWORK (more BW)
Grid Applications
12
Data and High Performance Computing

For many applications, development of balanced
systems needed to support applications which are
both data-intensive and compute-intensive. Codes
for which
Grid platforms not a strong option
Data must be local to computation
I/O rates exceed WAN capabilities
Continuous and frequent I/O is latency
intolerant
Scalability is key
Need high-bandwidth and large-capacity local
parallel file systems, archival storage

Data-intensiveand Compute-intensive HPC applicat
ions
Data-intensive applications
Data-intensive applications
Compute-intensiveHPC Applications
Compute-intensive applications
COMPUTE (more FLOPS)
13
Earthquake Simulation at Petascale better
prediction accuracy creates greater
data-intensive demands
Estimated figures for simulated 240 second period, 100 hour run-time TeraShake domain (600x300x80 km3) PetaShake domain (800x400x100 km3)
Fault system interaction NO YES
Inner Scale 200m 25m
Resolution of terrain grid 1.8 billion mesh points 2.0 trillion mesh points
Magnitude of Earthquake 7.7 8.1
Time steps 20,000 (.012 sec/step) 160,000 (.0015 sec/step)
Surface data 1.1 TB 1.2 PB
Volume data 43 TB 4.9 PB
Information courtesy of the Southern California
Earthquake Center
14
Data and Grids

Data applications some of the first applications
which
required Grid environments
could naturally tolerate longer latencies
Grid model supports key data application profiles
Compute at site A with data from site B
Store Data Collection at site A with copies at
sitesB and C
Operate instrument at site A, move data to site
B for storage, post-processing, etc.

CERN data providing key driver for grid
technologies
15
Data Services Key for TeraGrid Science Gateways

Science Gateways provide common application
interface for science communities on TeraGrid
Data services key for Gateway communities
Analysis
Visualization
Management
Remote access, etc.

Information and images courtesy of Nancy
Wilkins-Diehr
16
Unifying Data over the Grid the TeraGrid GPFS
WAN Effort

User wish list
Unlimited data capacity (everyones aggregate
storage almost looks like this)
Transparent, high speed access anywhere on the
Grid
Automatic archiving and retrieval
No Latency

TeraGrid GPFS-WAN effort focuses on providing
infinite(SDSC) storage over the grid
Looks like local disk to grid sites
Uses automatic migration with a large cache to
keep files always online and accessible
Data automatically archived without user
intervention

Information courtesy of Phil Andrews
17
Data Grids

SRB - Storage Resource Broker
Persistent naming of distributed data
Management of data stored in multiple types of
storage systems
Organization of data as a shared collection with
descriptive metadata, access controls, audit
trails
iRODS - integrated Rule-Oriented Data System
Rules control execution of remote micro-services
Manage persistent state information
Validate assertions about collection
Automate execution of management policies

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
18
iRODS integrated Rule-based Data
Systemhttp//irods.sdsc.edu

Organizes distributed data into shared
collections, while automating the application of
management policies
Each policy is expressed as a set of rules that
control the execution of a set of micro-services
Persistent state information is maintained to
track the results of all operations

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
19
integrated Rule-Oriented Data System
Client Interface
Admin Interface
Rule Invoker
Rule Modifier Module
Config Modifier Module
Metadata Modifier Module
Rule Base
Current State
Consistency Check Module
Consistency Check Module
Confs
Resources
Metadata-based Services
Resource-based Services
Metadata Persistent Repository
Micro Service Modules
Micro Service Modules
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
20
Data Management Applications(What do they have
in common?)

Data grids
Share data - organize distributed data as a
collection
Digital libraries
Publish data - support browsing and discovery
Persistent archives
Preserve data - manage technology evolution
Real-time sensor systems
Federate sensor data - integrate across sensor
streams
Workflow systems
Analyze data - integrate client- server-side
workflows

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
21
iRODS Approach

To meet the diverse requirements, the
architecture must
Be highly modular
Be highly extensible
Provide infrastructure independence
Enforce management policies
Provide scalability mechanisms
Manipulate structured information
Enable community standards

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
22
Data Management Challenges

Authenticity
Manage descriptive metadata for each file
Manage access controls
Manage consistent updates to administrative
metadata
Integrity
Manage checksums
Replicate files
Synchronize replicas
Federate data grids
Infrastructure independence
Manage collection properties
Manage interactions with storage systems
Manage distributed data

Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
23
Types of Risk

Media failure
Replicate data onto multiple media
Vendor specific systemic errors
Replicate data onto multiple vendor products
Operational error
Replicate data onto a second administrative
domain
Natural disaster
Replicate data to a geographically remote site
Malicious user
Replicate data to a deep archive

24
How Many Replicas

Three sites minimize risk
Primary site
Supports interactive user access to data
Secondary site
Supports interactive user access when first site
is down
Provides 2nd media copy, located at a remote
site, uses different vendor product, independent
administrative procedures
Deep archive
Provides 3rd media copy, staging environment for
data ingestion, no user access

25
Data Reliability

Manage checksums
Verify integrity
Rule to verify checksums
Synchronize replicas
Verify consistency between metadata and records
in vault
Rule to verify presence of required metadata
Federate data grids
Synchronize metadata catalogs

26
Data Services Beyond Storage to Use
What services do users want?
How can I combine my data with my colleagues
data?
How should I organize my data?
How do I make sure that my data will be there
when I want it?
What are the trends and what is the noise in my
data?
My data is confidential how do I make sure that
it is seen/used only by the right people?
How should I display my data?
How can I make my data accessible to my
collaborators?
27
Services Integrated Environment Key to Usability

Database selection and schema design
Portal creation and collection publication
Data analysis
Data mining
Data hosting
Preservation services
Domain-specific tools
Biology Workbench
Montage (astronomy mosaicking)
Kepler (Workflow management)
Data visualization
Data anonymization, etc.

Integrated Infrastructure
Many Data Sources
28
Data Hosting SDSC DataCentral A Comprehensive
Facility for Research Data

Broad program to support research and community
data collections and databases
DataCentral services include
Public Data Collections and Database Hosting
Long-term storage and preservation (tape and
disk)
Remote data management and access (SRB, iRODS
portals)
Data Analysis, Visualization and Data Mining
Professional, qualified 24/7 support

PDB 28 TB

DataCentral resources include
3 PB On-line disk
36 PB StorageTek tape library capacity
540 TB Storage-area Network (SAN)
DB2, Oracle, MySQL
Storage Resource Broker, iRODS
Gpfs-WAN with 800 TB

Web-based portal access
29
Data Cyberinfrastructure at SDSC

Comprehensive data environment that incorporates
access to the full spectrum of data enabling
resources
hosting, managing and publishing data in digital
libraries
sharing data through the Web and data grids
creating, optimizing, porting large scale
databases
data intensive computing with high bandwidth data
movement
analyzing, visualizing, rendering and data mining
large scale data
preservation of data in persistent archives
building collections, portals, ontologies, etc.
providing resource, services and expertise

30
Data to Discovery
31
SDSC Data Infrastructure Resources

3 PB On-line disk
36 PB StorageTek tape library capacity
540 TB Storage-area Network (SAN)
DB2, Oracle, MySQL
SAS, R, MATLAB, Mathematica
Storage Resource Broker
Wide area file system with 800TB

Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
32
36 PB
33
DataCentral Allocated Collections include
Seismology 3D Ground Motion Collection for the LA Basin
Atmospheric Sciences50 year Downscaling of Global Analysis over California Region
Earth Sciences NEXRAD Data in Hydrometerology and Hydrology
Elementary Particle Physics AMANDA data
Biology AfCS Molecule Pages
Biomedical Neuroscience BIRN
Networking Backbone Header Traces
Networking Backscatter Data
Biology Bee Behavior
Biology Biocyc (SRI)
Art C5 landscape Database
Geology Chronos
Biology CKAAPS
Biology DigEmbryo
Earth Science Education ERESE
Earth Sciences UCI ESMF
Earth Sciences EarthRef.org
Earth Sciences ERDA
Earth Sciences ERR
Biology Encyclopedia of Life
Life Sciences Protein Data Bank
Geosciences GEON
Geosciences GEON-LIDAR
Geochemistry Kd
Biology Gene Ontology
Geochemistry GERM
Networking HPWREN
Ecology HyperLter
Networking IMDC
Biology Interpro Mirror
Biology JCSG Data
Government Library of Congress Data
Geophysics Magnetics Information Consortium data
Education UC Merced Japanese Art Collections
Geochemistry NAVDAT
Earthquake Engineering NEESIT data
Education NSDL
Astronomy NVO
Government NARA
Anthropology GAPP
Neurobiology Salk data
Seismology SCEC TeraShake
Seismology SCEC CyberShake
Oceanography SIO Explorer
Networking Skitter
Astronomy Sloan Digital Sky Survey
Geology Sensitive Species Map Server
Geology SD and Tijuana Watershed data
Oceanography Seamount Catalogue
Oceanography Seamounts Online
Biodiversity WhyWhere
Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data
Structural Engineering TeraBridge
Various TeraGrid data collections
Biology Transporter Classification Database
Biology TreeBase
Art Tsunami Data
Education ArtStor
Biology Yeast regulatory network
Biology Apoptosis Database
Cosmology LUSciD
34
Data Visualization is key

Visualization of Cancer Tumors

Prokudin Gorskii historical images

SCEC Earthquake simulations

Information and images courtesy of Amit
Chourasia, SCEC, Steve Cutchin, Moores Cancer
Center, David Minor, U.S. Library of Congress
35
Building and Delivering Capable Data
Cyberinfrastructure
36
Building Capable Data Cyberinfrastructure
Incorporating the ilities

Scalability
Interoperability
Reliability
Capability
Sustainability
Predictability
Accessibility
Responsibility
Accountability

37
Reliability

How can we maximize data reliability?
Replication, UPS systems, heterogeneity, etc.
How can we measure data reliability?
Network availability 99.999 uptime (5 nines),
What is the equivalent number of 0s for data
reliability?

Entity at risk What can go wrong Frequency
File Corrupted media, disk failure 1 year
Tape Simultaneous failure of 2 copies 5 years
System Systemic errors in vendor SW, or malicious user, or operator error that deletes multiple copies 15 years
Archive Natural disaster, obsolescence of standards 50 - 100 years
Reliability What can go wrong
Information courtesy of Reagan Moore
38
Responsibility and Accountability

What are reasonable expectations between users
and repositories?
What are reasonable expectations between
federated partner repositories?
What are appropriate models for evaluating
repositories?
What incentives promote good stewardship? What
should happen if/when the system fails?

Who owns the data?
Who takes care of the data?
Who pays for the data?
Who can access the data?

39
Good Data Infrastructure Incurs Real Costs
Capacity Costs

Most valuable data must be replicated
SDSC research collections have been doubling
every 15 months
SDSC storage is 36 PB and counting. Data is from
supercomputer simulations, digital library
collections, etc.

Information courtesy of Richard Moore
40
Economic Sustainability
Relay Funding

Making Infinite Funding Finite
Difficult to support infrastructure for data
preservation as an infinite, increasing mortgage
Creative partnerships help create sustainable
economic models

User fees, recharges
Geisel Library at UCSD
Consortium support
Endowments
Hybrid solutions
41
Preserving Digital Information Over the Long Term
42
How much Digital Data is there?
SDSC HPSS tape archive 36 PetaBytes
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
Zetta 1021

5 exabytes of digital information produced in
2003
161 exabytes of digital information produced in
2006
25 of the 2006 digital universe is born digital
(digital pictures, keystrokes, phone calls, etc.)
75 is replicated (emails forwarded, backed up
transaction records, movies in DVD format)
1 zettabyte aggregate digital information
projected for 2010

iPod (up to 20K songs) 80 GB
1 novel 1 MegaByte
U.S. Library of Congress manages 295 TB of
digital data, 230 TB of which is born digital
Source The Expanding Digital Universe A
forecast of Worldwide Information Growth through
2010 IDC Whitepaper, March 2007
43
How much Storage is there?

2007 is the crossover year where the amount of
digital information is greater than the amount of
available storage
Given the projected rates of growth, we will
never have enough space again for all digital
information

Source The Expanding Digital Universe A
forecast of Worldwide Information Growth through
2010 IDC Whitepaper, March 2007
44
Focus for Preservation the most valuable data

What is valuable?
Community reference data collections (e.g.
UniProt, PDB)
Irreplaceable collections
Official collections (e.g. census data,
electronic federal records)
Collections which are very expensive to replicate
(e.g. CERN data)
Longitudinal and historical data
and others

45
A Framework for Digital Stewardship

Preservation efforts should focus on collections
deemed most valuable
Key issues
What do we preserve?
How do we guard against data loss?
Who is responsible?
Who pays? Etc.

IncreasingValue IncreasingTrust
Increasingrisk/responsibility Increasingstabili
ty Increasinginfra-structure
46
Digital Collections of Community Value
National, InternationalScale
Regional Scale
Local Scale

Key techniques for preservation replication,
heterogeneous support

The Data Pyramid
47
A Conceptual Model for Preservation Data Grids

The Chronopolis Model
Geographically distributed preservation data grid
that supports long-term management , stewardship
of, and access to digital collections
Implemented by developing and deploying a
distributed data grid, and by supporting its
human, policy, and technological infrastructure
Integrates targeted technology forecasting and
migration to support of long-term life-cycle
management and preservation

Distributed Production Preservation Environment
Digital Information of Long-Term Value
TechnologyForecasting and Migration
Administration, Policy, Outreach
48
Chronopolis Focus Areas and Demonstration Project
Partners

Chronopolis RD, Policy, and Infrastructure Focus
areas
Assessment of the needs of potential user
communities and development of appropriate
service models
Development of formal roles and responsibilities
of providers, partners, users
Assessment and prototyping of best practices for
bit preservation, authentication, metadata, etc.
Development of appropriate cost and risk models
for long-term preservation
Development of appropriate success metrics to
evaluate usefulness, reliability, and usability
of infrastructure

2 Prototypes National Demonstration
Project Library of Congress Pilot
Project Partners SDSC/UCSD U Maryland UCSD
Libraries NCAR NARA Library of Congress NSF ICPSR
Internet Archive NVO
Information courtesy of Robert McDonald, David
Minor, Ardys Kozbial
49
National Demonstration Project Large-scale
Replication and Distribution

Focus on supporting multiple, geographically
distributed copies of preservation collections
Bright copy Chronopolis site supports
ingestion, collection management, user access
Dim copy Chronopolis site supports remote
replica of bright copy and supports user access
Dark copy Chronopolis site supports reference
copy that may be used for disaster recovery but
no user access
Each site may play different roles for different
collections

Dim copy C1
Dark copy C1
Bright copy C2
Dark copy C2
Bright copy C1
Dim copy C2

Demonstration collections included
National Virtual Observatory (NVO) 1 TB Digital
Palomar Observatory Sky Survey
Copy of Interuniversity Consortium for Political
and Social Research (ICPSR) data 1 TB
Web-accessible Data
NCAR Observational Data 3 TB of Observational
and Re-Analysis Data

50
SDSC/ UCSD Libraries Pilot Project with U.S.
Library of Congress
Prokudin-Gorskii Photographs (Library of Congress
Prints and Photographs Division) http//www.loc.go
v/exhibits/empire/ (also collection of web crawls
from the Internet Archive)
Goal To demonstrate the feasibility and
performance of current approaches for a
production digital Data Center to support the
Library of Congress requirements.

Historically important 600 GB Library of Congress
image collection
Images over 100 years old with red, blue, green
components (kept as separate digital files).
SDSC stores 5 copies with dark archival copy at
NCAR
Infrastructure must support idiosyncratic file
structure. Special logging and monitoring
software developed so that both SDSC and Library
of Congress could access information

Library of Congress Pilot Project information
courtesy of David Minor
51
Pilot Projects provided invaluable experience
with key Issues

Technical Issues
How to address Integrity, verification,
provenance, authentication, etc.
Legal/Policy Issues
Who is responsible?
Who is liable?
Social Issues
What formats/standards are acceptable to the
community?
How do we formalize trust?

Infrastructure Issues
What kinds of resources (servers, storage,
networks) are required?
How should they operate?
Evaluation Issues
What is reliable?
What is successful?
Cost Issues
What is cost-effective?
How can support be sustained over time?

52
Chronopolis A Partnership

Chronopolis is being developed by a national
consortium led by SDSC and the UCSD Libraries.
Initial Chronopolis sites include
SDSC and UCSD Libraries at UC San Diego
University of Maryland Institute for Advanced
Computer Studies (UMIACS)
National Center for Atmospheric Research (NCAR)
in Boulder, CO

53
Chronopolis An NDIIPP Project

Current funding for project is from the Library
of Congress National Digital Information
Preservation Program
Capturing, preserving, and making available
significant digital content
Building and strengthening a network of partners
Developing a technical infrastructure of tools
and services

54
Its Hard to be Successful in the Information Age
without reliable, persistent information

Inadequate/unrealistic general solution Let X
do it where X is
The Government
The Libraries
The Archivists
Google
The private sector
Data owners
Data generators, etc.

Creative partnerships needed to provide
preservation solutions with
Trusted stewards
Feasible costs for users
Sustainable costs for infrastructure
Very low risk for data loss, etc.

55
Blue Ribbon Task Force to Focus on Economic
Sustainability

International Blue Ribbon Task Force (BRTF-SDPA)
to begin in 2008 to study issues of economic
sustainability of digital preservation and
access
Support from
National Science Foundation
Library of Congress
Mellon Foundation
Joint Information Systems Committee
National Archives and Records Administration
Council on Library and Information Sources

Image courtesy of Chris Greer
56
BRTF-SDPA

Charge to the Task Force
To conduct a comprehensive analysis of previous
and current efforts to develop and/or implement
models for sustainable digital information
preservation (First year report)
To identify and evaluate best practice regarding
sustainable digital preservation among existing
collections, repositories, and analogous
enterprises
To make specific recommendations for actions that
will catalyze the development of sustainable
resource strategies for the reliable preservation
of digital information (Second Year report)
Provide a research agenda to organize and
motivate future work.

How you can be involved
Contribute your ideas (oral and written
testimony)
Suggest readings (website will serve as a
community bibliography)
Write an article on the issues for a new
community (Important component will be to educate
decision makers and the public about digital
preservation)
Website to be launched this Fall. Will link from
www.sdsc.edu

57
Many Thanks

Fran Berman, Phil Andrews, Reagan Moore, Ian
Foster, Jack Dongarra, Authors of the IDC Report,
Ben Tolo, Reagan Moore, Richard Moore, David
Moore, Robert McDonald, Southern California
Earthquake Center, David Minor, Ardys Kozbial,
Amit Chourasia, U.S. Library of Congress, Moores
Cancer Center, National Archives and Records
Administration, NSF, Chris Greer, Nancy
Wilkins-Diehr, and many others

www.sdsc.edu
natashab_at_sdsc.edu
58
UCSD Libraries

3.5 million volumes
Digital Access Management System
250,000 objects
15 TB
Shared collections with UC
California Digital Library Digital Preservation
Repository
eScholarship repository

59
Previous Collaborations

LC Pilot Project Building Trust in a 3rd Party
Repository
Using test image collections/web crawls ingest
content to SDSC repository
Allow access for content audit
Track usage of content over time
Deliver content back to LC at end of project
California Digital Library (CDL) Mass Transit
Program

- Enable UC System Libraries to transfer
high-speed mass digitization collections across
CENIC/I2 - Develop transmission packaging for
CDL content
60
UCSD Organizations Provide

SDSC
Storage (50 TBs) and networking services
SRB support
Transmission Packaging Modules
UCSD Libraries
Metadata services (PREMIS)
DIPs (Dissemination Information Packages)
Other advanced data services as needed

61
NCAR and UMIACS Provide

Archives Complete copy of all data (50 TBs
each)
Advanced data services
PAWN Producer Archive Workflow Network in
Support of Digital Preservation
ACE Auditing Control Environment to Ensure the
Long Term Integrity of Digital Archives

62
Chronopolis Data Providers CDL

California Digital Library
A part of UCOP, supports the University of
California libraries
Providing data from the Web-At-Risk project
Five years of political and governmental websites
ARC files created from web crawls
Using BagIt Transfer Structure

63
Chronopolis Data Providers ICPSR

Inter-university Consortium for Political and
Social Research, University of Michigan
Providing _at_12TB of data Wide variety of types
Already working with SDSC using SRB

64
Chronopolis Data Providers SIO

Scripps Institution of Oceanography
Providing data from the Geological Data Center
Files stored using the SRB
Data collected under previous NDIIPP project

65
Chronopolis Data Providers NCSU

North Carolina State University Libraries
Providing _at_5TB of data State-wide geospatial
studies and data from many sources
Another test-case for BagIt transfer
specification

66
(No Transcript)
67
http//chronopolis.sdsc.edu

Write a Comment

User Comments (0)