Title: Dr. Natasha Balac
1A Grand Challenge for the Information Age
- Dr. Natasha Balac
- San Diego Supercomputer Center
- UC San Diego
2SDSC overview
ProductionSystems
User Servicesand Development
- National NSF facility since 1985 and UCSD
Organized Research Unit - 400 staff students
- Core TeraGrid programs provide high-end
computational and storage resources to US
researchers, based on a national peer-review
proposal/allocation process - Supports many programs including
- Core cyberinfrastructure program
- National TeraGrid program
- Protein Data Bank (PDB)
- Biomedical Informatics Research Network (BIRN)
- Optiputer
- Network for Earthquake Engineering Simulation
(NEES) - Geosciences Network (GEON)
- Alliance for Cell Signaling (AfCS)
- High Performance Wireless Research and Education
Network (HPWREN) - National Virtual Observatory
TechnologyResearch and Development
ScienceResearchand Development
Data andKnowledge Systems
3SDSC in Brief
SDSC CAIDA image from the 2007 Design and the
Elastic Mind Exhibit at the NY Museum of Modern
Art
- Funding
- In 2007, SDSC was home to over 110 research
projects and received research funding in excess
of 45M - 85 funding from NSF also NIH, DOE, NARA, LC
and other agencies/industry - Facilities
- SDSCs data center is the largest academic data
center in the world with 36 PBs capacity - SDSC hosted over 90 resident events, workshops,
and summer institutes in our facilities in 2007. - SDSCs focus on increased efficiency reduced our
utility usage by 18 - SDSCs new building is LEED silver -equivalent,
the first on the UCSD campus.
- Research
- SDSC hosts hosted over 100 separate community
digital data sets and collections for sponsors
such as NSF, NIH, and the Library of Congress. - SDSC staff and collaborators published scholarly
articles in a spectrum of journals including
Cell, Science, Nature, Journal of Seismology,
Journal of the American Chemical Society,
Journal of Medicinal Chemistry, Nano Letters,
PLoS Computational Biology, and many others
4Cyberinfrastructure
- the organized aggregate of information
technologies coordinated to address problems in
science and society
- If infrastructure is required for an industrial
economy, then we could say that
cyberinfrastructure is required for a knowledge
economy. - NSF 2003 Final Report of the Blue Ribbon Advisory
Panel on Cyberinfrastructure Atkins Report
5SDSCs Mission To transform science and society
through Cyberinfrastructure
- Cloud Platforms and Virtualization
6SDSC Initiatives Harnessing the 2 Most
Significant Trends in Information Technology
- ComputingCyberinfrastructure
- New computers will provide a unique resource for
massive data analysis, and provide the seed for
growing large-scale, professionally maintained
computational platform at UCSD
- Data Cyberinfrastructure
- New data resource will complement SDSCs green
Datacenter one of the largest academic data
center in the world
- CI Innovation
- Ongoing collaborations in cloud computing, power
efficiency, virtualization, disaster response,
drug design, etc. accelerating research,
education, and practice
7The Fundamental Driver of the Information Age is
Digital Data
Education
Entertainment
Shopping
Health
Information
Business
8Digital Data Critical for Research and Education
- Data from multiple sources in the Geosciences
- Data at multiple scales in the Biosciences
Where should we drill for oil? What is the
Impact of Global Warming? How are the continents
shifting?
Data Integration
Complex multiple-worlds mediation
What genes are associated with cancer? What
parts of the brain are responsible for
Alzheimers?
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Geologic Map
9Todays Presentation
- Data Cyberinfrastructure Today Designing and
developing infrastructure to enable todays
data-oriented applications - Challenges in Building and Delivering Capable
Data Infrastructure - Sustainable Digital Preservation Grand
Challenge for the Information age
10Data Cyberinfrastructure Today Designing and
Developing Infrastructure for Todays
Data-Oriented Applications
11Todays Data-oriented Applications Span the
Spectrum
Designing Infrastructure for Data Data and
High Performance Computing Data and
Grids Data and Cyberinfrastructure Services
Data-intensiveand Compute-intensive HPC applicat
ions
Data-intensive applications
Home, Lab, Campus, Desktop Applications
Compute-intensiveHPC Applications
Data Grid Applications
COMPUTE (more FLOPS)
NETWORK (more BW)
Grid Applications
12Data and High Performance Computing
- For many applications, development of balanced
systems needed to support applications which are
both data-intensive and compute-intensive. Codes
for which - Grid platforms not a strong option
- Data must be local to computation
- I/O rates exceed WAN capabilities
- Continuous and frequent I/O is latency
intolerant - Scalability is key
- Need high-bandwidth and large-capacity local
parallel file systems, archival storage
Data-intensiveand Compute-intensive HPC applicat
ions
Data-intensive applications
Data-intensive applications
Compute-intensiveHPC Applications
Compute-intensive applications
COMPUTE (more FLOPS)
13 Earthquake Simulation at Petascale better
prediction accuracy creates greater
data-intensive demands
Estimated figures for simulated 240 second period, 100 hour run-time TeraShake domain (600x300x80 km3) PetaShake domain (800x400x100 km3)
Fault system interaction NO YES
Inner Scale 200m 25m
Resolution of terrain grid 1.8 billion mesh points 2.0 trillion mesh points
Magnitude of Earthquake 7.7 8.1
Time steps 20,000 (.012 sec/step) 160,000 (.0015 sec/step)
Surface data 1.1 TB 1.2 PB
Volume data 43 TB 4.9 PB
Information courtesy of the Southern California
Earthquake Center
14Data and Grids
- Data applications some of the first applications
which - required Grid environments
- could naturally tolerate longer latencies
- Grid model supports key data application profiles
- Compute at site A with data from site B
- Store Data Collection at site A with copies at
sitesB and C - Operate instrument at site A, move data to site
B for storage, post-processing, etc.
CERN data providing key driver for grid
technologies
15Data Services Key for TeraGrid Science Gateways
- Science Gateways provide common application
interface for science communities on TeraGrid - Data services key for Gateway communities
- Analysis
- Visualization
- Management
- Remote access, etc.
Information and images courtesy of Nancy
Wilkins-Diehr
16Unifying Data over the Grid the TeraGrid GPFS
WAN Effort
- User wish list
- Unlimited data capacity (everyones aggregate
storage almost looks like this) - Transparent, high speed access anywhere on the
Grid - Automatic archiving and retrieval
- No Latency
- TeraGrid GPFS-WAN effort focuses on providing
infinite(SDSC) storage over the grid - Looks like local disk to grid sites
- Uses automatic migration with a large cache to
keep files always online and accessible - Data automatically archived without user
intervention
Information courtesy of Phil Andrews
17Data Grids
- SRB - Storage Resource Broker
- Persistent naming of distributed data
- Management of data stored in multiple types of
storage systems - Organization of data as a shared collection with
descriptive metadata, access controls, audit
trails - iRODS - integrated Rule-Oriented Data System
- Rules control execution of remote micro-services
- Manage persistent state information
- Validate assertions about collection
- Automate execution of management policies
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
18 iRODS integrated Rule-based Data
Systemhttp//irods.sdsc.edu
- Organizes distributed data into shared
collections, while automating the application of
management policies - Each policy is expressed as a set of rules that
control the execution of a set of micro-services - Persistent state information is maintained to
track the results of all operations
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
19integrated Rule-Oriented Data System
Client Interface
Admin Interface
Rule Invoker
Rule Modifier Module
Config Modifier Module
Metadata Modifier Module
Rule Base
Current State
Consistency Check Module
Consistency Check Module
Confs
Resources
Metadata-based Services
Resource-based Services
Metadata Persistent Repository
Micro Service Modules
Micro Service Modules
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
20Data Management Applications(What do they have
in common?)
- Data grids
- Share data - organize distributed data as a
collection - Digital libraries
- Publish data - support browsing and discovery
- Persistent archives
- Preserve data - manage technology evolution
- Real-time sensor systems
- Federate sensor data - integrate across sensor
streams - Workflow systems
- Analyze data - integrate client- server-side
workflows
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
21iRODS Approach
- To meet the diverse requirements, the
architecture must - Be highly modular
- Be highly extensible
- Provide infrastructure independence
- Enforce management policies
- Provide scalability mechanisms
- Manipulate structured information
- Enable community standards
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
22Data Management Challenges
- Authenticity
- Manage descriptive metadata for each file
- Manage access controls
- Manage consistent updates to administrative
metadata - Integrity
- Manage checksums
- Replicate files
- Synchronize replicas
- Federate data grids
- Infrastructure independence
- Manage collection properties
- Manage interactions with storage systems
- Manage distributed data
Slide adapted from presentation by Dr. Reagan
Moore, UCSD/SDSC
23Types of Risk
- Media failure
- Replicate data onto multiple media
- Vendor specific systemic errors
- Replicate data onto multiple vendor products
- Operational error
- Replicate data onto a second administrative
domain - Natural disaster
- Replicate data to a geographically remote site
- Malicious user
- Replicate data to a deep archive
24How Many Replicas
- Three sites minimize risk
- Primary site
- Supports interactive user access to data
- Secondary site
- Supports interactive user access when first site
is down - Provides 2nd media copy, located at a remote
site, uses different vendor product, independent
administrative procedures - Deep archive
- Provides 3rd media copy, staging environment for
data ingestion, no user access
25Data Reliability
- Manage checksums
- Verify integrity
- Rule to verify checksums
- Synchronize replicas
- Verify consistency between metadata and records
in vault - Rule to verify presence of required metadata
- Federate data grids
- Synchronize metadata catalogs
26Data Services Beyond Storage to Use
What services do users want?
How can I combine my data with my colleagues
data?
How should I organize my data?
How do I make sure that my data will be there
when I want it?
What are the trends and what is the noise in my
data?
My data is confidential how do I make sure that
it is seen/used only by the right people?
How should I display my data?
How can I make my data accessible to my
collaborators?
27Services Integrated Environment Key to Usability
- Database selection and schema design
- Portal creation and collection publication
- Data analysis
- Data mining
- Data hosting
- Preservation services
- Domain-specific tools
- Biology Workbench
- Montage (astronomy mosaicking)
- Kepler (Workflow management)
- Data visualization
- Data anonymization, etc.
Integrated Infrastructure
Many Data Sources
28Data Hosting SDSC DataCentral A Comprehensive
Facility for Research Data
- Broad program to support research and community
data collections and databases - DataCentral services include
- Public Data Collections and Database Hosting
- Long-term storage and preservation (tape and
disk) - Remote data management and access (SRB, iRODS
portals) - Data Analysis, Visualization and Data Mining
- Professional, qualified 24/7 support
PDB 28 TB
- DataCentral resources include
- 3 PB On-line disk
- 36 PB StorageTek tape library capacity
- 540 TB Storage-area Network (SAN)
- DB2, Oracle, MySQL
- Storage Resource Broker, iRODS
- Gpfs-WAN with 800 TB
Web-based portal access
29 Data Cyberinfrastructure at SDSC
- Comprehensive data environment that incorporates
access to the full spectrum of data enabling
resources - hosting, managing and publishing data in digital
libraries - sharing data through the Web and data grids
- creating, optimizing, porting large scale
databases - data intensive computing with high bandwidth data
movement - analyzing, visualizing, rendering and data mining
large scale data - preservation of data in persistent archives
- building collections, portals, ontologies, etc.
- providing resource, services and expertise
30Data to Discovery
31SDSC Data Infrastructure Resources
- 3 PB On-line disk
- 36 PB StorageTek tape library capacity
- 540 TB Storage-area Network (SAN)
- DB2, Oracle, MySQL
- SAS, R, MATLAB, Mathematica
- Storage Resource Broker
- Wide area file system with 800TB
Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
3236 PB
33DataCentral Allocated Collections include
Seismology 3D Ground Motion Collection for the LA Basin
Atmospheric Sciences50 year Downscaling of Global Analysis over California Region
Earth Sciences NEXRAD Data in Hydrometerology and Hydrology
Elementary Particle Physics AMANDA data
Biology AfCS Molecule Pages
Biomedical Neuroscience BIRN
Networking Backbone Header Traces
Networking Backscatter Data
Biology Bee Behavior
Biology Biocyc (SRI)
Art C5 landscape Database
Geology Chronos
Biology CKAAPS
Biology DigEmbryo
Earth Science Education ERESE
Earth Sciences UCI ESMF
Earth Sciences EarthRef.org
Earth Sciences ERDA
Earth Sciences ERR
Biology Encyclopedia of Life
Life Sciences Protein Data Bank
Geosciences GEON
Geosciences GEON-LIDAR
Geochemistry Kd
Biology Gene Ontology
Geochemistry GERM
Networking HPWREN
Ecology HyperLter
Networking IMDC
Biology Interpro Mirror
Biology JCSG Data
Government Library of Congress Data
Geophysics Magnetics Information Consortium data
Education UC Merced Japanese Art Collections
Geochemistry NAVDAT
Earthquake Engineering NEESIT data
Education NSDL
Astronomy NVO
Government NARA
Anthropology GAPP
Neurobiology Salk data
Seismology SCEC TeraShake
Seismology SCEC CyberShake
Oceanography SIO Explorer
Networking Skitter
Astronomy Sloan Digital Sky Survey
Geology Sensitive Species Map Server
Geology SD and Tijuana Watershed data
Oceanography Seamount Catalogue
Oceanography Seamounts Online
Biodiversity WhyWhere
Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data
Structural Engineering TeraBridge
Various TeraGrid data collections
Biology Transporter Classification Database
Biology TreeBase
Art Tsunami Data
Education ArtStor
Biology Yeast regulatory network
Biology Apoptosis Database
Cosmology LUSciD
34Data Visualization is key
- Visualization of Cancer Tumors
- Prokudin Gorskii historical images
- SCEC Earthquake simulations
Information and images courtesy of Amit
Chourasia, SCEC, Steve Cutchin, Moores Cancer
Center, David Minor, U.S. Library of Congress
35Building and Delivering Capable Data
Cyberinfrastructure
36Building Capable Data Cyberinfrastructure
Incorporating the ilities
- Scalability
- Interoperability
- Reliability
- Capability
- Sustainability
- Predictability
- Accessibility
- Responsibility
- Accountability
-
37Reliability
- How can we maximize data reliability?
- Replication, UPS systems, heterogeneity, etc.
- How can we measure data reliability?
- Network availability 99.999 uptime (5 nines),
- What is the equivalent number of 0s for data
reliability?
Entity at risk What can go wrong Frequency
File Corrupted media, disk failure 1 year
Tape Simultaneous failure of 2 copies 5 years
System Systemic errors in vendor SW, or malicious user, or operator error that deletes multiple copies 15 years
Archive Natural disaster, obsolescence of standards 50 - 100 years
Reliability What can go wrong
Information courtesy of Reagan Moore
38Responsibility and Accountability
- What are reasonable expectations between users
and repositories? - What are reasonable expectations between
federated partner repositories? - What are appropriate models for evaluating
repositories? - What incentives promote good stewardship? What
should happen if/when the system fails?
- Who owns the data?
- Who takes care of the data?
- Who pays for the data?
- Who can access the data?
39Good Data Infrastructure Incurs Real Costs
Capacity Costs
- Most valuable data must be replicated
- SDSC research collections have been doubling
every 15 months - SDSC storage is 36 PB and counting. Data is from
supercomputer simulations, digital library
collections, etc.
Information courtesy of Richard Moore
40Economic Sustainability
Relay Funding
- Making Infinite Funding Finite
- Difficult to support infrastructure for data
preservation as an infinite, increasing mortgage - Creative partnerships help create sustainable
economic models
User fees, recharges
Geisel Library at UCSD
Consortium support
Endowments
Hybrid solutions
41Preserving Digital Information Over the Long Term
42How much Digital Data is there?
SDSC HPSS tape archive 36 PetaBytes
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
Zetta 1021
- 5 exabytes of digital information produced in
2003 - 161 exabytes of digital information produced in
2006 - 25 of the 2006 digital universe is born digital
(digital pictures, keystrokes, phone calls, etc.) - 75 is replicated (emails forwarded, backed up
transaction records, movies in DVD format) - 1 zettabyte aggregate digital information
projected for 2010
iPod (up to 20K songs) 80 GB
1 novel 1 MegaByte
U.S. Library of Congress manages 295 TB of
digital data, 230 TB of which is born digital
Source The Expanding Digital Universe A
forecast of Worldwide Information Growth through
2010 IDC Whitepaper, March 2007
43How much Storage is there?
- 2007 is the crossover year where the amount of
digital information is greater than the amount of
available storage - Given the projected rates of growth, we will
never have enough space again for all digital
information
Source The Expanding Digital Universe A
forecast of Worldwide Information Growth through
2010 IDC Whitepaper, March 2007
44Focus for Preservation the most valuable data
- What is valuable?
- Community reference data collections (e.g.
UniProt, PDB) - Irreplaceable collections
- Official collections (e.g. census data,
electronic federal records) - Collections which are very expensive to replicate
(e.g. CERN data) - Longitudinal and historical data
- and others
45A Framework for Digital Stewardship
- Preservation efforts should focus on collections
deemed most valuable - Key issues
- What do we preserve?
- How do we guard against data loss?
- Who is responsible?
- Who pays? Etc.
IncreasingValue IncreasingTrust
Increasingrisk/responsibility Increasingstabili
ty Increasinginfra-structure
46Digital Collections of Community Value
National, InternationalScale
Regional Scale
Local Scale
- Key techniques for preservation replication,
heterogeneous support
The Data Pyramid
47 A Conceptual Model for Preservation Data Grids
- The Chronopolis Model
- Geographically distributed preservation data grid
that supports long-term management , stewardship
of, and access to digital collections - Implemented by developing and deploying a
distributed data grid, and by supporting its
human, policy, and technological infrastructure - Integrates targeted technology forecasting and
migration to support of long-term life-cycle
management and preservation
Distributed Production Preservation Environment
Digital Information of Long-Term Value
TechnologyForecasting and Migration
Administration, Policy, Outreach
48Chronopolis Focus Areas and Demonstration Project
Partners
- Chronopolis RD, Policy, and Infrastructure Focus
areas - Assessment of the needs of potential user
communities and development of appropriate
service models - Development of formal roles and responsibilities
of providers, partners, users - Assessment and prototyping of best practices for
bit preservation, authentication, metadata, etc. - Development of appropriate cost and risk models
for long-term preservation - Development of appropriate success metrics to
evaluate usefulness, reliability, and usability
of infrastructure
2 Prototypes National Demonstration
Project Library of Congress Pilot
Project Partners SDSC/UCSD U Maryland UCSD
Libraries NCAR NARA Library of Congress NSF ICPSR
Internet Archive NVO
Information courtesy of Robert McDonald, David
Minor, Ardys Kozbial
49National Demonstration Project Large-scale
Replication and Distribution
- Focus on supporting multiple, geographically
distributed copies of preservation collections - Bright copy Chronopolis site supports
ingestion, collection management, user access - Dim copy Chronopolis site supports remote
replica of bright copy and supports user access - Dark copy Chronopolis site supports reference
copy that may be used for disaster recovery but
no user access - Each site may play different roles for different
collections
Dim copy C1
Dark copy C1
Bright copy C2
Dark copy C2
Bright copy C1
Dim copy C2
- Demonstration collections included
- National Virtual Observatory (NVO) 1 TB Digital
Palomar Observatory Sky Survey - Copy of Interuniversity Consortium for Political
and Social Research (ICPSR) data 1 TB
Web-accessible Data - NCAR Observational Data 3 TB of Observational
and Re-Analysis Data
50SDSC/ UCSD Libraries Pilot Project with U.S.
Library of Congress
Prokudin-Gorskii Photographs (Library of Congress
Prints and Photographs Division) http//www.loc.go
v/exhibits/empire/ (also collection of web crawls
from the Internet Archive)
Goal To demonstrate the feasibility and
performance of current approaches for a
production digital Data Center to support the
Library of Congress requirements.
- Historically important 600 GB Library of Congress
image collection - Images over 100 years old with red, blue, green
components (kept as separate digital files). - SDSC stores 5 copies with dark archival copy at
NCAR - Infrastructure must support idiosyncratic file
structure. Special logging and monitoring
software developed so that both SDSC and Library
of Congress could access information
Library of Congress Pilot Project information
courtesy of David Minor
51Pilot Projects provided invaluable experience
with key Issues
- Technical Issues
- How to address Integrity, verification,
provenance, authentication, etc. - Legal/Policy Issues
- Who is responsible?
- Who is liable?
- Social Issues
- What formats/standards are acceptable to the
community? - How do we formalize trust?
- Infrastructure Issues
- What kinds of resources (servers, storage,
networks) are required? - How should they operate?
- Evaluation Issues
- What is reliable?
- What is successful?
- Cost Issues
- What is cost-effective?
- How can support be sustained over time?
52Chronopolis A Partnership
- Chronopolis is being developed by a national
consortium led by SDSC and the UCSD Libraries. - Initial Chronopolis sites include
- SDSC and UCSD Libraries at UC San Diego
- University of Maryland Institute for Advanced
Computer Studies (UMIACS) - National Center for Atmospheric Research (NCAR)
in Boulder, CO
53Chronopolis An NDIIPP Project
- Current funding for project is from the Library
of Congress National Digital Information
Preservation Program - Capturing, preserving, and making available
significant digital content - Building and strengthening a network of partners
- Developing a technical infrastructure of tools
and services
54Its Hard to be Successful in the Information Age
without reliable, persistent information
- Inadequate/unrealistic general solution Let X
do it where X is - The Government
- The Libraries
- The Archivists
- Google
- The private sector
- Data owners
- Data generators, etc.
- Creative partnerships needed to provide
preservation solutions with - Trusted stewards
- Feasible costs for users
- Sustainable costs for infrastructure
- Very low risk for data loss, etc.
55Blue Ribbon Task Force to Focus on Economic
Sustainability
- International Blue Ribbon Task Force (BRTF-SDPA)
to begin in 2008 to study issues of economic
sustainability of digital preservation and
access - Support from
- National Science Foundation
- Library of Congress
- Mellon Foundation
- Joint Information Systems Committee
- National Archives and Records Administration
- Council on Library and Information Sources
Image courtesy of Chris Greer
56BRTF-SDPA
- Charge to the Task Force
- To conduct a comprehensive analysis of previous
and current efforts to develop and/or implement
models for sustainable digital information
preservation (First year report) - To identify and evaluate best practice regarding
sustainable digital preservation among existing
collections, repositories, and analogous
enterprises - To make specific recommendations for actions that
will catalyze the development of sustainable
resource strategies for the reliable preservation
of digital information (Second Year report) - Provide a research agenda to organize and
motivate future work.
- How you can be involved
- Contribute your ideas (oral and written
testimony) - Suggest readings (website will serve as a
community bibliography) - Write an article on the issues for a new
community (Important component will be to educate
decision makers and the public about digital
preservation) - Website to be launched this Fall. Will link from
www.sdsc.edu
57Many Thanks
- Fran Berman, Phil Andrews, Reagan Moore, Ian
Foster, Jack Dongarra, Authors of the IDC Report,
Ben Tolo, Reagan Moore, Richard Moore, David
Moore, Robert McDonald, Southern California
Earthquake Center, David Minor, Ardys Kozbial,
Amit Chourasia, U.S. Library of Congress, Moores
Cancer Center, National Archives and Records
Administration, NSF, Chris Greer, Nancy
Wilkins-Diehr, and many others
www.sdsc.edu
natashab_at_sdsc.edu
58UCSD Libraries
- 3.5 million volumes
- Digital Access Management System
- 250,000 objects
- 15 TB
- Shared collections with UC
- California Digital Library Digital Preservation
Repository - eScholarship repository
59Previous Collaborations
- LC Pilot Project Building Trust in a 3rd Party
Repository - Using test image collections/web crawls ingest
content to SDSC repository - Allow access for content audit
- Track usage of content over time
- Deliver content back to LC at end of project
- California Digital Library (CDL) Mass Transit
Program
- Enable UC System Libraries to transfer
high-speed mass digitization collections across
CENIC/I2 - Develop transmission packaging for
CDL content
60UCSD Organizations Provide
- SDSC
- Storage (50 TBs) and networking services
- SRB support
- Transmission Packaging Modules
- UCSD Libraries
- Metadata services (PREMIS)
- DIPs (Dissemination Information Packages)
- Other advanced data services as needed
61NCAR and UMIACS Provide
- Archives Complete copy of all data (50 TBs
each) - Advanced data services
- PAWN Producer Archive Workflow Network in
Support of Digital Preservation - ACE Auditing Control Environment to Ensure the
Long Term Integrity of Digital Archives
62Chronopolis Data Providers CDL
- California Digital Library
- A part of UCOP, supports the University of
California libraries - Providing data from the Web-At-Risk project
- Five years of political and governmental websites
- ARC files created from web crawls
- Using BagIt Transfer Structure
63Chronopolis Data Providers ICPSR
- Inter-university Consortium for Political and
Social Research, University of Michigan - Providing _at_12TB of data Wide variety of types
- Already working with SDSC using SRB
64Chronopolis Data Providers SIO
- Scripps Institution of Oceanography
- Providing data from the Geological Data Center
- Files stored using the SRB
- Data collected under previous NDIIPP project
65Chronopolis Data Providers NCSU
- North Carolina State University Libraries
- Providing _at_5TB of data State-wide geospatial
studies and data from many sources - Another test-case for BagIt transfer
specification
66(No Transcript)
67http//chronopolis.sdsc.edu