TG 06 Data Collections Tutorial - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

TG 06 Data Collections Tutorial

Description:

TG 06 Data Collections Tutorial – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 35
Provided by: sdsc3
Category:

less

Transcript and Presenter's Notes

Title: TG 06 Data Collections Tutorial


1
TG 06 Data Collections Tutorial
  • Natasha Balac
  • Roman Olschanowsky

2
TG Collections Present and Future
  • Data collections represent permanent data storage
    that is organized, searchable, and available to a
    wide audience, either a collaborative group or
    the scientific public in general 
  • The term "collections" is also used to refer to
    the libraries or groupings of data in Storage
    Resource Broker (SRB), which is a
    client/server-based suite of data storage and
    movement tools

3
TG Collections Present and Future
  • A number of data collection resources are
    available at the TeraGrid sites
  • The data collections table lists collections that
    are available to or created by TeraGrid users
  • The list contains a brief description or abstract
    of each collection that is currently in
    production at TeraGrid sites
  • URL links will connect directly to the collection
    interface or to more detailed information
  • http//www.teragrid.org/userinfo/data/collections.
    php

4
State of Collections
  • 89 collections from 5 sites have been listed as
    TeraGrid Collections on the web site
  • http//www.teragrid.org/userinfo/guide_data_colle
    ctions_table.html
  • Many of these listed collections are missing info
    and/or lacking critical pieces of information in
    order for the collections to be useful and usable
    by the user community
  • Large numbers of listed collections contain
    either inadequate or completely missing
    documentation
  • Many of the listed collections do not have any
    apparent connections to the rest of the TGs
    resources or users

5
State of Collections
  • This is due to lack of any kind of coherent
    policy that states the requirements that a TG
    collection needs to satisfy in order to be made
    available on TG resources and become an
    official TG collection
  • TG Collections and usage models requirement
    analysis team (RAT)

6
Collections RAT Charter
  • Data Collections are a significant resource
    providing one of the vital non-compute classes of
    service to the TeraGrid user community
  • Data Collections differ from the traditional
    compute resources and therefore necessitate
    special attention
  • Potential and value must be presented to and
    utilized by the scientific user community in an
    effective manner
  • Special requirements must be made clear to and
    understood by both users and TeraGrid staff

7
Collections RAT Charter
  • Data collections should represent permanent data
    storage that is well organized, documented,
    searchable, publicly available and valuable to a
    wide audience
  • All TG collections should provide common look and
    feel
  • Since providing data collections as a resource is
    a fairly new endeavor, there are many
    uncertainties about collection usability,
    infrastructure, policies and many practical usage
    questions

8
Collections RAT Recommendations
  • We worked on answering questions regarding what
    defines and what constitutes a formally
    designated TG collection
  • What is an appropriate TG collection?
  • What criteria a collection must meet to qualify
    as a TG collection? What is the value added to
    the user community?
  • What does it mean to make data sets available on
    TG? Who, how and where?
  • What are the usage models for TG data collections
    whether they are tied to other TG resources or
    not? Should the data be in conjunction with
    compute resources, visualization resources, other
    data collections, science gateways or some other
    TG resource?

9
TG collection Categories
  • Recommendation is that there should be several
    designated categories of TG collections based on
    usage model criteria
  • Category 1 Grid related Data Collections
  • Category 2 Compute related Data Collections
  • Category 3 GPFS-WAN Data Collections
  • Category 4 TG affiliated Collections
  • Category 5 Non-TG Collection Other options (
    DataCentral, etc.)

10
Category 1 Grid related Data Collections
  • Data is stored on TG resources using TG network
    to access
  • Data is hosted on the HW connected to the TG
    network taking advantage of the high speed
    network
  • Usage model interface to efficiently retrieve
    data from the repository provided
  • Data stored at the resource that might require/be
    protected by Globus authentication
  • Data collection provides low level APIs for
    accessibility or a resource defined API (web
    pages vs. JDBC or CGI script returning XML)
  • Data is accessible through standard TG software
    stack utilities interface like Globus, SRB,
    gridftp, etc.

11
Category 2 Compute related Data Collections
  • Collections using TG compute resources
  • Collections using visualization or data analysis
    resources provided by TG
  • Collections might have different front ends
    portal, gateway, etc.
  • Examples
  • Purdue portal-consolidating several earth
    observation data collections into one convenient
    portal
  • http//www.purdue.teragrid.org/portal
  • Gateways

12
Category 3 GPFS-WAN Data Collections
  • Collections sitting on GPFS-WAN
  • Taking advantage of proximity of the compute
    resources
  • Collections that are being computed on
  • Allocations RAT working on making section of
    GPFS-WAN disk space allocated for collections

13
Category 4 TG affiliated Collections
  • Collection belonging to the TG Related project
  • Collection exhibits some tenuous link to TG
  • Manifests a potential to be ingested into the
    grid and become true TG collection

14
Category 5 Non-TG CollectionOther options
  • DataCentral and Data allocations process

15
What is Data Central?
  • The first program of its kind to support research
    and community data collections and databases
  • Data Central makes it possible to store, manage,
    analyze, mine, share and publish data collections
    thereby enabling access and collaboration in the
    broader scientific community

16
Data Central at work
  • Eligible researchers can request a data
    allocation from SDSC (with or without a compute
    allocation) that permits expanded access to
    SDSC's Data Central facilities and services for
    data collections management, data analysis and
    data mining

17
Why SDSC Data Central?
  • Todays scientists and engineers are increasingly
    dependent on valued community data collections
    and databases
  • SDSC has experienced increasing demand by the
    domain communities for collaborations on data
    management including
  • publishing of data in digital libraries
  • sharing of data through the Web and data grids
  • creating, optimizing, porting large scale
    databases
  • analyzing and data mining large scale data

18
A Deluge of Data
  • Today, data comes from everywhere
  • Scientific instruments
  • Experiments
  • Sensors and sensor nets
  • New devices
  • And is used by everyone
  • Scientists
  • Consumers
  • Educators
  • General public
  • IT environments must support unprecedented
    diversity, globalization, integration, scale, and
    use

Life Sciences
Preservationand Archiving
Astronomy
19
What does SDSC Data Central offer?
  • SDSC has been actively working with and
    collaborating with many researchers and national
    scale projects in their data management efforts
  • We offer Expertise and Resources for
  • Public Data Collections and Database Hosting
  • Long-term storage (tape and disk)
  • Remote data management and access (SRB)
  • Data Analysis and Data Mining
  • Professional, qualified 24/7 support

20
SDSC Data Resources
  • 540 TB Storage-area Network (SAN)
  • 1 PB On-line disk
  • 6 PB StorageTek tape library capacity
  • DB2, Oracle, MySQL
  • Storage Resource Broker

Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
21
Data Resources Available through DataCentral
  • Disk
  • 400 Terabytes SATA SAN Fibre Channel Attached
  • Enables multiple high-end computers, using a
    range of operating systems, to share data rapidly
    and seamlessly
  • Growing data storage capabilities are integrated
    with high-end computational resources such as
    SDSCs 15.6 Teraflop DataStar IBM supercomputer
    and parallel I/O
  • Accessible Mounted, Web, SRB, GridFTP
  • Tape
  • 6 Petabyte Capacity High Speed Robotic Silos
  • Disk cache front end, transparently mounted via
    Sun SAMQFS file system
  • Accessible Mounted, Web, SRB, GridFTP

22
Data Resources Available through DataCentral
  • Databases
  • DB2, Oracle, MySQL servers
  • High Availability, High Performance
  • Accessible Standard RDMS connectivity, client
    software installed on most systems
  • Software
  • Storage Resource Broker (SRB) State-of-the-art
    data management and collaboration software for
    grid file access
  • Powerful software applications covering a range
    of disciplines including bioscience, geoscience,
    astronomy, chemistry, medicine, etc.
  • A wide array of data analysis, mining and
    visualization tools

23
Data Resources Available through DataCentral
  • Expertise in
  • High performance large data management
  • Data migration, upload and sharing through the
    grid
  • Database application tuning, porting and
    optimization
  • SQL query tuning
  • Schema design
  • Data analysis and data mining
  • Portal creation and collection publication

24
Data Resources Available through DataCentral
Quality User Support
  • Consulting
  • Phone, Web, e-mail
  • M-F, 9 a.m. - 5 p.m.
  • 24x7 Help Desk/Operational Support
  • Training
  • Documentation
  • User Portals
  • Targeted Optimization and Porting (TOP)
  • Strategic Applications Collaborations (SAC)
  • Strategic Community Collaborations (SCC)

25
SDSC Data Central Architecture
data-login
web farm
Datastar
GPFS 108 TB
HPSS
DB2
Oracle
35 TB
13 TB
25 TB
Teragrid
GPFS 51 TB
6 PB Tape Capacity
Teragrid GPFS-WAN 210 TB
SamQFS
400 TB
Bluegene (Intimidata)
GPFS 40 TB
26
Partial list of databases and data collections
currently housed at SDSC
  • Protein Data Bank (protein data)
  • National Virtual Observatory (astronomical data)
  • UCSD Libraries Image Collegion (ArtStore)
  • National Science Digital Library (education
    collection)
  • SCEC (earthquake data)
  • BIRN (neuroscience data)
  • Encyclopedia of Life (genomic data)
  • TreeBase (phylogeny and ontology information)
  • Transport Classification Database (protein
    information)
  • Library of Congress data
  • CKAAPS (protein evolutionary information)
  • AfCS Molecule Pages (protein information)
  • SLACC-JCSG (structural genomics data)
  • APOPTOSIS DB (proteins related to cell death
    data)
  • NAVDAT (geochemistry data)
  • QRC (NSF data on Supercomputer Centers and PACI)
  • Network Topology Data (Skitter project)
  • UC Merced Library
  • Biology Workbench Databases (mirrors and
    originals of over 80 biology databases)
  • 2 Micron All Sky Survey (astronomy data)
  • Digital Palomar Observatory Sky Survey Collection
    (astronomy data)
  • Sloan Digital Sky Survey Collection (astronomy
    data)
  • Interpro Mirror (protein data)
  • HPWREN (Wireless Network Network Analysis Data)
  • HPWREN (sensor network data)
  • Security logs and archives (security information)
  • EarthRef Digital Archive (earth science
    information)
  • GERM (earth reservoir information)
  • Braindata (Rutgers neuroscience collection)
  • HyperLTER (hyperspectral images)
  • SIO-Explorer (oceanographic voyages)
  • Transana (classroom video)
  • WebBase (web crawls)
  • Alexandria Digital Library (photographs)
  • Backskatter Data (from UCSD network telescope)
  • Digital Earth Data Library (earth sciences
    related datasets)
  • GEON (PaleoGeographic Atlas project)
  • IMDC (Internet measurement data catalog)
  • Seamount Catalogue (bathymetric seamount maps)
  • Hayden Planetarium Collection (astronomical data)
  • TeraGrid Data (science and engineering
    collections)
  • Biocyc (collection of pathway/genome DBs)
  • Digital Embryo (human embryology)
  • National Archives (persistent archive)
  • San Diego Conservation Resources Network
    (sensitive species map server)
  • LDAS (land data assimilation system)
  • ROADNET (sensor data)
  • NPACI Data Grid (scientific simulation output)
  • Salk (biology data archive)
  • Backbone Packet Header Traces (OC48, OC12)
  • Teragrid (science and engineering collections)
  • CHRONOS (analytical tools for chronostratigraphy)
  • ERESE (educational Earth science portal)
  • TeraBridge (Sensor stream data)
  • C5 Landscape (UCSD Art dept)

27
Sites Using the SRB
28
SDSC SRB Projects (60 million, .5 PB )
  • Digital Libraries
  • UCB, Umich, UCSB, Stanford,CDL
  • NSF NSDL - UCAR / DLESE
  • NASA Information Power Grid
  • Astronomy
  • National Virtual Observatory
  • 2MASS Project (2 Micron All Sky Survey)
  • Particle Physics
  • Particle Physics Data Grid (DOE)
  • GriPhyN
  • SLAC Synchrotron Data Repository
  • Medicine
  • Digital Embryo (NLM)
  • Earth Systems Sciences
  • ESIPS
  • LTER
  • Persistent Archives
  • NARA
  • LOC

29
Integrated Data Cyberinfrastructure
coordination
integration
30
Getting an Allocation
  • Who should apply?
  • Open to researchers affiliated with US
    educational institutions
  • Proposals merit-reviewed quarterly by Data
    Allocations Committee
  • Types of Allocations
  • Expedited Allocations
  • 1 TB or less of disk tape 1st year
  • 30 GB Database 1st year
  • Yearly review
  • Medium Allocations
  • Under 30 TB
  • Large Allocations
  • Larger than 30 TB
  • Data Allocations
  • Getting Started http//datacentral.sdsc.edu

31
TG Data Collections Allocations
  • How to provide a mechanism for the addition of
    new Data Collections that satisfy the specified
    requirement
  • How should the current allocations process be
    modified to accommodate the data collections
    resource?
  • What are the storage and allocations guidelines
    and policies in conjunction with data
    collections? Who is eligible to request a data
    allocation?
  • How to extend/terminate allocations? What kind of
    review process is involved? What parameters
    would be considered?
  • Usage monitoring and usage tracking tools

32
TG Collection Process and Procedures
  • Documentation Well Defined Metadata
  • Yearly review for each collection
  • Formal Allocation process
  • Accounting process
  • TG Collections coordinator
  • SLA for each collection, renewal
  • Parking vs. Collections vs. Project data

33
Other tools provided supporting collections
  • What tools for management, analysis, mining,
    access and documentation will be provided with
    the collections?
  • Who is in responsible for providing, installing,
    maintaining tools?
  • What data managing tools are available?

34
Thank You
  • Natasha Balac natashab_at_sdsc.edu
  • Roman Olschanowsky roman2u_at_sdsc.edu
Write a Comment
User Comments (0)
About PowerShow.com