Scientific Data Management - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Scientific Data Management

Description:

1. Arie Shoshani, LBNL. SDM. center. Scientific Data ... GTech, NCSU, NWU, SDSC. Current. Goal. Goals. Optimize and simplify: access to very large datasets ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 14
Provided by: ariesho
Learn more at: https://www.anl.gov
Category:

less

Transcript and Presenter's Notes

Title: Scientific Data Management


1
Scientific Data Management Center (SDM-ISIC) Ari
e Shoshani Computing Sciences Directorate Lawrence
Berkeley National Laboratory http//sdm.lbl.gov/
sdmcenter
2
Participants
Center Director Arie Shoshani DOE
Laboratories ANL Bill Gropp ltgropp_at_mcs.anl.govgt
(coordinating PI) Rob Ross
ltrross_at_mcs.anl.govgt LBNL Ekow Otoo
ltejotoo_at_lbl.govgt Arie Shoshani
ltshoshani_at_lbl.govgt (coordinating
PI) LLNL Terence Critchlow ltcritchlow_at_llnl.govgt
(coordinating PI) ORNL Randy Burris
ltburrisrd_at_ornl.govgt Thomas Potok
ltpotokte_at_ornl.govgt (coordinating
PI)  Universities Georgia Institute of
Technology Ling Liu ltlingliu_at_cc.gatech.edugt Calt
on Pu ltcalton.pu_at_cc.gatech.edugt (coordinating
PI) North Carolina State University Mladen Vouk
ltvouk_at_csc.ncsu.edugt (coordinating
PI) Northwestern University Alok Choudhary
ltchoudhar_at_ece.nwu.edugt (coordinating
PI) Wei-Keng Liao ltwkliao_at_ece.nwu.edugt UC San
Diego (Supercomputer Center) Amarnath Gupta
ltgupta_at_sdsc.edugt Reagan Moore ltmoore_at_sdsc.edugt
(coordinating PI)  
3
Original Goals and Framework
  • Coordinated framework for the
  • unification,
  • development,
  • deployment, and
  • reuse
  • of scientific data management software
  • Framework
  • 4 areas
  • Very large databases
  • distributed databases
  • heterogeneous databases
  • data mining
  • ( agent technology)
  • 4 tier levels
  • Storage level
  • File level
  • Dataset level
  • federated data level

4
Master Diagram
4) Distributed, heterogeneous data access
d) Dataset Federation Level
Multi-tier metadata system for querying
heterogeneous data sources (LLNL, Georgia Tech)
Knowledge-based federation of heterogeneous
databases (SDSC)
1) Storage and retrieval of Very large datasets
2) Access optimization of distributed data
3) Data mining and discovery of access patterns
Analysis of application-level query patterns
(LLNL, NWU)
Optimizing shared access to tertiary
storage (LBNL, ORNL)
High-dimensional indexing techniques (LBNL)
c) Dataset Level
Multi-agent high-dimensional cluster analysis
(ORNL)
MPI I/O implementation based on file-level
hints (ANL, NWU)
b) File Level
Low level API for grid I/O (ANL)
Dimension reduction and sampling (LLNL,
LBNL)
Parallel I/O improving parallel access from
clusters (ANL, NWU)
a) Storage Level
Adaptive file caching in a distributed
system (LBNL)
Grid Enabling Technology
Optimization of low-level data storage,
retrieval and transport (ORNL)
5) Agent technology
Enabling communication among tools and data
(ORNL, NCSU)
5
Scientific Data Management ISIC
Petabytes
Petabytes
Scientific Simulations experiments
  • DOE Labs ANL, LBNL, LLNL, ORNL
  • Universities GTech, NCSU, NWU, SDSC

Terabytes
Terabytes
  • Climate Modeling
  • Astrophysics
  • Genomics and Proteomics
  • High Energy Physics

SDM-ISIC Technology
  • Optimizing shared access from mass storage
    systems
  • Metadata and knowledge- based federations
  • API for Grid I/O
  • High-dimensional cluster analysis
  • High-dimensional indexing
  • Adaptive file caching
  • Agents

Data Manipulation
Data Manipulation
20 time
  • Using SDM-ISIC technology
  • Getting files from Tape archive
  • Extracting subset of data from files
  • Reformatting data
  • Getting data from heterogeneous, distributed
    systems
  • moving data over the network

80 time
Scientific Analysis Discovery
80 time
Goals
  • Optimize and simplify
  • access to very large datasets
  • access to distributed data
  • access of heterogeneous data
  • data mining of very large datasets

Scientific Analysis Discovery
20 time
Current
Goal
6
Benefits to Applications
  • Efficiency
  • Example by removing I/O bottlenecks matching
    storage structures to the application
  • Effectiveness
  • Example by making access to data from tertiary
    storage or various sites on the data grid
    transparent, more effective data exploration is
    possible
  • New algorithms
  • Example by developing a more effective
    high-dimensional clustering technique for large
    datasets, discovery of new correlations are
    possible
  • Enabling ad-hoc exploration of data
  • Example by enabling a run and render
    capability to visualize simulation output while
    the code is running, it is possible to monitor
    and steer a long-running simulation

7
Current Projects
  • High-Dimensional Clustering
  • Target applications Astrophysics, Climate
    Modeling
  • LLNL, ORNL
  • Scientific problem targeted To understand the
    mechanism(s) behind core-collapse supernovae it
    is crucial to explore and quantify
  • The correlations between the neutrino flux and
    stellar core convection
  • The correlations between convection and spatial
    dimensionality
  • The correlations between convection and rotation
  • Contact Anthony Mezzacappa, ORNL
  • Scientific problem targeted Separating volcano
    and ENSO (El Nino Southern oscillation) signals
    from the rest of the climate data to study
    variability in temperature
  • Contact Ben Santer, PCMDI, LLNL

8
Current Projects
  • 2) Efficient Parallel I/O to Disk Storage
  • Target application Astrophysics
  • ANL, NWU, LLNL
  • Scientific problem targeted Astrophysics
    simulation code (FLASH) Early production runs
    spent as much as half of the time writing
    checkpoint and vizualization data
  • Contact Mike Zingale, U of Chicago
  • Scientific problem targeted improving parallel
    I/O efficiency for tiled displays - a popular
    medium for collaborative viewing of
    high-resolution visualization Astrophysics data
  • Contact Mike Papka, ANL
  • Scientific problem targeted Query pattern
    analysis for astrophysics star data devising disk
    layout for the data such that overall data access
    time across multiple applications and users is
    reduced
  • Contact LLNL

9
Current Projects
  • 3) Providing transparent access to grid data
  • Target application High Energy Physics
  • LBNL, ORNL
  • Scientific problem targeted given a logical
    request (expressed on event attributes), get
    relevant data from grid sites and tertiary
    storage to application code without human
    intervention
  • Contact Doug Olson, LBNL
  • Contact Stephen Gowdy, SLAC
  • Contact Jackie Chan, Sandia Livermore
    (combustion)

10
Current Projects
  • 4) Heterogeneous Data Federation
  • Target application Biology
  • LLNL, SDSC, GTU, NCSU, ORNL
  • Scientific problem targeted to developing our
    infrastructure in support of cancer researchers
    at LLNL, who expect to use it to help identify
    genes which respond to low-doses of radiation.
    This problem is difficult because the information
    required by the scientists is spread across many,
    independent, web-based data sources - each using
    their own interfaces and data formats
  • Contact Matt Coleman, LLNL

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com