Put Your Title Here - PowerPoint PPT Presentation

About This Presentation
Title:

Put Your Title Here

Description:

Can't steer the simulation waste of time and resource ... Dynamic steering of workflow by user. Dynamic user examination of results ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 21
Provided by: arie70
Learn more at: https://www.csm.ornl.gov
Category:
Tags: here | put | steer | title

less

Transcript and Presenter's Notes

Title: Put Your Title Here


1
Scientific Data Management Center (ISIC)
http//sdmcenter.lbl.gov contains extensive
publication list
2
Scientific Data Management Center
Participating Institutions
Center PI Arie Shoshani LBNL DOE
Laboratories co-PIs Bill Gropp, Rob
Ross ANL Arie Shoshani, Doron Rotem LBNL Terence
Critchlow, Chandrika Kamath LLNL Nagiza Samatova,
Andy White ORNL Universities co-PIs Mladen
Vouk North Carolina State Alok Choudhary
Northwestern Reagan Moore, Bertram Ludaescher
UC San Diego (SDSC) Calton Pu Georgia Tech Steve
Parker U of Utah (future)
3
Phases of Scientific Exploration
  • Data Generation
  • From large scale simulations or experiments
  • Fast data growth with computational power
  • examples
  • HENP 100 Teraops and 10 Petabytes by 2006
  • Climate Spatial Resolution T42 (280 km) -gt T85
    (140 km) -gt T170 (70 km), T42 about 1 TB/100
    year run gt factor of 10-20
  • Problems
  • Cant dump the data to storage fast enough
    waste of compute resources
  • Cant move terabytes of data over WAN robustly
    waste of scientists time
  • Cant steer the simulation waste of time and
    resource
  • Need to reorganize and transform data large
    data intensive tasks slowingprogress

4
Phases of Scientific Exploration
  • Data Analysis
  • Analysis of large data volume
  • Cant fit all data in memory
  • Problems
  • Find the relevant data need efficient indexing
  • Cluster analysis need linear scaling
  • Feature selection efficient high-dimensional
    analysis
  • Data heterogeneity combine data from diverse
    sources
  • Streamline analysis steps output of one step
    needs to match input of next

5
Example Data Flow in TSI
Logistical Network
Courtesy John Blondin
6
Goal Reduce the Data Management Overhead
  • Efficiency
  • Example parallel I/O, indexing, matching storage
    structures to the application
  • Effectiveness
  • Example Access data by attributes-not files,
    facilitate massive data movement
  • New algorithms
  • Example Specialized PCA techniques to separate
    signals or to achieve better spatial data
    compression
  • Enabling ad-hoc exploration of data
  • Example by enabling exploratory run and render
    capability to analyze and visualize simulation
    output while the code is running

7
Approach
  • Use an integrated framework that
  • Provides a scientific workflow capability
  • Supports data mining and analysis tools
  • Accelerates storage and access to data
  • Simplify data management tasks for the scientist
  • Hide details of underlying parallel and
    indexingtechnology
  • Permit assembly of modules using a simple
    graphical workflow description tool

SDM Framework
Scientific Process Automation Layer
Data Mining Analysis Layer
Scientific Application
Scientific Understanding
Storage Efficient Access Layer
8
Technology Details by Layer
9
AccomplishmentsStorage Efficient Access (SEA)
Shared memory communication
Parallel Virtual File System Enhancements and
deployment
  • Developed Parallel netCDF
  • Enables high performance parallel I/O to
    netCDF datasets
  • Achieves up to 10 fold performance
    improvement over HDF5
  • Enhanced ROMIO
  • Provides MPI access to PVFS
  • Advanced parallel file system interfaces for
    more efficient access
  • Developed PVFS2
  • Adds Myrinet GM and InfiniBand support
  • improved fault tolerance
  • asynchronous I/O
  • offered by Dell and HP for Clusters
  • Deployed an HPSS Storage Resource Manager (SRM)
    with PVFS
  • Automatic access of HPSS files to PVFS through
    MPI-IO library
  • SRM is a middleware component

After
Before
FLASH I/O Benchmark Performance (8x8x8 block
sizes)
10
Robust Multi-file Replication
  • Problem move thousands of files robustly
  • Takes many hours
  • Need error recovery
  • Mass storage systems failures
  • Network failures
  • Use Storage Resource Managers (SRMs)
  • Problem too slow
  • Use parallel streams
  • Use concurrent transfers
  • Use large FTP windows
  • Pre-stage files from MSS

11
File tracking helps to identify bottlenecks
Shows that archiving is the bottleneck
12
File tracking shows recovery from transient
failures
Total 45 GBs
13
AccomplishmentsData Mining and Analysis (DMA)
  • Developed Parallel-VTK
  • Efficient 2D/3D Parallel Scientific
    Visualization for NetCDF and HDF files
  • Built on top of PnetCDF
  • Developed region tracking tool
  • For exploring 2D/3D scientific databases
  • Using bitmap technology to identify regions
    based on multi-attribute conditions
  • Implemented Independent Component Analysis (ICA)
    module
  • Used for accurate for signal separation
  • Used for discovering key parameters that
    correlate with observed data
  • Developed highly effective data reduction
  • Achieves 15 fold reduction with high level of
    accuracy
  • Using parallel Principle Component Analysis(PCA)
    technology
  • Developed ASPECT
  • A framework that supports a rich set ofpluggable
    data analysis tools
  • Including all the tools above

Combustion region tracking
El Nino signal (red) and estimation (blue)
closely match
14
ASPECT Analysis Environment
pVTK Tool
R Analysis Tool
Select Data
Take Sample
Data Mining Analysis Layer
Read Data (buffer-name) Write Data
Read Data (buffer-name) Write Data
Read Data (buffer-name)
Get variables (var-names, ranges)
Use Bitmap (condition)
Bitmap Index Selection
Storage Efficient Access Layer
PVFS
Parallel NetCDF
Hardware, OS, and MSS (HPSS)
15
AccomplishmentsScientific Process Automation
(SPA)
  • Unique requirements of scientific WFs
  • Moving large volumes between modules
  • Tightlly-coupled efficient data movement
  • Specification of granularity-based iteration
  • e.g. In spatio-temporal simulations a time
    step is a granule
  • Support for data transformation
  • complex data types (including file formats, e.g.
    netCDF, HDF)
  • Dynamic steering of workflow by user
  • Dynamic user examination of results
  • Developed a working scientific work flow system
  • Automatic microarray analysis
  • Using web-wrapping tools developed by the center
  • Using Kepler WF engine
  • Kepler is an adaptation of the UC Berkeley tool,
    Ptolemy

workflow steps defined graphically
workflow results presented to user
16
GUI for setting up and running workflows
17
Re-applying Technology
SDM technology, developed for one application,
can be effectively targeted at many other
applications
  • Technology
  • Parallel NetCDF
  • Parallel VTK
  • Compressed bitmaps
  • Storage Resource
  • Managers
  • Feature Selection
  • Scientific Workflow

New Applications Climate Climate Combustion,
Astrophysics Astrophysics Fusion Astrophysics
(planned)
Initial Application Astrophysics
Astrophysics HENP HENP Climate Biology
18
Broad Impact of the SDM Center
  • Astrophysics
  • High speed storage technology, parallel NetCDF,
    parallel VTK, and ASPECT integration software
    used for Terascale Supernova Initiative (TSI) and
    FLASH simulations
  • Tony Mezzacappa ORNL, John Blondin NCSU,
    Mike Zingale U of Chicago, Mike Papka ANL
  • Climate
  • High speed storage technology, Parallel NetCDF,
    and ICA technology used for Climate Modeling
    projects
  • Ben Santer LLNL, John Drake ORNL, John
    Michalakes NCAR
  • Combustion
  • Compressed Bitmap Indexing used for fast
    generation of flame regions and tracking their
    progress over time
  • Wendy Koegler, Jacqueline Chen Sandia Lab

ASCI FLASH parallel NetCDF
Dimensionality reduction
Region growing
19
Broad Impact (cont.)
  • Biology
  • Kepler workflow system and web-wrapping
    technology used for executing complex highly
    repetitive workflow tasks for processing
    microarray data
  • Matt Coleman - LLNL
  • High Energy Physics
  • Compressed Bitmap Indexing and Storage Resource
    Managers used for locating desired subsets of
    data (events) and automatically retrieving data
    from HPSS
  • Doug Olson - LBNL, Eric Hjort LBNL, Jerome
    Lauret - BNL
  • Fusion
  • A combination of PCA and ICA technology used to
    identify the key parameters that are relevant to
    the presence of edge harmonic oscillations in a
    Tokomak
  • Keith Burrell - General Atomics

Building a scientific workflow
Dynamic monitoring of HPSS file transfers
Identifying key parameters for the DIII-D
Tokamak
20
Goals for Years 4-5
  • Fully develop the integrated SDM framework
  • Implement the 3 layer framework on SDM center
    facility
  • Provide a way to select only components needed
  • Develop self-guiding web pages on the use of SDM
    components
  • Use existing successful examples as guides
  • Generalize components for reuse
  • Develop general interfaces between components in
    the layers
  • support loosely-coupled WSDL interfaces
  • Support tightly-coupled components for efficient
    dataflow
  • Integrate operation of components in the
    framework
  • Hide details form user automate parallel access
    and indexing
  • Develop a reusable library of components that can
    be selected for use in the workflow system
Write a Comment
User Comments (0)
About PowerShow.com