On Large DataFlow Scientific Workflows: An Astrophysics Case Study Integration of Heterogeneous Data - PowerPoint PPT Presentation

About This Presentation
Title:

On Large DataFlow Scientific Workflows: An Astrophysics Case Study Integration of Heterogeneous Data

Description:

The two primary subjects under investigation are interacting binary stars ... The 140 files of a time step are merged into one (1) netcdf file (takes about 10 ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 21
Provided by: com81
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: On Large DataFlow Scientific Workflows: An Astrophysics Case Study Integration of Heterogeneous Data


1
On Large Data-Flow Scientific Workflows An
Astrophysics Case Study Integration of
Heterogeneous Datasets using Scientific
Workflow Engineering
  • Presenter
  • Mladen A. Vouk

2
Team(Scientific Process Automation - SPA)
  • Sangeeta Bhagwanani (MS student - GUI interfaces)
  • John Blondin (NCSU Faculty,TSI PI)
  • Zhengang Cheng (PhD student services, VV)
  • Dan Colonnese (MS student, graduated, workflow
    grid and reliability issues)
  • Ruben Lobo (PhD student, packaging)
  • Pierre Moualem (MS student, fault-tolerance)
  • Jason Kekas (PhD student, Technical Support)
  • Phoemphun Oothongsap (NCSU, Postdoc,
    high-throughput flows)
  • Elliot Peele (NCSU, Technical Support)
  • Mladen A. Vouk (NCSU faculty, SPA PI)
  • Brent Marinello (NCSU, workflows extensions,
  • Others

3
NC State researchers are simulating the death of
a massive star leading to a supernova explosion.
Of particular interest is the dynamics of the
shock wave generated by the initial implosion of
the star which ultimately destroys the star as a
highly energetic supernova.
4
Key Current TaskEmulating live workflows
5
Key Issue
  • Very important to distinguish between a
    custom-made workflow solution and a more
    cannonical set of operations, methods, and
    solutions that can be composed into a scientific
    workflow.
  • Complexity, skill level needed to implement,
    usability, maintainability, standardization
  • e.g., sort, uniq, grep, ftp, ssh on unix boxes
  • vs.
  • SAS (that can do sorting), home-made sort, SABUL,
    bbcp (free, but not standard), etc.

6
Topic Computational Astrophysics
  • Dr. Blondin is carrying out research in the field
    of Circumstellar Gas-Dynamics. The numerical
    hydrodynamical code VH-1 is used on
    supercomputers, to study a vast array of objects
    observed by astronomers both from ground-based
    observatories and from orbiting satellites.
  • The two primary subjects under investigation are
    interacting binary stars - including normal stars
    like the Algol binary, and compact object systems
    like the high mass X-ray binary SMC X-1 - and
    supernova remnants - from very young, like SNR
    1987a, to older remnants like the Cygnus Loop.
  • Other astrophysical processes of current interest
    include radiatively driven winds from hot stars,
    the interaction of stellar winds with the
    interstellar medium, the stability of radiative
    shockwaves, the propagation of jets from young
    stellar objects, and the formation of globular
    clusters.

7
Logistic Network L-Bone
Aggregate to 500 files (lt 10GB each)
Local Mass Storage 14TB)
Input Data
Aggregate to one file (1 TB each)
Data Depot
HPSS archive
Local 44 Proc. Data Cluster - data sits on local
nodes for weeks
Highly Parallel Compute
Output 500x500 files
Viz Software
Viz Wall
Viz Client
8
Workflow - Abstraction
Model Merge Backup Move Split Viz
Parallel Computation
Mass Storage
Fiber C. or Local NFS
Data Mover Channel (e.g. LORS, BCC, SABUL, FC
over SONET
Head Node Services
Head Node Services
RecvData
SendData
To VizWall
Split Viz
Merge Backup
Model
Parallel Visualization
Web or Client GUI
Web Services
Construct Orchestrate Monitor/Steer Change Stop/St
art
Control
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Current and Future Bottlenecks
Computing Resources and Computational Speed
(1000 Cray X1
processors, compute times of 30 hrs, wait
time) Storage and Disks (14 TB, reliable and
sustainable transfer speeds 300 MB/s ,
Automation Reliable and Sustainable Network
Transfer Rates (300 MB/s)
13
Bottlenecks (B-specific)
  • Supercomputer, Storage, HPSS, Ensight Memory
  • Average per job wait time is 24-48 hrs (could be
    longer if more processors are requested or more
    time slices are calculated).
  • One run a 6 hrs (run time) on Cray X1 currently
    uses 140 processors, and produces 10 time steps.
    Each time step has 140 Fortran-binary files (28
    GB total). Hence, currently, this is 280 MB per
    6hr run. Takes about 300 to 500 slices for full
    visualization (30 to 50 runs , and about 280x(300
    to 500) 10 to 14 TB of space).
  • The 140 files of a time step are merged into one
    (1) netcdf file (takes about 10 min)
  • BBCP the file to NCSU at about 30 MB/s, or about
    15 min per time slice (this can be done in
    parallel with next time-slice computation). In
    the future network transfer speeds and disk
    access speeds may be an issues.

14
B-specific Top-Level W/F Operations
  • Operators Create W/F (reserve resources), Run
    Model, Backup Output, PostProcess Output (e.g.,
    Merge, Split), MoveData, AnalyzeData (Viz,
    other?), Monitor Progress (state, audit,
    backtrack, errors, provenance), Modify Parameters
  • States Modeling, Backup, Postprocessing (A, ..
    Z), MovingData, Analyzing Remotely
  • Creators CreateWF, Model?, Expand
  • Modifiers Merge, Split, Move, Backup, Start,
    Stop, ModifyParameters
  • Behaviors Monitor, Audit, Visualize,
    Error/Exception Handling, Data Provenance,

15
Goal Ubiquitous Canonical Operations for
Scientific W/F Support
  • Fast data transfer from A to B (e.g., LORS,
    SABUL, GridFTP, BBCP?, other )
  • Database access
  • Stream merging and splitting
  • Flow monitoring
  • Tracking, Auditing, provenance
  • Verification and Validation
  • Communication service (web services, grid
    services, xmlrpc, etc.)
  • Other

16
Issues (1)
  • Communication Coupling (loose, tight, v. tight,
    code-level) and Granularity (fine, medium?,
    coarse)
  • Communication Methods (e.g., ssh tunnels, xmprpc,
    snmp, web/grid services,etc.) e.g., apparently
    poor support for Cray
  • Storage issues (e.g., p-netcdf support,
    bandwidth)
  • Direct and Indirect Data Flows (functionality,
    throughput, delays, other QoS parameters)
  • End-to-end performance
  • Level of abstraction
  • Workflow description language(s) and exchange
    issues interoperability
  • Standard scientific computing W/F functions

17
Issues (2)
  • Problem is currently similar to old-time
    punched-card job submissions (long turn-around
    time, can be expensive due to front end
    computational resource I/O bottleneck) - need up
    front verification and validation things will
    change
  • Back-end bottleneck due to hierarchical storage
    issues (e.g., retrieval from HPSS)
  • Long term workflow state preservation - needed
  • Recovery (transfers, other failures) more
    needed
  • Tracking data and files - needed
  • Who maintains equipment, storage, data, scripts,
    workflow elements? Elegant solutions my not be
    good solutions from the perspective of autonomy.
  • EXTREMELY IMPORTANT!!! We are trying to get out
    of the business of totally custom-made solutions.

18
Workflow - Abstraction
Model Merge Backup Move Split Viz
Goal 2 -3 Gbps TRates End-To-End
Parallel Computation
Mass Storage
Fiber C. or Local NFS
Data Mover Channel (e.g. LORS, SABUL, FC over
SONET
Head Node Services
Head Node Services
RecvData
SendData
To VizWall
Goal 1TB per Night
Split Viz
Merge Backup
Model
Parallel Visualization
Web Services
Web or Client GUI
Construct Orchestrate Monitor/Steer Change Stop/St
art
Control
19
Communications
  • Web/Java-based GUI
  • Web Services for Orchestration - overall and
    less than tightly coupled sub-workflows
  • LSF and MPI for parallel computation
  • Scripts (in this example csh/sh based, could be
    Perl, Python, etc.) on local machines
    interpreted language
  • High-level programming language for simulations,
    complex data movement algorithms, and similar
    compiled language

20
B-specific Top-Level W/F Operations
  • Operators Create W/F (reserve resources), Run
    Model, Backup Output, PostProcess Output (e.g.,
    Merge, Split), MoveData, AnalyzeData (Viz,
    other?), Monitor Progress (state, audit,
    backtrack, errors, provenance), Modify Parameters
  • States Modeling, Backup, Postprocessing(A, ..
    Z), MovingData, Analyzing Remotely
  • Constructor CreateWF, Model?, Expand
  • Modifiers Merge, Split, Move, Backup, Start,
    Stop, ModifyParameters
  • Behaviors Monitor, Audit, Visualize,

21
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com