Scientific Workflows - PowerPoint PPT Presentation

1 / 145
About This Presentation
Title:

Scientific Workflows

Description:

Scientific Workflows – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 146
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Scientific Workflows


1
Introduction to Scientific Workflow Management
and the Kepler System Ilkay Altintas1, Roselyne
Barreto4, Paul Breimyer5,Terence Critchlow2,
Daniel Crawl1, Ayla Khan3, David Koop3, Scott
Klasky4, Jeff Ligon5, Bertram Ludaescher6, Pierre
Mouallem5, Meiyappan Nagappan5, Steve Parker3,
Norbert Podhorszki6, Claudio Silva3, Mladen Vouk5
1. San Diego Supercomputer Center 2. Pacific
Northwest National Laboratory 3. University of
Utah 4. Oak Ridge National Laboratory 5. North
Carolina State University 6. University of
California, Davis
2
Tutorial Overview
  • 9-930am Introduction to Scientific Workflows
    (Bertram)
  • 930-10am Workflow Demos (Daniel)
  • 1000-1030am BREAK
  • 1030-11am Kepler Basics (Bertram)
  • 11-1230am (install Kepler) Hands-On 1 (Daniel)
  • 1230-130pm LUNCH
  • 130-2pm Advanced Features (Bertram)
  • 200-3pm Hands-On 2 (Daniel)
  • 300-330pm BREAK
  • 330-410pm Workflows Provenance (Bertram)
  • 410-440pm Provenance Demo (Daniel)
  • 440-500pm QA and Open Discussion

3
Introduction to Scientific Workflows
  • Motivating Examples
  • Ecological Niche Modeling
  • Processing Sensor Data Streams
  • Ecology, Oceanography use cases
  • Fusion Simulation
  • Requirements Features

3
3
4
Scientific Workflows Cyberinfrastructure
UPPER-WARE
5
Scientific Workflow
  • Capture how a scientist works with data and
    analytical tools
  • data access, transformation, analysis,
    visualization
  • possible worldview dataflow-oriented (cf.
    signal-processing)
  • Scientific workflow (wf) benefits (v.s.
    script-based approaches)
  • wf automation
  • wf component reuse
  • wf design, documentation
  • wf archival, sharing
  • built-in concurrency
  • (task-, pipeline-parallelism)
  • built-in provenance support
  • distributed parallel exec
  • Grid cluster support

6
Kepler science domains
  • Ecology
  • SEEK Ecological Niche Modeling and climate
    change
  • REAP Modeling parasite invasions in grasslands
    using sensor networks
  • NEON Ecological sensor networks
  • COMET environmental science
  • Geosciences
  • GEON LiDAR data processing
  • GEON Geological data integration
  • Molecular biology
  • SDM Gene promoter identification
  • ChIP-chip genome-scale research
  • CAMERA metagenomics
  • Physics
  • CPES Plasma fusion simulation
  • FermiLab particle physics
  • Oceanography
  • REAP SST data processing
  • LOOKING ocean observing CI
  • NORIA ocean observing CI
  • Phylogenetics
  • ATOL/pPOD Processing Phylodata
  • CiPRES phylogentic tools
  • Chemistry
  • Resurgence Computational chemistry
  • DART (X-Ray crystallography)
  • Library science
  • DIGARCH Digital preservation
  • Cheshire digital library archival
  • Conservation biology
  • SanParks Thresholds of Potential Concerns

Slide Matt Jones
7
Simple Kepler workflow using R (a statistics
package)
8
Ecological Niche Modeling
Temperature layer
Many other layers
Slide from D. Pennington
9
Managing Complexity
  • Scientific workflows use hierarchy to manage
    complexity
  • Top level workflows can be a conceptual
    representation of the science process that is
    easy to comprehend at a glance
  • Drilling down into sub-workflows reveals
    increasing levels of detail
  • Composing models using hierarchy promotes the
    development of re-usable components that can be
    shared with other scientists

10
Partial ENM Workflow
Slide Matt Jones
11
Workflow features required by ENM use case
  • Design phase
  • Access to distributed data specimens and climate
  • Streamline, automate labor-intensive data
    preparation
  • Workflow GUI environment
  • communication about complex models
  • experimentation and rapid modification of models
  • re-usable, sharable components
  • Software environment
  • Multi-platform (Mac/Windows/Linux), open,
    extensible

Slide Matt Jones
12
Workflow features required by ENM use case
  • Execution phase
  • Execution using multiple analytical environments
  • Java, C, R, Matlab, GDAL, web services, ...
  • Integration of multiple computing environments
    into a single environment
  • ? Glue-ware
  • High-throughput distributed execution
  • Iterate across many species with many model runs
  • Assume no prior knowledge of distributed
    computing technologies
  • Threadsafe components no back channel
    communication
  • Archiving products in community repositories
  • Provenance and metadata for derived products

Slide Matt Jones
13
NSF/CEOP REAP (Real-time Analysis Pipelines)
Ecology, Oceanography case studies
  • Terrestrial Ecology
  • Predictive Modeling to Examine the Role of an
    Insect Vectored Pathogen in Exotic Plant Invasion
  • temperature, precipitation, light interception _at_
    7 core research areas
  • integrate Metacat archived data with these
    sensors in analyses implemented in Kepler
  • Oceanography
  • Integrated Framework for Hybrid, Adaptive Ocean
    Modeling
  • Sea Surface Temperature (SST) fields from OPeNDAP
    servers
  • Kepler workflows to quantitatively evaluate SST
    data sets

Slide Matt Jones
14
REAP Project Goals
  • For scientists
  • capabilities for designing and executing complex
    analytical models over near real-time and
    archived data sources
  • For data-grid engineers
  • monitoring and management capabilities of
    underlying sensor networks
  • For outside users
  • access to observatory data and results of models,
    approachable to non-scientists.

15
Key
Internet
- radio with antenna
RBNB
data logger
Internet Point of Presence (IPP)?
RBNB
- sensor
- battery
Relay Station?
1 km
OPeNDAP EcoGrid Metacat
Linear Light Probe (B)? Reflectometer (B) Rain
gage Anemometer RH Temp Probe Quantum Point
Light Sensor?
Linear Light Probe (A)?
CR800 Datalogger
Reflectometer (A)?
Public website
Vegetation Plots
Slide Matt Jones
16
Ring Buffered Network Bus (RBNB) streaming data
component
  • Scientists can discover data streams
  • accessing streams requires little IT knowledge
  • Can easily assimilate streams
  • into existing or new workflow models

Slide Matt Jones
17
Modeling Disease Effects on Competition
  • Discrete Time Model
  • Survival between seasons
  • Reset of system Loss of Disease
  • Continuous Time Model
  • Growing (Winter Rainy) Season
  • Ongoing infection processes (SI model)
  • Competition (Lotka-Volterra)
  • Integro-Difference Equations
  • Parameterized with data from field experiments
  • Can utilize coupled models (aka hybrid models)
  • Continuous time model that is coupled to a
    discrete time model
  • Each model developed independently, joined via
    the workflow engine

Slide Matt Jones
18
Models of Computations (Directors) in Kepler
  • Continuous time
  • Lotka-Volterra predator-prey dynamics
  • Synchronize on a global clock
  • Synchronous Data Flow
  • Sensor data access, analysis
  • Static dependency analysis,
  • fixed data flow rate

Director controls Model of Computation (MoC)
Slide Matt Jones
19
Requirements of REAP use cases
  • All features from the ENM, plus
  • Design phase
  • Access to sensor data streams via catalog
  • RBNB and Antelope support
  • Bi-directional communication, monitor and control
    sensors
  • Execution phase
  • Support hybrid models
  • Population and community dynamics mix discrete
    and continuous time models
  • Provenance
  • archive modeling scenarios
  • support exploratory modeling

Slide Matt Jones
20
Discovery Streaming Workflows
  • Typical analytical models are complex and
    difficult to comprehend and maintain
  • Use cases described here are only two of many
    overlapping cases
  • Scientific workflows provide
  • An intuitive visual model
  • Structure and efficiency (user-time) in modeling
    and analysis
  • Abstractions to help deal with complexity
  • Direct access to data
  • Means to publish and share models

Slide Matt Jones
21
Plumbing Workflows Fusion Simulation (SDM
CPES)
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
22
Plumbing Workflows Archive Migration
Stage data files from NERSC HPSS to local disk
transfer to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
23
  • Plumbing workflow
  • to accomplish all these tasks
  • 50 composite actors (subworkflows)
  • 4 levels of hierarchy
  • 1000 atomic (Java) actors

Norbert Podhorszki UC Davis, soon ORNL
24
Summary a broad range of workflow types
  • Desktop / discovery workflows
  • analysis/method-intensive, R, Matlab, custom
    algorithims
  • e.g. bioinformatics, ecoinformatics, genomics,
    phylogenetics
  • exploratory workflow, rapidly evolving
  • need data workflow provenance
  • Streaming workflows
  • (near) real-time processing and data analysis
  • distributed setting
  • Plumbing workflows
  • data-intensive, e.g. moving TBs between from
    ORNL (compute) to LBL/NERSC (archive)
  • Production workflow reliable, fault-tolerant,
    high-throughput, runtime monitoring
  • HPC workflows
  • cpu-intensive, need to utilize a local cluster
    or distribute Grid, e.g. Ecological Niche
    Modeling, Parameter studies,
  • Parallel/distributed workflow

25
Workflow Demos
25
25
26
Bioinformatics Web Service
  • Retrieve genetic sequence from DNA Data Bank of
    Japan (DDBJ).
  • Data transformations via XSLT and XPath.

27
Bioinformatics Web Service Access
27
28
REAP Data Streaming
28
29
29
30
Transfer-Convert-Archive-Image-Workflow
30
31
Basic Kepler Features
31
31
32
Kepler is a Scientific Workflow System
http//www.kepler-project.org
  • Kepler is a cross-project collaboration
  • Latest release available from the website
  • Builds upon the open-source Ptolemy II framework

32
32
33
Kepler Communities Collaboration
  • Open-source
  • Builds on Ptolemy II from UC Berkeley
  • Contributors from
  • SEEK
  • SciDAC SDM
  • Ptolemy
  • GEON
  • ROADNet
  • Resurgence
  • AToL CIPRES, POD
  • Goals
  • Create powerful analytical tools that are useful
    across disciplines
  • Ecology, Biology, Engineering, Geology, Physics,
    Chemistry, Astronomy,

Ptolemy II
34
Vergil is the GUI for Kepler
but Kepler can also run in batch mode as a
command-line engine.
Data Search
Actor Search
Actor ontology (semantic search) Search ?
Drag drop ? Link via ports Metadata-based
search for datasets
35
Actor-Oriented Modeling Design
  • Actor
  • single component or task
  • well-defined interface (signature)?
  • given input data, produces output data

36
Actor-Oriented Modeling Ports
  • Ports
  • each actor has a set of input and output ports
  • denote the actors signature
  • produce/consume data (a.k.a. tokens)?
  • Parameters
  • (visible after double-click) can be seen as
    special static ports

37
Actor-Oriented Modeling Connections / Channels
  • Dataflow Connections
  • actor communication channels
  • directed (hyper) edges
  • connect output ports with input ports
  • can fork (cloning tokens) at relation nodes
    (little diamonds)

38
Actor-Oriented Modeling Subworkflows
  • Sub-workflows / Composite Actors
  • composite actors wrap sub-workflows
  • like actors, have signatures (i/o ports of
    sub-workflow)
  • hierarchical workflows (arbitrary nesting levels)

39
Actor-Oriented Modeling Directors
  • Directors
  • define the Model of Computation (MoC) of workflow
    graphs
  • executes workflow graph (some schedule)
  • sub-workflows may have different directors
  • Facilitates actor (sub-)workflow reusability

40
Models of Computation
  • Directors separate the concerns of WF
    orchestration from Actor execution
  • Synchronous Dataflow (SDF)
  • Connections have queues for sending/receiving
    fixed numbers of tokens at each firing. Schedule
    is statically predetermined. SDF models are
    highly analyzable and used often in SWFs.
  • Downside need to know token consumption/productio
    n rate ahead of time
  • Process Networks (PN)
  • Generalizes SDF. Actors execute as a separate
    thread/process, with queues of (in principle)
    unbounded size. Closely related to Kahn/MacQueen
    semantics.
  • Continuous Time (CT)
  • Connections represent the value of a continuous
    time signal at some point in time ... Often used
    to model physical processes.
  • Discrete Event (DE)
  • Actors communicate through a queue of events in
    time. Used for instantaneous reactions in
    physical systems.

41
Searching Components (Actors)
  • Kepler Actor Ontology (tags hierarchy)
  • Used in searching actors and creating conceptual
    views (virtual folders)
  • currently gt 370 actors

APAC07/Kepler Tutorial/V1
41
SC07/Kepler Tutorial/V8bc/Nov-07
41
42
Searching Binding Data
  • Kepler DataGrid
  • Discovery of data resources through local and
    remote services
  • SRB,
  • Grid and Web Services,
  • DB connections
  • Registry of datasets on the fly using workflows

APAC07/Kepler Tutorial/V1
42
42
43
Hands-On Exercises 1
APAC07/Kepler Tutorial/V1
44
Opening and Running a Workflow
  • Start Kepler
  • Open the HelloWorld.xml under the demos/sc07
    directory in your local Kepler folder
  • Two options to run a workflow
  • PLAY BUTTON in the toolbar
  • RUNTIME WINDOW from the run menu

45
Modifying an Existing Workflow Saving It
  • GOAL
  • Modify the HelloWorld workflow to display a
    parameter-based message
  • Step-by-step instructions
  • Open the HelloWorld workflow as before
  • From actors search tab, search for Parameter
  • Drag and drop the parameter to the workflow
    canvas on the right
  • Double click the parameter and type your name
  • Right click the parameter and select "Customize
    Name", type in "name".
  • Double click the Constant actor and type the
    following
  • Hello name
  • Save
  • Run the workflow

46
Creating a HelloWorld! Workflow
  • Open a new blank workflow canvas
  • From toolbar File ? New Workflow ? Blank
  • In the Components tab, search for Constant and
    select the Constant actor.
  • Drag the Constant actor onto the Workflow canvas
  • Configure the Constant actor
  • Right-click the actor and selecting Configure
    Actor from the menu
  • Or, double click the actor
  • Type Hello World in the value field and click
    Commit
  • In the Components and Data Access area, search
    for Display and select the Display actor found
    under Textual Output.
  • Drag the Display actor to the Workflow canvas.
  • Connect the output port of the Constant actor to
    the input port of the Display actor.
  • In the Components and Data Access area, select
    the Components tab, then navigate to the
    /Components/Director/ directory.
  • Drag the SDF Director to the top of the Workflow
    canvas.
  • Run the model

47
47
47
48
Using Various Displays
  • GOAL Use different graphical output actors.
  • Step-by-step instructions
  • Open the "03-ImageDisplay.xml" under the
    demos/getting-started directory in your local
    Kepler folder.
  • Run the workflow.
  • Search for "browser" in the components tab.
  • Drag and drop "Browser Display" onto the canvas.
  • Replace "ImageJ" with "Browser Display" (connect
    Image Converter output to "Browser Display"
    inputURL.
  • Run workflow again.
  • Replace "Browser Display" with a textual
    "Display.
  • Run workflow.

49
Advanced Kepler Features
50
Process Networks
  • The partial (or total linear) order implied by a
    DAG gives as a schedule for workflows for
    one-time tasks (jobs)
  • What about Pipelined Workflows on Token Streams??
  • Communicating processes with directed token flow
  • Dataflow Process Networks
  • communication token stream between two
    processes
  • process operations on tokens
  • host language process description
  • coordination language network description

process
process
token stream
channel
51
Kahn process networks (1974)
  • special class of process networks
  • stream is FIFO with unbounded capacity
  • process
  • destructive read (consumption) at process
    start,
  • non-destructive write (production) at process
    end,
  • blocking read process only executed if data
    available,
  • non-blocking write

EXAMPLE
52
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
53
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
54
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
55
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
56
Problems with Process Networks
  • How to run/schedule a process network without
    accumulating arbitrarily many tokens?
  • Difficult to schedule because of need to balance
    relative process rates
  • System inherently gives the scheduler few hints
    about appropriate rates
  • Tom Parks Algorithm
  • runs in bounded memory whenever possible
  • (bounded memory condition is undecidable)
  • Synchronous Dataflow (SDF)
  • Edward Lee and David Messerschmitt, Berkeley,
    1987
  • Restriction Kahn Process Networks to allow
    compile-time scheduling
  • Basic idea each process reads and writes a fixed
    number of tokens each time it fires. Example
  • Loop forever
  • read 2 tokens from A, 3 tokens from B
  • compute
  • write 1 token to C write 2 tokens to D

57
Synchronous Dataflow (SDF)Fixed
Production/Consumption Rates
  • Balance equations (one for each channel)
  • Schedulable statically
  • Decidable
  • buffer memory requirements
  • deadlock

number of tokens consumed
number of firings per iteration
number of tokens produced
fire B consume M
fire A produce N
channel
N
M
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
58
Parallel Scheduling of SDF Models
Many scheduling optimization problems can be
formulated. Some can be solved, too!
SDF is suitable for automated mapping onto
parallel processors and synthesis of parallel
circuits.
A
C
B
D
Sequential
Parallel
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
59
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
60
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
61
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
62
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
63
Selected Generalizations
  • Multidimensional Synchronous Dataflow (1993)
  • Arcs carry multidimensional streams
  • One balance equation per dimension per arc
  • Cyclo-Static Dataflow (Lauwereins, et al., 1994)
  • Periodically varying production/consumption rates
  • Boolean Integer Dataflow (1993/4)
  • Balance equations are solved symbolically
  • Permits data-dependent routing of tokens
  • Heuristic-based scheduling (undecidable)
  • Dynamic Dataflow (1981-)
  • Firings scheduled at run time
  • Challenge maintain bounded memory, deadlock
    freedom, liveness
  • Demand driven, data driven, and fair policies all
    fail
  • Kahn Process Networks (1974-)
  • Replace discrete firings with process suspension
  • Challenge maintain bounded memory, deadlock
    freedom, liveness
  • Heterochronous Dataflow (1997)
  • Combines state machines with SDF graphs
  • Very expressive, yet decidable

Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
64
(Internal) Workflow Format MoML
65
Sharing Kepler Workflows -- Use Cases
  • UC-1) Facilitate transport of workflows to
    grid/distributed/server/p2p systems
  • UC-2) Preserve an analysis to allow replication
  • UC-3) Allow the development and distribution of
    components (actors/directors) which can be
    released on a schedule independently from Kepler
    itself.

66
Kepler Archive File (KAR)
67
KAR File Functional Requirements
  • FR-1) Mechanism to package resources required to
    implement a component in kepler system.
  • FR-1a) must be able to contain java class files
  • FR-1b) must be able to contain native binary
    executable files
  • FR-1c) must be able to contain native library
    files
  • FR-1d) must be able to contain MoML and other XML
    based text
  • FR-1e) must be able to contain data in binary and
    ascii formats including zipped data.
  • FR-2) Must describe the contained components so
    they can be utilized in a Kepler system.
  • FR-2a) each component must have a unique LSID
    identifier which is tied to the specific
    implementation of the component.
  • FR-2b) must contain an OWL document with semantic
    ordering for the contained objects

68
The need for Plumbing WorkflowsTales from the
life of a simulation scientist
69
A few days in the life of Sim Scientist. Day 1
-morning.
  • 800AM Get Coffee, Check to see if job is
    running.
  • Ssh into jaguar.ccs.ornl.gov (job 1)?
  • Ssh into seaborg.nersc.gov (job 2) (this is
    running yea!)?
  • Run gnuplot to see if run is going ok on seaborg.
    This looks ok.
  • 900AM Look at data from old run for post
    processing.
  • Legacy code (IDL, Matlab) to analyze most data.
  • Visualize some of the data to see if there is
    anything interesting.
  • Is my job running on jaguar? I submitted this 4K
    processor job 2 days ago!
  • 1000AM scp some files from seaborg to my local
    cluster.
  • Luckily I only have 10 files (which are only 1
    GB/file).
  • 1030AM first file appears on my local machine
    for analysis.
  • Visualize data with Matlab.. Seems to be ok. ?
  • 1130AM see that the second file had trouble
    coming over.
  • Scp the files over again Dohhh

Slide Scott Klasky
69
70
A few days in the life of Sim Scientist. Day 1
evening.
  • 100PM Look at the output from the second file.
  • Opps, I had a mistake in my input parameters.
  • Ssh into seaborg, kill job. Emacs the input,
    submit job.
  • Ssh into jaguar, see status. Cool, its running.
  • bbcp 2 files over to my local machine. (8
    GB/file).
  • Gnuplot data.. This looks ok too, but still need
    to see more information.
  • 130PM Files are on my cluster.
  • Run matlab on hdf5 output files. Looks good.
  • Write down some information in my notebook about
    the run.
  • Visualize some of the data. All looks good.
  • Go to meetings.
  • 400PM Return from meetings.
  • Ssh into jaguar. Run gnuplot. Still looks good.
  • Ssh into seaborg. My job still isnt running
  • 800PM Are my jobs running?
  • ssh into jaguar. Run gnuplot. Still looks good.
  • Ssh into seaborg. Cool. My job is running. Run
    gnuplot. Looks good this time!

Slide Scott Klasky
71
And Later
  • 400AM yawn is my job on jaguar done?
  • Ssh into jaguar. Cool. Job is finished.
  • Start bbcp files over to my work machine.
    (2 TB of data).
  • 800AM Bbcp is having troubles.
  • Resubmit some of my bbcp from jaguar to my local
    cluster.
  • 800AM (next day).
  • Still need to get the rest of my 200GB of data
    over to my machine.
  • 300PM My data is finally here!
  • Run Matlab. Run Ensight. Oppps. Somethings
    wrong!!!!!!!!! Where did that instability come
    from?
  • 600PM finish screaming!

Slide Scott Klasky
72
And 2 years from now.
  • Simulations /computers are getting larger and
  • more expensive to operate.
  • In Fusion, large runs will be using gt50K cores/
    100 wallclock hours, to understand turbulent
    transport in ITER size reactors.
  • The cost of a simulation approaches 0.6M (power,
    cooling, system cost averaged over 5 years).
  • Data Sizes are getting larger.
  • Large simulations produce 2 TB/simulation
    (today), 100TB/simulation(week) in the future.
  • Demand for real-time monitoring/analysis of
    simulations.
  • Demand for fast-reliable data movement to local
    machines for post processing.
  • Demand to keep data provenance at 1 location.

Slide Scott Klasky
73
Workflows to the rescue!
  • In our demo section (SC07 tutorial only) you
    will see us automate this process.
  • Job submission starts services on ORNL IB cluster
    (ewok).
  • Files are automatically moved from Cray XT3 to
    ORNL IB cluster.
  • Files are converted from binary to hdf5 files.
  • Files accumulate until they gt 6GB. Then they are
    tarred.
  • Files use hsi commands to place tar files into
    HPSS. (xml file describes which files are in
    which tar files).
  • Hdf5 file is read into SciRun service which
    creates a jpeg.
  • Jpeg files create an mpeg file via a mpeg
    service.
  • Jpeg and mpeg files are moved to web portal.
  • Hdf5 files are archived to PPPL.
  • And of course we will keep track of the
    provenance of the workflow in a database!
  • And we can monitor this on our dashboard.

Slide Scott Klasky
74
Why do Pflop computing scientists care?
  • Typical situation for Sim Scientist.
  • We run on 1 60K processors, producing lots of
    data.
  • Typical method of work.
  • Prepare input data for smaller simulations.
  • Iterate until we come up with the correct
    parameters for the large run.
  • Run the large simulation only at a handful number
    of locations (usually lt4).
  • Must Archive results. Must be of the correct size
    archives on HPSS.
  • Must move some data over to our local clusters
    for analysis after the simulation.
  • Did we make a mistake with the input parameters?
    Is something going wrong? Fix the code/input
    start the run over again.
  • Wow, I just wasted 100K CPU hours because I
    missed a sign. Duhh.
  • Where are all of my files? I want to look at the
    temperature in the 200 time slice, where is it on
    HPSS.

Slide Scott Klasky
75
Post Processing Workflow A day in the life of
Sim Scientist
  • 900AM Get Diet Coke, decide which runs/
    experimental data to analyze.
  • 930AM Start to download files from HPSS from
    NERSC and ORNL.
  • 1000AM Move files from NERSC/ORNL to local
    desktop machine. Smallish data (10GB /location).
  • 1100AM Start IDL, and compute various post
    processing quantities.
  • 1130AM look at the data from the simulations,
    and grab data from a database which has
    experimental data.
  • 100PM Move some more data from ORNL to local
    desktop to compare to more experimental data.
  • Save the plot from Matlab to Postscript to
    include in paper.
  • Write down results into notebook, copy figure
    into notebook.
  • 200PM Think about results, and decide on new
    analysis routines to write in the future.
  • 400PM Start moving more data from NERSC to local
    desktop.

Slide Scott Klasky
76
Whats changing in his life?
  • Collaboration.
  • More clusters, more simulations.
  • Just analyze the data where we run. Dont move
    the data.
  • But
  • What if the network goes (Im on a plane,).
  • What if the resource is not available for my
    late-breaking analysis before the BIG conference?
  • OK, but what about the large data?
  • OK. Large data will be server-side analysis. Not
    DESKTOP. But can run workflow on a server.
  • Data from multiple resources
  • VV data from multiple simulations/experiments.
  • But cant we just run VISIT/SciRun?
  • Yes. But need to orchestrate the data movement
    from different sources, track the provenance, and
    perhaps use multiple analysis/visualization
    packages, then a workflow system can help.

Slide Scott Klasky
77
How do we help this scientist?
  • The workflow is the glue for the scientists.
  • The scientists hooks up all of the analysis
    routines.
  • The director makes sure that the data movement
    occurs, and is reliable, and secure.
  • All of the tedious portions of ssh, start this
    program, is removed by the workflow automation.
  • The workflow will be able to keep the provenance
    information which allows the user to understand
    how they processed the dataset.
  • This enables the scientist to compare new data
    with old data.

Slide Scott Klasky
78
So what are the requirements?
  • Must be EASY to use.
  • If you need a manual, then FORGET IT!
  • Good user support, and long-term DOE support. ?
  • The workflow should work for all of my workflows.
  • NOT just for the Petascale computers.
  • And on multiple platforms!
  • Must be easy to incorporate my own services into
    the workflow.
  • Must be customizable by the users.
  • Users need to easily change the workflow to work
    with the way users work.
  • Long-term requirements. NOT being worked on
    yet.
  • Autonomics/ User Adaptivity.
  • Faster data movement in the workflow? High
    Quality front-end for the end-user interaction.
  • You tell us!

Slide Scott Klasky
79
SWF Systems Requirements
  • Design tools-- especially for non-expert users
  • Ease of use-- fairly simple user interface having
    more complex features hidden in the background
  • Reusable generic features
  • Generic enough to serve to different communities
    but specific enough to serve one domain (e.g.
    geosciences) ? customizable
  • Extensibility for the expert user
  • Registration, publication provenance of data
    products and process products (workflows)?
  • Dynamic plug-in of data and processes from
    registries/repositories
  • Distributed WF execution (e.g. Web and Grid
    awareness)?
  • Semantics awareness
  • WF Deployment
  • as a web site, as a web service,Power apps.

Slide Scott Klasky
80
The Big Picture Supporting the Scientist
From Napkin Drawings
to Executable
Workflows
Conceptual SWF
Executable SWF
Here John Blondin, NC State Astrophysics Terascal
e Supernova Initiative SciDAC, DOE
80
Slide M. Vouk
81
CPES Fusion Simulation Workflow
  • Fusion Simulation Codes (a) GTC (b) XGC with
    M3D
  • e.g. (a) currently 4,800 (soon 9,600) nodes Cray
    XT3 9.6TB RAM 1.5TB simulation data/run
  • GOAL
  • automate remote simulation job submission
  • continuous file movement to secondary analysis
    cluster for dynamic visualization simulation
    control
  • with runtime-configurable observables

Submit FileMover Job
Submit Simulation Job
Execution Log (gt Data Provenance)
Select JobMgr
Overall architect ( prototypical user) Scott
Klasky (ORNL)? WF design implementation
Norbert Podhorszki (UC Davis)?
APAC07/Kepler Tutorial/V1
81
82
CPES Analysis Workflow
  • Concurrent analysis pipeline(_at_Analysis Cluster)
  • convert analyze copy-to-Web-portal
  • easy configuration, re-purposing

Reusable Actor Class
Specialized Actor Instances
SpecializeActor instances
SpecializeActor instances
Pipelined Execution Model
Inline Documentation
Inline Display
Easy-to-edit Parameter Settings
Overall architect ( prototypical user) Scott
Klasky (ORNL)? WF design implementation
Norbert Podhorszki (UC Davis)?
APAC07/Kepler Tutorial/V1
82
83
Dashboard integration with Kepler
  • Dashboard present information created from the
    workflow.
  • We have been developing a dashboard for Kepler
    workflows.
  • AJAX.
  • FLASH.
  • PHP.
  • MySQL.

Slide SDM/SPA, Klasky,Vouk, et al
84
Machine Monitoring
  • DOE Machine monitoring
  • Which machines are up?
  • Which machines have long queues, which are idle?
  • Where can I run my job?
  • Where am I running jobs?
  • Where are my running jobs and can I look at my
    old runs?
  • Can I monitor a new job, and compare this to an
    old job?

Slide SDM/SPA, Klasky,Vouk, et al
85
Dashboards for Simulation Monitoring
  • Back end shell scripts, python scripts and PHP.
  • Machine queues command
  • Users personal information
  • Services to display and manipulate data before
    display
  • Dynamic Front end
  • Machine monitoring standard web technology
    Ajax
  • Simulation monitoring Flash
  • Storage MySQL (queue-info, min-max data, users
    notes)?

85
Slide SDM/SPA, Klasky,Vouk, et al
86
Scientific Workflow Systems
  • Combination of
  • Data management, integration, analysis, and
    visualization steps
  • Larger, automated "scientific process"
  • Mission of scientific workflow systems
  • Promote scientific discovery by providing tools
    and methods to generate scientific workflows
  • Provide an extensible and customizable graphical
    user interface for scientists from different
    scientific domains
  • Support workflow design, execution, sharing,
    reuse and provenance
  • Design frameworks which define efficient ways to
    connect to the existing data and integrate
    heterogeneous data from multiple resources
  • Make technology useful through users computer!!!

86
87
Two typical types of Workflows for SC
  • Real-time Monitoring (Server Side Workflows)?
  • Job submission.
  • File movement.
  • Launch Analysis Services.
  • Launch Visualization Services.
  • Launch Automatic Archiving.
  • Post Processing (Desktop Workflows).
  • Read in Files from different locations.
  • File movement.
  • Launch Analysis Services.
  • Launch Visualization Services.
  • Connect to Databases.
  • Obviously there are other types of workflows
  • What is your type of workflow?

87
88
Plumbing Workflow using Kepler
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
89
Plumbing Workflow for Archive Migration
Stage from NERSC HPSS to local disk transfer
to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
90
Pipeline and parallel processing
Norbert Podhorszki (UC Davis)
91
  • Plumbing workflow
  • to accomplish all these tasks
  • 50 composite actors (subworkflows)
  • 4 levels of hierarchy
  • 1000 atomic (Java) actors

Norbert Podhorszki UC Davis, soon ORNL
92
Distributed Execution Many ways to skin a cat
  • Do it all in Kepler (white-box)
  • Single machine single-threaded and/or
    multi-threaded
  • Multiple nodes (cluster)
  • Distributed Kepler, Kepler/HPC
  • Medium-tightly coupled (grey box)
  • use remote commands
  • and their exist status
  • Loosely-coupled (black-box -- Norberts
    Workflows)
  • Launch remote scripts
  • Inquire about their status e.g. via ls -1
  • Minimalist approach
  • works even w/ tough ORNL constraints!

APAC07/Kepler Tutorial/V1
92
92
93
Authoring Distributed Workflows
Normal Workflow
Distributed Workflow
  • Place wf in a DistributedCompositeActor (DCA).
  • At runtime, the contents of the DCA are packaged
    up and shipped to the remote nodes.
  • The workflow is executed and the output is
    returned to the master Kepler node to be
    viewed/further processed.

Slide from C. Berkley
94
Node Discovery and Remote Management
Slide from C. Berkley
95
Efficient Data Transfer
  • Large datasets need special handling
  • Inefficient data transfer could wipe out time
    savings of distributed computation

Slave1 depends on Slave0 Slave2 depends on Slave1
Slave0
Slave0
Large Dataset
Large Dataset
Large Dataset
Large Dataset
Slave1
Master
Slave1
Master
Large Dataset
Large Dataset
Large Dataset
Results
Slave2
Slave2
Inefficient (6 possible transfers)
More efficient (4 possible transfers)
Slide from C. Berkley
96
A Hierarchical View of the Architecture
Control Plane (light data flows)?
Provenance, Tracking Meta-Data (DBs and
Portals)?
Execution Plane (Heavy Lifting
Computations and flows)?
Synchronous or Asynchronous?
96
97
Scientific Workflow Automation (e.g.,
Astrophysics)In conjunction with John Blondin,
NC State UniversityAutomate data acquisition,
transfer and visualization of a large-scale
simulation at ORNL
Logistic Network L-Bone or bbcp
Aggregate to 500 files (lt 50GB each)?
Input Data
Local Mass Storage 14TB)?
VH1
Depot
HPSS archive
Local 44 Proc. Data Cluster - data sits on local
nodes for weeks
Provenance
Highly Parallel Compute
Output 500x500 files
Web
Viz Software
Viz Wall
97
Viz Client
98
Scientific Workflow Modeling Design
And thats why our scientific workflows are
much easier to develop, understand, reuse and
maintain!
99
Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
100
Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
101
But how do we get from messy to neat reusable
designs?
102
The Problem Evolving Workflows
Daniel Zinn (UC Davis)
103
What we want Simple Analysis Pipelines
Author Tim McPhillips, UC Davis
104
The Answer (YMMV)
  • Collection-Oriented Modeling Design (COMAD)
  • embrace the assembly line metaphor fully
  • ? Virtual Assembly Lines (VALs)
  • ? cf. Flow-based Programming (J. Morrison)
  • data tagged nested collections
  • pipelined (XML) token streams
  • passing the buck on whats not in your scope

Timothy McPhillips (UC Davis)
105
Conventional vs Assembly Line Delta-XML
Thinking
Daniel Zinn (UC Davis)
106
More secret sauce User vs. Optimized Dataflow
Daniel Zinn (UC Davis)
107
What we got Simple Change-Resilient Pipelines
Author Tim McPhillips, UC Davis
108
Result Change-Resilience (Wf graph)
?
X
A
B
C
S
R
W
Original
Automatic Configuration
W
WX
S R
S R
Infer Configuration X of X
Daniel Zinn (UC Davis)
109
Related Change-Resilience (nested data types)
S. Bowers, Daniel Zinn (UC Davis)
110
Scientific Workflow Modeling Design Paradigms
  • Vanilla Process Network
  • Functional Programming Dataflow Network
  • XML Transformation Network
  • Collection-oriented Modeling Design framework
    (COMAD)
  • Look Ma No Shims!
  • also running DAGs, Petri Nets, easyBPEL,

111
Hands-On Exercises 2
APAC07/Kepler Tutorial/V1
111
111
112
Using R in Kepler
  • GOAL Use the R actor to generate histogram plot.
  • Step-by-step instructions
  • In demos/getting-started directory, open
    05-LinearRegression.xml.
  • Run the workflow to view linear regression.
  • Add another RExpression actor to canvas.
  • Double-click on new R actor and enter the
    following for R function or script
  • Mean lt- mean(Values)?
  • hist(Values)?
  • Right-click on new R actor and Configure Ports
  • Add an input called Values
  • Add an output called Mean
  • Control-click on the canvas to create a new
    Relation diamond
  • Connect the T_AIR port from Datos to the diamond.
  • Connect the T_AIR port from R_linear_regression
    to the diamond.
  • Connect the Values port from the new R actor to
    the diamond.
  • Place a second ImageJ actor on the canvas (can
    copy and paste existing one).
  • Connect R actor"s graphicsFileName port to
    second ImageJ actor"s input.
  • Run the workflow.

APAC07/Kepler Tutorial/V1
112
SC07/Kepler Tutorial/V8bc/Nov-07
112
113
Creating Web Service Workflows
  • GOAL Executing a Web Service using the generic
    Web Service client.
  • Step-by-step instructions
  • In the Components and Data Access area, select
    the components tab.
  • Search for "web service".
  • Drag "Web Service Actor" onto the canvas.
  • Double click the actor, enter http//xml.nig.ac.jp
    /wsdl/DDBJ.wsdl, commit.
  • Double click the actor again, select
    "getXMLEntry" as method name, commit.
  • Search for "String Constant" in the components
    tab. Drag and drop "String Constant" onto
    workflow canvas.
  • Double click the "String Constant", set AA045112
    as value, commit.
  • Connect "String Constant" output with the "Web
    Service Actor" input.
  • Add a "Display" and connect its input with the
    "Web Service Actor" "Result" output.
  • Add the SDF director.
  • Run the workflow.

APAC07/Kepler Tutorial/V1
113
113
114
SSH Actor and Including Existing Scripts in a
Workflow
  • GOAL Use SSH actor to execute command on remote
    host.
  • Step-by-step instructions
  • Search for "ssh" in the Components tab in left
    pane.
  • Drag "SSH To Execute" onto the canvas.
  • Double click the actor
  • Type in a remote host you have access to.
  • Type in your username.
  • Search for "String Constant" in the components
    tab. Drag and drop "String Constant" onto
    workflow canvas.
  • Double click the "String Constant", type "ls" and
    commit.
  • Connect "String Constant" output with the "SSH To
    Execute" command input (lowest)?.
  • Add a "Display" and connect its input with the
    "SSH To Execute" stdout output (top)?.
  • Add the SDF director.
  • Run the workflow.
  • If you have a script deployed on the server, you
    can replace the "ls" command to invoke the
    script.
  • e.g., perl tmp.pl

APAC07/Kepler Tutorial/V1
114
114
115
Using Relational Databases
  • GOAL Accessing a geoscience database using a
    generic database actor
  • Step by step instructions
  • In the Components and Data Access area, select
    the components tab
  • Search for database
  • Drag Open Database Connection and Database
    Query onto the canvas
  • Configure Open Database Connection with the
    following parameters
  • Database format PostgreSQL
  • Database URL jdbcpostgresql//geon17.sdsc.edu54
    32/igneous
  • Username readonly
  • Password read0n1y
  • Connect the output of Open Database Connection
    with the dbcon input port of Database Query
  • Double-click to customize the actor
  • Query SELECT FROM IGROCKS.ModalData WHERE SSID
    227
  • 227 for ssID
  • Add Display actor (from components tab), connect
    ports, add sdf director (as in previous example)?
  • Run the workflow

115
116
Provenance
  • Two different takes on it
  • Scientists View (discovery workflows)
  • Engineers View (plumbing workflows)

APAC07/Kepler Tutorial/V1
116
117
A Scientific Publication (the final
provenance frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
118
More Evidence
data reference
type of evidence
tool reference
trust me on this one
  • provenance/data lineage show the history and
    evidence
  • related to proof trees
  • unlike w/ scripts, SWF system can keep track of
    what happened
  • In the future deposit your data workflows in a
    repository

119
Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
120
Different Dependency Graphs
A Model for User-Oriented Data Provenance in
Pipelined Scientific Workflows, Shawn Bowers,
Timothy McPhillips, Bertram Ludäscher, Shirley
Cohen, Susan B. Davidson. International
Provenance and Annotation Workshop (IPAW'06),
Chicago,May 3-5, 2006.
121
Scientific Provenance Questions we can ask
  • What DNA sequences were input to the workflow?
  • What phylogenetic trees were output by the
    workflow?
  • What DNA sequences input to the workflow does
    this consensus tree depend on?
  • What input sequences were not used to derive any
    output consensus trees?
  • What was the sequence alignment (key intermediate
    data) used in the process of inferring this tree?
  • plus the usual smart-rerun, VCR replay,

122
Provenance in the COMAD Framework
Without Provenance
With Provenance
123
The Answer (YMMV)
  • Collection-Oriented Modeling Design (COMAD)
  • embrace the assembly line metaphor fully
  • ? Virtual Assembly Lines (VALs)
  • ? cf. Flow-based Programming (J. Morrison)
  • data tagged nested collections
  • pipelined (XML) token streams
  • passing the buck on whats not in your scope

Timothy McPhillips (UC Davis)
124
Provenance for the WF Engineer / Plumber
  • A Workflow Engineers View
  • Monitor, benchmark, and optimize workflow
    performance
  • Record resource usage for a workflow execution
  • Smart Re-run of (variants of) previous
    executions
  • Checkpointing restart (e.g. for crash recovery,
    load balancing)
  • Debug or troubleshoot a workflow run
  • Explain when, where, why a workflow crashed

125
Provenance for Domain Scientists!
  • Query the lineage of a data product
  • from what data was this computed? (real
    dependencies please!)
  • Evaluate the results of a workflow
  • do I like how this result was computed?
  • Reuse data products of one workflow run in
    another
  • (re-)attach prior data products to a new workflow
  • Archive scientific results in a repository
  • Replicate the results reported by another
    researcher
  • Discover all results derived from a given dataset
  • i.e. across all runs
  • Explain unexpected results
  • via parameter-, dataset-, object-dependencies
    in the scientists terms (yes, you may think
    ontology here )

126
Observables
  • Model of Computation MoC M
  • specification/algorithm to compute o M(W,P,i)
  • a director or scheduler implements M
  • gives rise to formal notions of
  • computation (aka run) R typically tree models
  • Model of Provenance MoP M
  • approximation M of M
  • a trace T approximates a run R by
    inclusion/exclusion of observables
  • T R Ignored-observables
    Model-observables
  • Observables (of a MoC M)
  • functional observables (may influence output o)
  • token rate, notions of firing,
  • non-functional observables (not part of M, do not
    influence o)
  • token timestamp, size, (unless the MoC cares
    about those)
  • What is a good model of provenance?
  • What is a good provenance schema?

127
Provenance in the General Architecture (SDM/SPA
View)
Analytics
Computations
Control Panel (Dashboard)? Display
Local and/or remote communications (networks)?
Orchestration (Kepler)?
Data, DataBases, Provenance, Storage
128
What is Provenance? (SDM/SPA view)
  • Provenance is about meta-data (data about data),
    the history (lineage) of data, code execution and
    conditions applied to a workflow run.
  • Run-time monitoring may be part of the provenance
    meta-data, but it also may require collection of
    additional information and display of that
    information in a user-friendly format, for
    example on a dashboard, so that run-time
    tracking, problem determination, computational
    steering, and other workflow-related feedback may
    take place.

129
Why Provenance?
  • Recreate results and rebuild workflows using the
    evolution information
  • Associate the workflow with the results it
    produced
  • Create links between generated data in different
    runs, and compare different runs
  • Checkpoint a workflow and Recover from a system
    failure
  • Debug and explain results (via lineage tracing,
    )?
  • Smart Reruns
  • Other

130
Types of Provenance
  • Process provenance dynamics of control flows
    and their progression, execution, etc.
  • Data provenance dynamics of data and data
    flows, file locations, application input/output
    information, etc.
  • Workflow provenance structure, form, evolution,
  • System (or Environment) provenance system
    information, O/S, compiler versions, etc.

131
Other Data Views and Concepts
  • Raw data
  • Application/Simulation monitoring (input, output,
    configuration, intermediate states, )?
  • Data history and location
  • Machine monitoring
  • Shelf-life of data
  • Auditability
  • Error and execution logs
  • Analytics and Data information summation
    (visual, formulas, smoothing, etc.)?

132
Framewok
Storage
Supercomputer Analytics
Kepler
Dash
Meta-Data about Processes, Data, Workflows, Syst
em Environment
Orchestration
133
A Hierarchical View of the Architecture
Control Plane (light data flows)?
Provenance, Tracking Meta-Data (DBs and
Portals)?
Execution Plane (Heavy Lifting
Computations and flows)?
Synchronous or Asynchronous?
134
Implementation
  • Kepler Linux Apache MySQL PHP (K-LAMP)?
  • Windows based solutions
  • Communications sockets, xmlrpc, http, files,
    NFS, synchronous, asynchronous, etc.
  • Single node vs. distributed solutions
  • Service-based solutions
  • Which information?

135
Data Model
APAC07/Kepler Tutorial/V1
136
Kepler Provenance Framework
136
137
A Key
Write a Comment
User Comments (0)
About PowerShow.com