Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler

About This Presentation

Title:

Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler

Description:

San Diego Supercomputer Center. ludaesch_at_SDSC.edu. SEEK meeting, UCSB, 10/22-26/2003 ... The ZOO of Workflow Standards and Systems. Source: W.M.P. van der ... – PowerPoint PPT presentation

Number of Views:336

Avg rating:3.0/5.0

Slides: 44

Provided by: bertr68

Category:

more less

Transcript and Presenter's Notes

Title: Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler

1
Towards Scientific Workflows Based on Dataflow
Process Networks (or from Ptolemy to Kepler)

Bertram Ludäscher
San Diego Supercomputer Center
ludaesch_at_SDSC.edu

2
A Note on the Style of the following Slides

Due to lack of time, most of the following slides
will be by reference only -)
Each speaker was given four minutes to present
his paper, as there were so many scheduled -- 198
from 64 different countries. To help expedite the
proceedings, all reports had to be distributed
and studied beforehand, while the lecturer would
speak only in numerals, calling attention in this
fashion to the salient paragraphs of his work.
... Stan Hazelton of the U.S. delegation
immediately threw the hall into a flurry by
emphatically repeating 4, 6, 11, and therefore
22 5, 9, hence 22 3, 7, 2, 11, from which it
followed that 22 and only 22!! Someone jumped up,
saying yes but 5, and what about 6, 18, or 4 for
that matter Hazelton countered this objection
with the crushing retort that, either way, 22. I
turned to the number key in his paper and
discovered that 22 meant the end of the world
The Futurological Congress, Stanislaw Lem,
translated from the Polish by Michael Kandel,
Futura 1977

3
Acknowledgements

NSF, NIH, DOE
GEOsciences Network (NSF)
www.geongrid.org
Biomedical Informatics Research Network (NIH)
www.nbirn.net
Science Environment for Ecological Knowledge
(NSF)
seek.ecoinformatics.org
Scientific Data Management Center (DOE)
sdm.lbl.gov/sdmcenter/

4
Example Promoter Identification Workflow (PIW)
(simplified)
From SciDAC/SDM project and collaboration w/
Matt Coleman (LLNL)
5
Conceptual Workflow (Promoter Identification
Workflow PIW)
Compute clusters (min. distance)
For each promoter
Select gene-set (cluster-level)
Compute Subsequence labels
For each gene
With all Promoter Models
Compute Joint Promoter Model
6
(No Transcript)
7
Details of the Functional MRI (Magnetic Resonance
Imaging) Analysis Workflow (Jeffrey Grethe)

Collect data (K-Space images in Fourier space)
from MR scanner while subject performs a specific
task
Reconstruct K-Space data to image data (this
requires scanner parameters for the
reconstruction)
Now have anatomical and functional data
Pre-process the functional data
Correct for difference in slice acquisition (each
slice in a volume is collected at a slightly
different time). Try to correct for these
differences so that all slices seem to be
acquired at same time
Not correct for subject motion (head movement in
scanner) by realigning all functional images
Register the functional images with the
anatomical image ? all images are now in the same
space (all aligned with one another)
Move all subjects into template space through
non-linear spatial normalization. There exist
atlas templates (made from many subjects) that
one can normalize to so that all subjects are in
the same space, allowing for direct comparison
across subjects.
DATA VERIFICATION - check if all these procedures
worked. If not, go back and try again (possibly
tweaking some parameters for the routines or by
re-doing some of it by hand).
Move onto statistics. First we do single subject
statistics in addition to the images,
information about the experimental paradigm is
required. These can be overlayed onto an
anatomical to create visual displays of brain
activation during a particular task.
Can also combine statistical data from multiple
subjects and do a group/population analysis and
display these results.
? Interactive nature of these workflows is
critical (data verification) - can these steps be
automated or semi-automated?
? need metadata from collection equipment and
experimental design !

8
GARP Invasive Species Pipeline
From NSF SEEK (Deana Pennington et al)
9
Scientific Workflows Some Findings

More dataflow than workflow
but some branching looping, merging,
not documents/objects undergoing modifications
instead dataset-out analysis(dataset-in)
Need for collection/functional-style
programming (FP)
Iterations over lists (foreach) filtering
functional composition generic higher-order
operations (zip, map(f), )
Need for abstraction and nested workflows
Need for data transformations (compute/transform
alternations)
Need for rich user interaction / steering
pause resume
select branch e.g., web browser capability at
specific steps as part of a coordinated SWF
Need for high-throughput transfers
(grid-enabling, streaming)
Need for persistence of intermediate products
? data provenance (virtual data cf. several
ITR and e-Science projects)

10
(Analytical) Pipelines . (Scientific) Workflows

Spectrum of languages formalisms
Pipelines (a la Unix)
Dataflow languages
Synchronous dataflow networks (SDF)
Kahns process networks (PN)
Web page-flow
Active XML, WebML,
Hesitating-weak-alternating-tree-automata-ML
(Business) Workflows
WfMCs XPDL, WSFL, BPELWS,

11
Business Workflows

Business Workflows
show their office automation ancestry
documents and work-tasks are passed
no data streaming, data-intensive pipelines
lots of standards to choose from WfMC, BMPL,
BPEL4WS,.. XPDL,
but often no clear semantics for constructs as
simple as this

Source Expressiveness and Suitability of
Languages for Control Flow Modelling in
Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
12
The ZOO of Workflow Standards and Systems
Source W.M.P. van der Aalst et
al. http//tmitwww.tm.tue.nl/research/patterns/
13
More on Scientific WF vs Business WF

Business WF
Tasks, documents, etc. undergo modifications
(e.g., flight reservation from reserved to
ticketed), but modified WF objects still
identifiable throughout
Complex control flow, task-oriented
Transactions w/o rollback (ticket reserved ?
purchased)
SWF
data-in and data-out of an analysis step are not
the same object!
dataflow, data-oriented (cf. AVS/Express, Khoros,
)
re-run automatically (a la distrib. comp., e.g.
Condor) or user-driven/interactively (based on
failure type)
data integration semantic mediation as part of
SWF framework!

14
SWF vs Distributed Computing

Distributed Computing (e.g. a la Condor-(G) )
Batch oriented
Transparent distributed computing (remote
Unix/Java standard/Java universes in Condor)
HPC resource allocation scheduling
SWF
Often highly interactive for decision
making/steering of the WF and visualization (data
analysis)
Transparent data access (Grid) and integration
(database mediation semantic extensions)
Desktop metaphor (microworkflow!?) often (but
not always!) light-weight web service invocation

15
Ptolemy-II

Recommendations following
must read
must see (now snippets following watch for new
ways to compress slides -)
must try
Bottom line
a sophisticated system to do simple things
(dataflows) as well as highly complex things
(hybrid models)
(compare to your favorite standard/approach/syste
m)

16
Dataflow Process Networks and Ptolemy-II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
In our (SEEK) terminology Think of it as
Workflow Execution Model
21
(No Transcript)
22
Our SEEK/ SciDAC/Kepler extensions here!
23
Some Glimpses on the PT-II Execution Models
(Domains)
24
Kahn Process Networks (PN)

Concurrent processes communication through
one-way FIFO channels with unbounded capacity
A functional process F maps a set of input
sequences into a set of output sequences (sounds
like XSM!)
increasing chain of sets of sequences ? outputs
may not increase!
Consider increasing chains (wrt. prefix ordering
PN is continuous if lub(Xs) exists for all
increasing chains Xs and
F(lub(Xs))
Continuous implies montonic
if Xs

25
Process Networks (contd)

PN in essence simultaneous relations between
sequences
Network of functional processes can be described
by a mapping
X F(X,I)
X denotes all the sequences in the network
(inputs Ioutputs)
X that forms a solution is a fixed point
Continuity implies exactly one minimal fixed
point
minimal in the sense of pre-fix ordering for any
inputs I
execution of the network given I and find
the minimal fixed point (works because of the
monotonic property)

26
Synchronous Data Flow Networks (SDF)

Special case of PN
Ptolemy-II SDF overview
SDF supports efficient execution of Dataflow
graphs that lack control structures
with control structures ? Process Networks(PN)
requires that the rates on the ports of all
actors be known before hand
do not change during execution
in systems with feedback, delays, which are
represented by initial tokens on relations must
be explicitly noted ? SDF uses this rate and
delay information to determine the execution
sequence of the actors before execution begins.

27
Extended Kahn-MacQueen Process Networks

A process is considered active from its creation
until its termination
An active process can block when trying to read
from a channel (read-blocked), when trying to
write to a channel (write-blocked) or when
waiting for a queued topology change request to
be processed (mutation-blocked)
A deadlock is when all the active processes are
blocked
real deadlock all the processes are blocked on a
read
artificial deadlock all processes are blocked,
at least one process is blocked on a write ?
increase the capacity of receiver with the
smallest capacity amongst all the receivers on
which a process is blocked on a write. This
breaks the deadlock.
If the increase results in a capacity that
exceeds the value of maximumQueueCapacity, then
instead of breaking the deadlock, an exception is
thrown. This can be used to detect erroneous
models that require unbounded queues.

28
Towards SciMod/SDMSWE/Kepler/

(my vote is for Kepler)

29
Scientific Workflows Dataflow Process Networks
X

Kepler current Ptolemy-II plus X, where X
Extended type system (structural semantic
extensions)
Collection programming extensions
(declarative/FP) and
Rich user interactions/workflow steering
Rich data transformations (compute/transform
alternations)
(Eco-)Grid extensions
Actors as web/grid services
3rd party data transfer, high-throughput data
streaming
Data and service repositories, discovery
Data provenance
(semi-)automatic meta-data creation
What else???
minus upcoming Ptolemy-II extensions!
The slower we are, the less we have to do
ourselves -)

30
Extended Type System (here OWL Semantic Types)
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty ?
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
Substructure association XML raw-data
(X)Query object model link OWL ontology
31
Actor Repositories (here a commercial tool)
See why we said user-definable (or
auto-generated) actor libraries?
32
Collection Programming(some lessons from
SciDAC/SSDBM demo)
33
Promoter Identification Workflow in
Ptolemy-II (SSDBM03)
hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
34
Promoter Identification Workflow in FP
genBankG GeneId - GeneSeqgenBankP
PromoterId - PromoterSeqblast GeneSeq -
PromoterIdpromoterRegion PromoterSeq -
PromoterRegiontransfac PromoterRegion -
TFBSgpr2str (PromoterId, PromoterRegion)
- Stringd0 Gid "7" -- start
with some gene-id d1 genBankG d0 --
get its gene sequence from GenBankd2 blast d1
-- BLAST to get a list of potential
promotersd3 map genBankP d2 -- get list
of promoter sequences d4 map promoterRegion d3
-- compute list of promoter regions and ...d5
map transfac d4 -- ... get transcription
factor binding sitesd6 zip d2 d4
-- create list of pairs promoter-id/regiond7
map gpr2str d6 -- pretty print into a list
of strings d8 concat d7 -- concat
into a single "file" d9 putStr d8
-- output that file
35
Simplified Process Network PIW

Back to purely functional dataflow process
network
( a data streaming model!)
Re-introducing map(f) to Ptolemy-II (was there in
PT Classic)
no control-flow spaghetti
data-intensive apps
free concurrent execution
free type checking
automatic support to go from piw(GeneId) to PIW
map(piw) over GeneId

map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
36
Optimization by Declarative Rewriting I

PIW as a declarative, referentially transparent
functional process
optimization via functional rewriting possible
e.g. map(f o g) map(f) o map(g)
Details
Technical report PIW specification in Haskell

map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-const
ructs.pdf
37
Optimization by Declarative Rewriting II

Rewritings require that data transformation
semantics is known
e.g., Haskell-like for FP and SQL (XQuery)-like
for (XML) database querying

Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
38
Data Transformation ActorsChaining together
web services is easy

(NOT)

39
(No Transcript)
40
MAP Data Massaging a la Blue-Titan/Perl
41
Data Transformation Actors Our Approach
(proposal)

Manual
XQuery, XSLT, Perl, Python, transformation
actor (development)
(Semi-)automatic
Semantic-type guided transformation generation
(research)
Also Web Service Composition is
a hot topic
a reincarnation of many old ideas
(e.g., AI-style planning born-again functional
composition query composition )
a separate topic

42
User Interaction

Brower Actor demo (Ilkay)

43
F I N(addtl. material follows)

FYI Flow-based programming has been
re-discovered/re-invented several times
Flow-based Programming, http//www.jpaulmorrison
.com/fbp/index.shtm

Write a Comment

User Comments (0)

About PowerShow.com

Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler - PowerPoint PPT Presentation

Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler

San Diego Supercomputer Center. ludaesch_at_SDSC.edu. SEEK meeting, UCSB, 10/22-26/2003 ... The ZOO of Workflow Standards and Systems. Source: W.M.P. van der ... – PowerPoint PPT presentation