Title: Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler
1 Towards Scientific Workflows Based on Dataflow
Process Networks (or from Ptolemy to Kepler)
- Bertram Ludäscher
- San Diego Supercomputer Center
- ludaesch_at_SDSC.edu
2A Note on the Style of the following Slides
- Due to lack of time, most of the following slides
will be by reference only -) - Each speaker was given four minutes to present
his paper, as there were so many scheduled -- 198
from 64 different countries. To help expedite the
proceedings, all reports had to be distributed
and studied beforehand, while the lecturer would
speak only in numerals, calling attention in this
fashion to the salient paragraphs of his work.
... Stan Hazelton of the U.S. delegation
immediately threw the hall into a flurry by
emphatically repeating 4, 6, 11, and therefore
22 5, 9, hence 22 3, 7, 2, 11, from which it
followed that 22 and only 22!! Someone jumped up,
saying yes but 5, and what about 6, 18, or 4 for
that matter Hazelton countered this objection
with the crushing retort that, either way, 22. I
turned to the number key in his paper and
discovered that 22 meant the end of the world
The Futurological Congress, Stanislaw Lem,
translated from the Polish by Michael Kandel,
Futura 1977
3Acknowledgements
- NSF, NIH, DOE
-
- GEOsciences Network (NSF)
- www.geongrid.org
- Biomedical Informatics Research Network (NIH)
- www.nbirn.net
- Science Environment for Ecological Knowledge
(NSF) - seek.ecoinformatics.org
- Scientific Data Management Center (DOE)
- sdm.lbl.gov/sdmcenter/
4Example Promoter Identification Workflow (PIW)
(simplified)
From SciDAC/SDM project and collaboration w/
Matt Coleman (LLNL)
5Conceptual Workflow (Promoter Identification
Workflow PIW)
Compute clusters (min. distance)
For each promoter
Select gene-set (cluster-level)
Compute Subsequence labels
For each gene
With all Promoter Models
Compute Joint Promoter Model
6(No Transcript)
7Details of the Functional MRI (Magnetic Resonance
Imaging) Analysis Workflow (Jeffrey Grethe)
- Collect data (K-Space images in Fourier space)
from MR scanner while subject performs a specific
task - Reconstruct K-Space data to image data (this
requires scanner parameters for the
reconstruction) - Now have anatomical and functional data
- Pre-process the functional data
- Correct for difference in slice acquisition (each
slice in a volume is collected at a slightly
different time). Try to correct for these
differences so that all slices seem to be
acquired at same time - Not correct for subject motion (head movement in
scanner) by realigning all functional images - Register the functional images with the
anatomical image ? all images are now in the same
space (all aligned with one another) - Move all subjects into template space through
non-linear spatial normalization. There exist
atlas templates (made from many subjects) that
one can normalize to so that all subjects are in
the same space, allowing for direct comparison
across subjects. - DATA VERIFICATION - check if all these procedures
worked. If not, go back and try again (possibly
tweaking some parameters for the routines or by
re-doing some of it by hand). - Move onto statistics. First we do single subject
statistics in addition to the images,
information about the experimental paradigm is
required. These can be overlayed onto an
anatomical to create visual displays of brain
activation during a particular task. - Can also combine statistical data from multiple
subjects and do a group/population analysis and
display these results. - ? Interactive nature of these workflows is
critical (data verification) - can these steps be
automated or semi-automated? - ? need metadata from collection equipment and
experimental design !
8GARP Invasive Species Pipeline
From NSF SEEK (Deana Pennington et al)
9Scientific Workflows Some Findings
- More dataflow than workflow
- but some branching looping, merging,
- not documents/objects undergoing modifications
- instead dataset-out analysis(dataset-in)
- Need for collection/functional-style
programming (FP) - Iterations over lists (foreach) filtering
functional composition generic higher-order
operations (zip, map(f), ) - Need for abstraction and nested workflows
- Need for data transformations (compute/transform
alternations) - Need for rich user interaction / steering
- pause resume
- select branch e.g., web browser capability at
specific steps as part of a coordinated SWF - Need for high-throughput transfers
(grid-enabling, streaming) - Need for persistence of intermediate products
- ? data provenance (virtual data cf. several
ITR and e-Science projects)
10(Analytical) Pipelines . (Scientific) Workflows
- Spectrum of languages formalisms
- Pipelines (a la Unix)
- Dataflow languages
- Synchronous dataflow networks (SDF)
- Kahns process networks (PN)
- Web page-flow
- Active XML, WebML,
- Hesitating-weak-alternating-tree-automata-ML
-
- (Business) Workflows
- WfMCs XPDL, WSFL, BPELWS,
11Business Workflows
- Business Workflows
- show their office automation ancestry
- documents and work-tasks are passed
- no data streaming, data-intensive pipelines
- lots of standards to choose from WfMC, BMPL,
BPEL4WS,.. XPDL, - but often no clear semantics for constructs as
simple as this
Source Expressiveness and Suitability of
Languages for Control Flow Modelling in
Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
12The ZOO of Workflow Standards and Systems
Source W.M.P. van der Aalst et
al. http//tmitwww.tm.tue.nl/research/patterns/
13More on Scientific WF vs Business WF
- Business WF
- Tasks, documents, etc. undergo modifications
(e.g., flight reservation from reserved to
ticketed), but modified WF objects still
identifiable throughout - Complex control flow, task-oriented
- Transactions w/o rollback (ticket reserved ?
purchased) -
- SWF
- data-in and data-out of an analysis step are not
the same object! - dataflow, data-oriented (cf. AVS/Express, Khoros,
) - re-run automatically (a la distrib. comp., e.g.
Condor) or user-driven/interactively (based on
failure type) - data integration semantic mediation as part of
SWF framework!
14SWF vs Distributed Computing
- Distributed Computing (e.g. a la Condor-(G) )
- Batch oriented
- Transparent distributed computing (remote
Unix/Java standard/Java universes in Condor) - HPC resource allocation scheduling
- SWF
- Often highly interactive for decision
making/steering of the WF and visualization (data
analysis) - Transparent data access (Grid) and integration
(database mediation semantic extensions) - Desktop metaphor (microworkflow!?) often (but
not always!) light-weight web service invocation
15Ptolemy-II
- Recommendations following
- must read
- must see (now snippets following watch for new
ways to compress slides -) - must try
- Bottom line
- a sophisticated system to do simple things
(dataflows) as well as highly complex things
(hybrid models) - (compare to your favorite standard/approach/syste
m)
16Dataflow Process Networks and Ptolemy-II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
17(No Transcript)
18(No Transcript)
19(No Transcript)
20In our (SEEK) terminology Think of it as
Workflow Execution Model
21(No Transcript)
22Our SEEK/ SciDAC/Kepler extensions here!
23Some Glimpses on the PT-II Execution Models
(Domains)
24Kahn Process Networks (PN)
- Concurrent processes communication through
one-way FIFO channels with unbounded capacity - A functional process F maps a set of input
sequences into a set of output sequences (sounds
like XSM!) - increasing chain of sets of sequences ? outputs
may not increase! - Consider increasing chains (wrt. prefix ordering
- PN is continuous if lub(Xs) exists for all
increasing chains Xs and - F(lub(Xs))
- Continuous implies montonic
- if Xs
25Process Networks (contd)
- PN in essence simultaneous relations between
sequences - Network of functional processes can be described
by a mapping - X F(X,I)
- X denotes all the sequences in the network
(inputs Ioutputs) - X that forms a solution is a fixed point
- Continuity implies exactly one minimal fixed
point - minimal in the sense of pre-fix ordering for any
inputs I - execution of the network given I and find
the minimal fixed point (works because of the
monotonic property)
26Synchronous Data Flow Networks (SDF)
- Special case of PN
- Ptolemy-II SDF overview
- SDF supports efficient execution of Dataflow
graphs that lack control structures - with control structures ? Process Networks(PN)
- requires that the rates on the ports of all
actors be known before hand - do not change during execution
- in systems with feedback, delays, which are
represented by initial tokens on relations must
be explicitly noted ? SDF uses this rate and
delay information to determine the execution
sequence of the actors before execution begins.
27Extended Kahn-MacQueen Process Networks
- A process is considered active from its creation
until its termination - An active process can block when trying to read
from a channel (read-blocked), when trying to
write to a channel (write-blocked) or when
waiting for a queued topology change request to
be processed (mutation-blocked) - A deadlock is when all the active processes are
blocked - real deadlock all the processes are blocked on a
read - artificial deadlock all processes are blocked,
at least one process is blocked on a write ?
increase the capacity of receiver with the
smallest capacity amongst all the receivers on
which a process is blocked on a write. This
breaks the deadlock. - If the increase results in a capacity that
exceeds the value of maximumQueueCapacity, then
instead of breaking the deadlock, an exception is
thrown. This can be used to detect erroneous
models that require unbounded queues.
28Towards SciMod/SDMSWE/Kepler/
29Scientific Workflows Dataflow Process Networks
X
- Kepler current Ptolemy-II plus X, where X
- Extended type system (structural semantic
extensions) - Collection programming extensions
(declarative/FP) and - Rich user interactions/workflow steering
- Rich data transformations (compute/transform
alternations) - (Eco-)Grid extensions
- Actors as web/grid services
- 3rd party data transfer, high-throughput data
streaming - Data and service repositories, discovery
- Data provenance
- (semi-)automatic meta-data creation
- What else???
- minus upcoming Ptolemy-II extensions!
- The slower we are, the less we have to do
ourselves -)
30Extended Type System (here OWL Semantic Types)
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty ?
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
Substructure association XML raw-data
(X)Query object model link OWL ontology
31Actor Repositories (here a commercial tool)
See why we said user-definable (or
auto-generated) actor libraries?
32Collection Programming(some lessons from
SciDAC/SSDBM demo)
33Promoter Identification Workflow in
Ptolemy-II (SSDBM03)
hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
34Promoter Identification Workflow in FP
genBankG GeneId - GeneSeqgenBankP
PromoterId - PromoterSeqblast GeneSeq -
PromoterIdpromoterRegion PromoterSeq -
PromoterRegiontransfac PromoterRegion -
TFBSgpr2str (PromoterId, PromoterRegion)
- Stringd0 Gid "7" -- start
with some gene-id d1 genBankG d0 --
get its gene sequence from GenBankd2 blast d1
-- BLAST to get a list of potential
promotersd3 map genBankP d2 -- get list
of promoter sequences d4 map promoterRegion d3
-- compute list of promoter regions and ...d5
map transfac d4 -- ... get transcription
factor binding sitesd6 zip d2 d4
-- create list of pairs promoter-id/regiond7
map gpr2str d6 -- pretty print into a list
of strings d8 concat d7 -- concat
into a single "file" d9 putStr d8
-- output that file
35Simplified Process Network PIW
- Back to purely functional dataflow process
network - ( a data streaming model!)
- Re-introducing map(f) to Ptolemy-II (was there in
PT Classic) - no control-flow spaghetti
- data-intensive apps
- free concurrent execution
- free type checking
- automatic support to go from piw(GeneId) to PIW
map(piw) over GeneId
map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
36Optimization by Declarative Rewriting I
- PIW as a declarative, referentially transparent
functional process - optimization via functional rewriting possible
- e.g. map(f o g) map(f) o map(g)
- Details
- Technical report PIW specification in Haskell
map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-const
ructs.pdf
37Optimization by Declarative Rewriting II
- Rewritings require that data transformation
semantics is known - e.g., Haskell-like for FP and SQL (XQuery)-like
for (XML) database querying
Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
38 Data Transformation ActorsChaining together
web services is easy
39(No Transcript)
40MAP Data Massaging a la Blue-Titan/Perl
41Data Transformation Actors Our Approach
(proposal)
- Manual
- XQuery, XSLT, Perl, Python, transformation
actor (development) - (Semi-)automatic
- Semantic-type guided transformation generation
(research) - Also Web Service Composition is
- a hot topic
- a reincarnation of many old ideas
- (e.g., AI-style planning born-again functional
composition query composition ) - a separate topic
42User Interaction
- Brower Actor demo (Ilkay)
43F I N(addtl. material follows)
- FYI Flow-based programming has been
re-discovered/re-invented several times - Flow-based Programming, http//www.jpaulmorrison
.com/fbp/index.shtm