Title: Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler or Workflow Consid
1Scientific Workflows Based on Dataflow Process
Networks (or from Ptolemy to Kepler)(or
Workflow Considered Harmful )
- Bertram Ludäscher
- San Diego Supercomputer Center
- ludaesch_at_SDSC.edu
2Overview
- Scientific Workflow (SWF) Examples
- SWF Requirements Characteristics
- Workflow standards considered harmful for SWF!?
- Dataflow Process Networks (Ptolemy II)
- Scientific Workflows (Kepler Ptolemy II X)
3Acknowledgements I
- NSF, NIH, DOE
-
- GEOsciences Network (NSF)
- www.geongrid.org
- Biomedical Informatics Research Network (NIH)
- www.nbirn.net
- Science Environment for Ecological Knowledge
(NSF) - seek.ecoinformatics.org
- Scientific Data Management Center (DOE)
- sdm.lbl.gov/sdmcenter/
4Acknowledgements II
- Ilkay Altintas SDM
- Chad Berkley SEEK
- Shawn Bowers SEEK
- Jeffrey Grethe BIRN
- Christopher H. Brooks Ptolemy II
- Zhengang Cheng SDM
- Efrat Jaeger GEON
- Matt Jones SEEK
- Edward A. Lee Ptolemy II
- Kai Lin GEON
- Bertram Ludaescher BIRN, GEON, SDM, SEEK
- Stephen Neuendorffer Ptolemy II
- Mladen Vouk SDM
- Yang Zhao Ptolemy II
-
- Coming soon!?
- ROADNet, myGrid, GriPhyN, ...
Ptolemy II
5Promoter Identification Workflow (PIW)
Source Matt Coleman (LLNL)
6Promoter Identification Workflow in
Ptolemy-II (SSDBM03)
Execution Semantics
7GARP Invasive Species Pipeline
Source NSF SEEK (Deana Pennington et. al, UNM)
8Rock Mineral Classification Workflow
9A Look Inside Classification
Finer granularity
Extracted from the mineral composition and this
levels diagram coordinates.
Classifier Locates the points region.
Diagrams information and transitions between them.
SVG to polygons.
Displays the point in the diagram for this level.
10Source NIH BIRN (Jeffrey Grethe, UCSD)
11SWF Requirements Characteristics
- Scientist friendly "problem solving environment"
- WF design
- WF execution
- WF steering and UI
- pause revise resume rollback (cf. SCIRun)
- repositories of reusable components
- data and WF provenance (virtual data concept)
- logging, cache reuse/partial re-derive, reports,
- Conceptual modeling support
- complex data (semantics) support
- wiring support (cf. web service composition)
- planning support
12SWF Requirements Characteristics
- "Modeling" support
- Abstraction, hierarchical modeling
- Models of Computation (MoC)
- component interaction combination of MoCs (cf.
CCA) - WF multi-grain/granola powder to bolders (and
back) - Boolean (N)AND, (N)OR, vs. chaining together
Grid-apps - Rich data structures and type systems
- End user "programming" support
- high-level programming constructs
- e.g. map/3 for iteration, filter, select, branch,
merge, ... - data transformations
- legacy tool integration (plug-ins)
- data streaming
- How to tame (e.g., starve a dataflow then
resume)? - ? Zauberlehrlings problem
13SWF Requirements Characteristics
- Grid-enabling SWFs
- transparent use of (remote) resources
- big data
- big computation requirements
- early/late binding of logical to physical
resources, - planning, scheduling,
- ? cf. Chimera, Pegasus, DAGman, Condor(-G)
14Scientific Workflows Some Findings
- More dataflow than (business) workflow
- but some branching looping, merging,
- not documents/objects undergoing modifications
- instead often dataset-out analysis(dataset-in)
- Need for programming extension
- Iterations over lists (foreach) filtering
functional composition generic higher-order
operations (zip, map(f), ) - Need for abstraction and nested workflows
- Need for data transformations (compute/transform
alternations) - Need for rich user interaction workflow
steering - pause / revise / resume
- select branch e.g., web browser capability at
specific steps as part of a coordinated SWF - Need for high-throughput transfers
(grid-enabling, streaming) - Need for persistence of intermediate products
- ? data provenance (virtual data concept)
15Scientific WF vs Business WF
- Scientific Workflows
- Dataflow and data transformations
- Data problems volume, complexity, heterogeneity
- Grid-aspects
- Distributed computation
- Distributed data
- User-interactions/WF steering
- Data, tool, and analysis integration
- ? Dataflow and control-flow are married!
- Business Workflows
- Process composition
- Tasks, documents, etc. undergo modifications
(e.g., flight reservation from reserved to
ticketed), but modified WF objects still
identifiable throughout - Complex control flow, task-oriented travel
reservations credit approval - ? Dataflow and control-flow are divorced!
16A ZOO of Workflow Standards and Systems
Source W.M.P. van der Aalst et
al. http//tmitwww.tm.tue.nl/research/patterns/
17Business Workflows
- Business Workflows
- show their office automation ancestry
- documents and work-tasks are passed
- no data streaming, no data-intensive pipelines
- lots of standards to choose from WfMC, WSFL,
BMPL, BPEL4WS,.. XPDL, - but often no clear execution semantics for
constructs as simple as this
Source Expressiveness and Suitability of
Languages for Control Flow Modelling in
Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
18On Workflow Standards
http//tmitwww.tm.tue.nl/staff/wvdaalst/Publicatio
ns/publications.html
19Workflow Standards Debunked
Source Dont go with the flowWeb services
composition standards exposed,W.M.P. van der
Aalst, Trends Controversies, Jan/Feb 2003 issue
of IEEE Intelligent Systems Web Services - Been
there done that?
20Workflow Standards Debunked
Source Dont go with the flowWeb services
composition standards exposed,W.M.P. van der
Aalst, Trends Controversies, Jan/Feb 2003 issue
of IEEE Intelligent Systems Web Services - Been
there done that?
21But never mind the standards discussionMany
Scientific Workflows are Dataflows!
22Commercial Workflow/Dataflow Systems
23SCIRun Component-Based Problem Solving
Environments for Large-Scale Scientific Computing
- SCIRun problem solving environment for
interactive construction, debugging, and steering
of large-scale scientific computations - Component model, based on generalized dataflow
programming - Source Steve Parker (cs.utah.edu) SciDAC/SDM
collaboration
24Workflow and distributed computation grid created
with Kensington Discovery Edition from
InforSense.
25Dataflow Process NetworksPutting Computation
Models first!
- Synchronous Dataflow Network (SDF)
- Statically schedulable single-threaded dataflow
- Can execute multi-threaded, but the
firing-sequence is known in advance - Maximally well-behaved, but also limited
expressiveness - Process Network (PN)
- Multi-threaded dynamically scheduled dataflow
- More expressive than SDF (dynamic token rate
prevents static scheduling) - Natural streaming model
- Other Execution Models (Domains)
- Implemented through different Directors
advanced push/pull
26Dataflow Process Networks and Ptolemy-II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
27Why Ptolemy-II?
- PTII Objective
- The focus is on assembly of concurrent
components. The key underlying principle in the
project is the use of well-defined models of
computation that govern the interaction between
components. A major problem area being addressed
is the use of heterogeneous mixtures of models of
computation. - Data Process oriented
- Dataflow process networks
- Natural Data Streaming Support
- End user WF console (Vergil GUI)
- PRAGMATICS
- mature, actively maintained, well-documented
- open source system
- leverage sister projects activities (e.g. SEEK,
SDM, BIRN,)
28Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
29Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
30Marrying Divorcing Control- Dataflow
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
31Another Goodie Ptolemy-II Type System
32Support for Multiple Workflow Granularities
Bolders
Plumbing
Powder
Abstraction Sand to Rocks
Sand
33Scientific Workflows Dataflow Process Networks
X
Kepler Ptolemy-II
X
- X
- Database plug-ins
- Legacy application plug-ins (via command line, as
web services, ) - Grid extensions
- Actors as web/grid services
- 3rd party data transfer, high-throughput data
streaming - Dealing with thousands of files (cf.
astrophysics, astronomy, HEP, examples) - Data and service repositories, discovery Extended
type system (structural semantic extensions) - Programming extensions (declarative/FP) and
- Rich user interactions/workflow steering
- Rich data transformations (compute/transform
alternations) - Data provenance
- (semi-)automatic meta-data creation
34Status update / specific tasks for KeplerDONE,
ONGOING, NEW
- User interaction, workflow steering
- Pause/revise/resume
- BrowserUI actor (browser as a 0-learning
display and selection tool) - Distributed execution
- Dynamically port-specializing WSDL actor
- Dynamically specializing Grid service actor
- Port actor type extensions (SEEK leverage)
- Structural types (XML Schema)
- Semantic types (OWL) incl. unit types w/
automatic conversion - Programming extensions
- Data transformation actors (XSLT, XQuery,
Python, Perl,) - map, zip, zipWith, , loop, switch patterns
- Specialized Data Sources
- EML (SEEK),
- MS Access (GEON), JDBC,
- XML, NetCDF,
35Some specific tasks for Kepler (all NEW)
- Design develop transparent, Grid-enabled PNs
- Communication protocol details
- Grid-actor extensions and/or
- Grid-Process Network director (G-PN)
- Host/Source-location becomes actor parameter
- add active-inline parameter display for
grid-actors (_at_exec-loc), channels
(_at_transport-protocol), source-actors
(_at_src-loccatalog-loc) - Activity Monitoring
- Add activity status display (green, yellow,
red) to replace PtII animation (needed for
concurrently executing PN!) - Registration Deployment mechanisms
- Actor/Data/Workflow repository (composite
actors) - Shows up as (configable) actor library
- OGSA Service Registry approach? (SEEK leverage
UDDI complex limited says MattJ) - http//www-unix.globus.org/toolkit/draft-ggf-ogsi-
gridservice-33_2003-06-27.pdf - Extensions to deal with failures (fault tolerance)
36Example Database actors for Ptolemy II
- (Kepler-GEON Efrat Jaeger)
37Database Actors
- Database Connection actor
- Database Query actor
38Database Actors Example
39Database Actors Example
40Example Web service-enabling Ptolemy II
- (Kepler-SDM Ilkay Altintas)
41A Generic Web Service Actor
42Set Parameters and Commit ?Specialized Actor
Set parameters and commit
43Web Service Actor after Instantiation
44Composing Third-Party Web Services
Input of next web service
User interaction Transformations
45Results of the Execution
User I/O via standard brower!
Run Window / WF Deployment
46Composing Legacy Applications (here Phylogeny)
Shell / Command-Line Actors
Source Alex Borchers, UCSD
47Example Grid-enabling Ptolemy II
- ( Kepler-SEEK, Chad Berkley
- Kepler-SDM, Ilkay Altintas,
- myGrid?,
- GriPhyN?,
- OGSIA-DAI ...)
48Transparently Grid-Enabling PTII Handles
Logical token transfer (3) requires
get_handle(1,2) then exec_handle(4,5,6,7) for
completion.
- A?GA get_handle
- GA?A return X
- A?B send X
- B?GB request X
- GB?GA request X
- GA? GB send X
- GB?B send done(X)
- Example
- X GA.17
- X ltsome_huge_filegt
PTII space
3
A
B
4
7
2
1
5
Grid space
GA
GB
6
49Transparently Grid-Enabling PTII
- Different phases
- Register designed WF (could include external
validation service) - Find suitable grid service hosts for actors
- Pre-stage execution
- Execute (w/ provenance)
- Interactively steer (pause revise resume)
- Batch process re-run parts later
- Register/store data products and execution logs
- Kepler implementation choices
- Grid-actors (no change of Director necessary!?)
and/or - Grid-(PN)-director (also need to change actors!?)
- Add grid service host id as actor parameter A_at_GA
- Similar for data myDB_at_GA
50C-z bg Detach your WF execution!
- Currently in PTII
- tight coupling of WF execution and PTII Java
client (also Vergil GUI) - To-do for Kepler
- detaching WF console (Vergil) from a Grid-aware
execution engine
Grid-PN Director!
Transport protocol parameter
Data location parameter
Host location parameter
51Semantic Type-enabling Ptolemy II (OWL here we
go -)
- (Kepler-SEEK Shawn Bowers)
52Semantic Type Extensions
- Take concepts and relationships from an ontology
to semantically type the data-in/out ports - Application e.g., design support
- smart/semi-automatic wiring, generation of
massaging actors
m1 (normalize)
p3
p4
Takes Abundance Count Measurements for Life
Stages
Returns Mortality Rate Derived Measurements for
Life Stages
53(No Transcript)
54(No Transcript)
55Semantic Types
- The semantic type signature
- Type expressions over the (OWL) ontology
m1 (normalize)
p3
p4
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty -gt
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
56Extended Type System (here OWL Semantic Types)
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty ?
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
Substructure association XML raw-data
(X)Querygt object model link gt OWL ontology
57Programming Extensions
- (some lessons from SciDAC/SSDBM demo)
58Promoter Identification Workflow in
Ptolemy-II (SSDBM03)
hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
59Promoter Identification Workflow in FP
genBankG GeneId -gt GeneSeqgenBankP
PromoterId -gt PromoterSeqblast GeneSeq -gt
PromoterIdpromoterRegion PromoterSeq -gt
PromoterRegiontransfac PromoterRegion -gt
TFBSgpr2str (PromoterId, PromoterRegion)
-gt Stringd0 Gid "7" -- start
with some gene-id d1 genBankG d0 --
get its gene sequence from GenBankd2 blast d1
-- BLAST to get a list of potential
promotersd3 map genBankP d2 -- get list
of promoter sequences d4 map promoterRegion d3
-- compute list of promoter regions and ...d5
map transfac d4 -- ... get transcription
factor binding sitesd6 zip d2 d4
-- create list of pairs promoter-id/regiond7
map gpr2str d6 -- pretty print into a list
of strings d8 concat d7 -- concat
into a single "file" d9 putStr d8
-- output that file
60Cleaned up Process Network PIW
- Back to purely functional dataflow process
network - ( also a data streaming model!)
- Re-introducing map(f) to Ptolemy-II (was there in
PT Classic) - no control-flow spaghetti
- data-intensive apps
- free concurrent execution
- free type checking
- automatic support to go from piw(GeneId) to
- PIW map(piw) over GeneId
map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
61Optimization by Declarative Rewriting I
- PIW as a declarative, referentially transparent
functional process - optimization via functional rewriting possible
- e.g. map(f o g) map(f) o map(g)
- Details
- Technical report PIW specification in Haskell
map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-const
ructs.pdf
62Optimizing II Streams Pipelines
Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
- Clean functional semantics facilitates algebraic
workflow (program) transformations
(Bird-Meertens) e.g. mapS f mapS g ? mapS (f
g)
63Summary
- Many (most of ours anyways) scientific workflows
are dataflows - lots of workflow standards (messy and not
focused on SWF problems) - should we start a new wave of dataflow
standards?? - Importance of clear semantics for
- different MoCs (models of computation PN, SDF,
DE, CT, ) - component composition across MoCs
- component interaction
- ? Ptolemy II directors
- Kepler
- Based on extensible Ptolemy II system
- Cross-project activity (SEEK, SDM, Ptolemy II,
GEON, BIRN, and counting) - Plug-in / interface with your SWF planner,
execution engine, grid-WF tool!
64Your Projects Icons ltHEREgt
65A Note on the Style of these Slides
- Due to lack of time, most of the following slides
are by reference only -) - Each speaker was given four minutes to present
his paper, as there were so many scheduled -- 198
from 64 different countries. To help expedite the
proceedings, all reports had to be distributed
and studied beforehand, while the lecturer would
speak only in numerals, calling attention in this
fashion to the salient paragraphs of his work.
... Stan Hazelton of the U.S. delegation
immediately threw the hall into a flurry by
emphatically repeating 4, 6, 11, and therefore
22 5, 9, hence 22 3, 7, 2, 11, from which it
followed that 22 and only 22!! Someone jumped up,
saying yes but 5, and what about 6, 18, or 4 for
that matter Hazelton countered this objection
with the crushing retort that, either way, 22. I
turned to the number key in his paper and
discovered that 22 meant the end of the world
The Futurological Congress, Stanislaw Lem,
translated from the Polish by Michael Kandel,
Futura 1977