Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler or Workflow Consid - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler or Workflow Consid

Description:

San Diego Supercomputer Center. ludaesch_at_SDSC.edu. NeSCR Dec-3 -2003 Bertram Ludaescher ... A ZOO of Workflow Standards and Systems. Source: W.M.P. van der ... – PowerPoint PPT presentation

Number of Views:398
Avg rating:3.0/5.0
Slides: 65
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler or Workflow Consid


1
Scientific Workflows Based on Dataflow Process
Networks (or from Ptolemy to Kepler)(or
Workflow Considered Harmful )
  • Bertram Ludäscher
  • San Diego Supercomputer Center
  • ludaesch_at_SDSC.edu

2
Overview
  • Scientific Workflow (SWF) Examples
  • SWF Requirements Characteristics
  • Workflow standards considered harmful for SWF!?
  • Dataflow Process Networks (Ptolemy II)
  • Scientific Workflows (Kepler Ptolemy II X)

3
Acknowledgements I
  • NSF, NIH, DOE
  • GEOsciences Network (NSF)
  • www.geongrid.org
  • Biomedical Informatics Research Network (NIH)
  • www.nbirn.net
  • Science Environment for Ecological Knowledge
    (NSF)
  • seek.ecoinformatics.org
  • Scientific Data Management Center (DOE)
  • sdm.lbl.gov/sdmcenter/

4
Acknowledgements II
  • Ilkay Altintas SDM
  • Chad Berkley SEEK
  • Shawn Bowers SEEK
  • Jeffrey Grethe BIRN
  • Christopher H. Brooks Ptolemy II
  • Zhengang Cheng SDM
  • Efrat Jaeger GEON
  • Matt Jones SEEK
  • Edward A. Lee Ptolemy II
  • Kai Lin GEON
  • Bertram Ludaescher BIRN, GEON, SDM, SEEK
  • Stephen Neuendorffer Ptolemy II
  • Mladen Vouk SDM
  • Yang Zhao Ptolemy II
  • Coming soon!?
  • ROADNet, myGrid, GriPhyN, ...

Ptolemy II
5
Promoter Identification Workflow (PIW)
Source Matt Coleman (LLNL)
6
Promoter Identification Workflow in
Ptolemy-II (SSDBM03)
Execution Semantics
7
GARP Invasive Species Pipeline
Source NSF SEEK (Deana Pennington et. al, UNM)
8
Rock Mineral Classification Workflow
9
A Look Inside Classification
Finer granularity
Extracted from the mineral composition and this
levels diagram coordinates.
Classifier Locates the points region.
Diagrams information and transitions between them.
SVG to polygons.
Displays the point in the diagram for this level.
10
Source NIH BIRN (Jeffrey Grethe, UCSD)
11
SWF Requirements Characteristics
  • Scientist friendly "problem solving environment"
  • WF design
  • WF execution
  • WF steering and UI
  • pause revise resume rollback (cf. SCIRun)
  • repositories of reusable components
  • data and WF provenance (virtual data concept)
  • logging, cache reuse/partial re-derive, reports,
  • Conceptual modeling support
  • complex data (semantics) support
  • wiring support (cf. web service composition)
  • planning support

12
SWF Requirements Characteristics
  • "Modeling" support
  • Abstraction, hierarchical modeling
  • Models of Computation (MoC)
  • component interaction combination of MoCs (cf.
    CCA)
  • WF multi-grain/granola powder to bolders (and
    back)
  • Boolean (N)AND, (N)OR, vs. chaining together
    Grid-apps
  • Rich data structures and type systems
  • End user "programming" support
  • high-level programming constructs
  • e.g. map/3 for iteration, filter, select, branch,
    merge, ...
  • data transformations
  • legacy tool integration (plug-ins)
  • data streaming
  • How to tame (e.g., starve a dataflow then
    resume)?
  • ? Zauberlehrlings problem

13
SWF Requirements Characteristics
  • Grid-enabling SWFs
  • transparent use of (remote) resources
  • big data
  • big computation requirements
  • early/late binding of logical to physical
    resources,
  • planning, scheduling,
  • ? cf. Chimera, Pegasus, DAGman, Condor(-G)

14
Scientific Workflows Some Findings
  • More dataflow than (business) workflow
  • but some branching looping, merging,
  • not documents/objects undergoing modifications
  • instead often dataset-out analysis(dataset-in)
  • Need for programming extension
  • Iterations over lists (foreach) filtering
    functional composition generic higher-order
    operations (zip, map(f), )
  • Need for abstraction and nested workflows
  • Need for data transformations (compute/transform
    alternations)
  • Need for rich user interaction workflow
    steering
  • pause / revise / resume
  • select branch e.g., web browser capability at
    specific steps as part of a coordinated SWF
  • Need for high-throughput transfers
    (grid-enabling, streaming)
  • Need for persistence of intermediate products
  • ? data provenance (virtual data concept)

15
Scientific WF vs Business WF
  • Scientific Workflows
  • Dataflow and data transformations
  • Data problems volume, complexity, heterogeneity
  • Grid-aspects
  • Distributed computation
  • Distributed data
  • User-interactions/WF steering
  • Data, tool, and analysis integration
  • ? Dataflow and control-flow are married!
  • Business Workflows
  • Process composition
  • Tasks, documents, etc. undergo modifications
    (e.g., flight reservation from reserved to
    ticketed), but modified WF objects still
    identifiable throughout
  • Complex control flow, task-oriented travel
    reservations credit approval
  • ? Dataflow and control-flow are divorced!

16
A ZOO of Workflow Standards and Systems
Source W.M.P. van der Aalst et
al. http//tmitwww.tm.tue.nl/research/patterns/
17
Business Workflows
  • Business Workflows
  • show their office automation ancestry
  • documents and work-tasks are passed
  • no data streaming, no data-intensive pipelines
  • lots of standards to choose from WfMC, WSFL,
    BMPL, BPEL4WS,.. XPDL,
  • but often no clear execution semantics for
    constructs as simple as this

Source Expressiveness and Suitability of
Languages for Control Flow Modelling in
Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
18
On Workflow Standards
http//tmitwww.tm.tue.nl/staff/wvdaalst/Publicatio
ns/publications.html
19
Workflow Standards Debunked
Source Dont go with the flowWeb services
composition standards exposed,W.M.P. van der
Aalst, Trends Controversies, Jan/Feb 2003 issue
of IEEE Intelligent Systems Web Services - Been
there done that?
20
Workflow Standards Debunked
Source Dont go with the flowWeb services
composition standards exposed,W.M.P. van der
Aalst, Trends Controversies, Jan/Feb 2003 issue
of IEEE Intelligent Systems Web Services - Been
there done that?
21
But never mind the standards discussionMany
Scientific Workflows are Dataflows!
  • (Check YOUR examples )

22
Commercial Workflow/Dataflow Systems
23
SCIRun Component-Based Problem Solving
Environments for Large-Scale Scientific Computing
  • SCIRun problem solving environment for
    interactive construction, debugging, and steering
    of large-scale scientific computations
  • Component model, based on generalized dataflow
    programming
  • Source Steve Parker (cs.utah.edu) SciDAC/SDM
    collaboration

24
Workflow and distributed computation grid created
with Kensington Discovery Edition from
InforSense.
25
Dataflow Process NetworksPutting Computation
Models first!
  • Synchronous Dataflow Network (SDF)
  • Statically schedulable single-threaded dataflow
  • Can execute multi-threaded, but the
    firing-sequence is known in advance
  • Maximally well-behaved, but also limited
    expressiveness
  • Process Network (PN)
  • Multi-threaded dynamically scheduled dataflow
  • More expressive than SDF (dynamic token rate
    prevents static scheduling)
  • Natural streaming model
  • Other Execution Models (Domains)
  • Implemented through different Directors

advanced push/pull
26
Dataflow Process Networks and Ptolemy-II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
27
Why Ptolemy-II?
  • PTII Objective
  • The focus is on assembly of concurrent
    components. The key underlying principle in the
    project is the use of well-defined models of
    computation that govern the interaction between
    components. A major problem area being addressed
    is the use of heterogeneous mixtures of models of
    computation.
  • Data Process oriented
  • Dataflow process networks
  • Natural Data Streaming Support
  • End user WF console (Vergil GUI)
  • PRAGMATICS
  • mature, actively maintained, well-documented
  • open source system
  • leverage sister projects activities (e.g. SEEK,
    SDM, BIRN,)

28
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
29
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
30
Marrying Divorcing Control- Dataflow
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
31
Another Goodie Ptolemy-II Type System
32
Support for Multiple Workflow Granularities
Bolders
Plumbing
Powder
Abstraction Sand to Rocks
Sand
33
Scientific Workflows Dataflow Process Networks
X
Kepler Ptolemy-II
X
  • X
  • Database plug-ins
  • Legacy application plug-ins (via command line, as
    web services, )
  • Grid extensions
  • Actors as web/grid services
  • 3rd party data transfer, high-throughput data
    streaming
  • Dealing with thousands of files (cf.
    astrophysics, astronomy, HEP, examples)
  • Data and service repositories, discovery Extended
    type system (structural semantic extensions)
  • Programming extensions (declarative/FP) and
  • Rich user interactions/workflow steering
  • Rich data transformations (compute/transform
    alternations)
  • Data provenance
  • (semi-)automatic meta-data creation

34
Status update / specific tasks for KeplerDONE,
ONGOING, NEW
  • User interaction, workflow steering
  • Pause/revise/resume
  • BrowserUI actor (browser as a 0-learning
    display and selection tool)
  • Distributed execution
  • Dynamically port-specializing WSDL actor
  • Dynamically specializing Grid service actor
  • Port actor type extensions (SEEK leverage)
  • Structural types (XML Schema)
  • Semantic types (OWL) incl. unit types w/
    automatic conversion
  • Programming extensions
  • Data transformation actors (XSLT, XQuery,
    Python, Perl,)
  • map, zip, zipWith, , loop, switch patterns
  • Specialized Data Sources
  • EML (SEEK),
  • MS Access (GEON), JDBC,
  • XML, NetCDF,

35
Some specific tasks for Kepler (all NEW)
  • Design develop transparent, Grid-enabled PNs
  • Communication protocol details
  • Grid-actor extensions and/or
  • Grid-Process Network director (G-PN)
  • Host/Source-location becomes actor parameter
  • add active-inline parameter display for
    grid-actors (_at_exec-loc), channels
    (_at_transport-protocol), source-actors
    (_at_src-loccatalog-loc)
  • Activity Monitoring
  • Add activity status display (green, yellow,
    red) to replace PtII animation (needed for
    concurrently executing PN!)
  • Registration Deployment mechanisms
  • Actor/Data/Workflow repository (composite
    actors)
  • Shows up as (configable) actor library
  • OGSA Service Registry approach? (SEEK leverage
    UDDI complex limited says MattJ)
  • http//www-unix.globus.org/toolkit/draft-ggf-ogsi-
    gridservice-33_2003-06-27.pdf
  • Extensions to deal with failures (fault tolerance)

36
Example Database actors for Ptolemy II
  • (Kepler-GEON Efrat Jaeger)

37
Database Actors
  • Database Connection actor
  • Database Query actor

38
Database Actors Example
39
Database Actors Example
40
Example Web service-enabling Ptolemy II
  • (Kepler-SDM Ilkay Altintas)

41
A Generic Web Service Actor
42
Set Parameters and Commit ?Specialized Actor
Set parameters and commit
43
Web Service Actor after Instantiation
44
Composing Third-Party Web Services
Input of next web service
User interaction Transformations
45
Results of the Execution
User I/O via standard brower!
Run Window / WF Deployment
46
Composing Legacy Applications (here Phylogeny)
Shell / Command-Line Actors
Source Alex Borchers, UCSD
47
Example Grid-enabling Ptolemy II
  • ( Kepler-SEEK, Chad Berkley
  • Kepler-SDM, Ilkay Altintas,
  • myGrid?,
  • GriPhyN?,
  • OGSIA-DAI ...)

48
Transparently Grid-Enabling PTII Handles
Logical token transfer (3) requires
get_handle(1,2) then exec_handle(4,5,6,7) for
completion.
  • A?GA get_handle
  • GA?A return X
  • A?B send X
  • B?GB request X
  • GB?GA request X
  • GA? GB send X
  • GB?B send done(X)
  • Example
  • X GA.17
  • X ltsome_huge_filegt

PTII space
3
A
B

4
7
2
1
5
Grid space
GA
GB
6
49
Transparently Grid-Enabling PTII
  • Different phases
  • Register designed WF (could include external
    validation service)
  • Find suitable grid service hosts for actors
  • Pre-stage execution
  • Execute (w/ provenance)
  • Interactively steer (pause revise resume)
  • Batch process re-run parts later
  • Register/store data products and execution logs
  • Kepler implementation choices
  • Grid-actors (no change of Director necessary!?)
    and/or
  • Grid-(PN)-director (also need to change actors!?)
  • Add grid service host id as actor parameter A_at_GA
  • Similar for data myDB_at_GA

50
C-z bg Detach your WF execution!
  • Currently in PTII
  • tight coupling of WF execution and PTII Java
    client (also Vergil GUI)
  • To-do for Kepler
  • detaching WF console (Vergil) from a Grid-aware
    execution engine

Grid-PN Director!
Transport protocol parameter
Data location parameter
Host location parameter
51
Semantic Type-enabling Ptolemy II (OWL here we
go -)
  • (Kepler-SEEK Shawn Bowers)

52
Semantic Type Extensions
  • Take concepts and relationships from an ontology
    to semantically type the data-in/out ports
  • Application e.g., design support
  • smart/semi-automatic wiring, generation of
    massaging actors

m1 (normalize)
p3
p4
Takes Abundance Count Measurements for Life
Stages
Returns Mortality Rate Derived Measurements for
Life Stages
53
(No Transcript)
54
(No Transcript)
55
Semantic Types
  • The semantic type signature
  • Type expressions over the (OWL) ontology

m1 (normalize)
p3
p4
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty -gt
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
56
Extended Type System (here OWL Semantic Types)
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty ?
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
Substructure association XML raw-data
(X)Querygt object model link gt OWL ontology
57
Programming Extensions
  • (some lessons from SciDAC/SSDBM demo)

58
Promoter Identification Workflow in
Ptolemy-II (SSDBM03)
hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
59
Promoter Identification Workflow in FP
genBankG GeneId -gt GeneSeqgenBankP
PromoterId -gt PromoterSeqblast GeneSeq -gt
PromoterIdpromoterRegion PromoterSeq -gt
PromoterRegiontransfac PromoterRegion -gt
TFBSgpr2str (PromoterId, PromoterRegion)
-gt Stringd0 Gid "7" -- start
with some gene-id d1 genBankG d0 --
get its gene sequence from GenBankd2 blast d1
-- BLAST to get a list of potential
promotersd3 map genBankP d2 -- get list
of promoter sequences d4 map promoterRegion d3
-- compute list of promoter regions and ...d5
map transfac d4 -- ... get transcription
factor binding sitesd6 zip d2 d4
-- create list of pairs promoter-id/regiond7
map gpr2str d6 -- pretty print into a list
of strings d8 concat d7 -- concat
into a single "file" d9 putStr d8
-- output that file
60
Cleaned up Process Network PIW
  • Back to purely functional dataflow process
    network
  • ( also a data streaming model!)
  • Re-introducing map(f) to Ptolemy-II (was there in
    PT Classic)
  • no control-flow spaghetti
  • data-intensive apps
  • free concurrent execution
  • free type checking
  • automatic support to go from piw(GeneId) to
  • PIW map(piw) over GeneId

map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
61
Optimization by Declarative Rewriting I
  • PIW as a declarative, referentially transparent
    functional process
  • optimization via functional rewriting possible
  • e.g. map(f o g) map(f) o map(g)
  • Details
  • Technical report PIW specification in Haskell

map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-const
ructs.pdf
62
Optimizing II Streams Pipelines
Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
  • Clean functional semantics facilitates algebraic
    workflow (program) transformations
    (Bird-Meertens) e.g. mapS f mapS g ? mapS (f
    g)

63
Summary
  • Many (most of ours anyways) scientific workflows
    are dataflows
  • lots of workflow standards (messy and not
    focused on SWF problems)
  • should we start a new wave of dataflow
    standards??
  • Importance of clear semantics for
  • different MoCs (models of computation PN, SDF,
    DE, CT, )
  • component composition across MoCs
  • component interaction
  • ? Ptolemy II directors
  • Kepler
  • Based on extensible Ptolemy II system
  • Cross-project activity (SEEK, SDM, Ptolemy II,
    GEON, BIRN, and counting)
  • Plug-in / interface with your SWF planner,
    execution engine, grid-WF tool!

64
Your Projects Icons ltHEREgt
65
A Note on the Style of these Slides
  • Due to lack of time, most of the following slides
    are by reference only -)
  • Each speaker was given four minutes to present
    his paper, as there were so many scheduled -- 198
    from 64 different countries. To help expedite the
    proceedings, all reports had to be distributed
    and studied beforehand, while the lecturer would
    speak only in numerals, calling attention in this
    fashion to the salient paragraphs of his work.
    ... Stan Hazelton of the U.S. delegation
    immediately threw the hall into a flurry by
    emphatically repeating 4, 6, 11, and therefore
    22 5, 9, hence 22 3, 7, 2, 11, from which it
    followed that 22 and only 22!! Someone jumped up,
    saying yes but 5, and what about 6, 18, or 4 for
    that matter Hazelton countered this objection
    with the crushing retort that, either way, 22. I
    turned to the number key in his paper and
    discovered that 22 meant the end of the world
    The Futurological Congress, Stanislaw Lem,
    translated from the Polish by Michael Kandel,
    Futura 1977
Write a Comment
User Comments (0)
About PowerShow.com