Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler

Description:

San Diego Supercomputer Center. ludaesch_at_SDSC.edu. SEEK meeting, UCSB, 10/22-26/2003 ... The ZOO of Workflow Standards and Systems. Source: W.M.P. van der ... – PowerPoint PPT presentation

Number of Views:336
Avg rating:3.0/5.0
Slides: 44
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Towards Scientific Workflows Based on Dataflow Process Networks or from Ptolemy to Kepler


1
Towards Scientific Workflows Based on Dataflow
Process Networks (or from Ptolemy to Kepler)
  • Bertram Ludäscher
  • San Diego Supercomputer Center
  • ludaesch_at_SDSC.edu

2
A Note on the Style of the following Slides
  • Due to lack of time, most of the following slides
    will be by reference only -)
  • Each speaker was given four minutes to present
    his paper, as there were so many scheduled -- 198
    from 64 different countries. To help expedite the
    proceedings, all reports had to be distributed
    and studied beforehand, while the lecturer would
    speak only in numerals, calling attention in this
    fashion to the salient paragraphs of his work.
    ... Stan Hazelton of the U.S. delegation
    immediately threw the hall into a flurry by
    emphatically repeating 4, 6, 11, and therefore
    22 5, 9, hence 22 3, 7, 2, 11, from which it
    followed that 22 and only 22!! Someone jumped up,
    saying yes but 5, and what about 6, 18, or 4 for
    that matter Hazelton countered this objection
    with the crushing retort that, either way, 22. I
    turned to the number key in his paper and
    discovered that 22 meant the end of the world
    The Futurological Congress, Stanislaw Lem,
    translated from the Polish by Michael Kandel,
    Futura 1977

3
Acknowledgements
  • NSF, NIH, DOE
  • GEOsciences Network (NSF)
  • www.geongrid.org
  • Biomedical Informatics Research Network (NIH)
  • www.nbirn.net
  • Science Environment for Ecological Knowledge
    (NSF)
  • seek.ecoinformatics.org
  • Scientific Data Management Center (DOE)
  • sdm.lbl.gov/sdmcenter/

4
Example Promoter Identification Workflow (PIW)
(simplified)
From SciDAC/SDM project and collaboration w/
Matt Coleman (LLNL)
5
Conceptual Workflow (Promoter Identification
Workflow PIW)
Compute clusters (min. distance)
For each promoter
Select gene-set (cluster-level)
Compute Subsequence labels
For each gene
With all Promoter Models
Compute Joint Promoter Model
6
(No Transcript)
7
Details of the Functional MRI (Magnetic Resonance
Imaging) Analysis Workflow (Jeffrey Grethe)
  • Collect data (K-Space images in Fourier space)
    from MR scanner while subject performs a specific
    task
  • Reconstruct K-Space data to image data (this
    requires scanner parameters for the
    reconstruction)
  • Now have anatomical and functional data
  • Pre-process the functional data
  • Correct for difference in slice acquisition (each
    slice in a volume is collected at a slightly
    different time). Try to correct for these
    differences so that all slices seem to be
    acquired at same time
  • Not correct for subject motion (head movement in
    scanner) by realigning all functional images
  • Register the functional images with the
    anatomical image ? all images are now in the same
    space (all aligned with one another)
  • Move all subjects into template space through
    non-linear spatial normalization. There exist
    atlas templates (made from many subjects) that
    one can normalize to so that all subjects are in
    the same space, allowing for direct comparison
    across subjects.
  • DATA VERIFICATION - check if all these procedures
    worked. If not, go back and try again (possibly
    tweaking some parameters for the routines or by
    re-doing some of it by hand).
  • Move onto statistics. First we do single subject
    statistics in addition to the images,
    information about the experimental paradigm is
    required. These can be overlayed onto an
    anatomical to create visual displays of brain
    activation during a particular task.
  • Can also combine statistical data from multiple
    subjects and do a group/population analysis and
    display these results.
  • ? Interactive nature of these workflows is
    critical (data verification) - can these steps be
    automated or semi-automated?
  • ? need metadata from collection equipment and
    experimental design !

8
GARP Invasive Species Pipeline
From NSF SEEK (Deana Pennington et al)
9
Scientific Workflows Some Findings
  • More dataflow than workflow
  • but some branching looping, merging,
  • not documents/objects undergoing modifications
  • instead dataset-out analysis(dataset-in)
  • Need for collection/functional-style
    programming (FP)
  • Iterations over lists (foreach) filtering
    functional composition generic higher-order
    operations (zip, map(f), )
  • Need for abstraction and nested workflows
  • Need for data transformations (compute/transform
    alternations)
  • Need for rich user interaction / steering
  • pause resume
  • select branch e.g., web browser capability at
    specific steps as part of a coordinated SWF
  • Need for high-throughput transfers
    (grid-enabling, streaming)
  • Need for persistence of intermediate products
  • ? data provenance (virtual data cf. several
    ITR and e-Science projects)

10
(Analytical) Pipelines . (Scientific) Workflows
  • Spectrum of languages formalisms
  • Pipelines (a la Unix)
  • Dataflow languages
  • Synchronous dataflow networks (SDF)
  • Kahns process networks (PN)
  • Web page-flow
  • Active XML, WebML,
  • Hesitating-weak-alternating-tree-automata-ML
  • (Business) Workflows
  • WfMCs XPDL, WSFL, BPELWS,

11
Business Workflows
  • Business Workflows
  • show their office automation ancestry
  • documents and work-tasks are passed
  • no data streaming, data-intensive pipelines
  • lots of standards to choose from WfMC, BMPL,
    BPEL4WS,.. XPDL,
  • but often no clear semantics for constructs as
    simple as this

Source Expressiveness and Suitability of
Languages for Control Flow Modelling in
Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
12
The ZOO of Workflow Standards and Systems
Source W.M.P. van der Aalst et
al. http//tmitwww.tm.tue.nl/research/patterns/
13
More on Scientific WF vs Business WF
  • Business WF
  • Tasks, documents, etc. undergo modifications
    (e.g., flight reservation from reserved to
    ticketed), but modified WF objects still
    identifiable throughout
  • Complex control flow, task-oriented
  • Transactions w/o rollback (ticket reserved ?
    purchased)
  • SWF
  • data-in and data-out of an analysis step are not
    the same object!
  • dataflow, data-oriented (cf. AVS/Express, Khoros,
    )
  • re-run automatically (a la distrib. comp., e.g.
    Condor) or user-driven/interactively (based on
    failure type)
  • data integration semantic mediation as part of
    SWF framework!

14
SWF vs Distributed Computing
  • Distributed Computing (e.g. a la Condor-(G) )
  • Batch oriented
  • Transparent distributed computing (remote
    Unix/Java standard/Java universes in Condor)
  • HPC resource allocation scheduling
  • SWF
  • Often highly interactive for decision
    making/steering of the WF and visualization (data
    analysis)
  • Transparent data access (Grid) and integration
    (database mediation semantic extensions)
  • Desktop metaphor (microworkflow!?) often (but
    not always!) light-weight web service invocation

15
Ptolemy-II
  • Recommendations following
  • must read
  • must see (now snippets following watch for new
    ways to compress slides -)
  • must try
  • Bottom line
  • a sophisticated system to do simple things
    (dataflows) as well as highly complex things
    (hybrid models)
  • (compare to your favorite standard/approach/syste
    m)

16
Dataflow Process Networks and Ptolemy-II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
In our (SEEK) terminology Think of it as
Workflow Execution Model
21
(No Transcript)
22
Our SEEK/ SciDAC/Kepler extensions here!
23
Some Glimpses on the PT-II Execution Models
(Domains)
24
Kahn Process Networks (PN)
  • Concurrent processes communication through
    one-way FIFO channels with unbounded capacity
  • A functional process F maps a set of input
    sequences into a set of output sequences (sounds
    like XSM!)
  • increasing chain of sets of sequences ? outputs
    may not increase!
  • Consider increasing chains (wrt. prefix ordering
  • PN is continuous if lub(Xs) exists for all
    increasing chains Xs and
  • F(lub(Xs))
  • Continuous implies montonic
  • if Xs

25
Process Networks (contd)
  • PN in essence simultaneous relations between
    sequences
  • Network of functional processes can be described
    by a mapping
  • X F(X,I)
  • X denotes all the sequences in the network
    (inputs Ioutputs)
  • X that forms a solution is a fixed point
  • Continuity implies exactly one minimal fixed
    point
  • minimal in the sense of pre-fix ordering for any
    inputs I
  • execution of the network given I and find
    the minimal fixed point (works because of the
    monotonic property)

26
Synchronous Data Flow Networks (SDF)
  • Special case of PN
  • Ptolemy-II SDF overview
  • SDF supports efficient execution of Dataflow
    graphs that lack control structures
  • with control structures ? Process Networks(PN)
  • requires that the rates on the ports of all
    actors be known before hand
  • do not change during execution
  • in systems with feedback, delays, which are
    represented by initial tokens on relations must
    be explicitly noted ? SDF uses this rate and
    delay information to determine the execution
    sequence of the actors before execution begins.

27
Extended Kahn-MacQueen Process Networks
  • A process is considered active from its creation
    until its termination
  • An active process can block when trying to read
    from a channel (read-blocked), when trying to
    write to a channel (write-blocked) or when
    waiting for a queued topology change request to
    be processed (mutation-blocked)
  • A deadlock is when all the active processes are
    blocked
  • real deadlock all the processes are blocked on a
    read
  • artificial deadlock all processes are blocked,
    at least one process is blocked on a write ?
    increase the capacity of receiver with the
    smallest capacity amongst all the receivers on
    which a process is blocked on a write. This
    breaks the deadlock.
  • If the increase results in a capacity that
    exceeds the value of maximumQueueCapacity, then
    instead of breaking the deadlock, an exception is
    thrown. This can be used to detect erroneous
    models that require unbounded queues.

28
Towards SciMod/SDMSWE/Kepler/
  • (my vote is for Kepler)

29
Scientific Workflows Dataflow Process Networks
X
  • Kepler current Ptolemy-II plus X, where X
  • Extended type system (structural semantic
    extensions)
  • Collection programming extensions
    (declarative/FP) and
  • Rich user interactions/workflow steering
  • Rich data transformations (compute/transform
    alternations)
  • (Eco-)Grid extensions
  • Actors as web/grid services
  • 3rd party data transfer, high-throughput data
    streaming
  • Data and service repositories, discovery
  • Data provenance
  • (semi-)automatic meta-data creation
  • What else???
  • minus upcoming Ptolemy-II extensions!
  • The slower we are, the less we have to do
    ourselves -)

30
Extended Type System (here OWL Semantic Types)
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty ?
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
Substructure association XML raw-data
(X)Query object model link OWL ontology
31
Actor Repositories (here a commercial tool)
See why we said user-definable (or
auto-generated) actor libraries?
32
Collection Programming(some lessons from
SciDAC/SSDBM demo)
33
Promoter Identification Workflow in
Ptolemy-II (SSDBM03)
hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
34
Promoter Identification Workflow in FP
genBankG GeneId - GeneSeqgenBankP
PromoterId - PromoterSeqblast GeneSeq -
PromoterIdpromoterRegion PromoterSeq -
PromoterRegiontransfac PromoterRegion -
TFBSgpr2str (PromoterId, PromoterRegion)
- Stringd0 Gid "7" -- start
with some gene-id d1 genBankG d0 --
get its gene sequence from GenBankd2 blast d1
-- BLAST to get a list of potential
promotersd3 map genBankP d2 -- get list
of promoter sequences d4 map promoterRegion d3
-- compute list of promoter regions and ...d5
map transfac d4 -- ... get transcription
factor binding sitesd6 zip d2 d4
-- create list of pairs promoter-id/regiond7
map gpr2str d6 -- pretty print into a list
of strings d8 concat d7 -- concat
into a single "file" d9 putStr d8
-- output that file
35
Simplified Process Network PIW
  • Back to purely functional dataflow process
    network
  • ( a data streaming model!)
  • Re-introducing map(f) to Ptolemy-II (was there in
    PT Classic)
  • no control-flow spaghetti
  • data-intensive apps
  • free concurrent execution
  • free type checking
  • automatic support to go from piw(GeneId) to PIW
    map(piw) over GeneId

map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
36
Optimization by Declarative Rewriting I
  • PIW as a declarative, referentially transparent
    functional process
  • optimization via functional rewriting possible
  • e.g. map(f o g) map(f) o map(g)
  • Details
  • Technical report PIW specification in Haskell

map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-const
ructs.pdf
37
Optimization by Declarative Rewriting II
  • Rewritings require that data transformation
    semantics is known
  • e.g., Haskell-like for FP and SQL (XQuery)-like
    for (XML) database querying

Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
38
Data Transformation ActorsChaining together
web services is easy
  • (NOT)

39
(No Transcript)
40
MAP Data Massaging a la Blue-Titan/Perl
41
Data Transformation Actors Our Approach
(proposal)
  • Manual
  • XQuery, XSLT, Perl, Python, transformation
    actor (development)
  • (Semi-)automatic
  • Semantic-type guided transformation generation
    (research)
  • Also Web Service Composition is
  • a hot topic
  • a reincarnation of many old ideas
  • (e.g., AI-style planning born-again functional
    composition query composition )
  • a separate topic

42
User Interaction
  • Brower Actor demo (Ilkay)

43
F I N(addtl. material follows)
  • FYI Flow-based programming has been
    re-discovered/re-invented several times
  • Flow-based Programming, http//www.jpaulmorrison
    .com/fbp/index.shtm
Write a Comment
User Comments (0)
About PowerShow.com