The KEPLER Scientific Workflow System - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

The KEPLER Scientific Workflow System

Description:

In dataflow, add when all connected inputs have data. In a time-triggered ... In discrete-event, add when any connected input has data, and add in zero time ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 49
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: The KEPLER Scientific Workflow System


1
The KEPLER Scientific Workflow System
Bertram Ludäscher Ilkay Altintas the Kepler
Team San Diego Supercomputer Center University
of California, San Diego
SDM Center AHM, LBL, August 3-5, 2004
2
Outline
  • Project Overview
  • from Ptolemy II to Kepler
  • Workflow Modeling Issues
  • from Dataflow to Control-flow (CCA et al)
  • Current Kepler Features
  • from plumbing to distributed execution
  • Example Workflows
  • from bioinformatics to geoinformatics
  • Future Plans
  • from today to tomorrow -)

3
What is a Scientific Workflow (SWF)?
  • Goals
  • automate a scientists repetitive steps (data
    analysis, data transformation, computational
    steps, )
  • can encompass data generation, aggregation,
    analysis, visualization (WF granularity)
  • design, test, share, deploy, execute, reuse,
    SWFs
  • Typical requirements/characteristics
  • data-intensive and/or compute-intensive
  • plumbing-intensive
  • dataflow-oriented
  • distribution (data, processing)
  • user-interaction in the middle,
  • vs. (C-z bg fg)-ing (detach and reconnect)
  • advanced programming constructs (map(f), zip,
    takewhile, )
  • logging, provenance, registering back
    (intermediate) products
  • easy to recognize a SWF when you see one!

4
Promoter Identification Workflow
Source Matt Coleman (LLNL)
5
Source NIH BIRN (Jeffrey Grethe, UCSD)
6
Ecology GARP Analysis Pipeline for Invasive
Species Prediction
Source NSF SEEK (Deana Pennington et. al, UNM)
7
(No Transcript)
8
Starting Point for SDM-Center/SPA SEEK
Ptolemy II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
9
An Early Example Promoter Identification SSDBM,
AD 2003
  • Scientist models application as a workflow of
    connected components (actors)
  • If all components exist, the workflow can be
    automated/ executed
  • Different directors can be used to pick
    appropriate execution model (often pipelined
    execution PN director)

10
Why Ptolemy II (and thus KEPLER)?
  • Ptolemy II Objective
  • The focus is on assembly of concurrent
    components. The key underlying principle in the
    project is the use of well-defined models of
    computation that govern the interaction between
    components. A major problem area being addressed
    is the use of heterogeneous mixtures of models of
    computation.
  • Dataflow Process Networks w/ natural
    pipelining/streaming support
  • User-Orientation
  • Workflow design exec console (Vergil GUI)
  • Application/Glue-Ware
  • excellent modeling and design support
  • run-time support, monitoring,
  • not a middle-/underware (we use someone elses,
    e.g. Globus, SRB, )
  • but middle-/underware is conveniently accessible
    through actors!
  • PRAGMATICS
  • Ptolemy II is mature, continuously extended
    improved, well-documented (500pp)
  • open source system
  • Ptolemy II folks actively participate in KEPLER

11
KEPLER An Open Collaboration
  • Founding projects
  • DOE SDM/SPA and NSF SEEK
  • Open Source (BSD-style license)
  • Intensive Communications
  • Web-archived mailing lists
  • IRC (!)
  • Co-development
  • via shared CVS repository
  • joining as a new co-developer (currently)
  • get a CVS account (read-only)
  • local development contribution via existing
    KEPLER member
  • be voted in as a member/co-developer
  • Software social engineering
  • How to better accommodate new groups/communities?
  • How to better accommodate different
    usage/contribution models (core dev special
    purpose extender user)?

12
KEPLER/CSP Contributors, Sponsors, Projects(or
loosely coupled Communicating Sequential Persons
-)
  • Ilkay Altintas SDM, Resurgence
  • Kim Baldridge Resurgence, NMI
  • Chad Berkley SEEK
  • Shawn Bowers SEEK
  • Terence Critchlow SDM
  • Tobin Fricke ROADNet
  • Jeffrey Grethe BIRN
  • Christopher H. Brooks Ptolemy II
  • Zhengang Cheng SDM
  • Dan Higgins SEEK
  • Efrat Jaeger GEON
  • Matt Jones SEEK
  • Werner Krebs, EOL
  • Edward A. Lee Ptolemy II
  • Kai Lin GEON
  • Bertram Ludaescher BIRN, SDM, SEEK, GEON
  • Mark Miller EOL
  • Steve Mock NMI
  • Steve Neuendorffer Ptolemy II

Ptolemy II
13
History
  • Gabriel (1986-1991)
  • Written in Lisp
  • Aimed at signal processing
  • Synchronous dataflow (SDF) block diagrams
  • Parallel schedulers
  • Code generators for DSPs
  • Hardware/software co-simulators
  • Ptolemy Classic (1990-1997)
  • Written in C
  • Multiple models of computation
  • Hierarchical heterogeneity
  • Dataflow variants BDF, DDF, PN
  • C/VHDL/DSP code generators
  • Optimizing SDF schedulers
  • Higher-order components
  • Ptolemy II (1996-2022)
  • Written in Java
  • Domain polymorphism
  • Multithreaded
  • PtPlot (1997-??)
  • Java plotting package
  • Tycho (1996-1998)
  • Itcl/Tk GUI framework
  • Diva (1998-2000)
  • Java GUI framework
  • Copernicus (code generator)
  • KEPLER (2003-2028)
  • scientific workflow extensions

Source (Ptolemy) Edward Lee et al.
http//ptolemy.eecs.berkeley.edu/
14
KEPLER then
15
and KEPLER today
Whats a poly- morphic actor?
Whats a scientific workflow?
What is HPC?
BTW Kepler is NOT a GUI (Vergil is)
16
The KEPLER/Ptolemy II GUI (Vergil)
Directors define the component interaction
execution semantics
Large, polymorphic component (Actors) and
Directors libraries (drag drop)
17
Actor-Oriented Design
  • Object orientation

What flows through an object is sequential
control (cf. CCA, MPI)
class name
data
methods
call
return
What flows through an object is a stream of data
tokens (in SWFs/KEPLER also references!!)
  • Actor/Dataflow orientation

actor name
data (state)
parameters
Input data
Output data
ports
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
18
Object-Oriented vs.Actor-Oriented Interfaces
  • Actor/Dataflow
  • Oriented

Object Oriented
OO interface gives procedures that have to be
invoked in an order not specified as part of the
interface definition.
AO interface definition says Give me text and
Ill give you speech
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
19
Ptolemy II Actor-Oriented Modeling
  • Component (actor) interaction semantics not
    hard-wired inside components, but factored out
    in a director
  • Different directors for different modeling and
    execution needs ( can even be combined!)
  • Better abstraction, modeling, component reuse,

20
Behavioral Polymorphism in Ptolemy
These polymorphic methods implement the
communication semantics of a domain in Ptolemy
II. The receiver instance used in communication
is supplied by the director, not by the
component. (cf. CCA, WS-??, GBPL4??, !)
IOPort
Behavioral polymorphism is the idea that
components can be defined to operate with
multiple models of computation and multiple
middleware frameworks.
consumer
producer
actor
actor
Receiver
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
21
Component Composition Interaction
  • Components linked via ports
  • Dataflow (and msg/ctl-flow)
  • Where is the component interaction semantics
    defined??
  • each component is its own director!
  • But still useful for special applications, e.g.
    parallel programs (MPI, )

Source GRIST/SC4DEVO workshop, July 2004, Caltech
22
Data/Control-Flow Spectrum
message passing, control flow
clean data(ctl)-flow
special tokens flow
  • Data (tokens) flow
  • (almost) no other side effects
  • WYSIWYG (usually)
  • References flow
  • token reference type may be http-get,
    ftp-get, hsi put
  • generic handling still possible
  • Application specific tokens flow
  • e.g. current Nimrod job management in Resurgence
  • invisible contract between components
  • Director is unaware of whats going on (sounds
    familiar? -)
  • Specific messages passing protocols (e.g., CSP,
    MPI)
  • for systems of tightly coupled components

23
CCA via special (look the other way)
Director(s)?
  • Dataflow in CCA
  • a CCA convention can be used to accommodate
    actor-oriented/dataflow modeling
  • CCA/Message Passing in KEPLER
  • Kepler/Ptolemy can be extended to accommodate
    message passing semantics (CSP is already in
    Ptolemy II)

24
Domains and Directors Semantics for Component
Interaction
  • CI Push/pull component interaction
  • CSP concurrent threads with rendezvous
  • CT continuous-time modeling
  • DE discrete-event systems
  • DDE distributed discrete events
  • FSM finite state machines
  • DT discrete time (cycle driven)
  • Giotto synchronous periodic
  • GR 2-D and 3-D graphics
  • PN process networks
  • SDF synchronous dataflow
  • SR synchronous/reactive
  • TM timed multitasking

For (finer-grained) concurrent jobs!?
For (coarse grained) Scientific Workflows!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
25
Polymorphic Actor Components Working Across Data
Types and Domains
  • Actor Data Polymorphism
  • Add numbers (int, float, double, Complex)
  • Add strings (concatenation)
  • Add complex types (arrays, records, matrices)
  • Add user-defined types
  • Actor Behavioral Polymorphism
  • In dataflow, add when all connected inputs have
    data
  • In a time-triggered model, add when the clock
    ticks
  • In discrete-event, add when any connected input
    has data, and add in zero time
  • In process networks, execute an infinite loop in
    a thread that blocks when reading empty inputs
  • In CSP, execute an infinite loop that performs
    rendezvous on input or output
  • In push/pull, ports are push or pull (declared or
    inferred) and behave accordingly
  • In real-time CORBA, priorities are associated
    with ports and a dispatcher determines when to
    add

By not choosing among these when defining the
component, we get a huge increment in component
re-usability. But how do we ensure that the
component will work in all these circumstances?
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
26
Directors and Combining Different Component
Interaction Semantics
  • Possible app. in SWF
  • time-series aware
  • parameter-sweep aware
  • XY aware
  • execution models

Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
27
A Few Specific Kepler Features and Example
Workflows
28
Web Services ? Actors (WS Harvester)
1
2
4
3
  • ? Minute-made (MM) WS-based application
    integration
  • Similarly MM workflow design sharing w/o
    implemented components

29
Recent Actor Additions
30
Digression Who are the clients?
  • Domain scientists
  • C/Perl/Python/Java/WS/DB-enabled ones
  • others (the rest of us?)
  • Goal make the life better for both categories!
  • Workflow automation
  • Plumbing support
  • Execution monitoring, steering, runtime revision
    (pause-inspect-modify-resume cycle)

31
GEON Mineral Classification Workflow
32
inside the Classifier
BrowserUI actor w/ SVG client display
33
GEON Dataset Generation Registration(and
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
34
GEON Data Registration UI
35
GEON Data Registration in KEPLER
36
Registered Resources show up in Vergil (joint
SEEK, SPA, GEON, Registry!?)
37
Data Analysis Biodiversity Indices
38
Traffic info for a list of highways Uses iterate
(higher-order map) actor to access highway info
web service repeatedly, sending out one email per
highway.
39
Traffic info for a list of highways Uses iterate
(higher-order map) actor to access highway info
web service repeatedly, sending out one email per
highway.
40
Traffic info for a list of highways Uses iterate
(higher-order map) actor to access highway info
web service repeatedly, sending out one email per
highway.
41
Re-engineered PIW w/ Iteration Constructs AD 2004
map(GenbankWS) Input NM_001924,
NM020375 Output CAGTAATATGAC",GGGGACAA
AGA
42
Streaming Real-time Data
Straightforward Example
Laser Strainmeter Channels in Scientific
Workflow Earth-tide signal out
Seismic Waveforms
43
(No Transcript)
44
Job Management (here NIMROD)
  • Job management infrastructure in place
  • Results database under development
  • Goal 1000s of GAMESS jobs (quantum mechanics)
    Fall/Winter04

45
KEPLER Today
  • Support for SWF life cycle
  • Design, share, prototype, run, monitor, deploy,
  • Coarse-grained scientific workflows, e.g.,
  • web service actors, grid actors, command-line
    actors,
  • Fine grained workflows and simulations, e.g.,
  • Database access, XSLT transformations,
  • Kepler Extensions
  • SDM Center/SPA support for data- and
    compute-intensive workflows!
  • real-time data streaming (ROADNet)
  • other special and generic extensions (e.g. GEON,
    SEEK)
  • Status
  • first release (alpha) was in May 2004
  • nightly builds w/ version tests
  • Link-Up Sister Project w/ other SWF systems (UK
    Taverna, Triana, )
  • Participation in various workshops and
    conferences (GGF10, SSDBMs, eScience WF workshop,
    )

46
KEPLER Tomorrow
  • Application-driven extensions
  • access to/integration with other IDMAF components
  • SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?,
    parallel-R?, ASPECT?, FastBit,
  • support for execution of new SWF domains
  • Astrophysics TSI/Blondin (SPA/NCSU)
  • Nuclear Physics Swesty (SPA/LLNL)
  • Generic extensions
  • addtl. support for data-intensive and
    compute-intensive workflows (all SRB Scommands,
    CCA support, )
  • (C-z bg fg)-ing (detach and reconnect)
  • workflow deployment models
  • Additional domain awareness (e.g. via new
    directors)
  • time series, parameter sweeps, job scheduling,
  • hybrid type system with semantic types
  • Consolidation
  • More installers, regular releases, improved
    documentation,

47
KEPLER SPA
  • First alpha releases since May 2004

https//www-casc.llnl.gov/sdm/
http//kepler.ecoinformatics.org
48
Breaking into the Parallel (MPI) and Stream
Computing Worlds!?
Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
  • Clean functional semantics facilitates algebraic
    workflow (program) transformations
    (Bird-Meertens) e.g. mapS f mapS g ? mapS (f
    g)

49
Hybrid Types (Structure Semantics)
  • Services can be semantically compatible, but
    structurally incompatible

Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Incompatible
StructuralType Pt
StructuralType Ps
(?)
?
?(Ps)
Source Service
Target Service
Desired Connection
Pt
Ps
Source Bowers-Ludaescher, DILS04
Write a Comment
User Comments (0)
About PowerShow.com