Title: Scientific Workflows: Even More eScience Mileage from Cyberinfrastructure
1Scientific Workflows (Even) More e-Science
Mileage from Cyberinfrastructure
- Bertram Ludäscher
- Dept. of Computer Science
- Genome Center
- University of California, Davis
- ludaesch_at_ucdavis.edu
2SUMMARY (first things first, but
not necessarily in that order)
- Scientific workflows are CI upper-ware for
eScience - Scientific workflows are universal (every-ware)
and there are many interesting technical
challenges - execution models, semantic extensions,
provenance, - modeling and design!
- Workflow Thinking!
- Collection-Oriented Modeling Design (COMAD) is
good for you! (YMMV) - Consilience The Holy eScience Grail
3Science has been changing lately
- All science is either physics or stamp
collecting. - Ernest Rutherford, British chemist
physicist (1871 - 1937) - J. B. Birks "Rutherford at Manchester (1962)
- From few data, lots of thinking
- to lots of data, lots of analysis
- ... and hopefully still lots of thinking!
4The Diversity Unity of Science
Natural Sciences
Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction
in vivo, in vitro, in situ, in silico,
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
5Scientific workflows are CI upper-ware, i.e.
the scientists way to harness
cyberinfrastructure
- Domain Scientists View
- Q When is CI (middle-ware, under-ware) good?
- A When I cant see it!
- Q When is a scientific workflow tool (CI
upper-ware) good? - A When I can get more, new, faster, better
science done! - Workflow Engineers View
- How can I (help the scientist) design implement
the desired wfs? - How does wf make my life easier? Beyond Perl,
Python? - Choice of platforms, standards reuse of existing
tools, semantic extensions, scheduling on the
Grid? - How do I make all of this robust, fault-tolerant,
etc. - Computer Scientists View
- workflow language design, static analysis,
optimization, theoretical limits what can /
cant be done - The quest for the right language(s)
- The holy grail of eScience Join the Quest!
6Science Environment for Ecological Knowledge
(SEEK)
7Scientific Workflow
- Capture how a scientist works with data and
analytical tools - data access, transformation, analysis,
visualization - possible worldview dataflow-oriented (cf.
signal-processing) - Scientific workflow (wf) benefits (compare w/
script-based approaches) - wf automation
- wf component reuse
- wf design, documentation
- wf archival, sharing
- built-in concurrency
- (task-, pipeline-parallelism)
- built-in provenance support
- distributed execution
- (Grid) support
-
8Ex SEEK Ecological Niche Modeling Pipeline
- Scientific Workflow paradigm
- Reusable components (actors) a scientists
verbs/actions - Top-level workflows conceptual representation
of the science process, sentences in the
scientists language - Sub-workflows increasing levels of detail
- Separation of concerns
- actors what to do
- parameters configurable behavior
- channels dataflow, pipeline composition
- directors fix execution model, scheduling
- semantic types smart discovery, linking
D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
9Simple Kepler workflow using R (a statistics
package)
10Plumbing with Style (Norbert Podhorszki UC
Davis, Scott Klasky ORNL)
Monitor
- Plasma physics simulation on 2048 processors on
Seaborg_at_NERSC (LBL) - Gyrokinetic Toroidal Code (GTC) to study energy
transport in fusion devices (plasma
microturbulence) - Generating 800GB of data (3000 files, 6000
timesteps, 267MB/timestep), 30 hour simulation
run - Under workflow control
- Monitor (watch) simulation progress (via remote
scripts) - Transfer from NERSC to ORNL concurrently with the
simulation run - Convert each file to HDF5 file
- Archive files to 4GB chunks into HPSS
11Kepler and Sensor Networks
- These ones just in (new NSF CEOP projects)
- Management and Analysis of Environmental
Observatory Data using the Kepler Scientific
Workflow System, NCEAS, SDSC, UC Davis, OSU,
CENS (UCLA), OPeNDAP - standardize services for sensor networks, support
multiple views, protocols - COMET Coast-to-Mountain Environmental Transect,
UC Davis, Bodega Marine Lab, Lake Tahoe Research
Center - study how environmental factors affect ecosystems
along an elevation gradient from coastal
California to the summit of the Sierra Nevada
CEOP/COMET
CEOP/Kepler
12Workflow Thinking (cf. Computational
Thinking)
- How should we think about scientific workflows?
- From What scientists do to produce scientific
papers - to its just a program
- Is that helpful?
- Depends on who you ask! What are you trying to
do? - (Domain) Scientist ? Workflow Engineer ? Computer
Scientist
13Our Starting Point Actor-Oriented Modeling
- Ports
- each actor has a set of input and output ports
- denote the actors signature
- produce/consume data (a.k.a. tokens)
- parameters are special static ports
14Actor-Oriented Modeling
- Dataflow Connections
- unidirectional actor communication channels
- connect output ports with input ports
- for composing analysis pipelines
15Actor-Oriented Modeling
- Sub-workflows / Composite Actors
- composite actors wrap sub-workflows
- like actors, have signatures (i/o ports of
sub-workflow) - hierarchical workflows (arbitrary nesting levels)
16Actor-Oriented Modeling
- Directors
- define the execution semantics of workflow graphs
- executes workflow graph (some schedule)
- sub-workflows may have different directors
- promotes reusability
17Models of Computation (A Wf Engineers Issue)
- Directors separate the concerns of orchestration
and scheduling from conceptual design - Synchronous Dataflow (SDF)
- Statically analyzable schedule, no deadlocks,
fixed buffer requirements executable as a single
thread by the director. - Process Networks (PN)
- Generalizes SDF. Actors execute as separate
threads/processes, with queues of unbounded size
(Kahn/MacQueen networks). - Directed Acyclic Graph (DAG)
- Special case of SDF. No loops, no pipelining.
- Continuous Time (CT)
- Connections represent the value of a continuous
time signal at some point in time ... Often used
to model physical processes. - Discrete Event (DE)
- Actors communicate through a queue of events in
time. Used for instantaneous reactions in
physical systems.
18Everything is a service / actor
( yeah right)
19Scientific Workflow Design Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
20Shimology Part 1 Structure Semantics
- Components and their i/o ports typically have
- Explicit structural type
- e.g., int, float, string, array.... of
double, - Implicit semantic type
- Not sure whether the stream of values from a port
represents rainfall values or body size values
21Semantic Annotation
- Label data with semantic types
- Label inputs and outputs of analytical components
with semantic types (and overall component
function)
22Semantic Type Annotation in Kepler
- Component input and output port annotation
- a port can be annotated with multiple concepts
from multiple ontologies - Annotations are stored with the actor metadata
23Component Annotation and Indexing
- Component Annotations
- New components can be annotated and indexed into
the actor library - (specializing generic actors)
- Existing components can be revised, annotated,
and indexed (hiding previous versions)
24Smart Discovery
- Find a component (here an actor) in different
locations (categories) - based on the semantic annotation of the
component (or its ports)
25Smart Linking (Workflow Design)
- Statically perform semantic and structural type
checking
- Navigate errors and warnings within the workflow
- !! Search for and insert adapters (aka shims) to
fix (structural and semantic) mismatches
26Smart Linking (addressing Shimology
Type 1)
Source Bowers-Ludaescher, DILS04
27Scientific Workflow Design More Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
28Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
29 Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
30And now a Real Example! (ChIP-chip workflow)
- Fear Not!
- Higher-order functions (map, fold, zipWith, )
shall oil your gears , e.g., - map f x1,,xn ? f(x1), , f(xn)
- MAP f1, , fk (x) ? f1(x), , fk(x)
- Your analysis pipeline shall be automated,
your data lineage shall be recorded, and
your provenance explained!
31But how do we get from messy to neat reusable
designs?
32The Answer (YMMV)
- Collection-Oriented Modeling Design (COMAD)
- embrace the assembly line metaphor fully
- ? cf. Flow-based Programming (J. Morrison)
- data tagged nested collections
- e.g. represented as flattened, pipelined
- (XML) token streams
33How does COMAD work?
- Some COMAD principles
- data tagged, flattened, nested collections
(token streams) - data tokens
- metadata tokens
- inherited downwards into (sub)collections
- define an actors read scope via an (X)Path-like
expression - default actor behavior
- not mine?
- ? dont do anything just pass the buck!
- stuff within my scope? ?
- add-only to it (default)
- consume scope write-out result
- (but remember the bypass!)
- iteration scope is a query involving group-by and
further refines the granularity/subtrees that
constitute the tokens consumed by an actor firing
- has aspects of implicit iteration (a la Taverna)
- default iteration level to fix signature
mismatches - but also
- granularity/grouping is definable
- works on anything (assuming scope is matched
correctly)
- T McPhillips, S Bowers. Pipelining Nested Data
Collections in Scientific Workflows. SIGMOD
Record, 2005. - T McPhillips, S Bowers, B Ludaescher.
Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data.
Workshop on Data Integration in the Life Sciences
(DILS), 2006
34COMAD What we gained
- from fragile, messy workflow designs
- to more reusable actors
- just change the read/iteration scope parameters!
- sometimes not even that is needed (working on
that ) - and cleaner workflow design (The A-B-C method
of wf design!) - Crux keep the nesting structure of data (pass
through, add-only) - and let it drive the (semi-)implicit iteration
(aka structural recursion )
35Workflow Thinking Modeling Design Paradigms
- Vanilla Process Network
- Functional Programming Dataflow Network
- XML Transformation Network
- Collection-oriented Modeling Design framework
(COMAD)
36A Scientific Publication (the final
provenance frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
37More Evidence
data reference
type of evidence
tool reference
trust me on this one
- provenance/lineage show the history and evidence
- related to proof trees
- unlike w/ scripts, SWF system can keep track of
what happened - In the future deposit your data workflows in a
repository
38Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
39Scientific Provenance Questions we can ask
- What DNA sequences were input to the workflow?
- What phylogenetic trees were output by the
workflow? - What DNA sequences input to the workflow does
this consensus tree depend on? - What input sequences were not used to derive any
output consensus trees? - What was the sequence alignment (key intermediate
data) used in the process of inferring this tree? - plus the usual smart-rerun, VCR replay,
40Provenance in the COMAD Framework
Without Provenance
With Provenance
41So what should we focus on?
- What is the bottleneck in Scientific Workflows?
- The human resource workflow design support!
- includes
- new modeling paradigms (e.g. COMAD, FP, NRC, )
- and data-orientation!
- Business workflows
- top-down, engineered, many times the same
- Scientific workflows
- bottom-up, exploratory, each time unique
- Combine best of both
- explore, capture, evolve!
- workflow sharing and reuse
42Workflow design when was the last time
- that we ate our own dog food?
- Do we really want to use formalism X for
scientist-oriented workflow design? - X in Petri-net, ?-calculus, process networks,
Turing machines, BPEL4WS, - What are the observables of approach/language X?
- What does language X talk about, ignore, and
allow in terms of analysis, understanding? - Example Data Provenance in Scientific Workflows
- T R I M
- Trace (MoP) Run (MoC) I(gnored) M(odeled)
43The Emperors Old Clothes
- Computer Science / Thin approach
- Minimize to the max Lambda calculus, Turing
machines, Register machines, Petri nets, Kahn
Process Networks, Relational Algebra Calculus,
- Thick approach
- Algol68, PL/1, XML Schema, BPEL4WS, SQL, (bloat
to the max?) - Premature optimization
- is the root of all evil
- Tony Hoare, Donald Knuth
- Premature standardization
- is the root of all evil
44The Evolution of Language
Source Phil Wadler, Peter Bunemann (?)
45Consilience The Unity of Knowledge (E. O. Wilson)
- "Literally a jumping together of knowledge by the
linking of facts and fact-based theory across
disciplines to create a common groundwork for
explanation." E.O.Wilson - eScience, Cyberinfrastructure mechanisms to make
progress - Scientific Workflows crucial elements to get the
most mileage out of CI to fuel eScience,
accelerating knowledge discovery - Identify the real bottlenecks in this quest!
- Need workflow engineers, computer scientists,
bioinformaticians, hybrids!
46The Holy Grail of eScience / Scientific Workflows
- Evolution programmed us to enjoy certain things
- We should feel lucky
- the brain is so powerful a control system that
self-conscience emerged now we also enjoy
thinking - hence weve been asking provenance questions
since the dawn of man (where from? to? why?) - Science (and now eScience) yield answers
- aside so does religion but only science is
strongly constrained by reality - We are an intelligent species and the use of our
intelligence quite properly gives us pleasure. In
this respect the brain is like a muscle. When we
think well, we feel good. Understanding is a kind
of ecstasy. Carl Sagan - Call to Arms/Ploughshares
- Join the Quest for the right language for
eScience Workflow Thinking!
47Acknowledgements, QA
- Data and Knowledge Systems Lab (DAKS) _at_ UC Davis
- Dr. Shawn Bowers, Dr. Timothy McPhillips, Dr.
Norbert Podhorszki - Dave Thau, Daniel Zinn, Alex Chen
- Many Kepler collaborators
- Ilkay Altintas (SDSC/UCSD), Matt Jones (UCSB),
Arie Shoshani (LBL), Terence Critchlow (LLNL),
Mladen Vouk (NCSU),
48Some Related Publications
- Semantic Type Annotation
- S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006. - S Bowers, B Ludaescher. Towards Automatic
Generation of Semantic Types in Scientific
Workflows. International Workshop on Scalable
Semantic Web Knowledge Base Systems (SSWS), WISE
2005 Workshop Proceedings, LNCS, 2005. - C Berkley, S Bowers, M Jones, B Ludaescher, M
Schildhauer, J Tao. Incorporating Semantics in
Scientific Workflow Authoring. SSDBM, 2005. - B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
Brodaric, C Baru. Managing Scientific Data From
Data Integration to Scientific Workflows. GSA
Today, Special Issue on Geoinformatics, 2006. - S Bowers, D Thau, R Williams, B Ludaescher. Data
Procurement for Enabling Scientific Workflows On
Exploring Inter-Ant Parasitism. VLDB Workshop on
Semantic Web and Databases (SWDB), 2004. - S Bowers, K Lin, B Ludaescher. On Integrating
Scientific Resources through Semantic
Registration. SSDBM, 2004. - S Bowers, B Ludaescher. An Ontology-Drive
Framework for Data Transformation in Scientific
Workflows. International Workshop on Data
Integration in the Life Sciences (DILS), LNCS,
2004. - S Bowers, B Ludaescher. Towards a Generic
Framework for Semantic Registration of Scientific
Data. International Semantic Web Conference
Workshop on Semantic Web Technologies for
Searching and Retrieving Scientific Data, 2003. - Workflow Design and Modeling
- T McPhillips, S Bowers, B Ludaescher.
Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data.
Workshop on Data Integration in the Life Sciences
(DILS), LNCS, 2006. - S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
Davidson. A Model for User-Oriented Data
Provenance in Pipelined Scientific Workflows.
International Provenance and Annotation Workshop
(IPAW), LNCS, 2006. - S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
Enabling Scientific Workflow Reuse through
Structured Composition of Dataflow and
Control-Flow. IEEE Workshop on Workflow and Data
Flow for Scientific Applications (SciFlow), 2006. - S Bowers, B Ludaescher. Actor-Oriented Design of
Scientific Workflows. International Conference on
Conceptual Modeling (ER), LNCS, 2005. - T McPhillips, S Bowers. Pipelining Nested Data
Collections in Scientific Workflows. SIGMOD
Record, 2005. - Kepler
- D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer-Verlag, to appear. - W Michener, J Beach, S Bowers, L Downey, M Jones,
B Ludaescher, D Pennington, A Rajasekar, S
Romanello, M Schildhauer, D Vieglais, J Zhang.
SEEK Data Integration and Workflow Solutions for
Ecology. Workshop on Data Integration in the Life
Sciences (DILS), LNCS, 2005. - S Romanello, W Michener, J Beach, M Jones, B
Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
D Pennington. Creating and Providing Data
Management Services for the Biological and
Ecological Sciences Science Environment for
Ecological Knowledge. SSDBM, 2005.
49Kepler Collaboration
- Open-source
- Builds on Ptolemy II from UC Berkeley
- Contributors from
- SEEK
- SciDAC SDM
- Ptolemy
- GEON
- ROADNet
- Resurgence
- AToL CIPRES, POD
-
- Goals
- Create powerful analytical tools that are useful
across disciplines - Ecology, Biology, Engineering, Geology, Physics,
Chemistry, Astronomy,
Ptolemy II
Natural Diversity Discovery Project
50Coming up Kepler Actor Repository
51Workflow and component repositories myExperiment
is your library, or Our
Workflow Repository!
- Taverna Repository Kepler
Repository - Workflow system speciation,
- looking for a new symbiotic relation --- need
enzymes! - (yes, thats for the rest of us )