Scientific Workflows: More eScience Mileage from Cyberinfrastructure - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Scientific Workflows: More eScience Mileage from Cyberinfrastructure

Description:

Scientific Workflows: More eScience Mileage from Cyberinfrastructure – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 48
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Scientific Workflows: More eScience Mileage from Cyberinfrastructure


1
Scientific Workflows More e-Science Mileage
from Cyberinfrastructure
  • Bertram Ludäscher
  • Dept. of Computer Science
  • Genome Center
  • University of California, Davis
  • ludaesch_at_ucdavis.edu

2
SUMMARY (first things first, but
not necessarily in that order)
  • Scientific workflows are CI upperware for
    eScience
  • Scientific workflows are universal (everyware)
    and there are many interesting technical
    challenges
  • execution models, semantic extensions,
    provenance,
  • modeling and design!
  • Workflow Thinking!
  • Collection-Oriented Modeling Design (COMAD) is
    good for you! (YMMV)
  • Introduction
  • on workflow legacy
  • i.e. data provenance issues.

3
Scientific workflows are CI upperware, i.e.
the scientists way to harness
cyberinfrastructure
  • Domain Scientists View
  • Q When is CI (middleware, underware) good?
  • A When I cant see it!
  • Q When is a scientific workflow tool (CI
    upperware) good?
  • A When I can get more, new, faster, better
    science done!
  • Workflow Engineers View
  • How can I (help the scientist) design implement
    the desired wfs?
  • How does wf make my life easier? Beyond Perl,
    Python? in there beyond just using Perl? Python?
  • Choice of platforms, standards reuse of existing
    tools, semantic extensions, scheduling on the
    Grid?
  • How do I make all of this robust, fault-tolerant,
    etc.
  • Computer Scientists View
  • workflow language design, static analysis,
    optimization, theoretical limits what can and
    cant be done

4
Science Environment for Ecological Knowledge
(SEEK)
5
Scientific Workflow
  • Capture how a scientist works with data and
    analytical tools
  • data access, transformation, analysis,
    visualization
  • possible worldview dataflow-oriented (cf.
    signal-processing)
  • Scientific workflow (wf) benefits (compare w/
    script-based approaches)
  • wf automation
  • wf component reuse
  • wf design, documentation
  • wf archival, sharing
  • built-in concurrency
  • (task-, pipeline-parallelism)
  • built-in provenance support
  • distributed execution
  • (Grid) support

6
Ex SEEK Ecological Niche Modeling Pipeline
  • Scientific Workflow paradigm
  • Reusable components (actors) a scientists
    verbs/actions
  • Top-level workflows conceptual representation
    of the science process, sentences in the
    scientists language
  • Sub-workflows increasing levels of detail
  • Separation of concerns
  • actors what to do
  • parameters configurable behavior
  • channels dataflow, pipeline composition
  • directors fix execution model, scheduling
  • semantic types smart discovery, linking

D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
7
Simple Kepler workflow using R (a statistics
package)
8
Plumbing with Style (Norbert Podhorszki UC
Davis, Scott Klasky ORNL)
Monitor
  • Plasma physics simulation on 2048 processors on
    Seaborg_at_NERSC (LBL)
  • Gyrokinetic Toroidal Code (GTC) to study energy
    transport in fusion devices (plasma
    microturbulence)
  • Generating 800GB of data (3000 files, 6000
    timesteps, 267MB/timestep), 30 hour simulation
    run
  • Under workflow control
  • Monitor (watch) simulation progress (via remote
    scripts)
  • Transfer from NERSC to ORNL concurrently with the
    simulation run
  • Convert each file to HDF5 file
  • Archive files to 4GB chunks into HPSS

9
(No Transcript)
10
Kepler and Sensor Networks
  • These ones just in (new NSF CEOP projects)
  • Management and Analysis of Environmental
    Observatory Data using the Kepler Scientific
    Workflow System, NCEAS, SDSC, UC Davis, OSU,
    CENS (UCLA), OPeNDAP
  • standardize services for sensor networks, support
    multiple views, protocols
  • COMET Coast-to-Mountain Environmental Transect,
    UC Davis, Bodega Marine Lab, Lake Tahoe Research
    Center
  • study how environmental factors affect ecosystems
    along an elevation gradient from coastal
    California to the summit of the Sierra Nevada

CEOP/COMET
CEOP/Kepler
11
Workflow Thinking (cf. Computational
Thinking)
  • How should we think about scientific workflows?
  • From What scientists do to produce scientific
    papers
  • to its just a (Turing) program
  • Is that helpful?
  • Who are you? What are you trying to do?
  • (Domain) Scientist ? Workflow Engineer ? Computer
    Scientist

12
Our Starting Point Actor-Oriented Modeling
  • Ports
  • each actor has a set of input and output ports
  • denote the actors signature
  • produce/consume data (a.k.a. tokens)
  • parameters are special static ports

13
Actor-Oriented Modeling
  • Dataflow Connections
  • unidirectional actor communication channels
  • connect output ports with input ports
  • for composing analysis pipelines

14
Actor-Oriented Modeling
  • Sub-workflows / Composite Actors
  • composite actors wrap sub-workflows
  • like actors, have signatures (i/o ports of
    sub-workflow)
  • hierarchical workflows (arbitrary nesting levels)

15
Actor-Oriented Modeling
  • Directors
  • define the execution semantics of workflow graphs
  • executes workflow graph (some schedule)
  • sub-workflows may have different directors
  • promotes reusability

16
Models of Computation (A Wf Engineers Issue)
  • Directors separate the concerns of orchestration
    and scheduling from conceptual design
  • Synchronous Dataflow (SDF)
  • Statically analyzable schedule, no deadlocks,
    fixed buffer requirements executable as a single
    thread by the director.
  • Process Networks (PN)
  • Generalizes SDF. Actors execute as separate
    threads/processes, with queues of unbounded size
    (Kahn/MacQueen networks).
  • Directed Acyclic Graph (DAG)
  • Special case of SDF. No loops, no pipelining.
  • Continuous Time (CT)
  • Connections represent the value of a continuous
    time signal at some point in time ... Often used
    to model physical processes.
  • Discrete Event (DE)
  • Actors communicate through a queue of events in
    time. Used for instantaneous reactions in
    physical systems.

17
Everything is a service / actor
( yeah right)
18
Scientific Workflow Design Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
19
Shimology Part 1 Structure Semantics
  • Components and their i/o ports typically have
  • Explicit structural type
  • e.g., int, float, string, array.... of
    double,
  • Implicit semantic type
  • Not sure whether the stream of values from a port
    represents rainfall values or body size values

20
Semantic Annotation
  • Label data with semantic types
  • Label inputs and outputs of analytical components
    with semantic types (and overall component
    function)
  • Grounded at level of measurement and data,
    avoiding some pitfalls of upper ontologies

21
Hybrid Types Semantic Structural Typing
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
22
Semantic Type Annotation in Kepler
  • Component input and output port annotation
  • a port can be annotated with multiple concepts
    from multiple ontologies
  • Annotations are stored with the actor metadata

23
Component Annotation and Indexing
  • Component Annotations
  • New components can be annotated and indexed into
    the actor library
  • (specializing generic actors)
  • Existing components can be revised, annotated,
    and indexed (hiding previous versions)

24
Smart Discovery
  • Find a component (here an actor) in different
    locations (categories)
  • based on the semantic annotation of the
    component (or its ports)

25
Smart Linking (Workflow Design)
  • Statically perform semantic and structural type
    checking
  • Navigate errors and warnings within the workflow
  • !! Search for and insert adapters (aka shims) to
    fix (structural and semantic) mismatches

26
Smart Linking (addressing Shimology
Type 1)
Source Bowers-Ludaescher, DILS04
27
Scientific Workflow Design More Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
28
Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
29
Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
30
And now a Real Example! (ChIP-chip workflow)
  • Fear Not!
  • You are in the good hands of the DAKS (and
    Farnham) Labs _at_ UC Davis Genome Center
  • Higher-order functions (map, fold, zipWith, )
    shall oil your gears , e.g.,
  • map f x1,,xn ? f(x1), , f(xn)
  • MAP f1, , fk (x) ? f1(x), , fk(x)
  • Your analysis pipeline shall be automated,
  • your data lineage shall be recorded,
  • and your provenance explained!

31
But how do we get from messy to neat reusable
designs?
32
The Answer (YMMV)
  • Collection-Oriented Modeling Design (COMAD)
  • embrace the assembly line metaphor fully
  • ? cf. Flow-based Programming (J. Morrison)
  • data tagged nested collections
  • e.g. represented as flattened, pipelined
  • (XML) token streams

33
How does COMAD work?
  • Some COMAD principles
  • data tagged, flattened, nested collections
    (token streams)
  • data tokens
  • metadata tokens
  • inherited downwards into (sub)collections
  • define an actors read scope via an (X)Path-like
    expression
  • default actor behavior
  • not mine?
  • ? dont do anything just pass the buck!
  • stuff within my scope? ?
  • add-only to it (default)
  • consume scope write-out result
  • (but remember the bypass!)
  • iteration scope is a query involving group-by and
    further refines the granularity/subtrees that
    constitute the tokens consumed by an actor firing
  • has aspects of implicit iteration (a la Taverna)
  • default iteration level to fix signature
    mismatches
  • but also
  • granularity/grouping is definable
  • works on anything (assuming scope is matched
    correctly)
  • T McPhillips, S Bowers. Pipelining Nested Data
    Collections in Scientific Workflows. SIGMOD
    Record, 2005.
  • T McPhillips, S Bowers, B Ludaescher.
    Collection-Oriented Scientific Workflows for
    Integrating and Analyzing Biological Data.
    Workshop on Data Integration in the Life Sciences
    (DILS), 2006

34
COMAD What we gained
  • from fragile, messy workflow designs
  • to more reusable actors
  • just change the read/iteration scope parameters!
  • sometimes not even that is needed (working on
    that )
  • and cleaner workflow design (The A-B-C method
    of wf design!)
  • Crux keep the nesting structure of data (pass
    through, add-only)
  • and let it drive the (semi-)implicit iteration
    (aka structural recursion )

35
Workflow Thinking Modeling Design Paradigms
  • Vanilla Process Network
  • Functional Programming Dataflow Network
  • XML Transformation Network
  • Collection-oriented Modeling Design framework
    (COMAD)

36
A Scientific Publication (the final
provenance frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
37
More Evidence
data reference
type of evidence
tool reference
trust me on this one
  • provenance/lineage show the history and evidence
  • related to proof trees
  • unlike w/ scripts, SWF system can keep track of
    what happened
  • In the future deposit your data workflows in a
    repository

38
Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
39
Scientific Provenance Questions we can ask
  • What DNA sequences were input to the workflow
    (this run)?
  • What phylogenetic trees were output by the
    workflow?
  • What phylogenetic trees were created
    (intermediate or final) by the workflow?
  • What actor (recall a verb in the scientists
    language) created this phylogenetic tree?
  • What sequences input to the workflow does this
    consensus tree depend on?
  • What input sequences were not used to derive any
    output consensus trees?
  • What was the sequence alignment (key intermediate
    data) used in the process of inferring this tree?
  • Which actors were involved in creating this tree?

40
Provenance in the COMAD Framework
Without Provenance
  • COMAD in a nutshell
  • look at an unstructured XML token stream
  • through the scope expression lenses
  • thus seeing nested, tagged data collections
  • pass the buck on stuff thats not yours
  • add computed data (workflow actors do that)
  • add provenance metadata (COMAD framework does
    that)

With Provenance
41
Kepler Collaboration
  • Open-source
  • Builds on Ptolemy II from UC Berkeley
  • Contributors from
  • SEEK
  • SciDAC SDM
  • Ptolemy
  • GEON
  • ROADNet
  • Resurgence
  • AToL CIPRES, POD
  • Goals
  • Create powerful analytical tools that are useful
    across disciplines
  • Ecology, Biology, Engineering, Geology, Physics,
    Chemistry, Astronomy,

Ptolemy II
Natural Diversity Discovery Project
42
Coming up Kepler Actor Repository
43
Workflow and component repositories myExperiment
is your library, or Our
Workflow Repository!
  • Taverna Repository Kepler
    Repository
  • Workflow system speciation,
  • looking for a new symbiotic relation --- need
    enzymes!
  • (yes, thats for the rest of us )

44
Acknowledgements, QA
  • Data and Knowledge Systems Lab (DAKS) _at_ UC Davis
  • Dr. Shawn Bowers, Dr. Timothy McPhillips, Dr.
    Norbert Podhorszki
  • Dave Thau, Daniel Zinn, Alex Chen
  • Many Kepler collaborators
  • Ilkay Altintas (SDSC), Matt Jones (UCSB), Arie
    Shoshani (LBL), Terence Critchlow (LLNL), Mladen
    Vouk (NCSU),

45
Further Reading
  • Special issues
  • General SIGMOD Record, Sept. 05
  • SWF Provenance CCPE , on 1st Provenance
    Challenge WS
  • Oldies but goldies
  • The Emperors Old Clothes, Tony Hoare, ACM Turing
    Award Lecture
  • The Evolution of Language (a slide by Peter
    Buneman on Phil Wadlers website I think)
  • Additional References
  • next page

46
Some Related Publications
  • Semantic Type Annotation
  • S Bowers, B Ludaescher. A Calculus for
    Propagating Semantic Annotations through
    Scientific Workflow Queries. ICDE Workshop on
    Query Languages and Query Processing (QLQP),
    LNCS, 2006.
  • S Bowers, B Ludaescher. Towards Automatic
    Generation of Semantic Types in Scientific
    Workflows. International Workshop on Scalable
    Semantic Web Knowledge Base Systems (SSWS), WISE
    2005 Workshop Proceedings, LNCS, 2005.
  • C Berkley, S Bowers, M Jones, B Ludaescher, M
    Schildhauer, J Tao. Incorporating Semantics in
    Scientific Workflow Authoring. SSDBM, 2005.
  • B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
    Brodaric, C Baru. Managing Scientific Data From
    Data Integration to Scientific Workflows. GSA
    Today, Special Issue on Geoinformatics, 2006.
  • S Bowers, D Thau, R Williams, B Ludaesher. Data
    Procurement for Enabling Scientific Workflows On
    Exploring Inter-Ant Parasitism. VLDB Workshop on
    Semantic Web and Databases (SWDB), 2004.
  • S Bowers, K Lin, B Ludaescher. On Integrating
    Scientific Resources through Semantic
    Registration. SSDBM, 2004.
  • S Bowers, B Ludaescher. An Ontology-Drive
    Framework for Data Transformation in Scientific
    Workflows. International Workshop on Data
    Integration in the Life Sciences (DILS), LNCS,
    2004.
  • S Bowers, B Ludaescher. Towards a Generic
    Framework for Semantic Registration of Scientific
    Data. International Semantic Web Conference
    Workshop on Semantic Web Technologies for
    Searching and Retrieving Scientific Data, 2003.
  • Workflow Design and Modeling
  • T McPhillips, S Bowers, B Ludaescher.
    Collection-Oriented Scientific Workflows for
    Integrating and Analyzing Biological Data.
    Workshop on Data Integration in the Life Sciences
    (DILS), 2006, to appear.
  • S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
    Davidson. A Model for User-Oriented Data
    Provenance in Pipelined Scientific Workflows.
    International Provenance and Annotation Workshop
    (IPAW), LNCS, 2006.
  • S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
    Enabling Scientific Workflow Reuse through
    Structured Composition of Dataflow and
    Control-Flow. IEEE Workshop on Workflow and Data
    Flow for Scientific Applications (SciFlow), 2006.
  • S Bowers, B Ludaescher. Actor-Oriented Design of
    Scientific Workflows. International Conference on
    Conceptual Modeling (ER), LNCS, 2005.
  • T McPhillips, S Bowers. Pipelining Nested Data
    Collections in Scientific Workflows. SIGMOD
    Record, 2005.
  • Kepler
  • D Pennington, D Higgins, AT Peterson, M Jones, B
    Ludaescher, S Bowers. Ecological Niche Modeling
    using the Kepler Workflow System. Workflows for
    e-Science, Springer-Verlag, to appear.
  • W Michener, J Beach, S Bowers, L Downey, M Jones,
    B Ludaescher, D Pennington, A Rajasekar, S
    Romanello, M Schildhauer, D Vieglais, J Zhang.
    SEEK Data Integration and Workflow Solutions for
    Ecology. Workshop on Data Integration in the Life
    Sciences (DILS), LNCS, 2005.
  • S Romanello, W Michener, J Beach, M Jones, B
    Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
    D Pennington. Creating and Providing Data
    Management Services for the Biological and
    Ecological Sciences Science Environment for
    Ecological Knowledge. SSDBM, 2005.

47
The Diversity Unity of Science
Natural Sciences

Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction
in vivo, in vitro, in situ, in silico,
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
Write a Comment
User Comments (0)
About PowerShow.com