Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician

Description:

Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 82
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician


1
Scientific Workflows Research Opportunities for
the Practically-Oriented Theoretician
  • Bertram Ludäscher
  • Dept. of Computer Science
  • Genome Center
  • University of California, Davis
  • ludaesch_at_ucdavis.edu

2
SUMMARY (first things first, but not
necessarily in that order)
  • Motivation (e-)Science today is data-driven
  • Scientific workflows are CI upper-ware for
    e-Science
  • Scientific workflows are ubiquitous
    (every-ware)
  • and there are many interesting technical
    challenges
  • wf modeling and design
  • wf execution models
  • semantic extensions, semantic type propagation
  • wf optimization
  • data ( workflow) provenance
  • Some early solutions but HELP NEEDED!

3
All Science is Physics or Stamp Collecting
4
Science has been changing lately
  • All science is either physics or stamp
    collecting.
  • Ernest Rutherford, British chemist
    physicist (1871 - 1937)
  • J. B. Birks "Rutherford at Manchester (1962)
  • That is, from few data, lots of thinking
  • to LOTS OF DATA and ANALYSIS
  • ? Data-driven scientific discovery!

5
The Diversity Unity of Science
Natural Sciences

Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction,

in vivo, in vitro, in situ, in silico,
Data-, Knowledge-, Workflow- Management is
central to most of them!
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
6
e-Science (UK) and Cyberinfrastructure (US)
  • e-Science is about global collaboration in key
    areas of science and the next generation of
    computing infrastructure that will enable it."
  • Sir John Taylor, Director Office of Science and
    Technology, UK
  • "Cyberinfrastructure is the coordinated aggregate
    of software, hardware and other technologies, as
    well as human expertise, required to support
    current and future discoveries in science and
    engineering. The challenge of Cyberinfrastructure
    is to integrate relevant and often disparate
    resources to provide a useful, usable, and
    enabling framework for research and discovery
    characterized by broad access and 'end-to-end'
    coordination.
  • Fran Berman, San Diego Supercomputer Center, UCSD

7
Towards 2020 Science Report (MSR)
http//research.microsoft.com/towards2020science
  • new develoment at the intersection of computer
    science and the sciences a leap from the
    application of computing to support scientists to
    do science (i.e. computational science) to
    the integration of computer science concepts,
    tools and theorems into the very fabric of
    science. We believe this development
    represents the foundations of a new revolution in
    science
  • we believe computer science is poised to become
    as fundamental to biology as mathematics has
    become to physics
  • to understand cells and cellular systems
    requires viewing them as information processing
    systems, as evidenced by the fundamental
    similarity between molecular machines of the
    living cell and computational automata, and by
    the natural fit between computer process algebras
    and biological signalling and between
    computational logical circuits and regulatory
    systems in the cell
  • We highlight that an immediate and important
    challenge is that of end-to-end scientific data
    management, from data acquisition and data
    integration, to data treatment, provenance and
    persistence.
  • dramatic in its impact, will be the integration
    of new conceptual and technological tools from
    computer science into the sciences.

8
Scientific Information Integration
  • Traditional Information ( Data) Integration
  • syntactic structural heterogeneities, schema
    mappings, schema matching, query rewriting
    (GAV,LAV, Chase),
  • dealing with fundamentally same kind of
    information
  • that happens to be represented differently,
    incompletely,
  • find the correct, best way to integrate
    different representations
  • Scientific Information Integration (SII)
  • has the traditional II as a small (but very
    important) piece
  • but often deals with combining fundamentally
    different info
  • not a single correct / best way to integrate
  • SII invokes scientific theories or models that
    cannot be inferred from the data / schema
    (ontologies may help though)
  • ? joining of data, chaining of tools is in
    the scientists head!
  • Scientific Workflows can provide the end-to-end
    framework

9
Types of Information Integration
  • Conventional information integration
  • schema-based
  • view-based
  • at the data-level
  • Spatial (co-)registration/overlay of different
    data
  • from 2D, 3D, 4D (x,y,z,t), (4n) D ? GIS
  • Extended DI approaches using ontologies
  • controlled vocabularies, metadata, annotations
  • Scientific Information Integration
  • data process/application integration
  • ? scientific workflows
  • can include all the others and
  • statistics, data mining, visualization,

10
Assembling the Tree of Life (AToL)
All organisms (alive or extinct) are part of one
large, genetically connected group Life on
Earth. Major subgroups Eubacteria, Archaea, and
Eukaryotesfurther divided into hierarchically
nested subgroups e.g., eukaryotes contains
plants, animals, fungi animals contains
sponges, cnidarians, Bilateria Bilateria
contains arthropods, molluscs, nematodes, etc.
11
Inferring a phylogenetic tree from disparate data
Aligned DNA sequences
Maximum likelihood tree (DNA)
Discrete morphological data
Maximum parsimony tree
Integrate
Consensus Tree(s)
Maximum likelihood tree (continuous characters)
Continuous characters
Actors
Datasets
Datasets
12
Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
13
How is this different from good old data
integration?
  • Some white-box actors (queries, XML
    transformations),
  • .. but many black-box actors (R call, WS-call,
    built-ins)
  • .. and grey-box actors (nested subworkflows)
  • Transformantion analysis pipelines (cf. ETL)
  • different Models of Computation (MoCs)
  • DAG(man)-ish, SDF, Kahn process networks,
  • hence different Models of Provenance (MoPs)
  • could use semantic extensions (semantic types)
  • could try to optimize / rewrite (depends on MoC,
    )

14
Scientific Workflows Cyberinfrastructure
UPPER-WARE
15
Scientific workflows are CI upper-ware, i.e.
the scientists way to harness
cyberinfrastructure
  • Domain Scientists View
  • Q When is CI (middle-ware, under-ware) good?
  • A When I cant see it!
  • Q When is a scientific workflow tool (CI
    upper-ware) good?
  • A When I can get more, new, faster, better
    science done!
  • Workflow Engineers View
  • How can I (help the scientist) design implement
    the desired wfs?
  • How does wf make my life easier? Is there life
    beyond Perl Python?
  • Choice of platforms, standards reuse of existing
    tools, semantic extensions, scheduling on the
    Grid?
  • How do I make all of this robust, fault-tolerant,
    etc.
  • Computer Scientists View
  • workflow modeling design, static analysis,
    optimization, theoretical limits what can /
    cant be done
  • The quest for the right models languages
  • The holy grail of eScience Join the Quest!

16
Scientific Workflows are EVERY-WARE Völker,
höret die Signale! (Then come comrades rally )
  • Wainer, Weske, Vossen, Bauzer-Medeiros.
    Scientific workflow systems. NSF Workshop on
    Workflow and Process Automation in Information
    Systems, May 1996
  • Anastassia Ailamaki, Yannis E. Ioannidis, Miron
    Livny Scientific Workflow Management by Database
    Management. SSDBM 1998
  • Workflow in Grid Systems, GGF-10 Berlin, March
    2004
  • Data Integration in the Life Sciences workshop
  • DILS04 (Leipzig, Germany), DILS05 (San Diego,
    California Republic),
  • DILS06 (Cambridge, UK), DILS07 (U Penn)
  • SIGMOD-Record on Scientific Workflows, Sept. 2005
  • IEEE Workshop on Workflow and Data Flow for
    Scientific Applications (SciFlow06), w/ ICDE,
    Atlanta, April 2006
  • NSF Workshop on Challenges of Scientific
    Workflows, Arlington May 2006
  • Microsoft eScience Workshop, Johns Hopkins Univ,
    Oct 2006
  • Scientific Workflows and Business workflow
    standards in e-Science, Amsterdam 12/06
  • 2nd Intl. Workshop on Workflow Systems in
    e-Science (w/ Intl. Conf. on Computational
    Science), May 2007, Beijing
  • Workflows for eScience (book)
  • Taylor, Deelman, Gannon, Shields, editors,
    Springer 2006

17
Some Research Challenges
  • Goal helping scientists and workflow engineers
    in SII
  • to optimize the human resource
  • workflow modeling design
  • software engineering, query optimization, type
    inference
  • rich provenance support
  • data models, computation models, query languages
  • use/exploit semantic information, static analysis
  • type inference, automated deduction
  • and to optimize system resources
  • resource scheduling, distributed execution,
  • cost models, scheduling, distributed computing

18
Scientific Workflow
  • Capture how a scientist works with data and
    analytical tools
  • data access, transformation, analysis,
    visualization
  • possible worldview dataflow-oriented (cf.
    signal-processing)
  • Scientific workflow (wf) benefits (compare w/
    script-based approaches)
  • wf automation
  • wf component reuse
  • wf design, documentation
  • wf archival, sharing
  • built-in concurrency
  • (task-, pipeline-parallelism)
  • built-in provenance support
  • distributed execution
  • (Grid) support

19
Ex SEEK Ecological Niche Modeling Pipeline
  • Scientific Workflow paradigm
  • Reusable components (actors) a scientists
    verbs/actions
  • Top-level workflows conceptual representation
    of the science process, sentences in the
    scientists language
  • Sub-workflows increasing levels of detail
  • Separation of concerns
  • actors what to do
  • parameters configurable behavior
  • channels dataflow, pipeline composition
  • directors fix execution model, scheduling
  • semantic types smart discovery, linking

D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
20
Simple Kepler workflow using R (a statistics
package)
21
Plumbing with Style (Norbert Podhorszki UC
Davis, Scott Klasky ORNL)
Monitor
  • Plasma physics simulation on 2048 processors on
    Seaborg_at_NERSC (LBL)
  • Gyrokinetic Toroidal Code (GTC) to study energy
    transport in fusion devices (plasma
    microturbulence)
  • Generating 800GB of data (3000 files, 6000
    timesteps, 267MB/timestep), 30 hour simulation
    run
  • Under workflow control
  • Monitor (watch) simulation progress (via remote
    scripts)
  • Transfer from NERSC to ORNL concurrently with the
    simulation run
  • Convert each file to HDF5 file
  • Archive files to 4GB chunks into HPSS

22
Kepler and Sensor Networks
  • These ones just in (new NSF CEOP projects)
  • Management and Analysis of Environmental
    Observatory Data using the Kepler Scientific
    Workflow System, NCEAS, SDSC, UC Davis, OSU,
    CENS (UCLA), OPeNDAP
  • standardize services for sensor networks, support
    multiple views, protocols
  • COMET Coast-to-Mountain Environmental Transect,
    UC Davis, Bodega Marine Lab, Lake Tahoe Research
    Center
  • study how environmental factors affect ecosystems
    along an elevation gradient from coastal
    California to the summit of the Sierra Nevada

CEOP/COMET
CEOP/Kepler
23
Workflow Thinking (cf. Computational
Thinking)
  • How should we think about scientific workflows?
  • From What scientists do to produce scientific
    papers
  • to its just a program
  • Is that helpful?
  • Depends on who you ask! What are you trying to
    do?
  • (Domain) Scientist ? Workflow Engineer ? Computer
    Scientist

24
Our Starting Point Actor-Oriented Modeling
  • Ports
  • each actor has a set of input and output ports
  • denote the actors signature
  • produce/consume data (a.k.a. tokens)
  • parameters are special static ports

25
Actor-Oriented Modeling
  • Dataflow Connections
  • unidirectional actor communication channels
  • connect output ports with input ports
  • for composing analysis pipelines

26
Actor-Oriented Modeling
  • Sub-workflows / Composite Actors
  • composite actors wrap sub-workflows
  • like actors, have signatures (i/o ports of
    sub-workflow)
  • hierarchical workflows (arbitrary nesting levels)

27
Actor-Oriented Modeling
  • Directors
  • define the execution semantics of workflow graphs
  • executes workflow graph (some schedule)
  • sub-workflows may have different directors
  • promotes reusability

28
Models of Computation (A Wf Engineers Issue)
  • Directors separate the concerns of orchestration
    and scheduling from conceptual design
  • Synchronous Dataflow (SDF)
  • Statically analyzable schedule, no deadlocks,
    fixed buffer requirements executable as a single
    thread by the director.
  • Process Networks (PN)
  • Generalizes SDF. Actors execute as separate
    threads/processes, with queues of unbounded size
    (Kahn/MacQueen networks).
  • Directed Acyclic Graph (DAG)
  • Special case of SDF. No loops, no pipelining.
  • Continuous Time (CT)
  • Connections represent the value of a continuous
    time signal at some point in time ... Often used
    to model physical processes.
  • Discrete Event (DE)
  • Actors communicate through a queue of events in
    time. Used for instantaneous reactions in
    physical systems.

29
Everything is a service / actor
( yeah right)
30
Scientific Workflow Design Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
31
Shimology Part 1 Structure Semantics
  • Components and their i/o ports typically have
  • Explicit structural type
  • e.g., int, float, string, array.... of
    double,
  • Implicit semantic type
  • Not sure whether the stream of values from a port
    represents rainfall values or body size values

32
Semantic Annotation
  • Label data with semantic types
  • Label inputs and outputs of analytical components
    with semantic types (and overall component
    function)

33
Semantic Type Annotation in Kepler
  • Component input and output port annotation
  • a port can be annotated with multiple concepts
    from multiple ontologies
  • Annotations are stored with the actor metadata

34
Component Annotation and Indexing
  • Component Annotations
  • New components can be annotated and indexed into
    the actor library
  • (specializing generic actors)
  • Existing components can be revised, annotated,
    and indexed (hiding previous versions)

35
Smart Discovery
  • Find a component (here an actor) in different
    locations (categories)
  • based on the semantic annotation of the
    component (or its ports)

36
Smart Linking (Workflow Design)
  • Statically perform semantic and structural type
    checking
  • Navigate errors and warnings within the workflow
  • !! Search for and insert adapters (aka shims) to
    fix (structural and semantic) mismatches

37
Smart Linking (addressing Shimology
Type 1)
Source Bowers-Ludaescher, DILS04
38
CS Challenge Hybrid (semantic structural) Types
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
39
CS Challenge Propagating Semantic Types
  • Creating semantic annotations is difficult
  • Potentially large numbers of derived data
    products
  • Thousands of workflow components
  • Getting it right can be difficult for the
    domain scientist
  • ? Annotation Propagation

?
?1
?2
?3
Forward Propagation
Automatically Derive Annotations
?
?1
?2
?3
Backward Propagation
Automatically Derive Annotations
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
40
CS Research Problems in Propagation
  • Computing Forward and Backward Propagation
  • Under different schema constraint languages
  • What can and cannot be computed
  • Approximate what cannot be computed
  • Algorithms for propagation through a single actor
  • Algorithms for propagation through an entire
    workflow

Biom1(ob, yr, seas, plt, spp, bm) -
Biom(ob, yr, seas, plt, spp, bm), Sscd(spp).
Biom3(yr, plt, spp, 1) - Biom2(yr, plt,
spp, bm), bm gt 0 Biom3(yr, plt, spp, 0) -
Biom2(yr, plt, spp, bm), bm lt 0
Biom2(yr, plt, spp, z ? sum(b y, t, p)) -
Biom1(ob, yr, seas, plt, spp, bm).
union
join
aggregation
41
Propagation via Query Expressions
O
O
O
O
?
?? ?(q-1)
??
? ??(q)
S
S
T
T
forward propagation
workflow step (actor)
workflow step (actor)
backward propagation
q
q
  • To propagate, we need information about the actor
  • The function of an actor given by a query q S
    ? S?
  • q is a special kind of metadata possibly an
    approximation
  • q maintains input-to-output structural
    associations
  • Propagation as annotation and query composition

42
Results on S-T Finite Dependencies (Fagin et al)
  • Full dependencies Lfull (e.g., ?/??, ?, ?/??,
    ?) ?x ?(x) ? ?(x)
  • Embedded dependencies Lem (e.g., ??) ?x ?(x) ?
    ?y ?(x, y)
  • Skolemized dependencies LSko
  • ?f ?x ?(x), ?(x) ? ?(x)
  • Composition (we want L?(Lq?) ? L? )
  • Lfull(Lfull) ? Lfull Lfull(Lem) ? Lfull
  • Lem(Lfull) ? Lem Lem(Lem) ? Lem
  • LSko(LSko) ? LSko
  • In general, annotations take the form of
    embedded (or Skolemized) s-t dependencies

43
Example queries and annotations
S
R1(o, x, y, t, v)
?
R1, R2
S
Actor A
R2(u, p)
?o,x,y,v
?ud
S(o, x, y, v, u, p)
?tc
q ?o,x,y,v(?tc(R1)) ? ?ud(R2)
R2
R1
  • Forward propagation
  • ?1 R1(o, x, y, t, v) ? Observation(o) ?
    hasVal(o, v)
  • ?2 R2(u, p) ? Site(u) ? Species(p) ?
    observedIn(p, u)
  • ?? ?(q?) where ? ?1 ? ?2
  • Backward propagation
  • ?? S(o, x, y, v, u, p) ? Observation(o) ?
    hasVal(o, v) ? Species(p)
  • ? ??(q)

S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
44
Open Problem (for now)
How does reasoning with logic constraints (The
Chase, FO-resolution) relate to composition of
relational mappings (Fagin et al) ?
45
Another (Partially Solved) Reasoning Problem
46
(No Transcript)
47
The Concept Problem in Taxonomy
  • For information integration
  • (e.g. compute combined abundance)
  • need to know how XBenson48 relates to
    YKartesz04 !!
  • 3rd taxon authority may state this relation!

48
becomes a Reasoning Problem in Taxonomy
  • Peet05 articulates relation between Benson48 and
    Kartesz04 names
  • Is that articulation consistent?
  • Can we infer additional information?

49
Approach Potential Taxon Graph (Berendsohn et
al)
  • Got FO reasoning?

50
Maximal Tractable Subclasses R28 ,
5
51
Scientific Workflow Design More Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
52
Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
53
Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
54
A Simple Motivating Example
  • Take the services (actors, components) in (a)
  • and chain them together in a scientist friendly
    form a la (b)
  • considering the following signatures (cf.
    Haskell, ML, )
  • (c) BLAST DNA? DNA
  • (d) MotifSearch DNA ? Motif
  • (e) MotifSearch o BLAST \x.
    MotifSearch(BLAST)(x)
  • oops (e) is not type correct note the
    signatures of (c) and (d)!
  • a neat solution implicit or explicit iteration /
    map(f)x1,,xn
  • cf. Kepler and Taverna, Kepler solutions

55
Extended Example Workflow Evolution
  • (a) gt (b) replace Aa?b with Aa?b
  • need to call B iteratively i.e. wrap B inside a
    component or add control-flow
  • (b) gt (c) upstream produces a, a,
    instead of a, a,
  • (d) need to bypass data components since B
    cant handle ds
  • This gets messy quickly

56
A Realistic Example (ChIP-chip workflow)
57
But how do we get from messy to neat reusable
designs?
58
The Answer (YMMV)
  • Collection-Oriented Modeling Design (COMAD)
  • embrace the assembly line metaphor fully
  • ? cf. Flow-based Programming (J. Morrison)
  • data tagged nested collections
  • e.g. represented as flattened, pipelined
  • (XML) token streams

59
How does COMAD work?
  • Some COMAD principles
  • data tagged, flattened, nested collections
    (token streams)
  • data tokens
  • metadata tokens
  • inherited downwards into (sub)collections
  • define an actors read scope via an (X)Path-like
    expression
  • default actor behavior
  • not mine?
  • ? dont do anything just pass the buck!
  • stuff within my scope? ?
  • add-only to it (default)
  • consume scope write-out result
  • (but remember the bypass!)
  • iteration scope is a query involving group-by and
    further refines the granularity/subtrees that
    constitute the tokens consumed by an actor firing
  • has aspects of implicit iteration (a la Taverna)
  • default iteration level to fix signature
    mismatches
  • but also
  • granularity/grouping is definable
  • works on anything (assuming scope is matched
    correctly)
  • T McPhillips, S Bowers. Pipelining Nested Data
    Collections in Scientific Workflows. SIGMOD
    Record, 2005.
  • T McPhillips, S Bowers, B Ludaescher.
    Collection-Oriented Scientific Workflows for
    Integrating and Analyzing Biological Data.
    Workshop on Data Integration in the Life Sciences
    (DILS), 2006

60
COMAD What we gained
  • from fragile, messy workflow designs
  • to more reusable actors
  • just change the read/iteration scope parameters!
  • sometimes not even that is needed (working on
    that )
  • and cleaner workflow design (The A-B-C method
    of wf design!)
  • Crux keep the nesting structure of data (pass
    through, add-only)
  • and let it drive the (semi-)implicit iteration
    (aka structural recursion )

61
COMAD Optimization Potential
  • When is it worth to bypass data?

62
Challenge Modeling Design Paradigms
  • Vanilla Process Network
  • Functional Programming Dataflow Network
  • XML Transformation Network
  • Collection-oriented Modeling Design framework
    (COMAD)

The limitations of my modeling language are the
limitations of my design world. BL
63
A Scientific Publication (the final
provenance frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
64
More Evidence
data reference
type of evidence
tool reference
trust me on this one
  • provenance/data lineage show the history and
    evidence
  • related to proof trees
  • unlike w/ scripts, SWF system can keep track of
    what happened
  • In the future deposit your data workflows in a
    repository

65
Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
66
Scientific Provenance Questions we can ask
  • What DNA sequences were input to the workflow?
  • What phylogenetic trees were output by the
    workflow?
  • What DNA sequences input to the workflow does
    this consensus tree depend on?
  • What input sequences were not used to derive any
    output consensus trees?
  • What was the sequence alignment (key intermediate
    data) used in the process of inferring this tree?
  • plus the usual smart-rerun, VCR replay,

67
Provenance in the COMAD Framework
Without Provenance
With Provenance
68
Provenance for the WF Engineer / Plumber
  • A Workflow Engineers View
  • Monitor, benchmark, and optimize workflow
    performance
  • Record resource usage for a workflow execution
  • Smart Re-run of (variants of) previous
    executions
  • Checkpointing restart (e.g. for crash recovery,
    load balancing)
  • Debug or troubleshoot a workflow run
  • Explain when, where, why a workflow crashed

69
Provenance for Domain Scientists!
  • Query the lineage of a data product
  • from what data was this computed? (real
    dependencies please!)
  • Evaluate the results of a workflow
  • do I like how this result was computed?
  • Reuse data products of one workflow run in
    another
  • (re-)attach prior data products to a new workflow
  • Archive scientific results in a repository
  • Replicate the results reported by another
    researcher
  • Discover all results derived from a given dataset
  • i.e. across all runs
  • Explain unexpected results
  • via parameter-, dataset-, object-dependencies
    in the scientists terms (yes, you may think
    ontology here )

70
Observables
  • Model of Computation MoC M
  • specification/algorithm to compute o M(W,P,i)
  • a director or scheduler implements M
  • gives rise to formal notions of
  • computation (aka run) R typically tree models
  • Model of Provenance MoP M
  • approximation M of M
  • a trace T approximates a run R by
    inclusion/exclusion of observables
  • T R Ignored-observables
    Model-observables
  • Observables (of a MoC M)
  • functional observables (may influence output o)
  • token rate, notions of firing,
  • non-functional observables (not part of M, do not
    influence o)
  • token timestamp, size, (unless the MoC cares
    about those)
  • What is a good model of provenance? What is a
    good provenance schema?

71
So what should we focus on?
  • What is the bottleneck in Scientific Workflows?
  • The human resource workflow design support!
  • includes
  • new modeling paradigms (e.g. COMAD, FP, NRC, )
  • and data-orientation!
  • Business workflows
  • top-down, engineered, many times the same
  • Scientific workflows
  • bottom-up, exploratory, each time unique
  • Combine best of both
  • explore, capture, evolve!
  • workflow sharing and reuse

72
Workflow design when was the last time
  • that we ate our own dog food?
  • Do we really want to use formalism X for
    scientist-oriented workflow design?
  • X in Petri-net, ?-calculus, process networks,
    Turing machines, BPEL4WS,
  • What are the observables of approach/language X?
  • What does language X talk about, ignore, and
    allow in terms of analysis, understanding?
  • Example Data Provenance in Scientific Workflows
  • T R I M
  • Trace (MoP) Run (MoC) I(gnored) M(odeled)

73
The Emperors Old Clothes
  • Computer Science / Thin approach
  • Minimize to the max Lambda calculus, Turing
    machines, Register machines, Petri nets, Kahn
    Process Networks, Relational Algebra Calculus,
  • Thick approach
  • Algol68, PL/1, XML Schema, BPEL4WS, SQL, (bloat
    to the max?)
  • Premature optimization
  • is the root of all evil
  • Tony Hoare, Donald Knuth
  • Premature standardization
  • is the soil the root lives in

74
Consilience The Unity of Knowledge (E. O. Wilson)
  • "Literally a jumping together of knowledge by the
    linking of facts and fact-based theory across
    disciplines to create a common groundwork for
    explanation." E.O.Wilson
  • eScience, Cyberinfrastructure mechanisms to make
    progress
  • Scientific Workflows crucial elements to get the
    most mileage out of CI to fuel eScience,
    accelerating knowledge discovery
  • Identify the real bottlenecks in this quest!
  • Need workflow engineers, computer scientists,
    bioinformaticians, hybrids!

75
The Holy Grail of eScience / Scientific Workflows
  • Evolution programmed us to enjoy certain things
  • We should feel lucky
  • the brain is so powerful a control system that
    self-conscience emerged now we also enjoy
    thinking
  • hence weve been asking provenance questions
    since the dawn of man (where from? to? why?)
  • Science (and now eScience) yield answers
  • aside so does religion but only science is
    strongly constrained by reality
  • We are an intelligent species and the use of our
    intelligence quite properly gives us pleasure. In
    this respect the brain is like a muscle. When we
    think well, we feel good. Understanding is a kind
    of ecstasy. Carl Sagan
  • Call to Arms/Ploughshares
  • Join the Quest for the right language for
    eScience Workflow Thinking!

76
Conclusion
  • From Science to eScience via scientific workflows
  • Many interesting challenges opportunities,
    e.g.,
  • quest for suitable models languages for
    scientific workflows
  • support pipelining, nested collections,
    provenance,
  • exploit static analysis, type inference,
    provenance,
  • optimization
  • Examples
  • Propagating semantic types (logic inference,
    Chase, composition)
  • Efficient reasoning w/ taxon constraints in RCC-5
    subalgebra
  • Combining XML, streaming, XPath/CDUCE, .. for
    COMAD
  • Wf optimization (bypass, scheduling, )
  • From MoCs to MoPs (Models of Provenance)
  • Wir müssen wissen, wir werden wissen! (D.
    Hilbert)

77
Acknowledgements, QA
  • Data and Knowledge Systems Lab (DAKS) _at_ UC Davis
  • Dr. Shawn Bowers, Dr. Timothy McPhillips, Dr.
    Norbert Podhorszki
  • Dave Thau, Daniel Zinn, Alex Chen
  • Many Kepler collaborators
  • Ilkay Altintas (SDSC/UCSD), Matt Jones (UCSB),
    Arie Shoshani (LBL), Terence Critchlow (LLNL),
    Mladen Vouk (NCSU),

78
Some Related Publications
  • Semantic Type Annotation
  • S Bowers, B Ludaescher. A Calculus for
    Propagating Semantic Annotations through
    Scientific Workflow Queries. ICDE Workshop on
    Query Languages and Query Processing (QLQP),
    LNCS, 2006.
  • S Bowers, B Ludaescher. Towards Automatic
    Generation of Semantic Types in Scientific
    Workflows. International Workshop on Scalable
    Semantic Web Knowledge Base Systems (SSWS), WISE
    2005 Workshop Proceedings, LNCS, 2005.
  • C Berkley, S Bowers, M Jones, B Ludaescher, M
    Schildhauer, J Tao. Incorporating Semantics in
    Scientific Workflow Authoring. SSDBM, 2005.
  • B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
    Brodaric, C Baru. Managing Scientific Data From
    Data Integration to Scientific Workflows. GSA
    Today, Special Issue on Geoinformatics, 2006.
  • S Bowers, D Thau, R Williams, B Ludaescher. Data
    Procurement for Enabling Scientific Workflows On
    Exploring Inter-Ant Parasitism. VLDB Workshop on
    Semantic Web and Databases (SWDB), 2004.
  • S Bowers, K Lin, B Ludaescher. On Integrating
    Scientific Resources through Semantic
    Registration. SSDBM, 2004.
  • S Bowers, B Ludaescher. An Ontology-Drive
    Framework for Data Transformation in Scientific
    Workflows. International Workshop on Data
    Integration in the Life Sciences (DILS), LNCS,
    2004.
  • S Bowers, B Ludaescher. Towards a Generic
    Framework for Semantic Registration of Scientific
    Data. International Semantic Web Conference
    Workshop on Semantic Web Technologies for
    Searching and Retrieving Scientific Data, 2003.
  • Workflow Design and Modeling
  • T McPhillips, S Bowers, B Ludaescher.
    Collection-Oriented Scientific Workflows for
    Integrating and Analyzing Biological Data.
    Workshop on Data Integration in the Life Sciences
    (DILS), LNCS, 2006.
  • S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
    Davidson. A Model for User-Oriented Data
    Provenance in Pipelined Scientific Workflows.
    International Provenance and Annotation Workshop
    (IPAW), LNCS, 2006.
  • S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
    Enabling Scientific Workflow Reuse through
    Structured Composition of Dataflow and
    Control-Flow. IEEE Workshop on Workflow and Data
    Flow for Scientific Applications (SciFlow), 2006.
  • S Bowers, B Ludaescher. Actor-Oriented Design of
    Scientific Workflows. International Conference on
    Conceptual Modeling (ER), LNCS, 2005.
  • T McPhillips, S Bowers. Pipelining Nested Data
    Collections in Scientific Workflows. SIGMOD
    Record, 2005.
  • Kepler
  • D Pennington, D Higgins, AT Peterson, M Jones, B
    Ludaescher, S Bowers. Ecological Niche Modeling
    using the Kepler Workflow System. Workflows for
    e-Science, Springer-Verlag, to appear.
  • W Michener, J Beach, S Bowers, L Downey, M Jones,
    B Ludaescher, D Pennington, A Rajasekar, S
    Romanello, M Schildhauer, D Vieglais, J Zhang.
    SEEK Data Integration and Workflow Solutions for
    Ecology. Workshop on Data Integration in the Life
    Sciences (DILS), LNCS, 2005.
  • S Romanello, W Michener, J Beach, M Jones, B
    Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
    D Pennington. Creating and Providing Data
    Management Services for the Biological and
    Ecological Sciences Science Environment for
    Ecological Knowledge. SSDBM, 2005.

79
Kepler Collaboration
  • Open-source
  • Builds on Ptolemy II from UC Berkeley
  • Contributors from
  • SEEK
  • SciDAC SDM
  • Ptolemy
  • GEON
  • ROADNet
  • Resurgence
  • AToL CIPRES, POD
  • Goals
  • Create powerful analytical tools that are useful
    across disciplines
  • Ecology, Biology, Engineering, Geology, Physics,
    Chemistry, Astronomy,

Ptolemy II
Natural Diversity Discovery Project
80
Databases Information Systems (DBIS)
DBIS.ucdavis.edu
DAKS.ucdavis.edu
  • Profs. Michael Gertz, Bertram Ludaescher
  • Drs. Shawn Bowers, Timothy McPhillips, Norbert
    Podhorszki
  • 12 graduate students

81
Databases and Information Systems (DBIS)
  • DBIS.ucdavis.edu_at_ Dept of Computer Science (CS)
  • DAKS.ucdavis.edu (Data Knowledge Systems) _at_
    Genome Center (GC)
  • Faculty
  • Michael Gertz Bertram Ludäscher
  • Researchers
  • Drs. Shawn Bowers (GC), Timothy McPhillips (GC),
    Norbert Podhorszki (CS)
  • Current Students
  • Omar Alonso, Michael Byrd, Conny Franke,
  • Quinn Hart, Carlos Rueda, Dave Thau, Alex Chen
Write a Comment
User Comments (0)
About PowerShow.com