Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician

About This Presentation

Title:

Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician

Description:

Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 82

Provided by: bertr68

Category:

more less

Transcript and Presenter's Notes

Title: Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician

1
Scientific Workflows Research Opportunities for
the Practically-Oriented Theoretician

Bertram Ludäscher
Dept. of Computer Science
Genome Center
University of California, Davis
ludaesch_at_ucdavis.edu

2
SUMMARY (first things first, but not
necessarily in that order)

Motivation (e-)Science today is data-driven
Scientific workflows are CI upper-ware for
e-Science
Scientific workflows are ubiquitous
(every-ware)
and there are many interesting technical
challenges
wf modeling and design
wf execution models
semantic extensions, semantic type propagation
wf optimization
data ( workflow) provenance
Some early solutions but HELP NEEDED!

3
All Science is Physics or Stamp Collecting
4
Science has been changing lately

All science is either physics or stamp
collecting.
Ernest Rutherford, British chemist
physicist (1871 - 1937)
J. B. Birks "Rutherford at Manchester (1962)
That is, from few data, lots of thinking
to LOTS OF DATA and ANALYSIS
? Data-driven scientific discovery!

5
The Diversity Unity of Science
Natural Sciences

Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction,

in vivo, in vitro, in situ, in silico,
Data-, Knowledge-, Workflow- Management is
central to most of them!
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
6
e-Science (UK) and Cyberinfrastructure (US)

e-Science is about global collaboration in key
areas of science and the next generation of
computing infrastructure that will enable it."
Sir John Taylor, Director Office of Science and
Technology, UK
"Cyberinfrastructure is the coordinated aggregate
of software, hardware and other technologies, as
well as human expertise, required to support
current and future discoveries in science and
engineering. The challenge of Cyberinfrastructure
is to integrate relevant and often disparate
resources to provide a useful, usable, and
enabling framework for research and discovery
characterized by broad access and 'end-to-end'
coordination.
Fran Berman, San Diego Supercomputer Center, UCSD

7
Towards 2020 Science Report (MSR)
http//research.microsoft.com/towards2020science

new develoment at the intersection of computer
science and the sciences a leap from the
application of computing to support scientists to
do science (i.e. computational science) to
the integration of computer science concepts,
tools and theorems into the very fabric of
science. We believe this development
represents the foundations of a new revolution in
science
we believe computer science is poised to become
as fundamental to biology as mathematics has
become to physics
to understand cells and cellular systems
requires viewing them as information processing
systems, as evidenced by the fundamental
similarity between molecular machines of the
living cell and computational automata, and by
the natural fit between computer process algebras
and biological signalling and between
computational logical circuits and regulatory
systems in the cell
We highlight that an immediate and important
challenge is that of end-to-end scientific data
management, from data acquisition and data
integration, to data treatment, provenance and
persistence.
dramatic in its impact, will be the integration
of new conceptual and technological tools from
computer science into the sciences.

8
Scientific Information Integration

Traditional Information ( Data) Integration
syntactic structural heterogeneities, schema
mappings, schema matching, query rewriting
(GAV,LAV, Chase),
dealing with fundamentally same kind of
information
that happens to be represented differently,
incompletely,
find the correct, best way to integrate
different representations
Scientific Information Integration (SII)
has the traditional II as a small (but very
important) piece
but often deals with combining fundamentally
different info
not a single correct / best way to integrate
SII invokes scientific theories or models that
cannot be inferred from the data / schema
(ontologies may help though)
? joining of data, chaining of tools is in
the scientists head!
Scientific Workflows can provide the end-to-end
framework

9
Types of Information Integration

Conventional information integration
schema-based
view-based
at the data-level
Spatial (co-)registration/overlay of different
data
from 2D, 3D, 4D (x,y,z,t), (4n) D ? GIS
Extended DI approaches using ontologies
controlled vocabularies, metadata, annotations
Scientific Information Integration
data process/application integration
? scientific workflows
can include all the others and
statistics, data mining, visualization,

10
Assembling the Tree of Life (AToL)
All organisms (alive or extinct) are part of one
large, genetically connected group Life on
Earth. Major subgroups Eubacteria, Archaea, and
Eukaryotesfurther divided into hierarchically
nested subgroups e.g., eukaryotes contains
plants, animals, fungi animals contains
sponges, cnidarians, Bilateria Bilateria
contains arthropods, molluscs, nematodes, etc.
11
Inferring a phylogenetic tree from disparate data
Aligned DNA sequences
Maximum likelihood tree (DNA)
Discrete morphological data
Maximum parsimony tree
Integrate
Consensus Tree(s)
Maximum likelihood tree (continuous characters)
Continuous characters
Actors
Datasets
Datasets
12
Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
13
How is this different from good old data
integration?

Some white-box actors (queries, XML
transformations),
.. but many black-box actors (R call, WS-call,
built-ins)
.. and grey-box actors (nested subworkflows)
Transformantion analysis pipelines (cf. ETL)
different Models of Computation (MoCs)
DAG(man)-ish, SDF, Kahn process networks,
hence different Models of Provenance (MoPs)
could use semantic extensions (semantic types)
could try to optimize / rewrite (depends on MoC,
)

14
Scientific Workflows Cyberinfrastructure
UPPER-WARE
15
Scientific workflows are CI upper-ware, i.e.
the scientists way to harness
cyberinfrastructure

Domain Scientists View
Q When is CI (middle-ware, under-ware) good?
A When I cant see it!
Q When is a scientific workflow tool (CI
upper-ware) good?
A When I can get more, new, faster, better
science done!
Workflow Engineers View
How can I (help the scientist) design implement
the desired wfs?
How does wf make my life easier? Is there life
beyond Perl Python?
Choice of platforms, standards reuse of existing
tools, semantic extensions, scheduling on the
Grid?
How do I make all of this robust, fault-tolerant,
etc.
Computer Scientists View
workflow modeling design, static analysis,
optimization, theoretical limits what can /
cant be done
The quest for the right models languages
The holy grail of eScience Join the Quest!

16
Scientific Workflows are EVERY-WARE Völker,
höret die Signale! (Then come comrades rally )

Wainer, Weske, Vossen, Bauzer-Medeiros.
Scientific workflow systems. NSF Workshop on
Workflow and Process Automation in Information
Systems, May 1996
Anastassia Ailamaki, Yannis E. Ioannidis, Miron
Livny Scientific Workflow Management by Database
Management. SSDBM 1998
Workflow in Grid Systems, GGF-10 Berlin, March
2004
Data Integration in the Life Sciences workshop
DILS04 (Leipzig, Germany), DILS05 (San Diego,
California Republic),
DILS06 (Cambridge, UK), DILS07 (U Penn)
SIGMOD-Record on Scientific Workflows, Sept. 2005
IEEE Workshop on Workflow and Data Flow for
Scientific Applications (SciFlow06), w/ ICDE,
Atlanta, April 2006
NSF Workshop on Challenges of Scientific
Workflows, Arlington May 2006
Microsoft eScience Workshop, Johns Hopkins Univ,
Oct 2006
Scientific Workflows and Business workflow
standards in e-Science, Amsterdam 12/06
2nd Intl. Workshop on Workflow Systems in
e-Science (w/ Intl. Conf. on Computational
Science), May 2007, Beijing
Workflows for eScience (book)
Taylor, Deelman, Gannon, Shields, editors,
Springer 2006

17
Some Research Challenges

Goal helping scientists and workflow engineers
in SII
to optimize the human resource
workflow modeling design
software engineering, query optimization, type
inference
rich provenance support
data models, computation models, query languages
use/exploit semantic information, static analysis
type inference, automated deduction
and to optimize system resources
resource scheduling, distributed execution,
cost models, scheduling, distributed computing

18
Scientific Workflow

Capture how a scientist works with data and
analytical tools
data access, transformation, analysis,
visualization
possible worldview dataflow-oriented (cf.
signal-processing)
Scientific workflow (wf) benefits (compare w/
script-based approaches)
wf automation
wf component reuse
wf design, documentation
wf archival, sharing
built-in concurrency
(task-, pipeline-parallelism)
built-in provenance support
distributed execution
(Grid) support

19
Ex SEEK Ecological Niche Modeling Pipeline

Scientific Workflow paradigm
Reusable components (actors) a scientists
verbs/actions
Top-level workflows conceptual representation
of the science process, sentences in the
scientists language
Sub-workflows increasing levels of detail
Separation of concerns
actors what to do
parameters configurable behavior
channels dataflow, pipeline composition
directors fix execution model, scheduling
semantic types smart discovery, linking

D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
20
Simple Kepler workflow using R (a statistics
package)
21
Plumbing with Style (Norbert Podhorszki UC
Davis, Scott Klasky ORNL)
Monitor

Plasma physics simulation on 2048 processors on
Seaborg_at_NERSC (LBL)
Gyrokinetic Toroidal Code (GTC) to study energy
transport in fusion devices (plasma
microturbulence)
Generating 800GB of data (3000 files, 6000
timesteps, 267MB/timestep), 30 hour simulation
run
Under workflow control
Monitor (watch) simulation progress (via remote
scripts)
Transfer from NERSC to ORNL concurrently with the
simulation run
Convert each file to HDF5 file
Archive files to 4GB chunks into HPSS

22
Kepler and Sensor Networks

These ones just in (new NSF CEOP projects)
Management and Analysis of Environmental
Observatory Data using the Kepler Scientific
Workflow System, NCEAS, SDSC, UC Davis, OSU,
CENS (UCLA), OPeNDAP
standardize services for sensor networks, support
multiple views, protocols
COMET Coast-to-Mountain Environmental Transect,
UC Davis, Bodega Marine Lab, Lake Tahoe Research
Center
study how environmental factors affect ecosystems
along an elevation gradient from coastal
California to the summit of the Sierra Nevada

CEOP/COMET
CEOP/Kepler
23
Workflow Thinking (cf. Computational
Thinking)

How should we think about scientific workflows?
From What scientists do to produce scientific
papers
to its just a program
Is that helpful?
Depends on who you ask! What are you trying to
do?
(Domain) Scientist ? Workflow Engineer ? Computer
Scientist

24
Our Starting Point Actor-Oriented Modeling

Ports
each actor has a set of input and output ports
denote the actors signature
produce/consume data (a.k.a. tokens)
parameters are special static ports

25
Actor-Oriented Modeling

Dataflow Connections
unidirectional actor communication channels
connect output ports with input ports
for composing analysis pipelines

26
Actor-Oriented Modeling

Sub-workflows / Composite Actors
composite actors wrap sub-workflows
like actors, have signatures (i/o ports of
sub-workflow)
hierarchical workflows (arbitrary nesting levels)

27
Actor-Oriented Modeling

Directors
define the execution semantics of workflow graphs
executes workflow graph (some schedule)
sub-workflows may have different directors
promotes reusability

28
Models of Computation (A Wf Engineers Issue)

Directors separate the concerns of orchestration
and scheduling from conceptual design
Synchronous Dataflow (SDF)
Statically analyzable schedule, no deadlocks,
fixed buffer requirements executable as a single
thread by the director.
Process Networks (PN)
Generalizes SDF. Actors execute as separate
threads/processes, with queues of unbounded size
(Kahn/MacQueen networks).
Directed Acyclic Graph (DAG)
Special case of SDF. No loops, no pipelining.
Continuous Time (CT)
Connections represent the value of a continuous
time signal at some point in time ... Often used
to model physical processes.
Discrete Event (DE)
Actors communicate through a queue of events in
time. Used for instantaneous reactions in
physical systems.

29
Everything is a service / actor
( yeah right)
30
Scientific Workflow Design Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
31
Shimology Part 1 Structure Semantics

Components and their i/o ports typically have
Explicit structural type
e.g., int, float, string, array.... of
double,
Implicit semantic type
Not sure whether the stream of values from a port
represents rainfall values or body size values

32
Semantic Annotation

Label data with semantic types
Label inputs and outputs of analytical components
with semantic types (and overall component
function)

33
Semantic Type Annotation in Kepler

Component input and output port annotation
a port can be annotated with multiple concepts
from multiple ontologies
Annotations are stored with the actor metadata

34
Component Annotation and Indexing

Component Annotations
New components can be annotated and indexed into
the actor library
(specializing generic actors)
Existing components can be revised, annotated,
and indexed (hiding previous versions)

35
Smart Discovery

Find a component (here an actor) in different
locations (categories)
based on the semantic annotation of the
component (or its ports)

36
Smart Linking (Workflow Design)

Statically perform semantic and structural type
checking

Navigate errors and warnings within the workflow
!! Search for and insert adapters (aka shims) to
fix (structural and semantic) mismatches

37
Smart Linking (addressing Shimology
Type 1)
Source Bowers-Ludaescher, DILS04
38
CS Challenge Hybrid (semantic structural) Types
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
39
CS Challenge Propagating Semantic Types

Creating semantic annotations is difficult
Potentially large numbers of derived data
products
Thousands of workflow components
Getting it right can be difficult for the
domain scientist
? Annotation Propagation

?
?1
?2
?3
Forward Propagation
Automatically Derive Annotations
?
?1
?2
?3
Backward Propagation
Automatically Derive Annotations
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
40
CS Research Problems in Propagation

Computing Forward and Backward Propagation
Under different schema constraint languages
What can and cannot be computed
Approximate what cannot be computed
Algorithms for propagation through a single actor
Algorithms for propagation through an entire
workflow

Biom1(ob, yr, seas, plt, spp, bm) -
Biom(ob, yr, seas, plt, spp, bm), Sscd(spp).
Biom3(yr, plt, spp, 1) - Biom2(yr, plt,
spp, bm), bm gt 0 Biom3(yr, plt, spp, 0) -
Biom2(yr, plt, spp, bm), bm lt 0
Biom2(yr, plt, spp, z ? sum(b y, t, p)) -
Biom1(ob, yr, seas, plt, spp, bm).
union
join
aggregation
41
Propagation via Query Expressions
O
O
O
O
?
?? ?(q-1)
??
? ??(q)
S
S
T
T
forward propagation
workflow step (actor)
workflow step (actor)
backward propagation
q
q

To propagate, we need information about the actor
The function of an actor given by a query q S
? S?
q is a special kind of metadata possibly an
approximation
q maintains input-to-output structural
associations
Propagation as annotation and query composition

42
Results on S-T Finite Dependencies (Fagin et al)

Full dependencies Lfull (e.g., ?/??, ?, ?/??,
?) ?x ?(x) ? ?(x)
Embedded dependencies Lem (e.g., ??) ?x ?(x) ?
?y ?(x, y)
Skolemized dependencies LSko
?f ?x ?(x), ?(x) ? ?(x)
Composition (we want L?(Lq?) ? L? )
Lfull(Lfull) ? Lfull Lfull(Lem) ? Lfull
Lem(Lfull) ? Lem Lem(Lem) ? Lem
LSko(LSko) ? LSko
In general, annotations take the form of
embedded (or Skolemized) s-t dependencies

43
Example queries and annotations
S
R1(o, x, y, t, v)
?
R1, R2
S
Actor A
R2(u, p)
?o,x,y,v
?ud
S(o, x, y, v, u, p)
?tc
q ?o,x,y,v(?tc(R1)) ? ?ud(R2)
R2
R1

Forward propagation
?1 R1(o, x, y, t, v) ? Observation(o) ?
hasVal(o, v)
?2 R2(u, p) ? Site(u) ? Species(p) ?
observedIn(p, u)
?? ?(q?) where ? ?1 ? ?2
Backward propagation
?? S(o, x, y, v, u, p) ? Observation(o) ?
hasVal(o, v) ? Species(p)
? ??(q)

S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
44
Open Problem (for now)
How does reasoning with logic constraints (The
Chase, FO-resolution) relate to composition of
relational mappings (Fagin et al) ?
45
Another (Partially Solved) Reasoning Problem
46
(No Transcript)
47
The Concept Problem in Taxonomy

For information integration
(e.g. compute combined abundance)
need to know how XBenson48 relates to
YKartesz04 !!
3rd taxon authority may state this relation!

48
becomes a Reasoning Problem in Taxonomy

Peet05 articulates relation between Benson48 and
Kartesz04 names
Is that articulation consistent?
Can we infer additional information?

49
Approach Potential Taxon Graph (Berendsohn et
al)

Got FO reasoning?

50
Maximal Tractable Subclasses R28 ,
5
51
Scientific Workflow Design More Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
52
Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
53
Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
54
A Simple Motivating Example

Take the services (actors, components) in (a)
and chain them together in a scientist friendly
form a la (b)
considering the following signatures (cf.
Haskell, ML, )
(c) BLAST DNA? DNA
(d) MotifSearch DNA ? Motif
(e) MotifSearch o BLAST \x.
MotifSearch(BLAST)(x)
oops (e) is not type correct note the
signatures of (c) and (d)!
a neat solution implicit or explicit iteration /
map(f)x1,,xn
cf. Kepler and Taverna, Kepler solutions

55
Extended Example Workflow Evolution

(a) gt (b) replace Aa?b with Aa?b
need to call B iteratively i.e. wrap B inside a
component or add control-flow
(b) gt (c) upstream produces a, a,
instead of a, a,
(d) need to bypass data components since B
cant handle ds
This gets messy quickly

56
A Realistic Example (ChIP-chip workflow)
57
But how do we get from messy to neat reusable
designs?
58
The Answer (YMMV)

Collection-Oriented Modeling Design (COMAD)
embrace the assembly line metaphor fully
? cf. Flow-based Programming (J. Morrison)
data tagged nested collections
e.g. represented as flattened, pipelined
(XML) token streams

59
How does COMAD work?

Some COMAD principles
data tagged, flattened, nested collections
(token streams)
data tokens
metadata tokens
inherited downwards into (sub)collections
define an actors read scope via an (X)Path-like
expression
default actor behavior
not mine?
? dont do anything just pass the buck!
stuff within my scope? ?
add-only to it (default)
consume scope write-out result
(but remember the bypass!)
iteration scope is a query involving group-by and
further refines the granularity/subtrees that
constitute the tokens consumed by an actor firing
has aspects of implicit iteration (a la Taverna)
default iteration level to fix signature
mismatches
but also
granularity/grouping is definable
works on anything (assuming scope is matched
correctly)

T McPhillips, S Bowers. Pipelining Nested Data
Collections in Scientific Workflows. SIGMOD
Record, 2005.
T McPhillips, S Bowers, B Ludaescher.
Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data.
Workshop on Data Integration in the Life Sciences
(DILS), 2006

60
COMAD What we gained

from fragile, messy workflow designs
to more reusable actors
just change the read/iteration scope parameters!
sometimes not even that is needed (working on
that )
and cleaner workflow design (The A-B-C method
of wf design!)
Crux keep the nesting structure of data (pass
through, add-only)
and let it drive the (semi-)implicit iteration
(aka structural recursion )

61
COMAD Optimization Potential

When is it worth to bypass data?

62
Challenge Modeling Design Paradigms

Vanilla Process Network
Functional Programming Dataflow Network
XML Transformation Network
Collection-oriented Modeling Design framework
(COMAD)

The limitations of my modeling language are the
limitations of my design world. BL
63
A Scientific Publication (the final
provenance frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
64
More Evidence
data reference
type of evidence
tool reference
trust me on this one

provenance/data lineage show the history and
evidence
related to proof trees
unlike w/ scripts, SWF system can keep track of
what happened
In the future deposit your data workflows in a
repository

65
Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
66
Scientific Provenance Questions we can ask

What DNA sequences were input to the workflow?
What phylogenetic trees were output by the
workflow?
What DNA sequences input to the workflow does
this consensus tree depend on?
What input sequences were not used to derive any
output consensus trees?
What was the sequence alignment (key intermediate
data) used in the process of inferring this tree?
plus the usual smart-rerun, VCR replay,

67
Provenance in the COMAD Framework
Without Provenance
With Provenance
68
Provenance for the WF Engineer / Plumber

A Workflow Engineers View
Monitor, benchmark, and optimize workflow
performance
Record resource usage for a workflow execution
Smart Re-run of (variants of) previous
executions
Checkpointing restart (e.g. for crash recovery,
load balancing)
Debug or troubleshoot a workflow run
Explain when, where, why a workflow crashed

69
Provenance for Domain Scientists!

Query the lineage of a data product
from what data was this computed? (real
dependencies please!)
Evaluate the results of a workflow
do I like how this result was computed?
Reuse data products of one workflow run in
another
(re-)attach prior data products to a new workflow
Archive scientific results in a repository
Replicate the results reported by another
researcher
Discover all results derived from a given dataset
i.e. across all runs
Explain unexpected results
via parameter-, dataset-, object-dependencies
in the scientists terms (yes, you may think
ontology here )

70
Observables

Model of Computation MoC M
specification/algorithm to compute o M(W,P,i)
a director or scheduler implements M
gives rise to formal notions of
computation (aka run) R typically tree models
Model of Provenance MoP M
approximation M of M
a trace T approximates a run R by
inclusion/exclusion of observables
T R Ignored-observables
Model-observables
Observables (of a MoC M)
functional observables (may influence output o)
token rate, notions of firing,
non-functional observables (not part of M, do not
influence o)
token timestamp, size, (unless the MoC cares
about those)
What is a good model of provenance? What is a
good provenance schema?

71
So what should we focus on?

What is the bottleneck in Scientific Workflows?
The human resource workflow design support!
includes
new modeling paradigms (e.g. COMAD, FP, NRC, )
and data-orientation!
Business workflows
top-down, engineered, many times the same
Scientific workflows
bottom-up, exploratory, each time unique
Combine best of both
explore, capture, evolve!
workflow sharing and reuse

72
Workflow design when was the last time

that we ate our own dog food?
Do we really want to use formalism X for
scientist-oriented workflow design?
X in Petri-net, ?-calculus, process networks,
Turing machines, BPEL4WS,
What are the observables of approach/language X?
What does language X talk about, ignore, and
allow in terms of analysis, understanding?
Example Data Provenance in Scientific Workflows
T R I M
Trace (MoP) Run (MoC) I(gnored) M(odeled)

73
The Emperors Old Clothes

Computer Science / Thin approach
Minimize to the max Lambda calculus, Turing
machines, Register machines, Petri nets, Kahn
Process Networks, Relational Algebra Calculus,
Thick approach
Algol68, PL/1, XML Schema, BPEL4WS, SQL, (bloat
to the max?)
Premature optimization
is the root of all evil
Tony Hoare, Donald Knuth
Premature standardization
is the soil the root lives in

74
Consilience The Unity of Knowledge (E. O. Wilson)

"Literally a jumping together of knowledge by the
linking of facts and fact-based theory across
disciplines to create a common groundwork for
explanation." E.O.Wilson
eScience, Cyberinfrastructure mechanisms to make
progress
Scientific Workflows crucial elements to get the
most mileage out of CI to fuel eScience,
accelerating knowledge discovery
Identify the real bottlenecks in this quest!
Need workflow engineers, computer scientists,
bioinformaticians, hybrids!

75
The Holy Grail of eScience / Scientific Workflows

Evolution programmed us to enjoy certain things
We should feel lucky
the brain is so powerful a control system that
self-conscience emerged now we also enjoy
thinking
hence weve been asking provenance questions
since the dawn of man (where from? to? why?)
Science (and now eScience) yield answers
aside so does religion but only science is
strongly constrained by reality
We are an intelligent species and the use of our
intelligence quite properly gives us pleasure. In
this respect the brain is like a muscle. When we
think well, we feel good. Understanding is a kind
of ecstasy. Carl Sagan
Call to Arms/Ploughshares
Join the Quest for the right language for
eScience Workflow Thinking!

76
Conclusion

From Science to eScience via scientific workflows
Many interesting challenges opportunities,
e.g.,
quest for suitable models languages for
scientific workflows
support pipelining, nested collections,
provenance,
exploit static analysis, type inference,
provenance,
optimization
Examples
Propagating semantic types (logic inference,
Chase, composition)
Efficient reasoning w/ taxon constraints in RCC-5
subalgebra
Combining XML, streaming, XPath/CDUCE, .. for
COMAD
Wf optimization (bypass, scheduling, )
From MoCs to MoPs (Models of Provenance)
Wir müssen wissen, wir werden wissen! (D.
Hilbert)

77
Acknowledgements, QA

Data and Knowledge Systems Lab (DAKS) _at_ UC Davis
Dr. Shawn Bowers, Dr. Timothy McPhillips, Dr.
Norbert Podhorszki
Dave Thau, Daniel Zinn, Alex Chen
Many Kepler collaborators
Ilkay Altintas (SDSC/UCSD), Matt Jones (UCSB),
Arie Shoshani (LBL), Terence Critchlow (LLNL),
Mladen Vouk (NCSU),

78
Some Related Publications

Semantic Type Annotation
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
S Bowers, B Ludaescher. Towards Automatic
Generation of Semantic Types in Scientific
Workflows. International Workshop on Scalable
Semantic Web Knowledge Base Systems (SSWS), WISE
2005 Workshop Proceedings, LNCS, 2005.
C Berkley, S Bowers, M Jones, B Ludaescher, M
Schildhauer, J Tao. Incorporating Semantics in
Scientific Workflow Authoring. SSDBM, 2005.
B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
Brodaric, C Baru. Managing Scientific Data From
Data Integration to Scientific Workflows. GSA
Today, Special Issue on Geoinformatics, 2006.
S Bowers, D Thau, R Williams, B Ludaescher. Data
Procurement for Enabling Scientific Workflows On
Exploring Inter-Ant Parasitism. VLDB Workshop on
Semantic Web and Databases (SWDB), 2004.
S Bowers, K Lin, B Ludaescher. On Integrating
Scientific Resources through Semantic
Registration. SSDBM, 2004.
S Bowers, B Ludaescher. An Ontology-Drive
Framework for Data Transformation in Scientific
Workflows. International Workshop on Data
Integration in the Life Sciences (DILS), LNCS,
2004.
S Bowers, B Ludaescher. Towards a Generic
Framework for Semantic Registration of Scientific
Data. International Semantic Web Conference
Workshop on Semantic Web Technologies for
Searching and Retrieving Scientific Data, 2003.
Workflow Design and Modeling
T McPhillips, S Bowers, B Ludaescher.
Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data.
Workshop on Data Integration in the Life Sciences
(DILS), LNCS, 2006.
S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
Davidson. A Model for User-Oriented Data
Provenance in Pipelined Scientific Workflows.
International Provenance and Annotation Workshop
(IPAW), LNCS, 2006.
S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
Enabling Scientific Workflow Reuse through
Structured Composition of Dataflow and
Control-Flow. IEEE Workshop on Workflow and Data
Flow for Scientific Applications (SciFlow), 2006.
S Bowers, B Ludaescher. Actor-Oriented Design of
Scientific Workflows. International Conference on
Conceptual Modeling (ER), LNCS, 2005.
T McPhillips, S Bowers. Pipelining Nested Data
Collections in Scientific Workflows. SIGMOD
Record, 2005.
Kepler
D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer-Verlag, to appear.
W Michener, J Beach, S Bowers, L Downey, M Jones,
B Ludaescher, D Pennington, A Rajasekar, S
Romanello, M Schildhauer, D Vieglais, J Zhang.
SEEK Data Integration and Workflow Solutions for
Ecology. Workshop on Data Integration in the Life
Sciences (DILS), LNCS, 2005.
S Romanello, W Michener, J Beach, M Jones, B
Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
D Pennington. Creating and Providing Data
Management Services for the Biological and
Ecological Sciences Science Environment for
Ecological Knowledge. SSDBM, 2005.

79
Kepler Collaboration

Open-source
Builds on Ptolemy II from UC Berkeley
Contributors from
SEEK
SciDAC SDM
Ptolemy
GEON
ROADNet
Resurgence
AToL CIPRES, POD
Goals
Create powerful analytical tools that are useful
across disciplines
Ecology, Biology, Engineering, Geology, Physics,
Chemistry, Astronomy,

Ptolemy II
Natural Diversity Discovery Project
80
Databases Information Systems (DBIS)
DBIS.ucdavis.edu
DAKS.ucdavis.edu

Profs. Michael Gertz, Bertram Ludaescher
Drs. Shawn Bowers, Timothy McPhillips, Norbert
Podhorszki
12 graduate students

81
Databases and Information Systems (DBIS)

DBIS.ucdavis.edu_at_ Dept of Computer Science (CS)
DAKS.ucdavis.edu (Data Knowledge Systems) _at_
Genome Center (GC)
Faculty
Michael Gertz Bertram Ludäscher
Researchers
Drs. Shawn Bowers (GC), Timothy McPhillips (GC),
Norbert Podhorszki (CS)
Current Students
Omar Alonso, Michael Byrd, Conny Franke,
Quinn Hart, Carlos Rueda, Dave Thau, Alex Chen

Write a Comment

User Comments (0)

About PowerShow.com

Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician - PowerPoint PPT Presentation

Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician

Scientific Workflows: Research Opportunities for the PracticallyOriented Theoretician – PowerPoint PPT presentation