Scientific Data Management: From Data Integration to Analytical Pipelines

1 / 71
About This Presentation
Title:

Scientific Data Management: From Data Integration to Analytical Pipelines

Description:

Scientific Data Management: From Data Integration to Analytical Pipelines – PowerPoint PPT presentation

Number of Views:218
Avg rating:3.0/5.0
Slides: 72
Provided by: bertr68

less

Transcript and Presenter's Notes

Title: Scientific Data Management: From Data Integration to Analytical Pipelines


1
Scientific Data Management From Data
Integration to Analytical Pipelines
Bertram Ludäscher ludaesch_at_sdsc.edu
  • Knowledge-Based Information System
  • San Diego Supercomputer Center
  • University of California, San Diego

2
Outline
  • Data Integration for Scientific Data
  • Difference to conventional data integration
  • Need for mediators KR
  • gt semantic (model-based) mediation
  • Demo Ontology-enabled geologic map integration
  • Scientific Workflows and Analytical Pipelines
  • Business WFs vs. Scientific WFs
  • Example SciWF (promoter-identification) and
    critique
  • General requirements

3
Acknowledgements
  • National Science Foundation (NSF)
  • www.nsf.gov
  • GEOsciences Network (NSF)
  • www.geongrid.org
  • Biomedical Informatics Research Network (NIH)
  • www.nbirn.net
  • Science Environment for Ecological Knowledge
    (NSF)
  • seek.ecoinformatics.org
  • Scientific Data Management Center (DOE)
  • sdm.lbl.gov/sdmcenter/

4
An Online Shoppers Information Integration
Problem
El Cheapo Where can I get the cheapest copy
(including shipping cost) of Wittgensteins
Tractatus Logicus-Philosophicus within a week?
One-World Scenario XML-based mediator
Mediator (virtual DB) (vs. Datawarehouse)
5
A Home Buyers Information Integration Problem
What houses for sale under 500k have at least 2
bathrooms, 2 bedrooms, a nearby school ranking
in the upper third, in a neighborhood with
below-average crime rate and diverse population?
Multiple-Worlds Mediation
6
Some BIRNing Data Integration Questions
Biomedical Informatics Research
Network http//nbirn.net
  • Data Integration Approaches
  • Lets just share data, e.g., link everything from
    a web page!
  • ... or better put everything into an relational
    or XML database
  • ... and do remote access using the Grid
  • ... or just use Web services!
  • Nice try. But
  • Find the files where the amygdala was
    segmented.
  • Which other structures were segmented in the
    same files?
  • Did the volume of any of those structures differ
    much from normal?
  • What is the cerebellar distribution of rat
    proteins with more than 70 homology with human
    NCS-1? Any structure specificity? How about other
    rodents?

7
A Neuroscientists Information Integration Problem
Biomedical Informatics Research
Network http//nbirn.net
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
Complex Multiple-Worlds Mediation
8
Information Integration Challenges
Heterogeneities S4...
  • System Aspects
  • platforms, devices, distribution, APIs,
    protocols,
  • Syntaxes
  • heterogeneous data formats (one for each tool
    ...)
  • Structures
  • heterogeneous schemas (one for each DB ...)
  • heterogeneous data models (RDBs, ORDBs, OODBs,
    XMLDBs, flat files, )
  • Semantics
  • unclear hidden semantics e.g., incoherent
    terminology, multiple taxonomies, implicit
    assumptions, ...

9
Information Integration Challenges
  • System aspects Grid Middleware
  • distributed data computing
  • Web Services, WSDL/SOAP, OGSA,
  • sources functions, files, data sets,
  • Syntax Structure
  • (XML-Based) Data Mediators
  • wrapping, restructuring
  • (XML) queries and views
  • sources (XML) databases
  • Semantics
  • Model-Based/Semantic Mediators
  • conceptual models and declarative views
  • Knowledge Representation ontologies, description
    logics (RDF(S),OWL ...)
  • sources knowledge bases (DBCMsICs)

10
Information Integration from a DB Perspective
  • Information Integration Problem
  • Given data sources S1, ..., Sk (DBMS, web sites,
    ...) and user questions Q1,..., Qn that can be
    answered using the Si
  • Find the answers to Q1, ..., Qn
  • The Database Perspective source database
  • Si has a schema (relational, XML, OO, ...)
  • Si can be queried
  • define virtual (or materialized) integrated
    views V over S1 ,..., Sk using database
    query languages (SQL, XQuery,...)
  • questions become queries Qi against V(S1,..., Sk)

11
Standard (XML-Based) Mediator Architecture
(XML) View
(XML) View
(XML) View
wrappers implemented as web services
Wrapper
Wrapper
Wrapper
S1
S2
Sk
12
Scientific Data Integration ... Questions to
Queries ...
What is the distribution and U/ Pb zircon ages of
A-type plutons in VA? How about their 3-D
geometry ? How does it relate to host rock
structures?
Complex Multiple-Worlds Mediation
GeoPhysical (gravity contours)
Geologic Map (Virginia)
GeoChronologic (Concordia)
Foliation Map (structure DB)
GeoChemical
13
Towards Shared Conceptualizations Data
Contextualization via Concept Spaces
14
Rock Classification Ontology
Genesis
Fabric
Composition
Texture
15
Some enabling operations on ontology data
  • Concept expansion
  • what else to look for when asking for Mafic

Composition
16
Some enabling operations on ontology data
  • Generalization
  • finding data that is like X and Y

Composition
17
Ontology-Based GEON Workbench
  • Uploadable
  • OWL ontologies
  • OWL inter-ontology mappings (articulations)
  • Data sets (shape files)
  • Semantic Registration
  • Link data set D with ontology O1 (w/
    instance-based heuristic)
  • Query D using ontology O2
  • (e.g. rock classification O1 GSC, O2BGS)
  • Ontology-Enabled Application

18
Amalgamating OWL Rocktype Ontologies
  • OWL ontologies
  • Geologic age
  • GSCs hiearchies
  • Composition
  • Texture
  • Fabric
  • Genesis
  • BGS
  • All-in-one hierarchy
  • Mappings
  • for now synonyms only
  • next steps more of OWL DL

19
Navigatable, Amalgamated Rocktype Ontology
20
(No Transcript)
21
Example Geologic Map Integration
22
Ontology-Enabled Geologic Map Integration
Demonstration(Kai Lin _at_ SDSC)
  • http//kbis.sdsc.edu/GEON/map-integration.html

23
GEON and Semantic Data Integration
24
Mediator Demo
25
Getting Formal Source Contextualization
Ontology Refinement in Logic
Biomedical Informatics Research
Network http//nbirn.net
26
Distributed Querying Processing Challenges Part
I, The Basics
  • Scientific data (BIRN, GEON, ...) variant of
    data integration problem studied by database CS
    community
  • Given
  • user query against integrated view
  • view to source mappings (GAV/LAV)
  • sources with limited access patterns
  • Compute a distributed query plan P s.t.
  • P has a feasible execution order
  • P optimized wrt. time/space/networking complexity

27
Real-time Observatories, Applications, and
Data management Network
  • Autonomous field sensors
  • Seismic, oceanic, climate, ecological, , video,
    audio,
  • RT Data Acquisition
  • ANZA Seismic Network (1981-present)13 Broadband
    Stations, 3 Borehole Strong Motion Arrays, 5
    Infrasound Stations, 1 Bridge Monitoring System
    Kyrgyz Seismic Network (1991-present) 10
    Broadband Stations IRIS PASSCAL Transportable
    Array (1997-Present)15 - 60 Broadband and Short
    Period Stations IDA Global Seismic Network
    (1990 -Present) 38 Broadband Stations
  • High Performance Wireless Research Network
    (HPWREN)
  • High performance backbone network 45Mbps duplex
    point-to-point links, backbone nodes at quality
    locations, network performance monitors at
    backbone sites High speed access links hard to
    reach areas, typically 45Mbps or 802.11radios,
    point-to-point or point-to-multipoint
  • Data Grid Technology (SRB)
  • collaborative access to distributed heterogeneous
    data, single sign-on authentication and seamless
    authorization,data scaling to Petabytes and 100s
    of millions of files, data replication, etc.

28
A P2P Problem from ROADNet
  • Networks of ORBs send each other various data
    streams
  • Avoid actual loops in the presence of virtual
    loops
  • A ? B ? C
  • A c1?B
  • B c2 ? C
  • C c3 ? A
  • ...
  • Idea L(c1) ? L(c2) ? L(c3)
  • In the real system unix regexps

29
Scientific Workflows and Analytical Pipelines
30
What is a Scientific Workflow?
  • A Misnomer
  • well, at least for a number of examples
  • Scientific Workflows ? Business Workflows
  • Business Workflows control-flow-rich
  • Scientific Workflows data-flow-rich

31
Scientific Workflows ? Business Workflows
  • Business Workflows
  • show their office automation ancestry
  • documents and work-tasks are passed
  • no data streaming, data-intensive pipelines
  • lots of standards to choose from WfMC, BPEL,
    BPEL4WS,.. XPDL,
  • but no clear semantics for constructs as simple
    as this

Source Expressiveness and Suitability of
Languages for Control Flow Modelling in
Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
32
Scientific Workflows ? Business Workflows
  • Scientific Workflows
  • Data-intensive, data streaming approach
  • Execution pipelines
  • Have analysis and modeling (simulation) steps
  • cf. Ptolemy-II, SEEK,
  • are often really Scientific Analysis
    Pipelines/Dataflows
  • with some control-flow, e.g., for collection
    programming
  • inherit features from
  • good old visualization pipelines AVS, Khoros,
  • problem solving environments, workbenches,
  • heterogeneous modeling systems (Ptolemy-II)
  • Grid environments, portrals,

33
The Brave New Old World of Workflow Management
34
The ZEN of Workflow Patterns(Source
http//tmitwww.tm.tue.nl/research/patterns/)
  • Basic Control Patterns
  • Sequence - execute activities in sequence
  • Parallel Split - execute activities in parallel
  • Synchronization - synchronize two parallel
    threads of execution
  • Exclusive Choice - choose one execution path
    from many alternatives
  • Simple Merge - merge two alternative execution
    paths
  • Advanced Branching and Synchronization Patterns
  • Multiple Choice - choose several execution paths
    from many alternatives
  • Multiple Merge - merge many execution paths
    without synchronizing
  • Discriminator - merge many execution paths
    without synchronizing. Execute the subsequent
    activity only once.
  • N-out-of-M Join - merge many execution paths.
    Perform partial synchronization and execute
    subsequent activity only once.
  • Synchronizing Join - merge many execution paths.
    Synchronize if many paths are taken. Simple merge
    if only one execution path is taken

35
The ZEN of Workflow Patterns
  • Structural Patterns
  • Arbitrary Cycles - execute workflow graph w/out
    any structural restriction on loops
  • Implicit Termination - terminate if there is
    nothing to be done
  • Patterns Involving Multiple Instances
  • MI with a priori known design time knowledge -
    generate many instances of one activity when a
    number of instances is known at the design time
  • MI with a priori known runtime knowledge -
    generate many instances of one activity when a
    number of instances can be determined at some
    point during the runtime (as in FOR loop)
  • MI with no a priori runtime knowledge - generate
    many instances of one activity when a number of
    instances cannot be determined (as in WHILE loop)
  • MI requiring synchronization - generate many
    instances of one activity and synchronize them
    afterwards

36
The ZEN of Workflow Patterns
  • State-based patterns
  • Deferred Choice - execute one of the two
    alternatives threads. The choice which thread is
    to be executed should be implicit.
  • Interleaved Parallel Routing - execute two
    activities in random order, but not in parallel.
  • Milestone - enable an activity until a milestone
    is reached
  • Cancellation Patterns
  • Cancel Activity - cancel (disable) an enabled
    activity
  • Cancel Case - cancel (disable) the process

37
The ZOO of Workflow Standards and Systems
Source W.M.P. van der Aalst et
al. http//tmitwww.tm.tue.nl/research/patterns/
38
OK, back to dataflow
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
  • Extensible Open Source Tool (EECS UC Berkeley)
  • Various combinable, clearly defined execution
    models (domains)
  • Process Networks (Kahn, McQueen), Synchronous
    Dataflow Networks,
  • Discrete Events, Continuous Events,
  • executed by corresponding directors

39
Some Commercial Tools
40
Another one
41
and another one
42
MAP Data Massaging a la Blue-Titan/Perl
43
Scientific Workflows Everywhere
  • Chimera, Pegasus, Dagman, CondorG,
  • The Mission
  • Sort out all the issues! For example
  • Supported execution models
  • Physical vs. logical resources (function nodes,
    data transformer nodes, data nodes)
  • Smarter Typing
  • e.g. for connecting output(f_n) ???
    input(f_n1)
  • Storage types (XML Schema?)
  • PL types (Haskell-like? Hindley-Milner?
    Mycroft-OKeefe?)
  • Semantic types (OWL DL?)
  • Unit types (EML unit dictionary, derived from
    STMML)
  • Grid resource type !???
  • Work towards a common framework (yet another
    standard!)

44
Example Workflows/Pipelines
  • BIRN (Neuroimaging pipelines, )
  • GEON (Pluton characterization, )
  • GriPhyN (Sloan Sky Survey, )
  • SCEC (Pathways 1 2, )
  • SEEK (Ecological niche modeling, )
  • SciDAC/SDM (Promoter identification, )

45
Biomedical Informatics Research
Network http//nbirn.net
Scientific Workflows/Analytical Pipelines over
Brain Data
46
SEEK Vision Overview
  • Large collaborative NSF/ITR project UNM, UCSB,
    SDSC/UCSD, UKansas,..
  • Fundamental improvements for researchers Global
    access to ecologically relevant data Rapidly
    locate and utilize distributed computation
    Capture, reproduce, extend analysis process

47
SEEK Components
  • EcoGrid
  • Seamless access to distributed, heterogeneous
    data ecological, biodiversity, environmental
    data
  • Semantically mediated and metadata driven
  • Centralized search management portal(s)
  • Analysis and Modeling System
  • Capture, reproduce, and extend analysis process
  • Declarative means for documenting analysis
  • Pipeline system for linking generic analysis
    steps
  • Strong version control for analysis steps
  • Easy-to-use interface between data and analysis
  • Semantic Mediation System
  • smart data discovery, type-correct pipeline
    construction data binding
  • determine whether/how to link analytic steps
  • determine how data sets can be combined
  • determine whether/how data sets are appropriate
    inputs for analysis steps

48
AMS Overview
  • Objective
  • Create a semi-automated system for analyzing data
    and executing models that provides documentation,
    archiving, and versioning of the analyses,
    models, and their outputs (visual programming
    language?)
  • Scope
  • Any type of analysis or model in ecology and
    biodiversity science
  • Massively streamline the analysis and modeling
    process
  • Archiving, rerunning analyses in SAS, Matlab, R,
    SysStat, C(),

49
SMS Requirements from AMS
  • ...assist users in determining the
    appropriateness of combining various analytical
    steps and data sources based on semantic
    mediation...
  • Semantic mediation should occur in three areas
  • determine whether it is appropriate to link
    together particular analytic steps.
  • mediate between multiple data sets to determine
    in what ways they can be combined.
  • determine whether the selected data sources are
    appropriate inputs for the selected analysis.

50
Some functional requirements
  • SMS should have the ability to ...
  • FR1 recognize data types (XML Schema types!? EML
    types?) of registered EcoGrid data sets
  • FR2 recognize semantic types (OWL and/or RDF(S)
    !?) of registered EcoGrid data sets
  • FR3 recognize registered EcoGrid ontologies
  • Note semantic types reference those ontologies
  • FR4 recognize data type signature (XML Schema?
    WSDL?) of analytical steps (ASs)
  • FR5 recognize semantic type signature of
    analytical steps
  • FR6 recognize semantic constraints (OWL?
    First-order? What syntax? KIF? Prolog?)
  • Note data schemas and signatures of analytical
    steps have those

51
... some functional requirements
  • Ability to ...
  • FR8 check well-typedness (data and semantics) of
    a data set wrt. an analytical step
  • FR9 check compatibility of two data sets wrt.
    "generalized operations" between those data sets
    (e.g., "semantic" join and union)
  • FR10 check well-typedness (data and semantics)
    of chained analytical steps
  • FR11 introduce data type conversions (e.g., int
    ? float)
  • FR12 perform and "explain" semantic type
    substitutions
  • (e.g. if some AS works for Cs and D-isa-C, it
    also works for Ds)
  • FR13 optional generate type correct APs from a
    given schema of desired output and (optionally)
    input parameters

52
Use Cases
  • Clients of the SMS include the AMS, the EcoGrid,
    and "scientific workflow engineers".
  • UC1 Client requests type signature (data and
    semantic types) of a registered EcoGrid data set
    (DS)
  • UC2 Client requests "other semantic constraints"
    of a DS.
  • UC3 Client requests type signature (data and
    semantic types) of an analytical step (AS)
  • UC4 Client requests "other semantic constraints"
    of an AS.
  • UC5 Client requests type signature of an AP.
  • UC6 Client requests type checking of AP.
  • UC7 Client requests registered data sets
    compatible with the inputs of an AS (e.g., if AS
    is scale sensitive, then all data sets must have
    the same scale a flag is raised if data needs
    scaling).
  • UC8 Client requests all registered ASs which can
    produce a given parameter (the latter is part
    of a registered ontology)
  • UC9 Client requests candidate predecessor and
    successor steps for a given AS.

53
Planned Components
  • SW1 Formal language(s) for representing/instantia
    ting data types, semantic types, ontologies, and
    "other semantic constraints".
  • SW2 System for data type checking and inference
    (includes introduction of data type conversion
    steps)
  • SW3 System for semantic type checking and
    inference
  • SW4 optional System for "planning" APs given
    some of output parameters, data sets, and input
    parameteres

54
A Problem Abstract to Executable WFs
  • Scientists would like to ...
  • create a high-level abstract WF and
  • not bother about web service urls, parameter
    passing, low-level data transformations, control
    flow, ...
  • How to go from ...
  • a high-level Abstract Workflow (AWF) to
  • an Executable Workflow (EWF) of web services ??
  • Basic idea
  • Use nested definitions (sub-workflows) to conquer
    complexity
  • ? Abstract-as-View (AAV) approach demo short
    paper _at_ SSDBM03
  • Possible approaches
  • WF engine to execute complex nested workflows
  • ? compile (or flatten / unfold) AWF into EWF
    via AAV
  • inspired by (and potential of) query rewriting in
    database mediation

55
Conceptual Workflow (Promoter Identification
Workflow PIW)
Compute clusters (min. distance)
For each promoter
Select gene-set (cluster-level)
Compute Subsequence labels
For each gene
With all Promoter Models
Compute Joint Promoter Model
56
manageClustalW Loop
noMore Genes
geneListEmpty
prepareClustalWInput
geneList
updated GeneList
ClustalW Sequence
geneListNOTEmpty
loop back
geneId
format
program
db1
partialSeq
geneNo
BlastRID
Genbank1
RequestId
cDNASeq
seq1
orient gt 0
dopt
cmd2
db2
cmd1
complement Sequence
plusSeq
minusSeq
list_udis
BlastPromoter
full Genomic Sequence
Genbank2
RId
promoters
shortSeq
orient lt 0
orient gt 0
outputNext Promoter
updated Promoter List
geneListEmpty
type
pwalignment
seq2
hitId
trimSequence
promoter List
ClustalW
start
from
Sequence List
multipleSeq Alignment
end
to
orientation
orient
TRANSFACMatInspector
Unfolded EWF
inspected TFBSs
sequence
57
Simple WF Language Constructs (both AWFEWF)
?
cond
?
?
cond
?
?
cond
?
?
58
selectGeneSet
expression Array
AWF for Matts PIW
geneList
updated Gene List
managegeneLoop/ while geneList not EMPTY
LOOP1 for each gene
updated Gene List
gene
geneList EMPTY
Loop1 Final
59
prepareClustalWInput
CW Sequence
manageClustalW Loop
geneList
updated GeneList
ClustalW Sequence
noMore Genes
geneListEmpty
geneListNOTEmpty
loop back
partialSeq
geneId
orient gt 0
geneNo
complement Sequence
shortSeq
plusSeq
orient lt 0
minusSeq
Figure1
EWF for Matts PIW (partially unfolded)
orient gt 0
geneListEmpty
type
pwalignment
TRANSFACMatInspector
inspected TFBSs
sequence
ClustalW
Sequence List
multipleSeq Alignment
60
piw
AWF
promoters
tfbs_models
Promoters
TFBSModels
Promoters
Gene
promoters AAV
AWF to EWF via Abstract-as- View (AAV)
DB
Gene
Promoters
CDNASeq
CDNASeq
gene_seq
localAlignment
AAV
EWF
gene_seq AAV
GenbankId
cDNASeq
genbank_service

EMBLId
cDNASeq
Gene
GeneId
CDNASeq
convertToAcc
embl_service
DDBJId
cDNASeq
ddbj_service
61
Abstract Workflow (AWF)(here Datalog chain
program over relations with i/o patterns)
AWF piw(DB,Gene,TFBSModel) -
cDNASequence(Gene, GeneSeq), localAlignment(DB,
CDNASeq,RankedPromoterList), firstRest(Promoter,
RankedPromoters,RankedPromoters1), promoter_deta
il(Promoter, PromoterId, Start, End,
Orientation), cDNASequence(PromoterId,Geno
micSeq), trim_sequence(GenomicSeq, Start, End,
Orientation, ShortSeq), convertSeq(Orientation,S
hortSeq,PosSeq), transfac(PosSeq, TFBSModel).
62
Abstract-As-View (AAV) Definitions(note
Control-Flow Issues)
AAV cDNASequence(GeneId, CDNASeq) -
genbank(GeneId, CDNASeq)
fail(genbank), embl(GeneId, CDNASeq)
fail(genbank),fail(embl),ddbj(GeneId,
CDNASeq). localAlignment(DB, CDNASeq,RankedPromot
erList) - blast(CDNASeq,DB,xml,RankedPr
omoterList) fail(blast),
fasta(CDNASeq,DB, RankedPromoterList)
fail(blast),fail(fasta),blat(CDNASeq,que
rytype,
sortcriteria,outputtype,RankedProm
oterList). convertSeq(Orientation,ShortSeq,PosSe
q) - negative(Orientation),
complement(ShortSeq,PosSeq) equals(ShortSeq,P
osSeq)
63
AWF to EWF (contd)
User supplied
Declarative specification
GetGenomicSequence (selectedGene,
-GenomicSequence) - GENBANK
(selectedGene, -cDNASequence), BLAST
(cDNASequence, dbName, format,
-rankedGenomicSequenceList). GetGenomicSequence
(selectedGene, -GenomicSequence)
- GENBANK (selectedGene, -cDNASequence), BLA
T (cDNASequence, QueryType, SortCriteria,
OutputType , -rankedGenomicSequenceList). Ident
ifyPromoterElements (rankedGenomicSequenceList,
-element) - PromoterSequences
(rankedGenomicSequenceList, getBeginEnd(Specie
s, -Begin, -End), -element).
For each gene
Need extra domain knowledge
Translation to EWF needs creation of iterators
Same functionality, different operational
constraints and availability
64
AWF ? EWF Translation
  • Check whether AWF is well-formed and well-typed
    if not, corresponding warnings are issued (a
    semantic type mismatch may not only be a workflow
    design error, but often indicates the
    incompleteness of the underlying ontology).
  • Next the AWF is successively unfolded, using the
    AAV view definitions.
  • (Compiling AWF into EWF using AAV is similar to
    rewriting a query against a global schema into
    queries against the sources.)
  • The unfolded logic query plan then undergoes
    several rewriting steps until a certain normal
    (DNF/UCQ?) is reached. If the join variables (
    the connection edges) are not of the same data
    type (but at least of compatible semantic types)
    then the insertion of conversion rules is
    attempted if this fails, an error is reported.
  • For each list of conjunctive goals, the system
    tries to find an executable goal order, i.e., one
    which satisfies all i/o restrictions imposed by
    the web service descriptions of executable tasks.
  • Implementation a set of Java and Prolog
    programs, rules, ontologies and repositories

65
Generated EWF Plan (using NIH BIRN Mediation
Tool)
66
More Problems Reconcile this
  • Simple, intuitive graph/pipeline language,
  • which is expressive enough to handle real-world
    flows
  • and allows some static analysis
  • e.g., compile-time type checking, resource
    allocation and staging,
  • while trying to leverage existing work
  • e.g., Ptolemy-II directors Process Networks
    (PN), Synchronous Dataflow (SDF), ...,
  • standards efforts (Grid workflow languages, )

67
The Spectrum(Analytical) Pipelines .
(Scientific) Workflows
  • Spectrum of languages formalisms
  • Pipelines (a la Unix)
  • Dataflow languages
  • Kahns process networks (PN)
  • Synchronous dataflow networks (SDF)
  • Web page-flow
  • Active XML, WebML,
  • Hesitating-weak-alternating-tree-automata-ML
  • (Business) Workflows
  • WfMCs XPDL, WSFL, BPELWS,
  • Grid workflows (Chimera, Pegasus, Condor, ??)
  • Which ones?
  • Consolidation? Standardization?

68
Example Promoter Identification Workflow (PIW)
(simplified)
  • scientific data sets flow between the steps
  • abstraction of tasks into higher conceptual
    levels
  • branching/merging of tasks and looping

69
(Ptolemy II-Based Architecture)
WF-Pilot
Design(Ptolemy-II)
Execution monitoring(Ptolemy-II)
Execution(Ptolemy-II)
Directors PN, SDF, . . , XPDL/OFBiz Style
Ptolemy-II Director
SciDAC Extensions to Ptolemy-II
Web Service plug-in
AWF
Valid-AWF
web service invocation
web service invocation
ET
ET
Validation Errors
query rewriting
semantic type checking
data type conversion
web service matching
Genbank
BLAST
ET -- Web service AT -- (Mini workflow of ETs
Composition of ETs and ATs) ?
may become a web service if deployed
Abstract Task (AT) Repository
Data Parameter Ontologies
Datatype Conversion Repository
Executable Task (ET) Repository
70
Its demo time again
71
Details of PIW
72
Some Problems and Possible Solutions
  • PIW-1 control-flow overly complicated
  • while scientific workflows are really
    dataflow analysis pipelines
  • PIW-1 designed to fit ( an unfit design)
  • Now a PT-II custom solution (replacing Perl
    PIW-0 custom solution)
  • gt General WF design and execution methodology (
    SWF Engineering)
  • Now Every web service is hand-crafted
  • gt different instantiations of a single generic
    WSDL actor
  • Now The control flow is hand-crafted
  • gt Declarative functional programming constructs
    (map, )
  • Now No plugging together of 3rd party tools/web
    services/actors
  • gt General Data Transformation Actor
  • Now we know better how to design for reuse!!
  • (Ptolemy folks knew it all along .)

73
hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
74
Simplified Process Network PIW
  • Back to purely functional dataflow process
    network
  • ( a data streaming model!)
  • Re-introducing map(f) to Ptolemy-II (was there in
    PT Classic)
  • no control-flow spaghetti
  • data-intensive apps
  • free concurrent execution
  • free type checking
  • automatic support to go from piw(GeneId) to PIW
    map(piw) over GeneId

map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
75
Optimization by Declarative Rewriting
  • PIW as a declarative, referentially transparent
    functional process
  • optimization via functional rewriting possible
  • e.g. map(f o g) map(f) o map(g)
  • Details
  • Technical report PIW specification in Haskell

map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-const
ructs.pdf
76
Summary and More Open Issues
  • Generic WSDL Grid service actors
  • creates a web service actor from WSDL-description
  • creates a grid service actor from a grid-service
    description !??
  • Data transformation actors
  • XSLT, XQuery, Perl,
  • can be enhanced by semantic types
  • Declarative collection programming (map(f), )
  • Compilation approach AWF? EWF
  • User interaction at design/compile-time and
    runtime
  • Automated WF auditing, pauseresume,
  • Standard exchange language for scientific
    workflows (Moml vs. SciDAC, SEEK,
    Chimera/Pegasus, Dagman, DiscoveryNet, )
  • Virtualize and Grid-enable everything!
  • Computation_at_LOC, Data_at_LOC, HELP!!

77
Combine EverythingDie eierlegende Wollmilchsau
  • Database Federation/Mediation
  • query rewriting under GAV/LAV
  • w/ binding pattern constraints
  • distributed query processing
  • Semantic Mediation
  • semantic integrity constraints, reasoning w/
    plans, automated deduction
  • deductive database/logic programming technology,
    AI stuff...
  • Semantic Web technology (OWL, )
  • Scientific Workflow Management
  • more procedural than database mediation (often
    the scientist is the query planner)
  • deployment using grid services!

78
F I N
Write a Comment
User Comments (0)