Title: Kepler: Towards a GridEnabled System for Scientific Workflows
1Kepler Towards a Grid-Enabled System for
Scientific Workflows
Ilkay Altintas, Chad Berkley, Efrat Jaeger,
Matthew Jones, Bertram Ludäscher , Steve
Mock ludaesch_at_SDSC.EDU San Diego Supercomputer
Center (SDSC) University of California, San Diego
(UCSD)
2Outline
- Motivation Scientific Workflows (SEEK, SDM,
GEON, ..) - Current Features of the Kepler Scientific
Workflows System - Extending Kepler
- Grid-Enabling Kepler
- 3rd party transfer
- WF planning optimization
- Shipping and Handling Algebra (SHA)
- Web Service Composition as Declarative Query
Plans - Semantic Types for Scientific Workflows
- Conclusions
3Kepler Team, Projects, Sponsors
- Ilkay Altintas SDM
- Chad Berkley SEEK
- Shawn Bowers SEEK
- Jeffrey Grethe BIRN
- Christopher H. Brooks Ptolemy II
- Zhengang Cheng SDM
- Efrat Jaeger GEON
- Matt Jones SEEK
- Edward A. Lee Ptolemy II
- Kai Lin GEON
- Bertram Ludäscher BIRN, GEON, SDM, SEEK
- Steve Mock NMI
- Steve Neuendorffer Ptolemy II
- Jing Tao SEEK
- Mladen Vouk SDM
- Yang Zhao Ptolemy II
Ptolemy II
4Example SEEK Science Environment for
Ecological Knowledge (large NSF ITR)
- Analysis Modeling System
- Design and execution of ecological models and
analysis - End user focus
- application-/upperware
- Semantic Mediation System
- Data Integration of hard-to-relate sources and
processes - Semantic Types and Ontologies
- upper middleware
- EcoGrid
- Access to ecology data and tools
- middle-/underware
Architecture Overview (cf. Cyberinfrastructure)
5Ecology GARP Analysis Pipeline for Invasive
Species Prediction
Source NSF SEEK (Deana Pennington et. al, UNM)
6Genomics Example Promoter Identification
Workflow (PIW)
Source Matt Coleman (LLNL)
7Source NIH BIRN (Jeffrey Grethe, UCSD)
8Scientific Workflows Some Findings
- More dataflow than (business control-/) workflow
- DiscoveryNet, Kepler, SCIRun, Scitegic, Taverna,
Triana,, , - Need for programming extension
- Iterations over lists (foreach) filtering
functional composition generic higher-order
operations (zip, map(f), ) - Need for abstraction and nested workflows
- Need for data transformations (WS1?DT?WS2)
- Need for rich user interaction workflow
steering - pause / revise / resume
- select branch e.g., web browser capability at
specific steps as part of a coordinated SWF - Need for high-throughput transfers
(grid-enabling, streaming) - Need for persistence of intermediate products and
provenance
9Scientific Workflows vs Business Workflows
- Scientific Workflows
- Dataflow and data transformations
- Data problems volume, complexity, heterogeneity
- Grid-aspects
- Distributed computation
- Distributed data
- User-interactions/WF steering
- Data, tool, and analysis integration
- ? Dataflow and control-flow are married!
- Business Workflows (BPEL4WS )
- Task-orientation travel reservations credit
approval BPM - Tasks, documents, etc. undergo modifications
(e.g., flight reservation from reserved to
ticketed), but modified WF objects still
identifiable throughout - Complex control flow, complex process composition
(danger of control flow/dataflow spaghetti) - ? Dataflow and control-flow are divorced!
10In a Flux Workflow Standards
Source W.M.P. van der Aalst et al.
http//tmitwww.tm.tue.nl/research/patterns/ http/
/tmitwww.tm.tue.nl/staff/wvdaalst/Publications/pub
lications.html
11Commercial Open Source Scientific Workflow
(well Dataflow) Systems
Kensington Discovery Edition from InforSense
Triana
Taverna
12SCIRun Problem Solving Environments for
Large-Scale Scientific Computing
- SCIRun PSE for interactive construction,
debugging, and steering of large-scale scientific
computations - New collaboration under Kepler/SDM
- Component model, based on generalized dataflow
programming
Steve Parker (cs.utah.edu)
13Our Starting Point Ptolemy II Dataflow Process
Networks
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
14Why Ptolemy II?
- Ptolemy II Objective
- The focus is on assembly of concurrent
components. The key underlying principle in the
project is the use of well-defined models of
computation that govern the interaction between
components. A major problem area being addressed
is the use of heterogeneous mixtures of models of
computation. - Data Process oriented Dataflow process
networks - Natural Data Streaming Support
- User-Orientation
- application-ware, not middle-/under-ware)
- Workflow design exec console (Vergil GUI)
- PRAGMATICS
- mature, actively maintained, well-documented
(500pp) - open source system
- developed across multiple projects (NSF/ITRs SEEK
and GEON, DOE SciDAC SDM, ) - hoping to leverage e-sister projects (e.g.
Taverna, )
15Dataflow Process Networks Putting Computation
Models (Orchestration) first!
- Synchronous Dataflow Network (SDF)
- Statically schedulable single-threaded dataflow
- Can execute multi-threaded, but the
firing-sequence is known in advance - Maximally well-behaved, but also limited
expressiveness - Process Network (PN)
- Multi-threaded dynamically scheduled dataflow
- More expressive than SDF (dynamic token rate
prevents static scheduling) - Natural streaming model
- Other Execution Models (Domains)
- Implemented through different Directors
advanced push/pull
16Actor-/Dataflow Orientation vs Object-/ Control
flow Orientation
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
17Marrying or Divorcing Control- Dataflow
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
18Overview Scientific Workflows in Kepler
- Modeling and Workflow Design
- Web services individual components (actors)
- Minute-Made Application Integration
- Plugging-in and harvesting web service components
is easy, fast - Rich SWF modeling semantics (directors)
- Different and precise dataflow models of
computation - Clear and composable component interaction
semantics - ? Web service composition and application
integration tool - Coming soon
- Shrinked wrapped, pre-packaged Kepler-to-Go
- Structural and semantic typing (better design
support) - Grid-enabled web services (for big data, big
computations,) - Different deployment models (web service, web
site, applet, )
19The KEPLER GUI Vergil(Steve Neuendorffer,
Ptolemy II)
Drag and drop utilities, director and actor
libraries.
20Running a Genomics WF (Ilkay Altintas, SDM)
21Support for Multiple Workflow Granularities
Boulders
Plumbing
Powder
Abstraction Sand to Rocks
Sand
22Directors and Combining Different Component
Interaction Semantics
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
23Application Examples Mineral Classification with
Kepler (Efrat Jaeger, GEON)
24 inside the Classifier
25Standard BrowserUI Client-Side SVG
26SWF Reengineering (Ashraf, Efrat, Kai, GEON)
27DataMapper Sub-Workflow
28Result launched via BrowserUI actor(coupling
with ESRIs ArcIMS)
29Distributed Workflows in KEPLER
- Web and Grid Service plug-ins
- WSDL (now) and Grid services (stay tuned )
- ProxyInit, GlobusGridJob, GridFTP,
DataAccessWizard - SSH, SCP, SDSC SRB, OGS?-??? coming
- WS Harvester
- Import query-defined WS operations as Kepler
actors - XSLT and XQuery Data Transformers
- to link not designed-to-fit web services
- WS-deployment interface (planned)
30Generic Web Service Actor (Ilkay Altintas)
- Given a WSDL and the name of an operation of a
web service, dynamically customizes itself to
implement and execute that method.
31Set Parameters and Commit
Set parameters and commit
32Specialized WS Actor (after instantiation)
33Web Service Harvester (Ilkay Altintas, SDM)
- Imports the web services in a repository into
the actor library. - Has the capability to search for web services
based on a keyword.
34Composing 3rd-Party WSs (NMI, Steve Mock)
Input of next web service
User interaction Transformations
35A Special Generic Ingestion Actor for EML Data
(SEEK, Chad Berkley)
- Ingests any data format described by EML metadata
- Converts raw data to Ptolemy format
- Data can then be operated on with other actors
36Wrapping Legacy Applications
37Promoter Identification Workflow (PIW)
Source Matt Coleman (LLNL)
38Execution Semantics
Promoter Identification Workflow in
Ptolemy-II SSDBM03
39hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
40Promoter Identification Workflow in FP
genBankG GeneId -gt GeneSeqgenBankP
PromoterId -gt PromoterSeqblast GeneSeq -gt
PromoterIdpromoterRegion PromoterSeq -gt
PromoterRegiontransfac PromoterRegion -gt
TFBSgpr2str (PromoterId, PromoterRegion)
-gt Stringd0 Gid "7" -- start
with some gene-id d1 genBankG d0 --
get its gene sequence from GenBankd2 blast d1
-- BLAST to get a list of potential
promotersd3 map genBankP d2 -- get list
of promoter sequences d4 map promoterRegion d3
-- compute list of promoter regions and ...d5
map transfac d4 -- ... get transcription
factor binding sitesd6 zip d2 d4
-- create list of pairs promoter-id/regiond7
map gpr2str d6 -- pretty print into a list
of strings d8 concat d7 -- concat
into a single "file" d9 putStr d8
-- output that file
41Cleaned up Process Network PIW
- Back to purely functional dataflow process
network - ( also a data streaming model!)
- Re-introducing map(f) to Ptolemy-II (was there in
PT Classic) - no control-flow spaghetti
- data-intensive apps
- free concurrent execution
- free type checking
- automatic support to go from piw(GeneId) to
- PIW map(piw) over GeneId
map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
42Optimization by Declarative Rewriting I
- PIW as a declarative, referentially transparent
functional process - optimization via functional rewriting possible
- e.g. map(f o g) map(f) o map(g)
- Technical report PIW specification in Haskell
map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-cons
tructs.pdf
43Optimizing II Streams Pipelines
Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
- Clean functional semantics facilitates algebraic
workflow (program) transformations
(Bird-Meertens) e.g. mapS f mapS g ? mapS (f
g)
44Middle/Underware Access Querying Databases
- Database connection actor
- Opening a database connection and passing it to
all actors accessing this database. - Database query actor
- A generic actor that queries a database and
provides its result. - DBConnection type and DBConnectionToken
- A new IOPort type and a token to distinguish a
database connection from any general type.
45Database Connection Actor
- OpenDBConnection actor
- Input database connection information
- Output DBConnectionToken (reference to a DB
connection instance, via a DBConnection output
port)
46Database Query Actor
- Database Query actor
- Input SQL query string and a DB connection token
- Parameters
- output type XML, Record, or String
- tuple-at-a-time vs set-at-a-time
- Process
- execute query
- produce results according to parameters
-
47Querying Example
48An (oversimplified) Model of the Grid
- Hosts h1, h2, h3,
- Data_at_Hosts d1_at_hi, d2_at_hj,
- Functions_at_Hosts f1_at_hi, f2_at_hj,
- Given data/workflow
- as a functional plan Y f(X) Z
g(Y) - as a logic plan
f(X,Y)?g(Y,Z) - Find Host Assignment di ? hi , fj ? hj
for all di , fj - s.t. d3_at_h3 f_at_h2(d1_at_h1), is a valid
plan
49Shipping and Handling Algebra (SHA)
Logical view
(1)
- plan Y_at_C F_at_A of X_at_B
- X_at_B to A, Y_at_A F_at_A(X_at_A), Y_at_A to C
- F_at_A gt B, Y_at_B F_at_B(X_at_B), Y_at_B to C
- X_at_B to C, F_at_A gt C, Y_at_C F_at_C(X_at_C)
(2)
(3)
Physical view SHA Plans
50Grid-Enabling PTII Handles
- A?GA get_handle
- GA?A return X
- A?B send X
- B?GB request X
- GB?GA request X
- GA? GB send X
- GB?B send done(X)
- Example
- X GA.17
- X ltsome_huge_filegt
- Candidate Formalisms
- GridFTP
- SSH, SCP
- SDSC SRB
- OGS?-??? WSRF?
Logical token transfer (3) requires
get_handle(1,2) then exec_handle(4,5,6,7) for
completion.
Kepler space
3
A
B
4
7
2
1
5
Grid space
GA
GB
6
51Extensions Semantic Type
- Take concepts and relationships from an ontology
to semantically type the data-in/out ports - Application e.g., design support
- smart/semi-automatic wiring, generation of
massaging actors
m1 (normalize)
p3
p4
Takes Abundance Count Measurements for Life
Stages
Returns Mortality Rate Derived Measurements for
Life Stages
52(No Transcript)
53(No Transcript)
54Semantic Types
- The semantic type signature
- Type expressions over the (OWL) ontology
m1 (normalize)
p3
p4
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty -gt
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
55Extended Type System (here OWL Semantic Types)
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty ?
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
Substructure association XML raw-data
(X)Querygt object model link gt OWL ontology
56Semantic Types for Scientific Workflows
57Deriving Data Transformations from Semantic
Service Registration
Bowers-Ludaescher, DILS04
58Structural and Semantic Mappings
Bowers-Ludaescher, DILS04
59Workflow Planning as Planning Queries with
Limited Access Patterns
- User query Q answer(ISBN, Author, Title) ?
- book(ISBN, Author, Title),
- catalog(ISBN, Author),
- not library(ISBN).
- Limited (web service) Access Patterns (API)
- Src1.books in ISBN out Author, Title
- Src1.books in Author out ISBN, Title
- Src2.catalog in out ISBN, Author
- Src3.library in out ISBN
- Q is not executable, but feasible (equivalent to
executable Q catalog book not library) - ? ICDE (poster), EDBT, PODS (papers),
Nash-Ludaescher,2004
60Conclusions
- Summary
- Kepler Scientific Workflow System
- Open source, cross-project collaboration (SEEK,
GEON, SDM,) - Actor Dataflow-oriented Modeling, Design,
Execution (Ptolemy II heritage) - Prototyping, static analysis, web services, data
transformations - Next Steps
- First official release (Kepler-to-Go) April/May
04 - e-Science meeting NeSC, Edinburgh
- Grid-enabling
- 3rd party transfer, planning, optimization,
- Semantic Typing DILS04
- Provenance, Fault tolerance,
- Link-Up w/ e.g. Taverna, Pegasus,
- Become a member or co-developer (You!)