Kepler: Towards a GridEnabled System for Scientific Workflows - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Kepler: Towards a GridEnabled System for Scientific Workflows

Description:

San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD) ... Task-orientation: travel reservations; credit approval; BPM; ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 60
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Kepler: Towards a GridEnabled System for Scientific Workflows


1
Kepler Towards a Grid-Enabled System for
Scientific Workflows
Ilkay Altintas, Chad Berkley, Efrat Jaeger,
Matthew Jones, Bertram Ludäscher , Steve
Mock ludaesch_at_SDSC.EDU San Diego Supercomputer
Center (SDSC) University of California, San Diego
(UCSD)
2
Outline
  • Motivation Scientific Workflows (SEEK, SDM,
    GEON, ..)
  • Current Features of the Kepler Scientific
    Workflows System
  • Extending Kepler
  • Grid-Enabling Kepler
  • 3rd party transfer
  • WF planning optimization
  • Shipping and Handling Algebra (SHA)
  • Web Service Composition as Declarative Query
    Plans
  • Semantic Types for Scientific Workflows
  • Conclusions

3
Kepler Team, Projects, Sponsors
  • Ilkay Altintas SDM
  • Chad Berkley SEEK
  • Shawn Bowers SEEK
  • Jeffrey Grethe BIRN
  • Christopher H. Brooks Ptolemy II
  • Zhengang Cheng SDM
  • Efrat Jaeger GEON
  • Matt Jones SEEK
  • Edward A. Lee Ptolemy II
  • Kai Lin GEON
  • Bertram Ludäscher BIRN, GEON, SDM, SEEK
  • Steve Mock NMI
  • Steve Neuendorffer Ptolemy II
  • Jing Tao SEEK
  • Mladen Vouk SDM
  • Yang Zhao Ptolemy II

Ptolemy II
4
Example SEEK Science Environment for
Ecological Knowledge (large NSF ITR)
  • Analysis Modeling System
  • Design and execution of ecological models and
    analysis
  • End user focus
  • application-/upperware
  • Semantic Mediation System
  • Data Integration of hard-to-relate sources and
    processes
  • Semantic Types and Ontologies
  • upper middleware
  • EcoGrid
  • Access to ecology data and tools
  • middle-/underware

Architecture Overview (cf. Cyberinfrastructure)
5
Ecology GARP Analysis Pipeline for Invasive
Species Prediction
Source NSF SEEK (Deana Pennington et. al, UNM)
6
Genomics Example Promoter Identification
Workflow (PIW)
Source Matt Coleman (LLNL)
7
Source NIH BIRN (Jeffrey Grethe, UCSD)
8
Scientific Workflows Some Findings
  • More dataflow than (business control-/) workflow
  • DiscoveryNet, Kepler, SCIRun, Scitegic, Taverna,
    Triana,, ,
  • Need for programming extension
  • Iterations over lists (foreach) filtering
    functional composition generic higher-order
    operations (zip, map(f), )
  • Need for abstraction and nested workflows
  • Need for data transformations (WS1?DT?WS2)
  • Need for rich user interaction workflow
    steering
  • pause / revise / resume
  • select branch e.g., web browser capability at
    specific steps as part of a coordinated SWF
  • Need for high-throughput transfers
    (grid-enabling, streaming)
  • Need for persistence of intermediate products and
    provenance

9
Scientific Workflows vs Business Workflows
  • Scientific Workflows
  • Dataflow and data transformations
  • Data problems volume, complexity, heterogeneity
  • Grid-aspects
  • Distributed computation
  • Distributed data
  • User-interactions/WF steering
  • Data, tool, and analysis integration
  • ? Dataflow and control-flow are married!
  • Business Workflows (BPEL4WS )
  • Task-orientation travel reservations credit
    approval BPM
  • Tasks, documents, etc. undergo modifications
    (e.g., flight reservation from reserved to
    ticketed), but modified WF objects still
    identifiable throughout
  • Complex control flow, complex process composition
    (danger of control flow/dataflow spaghetti)
  • ? Dataflow and control-flow are divorced!

10
In a Flux Workflow Standards
Source W.M.P. van der Aalst et al.
http//tmitwww.tm.tue.nl/research/patterns/ http/
/tmitwww.tm.tue.nl/staff/wvdaalst/Publications/pub
lications.html
11
Commercial Open Source Scientific Workflow
(well Dataflow) Systems
Kensington Discovery Edition from InforSense
Triana
Taverna
12
SCIRun Problem Solving Environments for
Large-Scale Scientific Computing
  • SCIRun PSE for interactive construction,
    debugging, and steering of large-scale scientific
    computations
  • New collaboration under Kepler/SDM
  • Component model, based on generalized dataflow
    programming

Steve Parker (cs.utah.edu)
13
Our Starting Point Ptolemy II Dataflow Process
Networks
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
14
Why Ptolemy II?
  • Ptolemy II Objective
  • The focus is on assembly of concurrent
    components. The key underlying principle in the
    project is the use of well-defined models of
    computation that govern the interaction between
    components. A major problem area being addressed
    is the use of heterogeneous mixtures of models of
    computation.
  • Data Process oriented Dataflow process
    networks
  • Natural Data Streaming Support
  • User-Orientation
  • application-ware, not middle-/under-ware)
  • Workflow design exec console (Vergil GUI)
  • PRAGMATICS
  • mature, actively maintained, well-documented
    (500pp)
  • open source system
  • developed across multiple projects (NSF/ITRs SEEK
    and GEON, DOE SciDAC SDM, )
  • hoping to leverage e-sister projects (e.g.
    Taverna, )

15
Dataflow Process Networks Putting Computation
Models (Orchestration) first!
  • Synchronous Dataflow Network (SDF)
  • Statically schedulable single-threaded dataflow
  • Can execute multi-threaded, but the
    firing-sequence is known in advance
  • Maximally well-behaved, but also limited
    expressiveness
  • Process Network (PN)
  • Multi-threaded dynamically scheduled dataflow
  • More expressive than SDF (dynamic token rate
    prevents static scheduling)
  • Natural streaming model
  • Other Execution Models (Domains)
  • Implemented through different Directors

advanced push/pull
16
Actor-/Dataflow Orientation vs Object-/ Control
flow Orientation
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
17
Marrying or Divorcing Control- Dataflow
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
18
Overview Scientific Workflows in Kepler
  • Modeling and Workflow Design
  • Web services individual components (actors)
  • Minute-Made Application Integration
  • Plugging-in and harvesting web service components
    is easy, fast
  • Rich SWF modeling semantics (directors)
  • Different and precise dataflow models of
    computation
  • Clear and composable component interaction
    semantics
  • ? Web service composition and application
    integration tool
  • Coming soon
  • Shrinked wrapped, pre-packaged Kepler-to-Go
  • Structural and semantic typing (better design
    support)
  • Grid-enabled web services (for big data, big
    computations,)
  • Different deployment models (web service, web
    site, applet, )

19
The KEPLER GUI Vergil(Steve Neuendorffer,
Ptolemy II)
Drag and drop utilities, director and actor
libraries.
20
Running a Genomics WF (Ilkay Altintas, SDM)
21
Support for Multiple Workflow Granularities
Boulders
Plumbing
Powder
Abstraction Sand to Rocks
Sand
22
Directors and Combining Different Component
Interaction Semantics
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
23
Application Examples Mineral Classification with
Kepler (Efrat Jaeger, GEON)
24
inside the Classifier
25
Standard BrowserUI Client-Side SVG
26
SWF Reengineering (Ashraf, Efrat, Kai, GEON)
27
DataMapper Sub-Workflow
28
Result launched via BrowserUI actor(coupling
with ESRIs ArcIMS)
29
Distributed Workflows in KEPLER
  • Web and Grid Service plug-ins
  • WSDL (now) and Grid services (stay tuned )
  • ProxyInit, GlobusGridJob, GridFTP,
    DataAccessWizard
  • SSH, SCP, SDSC SRB, OGS?-??? coming
  • WS Harvester
  • Import query-defined WS operations as Kepler
    actors
  • XSLT and XQuery Data Transformers
  • to link not designed-to-fit web services
  • WS-deployment interface (planned)

30
Generic Web Service Actor (Ilkay Altintas)
  • Given a WSDL and the name of an operation of a
    web service, dynamically customizes itself to
    implement and execute that method.

31
Set Parameters and Commit
Set parameters and commit
32
Specialized WS Actor (after instantiation)
33
Web Service Harvester (Ilkay Altintas, SDM)
  • Imports the web services in a repository into
    the actor library.
  • Has the capability to search for web services
    based on a keyword.

34
Composing 3rd-Party WSs (NMI, Steve Mock)
Input of next web service
User interaction Transformations
35
A Special Generic Ingestion Actor for EML Data
(SEEK, Chad Berkley)
  • Ingests any data format described by EML metadata
  • Converts raw data to Ptolemy format
  • Data can then be operated on with other actors

36
Wrapping Legacy Applications
37
Promoter Identification Workflow (PIW)
Source Matt Coleman (LLNL)
38
Execution Semantics
Promoter Identification Workflow in
Ptolemy-II SSDBM03
39
hand-crafted control solution also forces
sequential execution!
No data transformations available
Complex backward control-flow
40
Promoter Identification Workflow in FP
genBankG GeneId -gt GeneSeqgenBankP
PromoterId -gt PromoterSeqblast GeneSeq -gt
PromoterIdpromoterRegion PromoterSeq -gt
PromoterRegiontransfac PromoterRegion -gt
TFBSgpr2str (PromoterId, PromoterRegion)
-gt Stringd0 Gid "7" -- start
with some gene-id d1 genBankG d0 --
get its gene sequence from GenBankd2 blast d1
-- BLAST to get a list of potential
promotersd3 map genBankP d2 -- get list
of promoter sequences d4 map promoterRegion d3
-- compute list of promoter regions and ...d5
map transfac d4 -- ... get transcription
factor binding sitesd6 zip d2 d4
-- create list of pairs promoter-id/regiond7
map gpr2str d6 -- pretty print into a list
of strings d8 concat d7 -- concat
into a single "file" d9 putStr d8
-- output that file
41
Cleaned up Process Network PIW
  • Back to purely functional dataflow process
    network
  • ( also a data streaming model!)
  • Re-introducing map(f) to Ptolemy-II (was there in
    PT Classic)
  • no control-flow spaghetti
  • data-intensive apps
  • free concurrent execution
  • free type checking
  • automatic support to go from piw(GeneId) to
  • PIW map(piw) over GeneId

map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
42
Optimization by Declarative Rewriting I
  • PIW as a declarative, referentially transparent
    functional process
  • optimization via functional rewriting possible
  • e.g. map(f o g) map(f) o map(g)
  • Technical report PIW specification in Haskell

map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-cons
tructs.pdf
43
Optimizing II Streams Pipelines
Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
  • Clean functional semantics facilitates algebraic
    workflow (program) transformations
    (Bird-Meertens) e.g. mapS f mapS g ? mapS (f
    g)

44
Middle/Underware Access Querying Databases
  • Database connection actor
  • Opening a database connection and passing it to
    all actors accessing this database.
  • Database query actor
  • A generic actor that queries a database and
    provides its result.
  • DBConnection type and DBConnectionToken
  • A new IOPort type and a token to distinguish a
    database connection from any general type.

45
Database Connection Actor
  • OpenDBConnection actor
  • Input database connection information
  • Output DBConnectionToken (reference to a DB
    connection instance, via a DBConnection output
    port)

46
Database Query Actor
  • Database Query actor
  • Input SQL query string and a DB connection token
  • Parameters
  • output type XML, Record, or String
  • tuple-at-a-time vs set-at-a-time
  • Process
  • execute query
  • produce results according to parameters

47
Querying Example
48
An (oversimplified) Model of the Grid
  • Hosts h1, h2, h3,
  • Data_at_Hosts d1_at_hi, d2_at_hj,
  • Functions_at_Hosts f1_at_hi, f2_at_hj,
  • Given data/workflow
  • as a functional plan Y f(X) Z
    g(Y)
  • as a logic plan
    f(X,Y)?g(Y,Z)
  • Find Host Assignment di ? hi , fj ? hj
    for all di , fj
  • s.t. d3_at_h3 f_at_h2(d1_at_h1), is a valid
    plan

49
Shipping and Handling Algebra (SHA)
Logical view
(1)
  • plan Y_at_C F_at_A of X_at_B
  • X_at_B to A, Y_at_A F_at_A(X_at_A), Y_at_A to C
  • F_at_A gt B, Y_at_B F_at_B(X_at_B), Y_at_B to C
  • X_at_B to C, F_at_A gt C, Y_at_C F_at_C(X_at_C)

(2)
(3)
Physical view SHA Plans
50
Grid-Enabling PTII Handles
  • A?GA get_handle
  • GA?A return X
  • A?B send X
  • B?GB request X
  • GB?GA request X
  • GA? GB send X
  • GB?B send done(X)
  • Example
  • X GA.17
  • X ltsome_huge_filegt
  • Candidate Formalisms
  • GridFTP
  • SSH, SCP
  • SDSC SRB
  • OGS?-??? WSRF?

Logical token transfer (3) requires
get_handle(1,2) then exec_handle(4,5,6,7) for
completion.
Kepler space
3
A
B

4
7
2
1
5
Grid space
GA
GB
6
51
Extensions Semantic Type
  • Take concepts and relationships from an ontology
    to semantically type the data-in/out ports
  • Application e.g., design support
  • smart/semi-automatic wiring, generation of
    massaging actors

m1 (normalize)
p3
p4
Takes Abundance Count Measurements for Life
Stages
Returns Mortality Rate Derived Measurements for
Life Stages
52
(No Transcript)
53
(No Transcript)
54
Semantic Types
  • The semantic type signature
  • Type expressions over the (OWL) ontology

m1 (normalize)
p3
p4
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty -gt
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
55
Extended Type System (here OWL Semantic Types)
SemType m1 Observation
itemMeasured.AbundanceCount
hasContext.appliesTo.LifeStageProperty ?
DerivedObservation itemMeasured.MortalityRate
hasContext.appliesTo.LifeStageProperty
Substructure association XML raw-data
(X)Querygt object model link gt OWL ontology
56
Semantic Types for Scientific Workflows
57
Deriving Data Transformations from Semantic
Service Registration
Bowers-Ludaescher, DILS04
58
Structural and Semantic Mappings
Bowers-Ludaescher, DILS04
59
Workflow Planning as Planning Queries with
Limited Access Patterns
  • User query Q answer(ISBN, Author, Title) ?
  • book(ISBN, Author, Title),
  • catalog(ISBN, Author),
  • not library(ISBN).
  • Limited (web service) Access Patterns (API)
  • Src1.books in ISBN out Author, Title
  • Src1.books in Author out ISBN, Title
  • Src2.catalog in out ISBN, Author
  • Src3.library in out ISBN
  • Q is not executable, but feasible (equivalent to
    executable Q catalog book not library)
  • ? ICDE (poster), EDBT, PODS (papers),
    Nash-Ludaescher,2004

60
Conclusions
  • Summary
  • Kepler Scientific Workflow System
  • Open source, cross-project collaboration (SEEK,
    GEON, SDM,)
  • Actor Dataflow-oriented Modeling, Design,
    Execution (Ptolemy II heritage)
  • Prototyping, static analysis, web services, data
    transformations
  • Next Steps
  • First official release (Kepler-to-Go) April/May
    04
  • e-Science meeting NeSC, Edinburgh
  • Grid-enabling
  • 3rd party transfer, planning, optimization,
  • Semantic Typing DILS04
  • Provenance, Fault tolerance,
  • Link-Up w/ e.g. Taverna, Pegasus,
  • Become a member or co-developer (You!)
Write a Comment
User Comments (0)
About PowerShow.com