On Accelerating Scientific Discovery using Scientific Workflows - PowerPoint PPT Presentation

Loading...

PPT – On Accelerating Scientific Discovery using Scientific Workflows PowerPoint presentation | free to download - id: 68018-YzNhN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

On Accelerating Scientific Discovery using Scientific Workflows

Description:

On Accelerating Scientific Discovery using Scientific Workflows – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 74
Provided by: shann84
Learn more at: http://www.geongrid.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: On Accelerating Scientific Discovery using Scientific Workflows


1
On Accelerating Scientific Discovery using
Scientific Workflows -- Kepler Scientific
Workflow System --
  • Ilkay ALTINTAS
  • Lab Director, Scientific Workflow Automation
    Technologies (SWAT)
  • Deputy Coordinator, Research, San Diego
    Supercomputer Center (SDSC)
  • University of California, San Diego
  • and
  • University of Amsterdam

2
Scientific Workflows
  • SCIENTIFIC gt regulated by or conforming to the
    principles of exact science
  • Science is best defined as a careful,
    disciplined, logical search for knowledge about
    any and all aspects of the universe, obtained by
    examination of the best available evidence and
    always subject to correction and improvement upon
    discovery of better evidence. What's left is
    magic. And it doesn't work.
  • -- James Randi, magician and scientific
    skeptic

3
Traditional Scientific Method
  • Scientific Method

Wikipedia definition Scientific method refers
to the body of techniques for investigating
phenomena, acquiring new knowledge, or correcting
and integrating previous knowledge. It is based
on gathering observable, empirical and measurable
evidence subject to specific principles of
reasoning. A scientific method consists of the
collection of data through observation and
experimentation, and the formulation and testing
of hypotheses. Among other facets shared by the
various fields of inquiry is the conviction that
the process be objective to reduce a biased
interpretation of the results. Another basic
expectation is to document, archive and share all
data and methodology so they are available for
careful scrutiny by other scientists, thereby
allowing other researchers the opportunity to
verify results by attempting to reproduce them.
This practice, called full disclosure, also
allows statistical measures of the reliability of
these data to be established.
Scientific method is the most generic scientific
workflow to solve a problem end-to-end!
Source http//www.sciencebuddies.org/mentoring/pr
oject_scientific_method.shtml
  • Underlies the scientific revolution
  • Has been used for thousand years

4
Is Applied Scientific Computing the Second
Scientific Revolution?
Microscopes, telescopes, particle accelerators,
X-rays, MRIs, microarrays, satellite-based
sensors, sensor networks, field studies
Collect Data, Observe
Potentially large computation
Data, modeling and computational challenges at
each step mostly automatable!
Simulation, Prediction Model execution
Visualization, Post-processing
Analysis, Prediction Model execution
Papers, Related Data and Workflows Information
on Data and Process Products
Publication
!!! Scientific Workflows to model and glue one or
more steps in the Scientific Method !!!
5
Scientific workflows emerged as an answer to the
need to combine multiple Cyberinfrastructure
components in automated process networks.
So,what is a scientific workflow?
6
The Big Picture Supporting the Scientist
From Napkin Drawings
to Executable
Workflows
Source Mladen Vouk (NCSU)
Conceptual SWF
Executable SWF
Here John Blondin, NC State Astrophysics Terascal
e Supernova Initiative SciDAC, DOE
7
Phylogeny Analysis Workflows
8
Promoter Identification Workflow
Source Matt Coleman (LLNL)
9
SWF Systems Requirements
  • Design tools-- especially for non-expert users
  • Ease of use-- fairly simple user interface having
    more complex features hidden in the background
  • Reusable generic features
  • Generic enough to serve to different communities
    but specific enough to serve one domain (e.g.
    geosciences) gt customizable
  • Extensibility for the expert user
  • Registration, publication provenance of data
    products and process products (workflows)
  • Dynamic plug-in of data and processes from
    registries/repositories
  • Distributed WF execution (e.g. Web and Grid
    awareness)
  • Semantics awareness
  • WF Deployment
  • as a web site, as a web service,Power apps (a
    la SciRUN II)
  • Interoperability with other SWF systems

10
Scientific Workflow Systems
  • Combination of
  • data integration, analysis, and visualization
    steps
  • automated "scientific process"
  • Mission of scientific workflow systems
  • Promote scientific discovery by providing tools
    and methods to generate scientific workflows
  • Create an extensible and customizable graphical
    user interface for scientists from different
    scientific domains
  • Support computational experiment creation,
    execution, sharing, reuse and provenance
  • Design frameworks which define efficient ways to
    connect to the existing data and integrate
    heterogeneous data from multiple resources
  • Make technology useful through users monitor!!!

11
Kepler is a Scientific Workflow System
www.kepler-project.org
  • and a cross-project collaboration
  • initiated August 2003
  • 1st release May 13, 2008
  • Builds upon the open-source Ptolemy II framework

12
Kepler is a Team Effort
  • Some CI projects using Kepler
  • SEEK (ecology)
  • SciDAC (molecular bio, astrophysics, ...)
  • CPES (plasma simulation, combustion)
  • GEON (geosciences)
  • CiPRes (phylogenetics)
  • ROADnet (real-time data)
  • Processing Phylodata (pPOD)
  • REAP (streaming data)
  • Digital preservation (DIGARCH)
  • COMET (environmental science)
  • OOI CI - ORION (ocean observing CI)
  • LOOKING (oceanography)
  • CAMERA (metagenomics)
  • Resurgence (computational chemistry)
  • ChIP-chip (genomics)
  • Cheshire Digital Library (archival)
  • Cell Biology (Scripps)
  • DART (X-Ray crystallography)
  • Source code access
  • 154 people accessed source code
  • 30 members have write permission

Kepler downloads Total 9204 Beta
6675 redWindows blueMacintosh
Source Matt Jones, NCEAS
13
Kepler Software Development Practice
  • How does this all work?
  • Joint CVS -- special rules!
  • Projects like SDM, Cipres, Resurgence have their
    specialized releases out of a common
    infrastructure!
  • Open-source (BSD)
  • Website Wiki -- http kepler-project.org
  • Communications
  • Busy IRC channel
  • Mailing lists Kepler-dev, Kepler-users,
    Kepler-members
  • Telecons for design discussions
  • 6-monthly hackatons
  • Focus group meetings workshops and conference
    calls
  • How will it all persist?

14
How will it all persist?
  • Development of Kepler C.O.R.E. -- A
    Comprehensive, Open, Robust, and Extensible
    Scientific Workflow Infrastructure
  • Ludäscher, Altintas, Bowers, Jones, McPhillips,
    Schildhauer
  • Extensibility Governance Sustainability
  • Goals
  • Reliable
  • refactored build
  • more modular design
  • improved engineering practices
  • Independently extensible
  • Open architecture, open project With improved
    governance!

15
First Kepler Stakeholders Meeting
  • Organized by Kepler/CORE
  • May 13-15, 2008
  • 35 stakeholders from 5 countries and 25 projects
  • Big move towards web execution environments
    through virtual laboratories
  • Different software projects building upon Kepler
  • Hydrant (Australia), KFlex (Germany), Nimrod/K
    (Australia)
  • Short-term goals in addition to a growing set of
    tools for data processing components
  • Enable extension points for
  • A customizable workflow authoring user interface
    (Application Web)
  • Moving beyond the desktop environment for the
    full scientific process
  • Provenance tracking for workflow design and
    execution
  • Extensible access to multiple data repositories
  • Distributed execution of workflows and
    computational experiments
  • Social networks to share and build workflows

16
So,what is in Kepler?
17
Actors are the Processing Components
  • Actor
  • Encapsulation of parameterized actions
  • Interface defined by ports and parameters
  • Port
  • Communication between input and output data
  • Without call-return semantics
  • Model of computation
  • Communication semantics among ports
  • Flow of control
  • Implementation is a framework
  • Examples
  • Simulink(The MathWorks)
  • LabVIEW ( from National Instruments)
  • Easy 5x (from Boeing)
  • ROOM(Real-time object-oriented modeling)
  • ADL(Wright)

Actor-Oriented Design
Source Edward A. Lee, UC Berkeley
18
Some actors in place for
  • Generic Web Service Client and Web Service
    Harvester
  • Customizable RDBMS query and update
  • Command Line wrapper tools (local, ssh, scp,
    ftp, etc.)
  • Some Grid actors-Globus Job Runner,
    GridFTP-based file access, Proxy Certificate
    Generator
  • SRB support
  • Native R and Matlab support
  • Interaction with Nimrod and APST
  • Communication with ORBs through actors and
    services
  • Imaging, Gridding, Vis Support
  • Textual and Graphical Output
  • more generic and domain-oriented actors

19
Directors are the WF Engines that
  • Implement different computational models
  • Define the semantics of
  • execution of actors and workflows
  • interactions between actors
  • Ptolemy and Kepler are unique in combining
    different
  • execution models in heterogeneous models!
  • Kepler is extending Ptolemy directors with
    specialized ones for web service based workflows
    and distributed workflows.
  • Process Networks
  • Rendezvous
  • Publish and Subscribe
  • Continuous Time
  • Finite State Machines
  • Dataflow
  • Time Triggered
  • Synchronous/reactive model
  • Discrete Event
  • Wireless

20
Vergil is the GUI for Kepler
Data Search
Actor Search
  • Actor ontology and semantic search for actors
  • Search -gt Drag and drop -gt Link via ports
  • Metadata-based search for datasets

21
Actor Search
  • Kepler Actor Ontology
  • Used in searching actors and creating conceptual
    views ( folders)
  • Currently more than 200 Kepler actors added!

22
Data Search and Usage of Results
  • EarthGrid
  • Discovery of data resources through local and
    remote services
  • SRB,
  • Grid and Web Services,
  • Db connections
  • Registry of datasets on the fly using workflows

23
Current Advances and Users
  • Data and Actor search
  • EarthGrid data access system
  • Kepler Component Library
  • Kepler Archive (KAR) format
  • Integrated support for LSID identifiers for all
    objects
  • Object Manager and cache
  • Web service execution
  • RExpression MatlabExpression
  • Redesigned user interface
  • Authentication subsystem
  • Null-value handling
  • Documentation
  • Semantics support
  • annotation, search, workflow validation,
    integration
  • Collection-oriented workflows
  • Domain-specific actors for case studies
  • Provenance framework
  • Grid computing support
  • NIMROD, Globus, ssh, ...
  • Kepler Users
  • User interface users
  • Workflow developers
  • Scientists
  • Software Developers
  • Engineers
  • Researchers
  • Batch users
  • Portals
  • Other workflow systems as an engine

24
Kepler System Architecture
Authentication
GUI extensions
Vergil
Documentation
Provenance Framework
Kepler Object Manager
Smart Re-run / Failure Recovery
SMS
Type System Ext
ActorData SEARCH
Kepler Core Extensions
Ptolemy
25
Kepler can be used as a batch execution engine
  • Configuration phase
  • Subset DB2 query on DataStar

Portal
Monitoring/ Translation
Subset
  • Interpolate Grass RST, Grass IDW, GMT
  • Visualize Global Mapper, FlederMaus, ArcIMS

Scheduling/ Output Processing
Grid
26
So,show me an exampleCI project that uses
Kepler?
27
CI Project REAP
  • Management and Analysis of Observatory Data using
    Kepler Scientific Workflows
  • The vision
  • An integrated environment for analyzing data from
    observatories

reap.ecoinformatics.org
  • Funded 2006-2009
  • NSF CEOP
  • Jones(PI), Altintas, Baru, Ludaescher,
    Schildhauer
  • Partners
  • NCEAS/UCSB (Lead), SDSC/UCSD, UCDavis, CENS/UCLA,
    OpenDAP, OSU
  • Two scientific use cases
  • Terrestrial ecology
  • Oceanography

28
REAP Views-- End-to-End CI for Observatories --
  • For scientists
  • capabilities for designing and executing complex
    analytical models over near real-time and
    archived data sources
  • For data-grid engineers
  • monitoring and management capabilities of
    underlying sensor networks
  • For outside users
  • access to observatory data and results of models,
    approachable to non-scientists.

29
REAP Terrestrial Ecology Usecase
Workflows to develop and test models exploring
the impacts of abiotic factors (real-time light,
temperature, and rainfall measurements) on the
dynamics of plant host populations and their
susceptibility to viral pathogens.
30
REAP RBNB Streaming Data Actor
Example data from Terrestrial UseCase Hardware
a Campbell Scientific CR800 datalogger with
eight attached sensors, operating on a workbench.
31
REAP Oceanographic Usecase
Facilitate the quantitative evaluation of SST
data sets.
32
Sea Surface Temperature Workflow
33
(No Transcript)
34
(No Transcript)
35
CI Project GEON-- Some Workflow Features --
  • Support for High Performance Computations
  • Job submission and monitoring
  • Logging of execution trace and registering
    intermediate products
  • Data provenance and failure recovery
  • Portal accessibility
  • Deployment of workflows to the GEON portal
  • Harvesting data and tools from repositories
  • Direct access to data and tools registered to the
    GEON portal
  • Access to GEON Web Services
  • Storage Resource Broker (SRB)

36
GEON Workflow Examples
37
GEON Mineral Classification Workflow
An early example Classification for naming
Igneous Rocks.
38
Integration Scenario A-type query
  • Classifying A-types from an Igneous rock database
  • Integrating between Relational and Spatial
    (shapefiles) databases to query and interactively
    display GIS results
  • Reusing existing and generic Kepler components
    (Classifier, JDBC)

Ghulam Memon, Ashraf Memon
39
Beach Balls Workflow
  • GOAL Integrate seismic focal mechanisms with
    image services

40
Beach Balls Workflow Output
41
Gravity Modeling Workflow
Observed Gravity
Topography
Pluton map
Sediments
Moho
Output
Residual Map
Differencecalculator
Densities
Source (GEON) Dogan Seber, Randy Keller
Interactive 3D model Defining possible depth
distribution of plutons
42
Kepler as a Modeling Tool Gravity Modeling
Workflow
  • Comparing between synthetic and observed gravity
    models of heterogeneous data sources. Creating a
    residual map of the difference using ESRI
    services and displaying it on a web browser
  • Portrays Kepler as a prototyping tool (ToDo)
  • Adjustable parameter-wise

Joint work betweenSDSC and UTEP.
43
Gravity Modeling Workflow
44
R. Haugerud, U.S.G.S
LiDAR Introduction
Survey
Interpolate / Grid
Process Classify
D. Harding, NASA
Point Cloud x, y, zn,
Analyze / Do Science
45
A Three-Tier Architecture
Portal
  • GOAL Efficient LiDAR interpolation and analysis
    using GEON infrastructure and tools
  • GEON Portal
  • Kepler Scientific Workflow System
  • GEON Grid
  • Use scientific workflows to glue/combine
    different tools and the infrastructure

Grid
46
Kepler can be used as a batch execution engine
Portal
  • Configuration phase
  • Subset DB2 query on DataStar

Monitoring/ Translation
Subset
  • Interpolate Grass RST, Grass IDW, GMT
  • Visualize Global Mapper, FlederMaus, ArcIMS

Scheduling/ Output Processing
Grid
47
Lidar Processing Workflow (using Fledermaus)
Subset
d2
d2 (grid file)
d1
d1
d2
NFS Mounted Disk
48
Lidar Processing Workflow (using Global Mapper)
Subset
d2
d2 (grid file)
d1
d1
d2
NFS Mounted Disk
49
Lidar Processing Workflow (using ArcIMS)
Subset
d2 (grid file)
d1
d1
d2
NFS Mounted Disk
50
Lidar Workflow Portlet
  • User selections from GUI
  • Translated into a query and a parameter file
  • Uploaded to remote machine
  • Workflow description created on the fly
  • Workflow response redirected back to portlet

51
LIDAR POST-PROCESSING WORKFLOW PORTLET
52
Behind the Scenes Workflow Template
53
Filled Template
54
Example Outputs
55
With Additional Algorithms
56
GLW Monitoring
  • Job management
  • A unified interface to follow up on the status of
    submitted jobs The system
  • View job metadata
  • Zoom to a specific bounding box location
  • Track errors
  • Modify a job and re-submist
  • View the processing results
  • In the future, register desired workflow products
  • Useful for publication
  • GLW is exposed to a high risk of components
    failures
  • Long running process
  • Distributed computational resources under diverse
    controlling authorities
  • Provides transparent/background error handling
    using provenance data and smart reruns

57
RisingTrends
58
Currently scientific workflows help with
  • mostly testing and experimentation steps
  • Formalization of the scientific process
  • Interaction with multiple tools and resources at
    once
  • Sharing, adaptation and reuse
  • Deployable, customizable, extensible
  • Management of complexity and usability
  • Support for hierarchical composition
  • Interfaces to different technologies from a
    unified interface
  • Can be annotated with domain-knowledge
  • Tracking provenance of the data, processes and
    workflow evolution
  • Keep the association of results to processes
  • Make it easier to validate/regenerate results and
    processes
  • Enable comparison between different workflow
    versions
  • Execution monitoring and fault tolerance

59
Evolving Challenges For Scientific Workflows - 1
  • Access to heterogeneous data and computational
    resources and link to different domain knowledge
  • Interface to multiple analysis tools and workflow
    systems
  • One size doesnt fit all!
  • Support computational experiment creation,
    execution, sharing, reuse and provenance
  • Manage complexity, user and process interactivity
  • Extensions for adaptive and dynamic workflows

60
Evolving Challenges For Scientific Workflows - II
  • Track provenance of workflow evolution,
    execution, and intermediate and final results
  • Efficient failure recovery and smart re-runs
  • Support various file and process transport
    mechanisms
  • Main memory, Java shared file system,
  • Come up with efficient and intuitive workflow
    deployment methods
  • Methodologies for workflow design
  • Do all these in a secure and easy-to-use way!!!
  • What are these helpful for?

61
End-to-End Support for the Full Scientific Process
  • Requires all the evolving challenges
  • End-to-end support from observing to publication
  • Use and control instruments, networks and
    observatories in observing steps
  • Scientifically and statistically analyze and
    manage data collected by steps in the scientific
    process,
  • Set up simulations as testbeds for possible
    observatories
  • Provenance tracking
  • within and across experiments modeled as
    workflows
  • beyond the active lifetime of a workflow
    including citations to it
  • How about engineering for scientific
    applications?
  • Engineering Workflows
  • How much of todays science depend on
    engineering?
  • Is engineering yet another applied science today?

62
  • "Why does this magnificent applied science,
    which saves work and makes life easier, bring us
    so little happiness? The simple answer runs
    Because we have not yet learned to make sensible
    use of it." Albert Einstein, in an
    address at Cal Tech, 1931. (Harper)

63
Scientific Workflows with Provenance Tracking
Could Be an Answer!
  • Workflow modeling and provenance tracking for
    scientific process end-to-end
  • Involves
  • multiple workflows
  • multiple runs
  • multiple users
  • even multiple workflow engines
  • Customizable to conform to data models within
    different Cyberinfrastructure system
    architectures
  • Reporting and analytical tools to process
    collected data
  • What is the lifecycle of such provenance data?

64
Lifecycle of provenance information from
collection to usage
  • Who uses it?
  • Workflow developer
  • Workflow user
  • Scientific dashboards
  • execute and monitor a workflow
  • Other workflow systems
  • For what?
  • Data collection
  • Data usage
  • Feedback on usage

65
Kepler Provenance Framework
  • What provenance is recorded
  • Workflow Specification actors, ports,
    connections, parameters, etc.
  • Workflow Evolution parameter values that change
    over time.
  • Workflow Execution
  • Start/stop of workflow, individual actor
    executions
  • Data exchanged between actors
  • Where provenance recorded
  • Modular interface supports saving to different
    output types.
  • Currently implemented Text file, SQL

66
Text output
  • Write to file or console

67
SQL output
  • MySQL database

68
SQL Schema
Specification
Evolution
Execution
69
Example Provenance Queries
  • How long did workflow run n take?
  • SELECT e.end_time - e.start_time
  • FROM workflow_exec e
  • WHERE e.id n
  • What actors does workflow m have?
  • SELECT e., a.
  • FROM actor a, entity e
  • WHERE e.wf_id m and e.id a.id

70
Ocean SST Analyze Workflow
71
Configured Recorder
72
To Sum Up
  • Scientific workflows are maturing
  • with maturing technology requirements of
    scientific research
  • Modeling the full scientific process as a set of
    workflows and its provenance
  • opens way to a more efficient and documentable
    scientific research
  • Kepler is an open-source system and
    collaboration
  • was initiated in August, 2003
  • actively grows by application pull from
    contributors
  • released 1.0.0 on May 13th, 2008
  • GEON Workflows and prototypes of some of the
    mentioned topics available in Kepler

73
Thanks! Questions
Ilkay Altintas altintas_at_sdsc.edu 1 (858)
822-5453 http//www.sdsc.edu/altintas
  • More information http//kepler-project.org
  • Acknowledgements
  • DOE SciDac Award No. DE-FC02-07ER25811 for SDM
    Center
  • NSF Award No. DBI 0619060 for REAP
  • NSF Award OCI-0722079 for Kepler CORE
About PowerShow.com