Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis

Description:

Produces one output port for each attribute in the dataset. Individual attributes can then be mapped to other actors. Ptolemy model with EML ingestion actor ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 26
Provided by: MattJ2
Category:

less

Transcript and Presenter's Notes

Title: Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis


1
Kepler A Workflow Tool for Heterogeneous
Ecological Data Analysis
Chad Berkley NCEAS National Center for
Ecological Analysis and Synthesis
(NCEAS), University of California Santa
Barbara Long Term Ecological Research Network
Office, University of New Mexico University of
Kansas San Diego Supercomputer Center
http//seek.ecoinformatics.org
December 4, 2003
Edinburgh, Scotland
2
Outline
  • Quick history
  • SEEK overview
  • Ecological Metadata Language
  • Using workflows in Ecology
  • Workflow editing with Kepler
  • Future visions

3
History
  • Late 1990s patterns noticed in the problems
    surrounding data synthesis at NCEAS
  • 1999 - Michener et al paper on ecological
    metadata
  • 2000 Knowledge Network for Biocomplexity
  • Morpho, Metacat, Ecological Metadata Language
  • Some footholds into workflow creation and
    execution
  • 2003 Scientific Environment for Ecological
    Knowledge (SEEK) Grant
  • Continues the work done on the KNB grant
  • Emphasis on using metadata for advanced data
    processing

4
SEEK approach
  • General approach to specific ecological problems
  • Data described with adequate metadata in a grid
    accessible repository
  • Reasoning engine (ontology based) to locate and
    extract data and processes
  • Modeling system to put it all together and
    control execution flow

5
SEEK Components
  • Ecogrid
  • Analysis Library
  • Metadata and data repository
  • Semantic Mediation System
  • Controlled semantic vocabulary
  • Ontological discovery system
  • Analysis and Modeling System (Kepler)
  • Workflow control system
  • Utilizes resources from other components

6
SEEK Architecture
7
Ecological Metadata Language
  • Common language for archiving and transport of
    datasets
  • XML based
  • Designed for/by the ecological community
  • Describes physical and logical structure of data
  • Also includes project, literature and software
    information
  • SEEK will add semantic information

8
Workflows in SEEK
  • In the SEEK model, data ingestion/cleaning is
    metadata driven (specifically with EML)
  • Output generation includes creating appropriate
    metadata
  • The analysis pipeline itself becomes metadata

9
Metadata driven data ingestion
  • Key information needed to read and machine
    process a data file is in the metadata
  • File descriptors (CSV, Excel, RDBMS, etc.)
  • Entity (table) and Attribute (column)
    descriptions
  • Name
  • Type (integer, float, string, etc.)
  • Codes (missing values, nulls, etc.)
  • In the future, this will include semantic typing

10
Metadata revision
  • Metadata is revised following any transformation
  • Versioning of metadata and data is very important
  • This process results in a lineage of the data
    file as it has been transformed

11
Typical ecological workflow example
  • Workflows can automate the integration process if
    data is described with adequate structured
    metadata

12
Homogeneous data integration
  • Integration of homogeneous or mostly homogeneous
    data via EML metadata is relatively
    straightforward

13
Heterogeneous Data integration
  • Integration of heterogeneous data requires much
    more advanced metadata and processing
  • Attributes must be semantically typed
  • Collection protocols must be known
  • Units and measurement scale must be known
  • Measurement mechanics must be known (i.e. that
    DensityCount/Area)

14
Semantic typing and ontologies
  • Label data with semantic types
  • Label inputs and outputs of analytical components
    with semantic types
  • Use Semantic Mediation System (SMS) to generate
    transformation steps
  • Beware analytical constraints
  • Use SMS to discover relevant components
  • Ontology specification of a conceptualization
    (a knowledge map)

Data
Ontology
Workflow Components
15
Measurement Ontology
  • Density is part of a larger measurement ontology
  • SEEKs intent is to create one or more community
    created ecological ontologies
  • Creates a controlled vocabulary for ecological
    metadata
  • More about this in Bertrams talk

16
About Kepler
  • Kepler is the name of the SEEK/SDM additions to
    the Ptolemy modeling system
  • Ptolemy was designed by the UC Berkeley EECS
    department
  • Primary use is modeling EE circuits
  • Free, opensource, pure Java
  • Flexible design GUI for building workflows

17
Kepler
  • A Kepler model consists of linked actors (which
    correspond to workflow steps)
  • Timing is controlled by a director
  • All actors are written in Java but can call other
    applications (such as SAS and MATLAB or native
    language code via JNI)
  • Actors can call arbitrary Web (or Grid) Services
  • Ptolemy already has a very large inventory of
    actors
  • Easy to use, drag n drop interface

18
SEEK Contributions to Kepler (so far)
  • EML data ingestion actor
  • Actor design tool

19
EML data ingestion actor
  • Ingests any data format described by EML metadata
  • Converts raw data to Kepler format
  • Data can then be operated on with other actors
  • Produces one output port for each attribute in
    the dataset
  • Individual attributes can then be mapped to other
    actors

20
Ptolemy model with EML ingestion actor
21
SEEK Contributions to Kepler (so far)
  • EML data ingestion actor
  • Actor design tool

22
Actor design tool
  • Allows place-holder actors to be defined on the
    fly by non-programmers during workflow creation
  • Domain scientists can thereby create workflows
    without programming knowledge
  • Workflows created with these actors can be
    executed once their functionality is implemented
    by a programmer
  • Allows quick prototyping of workflows by domain
    scientists
  • Place-holder actors can still be linked to
    other working actors

23
Ptolemy and dynamically created actor
24
How domain scientists will benefit
  • More fully automated integration systems
  • A library of pre-defined analytical processes
    which can be executed on heterogeneous data
  • Semantic data discovery and processing
  • Automated unit and measurement scale conversions
  • A fuller understanding of cross site research
    implications

25
Acknowledgements
More info http//seek.ecoinformatics.org Question
s? IRC irc.ecoinformatics.org seek
This material is based upon work supported
by The National Science Foundation under Grant
Numbers 9980154, 9904777, and 0225676 to NCEAS
and its collaborators. The National Center for
Ecological Analysis and Synthesis, a Center
funded by NSF (Grant Number 0072909), the
University of California, and the UC Santa
Barbara campus. Primary Collaborators University
of New Mexico (Long Term Ecological Research
Network Office), San Diego Supercomputer Center,
University of Kansas (Center for Biodiversity
Research)
Write a Comment
User Comments (0)
About PowerShow.com