Yolanda Gil, PhD - PowerPoint PPT Presentation

About This Presentation
Title:

Yolanda Gil, PhD

Description:

Semantic Workflows and Shared Provenance Representations Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 17
Provided by: Yola86
Category:
Tags: phd | gil | taylor | wheeler | yolanda

less

Transcript and Presenter's Notes

Title: Yolanda Gil, PhD


1
Scientific Reproducibility through Semantic
Workflows andShared Provenance Representations
  • Yolanda Gil, PhD
  • Information Sciences Institute and
  • Department of Computer Science
  • University of Southern California
  • gil_at_isi.edu
  • http//www.isi.edu/gil

2
NSF Workshop on Challenges of Scientific
Workflows Gil et al IEEE Computer 2007
  • Despite investments on CyberInfrastructure as an
    enabler of a significant paradigm change in
    science
  • Reproducibility, key to scientific method, is
    threatened
  • Exponential growth in Compute, Sensors, Data
    storage, Network BUT growth of science is not
    same exponential
  • What is missing
  • Perceived importance of capturing and sharing
    process in accelerating pace of scientific
    advances
  • Process (method/protocol) is increasingly complex
    and highly distributed
  • Workflows are emerging as a paradigm for
    process-model driven science that captures the
    analysis itself
  • Workflows need to be first class citizens in
    science CyberInfrastructure
  • Enable reproducibility
  • Accelerate scientific progress by automating
    processes
  • Interdisciplinary and intradisciplinary research
    challenges
  • Report available at http//www.isi.edu/nsf-workflo
    ws06

3
Benefits of Workflow Systems Taylor et al 07
  • Managing execution
  • Remote job submission
  • Dependencies among steps
  • Failure recovery
  • Managing distributed computation
  • Move data when needed
  • Managing large data sets
  • Efficiency, reliability
  • Security and access control
  • Access to shared resources
  • Provenance recording
  • Low-cost high-fidelity reproducibility

4
Capabilities Available Today Wings/Pegasus
Workflows for Seismic Hazard Analysis Gil et
al 07 (see also Maechlin et al 05 Deelman et
al 06)
  • Input data a site and an earthquake forecast
    model
  • thousands of possible fault ruptures and rupture
    variations, each a file, unevenly distributed
  • 110,000 rupture variations to be simulated for
    that site
  • High-level template combines 11 application codes
  • 8048 application nodes in the workflow instance
    generated by Wings
  • Provenance records kept for 100,000 workflow data
    products
  • Generated more than 2M triples of metadata
  • 24,135 nodes in the executable workflow generated
    by Pegasus, including
  • data stage-in jobs, data stage-out jobs, data
    registration jobs
  • Executed in USC HPCC cluster, 1820 nodes w/ dual
    processors) but only lt 144 available
  • Including MPI jobs, each runs on hundreds of
    processors for 25-33 hours
  • Runtime was 1.9 CPU years

5
The Wings/Pegasus Workflow SystemGil et al 07
Deelman et al 03 Deelman et al 05 Kim et al 08
Gil et al forthcoming
WINGS Semantic workflow environment wings.isi.edu
  • Knowledge-based reasoning on workflows and data
    (W3Cs OWL)
  • Semantic workflow catalogs
  • Automation and assistance
  • Execution-independent workflows

Pegasus Automated workflow refinement and
execution pegasus.isi.edu
  • Optimize for performance, cost, reliability
  • Assign execution resources
  • Manage execution through DAGMan
  • Daily operational use in many domains

Grid services condor.uwisc.edu www.globus.org
  • Secure and controlled sharing of distributed
    services, computing, data
  • Scalable service-oriented architecture
  • Commercial quality, open source

6
Semantic Workflows in WINGSGil et al IEE IS
2010 Gil et al JETAI 2010 Gil et al eScience
2009 Kim et al JCCPE 2008 Gil et al 2007
  • Semantic workflows
  • More than a dataflow graph
  • Workflow variables each constituent (node, link,
    component, dataset) has a corresponding variable
  • Semantic constraints on workflow variables, both
    within and across variables
  • Semantic descriptions of collections of of data
    and components are concisely represented

(TestData dcdomisDiscrete false) (TrainingData
dcdomisDiscrete false)
modelerInput_not_equal_to_classifierInput
(modelerInput wflowhasDataBinding ?ds1)
(classifierInput wflowhasDataBinding ?ds2)
equal(?ds1, ?ds2) (?t rdftype
wflowWorkflowTemplate) gt (?t
wflowisInvalid "true"xsdboolean)
7
Workflow Portal for Genetic Studies of Mental
Disorders (with E. Deelman and C. Mason)
  • Existing repository of genotypic and phenotypic
    information
  • Goal develop workflows useful for data in the
    repository

8
Designing a Workflow Collection for Population
Genomics
  • Designed workflows for common analysis types
  • Association tests
  • CNV detection
  • Variant discovery
  • Family-based association analysis (TDT)
  • Developed workflow components by encapsulating
    widely-used heterogeneous open software
  • Plink (Purcell, Harvard)
  • R (Chambers et al)
  • PennCNV (Penn) -- Hidden Markov Models
  • Gnosis (State, Yale) -- sliding windows
  • Allegro (Decode, Iceland) -- Multiterminal Binary
    Decision Diagrams
  • Structure (Pritchard, Chicago) -- structured
    association
  • FastLink (Schaffer, NCBI)
  • (BWA) Burrows-Wheeler Aligner (Li Durbin)
  • SAMTools

9
Wings Workflows for Genetic Studies of Mental
Disorders Gil et al, forthcoming
Transmission Disequilibrium Test (TDT)
Association Tests
CNV Detection
Variant Discovery from Resequencing
10
Major Features
  • Workflow system manages set up and execution
  • Wings set up
  • Pegasus - execution
  • Initial collection of workflows captures common
    genomic analyses
  • Users can upload their own datasets
  • Including collections of datasets
  • User data is secure
  • Not accessible by others

11
Wings Replication of Crohns Disease Association
Study from Duerr et al, Science 06
12
Wings Replication of Early-Onset Parkinsons
Disease Study from Bayrakli et al, Human
Mutation 07
13
Observations about Reproducibility with Workflows
Gil et al, forthcoming
  • Effort involved in reproducing results is minor
  • 30 seconds to set up a workflow
  • A catalog of carefully crafted workflows of
    select state-of-the-art methods will cover a wide
    range of genomic analyses
  • Our workflows were independently developed and
    used as is
  • Semantic representations abstract the analysis
    method from the software that implements it
  • Our workflows used different analytic tools than
    the original studies
  • Many implementations of same algorithm, some
    proprietary
  • Semantic constraints can be added to workflows to
    avoid analysis errors
  • Eg in association analysis workflow, added
    constraint to remove duplicate individuals
    initially to avoid problems downstream

14
Benefits of Semantic Workflows Gil JSP-09
  • Execution management
  • Automation of workflow execution
  • Managing distributed computation
  • Managing large data sets
  • Security and access control
  • Provenance recording
  • Low-cost high fidelity reproducibility
  • Semantics and reasoning
  • User assistance to correctly explore analysis
    design space
  • Validation of analyses
  • Automated generation of metadata
  • Workflow retrieval and discovery
  • Conceptual reproducibility

15
W3C Provenance Group (Y. Gil, chair)Goals
  • Provide state-of-the-art understanding and
    develop a roadmap for development and possible
    standardization
  • Articulate requirements for accessing and
    reasoning about provenance information
  • Develop use cases
  • Identify issues in provenance that are direct
    concern to the Semantic Web
  • Articulate relationships with other aspects of
    Web architecture
  • Report on state-of-the-art work on provenance
  • Report on a roadmap for provenance in the
    Semantic Web
  • Identify starting points for provenance
    representations
  • Identifying elements of a provenance architecture
    that would benefit from standardization

16
W3C Provenance GroupProducts of the Group to
Date
  • Group formed in September 2009, open to new
    members
  • All information is public http//www.w3.org/2005/
    Incubator/prov/wiki/
  • Developed a set of key dimensions for provenance
    (11/09)
  • Grouped into three major categories content,
    management, use
  • Developed use cases for provenance (12/09)
  • More than 30 use cases, including 10 in science
    but others are relevant
  • Developed requirements for provenance from use
    cases (1/10)
  • User requirements what is the purpose of the
    provenance information
  • Technical requirements derived from the user
    requirements
  • Report on Requirements for Provenance on the
    Web
  • Currently developing state-of-the-art report
    (expected 6/10)
  • Started to develop recommendations (expected
    9/10)
  • Mappings across provenance vocabularies (eg DC,
    OPM, SWAN,)
Write a Comment
User Comments (0)
About PowerShow.com