Paul Fisher - PowerPoint PPT Presentation

About This Presentation
Title:

Paul Fisher

Description:

... it to extract biological knowledge ... Identified biological pathways involved ... biological sequence. database name. blast program. Outputs: Blast Report ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 49
Provided by: Wol80
Category:

less

Transcript and Presenter's Notes

Title: Paul Fisher


1
An Introduction to Web Services and Scientific
Workflows
  • Paul Fisher
  • University of Manchester

2
Overview
  • Current analysis techniques
  • Issues with manual analyses
  • Web Services
  • Workflows as Scientific protocols
  • Workflow sharing, re-use, and repurposing in
    myExperiment
  • Service discovery with Feta and BioCatalogue
  • Later Practical session for hands-on

3
Manual analysis techniques
  • Nucleic Acids Research (2009) - over 1170
    databases
  • Specialist software applications
  • Navigating between software resources
  • Cut and Paste of data
  • Screen scraping of Web pages
  • Scripting in Perl / Java / Python / C
  • insert another language here

4
Manual Methods of data analysis
5
Issues in analysis techniques
6
Manual Methods of data analysis
No explicit methods
Tedious and repetitive
Human error
Navigating through hyperlinks
7
Implicit methods
8
Huge amounts of data
Region on chromosome
Microarray
1000 Genes
200 Genes
How do I look at ALL the genes systematically?
9
Hypothesis-Driven Analyses
200 genes
Pick the genes involved in immunological process
Cherry Pick genes
40 genes
Pick the genes that I am most familiar with
2 genes
Biased view
10
Issues with current approaches
  • Scale of analysis task overwhelms researchers
    lots of data
  • User bias and premature filtering of datasets
    cherry picking
  • Hypothesis-Driven approach to data analysis
  • Constant changes in data - problems with
    re-analysis of data
  • Implicit methodologies (hyper-linking through web
    pages)
  • Error proliferation from any of the listed issues
    notably human error
  • Solution Automate

11
Automate using the Two Ws
  • Web Services
  • Technology and standard for exposing code and
    data resources by an means that can be consumed
    by a third party remotely
  • Describes how to interact with it, e.g. service
    parameters
  • Workflows
  • General technique for describing and executing a
    process
  • Describes what you want to do, including the
    services to use

12
Web Services
Client Application
HTTP Request
HTTP Response
SOAP
Remote Application
WSDL
HTTP Request
HTTP Response
Client Application
13
Web Service Description Language
  • Web Service Description Language (WSDL) is used
    to provide a computer program with enough
    information on how to execute or provide data to
    a remote resource
  • XML based language
  • Can be used for most industry programming
    languages including Perl, C, Java
  • Tells external programs how to call a remote
    service exposes function calls

14
Programmatic Interfaces to Services(Web Services
not Web Sites)
SeqFetch Service
GO Service
BLAT Service
BLAST Service
SeqFetch Service
Interface Description Document WSDL WADL
Service Registry
Web Service
Your Workflow
Your Application
Your Script
15
What types of service?
  • WSDL Web Services
  • BioMart
  • R-processor
  • BioMoby
  • Soaplab
  • Local Java services
  • Beanshell
  • Workflows

16
Workflows
  • Collection of tasks chained together to perform
    one overall operation e.g. the morning ritual
    workflow
  • Get up
  • Have a wash
  • Get dressed
  • Eat breakfast
  • Clean teeth
  • Go to lectures
  • High level description of your experiment
  • Inputs, programs, outputs (and intermediate
    inputs and outputs)
  • Workflow is the model of experiment
  • Methods section in your publication

17
(No Transcript)
18
What is a Workflow?
  • Workflows provide a general technique for
    describing and enacting a process
  • Describes what you want to do, and how you want
    to do it
  • Specifies how bioinformatics processes fit
    together
  • Processes are represented as web services

Remove repeats
Find orthalogues
Find genes
19
Taverna
20
The Taverna Workflow Workbench
21
What is Taverna?
  • Taverna enables the interoperation between
    databases and tools by providing a toolkit for
    composing, executing and managing workflow
    experiments. Someone (sometime)
  • OR
  • Allows you to build and run workflows. Paul
    Fisher (2009)
  • Access to local and remote resources and analysis
    tools
  • Automation of data flow between services
  • Iteration over large data sets.. And so on

http//www.mygrid.org.uk/
22
Taverna Workflow Workbench
23
Who uses Taverna?
  • Over 60,000 downloads
  • Systems biology
  • Medical image analysis
  • Heart simulations
  • High throughput screening
  • Genotype/Phenotype studies
  • Health Informatics
  • Astronomy
  • Chemoinformatics

NOT FOR BOLOGISTS!!!!
Prof. Andy Brass
Designed for informaticians computer savvy
people
24
What do Scientists use Taverna for?
  • Data gathering and annotating
  • Distributed data and knowledge
  • Building models and knowledge management
  • Populating SBML or hypothesis generation
  • Data analysis
  • Distributed analysis tools and high throughput

25
Data Gathering
  • Collecting evidence from lots of places
  • Accessing local and remote databases, extracting
    info and displaying a unified view to the user

Lots of outputs!!
26
Annotation Pipelines
  • Genome annotation pipelines
  • Workflow assembles evidence for predicted genes /
    potential functions
  • Human expert can review this evidence before
    submission to the genome database
  • Data warehouse pipelines
  • e-Fungi model organism warehouse
  • ISPIDER proteomics warehouse
  • Annotating the up/down regulated genes in a
    microarray experiment

27
Systems Biology Model Construction
Automatic reconstruction of genome-scale yeast
metabolism from distributed data in the life
sciences to create and manipulate Systems Biology
Markup Models.
28
Integration of microarray data with SBML
Read enzyme names from SBML
Query maxdLoad2 using enzyme names
Calculate colours based on gene expn level
Create new SBML model with new colour nodes
29
Data Analysis
  • Access to local and remote analysis tool
  • You start with your own data / public data of
    interest
  • You need to analyse it to extract biological
    knowledge

30
Trypanosomiasis in Africa
Steve Kemp
Andy Brass
many Others
http//www.genomics.liv.ac.uk/tryps/trypsindex.htm
l
31
A Systematic Strategy for Large-Scale Analysis of
Genotype-Phenotype Correlations Identification
of candidate genes involved in African
Trypanosomiasis Fisher et al., (2007) Nucleic
Acids Research doi10.1093/nar/gkm623
  • Explicitly discusses the methods we used for the
    Trypanosomiasis use case
  • Discussion of the results for Daxx and shows
    mutation
  • Sharing of workflows for re-use, re-purposing

32
Trichuris muris
  • Mouse whipworm infection - parasite model of the
    human parasite - Trichuris trichuria
  • Understanding Phenotype
  • Comparing resistant vs. susceptible strains
    Microarrays
  • Understanding Genotype
  • Mapping quantitative traits Classical genetics
    QTL (regions of chromosome)

33
Recycling, Reuse, Repurposing
Heres the Science!
  • Identified a candidate gene (Daxx) for
    Trypanosomiasis resistance.
  • Manual analysis on the microarray and QTL data
    failed to identify this gene as a candidate.
  • Unbiased analysis. Confirmed by the wet lab.

Heres the e-Science!
  • Trypanosomiasis mouse workflow reused without
    change in Trichuris muris infection in mice
  • Identified biological pathways involved in sex
    dependence
  • Previous manual two year study of candidate genes
    had failed to do this.

Workflows now being run over Colitis/
Inflammatory Bowel Disease in Mice (without
change)
34
Was the Workflow Approach Successful?
  • Scale of analysis task overwhelms researchers
    lots of data
  • Handled by computers
  • User bias and premature filtering of datasets
    cherry picking
  • All data processed systematically
  • Hypothesis-Driven approach to data analysis
  • Computers know nothing of hypotheses and so
    process the data independent of any prior
    judgments
  • Constant changes in data - problems with
    re-analysis of data
  • Saved workflow can be re-run at any point, over
    new data sets
  • Implicit methodologies (hyper-linking through web
    pages)
  • Methodology has been captured in the workflow
    itself

35
Social Networking for Scientists
36
Recycling, Reuse, Repurposing
  • Share
  • Search
  • Re-use
  • Re-purpose
  • Execute
  • Communicate
  • Record

http//www.myexperiment.org/
37
Sharing Experiments
  • myGrid supports the in silico experimental
    process for individual scientists
  • How do you share your results/experiments/experien
    ces with your
  • Research group
  • Collaborators
  • Scientific community
  • How do you compare your results with others
    produced by e.g. Kepler / Triana?

38
(No Transcript)
39
Just Enough Sharing.
  • myExperiment can provide a central location for
    workflows from one community/group
  • myExperiment allows you to say
  • Who can look at your workflow
  • Who can download your workflow
  • Who can modify your workflow
  • Who can run your workflow

40
Remote Execution of Workflows
41
Service Discovery
42
Finding Services
  • There are over 3500 distributed services. How do
    we find an appropriate one?
  • We need to annotate services by their functions
    (and not their names!)
  • The services might be distributed, but a registry
    of service descriptions can be central and
    queried
  • Annotated with terms from the myGrid ontology
  • Questions we can ask Find me all the services
    that perform a multiple sequence alignment and
    accept protein sequences in FASTA format as input

43
myGrid Ontology
  • Logically separated into two parts
  • Service ontology
  • Physical and operational features of web
    services
  • Domain ontology
  • Vocabulary for core bioinformatics data, data
    types and their relationships
  • Ontology developed in OWL

44
myGrid ontology
  • Example BLAST (from the DDBJ)
  • Performs task Alignment
  • Uses Method Similarity Search Algorithm
  • Uses Resources DNA/Protein sequence databases
  • Inputs
  • biological sequence
  • database name
  • blast program
  • Outputs Blast Report

45
Feta Search Result
46
BioCatalogueJoint Manchester-EBI
Curation by Developers
refine validate
seed
Curation by Experts
refine validate
refine validate
seed
seed
Automated Curation
Curation by the Community
47
Summary
  • Taverna workflows
  • Combine local and remote resource and analysis
    tools
  • Automate multi-step processes
  • Iterate over large data sets
  • myExperiment
  • Provides reusable protocols for in silico science
  • Enables sharing of workflows and expertise
  • Provides an alternative way of running workflows
  • Not everyone who runs workflows wants to build
    workflows or see workflows running

48
Acknowledgements
  • Carole Goble, Norman Paton, Robert Stevens, Anil
    Wipat, David De Roure, Steve Pettifer
  • OMII-UK Tom Oinn, Katy Wolstencroft, Daniele
    Turi, June Finch, Stuart Owen, David Withers,
    Stian Soiland, Franck Tanoh, Matthew Gamble, Alan
    Williams, Ian Dunlop
  • Research Martin Szomszor, Duncan Hull, Jun Zhao,
    Pinar Alper, Antoon Goderis, Alastair Hampshire,
    Qiuwei Yu, Wang Kaixuan.
  • Current contributors Matthew Pocock, James Marsh,
    Khalid Belhajjame, PsyGrid project, Bergen
    people, EMBRACE people.
  • User Advocates and their bosses Simon Pearce,
    Claire Jennings, Hannah Tipney, May Tassabehji,
    Andy Brass, Paul Fisher, Peter Li, Simon Hubbard,
    Tracy Craddock, Doug Kell, Marco Roos, Matthew
    Pocock, Mark Wilkinson
  • Past Contributors Matthew Addis, Nedim Alpdemir,
    Tim Carver, Rich Cawley, Neil Davis, Alvaro
    Fernandes, Justin Ferris, Robert Gaizaukaus,
    Kevin Glover, Chris Greenhalgh, Mark Greenwood,
    Yikun Guo, Ananth Krishna, Phillip Lord, Darren
    Marvin, Simon Miles, Luc Moreau, Arijit
    Mukherjee, Juri Papay, Savas Parastatidis, Milena
    Radenkovic, Stefan Rennick-Egglestone, Peter
    Rice, Martin Senger, Nick Sharman, Victor Tan,
    Paul Watson, and Chris Wroe.
  • Industrial Dennis Quan, Sean Martin, Michael
    Niemi (IBM), Chimatica.
  • Funding EPSRC, Wellcome Trust.

http//www.mygrid.org.uk http//www.myexperiment.o
rg
Write a Comment
User Comments (0)
About PowerShow.com