Title: Paul Fisher
1An Introduction to Web Services and Scientific
Workflows
- Paul Fisher
- University of Manchester
2Overview
- Current analysis techniques
- Issues with manual analyses
- Web Services
- Workflows as Scientific protocols
- Workflow sharing, re-use, and repurposing in
myExperiment - Service discovery with Feta and BioCatalogue
- Later Practical session for hands-on
3Manual analysis techniques
- Nucleic Acids Research (2009) - over 1170
databases - Specialist software applications
- Navigating between software resources
- Cut and Paste of data
- Screen scraping of Web pages
- Scripting in Perl / Java / Python / C
- insert another language here
4Manual Methods of data analysis
5Issues in analysis techniques
6Manual Methods of data analysis
No explicit methods
Tedious and repetitive
Human error
Navigating through hyperlinks
7Implicit methods
8Huge amounts of data
Region on chromosome
Microarray
1000 Genes
200 Genes
How do I look at ALL the genes systematically?
9Hypothesis-Driven Analyses
200 genes
Pick the genes involved in immunological process
Cherry Pick genes
40 genes
Pick the genes that I am most familiar with
2 genes
Biased view
10Issues with current approaches
- Scale of analysis task overwhelms researchers
lots of data - User bias and premature filtering of datasets
cherry picking - Hypothesis-Driven approach to data analysis
- Constant changes in data - problems with
re-analysis of data - Implicit methodologies (hyper-linking through web
pages) - Error proliferation from any of the listed issues
notably human error - Solution Automate
11Automate using the Two Ws
- Web Services
- Technology and standard for exposing code and
data resources by an means that can be consumed
by a third party remotely - Describes how to interact with it, e.g. service
parameters - Workflows
- General technique for describing and executing a
process - Describes what you want to do, including the
services to use
12Web Services
Client Application
HTTP Request
HTTP Response
SOAP
Remote Application
WSDL
HTTP Request
HTTP Response
Client Application
13Web Service Description Language
- Web Service Description Language (WSDL) is used
to provide a computer program with enough
information on how to execute or provide data to
a remote resource - XML based language
- Can be used for most industry programming
languages including Perl, C, Java - Tells external programs how to call a remote
service exposes function calls
14Programmatic Interfaces to Services(Web Services
not Web Sites)
SeqFetch Service
GO Service
BLAT Service
BLAST Service
SeqFetch Service
Interface Description Document WSDL WADL
Service Registry
Web Service
Your Workflow
Your Application
Your Script
15What types of service?
- WSDL Web Services
- BioMart
- R-processor
- BioMoby
- Soaplab
- Local Java services
- Beanshell
- Workflows
16Workflows
- Collection of tasks chained together to perform
one overall operation e.g. the morning ritual
workflow - Get up
- Have a wash
- Get dressed
- Eat breakfast
- Clean teeth
- Go to lectures
- High level description of your experiment
- Inputs, programs, outputs (and intermediate
inputs and outputs) - Workflow is the model of experiment
- Methods section in your publication
17(No Transcript)
18 What is a Workflow?
- Workflows provide a general technique for
describing and enacting a process - Describes what you want to do, and how you want
to do it - Specifies how bioinformatics processes fit
together - Processes are represented as web services
Remove repeats
Find orthalogues
Find genes
19Taverna
20The Taverna Workflow Workbench
21What is Taverna?
-
- Taverna enables the interoperation between
databases and tools by providing a toolkit for
composing, executing and managing workflow
experiments. Someone (sometime) - OR
- Allows you to build and run workflows. Paul
Fisher (2009) - Access to local and remote resources and analysis
tools - Automation of data flow between services
- Iteration over large data sets.. And so on
http//www.mygrid.org.uk/
22Taverna Workflow Workbench
23Who uses Taverna?
- Over 60,000 downloads
- Systems biology
- Medical image analysis
- Heart simulations
- High throughput screening
- Genotype/Phenotype studies
- Health Informatics
- Astronomy
- Chemoinformatics
NOT FOR BOLOGISTS!!!!
Prof. Andy Brass
Designed for informaticians computer savvy
people
24What do Scientists use Taverna for?
- Data gathering and annotating
- Distributed data and knowledge
- Building models and knowledge management
- Populating SBML or hypothesis generation
- Data analysis
- Distributed analysis tools and high throughput
25Data Gathering
- Collecting evidence from lots of places
- Accessing local and remote databases, extracting
info and displaying a unified view to the user
Lots of outputs!!
26Annotation Pipelines
- Genome annotation pipelines
- Workflow assembles evidence for predicted genes /
potential functions - Human expert can review this evidence before
submission to the genome database - Data warehouse pipelines
- e-Fungi model organism warehouse
- ISPIDER proteomics warehouse
- Annotating the up/down regulated genes in a
microarray experiment
27Systems Biology Model Construction
Automatic reconstruction of genome-scale yeast
metabolism from distributed data in the life
sciences to create and manipulate Systems Biology
Markup Models.
28Integration of microarray data with SBML
Read enzyme names from SBML
Query maxdLoad2 using enzyme names
Calculate colours based on gene expn level
Create new SBML model with new colour nodes
29Data Analysis
- Access to local and remote analysis tool
- You start with your own data / public data of
interest - You need to analyse it to extract biological
knowledge
30Trypanosomiasis in Africa
Steve Kemp
Andy Brass
many Others
http//www.genomics.liv.ac.uk/tryps/trypsindex.htm
l
31A Systematic Strategy for Large-Scale Analysis of
Genotype-Phenotype Correlations Identification
of candidate genes involved in African
Trypanosomiasis Fisher et al., (2007) Nucleic
Acids Research doi10.1093/nar/gkm623
- Explicitly discusses the methods we used for the
Trypanosomiasis use case - Discussion of the results for Daxx and shows
mutation - Sharing of workflows for re-use, re-purposing
32Trichuris muris
- Mouse whipworm infection - parasite model of the
human parasite - Trichuris trichuria - Understanding Phenotype
- Comparing resistant vs. susceptible strains
Microarrays - Understanding Genotype
- Mapping quantitative traits Classical genetics
QTL (regions of chromosome)
33Recycling, Reuse, Repurposing
Heres the Science!
- Identified a candidate gene (Daxx) for
Trypanosomiasis resistance. - Manual analysis on the microarray and QTL data
failed to identify this gene as a candidate. - Unbiased analysis. Confirmed by the wet lab.
Heres the e-Science!
- Trypanosomiasis mouse workflow reused without
change in Trichuris muris infection in mice - Identified biological pathways involved in sex
dependence - Previous manual two year study of candidate genes
had failed to do this.
Workflows now being run over Colitis/
Inflammatory Bowel Disease in Mice (without
change)
34Was the Workflow Approach Successful?
- Scale of analysis task overwhelms researchers
lots of data - Handled by computers
- User bias and premature filtering of datasets
cherry picking - All data processed systematically
- Hypothesis-Driven approach to data analysis
- Computers know nothing of hypotheses and so
process the data independent of any prior
judgments - Constant changes in data - problems with
re-analysis of data - Saved workflow can be re-run at any point, over
new data sets - Implicit methodologies (hyper-linking through web
pages) - Methodology has been captured in the workflow
itself
35Social Networking for Scientists
36Recycling, Reuse, Repurposing
- Share
- Search
- Re-use
- Re-purpose
- Execute
- Communicate
- Record
http//www.myexperiment.org/
37Sharing Experiments
- myGrid supports the in silico experimental
process for individual scientists - How do you share your results/experiments/experien
ces with your - Research group
- Collaborators
- Scientific community
- How do you compare your results with others
produced by e.g. Kepler / Triana?
38(No Transcript)
39Just Enough Sharing.
- myExperiment can provide a central location for
workflows from one community/group - myExperiment allows you to say
- Who can look at your workflow
- Who can download your workflow
- Who can modify your workflow
- Who can run your workflow
40Remote Execution of Workflows
41Service Discovery
42Finding Services
- There are over 3500 distributed services. How do
we find an appropriate one? - We need to annotate services by their functions
(and not their names!) - The services might be distributed, but a registry
of service descriptions can be central and
queried - Annotated with terms from the myGrid ontology
- Questions we can ask Find me all the services
that perform a multiple sequence alignment and
accept protein sequences in FASTA format as input -
43myGrid Ontology
- Logically separated into two parts
- Service ontology
- Physical and operational features of web
services - Domain ontology
- Vocabulary for core bioinformatics data, data
types and their relationships - Ontology developed in OWL
44myGrid ontology
- Example BLAST (from the DDBJ)
- Performs task Alignment
- Uses Method Similarity Search Algorithm
- Uses Resources DNA/Protein sequence databases
- Inputs
- biological sequence
- database name
- blast program
- Outputs Blast Report
45Feta Search Result
46BioCatalogueJoint Manchester-EBI
Curation by Developers
refine validate
seed
Curation by Experts
refine validate
refine validate
seed
seed
Automated Curation
Curation by the Community
47Summary
- Taverna workflows
- Combine local and remote resource and analysis
tools - Automate multi-step processes
- Iterate over large data sets
- myExperiment
- Provides reusable protocols for in silico science
- Enables sharing of workflows and expertise
- Provides an alternative way of running workflows
- Not everyone who runs workflows wants to build
workflows or see workflows running
48Acknowledgements
- Carole Goble, Norman Paton, Robert Stevens, Anil
Wipat, David De Roure, Steve Pettifer - OMII-UK Tom Oinn, Katy Wolstencroft, Daniele
Turi, June Finch, Stuart Owen, David Withers,
Stian Soiland, Franck Tanoh, Matthew Gamble, Alan
Williams, Ian Dunlop - Research Martin Szomszor, Duncan Hull, Jun Zhao,
Pinar Alper, Antoon Goderis, Alastair Hampshire,
Qiuwei Yu, Wang Kaixuan. - Current contributors Matthew Pocock, James Marsh,
Khalid Belhajjame, PsyGrid project, Bergen
people, EMBRACE people. - User Advocates and their bosses Simon Pearce,
Claire Jennings, Hannah Tipney, May Tassabehji,
Andy Brass, Paul Fisher, Peter Li, Simon Hubbard,
Tracy Craddock, Doug Kell, Marco Roos, Matthew
Pocock, Mark Wilkinson - Past Contributors Matthew Addis, Nedim Alpdemir,
Tim Carver, Rich Cawley, Neil Davis, Alvaro
Fernandes, Justin Ferris, Robert Gaizaukaus,
Kevin Glover, Chris Greenhalgh, Mark Greenwood,
Yikun Guo, Ananth Krishna, Phillip Lord, Darren
Marvin, Simon Miles, Luc Moreau, Arijit
Mukherjee, Juri Papay, Savas Parastatidis, Milena
Radenkovic, Stefan Rennick-Egglestone, Peter
Rice, Martin Senger, Nick Sharman, Victor Tan,
Paul Watson, and Chris Wroe. - Industrial Dennis Quan, Sean Martin, Michael
Niemi (IBM), Chimatica. - Funding EPSRC, Wellcome Trust.
http//www.mygrid.org.uk http//www.myexperiment.o
rg