10'30 11'00 Introduction to myGrid - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

10'30 11'00 Introduction to myGrid

Description:

2.40 3.00 sharing my workflows. Taverna training day in Sheffield. An Introduction to Taverna Workflows ... Steve Kemp. Paul Fisher. Trypanosomiasis Study ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 45
Provided by: Kat121
Category:

less

Transcript and Presenter's Notes

Title: 10'30 11'00 Introduction to myGrid


1
Taverna training day in Sheffield
10.30 - 11.00 Introduction to myGrid 11.00
12.30 tutorial (part 1) 12.30 1.15 Break 1.15
2.30 Tutorial (part2) 2.30 2.40 break 2.40
3.00 sharing my workflows
2
An Introduction to Taverna Workflows
  • Franck Tanoh
  • myGrid
  • University of Manchester

3
What is myGrid?
  • myGrid is a suite components to support in silico
    experiments in biology
  • Taverna workbench myGrid user interface
  • Originally designed to support bioinformatics
  • Expanded into new areas
  • Chemoinformatics
  • Health Informatics
  • Medical Imaging
  • Integrative Biology
  • Open source and always will be

4
History
EPSRC funded UK eScience Program Pilot Project
5
OMII-UK
  • University of Manchester (myGrid) joined with the
    Universities of Edinburgh (OGSA-DAI) and
    Southampton (OMII phase 1) in March 2006
  • OMII-UK aims to provide software and support to
    enable a sustained future for the UK e-Science
    community and its international collaborators.
  • A guarantee of development and support

6
The Life Science Community
  • In silico Biology is an open Community
  • Open access to data
  • Open access to resources
  • Open access to tools
  • Open access to applications
  • Global in silico biological research

7
The Community Problems
  • Everything is Distributed
  • Data, Resources and Scientists
  • Heterogeneous data
  • Very few standards
  • I/O formats, data representation, annotation
  • Everything is a string!
  • Integration of data and interoperability of
    resources is difficult

8
Lots of Resources
NAR 2007 968 databases
9
Traditional Bioinformatics
10
Cutting and Pasting
  • Advantages
  • Low Technology on both server and client side
  • Very Robust Hard to break.
  • Data Integration happens along the way
  • Disadvantages
  • Time Consuming (and painful!)
  • Can be repeated rarely
  • Limited to small data sets.
  • Error Prone
  • Poor repeatability
  • How do you do this for a genome/proteome/metabolom
    e of information!

11
Pipeline Programming
  • Advantages
  • Repeatable
  • Allows automation
  • Quick, reliable, efficient
  • Disadvantages
  • Requires programming skills
  • Difficult to modify
  • Requires local tool and database installation
  • Requires tool and database maintenance!!!

12
What we want as a solution
  • A system that is
  • Allows automation
  • Allows easy repetition, verification and sharing
    of experiments
  • Works on distributed resource
  • Requires few programming skills
  • Runs on a local desktop / laptop

13
myGrid as a solution
  • myGrid allows the automated orchestration of in
    silico experiments over distributed resources
    from the scientists desktop
  • Built on computer science technologies of
  • Web services
  • Workflows
  • Semantic web technologies

14
Web Services
  • Web services support machine-to-machine
    interaction over a network.
  • Web services are a
  • technology and standard for exposing code /
    databases with an API that can be consumed by a
    third party remotely.
  • describes how to interact with it.
  • They are
  • Self-contained
  • Self-describing
  • Modular
  • Platform independent

15
Workflows
  • General technique for describing and enacting a
    process
  • Describes what you want to do, not how you want
    to do it
  • High level description of the experiment

16
Workflows
  • Workflow language specifies how bioinformatics
    processes fit together.
  • High level workflow diagram separated from any
    lower level coding you dont have to be a coder
    to build workflows.
  • Workflow is a kind of script or protocol that you
    configure when you run it.
  • Easier to explain, share, relocate, reuse and
    repurpose.
  • Workflow ltgt Model
  • Workflow is the integrator of knowledge
  • The METHODS section of a scientific publication

17
Workflow Advantages
  • Automation
  • Capturing processes in an explicit manner
  • Tedium! Computers dont get bored/distracted/hungr
    y/impatient!
  • Saves repeated time and effort
  • Modification, maintenance, substitution and
    personalisation
  • Easy to share, explain, relocate, reuse and build
  • Releases Scientists/Bioinformaticians to do other
    work
  • Record
  • Provenance what the data is like, where it came
    from, its quality
  • Management of data (LSID - Life Science
    Identifiers)

18
Different Workflow Systems
  • Kepler
  • Triana
  • DiscoveryNet
  • Taverna
  • Geodise
  • Pegasus
  • Pipeline Pilot
  • Each has differences in action, language, access
    restrictions, subject areas

19
Taverna Workflow Components
Scufl Simple Conceptual Unified Flow
Language Taverna Writing, running workflows
examining results SOAPLAB Makes applications
available
20
An Open World
  • Open domain services and resources.
  • Taverna accesses 3000 services
  • Third party we dont own them we didnt build
    them
  • All the major providers
  • NCBI, DDBJ, EBI
  • Enforce NO common data model.
  • Quality Web Services considered desirable

21
Services Landscape
SoapLab
BioMoby
Standard soap services
Taverna web services
API consumer
Biomart
Local Java
Beanshell
Nested workflow
22
Shield the Scientist Bury the Complexity
Workflow enactor
23
What can you do with myGrid?
  • gt40000 downloads
  • Users worldwide
  • US, Singapore, UK, Europe, Australia
  • Systems biology
  • Proteomics
  • Gene/protein annotation
  • Microarray data analysis
  • Medical image analysis
  • Heart simulations
  • High throughput screening
  • Genotype/Phenotype studies
  • Health Informatics
  • Astronomy
  • Chemoinformatics
  • Data integration

24
Trypanosomiasis in Africa
Steve Kemp
Andy Brass
Paul Fisher
http//www.genomics.liv.ac.uk/tryps/trypsindex.htm
l
25
Trypanosomiasis Study
  • A form of Sleeping sickness in cattle Known as
    ngana
  • Caused by Trypanosoma brucei
  • Can we breed cattle resistant to ngana
    infection?
  • What are the causes of the differences between
    resistant and susceptible strains?

26
Trypanosomiasis Study
  • Understanding Phenotype
  • Comparing resistant vs susceptible strains
    Microarrays
  • Understanding Genotype
  • Mapping quantitative traits Classical genetics
    QTL

Need to access microarray data, genomic sequence
information, pathway databases AND integrate the
results
27
Key A Retrieve genes in QTL region B
Annotate genes with external database Ids C
Cross-reference Ids with KEGG gene ids D
Retrieve microarray data from MaxD database E
For each KEGG gene get the pathways its involved
in F For each pathway get a description of what
it does G For each KEGG gene get a description
of what it does
28
Results
  • Identified a pathway for which its correlating
    gene (Daxx) is believed to play a role in
    trypanosomiasis resistance.
  • Manual analysis on the microarray and QTL data
    had failed to identify this gene as a candidate.

29
Why was the Workflow Approach Successful?
  • Workflow analysed each piece of data
    systematically
  • Eliminated user bias and premature filtering of
    datasets and results leading to single sided,
    expert-driven hypotheses
  • The size of the QTL and amount of the microarray
    data made a manual approach impractical
  • Workflows capture exactly where data came from
    and how it was analysed
  • Workflow output produced a manageable amount of
    data for the biologists to interpret and verify
  • make sense of this data -gt does this make
    sense?

30
Trichuris muris (mouse whipworm) infection
parasite model of the human parasite - Trichuris
trichuria)
  • Identified the biological pathways involved in
    sex dependence in the mouse model, previously
    believed to be involved in the ability of mice to
    expel the parasite.
  • Manual experimentation Two year study of
    candidate genes, processes unidentified
  • Workflows trypanosomiasis cattle experiment was
    reused without change.
  • Analysis of the resulting data by a biologist
    found the processes in a couple of days.

Joanne Pennock, Richard Grencis University of
manchester
31
Workflow Reuse Workflows are Scientific
Protocols Share them!
32
A workflow marketplace
33
A Practical Guide to Building and Managing in
silico Experiments
34
Semantic Web Technologies
  • myGrid built on Web Services, Workflows AND
    semantic web technologies
  • Semantic web technologies are used to
  • Find appropriate services during workflow design
  • Find similar workflows for reuse and repurposing
  • Record the process and outcome of an experiment,
    in context
  • -gtgtgtgt the experimental provenance

35
Finding Services
  • There are over 3000 distributed services. How do
    we find an appropriate one?
  • Find services by their function instead of their
    name
  • We need to annotate services by their functions.
  • The services might be distributed, but a registry
    of service descriptions can be central and
    queried

36
Feta Semantic Discovery
  • Feta is the myGrid component that can query the
    service annotations and find services
  • Questions we can ask
  • Find me all the services that perform a multiple
    sequence alignment And accept protein sequences
    in FASTA format as input

37
myGrid Ontology
Specialises
38
Annotations
  • Feta has been available for over a year
  • Only just been included in the release
  • Need critical mass of service annotations before
    release
  • By demonstrating the use of service annotation,
    we aim to encourage service providers to provide
    the annotations in the future
  • Annotation experiments with users and domain
    experts
  • Domain expert annotations much better
  • We now have a domain expert for full-time service
    annotation

39
Data Management
  • Workflows can generate vast amount of data - how
    can we manage and track it?
  • We need to manage
  • data AND
  • metadata AND
  • experiment provenance
  • Workflow experiments may consist of many
    workflows of the same, or different experiments.
  • Scientists need to check back over past results,
    compare workflow runs and share workflow runs
    with colleagues

40
(No Transcript)
41
Provenance the myGrid logbook
Smart Tea
  • Who, What, Where, When, Why?, How?
  • Context
  • Interpretation
  • Logging Debugging
  • Reproducibility and repeatability
  • Evidence Audit
  • Non-repudiation
  • Credit and Attribution
  • Credibility
  • Accurate reuse and interpretation
  • Just good scientific practice

BioMOBY
42
Conclusions
  • Web services and workflows are powerful
    technologies for in silico science
  • automation
  • high throughput experiments
  • systematic analysis
  • Interoperability of distributed resources

43
Contact Us
  • Taverna development is user-driven
  • Please tell us what you would like to see via the
    mailing lists
  • Taverna-Users and Taverna-Hackers
  • Download software and find out more at
  • http//www.mygrid.org.uk
  • http//taverna.sourceforge.net

44
myGrid acknowledgements
  • Carole Goble, Norman Paton, Robert Stevens, Anil
    Wipat, David De Roure, Steve Pettifer
  • OMII-UK Tom Oinn, Katy Wolstencroft, Daniele
    Turi, June Finch, Stuart Owen, David Withers,
    Stian Soiland, Franck Tanoh, Matthew Gamble, Alan
    Williams, Ian Dunlop
  • Research Martin Szomszor, Duncan Hull, Jun Zhao,
    Pinar Alper, Antoon Goderis, Alastair Hampshire,
    Qiuwei Yu, Wang Kaixuan.
  • Current contributors Matthew Pocock, James Marsh,
    Khalid Belhajjame, PsyGrid project, Bergen
    people, EMBRACE people.
  • User Advocates and their bosses Simon Pearce,
    Claire Jennings, Hannah Tipney, May Tassabehji,
    Andy Brass, Paul Fisher, Peter Li, Simon Hubbard,
    Tracy Craddock, Doug Kell, Marco Roos, Matthew
    Pocock, Mark Wilkinson
  • Past Contributors Matthew Addis, Nedim Alpdemir,
    Tim Carver, Rich Cawley, Neil Davis, Alvaro
    Fernandes, Justin Ferris, Robert Gaizaukaus,
    Kevin Glover, Chris Greenhalgh, Mark Greenwood,
    Yikun Guo, Ananth Krishna, Phillip Lord, Darren
    Marvin, Simon Miles, Luc Moreau, Arijit
    Mukherjee, Juri Papay, Savas Parastatidis, Milena
    Radenkovic, Stefan Rennick-Egglestone, Peter
    Rice, Martin Senger, Nick Sharman, Victor Tan,
    Paul Watson, and Chris Wroe.
  • Industrial Dennis Quan, Sean Martin, Michael
    Niemi (IBM), Chimatica.
  • Funding EPSRC, Wellcome Trust.
Write a Comment
User Comments (0)
About PowerShow.com