Knowledgebased Middleware for BioGrid services from the myGrid Project. - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Knowledgebased Middleware for BioGrid services from the myGrid Project.

Description:

Knowledge-based Middleware for. BioGrid services from the myGrid Project. ... Base services that tools that will constitute the experiments ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 43
Provided by: caro258
Category:

less

Transcript and Presenter's Notes

Title: Knowledgebased Middleware for BioGrid services from the myGrid Project.


1
Knowledge-based Middleware for BioGrid services
from the myGrid Project.
  • Professor Carole Goble and the
  • myGrid consortium
  • http//www.mygrid.org.uk

2
The Grid Problem
(Foster, Kesselman, Tueke)
  • flexible, secure, coordinated resource sharing
    among dynamic collections of individuals,
    institutions, and resources - what we refer to as
    virtual organizations."

Resources computers, databases, archives,
people, instruments, workflow repositories,
personal notes in online lab books, web pages,
institutions
3
Data-intensive bioinformatics
source GlaxoSmithKline
ID MURA_BACSU STANDARD PRT 429
AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE
1-CARBOXYVINYLTRANSFERASE DE (EC 2.5.1.7)
(ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMI
NE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA
OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA
FIRMICUTES BACILLUS/CLOSTRIDIUM GROUP
BACILLACEAE OC BACILLUS. KW PEPTIDOGLYCAN
SYNTHESIS CELL WALL TRANSFERASE. FT ACT_SITE
116 116 BINDS PEP (BY SIMILARITY). FT
CONFLICT 374 374 S -gt A (IN REF.
3). SQ SEQUENCE 429 AA 46016 MW 02018C5C
CRC32 MEKLNIAGGD SLNGTVHISG AKNSAVALIP
ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE
MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI
GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER
LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE
IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP
DRIEAGTFMI
4
Obstacles to integration
  • Access to and understanding of distributed,
    heterogeneous information resources and
    applications
  • 100s of relevant information sources,
  • An explosion in availability of experimental
    data, scientists annotations, text documents
    abstracts, eJournal articles, monthly reports,
    patents, ...
  • Rapidly changing domain concepts and terminology
    and analysis approaches
  • Constantly evolving data structures and data
  • Continuous creation of new data sources
  • Highly heterogeneous sources and applications
  • Different policies for access, security,
  • Data and results of uneven quality, depth, scope

5
Obstacles to integration
  • Access to and understanding of distributed,
    heterogeneous information resources and
    applications
  • 100s of relevant information sources,
  • An explosion in availability of experimental
    data, scientists annotations, text documents
    abstracts, eJournal articles, monthly reports,
    patents, ...
  • Rapidly changing domain concepts and terminology
    and analysis approaches
  • Constantly evolving data structures and data
  • Continuous creation of new data sources
  • Highly heterogeneous sources and applications
  • Different policies for access, security,
  • Data and results of uneven quality, depth, scope

A MIDDLEWARE SOLUTION!
6
Collaboration, collaboration
  • e-Collaborations
  • Virtual Organisations
  • Collaboration for understanding the
    data/information and consensus is essential
  • Within the Organisation
  • across the organisation functionally and
    geographically (world-wide)
  • along the pipeline and up the hierarchy
  • Externally With Others
  • Sharing knowledge and expertise
  • Personalised Workspace
  • Leverage resources of the entire organisation and
    external partners, but target the needs/interests
    of individual scientist
  • Find the right information for the current
    investigation
  • Discovery of information/expertise that was not
    explicitly sought
  • Visualisation of data/information
  • Capture work flow and analysis processes of
    investigators

7
Building the IT Environment
  • Eliminate redundant application development and
    use best of breed
  • Build components/services, not one-off
    applications
  • Components/services must be visible to the
    organisation (not hidden in libraries)
  • Ease of use of components
  • Standard interfaces and objects promote a
    component/service marketplace - aids the build vs
    buy decision
  • Standard service and object descriptions through
    industry and community consortia

8
myGrid
  • EPSRC UK e-Science pilot project
  • Open Source Upper Middleware for Bioinformatics
  • (Web) Service-based architecture -gt OGSA Grid
    services
  • Prototype v1 Release Oct 2003, some services
    available now.
  • Targeted at Tool Developers, Bioinformaticians
    and Service Providers

Newcastle
Sheffield
Manchester
Nottingham
Hinxton
Southampton
9
Graves disease
Application Drivers
  • Autoimmune disease of the thyroid in which the
    immune system of an individual attacks cells in
    the thyroid gland resulting in hyperthyroidism
  • Weight loss, trembling, muscle weakness,
    increased pulse rate, increased sweating and heat
    intolerance, goitre, exophtalmos

10
Biology working together with
Autoimmune Antibodies attach to TSH receptors,
competing with TSH
  • Graves Disease caused by the stimulation of the
    thyrotrophin receptor by thyroid-stimulating
    autoantibodies secreted by lymphocytes of the
    immune system.
  • What is the molecular basis for this autoimmune
    response?

What genes might be associated with Graves
Disease? Affymetrix microarray studies
11
Bioinformatics
Peter Li1, Claire Jennings2, Simon Pearce2 and
Anil Wipat1, (2003) 1School of Computing Science
and 2Institute of Human Genetics, University of
Newcastle-upon-Tyne.
Candidate gene pool
Annotation Pipeline
Genotype Assay Design System
3D Protein Structure
What is known about my candidate gene?
What is the structure of the protein product
encoded by my candidate gene?
Select a SNP from candidate gene. Is this SNP
associated with Disease?
Medline
Gene ID
Primer Design
GO
EMBL
Emboss Eprimer application in SoapLab
Use primers designed by myGrid to amplify region
flanking SNP on the gene
SNP
Query
Restriction Fragment Length Polymorphism
experiment
OMIM
BLAST
Selection of restriction enzyme
Talisman
Emboss Restrict in SoapLab
DQP
SN
SNP
P
SN
P
12
Workflows are in silico experiments
http//cvs.mygrid.org.uk/scufl/NucleotideSeqAnnota
tionPipelineWithGoTerms/
Experimental orchestration Exploratory Hypothesis
driven Not prescriptive Methodology free Ad hoc
13
Experiment life cycle
Personalised registries Personalised
workflows Info repository views Personalised
annotations Personalised metadata Security
Resource service discovery Repository
creation Workflow creation Database query
formation
Forming experiments
Personalisation
Discovering and reusing experiments and resources
Executing experiments
Workflow discovery refinement Resource
service discovery Repository creation Provenance
Workflow enactment Distributed Query
processing Job execution Provenance
generation Single sign-on authentician Event
notification
Providing services experiments
Managing experiments
Service registration Workflow deposition Metadata
Annotation Third party registration
Information repository Metadata
management Provenance management Workflow
evolution Event notification
14
Bio in silico experiments service types
  • Making in silico experiments
  • workflow
  • distributed database query processing.
  • Managing experimental outcomes
  • information management
  • managing metadata
  • Scientific method
  • provenance management
  • change notification
  • personalisation
  • Sharing experiments
  • semantic services for discovering services and
    workflows, and managing metadata
  • third party service registries and federated
    personalised views over those registries,
  • ontologies and ontology management.
  • Base services that tools that will constitute the
    experiments
  • third party services such databases,
    computational analyses, simulations .
  • specialised services such as AMBIT text
    extraction.

15
Investigation / Study set of experiments
metadata
  • Experimental design components
  • Workflow specs, queries, notes, data
  • Experimental instances, records of enacted
    experiments
  • Parameter settings, result data, workflow runs
  • Experimental glue that groups and links design
    and instance components
  • Life Science IDs (LSIDs)
  • RDF

16
myGrid Service Stack
External Applications
Applications
e
d
e-Science experimental management
Semantic Grid capabilities
c
Core services
Data management
b
High level services for data intensive integration
Web Service Grid communication fabric, OGSA,
OGSI
External (Web/Grid) Services
External services
a
17
myGrid Service Stack Confusagram
Work bench
Taverna workflow environment
Talisman application
Web Portal
Applications
e
Gateway
d
Personalisation
Service and Workflow Discovery
Registries
Provenance mgt
Event Notification
Ontology Mgt
Ontologies
Metadata Mgt
c
Core services
myGrid Information Repository
FreeFluo Workflow enactment engine
OGSA Distributed Query Processor
b
Web Service Grid communication fabric OGSI
External services
AMBIT Text Extraction Service
Bio Services
Soaplab
SRS
a
EMBOSS
18
Interaction Architecture
Knowledge Services
Knowledge Service
Semantic registration
Registry
Registry
Ontology Server
FaCT Reasoner
Structural registration
UDDI
Matcher
Service
Registry View
Notification Service
Notification Service
UDDI-RDF
Service Discovery
JMS
Provenance service
FreeFluo Workflow enactment engine
Workflow wizard
mIR
WSFL/Scufl
mIR browser
Information Extraction
Distributed Query Processor
Job Execution
mIR
AMBIT
DB2
Service
Service
Service
DB2
Soaplab
19
A work bench for demonstrating services
myView on the mIR
Workflow
Metadata about workflow
note about workflow
NetBeans
20
Notification service
  • A new gene with changed expression in Graves
    Disease added to mIR
  • User registers interest in notification topics
  • Informs the user via a notification client in the
    workbench that new data has been added to the
    mIR.
  • Notifications presented to the user with a client
    in the workbench environment.

21
Semantic discovery services workflows
  • Services and workflows described using semantic
    web technologies and ontologies
  • Selection by the types of inputs they use,
    outputs they produce, the bioinformatics tasks
    they perform
  • DAMLOIL ? OWL
  • RDF-based UDDI registry
  • Multiple 3rd party registries
  • Multiple 3rd party metadata

A registry browser
A workflow wizard
22
The mIR holds the experimental components
  • We need to discover which workflows have been
    published that can operate on data of this
    specific semantic type (an Affymetrix probe set
    identifier)
  • Some might be in mIR, some might be in global
    registry
  • mIR holds all experimental components
  • Multiple mIRs
  • Built on RDMS OGSA-DAI
  • Plans Federated architecture, LSIDs and RDF

23
Create and run a workflow
  • If an appropriate workflow does not exist, a new
    one can be created in the Taverna editor
  • Workflow outputs stored in mIR
  • Freefluo workflow enactment engine
  • WSFL Scufl
  • Joint development with HGMP and EBI

http//www.mygrid.org.uk/myGrid/web/components/Wor
kflow/
24
Provenance logging and reusing
  • FreeFluo provides a detailed provenance record
    stored in the mIR describing what was done, with
    what services and when
  • Can be viewed within the workbench
  • XML document
  • Every mIR object have (Dublin Core) provenance
    properties

Provenance is not just workflow Derivation
paths workflows, queries Annotations
notes Evolution paths workflow ? workflow
25
Legacy Bio Services publication
  • Wrap CORBA, Perl etc to look like web services,
    to become Grid services (eventually)
  • Soaplab
  • A soap-based programmatic interface to
    command-line applications
  • 300 different classes of services
  • Swiss-Prot, EMBOSS, Medline
  • 3rd parties
  • JEMBOSS, PathPort, bioMoby

26
Talisman application using individual services
http//www.ebi.ac.uk/collab/mygrid/service1/talism
an/index.html
27
An in silico experiment a web of interconnected
investigation holdings
People to notify of the workflow status
Provenance of the workflow template. Related
workflows.
Ontologies describing workflows
28
Data at the centre
Workflows that could use pr generate this data
People who have registered an interest in this
data
Related Data holding
Provenance of the data holdings
Ontologies describing data
29
Put the scientist at the centre
Workflows they wrote or used
People they collaborate with
30
Semantic Glue
Workflows
Provenance record of workflow runs
Notes
People
Data holdings
Services
31
The my in myGrid
  • my services
  • my favourite services
  • my opinion of those services
  • my workflow templates
  • my workflow runs
  • my data
  • my notes
  • my queries
  • my logs of what I did
  • the events I care about

32
The Grid in myGrid
  • Service based architecture
  • mIR and the DQP OGSA-DAI compliant
  • Migrated event notification and workflow
    enactment engine to OGSA
  • Volatility of services and virtual organisations
  • Graceful management of failure
  • Role based views over registries
  • Scale of data
  • Dataflow through workflow engine and distributed
    query processor
  • Services that are large computational services

33
Status and plans
  • Reflecting on what we have
  • All the components have an implementation in
    various states of maturity and functionality,
    some of which are downloadable already Freefluo,
    Taverna, Soaplab.
  • Field evaluations with Grave Disease geneticists
    and GSK with seeded data
  • Expanding the user base
  • Use cases in Sleeping Cow
  • Each component has plans, e.g.
  • More sophisticated model of provenance and other
    experimental data holdings, to store much more
    heavily linked metadata about provenance that
    will enable us to create views of the mIR along
    many axes.
  • The myGrid Information Repository to be
    significantly revised.
  • Review systematisation of type management
  • Migration strategy to OGSA

34
Remarks 1 Service Providers
  • Its hard to get Service Providers buy-in
  • lower the barriers of entry
  • make it reliable.
  • security intellectual property management
  • programmatic interfaces
  • How do we migrate legacy applications?
  • Whole bunch of apps and databases on the web
  • Accounting matters
  • Who is going to pay for all this?

35
Remarks 2 Hotch potch
  • Heterogeneity sucks
  • Multi-policy of everything security, access,
    accounting really matters in EU
  • Getting a UK Grid to work is non-trivial
  • Huge investment in system admin.
  • Doing more than you could do before.
  • Not just another predictable BLAST service over a
    bunch of machines
  • Non-predictable analysis.

36
Remarks 3 Not a silver bullet!
  • Its just middleware not magic
  • Data quality
  • Content management of databases (controlled
    vocabularies)
  • Provenance and versioning policies
  • Appropriate use of tools
  • Computational inaccessibility of free text
    annotation
  • Database accessibility through means other than
    point and click web interfaces.
  • Independent of the Grid!

37
Pitfalls of Grid
  • The Technology is emerging
  • Building middleware, Advancing Standards,
    Developing, Dependability
  • Building demonstrators.
  • The computational grid is in advance of the data
    intensive middleware
  • Integration and curation are probably the
    obstacles
  • But!! It doesnt have to be all there to be
    useful.
  • We know how we will use grid services
  • No Disruptive technology
  • Lower the barriers of entry.
  • Its only for big science
  • No small science collaborates too!

38
Life Sciences Grid (LSG)
http//people.cs.uchicago.edu/dangulo/LSG/
39
Summary
  • Information management matters! Accelerating
    scientific process is not just accelerating
    compute intensive processes.
  • myGrid offers service based middleware components
  • Open source and freely downloadable
  • Open Grid Service Architecture-compliant
  • Allows the scientist to be at the centre of the
    Grid -- Personalisation
  • Generic middleware that suits the creation of
    bioinformatics applications
  • Inclusion of rich semantics to facilitate the
    scientific process
  • Available from http//www.mygrid.org.uk

40
Our Biology colleagues
Claire Jennings
  • Institute of Human Genetics School of Clinical
    Medical Sciences
  • University of Newcastle
  • UK

41
The team
  • Current
  • Matthew Addis, Nedim Alpdemir, Rich Cawley,
  • Neil Davis, Alvaro Fernandes, Justin Ferris,
  • Rob Gaizauskas, Kevin Glover, Chris Greenhalgh,
  • Mark Greenwood, Ananth Krishna, Peter Li,
  • Darren Marvin, Karon Mee, Simon Miles, Luc
    Moreau,
  • Arijit Mukherjee, Juri Papay, Norman Paton,
  • Steve Pettifer, Milena Radenkovic, Peter Rice,
  • Martin Senger, Nick Sharman, Paul Watson,
  • Anil Wipat Chris Wroe.
  • Past
  • Vijay Dialani, Xiaojian Liu, Angus Roberts,
  • Alan Robinson,

42
  • Thank you
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com