Title: The myGrid project: towards a semantic grid for bioinformatics
1The myGrid project towards a semantic grid
for bioinformatics
- Professor Carole Goble
- http//www.mygrid.org.uk
- Contact mygrid_at_cs.man.ac.uk
- Genes, Proteins and Computing VII
- 12th September 2002
- University of Southampton, UK
2Roadmap
- The information integration bottleneck
- The myGrid project
- Semantics-driven middleware
- A taste of myGrid 0.0 0.1
- Remarks
3Obstacles Everywhere to Information Weaving
- Large amounts of different kinds of data many
applications. - Highly heterogeneous.
- Different types, algorithms, forms,
implementations, communities, service providers - High autonomy.
- Highly complex and inter-related, volatile.
4Circadian Rhythms
- Has anyone else studied the effect of
neurotransmitters on the circadian rhythms in
Drosophila? - Ive got a cluster of proteins from my
experiment. How do their functions interrelate?
And what are the proteins with a particular
function? - Is a structure known for my protein? What other
proteins have a similar structure? - Can I build a homology 3D model?
- What is known about a homologous protein?
1
2
3
5
4
5E-Science Q A
- Who else has asked this question can I
use/adapt their approach? - Workflow.
- What were the results at each stage?
- Dynamic Data Repositories.
- When was P12345 last updated?
- Which BLAST did I use?
- Provenance.
- Has PDB changed since I last ran this?
- Notification.
1
2
3
5
4
Personalisation.
6- myGridpersonalised extensible environments
fordata-intensive in silico experiments in
biology - EPSRC eScience pilot project
- official start 01/10/01
- actual 01/01/02
- end 30/03/05
- 16 RAs, 9 studentships (start 09/03)
7myGrid partners
8myGrid upper middleware
- An extensible open platform for data tools
interoperability - Service based
- Web services Grid Open Grid Services
Architecture (OGSA) - XML, SOAP, WSDL,
- WSIL, UDDI
- WebSphere
Grid
Web Services
9Web Services
- Loosely coupled, stateless, message based
distributed computing over the internet - Description separate from implementation
10myGrid integration coordination services
- Databases
- Access.
- Semantic database integration.
- Distributed query processing.
- Workflow
- Dynamic workflow enactment.
- A resource in their own right.
- Workflow replay / repeat / reuse.
- User interactive.
- Linking
- Workflows and derived data
DB2, mySQL XML
WSFL
XML, RDF
11myGrid specialist services
- Information extraction from abstracts
- PASTA from Sheffield
- Grid-enabled BioServices by the EBI
- myGrid 0.1
- EMBOSS
- SRS
- Open BQS
- BLAST
- XEmbl and EmblFetch
- Flybase, Gadfly
12myGrid science services
- Data provenance and resource change management
- Workflow logs.
- Event notification service.
- Incremental view management.
- Personalisation
- Management of views over repositories.
- Personalisation of process flows,.
- Annotation of existing data sets.
- Dynamic creation of personal data sets.
XML, RDF, DBMS
13myGrid metadata services
- Metadata
- Service discovery, publication, composition,
management. - Database and data integration.
- Workflow control.
- Portal driving.
- Etc
- The Semantics
- Information models, data types ontologies.
- Object identity.
Web Services
Semantic Web
Grid
XML, XSD,RDF(S), DAMLOIL, OWL
14Metadata
- Metadata computationally accessible data about
the services - Ontologies the shared and common understanding
of a domain - A vocabulary of terms
- Definition of what those terms mean.
- A shared understanding for people and machines
- Usually organised into a taxonomy.
15myGrid Clients of Services
- e-Scientists
- Configurable portal for service access,
personalisation community management. - Wizards for workflow.
- Reference implementation services applications.
- Gene function expression analysis -fly yeast.
- Annotation workbench for the PRINTS pattern
database. - Developers
- Protocols and service descriptions.
- myGrid-in-a-Box developers kit of core services.
- Grid-enabled Bio services.
16User Agent
Custom Application
Presentation Services
Collaboration Support
Management Tools
Portal
Client Framework
Semantic Data Integration
Semantic Aspect
Information Extraction
Semantic Workflow Design
Provenance Validation Assessment
Semantic Discovery
Ontology Service
Preferences
Metadata Aspect
Availability
Preferences
Versioning
Third-party Metadata
QoS
QoS
Provenance
QoS
Coordination Services
Distributed Query
Workflow Enactment
Syntactic Discovery
Event Notification
Networked Services
White Pages Yellow Pages Discovery
Personal Repository
Database Access
JobExecution
Device Access
Device Access
Security Authentication Authorization
Distributed Resources
Database
resources data and tools
17myGrid Challenge
- Identifying the most important services.
- Agreeing consistent interfaces.
- Integrating with other Grid services.
- Implementing core services
- Describing services
- Use case scenarios (see Peter Lis poster)
- Rolling programme of prototyping
- April myGrid 0.0, October myGrid 0.1
- Not reinventing the wheel.
Here comes the semantics!!
18Wheres the Grid?
- Resource sharing coordinated problem solving in
dynamic, multi-institutional virtual
organizations - On-demand, ubiquitous access to computing, data,
and services - New capabilities constructed dynamically and
transparently from distributed services - Ian Foster
19Courtesy of Mark Wilkinson (BioMOBY)
20Courtesy of Mark Wilkinson (BioMOBY)
21Service Discovery
- Find appropriate type of services
- sequence alignment
- Find appropriate instances of that service
- BLAST (an algorithm for sequence alignment), as
delivered by NCBI - Assist in forming an appropriate assembly of
discovered services. - Find, select and execute instances of services
while the workflow is being enacted. - Knowledge in the head of expert bioinformatian
22Finding a Service
- Words
- Syntactic signature type of inputs, outputs
number - Semantics What it does, whats it for, who wrote
it -gt ontologies - All of the above
23Why have ontologies for services?
- A shared vocabulary for describing a service
- that can evolve and say as little or as much as
necessary. - Service classifications
- Service discovery, organisation indexing
- Service matching and substitution
- BLAST Finds tblastx, tblastn, psi-blast, and
marks_super_blast. - Alignment Finds ClustalW, Blast,
Smith-Waterman, Needleman-Wunsch - Expanded selection of services presented based on
expansion of in-hand object
24Why have ontologies for services?
- Controlling service composition
- Outputs of service A semantically compatible with
inputs of service B. - A service description is plausible.
- Blastn compares a nucleotide query sequence
against a nucleotide sequence database
25Metadata Classification
- Domain metadata
- the domain coverage of the service, or its
function. - BLASTn is a tool for computing sequence homology
that uses the BLAST algorithm over nucleotides - Business metadata
- data quality, quality of service, cost,
geographical location, authorisation, provenance
of data and so on. - BLASTn service offered by the NCBI is 80
reliable.
26Four tiered service descriptions
Domain semantic
- Class of service
- a protein sequence alignment, a protein sequence
database. - Specific example of an abstract service
- BLAST, SWISS-PROT.
- Instance service description of a specific
service - BLAST, SWISS-PROT as offered by the EBI.
- Invoked instance service description
- BLAST as offered by the EBI on a particular date,
with particular parameters when a service was
actually enacted.
Business operational
27DAMLOIL/OWL
- DAML OIL designed to describe ontologies
- Ontologies incorporate information about classes,
properties, and individuals, each of which can
have an ID which is URI reference. - Ontologies can reference XML Schema datatypes by
a name for the datatype. - Automated reasoning for inferring classification
lattice and checking concepts are consistent - OWL Web Ontology Language 1.0 Reference
- W3C Working Draft 29 July 2002
- http//www.w3.org/TR/owl-ref/
28Ontology editing OilEd
http//oiled.man.ac.uk/
29Reasoning in DAMLOIL
- Consistency check if knowledge is meaningful
- Subsumption structure knowledge, compute
classification - Equivalence check if two classes denote same
set of instances - Instantiation check if individual i instance of
class C - Retrieval retrieve set of individuals that
instantiate C
30Why isnt a tree enough?
- BLASTn description implicitly describes a service
that operates over nucleotides (and not
proteins). protein pairwise alignment doesnt
exist despite the fact that this is a likely
service description.
31Why isnt a tree enough?
- Classify descriptions by operation (alignment,
pairwise, multiple), by data source (protein,
nucleotide, sequence), or by algorithm
(SmithWaterman, BLAST).
32Why isnt a tree enough?
- Constraints are missing for the descriptions.
BLASTp only operates over proteins. - tBLASTn compares a protein sequence query against
a nucleotide sequence database dynamically
translated in 6 reading frames. - Alignment is an operation that only applies to
sequences and not pathways, and at least two
inputs are required.
33- class-def defined BLAST-n_service_operation
- subclass-of atomic_service_operation
- has_Class performs_task
- (aligning has_Class has_feature local
- has_Class has_feature pairwise)
- has_Class produces_result
- (report has_Class is_report_of
sequence_alignment) - has_Class uses_resource
- (database has_Class contains
- (data has_Class encodes
- (sequence has_Class is_sequence_of
-
nucleic_acid_molecule))) - has_Class requires_input
- (data has_Class encodes
- (sequence has_Class is_sequence_of
-
nucleic_acid_molecule)) - has_Class is_function_of (BLAST_application)
34- class-def defined pairwise_sequence_alignment_se
rvice - subclass-of atomic_service_operation
- has_Class performs_task
- (aligning has_Class has_feature local
- has_Class has_feature pairwise)
- has_Class produces_result
- (report has_Class is_report_of
sequence_alignment) - has_Class uses_resource
- (database has_Class contains
- (data has_Class encodes
- (sequence has_Class is_sequence_of
-
nucleic_acid_molecule))) - has_Class requires_input
- (data has_Class encodes
- (sequence has_Class is_sequence_of
-
nucleic_acid_molecule)) - has_Class is_function_of (BLAST_application)
35Suite
Specialises. All concepts are subclassed from
those in the more general ontology.
Contributes concepts to form definitions.
Upper level ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Task ontology
Bioinformatics ontology
Web serviceontology
36Suite
Specialises. All concepts are subclassed from
those in the more general ontology.
Contributes concepts to form definitions.
Upper level ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Task ontology
parameters input, output, precondition,
effect performs_task uses-resource is_function_of
Bioinformatics ontology
Web serviceontology
37Suites Coverage
38Client framework
myGrid 0.0
Portal
Repository Client
Ontology Client
Workflow Client
Personal Repository
Workflow Repository
(Metadata) Ontology Server
DAMLOIL Reasoner (FaCT)
(Metadata) Service Type Directory
Workflow enactment
Matcher and Ranker
Service instance directory
Bioinformatics services
39How do the functions of a cluster of proteins
interrelate?
- Some proteins in my personal repository
40 Find services that takes a protein and gives
their functions and pick the best match.
41 Find another that displays the proteins base on
their function. Ontology restricts inputs
outputs
42Build a workflow of composed services linked
together
43 See if a workflow that is appropriate already
exists. It could have been made anyone who will
share with you.
44Pick one and enact it.
45While its running it picks the best service
instance that can run the service at that time.
46While its running it picks the best service
instance that can run the service at that
time. Or you choose.
47The workflow finishes with the final display
service
48Results are put into your personal repository,
with a concept from the ontology to tell you and
myGrid what they mean.
49And full provenance record kept, and linked with
the results. We could redo or reuse the workflow.
50Other uses of the ontology
- Labelling data items in databases
- Semantic typing for controlling inputs and
outputs - Use by distributed query processing
- Labelling and hence indexing linking browsing
any myGrid information component - Workflow descriptions, people, provenance
- Link with the Life Science Identifier (I3C)
- Generate BioMOBY Central service classification
link with BioMOBY objects.
51What about other efforts?
- Integration
- DAS "Distributed Annotation System"
- ISYS Integration of Desktop Tool
- DiscoveryLink wrapper and distributed query
environment - GO Gene Ontology etc
- Service discovery and common typing
- BioMOBY Integration of online biological
databases and analysis services - Tackling parts of the problem.
- myGrid is a framework for a platform.
52Thoughts to hold on to
- Application driven by use cases
- Open Source
- Data object types, APIs, protocols, ontologies
have longer life span that s/w - Components are useful dont have to buy into
the whole shooting match. - Dont reinvent the wheel
- Get everyone else to build services.
- Keep it simple.
- Its distributed global and a continuum.
53myGrid Summary
- myGrid aims to develop infrastructure middleware
for an e-Biologists workbench. - The setting is bioinformatics but the results are
intended to be generally applicable to e-Science. - A mix of standard, vanguard and bleeding edge
technologies, advanced development and (some)
research. - Academic commercial partnership.
54The myGrid team
- Carole Goble
- Norman Paton
- Brian Warboys
- Stephen Pettifer
- Alvaro Fernandes
- Luc Moreau
- Dave De Roure
- Chris Greenhalgh
- Tom Rodden
- John Brooke
- Paul Watson
- Alan Robinson
- Rob Gaizauskas
- Robert Stevens
- Ian Horrocks
- Neil Wipat
- Matthew Addis
- Nick Sharman
- Rich Cawley
- Simon Harper
- Karon Mee
- Simon Miles
- Vijay Dailani
- Xiaojian Liu
- Tom Oinn
- Martin Senger
- Milena Radenkovic
- Kevin Glover
- Angus Roberts
- Chris Wroe
- Mark Greenwood
- Phil Lord
- Neil Davis
- Darren Marvin
- Justin Ferris
- Peter Li
- Nedim Alpdemir
- Luca Toldo
- Robin McEntire
- Anne Westcott
- Tony Storey
- Bernard Horan
- Paul Smart
- Robert Haynes
55Downloads
- All tools ontology available from
- http//www.mygrid.org.uk
- Forthcoming publication
- A suite of DAMLOIL Ontologies to Describe
Bioinformatics Web Services and Data Chris Wroe,
Robert Stevens, Carole Goble, Angus Roberts, Mark
Greenwood - To appear in International Journal of
Cooperative Information Systems.