The myGrid project: towards a semantic grid for bioinformatics - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

The myGrid project: towards a semantic grid for bioinformatics

Description:

The myGrid project: towards a 'semantic grid' for ... Ontology editing: OilEd. http://oiled.man.ac.uk/ Consistency check if knowledge is meaningful ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 56
Provided by: myg5
Category:

less

Transcript and Presenter's Notes

Title: The myGrid project: towards a semantic grid for bioinformatics


1
The myGrid project towards a semantic grid
for bioinformatics
  • Professor Carole Goble
  • http//www.mygrid.org.uk
  • Contact mygrid_at_cs.man.ac.uk
  • Genes, Proteins and Computing VII
  • 12th September 2002
  • University of Southampton, UK

2
Roadmap
  • The information integration bottleneck
  • The myGrid project
  • Semantics-driven middleware
  • A taste of myGrid 0.0 0.1
  • Remarks

3
Obstacles Everywhere to Information Weaving
  • Large amounts of different kinds of data many
    applications.
  • Highly heterogeneous.
  • Different types, algorithms, forms,
    implementations, communities, service providers
  • High autonomy.
  • Highly complex and inter-related, volatile.

4
Circadian Rhythms
  • Has anyone else studied the effect of
    neurotransmitters on the circadian rhythms in
    Drosophila?
  • Ive got a cluster of proteins from my
    experiment. How do their functions interrelate?
    And what are the proteins with a particular
    function?
  • Is a structure known for my protein? What other
    proteins have a similar structure?
  • Can I build a homology 3D model?
  • What is known about a homologous protein?

1
2
3
5
4
5
E-Science Q A
  • Who else has asked this question can I
    use/adapt their approach?
  • Workflow.
  • What were the results at each stage?
  • Dynamic Data Repositories.
  • When was P12345 last updated?
  • Which BLAST did I use?
  • Provenance.
  • Has PDB changed since I last ran this?
  • Notification.

1
2
3
5
4
Personalisation.
6
  • myGridpersonalised extensible environments
    fordata-intensive in silico experiments in
    biology
  • EPSRC eScience pilot project
  • official start 01/10/01
  • actual 01/01/02
  • end 30/03/05
  • 16 RAs, 9 studentships (start 09/03)

7
myGrid partners
8
myGrid upper middleware
  • An extensible open platform for data tools
    interoperability
  • Service based
  • Web services Grid Open Grid Services
    Architecture (OGSA)
  • XML, SOAP, WSDL,
  • WSIL, UDDI
  • WebSphere

Grid
Web Services
9
Web Services
  • Loosely coupled, stateless, message based
    distributed computing over the internet
  • Description separate from implementation

10
myGrid integration coordination services
  • Databases
  • Access.
  • Semantic database integration.
  • Distributed query processing.
  • Workflow
  • Dynamic workflow enactment.
  • A resource in their own right.
  • Workflow replay / repeat / reuse.
  • User interactive.
  • Linking
  • Workflows and derived data

DB2, mySQL XML
WSFL
XML, RDF
11
myGrid specialist services
  • Information extraction from abstracts
  • PASTA from Sheffield
  • Grid-enabled BioServices by the EBI
  • myGrid 0.1
  • EMBOSS
  • SRS
  • Open BQS
  • BLAST
  • XEmbl and EmblFetch
  • Flybase, Gadfly

12
myGrid science services
  • Data provenance and resource change management
  • Workflow logs.
  • Event notification service.
  • Incremental view management.
  • Personalisation
  • Management of views over repositories.
  • Personalisation of process flows,.
  • Annotation of existing data sets.
  • Dynamic creation of personal data sets.

XML, RDF, DBMS
13
myGrid metadata services
  • Metadata
  • Service discovery, publication, composition,
    management.
  • Database and data integration.
  • Workflow control.
  • Portal driving.
  • Etc
  • The Semantics
  • Information models, data types ontologies.
  • Object identity.

Web Services
Semantic Web
Grid
XML, XSD,RDF(S), DAMLOIL, OWL
14
Metadata
  • Metadata computationally accessible data about
    the services
  • Ontologies the shared and common understanding
    of a domain
  • A vocabulary of terms
  • Definition of what those terms mean.
  • A shared understanding for people and machines
  • Usually organised into a taxonomy.

15
myGrid Clients of Services
  • e-Scientists
  • Configurable portal for service access,
    personalisation community management.
  • Wizards for workflow.
  • Reference implementation services applications.
  • Gene function expression analysis -fly yeast.
  • Annotation workbench for the PRINTS pattern
    database.
  • Developers
  • Protocols and service descriptions.
  • myGrid-in-a-Box developers kit of core services.
  • Grid-enabled Bio services.

16
User Agent
Custom Application
Presentation Services
Collaboration Support
Management Tools
Portal
Client Framework
Semantic Data Integration
Semantic Aspect
Information Extraction
Semantic Workflow Design
Provenance Validation Assessment
Semantic Discovery
Ontology Service
Preferences
Metadata Aspect
Availability
Preferences
Versioning
Third-party Metadata
QoS
QoS
Provenance
QoS
Coordination Services
Distributed Query
Workflow Enactment
Syntactic Discovery
Event Notification
Networked Services
White Pages Yellow Pages Discovery
Personal Repository
Database Access
JobExecution
Device Access
Device Access
Security Authentication Authorization
Distributed Resources
Database
resources data and tools
17
myGrid Challenge
  • Identifying the most important services.
  • Agreeing consistent interfaces.
  • Integrating with other Grid services.
  • Implementing core services
  • Describing services
  • Use case scenarios (see Peter Lis poster)
  • Rolling programme of prototyping
  • April myGrid 0.0, October myGrid 0.1
  • Not reinventing the wheel.

Here comes the semantics!!
18
Wheres the Grid?
  • Resource sharing coordinated problem solving in
    dynamic, multi-institutional virtual
    organizations
  • On-demand, ubiquitous access to computing, data,
    and services
  • New capabilities constructed dynamically and
    transparently from distributed services
  • Ian Foster

19
Courtesy of Mark Wilkinson (BioMOBY)
20
Courtesy of Mark Wilkinson (BioMOBY)
21
Service Discovery
  • Find appropriate type of services
  • sequence alignment
  • Find appropriate instances of that service
  • BLAST (an algorithm for sequence alignment), as
    delivered by NCBI
  • Assist in forming an appropriate assembly of
    discovered services.
  • Find, select and execute instances of services
    while the workflow is being enacted.
  • Knowledge in the head of expert bioinformatian

22
Finding a Service
  • Words
  • Syntactic signature type of inputs, outputs
    number
  • Semantics What it does, whats it for, who wrote
    it -gt ontologies
  • All of the above

23
Why have ontologies for services?
  • A shared vocabulary for describing a service
  • that can evolve and say as little or as much as
    necessary.
  • Service classifications
  • Service discovery, organisation indexing
  • Service matching and substitution
  • BLAST Finds tblastx, tblastn, psi-blast, and
    marks_super_blast.
  • Alignment Finds ClustalW, Blast,
    Smith-Waterman, Needleman-Wunsch
  • Expanded selection of services presented based on
    expansion of in-hand object

24
Why have ontologies for services?
  • Controlling service composition
  • Outputs of service A semantically compatible with
    inputs of service B.
  • A service description is plausible.
  • Blastn compares a nucleotide query sequence
    against a nucleotide sequence database

25
Metadata Classification
  • Domain metadata
  • the domain coverage of the service, or its
    function.
  • BLASTn is a tool for computing sequence homology
    that uses the BLAST algorithm over nucleotides
  • Business metadata
  • data quality, quality of service, cost,
    geographical location, authorisation, provenance
    of data and so on.
  • BLASTn service offered by the NCBI is 80
    reliable.

26
Four tiered service descriptions
Domain semantic
  • Class of service
  • a protein sequence alignment, a protein sequence
    database.
  • Specific example of an abstract service
  • BLAST, SWISS-PROT.
  • Instance service description of a specific
    service
  • BLAST, SWISS-PROT as offered by the EBI.
  • Invoked instance service description
  • BLAST as offered by the EBI on a particular date,
    with particular parameters when a service was
    actually enacted.

Business operational
27
DAMLOIL/OWL
  • DAML OIL designed to describe ontologies
  • Ontologies incorporate information about classes,
    properties, and individuals, each of which can
    have an ID which is URI reference.
  • Ontologies can reference XML Schema datatypes by
    a name for the datatype.
  • Automated reasoning for inferring classification
    lattice and checking concepts are consistent
  • OWL Web Ontology Language 1.0 Reference
  • W3C Working Draft 29 July 2002
  • http//www.w3.org/TR/owl-ref/

28
Ontology editing OilEd
http//oiled.man.ac.uk/
29
Reasoning in DAMLOIL
  • Consistency check if knowledge is meaningful
  • Subsumption structure knowledge, compute
    classification
  • Equivalence check if two classes denote same
    set of instances
  • Instantiation check if individual i instance of
    class C
  • Retrieval retrieve set of individuals that
    instantiate C

30
Why isnt a tree enough?
  • BLASTn description implicitly describes a service
    that operates over nucleotides (and not
    proteins). protein pairwise alignment doesnt
    exist despite the fact that this is a likely
    service description.

31
Why isnt a tree enough?
  • Classify descriptions by operation (alignment,
    pairwise, multiple), by data source (protein,
    nucleotide, sequence), or by algorithm
    (SmithWaterman, BLAST).

32
Why isnt a tree enough?
  • Constraints are missing for the descriptions.
    BLASTp only operates over proteins.
  • tBLASTn compares a protein sequence query against
    a nucleotide sequence database dynamically
    translated in 6 reading frames.
  • Alignment is an operation that only applies to
    sequences and not pathways, and at least two
    inputs are required.

33
  • class-def defined BLAST-n_service_operation
  • subclass-of atomic_service_operation
  • has_Class performs_task
  • (aligning has_Class has_feature local
  • has_Class has_feature pairwise)
  • has_Class produces_result
  • (report has_Class is_report_of
    sequence_alignment)
  • has_Class uses_resource
  • (database has_Class contains
  • (data has_Class encodes
  • (sequence has_Class is_sequence_of

  • nucleic_acid_molecule)))
  • has_Class requires_input
  • (data has_Class encodes
  • (sequence has_Class is_sequence_of

  • nucleic_acid_molecule))
  • has_Class is_function_of (BLAST_application)

34
  • class-def defined pairwise_sequence_alignment_se
    rvice
  • subclass-of atomic_service_operation
  • has_Class performs_task
  • (aligning has_Class has_feature local
  • has_Class has_feature pairwise)
  • has_Class produces_result
  • (report has_Class is_report_of
    sequence_alignment)
  • has_Class uses_resource
  • (database has_Class contains
  • (data has_Class encodes
  • (sequence has_Class is_sequence_of

  • nucleic_acid_molecule)))
  • has_Class requires_input
  • (data has_Class encodes
  • (sequence has_Class is_sequence_of

  • nucleic_acid_molecule))
  • has_Class is_function_of (BLAST_application)

35
Suite
Specialises. All concepts are subclassed from
those in the more general ontology.
Contributes concepts to form definitions.
Upper level ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Task ontology
Bioinformatics ontology
Web serviceontology
36
Suite
Specialises. All concepts are subclassed from
those in the more general ontology.
Contributes concepts to form definitions.
Upper level ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Task ontology
parameters input, output, precondition,
effect performs_task uses-resource is_function_of
Bioinformatics ontology
Web serviceontology
37
Suites Coverage
38
Client framework
myGrid 0.0
Portal
Repository Client
Ontology Client
Workflow Client
Personal Repository
Workflow Repository
(Metadata) Ontology Server
DAMLOIL Reasoner (FaCT)
(Metadata) Service Type Directory
Workflow enactment
Matcher and Ranker
Service instance directory
Bioinformatics services
39
How do the functions of a cluster of proteins
interrelate?
  • Some proteins in my personal repository

40
Find services that takes a protein and gives
their functions and pick the best match.
41
Find another that displays the proteins base on
their function. Ontology restricts inputs
outputs
42
Build a workflow of composed services linked
together
43
See if a workflow that is appropriate already
exists. It could have been made anyone who will
share with you.
44
Pick one and enact it.
45
While its running it picks the best service
instance that can run the service at that time.
46
While its running it picks the best service
instance that can run the service at that
time. Or you choose.
47
The workflow finishes with the final display
service
48
Results are put into your personal repository,
with a concept from the ontology to tell you and
myGrid what they mean.
49
And full provenance record kept, and linked with
the results. We could redo or reuse the workflow.
50
Other uses of the ontology
  • Labelling data items in databases
  • Semantic typing for controlling inputs and
    outputs
  • Use by distributed query processing
  • Labelling and hence indexing linking browsing
    any myGrid information component
  • Workflow descriptions, people, provenance
  • Link with the Life Science Identifier (I3C)
  • Generate BioMOBY Central service classification
    link with BioMOBY objects.

51
What about other efforts?
  • Integration
  • DAS "Distributed Annotation System"
  • ISYS Integration of Desktop Tool
  • DiscoveryLink wrapper and distributed query
    environment
  • GO Gene Ontology etc
  • Service discovery and common typing
  • BioMOBY Integration of online biological
    databases and analysis services
  • Tackling parts of the problem.
  • myGrid is a framework for a platform.

52
Thoughts to hold on to
  • Application driven by use cases
  • Open Source
  • Data object types, APIs, protocols, ontologies
    have longer life span that s/w
  • Components are useful dont have to buy into
    the whole shooting match.
  • Dont reinvent the wheel
  • Get everyone else to build services.
  • Keep it simple.
  • Its distributed global and a continuum.

53
myGrid Summary
  • myGrid aims to develop infrastructure middleware
    for an e-Biologists workbench.
  • The setting is bioinformatics but the results are
    intended to be generally applicable to e-Science.
  • A mix of standard, vanguard and bleeding edge
    technologies, advanced development and (some)
    research.
  • Academic commercial partnership.

54
The myGrid team
  • Carole Goble
  • Norman Paton
  • Brian Warboys
  • Stephen Pettifer
  • Alvaro Fernandes
  • Luc Moreau
  • Dave De Roure
  • Chris Greenhalgh
  • Tom Rodden
  • John Brooke
  • Paul Watson
  • Alan Robinson
  • Rob Gaizauskas
  • Robert Stevens
  • Ian Horrocks
  • Neil Wipat
  • Matthew Addis
  • Nick Sharman
  • Rich Cawley
  • Simon Harper
  • Karon Mee
  • Simon Miles
  • Vijay Dailani
  • Xiaojian Liu
  • Tom Oinn
  • Martin Senger
  • Milena Radenkovic
  • Kevin Glover
  • Angus Roberts
  • Chris Wroe
  • Mark Greenwood
  • Phil Lord
  • Neil Davis
  • Darren Marvin
  • Justin Ferris
  • Peter Li
  • Nedim Alpdemir
  • Luca Toldo
  • Robin McEntire
  • Anne Westcott
  • Tony Storey
  • Bernard Horan
  • Paul Smart
  • Robert Haynes

55
Downloads
  • All tools ontology available from
  • http//www.mygrid.org.uk
  • Forthcoming publication
  • A suite of DAMLOIL Ontologies to Describe
    Bioinformatics Web Services and Data Chris Wroe,
    Robert Stevens, Carole Goble, Angus Roberts, Mark
    Greenwood
  • To appear in International Journal of
    Cooperative Information Systems.
Write a Comment
User Comments (0)
About PowerShow.com