Title: Nobody said it was easy: Semantically Discovering BioGrid Services is tricky
1Nobody said it was easySemantically
Discovering BioGrid Services is tricky
- Professor Carole Goble
- University of Manchester, UK
- myGrid project
http//www.mygrid.org.uk
2- Environmental requirements of bioinformatics in
silico experimentation - The services
- Workflow execution
- And the impact on describing services for how you
description stuff, what to describe and how and
when to use the descriptions - different levels of descriptions
- different views on services
- depending on whether you are middleware or a user
- implications for registration
3Road map
- Why are we describing bio-services
- myGrid project requirements and architecture
- A little tiny wenny contextualising demo
- The user perspective and the implementation
perspective. - Thoughts, lessons and design decisions
- Describing different executable objects
- Workflows and Services
- Stratification of metadata
- Classes and Instances
- Service execution
- State based invocation models
- Parametric polymorphism of services
- Multiple descriptions, multiple interfaces
4The Grid Problem
(Foster, Kesselman, Tueke)
- flexible, secure, coordinated resource sharing
among dynamic collections of individuals,
institutions, and resources - what we refer to as
virtual organizations."
a low level framework to allow inter-operation of
resources. mainly for the benefit of
application developers deploy standard tasks on
the Grid in a straightforward manner
5Open Grid Services Architecture
- Present Grid Architecture is a services
architecture - Implemented using Web Services Technology
- OGSA will provide
- Naming /Authorization / Security / Privacy
- Higher level services Workflow, Transactions,
Data Mining,Knowledge Discovery, - Exploiting Synergy Commercial Internet with Grid
Services
- OGSI extends Web Services
- Transient Service Instances
- Service State
- Lifetime management
- Defines fundamental (WSDL) interfaces and
behaviors that define a Grid Service - Required optional interfaces WS profile
- Defines WSDL extensibility elements
- E.g., serviceType (a group of portTypes)
6myGrid
- EPSRC UK e-Science pilot project
- Open Source Upper Middleware for Bioinformatics
- Data intensive not compute intensive
- Sharing knowledge and sharing components
7myGrid in a nutshell
- An example of a second generation open
service-based Grid project, specifically a test
bed for the OGSI, OGSA and OGSA-DAI base
services - myGrid Information Repository that is OGSA-DAI
compliant - Developing high level services for data intensive
integration, rather than computationally
intensive problems - Workflow distributed query processing
- Developing high level services for e-Science
experimental management - Provenance, change notification and
personalisation - Developing Semantic Grid capabilities and
knowledge-based technologies, such as
semantic-based resource discovery and matching. - Metadata descriptions and ontologies for service
discovery, component discovery and linking
components.
8(No Transcript)
9Experiment life cycle
- Service discovery
- Workflow discovery refinement
- Workflow creation
- Personalised service registries
- Personalised workflows
Forming experiments
Personalisation
Discovering and reusing experiments and resources
Executing experiments
- Service discovery
- Workflow discovery refinement
- Provenance logs
- Workflow enactment
- Service invocation
- Provenance logs
Providing services experiments
Managing experiments
- Service registration
- Workflow deposition
- Metadata annotation
- Third party registration
- Provenance records
- Workflow evolution
- Service monitoring
10Provenance
- Experiment is repeatable, if not reproducible,
and explained by provenance records - Who, what, where, why, when, (w)how?
- The tracability of knowledge as it is evolves and
as it is derived. - Implications for recording which services invoked
on what data when with what parameters. - Immutatable and persistent
11Architectural Overview
Knowledge Services
Knowledge Service
Semantic registration
Registry
Ontology Server
Registry
Reasoner
Structural registration
UDDI
Matcher
Service
KB Store
Registry View
Notification Service
Notification Service
RDF-based UDDI
Service Discovery
JMS
Provenance service
Workflow enactment engine
Discover Workflow or Service
mIR
Test Data
Scufl WSFL
mG Object Discovery
Information Extraction
Distributed Query Processor
Job Execution
m Info Repository
Workflow templates
Workflow instances
PESTO
Service
Service
Service
Metadata
Concepts
Data
Provenance
SoapLab
DB2
DB2
12Workflows
- Workflow discovery
- Finding workflows that others have done, and that
I have done myself - Workflow specification
- Finding classes of services
- Guiding service composition
- We dont do automated composition
- Dynamic workflow enactment service discovery and
invocation - Choose services instances when running workflow
- User involvement
13myGrid Find Service
Discovery Client Find Service
Word-based discovery
Semantic discovery
Syntactic discovery
Views
Ontology Server
UDDI-M
Views
Reasoner
Third party description
RDF
Service
FaCT
publishes
Matcher
Description Store
Gather service descriptions
publishes
Org. registry
KAON
Public registry
WSDL
UDDI
Third Party
14myGrid Components Demo
- portal operation.
- semantics to define type system.
- mIR, to store, and retrieve data.
- registry to describe and record services
Uncharacterised DNA sequence
Select an open reading frame
Translate to protein
BLAST search
Characterised DNA sequence
15myGrid Components Demo
- Pre-existing third party application
- Service invocation
- Workflow enactment
DNA sequence
getOrf
transeq
prophet
plotorf
Proteins from a family
emma
prophecy
Classical bioinformatics detecting whether an
uncharacterised protein domain is conserved
across a group of proteins
16Bio Services Landscape
- Wrap CORBA, Perl etc to look like web services,
to become Grid services (eventually) - Multiple services
- Many hundreds of different services in the public
domain and privately owned - Multiple registries
- 3rd party public registries, private registries,
personal registries - 3rd parties
- JEMBOSS, PathPort, bioMoby
- Wrap our own
- Soaplab
- A soap-based programmatic interface to
command-line applications - 300 different classes of services
- Swiss-Prot, EMBOSS, Medline, blah, blah
- http//industry.ebi.ac.uk/soap/soaplab
17Bio Services Problem Space
- Multiple service providers of same service (not
just similar service) - Many implementations of Swiss-Prot version 40
- What and which Discovery based on
- What the services does from a domain perspective.
- Which service instance has the appropriate
capabilities from an operational perspective. - Users dont care if the service is a service or a
workflow. - Same what description from their perspective
- Different how description from middleware
perspective.
SWISS-PROT
SWISS-PROT_at_local
SWISS-PROT_at_ncbi
SWISS-PROT_at_ebi
18Consequences
- We support (at least) two types of semantic
service discovery - Domain
- requiring access to common application domain
ontologies - Biology and bioinformatics
- Service
- using cross-domain knowledge independent of
application - Quality of service, ownership, location,
organisations - We describe the profile of workflows as if they
were services (of course a workflow could be
deployed as a service) - Should workflow descriptions be in the same
registry as service descriptions, or elsewhere? - A find service must transcend the location.
19Tiers of service description
Select an open reading frame
Characterised DNA sequence
Uncharacterised DNA sequence
Sequence alignment
Translate to protein
Characterised DNA sequence
EMBOSS TransSeq
EMBOSS GetORF
BLAST-p
CATTACCC
Characterised DNA sequence
EMBOSS TransSeq_at_httped.ac.uk
EMBOSS GetORF _at_httpimg.cs.man.ac.uk
BLASTp _at_ncbi.nih.gov
CATTACCC
20Summary Tiered levels of descriptions
Abstract Service
Sequence alignment
Classes of services Domain semantic Unexecutable
Potentials
Ontology
Specific Service
Blastn
Ontology
Service Instance
Blastn_at_EBI
Instances of services Business operational Execu
table Actuals
Ontology
Data model
Invoked Service
Blastn_at_EBI invoked proxy
Service Data Element
21What are you discovering? Classes Users
Workflow specifications
Discovery
Classes of Service
- Finding a service that will fulful some task e.g.
aligning of biological sequences. - What services perform a specific kind of task,
for example, what services can I used to perform
a biological sequence similarity search? - Finding a service that will accept or produce
some kind of data. - What services produce this kind of data, for
example, from where can I find sequence data for
a protein? - What services consume this kind of data, for
example, if I have protein sequence data, what
can I do with it? - Class of service
- a protein sequence alignment, a protein sequence
database. - Specific example of an abstract service
- BLAST, BLASTn, SWISS-PROT,
- Applies to class of services and workflow
specifications
22Originally Based on DAML-S
- US DARPA Agent Markup Language Services
http//www.daml.org - An Upper Ontology for Services
23Suite
Specialises. All concepts are subclassed from
those in the more general ontology.
Contributes concepts to form definitions.
Upper level ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Task ontology
parameters input, output, precondition,
effect performs_task uses-resource is_function_of
Bioinformatics ontology
Web serviceontology
24(No Transcript)
25Pedro interface to Service Discovery
26Classification and matchmaking of services
- Classification of services/workflows
- Imprecise (best effort) substitutions of
services/workflows - Service/workflow organisation indexing,
- Service/workflow matchmaking substitution
- BLAST finds tblastx, tblastn, psi-blast,
marks_super_blast. - Alignment finds ClustalW, Blast,
Smith-Waterman, Needleman-Wunsch - Expanded selection of services based on expansion
of in-hand object - A vocabulary for expressing service descriptions
without pre-determining every description - A reasoning process to manage
- coherency of the classifications and the
descriptions when they are created, - the service discovery, matching and composition
when they are deployed. - Ontologies in DAMLOIL/OWL based on the DAML-S
ontology
27What are you discovering? Instances Machines
Workflow specifications
Discovery
Classes of Service
registry
Instantiate
Select instances
28Discovering services based on their operational
properties
- What resources does a specific organisation
provide? - Who authored this resource?
- What services offering x currently give the best
quality of service? - Which service would the local bioinformatics
expert suggest we use? - Data quality, quality of service, cost,
geographical location, authorisation, provenance
of data and so on. - Third party metadata
- Instance service description of a specific
service - BLAST, SWISS-PROT as offered by the EBI is 80
reliable. - Invoked instance service description
- BLAST as offered by the EBI on a particular date,
with particular parameters when a service
invoked.
Applies to instances of services and workflows
29RDF based UDDI metadata for service instances
30User engagement
Workflow specifications
Discovery
Classes of Service
registry
Instantiate
Select instances
Support for the user to find a service that
fulfils their task. ontology should be fairly
simple couched in concepts the user is familiar
with e.g. protein sequence. analogous to DAML-S
profile
31EMBOSS seqret
- Function that reads and writes (returns)
sequences - But its so much more than that!
- EMBOSS programs can take a wide range of
qualifiers that slightly change the behaviour of
the program when reading or writing a sequence - seqret can read a sequence or many sequences from
databases, files, files of sequence names, the
command-line or the output of other programs and
then can write them to files, the screen or pass
them to other programs. - Because it can read in a sequence from a database
and write it to a file, its a program for
extracting sequences from databases - Because it can write the sequence to the screen,
seqret is a program for displaying sequences.
32And more.
- seqret can read sequences in any of a wide range
of standard sequence formats. You can specify the
input and output formats being used. If you don't
specify the input format, it will try a set of
possible formats until it reads it in
successfully. Because you can specify the output
sequence format, its a program to reformat a
sequence. seqret can read in the reverse
complement of a nucleic acid sequence. So its a
program for producing the reverse complement of a
sequence. seqret can read in a sequence whose
begin and end positions you have specified and
write out that fragment. So its a utility for
doing simple extraction of a region of a
sequence. seqret can change the case of the
sequence being read in to upper or to lower case.
So its a simple sequence beautification utility.
seqret can do any combination of the above
functions. ......
33EMBOSS
- EMBOSS sequence alignment service matcher simple
way to describe the task it fulfils ismatcher
has_input sequence performs_task aligning - some verb acting on some object to produce a
result and it fits most descriptions. - Quickly get more complicated.
- EMBOSS degap removes gap characters from a
sequence. - Where should the gap character concept be
included? It is neither an input or an output.
34- Several properties added over the DAML-S profile
for bioinformatics - e.g. uses_resource and uses_application.
- These could be simplified away either just as one
additional property or a precondition as used
DAML-S. - More obtuse to the user.
- Makes the model more complex or redundant for the
benefit of the user. - Reduces inter operability with service
descriptions in other domains. - Perhaps this redundancy should be encoded within
the applications delivering the ontology and a
more complex precondition description used under
the hood?
35EMBOSS matcher
- protein sequence is an ambiguous term and relies
on implicit information held in the head of the
bioinformatician. - to reason over or organise concepts we need a
more precise definition - data structure conforming to some schema that
encodes the sequence of amino acid in a protein
molecule. - We can now start to infer the relationship
between protein sequences and nucleotide
sequences. - But a user cannot be expected to interact with
such a complex model.
36Outcome Views
- Multiple descriptions over same services
workflows held in registries - Third party descriptions Subsets of services
- publication of descriptions must be supported
both for the author of the service and third
parties - third party annotations are a view of a service
and discovery should offer a variety of views
based upon third party annotations - there is a need for control over who make add and
alter third party annotations - Generic services supporting a wide variety of
multiple tasks - Middleware must have some way of going beyond a
generic description and stating given these
inputs what are the outputs going to be. - Rather than author very complex description that
cater for all possibilities, it is better to
author many simpler descriptions for each case. - It may in fact be necessary to ask the service
itself for specific answers, such as given these
inputs what would you perform?
37myGrid Find Service
Discovery Client Find Service
Word-based discovery
Semantic discovery
Syntactic discovery
Views
Ontology Server
UDDI-M
Views
Reasoner
Third party description
RDF
Service
FaCT
publishes
Matcher
Description Store
Gather service descriptions
publishes
Org. registry
KAON
Public registry
WSDL
UDDI
Third Party
38Bio Services Problem Space
- Wrap CORBA, Perl etc to look like web services,
to become Grid services (eventually) - Dialogue oriented (e.g. Soaplab) and function
oriented (e.g. bioMOBY) - Often highly parameterised
- Mixture of synchronous and asynchronous
- Simulations and feedback loops
- Streaming large scale data
- Mixture of binary and text
39EMBOSS
- Suite of 200 command line programs, which uses a
command definition language AJAX - How do we present these services?
- As 200 different services, one for each EMBOSS
program, with a single method, with as many
parameters as the EMBOSS program requires. - As 200 different services, one for each EMBOSS
program, with a number of overloaded methods
where the program takes optional parameters. - As a single service with 200 different methods,
one for each EMBOSS program. - As a single, highly parametric service, with a
single method, called invoke, the first
parameter of which names the EMBOSS program to
run.
40Workflow specifications
Discovery
Classes of Service
Instantiate
Select instances
Execution
Invoked instance
Workflow enactment
41Invocation
Workflow specifications
Discovery
Classes of Service
Registry
Discovery Instantiate
Select instances
Registry?
Execution
Invoked instance
Workflow enactment
Monitor
Terminate
42Phases
Support for middleware to perform tasks such as
substitution, data transformation between
services, automatic invocation of services where
the invocation model is not simple. a complex
model to explicitly describe every implementation
detail of the service or a binding to it.
analogous to DAML-S process model and grounding.
Workflow specifications
Discovery
Classes of Service
Discovery Instantiate
Select instances
Execution
Invoked instance
Workflow enactment
Monitor
Terminate
43Invocation models
- bioMoby forces services to have a single
operation that completely encompasses the single
task the service supports. - Each task may be in turn supported by a single
operation - Soaplab there is no one to one mapping between a
single task and a single operation. - Can repurpose a service to be presented multiple
times a different wrapper for every view - Proliferation of views
- Makes discovery easier
- Reasoning that its the same service as one
running
44Soaplab version of matcher alignment_localmatch
erderived (wsdl)
- createEmptyJob
- get_detailed_status
- get_report
- get_outfile
- set_gappenalty
- set_sbegin1
- set_sbegin2
- set_send1
- set_send2
- set_sformat1
- set_sformat2
- set_slower1
- set_slower2
- set_snucleotide1
- set_snucleotide2
- set_sprotein1
- set_sprotein2
- set_sreverse1
- set_sreverse2
- set_sequenceb_usa
- set_gaplength
- set_alternatives
- run
- destroy
- getStatus
- describe
- getInputSpec
- getResultSpec
- getAnalysisType
- createJob
- runNotifiable
- createAndRun
- createAndRunNotifiable
- waitFor
- runAndWaitFor
- getResults
- terminate
- getLastEvent
45Coordinating EMBOSS through Soaplab - WSFL
- for each task
- createJob(inputsMap)
- run(...)
- waitFor(...)
- getResults(...)
- destroy(...)
Workflow Engine
WSFL
46Coordinating EMBOSS through Soaplab - Scufl
- for each task
- run(operation, inputs)
Soaplab plugin
Workflow Engine
Scufl
47Does the user ever see this?
- If the user never has to deal with the invocation
model - The DAML-S approach of splitting the information
between two descriptions seems plausible. - Once the user has used the simpler profile, the
middleware gets to work on the more complex
process model and binding, or a myGrid workflow
to actually translate the task into concrete
service operation calls. - If the user does want to know what is going to
happen - A more unified model with views for user and
middleware seems more appropriate. - The downside is the cost of implementing the
infrastructure to deliver the views.
48Summary Views
- Two parallel but slightly redundant descriptions
of the service - one for human discovery and one for middleware.
- what DAML-S does.
- OR
- One common model which is complex and supports
multiple tasks but have an extra layer that
provides a view to support each specific task - intermediate representations, reasonables,
perspectives, language generation. - The user sees the term protein sequence even
though the underlying concept is far more
explicit. - Transformed into the more complex pattern the
user may be promoted for attributes associated
with the parent concept data even though the
user never explicitly stated this was a kind of
data. - The view approach used in GALEN and GONG.
- The DAML-S profile probably too complex to
present to bioinformatics users.
49Summary 2 human vs machine views
Human
Machine
Service User
Weak semantic descriptions Rewriting views
UDDI style advertisements
Human
Syntactic descriptions Semantic mining
Elaborate Semantic descriptions Simplication
views
Machine
Service provider
50Discovery space
Classes and instances
Abstractions over a single description of a
service
Third party multiple viewpoints
People and machines
Multiple descriptions over a single service
Multiple tasks
51AcknowledgementsLuc Moreau, Simon Miles,
Keith Decker, Terry Payne, Phil Lord, Chris Wroe,
Roberts Stevens, Kevin Garwoodhttp//www.mygrid.
org.uk/