Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature - PowerPoint PPT Presentation

Loading...

PPT – Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature PowerPoint presentation | free to download - id: 7f28e-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature

Description:

Informatics concepts. general concepts of data, data structures, databases, metadata ... Infrastructure for biomedical semantics. semantic resources and repositories ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 38
Provided by: mcas4
Learn more at: http://carbon.videolectures.net
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature


1
Mining Semantic Descriptions of Bioinformatics
WebResources from the Literature
  • Hammad Afzal, Robert Stevens, Goran Nenadic
  • School of Computer Science
  • University of Manchester

G.Nenadic_at_manchester.ac.uk
2
Motivation
  • A number of bioinformatics tools and resources
    available for service use and composition
  • guessimate is 3000 Web Services publically
    available
  • how to find a service, what is out there to use?
  • provenance?
  • Semantic annotation of bioinformatics services
  • annotate functional capabilities
  • e.g. Taverna, myGrid, myExperiment, EBI, BioMOBY
  • Not only services and tools
  • databases, repositories, corpora

3
Motivation
  • Manual curation
  • e.g. myGrid, BioCatalogue etc.
  • e.g. Taverna/Feta only 15-20 functionally
    described
  • backlog and the number of services is growing
  • Annotations combine
  • textual descriptions
  • ontological mappings

4
Example
text
  • multiple local align.
  • Soaplab

ontological descriptions
5
BioCatalogue
  • Single registration point for Web Service
    providers
  • Single search site for scientists and developers
  • Place where the community can find contacts and
    meet the experts and maintainers of these
    services
  • Community-sourced annotation, expert oversee
  • Mixed annotations free text, tags, controlled
    vocabularies, community ontologies

6
BioCatalogue
Beta version at http//beta.biocatalogue.org/ Laun
ch June 2009 at ISMB
7
Our approach
  • Collect service semantic descriptions by
    extracting and integrating information from text
    resources
  • full text bioinformatics journal publications
  • Approach
  • identify descriptors that are used for service
    and resource annotations
  • locate them in text
  • infer the annotations
  • textual evidence and mappings to an ontology

8
The rest of the talk
  • Methodology
  • mining bioinformatics terminology
  • extraction of service description profiles
  • Experiments and results
  • semi-automated curation
  • What next?

9
Methodology
10
Bioinformatics terminology
Learn bioinformatics terms from literature
1) get a corpus 2) get all terms 3) get seed
examples 4) find relevant ones using term
profiling and comparison to seed
examples
11
Bioinformatics terminology
  • Use seed terms to bootstrap
  • e.g. known descriptors used in existing service
    descriptions, either in literature or service
    repositories
  • 250 terms identified, manual pruning after
    automatic term recognition
  • examples of lexical constituents and textual
    behaviour (pragmatics)
  • lexical profiling
  • contextual profiling

12
Bioinformatics terminology
  • Lexical profiling
  • what is in the name
  • Contextual profiling
  • characterise sentences in which terms appear
    (nouns, verbs and context-patterns)
  • Comparing candidate term profiles to
  • average seed term
  • best-match

13
Bioinformatics terminology
  • Two domain experts evaluated the top 300 terms

14
Semantic classes myGrid
  • Informatics concepts
  • general concepts of data, data structures,
    databases, metadata
  • Bioinformatics concepts
  • domain-specific data sources and algorithms for
    searching and analysing data
  • e.g. Smith-Waterman algorithm

15
Semantic classes myGrid
  • Molecular biology concepts
  • higher level concepts used to describe
    bioinformatics data types, used as inputs and
    outputs in services
  • e.g. protein sequence, nucleic acid sequence
  • Task concepts
  • generic tasks a service operation can perform
  • e.g. retrieving, displaying, aligning

16
Semantic classes
  • Engineered from MyGrid bioinformatics sub-ontology

class examples
Algorithm SigCalc algorithm, CHAOS local alignment, SNP analysis, KEGG Genome-based approach, GeneMark method, K-fold cross validation procedure
Application PreBIND Searcher program, Apollo2Go Web Service, FLIP application, Apollo Genome Annotation curation tool, GenePix software, Pegasys system
Data GeneBank record, Genome Microbial CoDing sequences, Drug Data report
Data resource PIR Protein Information Resource, BIND database, TIGR dataset, BioMOBY Public Code repository
17
Semantic classes and instances
18
Semantic classes and instances
19
Service mentions
  • Named-entity recognition (NER) task
  • Recognition of service mentions using
  • terminological (semantic) heads of automatically
    recognised terms
  • Apollo2Go Web Service is an Application
  • BIND database is a Data source
  • assign the corresponding semantic class
  • Hearst patterns (co-ordinations, appositions,
    enumerations, etc.)

20
Semantic descriptors
  • Recognition of phrases depicting semantic roles
  • used to describe services
  • Flexible dictionary look-up
  • terms from myGrid ontology
  • terms/noun phrases from existing descriptions of
    bioinformatics resources (collected from Taverna
    and other Web service providers).

21
Mining service descriptions
22
Extraction/functional rules
  • Predicate-driven rules each verb associated with
    the type of information content it provides

23
Extraction/functional rules
  • Manually designed predicate-driven rules Subject
    (Arg) Verb (Predicate) Object (Arg)
  • Applied on dependencyparsed sentences
  • Stanford parser
  • no phrase structures
  • complex sentences
  • information in sub-clause

24
Extraction/functional rules
  • Phrase structuresidentified and integratewith
    the dependency
  • Predicate-dependent rules applied to
    extractspecific content andprofile the
    services
  • Profiles collated for all mentions
  • service name variation

25
Semantic service profiles
  • For a given service, collection of
  • descriptors, including parameters
  • links to other related instances
  • related myGrid ontology semantic labels
  • informative sentences

26
Example GeneClass
  • Descriptors

27
Example GeneClass
  • Functions, parameters

28
Example GeneClass
  • Sentences
  • We extend the original GeneClass algorithm to
    use all target genes for which both motif and
    expression data is available.
  • In order to study different aspects of target
    gene regulation we use different sets of motifs
    and parents with the GeneClass algorithm.
  • The GeneClass algorithm for predicting
    differential gene expression starts with a
    candidate set of motifs representing known or
    putative regulatory element sequence patterns and
    a candidate set of regulators or parentSS.

29
Experiments
  • 2120 BMC Bioinformatics articles
  • full-text articles before March 2008
  • Service descriptors dictionary
  • 471 descriptors from myGrid/Feta
  • 450 descriptors collected from other
    bioinformatics service/tools providers
  • 108 predicates used

30
Experiments
  • Number of candidate resources

31
Experiments
  • Number of descriptions collected using rules

32
Evaluation of semantic profiles
  • Evaluated for their capability to be used for
    semantic description of a given bioinformatics
    resource
  • irrelevant
  • partially useful
  • useful

HeatMapper The HeatMapper tool has already proven
to be very useful in several studies
Kalign To compare Kalign to other MSA programs,
the following test sets were used.
Cognitor To add a new species to the COG system,
the annotated protein sequences from the
respective genome were compared to the proteins
in the COG database by using the BLAST program
and assigned to pre-existing COGs by using the
COGNITOR program
33
Evaluation of semantic profiles
  • Two experiments
  • 5 well-known resources with descriptions already
    available
  • excellent rating for sentences
  • average rating for semantic descriptors
  • predicate functions
  • 5 new, unknown resources
  • excellent rating for sentences
  • average rating for semantic descriptors
  • predicate functions

34
What next?
  • Good recall, poor precision
  • context needs a better model
  • Mining parameter values
  • sub-language of parameters
  • Candidate service/resource mentions
  • an entity whose profile looks like a service
  • comparison of semantic profiles
  • network of services ISMB 2009
  • Do we have good service ontologies?

http//gnode1.mib.man.ac.uk/bioinf
35
Conclusion
  • Literature mining approach to service description
    and annotation
  • Aims
  • reduce curation efforts
  • provide semantic synopses of services for the
    Semantic Web
  • Potential of text mining
  • integration with other annotation approaches
  • extracting the entire service context is still
    challenging

36
Acknowledgements
  • gnTEAM (text extraction, analitics, mining)H.
    Yang, I. Spasic, H. Afzal, A. Gledson, J. Eales,
    M. Greenwood, F. Sarafraz
  • myGrid team Franck Tanoh
  • BBSRC
  • Mining term associations from literature to
    support knowledge discovery in biology
    (2005-2008)
  • pubmed2ensembl (2009-2010)
  • BioCatalogue (2008-2011)

37
Announcement
  • Journal of BioMedical Semantics
  • published by BioMed Central
  • launched at ISMB 2009
  • Topics include
  • Infrastructure for biomedical semantics
  • semantic resources and repositories
  • meta-data management and resource description
  • knowledge representation and semantic frameworks
  • Biomedical Semantic Web
  • life-long management of semantic resources
  • Semantic mining, annotation and analysis
About PowerShow.com