GO Tag: Assigning Gene Ontology Labels to Medline Abstracts - PowerPoint PPT Presentation

Loading...

PPT – GO Tag: Assigning Gene Ontology Labels to Medline Abstracts PowerPoint presentation | free to download - id: 1dba4d-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

Description:

trained on gene/protein DB's which use GO codes AND have links to Medline ... Given protein-article pairs plus the number of GO code assignments supported by ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 65
Provided by: Mar449
Learn more at: http://nlp.shef.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: GO Tag: Assigning Gene Ontology Labels to Medline Abstracts


1
GO Tag Assigning Gene Ontology Labels to Medline
Abstracts
  • Robert Gaizauskas
  • Natural Language Processing Group
  • Department of Computer Science

2
GO Tag Assigning Gene Ontology Labels to Medline
Abstracts
  • Robert Gaizauskas
  • N. Davis, Y.K. Guo, H. Harkema
  • Natural Language Processing Group
  • Department of Computer Science

J. Ratcliffe
3
Outline
  • Context
  • Project Background
  • The Gene Ontology
  • Go Annotation in Model Organism Databases
  • Medline
  • Go Tagging Tasks
  • User types/scenarios
  • Possible tasks
  • Related Work
  • Data sets/Gold Standards
  • Approaches and Results to Date
  • Lexical lookup
  • Vector Space Similarity
  • Machine Learning
  • Exploiting the Results in Search Tools

4
Project Background
  • Work is funded by the EPSRC as a Best Practice
    Project for collaboration between DiscoveryNet
    and myGrid -- E-Science Pilot Projects (2001-5)
  • Both projects
  • have developed text mining and data analysis
    components -- complementary approaches NLP vs.
    datamining/statistical analysis
  • workflow models for co-ordinating distributed
    services
  • working on life science applications
  • Aim to develop a unified real-time e-Science
    text-mining infrastructure that builds upon and
    extends the technologies and methods developed by
    both Discovery Net and myGrid
  • Software engineering challenge integrate
    complementary service-based text mining
    capabilities with different metadata models into
    a single framework
  • Application challenge annotate biomedical
    abstracts with semantic categories from the Gene
    Ontology

5
The Gene Ontology
  • The Gene Ontology project provides a controlled
    vocabulary to describe gene and gene product
    attributes in any organism http//www.geneon
    tology.org/
  • Consists of three structured, controlled
    vocabularies (ontologies) that describe gene
    products in terms of their associated
  • biological processes
  • cellular components
  • molecular functions
  • in a species-independent manner
  • E.g. gene product cytochrome c can be described
    by
  • the molecular function term electron transporter
    activity
  • the biological process terms oxidative
    phosphorylation and induction of cell death
  • the cellular component terms mitochondrial matrix
    and mitochondrial inner membrane

6
Gene Ontology (cont)
From Gene Ontology tool for the unification of
biology. The Gene Ontology Consortium (2000)
Nature Genet. 25 25-29.
7
The Gene Ontology (cont)
  • Started as a joint effort between three model
    organism databases (FlyBase (Drosophila), the
    Saccharomyces Genome Database (SGD) and the Mouse
    Genome Database (MGD))
  • GO now (08/11/05) contains 19022 terms
  • GO Slim(s) are reduced versions of GO ontologies
    containing a subset of GO terms
  • Aim to give a broad overview of ontology content
  • GO Slim Generic currrently contains 127 terms
  • A typical GO Term

Term name isotropic cell growth Accession
GO0051210 Ontology biological_process Synonyms
related uniform cell growth Definition The
process by which a cell irreversibly increases in
size uniformly in all directions. In general, a
rounded cell morphology reflects isotropic cell
growth.
8
GO Annotation in Model Organism DBs
  • Model organism dbs typically record for each
    entry (gene) one or more GO codes links to the
    literature supporting the assignment of the GO
    code
  • E.g. from the Saccharomyces Genome Database (SGD)

IC Inferred by Curator IDA Inferred from Direct
Assay IEA Inferred from Electronic
Annotation IEP Inferred from Expression
Pattern IGI Inferred from Genetic
Interaction IMP Inferred from Mutant
Phenotype IPI Inferred from Physical
Interaction ISS Inferred from Sequence or
Structural Similarity NAS Non-traceable Author
Statement ND No biological Data available RCA
inferred from Reviewed Computational
Analysis TAS Traceable Author Statement NR Not
Recorded
9
PubMed
  • PubMed
  • on-line bibliographic database designed to
    provide access to citations from biomedical
    literature
  • developed by the US NCBI at the NLM
  • Contains Medline, OldMedline, various other
    sources
  • Medline
  • Over 12 million citations dating back to 1960s
  • Author abstracts and citations from gt 4800
    biomedical journals

10
PubMed
  • Entrez is NCBIs integrated, text-based search
    and retrieval system for the major databases it
    maintains

11
Outline
  • Context
  • Project Background
  • The Gene Ontology
  • Go Annotation in Model Organism Databases
  • Medline
  • Go Tagging Tasks
  • User types/scenarios
  • Possible tasks
  • Related Work
  • Data sets/Gold Standards
  • Approaches and Results to Date
  • Lexical lookup
  • Vector Space Similarity
  • Machine Learning
  • Exploiting the Results in Search Tools

12
User Types
  • Research Geneticists
  • Narrow information interest
  • Particular gene
  • Particular activity/functionality
  • Model Organism Genome DB Curators
  • Broader information interest
  • Typically track a number of publications, seeking
    to enhance information stored in the model
    organism genome DB at the locus level

13
User Scenarios Research Geneticist
  • Possible scenarios using GO tagging to support a
    research geneticist include
  • Search result presentation
  • Tag abstracts returned from a PubMed search with
    GO codes
  • Use GO codes to cluster/structure search results
    to support more effective information access
  • Structuring of related literature as workflow
    side-effect
  • Many typical researcher workflows involve BLAST
    searches yielding BLAST/Swissprot reports
  • Workflow can automatically assemble a set of
    related papers by extracting PMIDs of
    homologous genes/proteins from reports and
    collecting these abstracts plus, optionally,
    others closely related by text similarity
  • Resulting abstract set can be clustered/structured
    by GO terms and presented to researcher
  • (Integrating Text Mining Services into
    Distributed Bioinformatics Workflows A Web
    Services Implementation. Gaizauskas, Davis,
    Demetriou, Guo and Roberts, In Proceedings of the
    IEEE International Conference on Services
    Computing (SCC 2004), 2004.)

14
Search Result Presentation Motivating Example
  • One of the genes involved in the cognitive/social
    elements of Williams Beuren syndrome is LIM
    Kinase 1 (LIMK1/LIMK-1)
  • Putting LIM Kinase into Entrez gives 146 possible
    papers of interest.

15
Search Result Presentation Motivating Example
  • One of the genes involved in the cognitive/social
    elements of Williams Beuren syndrome is LIM
    Kinase 1 (LIMK1/LIMK-1)
  • Putting LIM Kinase into Entrez gives 146 possible
    papers of interest.
  • However search in the model organism corpus for
    LIM Kinase yields only 5 papers but a high number
    of associated GO codes (and this is from only
    partially annotated papers)
  • Suggests even a single gene may be involved in
    numerous roles and that clustering according to
    GO codes may give a more focused method of
    searching rather than simply supplying more and
    more keywords which may remove useful and
    important papers from the result set.

GO0006468 biological_process protein amino
acid phosphorylation GO0004674
molecular_function protein serine/threonine
kinase activity GO0004672 molecular_function
protein kinase activity GO0007283
biological_process spermatogenesis GO0008064
biological_process regulation of actin
polymerization and/or depolymerization GO0005515
molecular_function protein binding GO0005634
cellular_component nucleus GO0005925
cellular_component focal adhesion GO0005515
molecular_function protein binding
16
Search Result Presentation Motivating Example
  • However search in the model organism corpus for
    LIM Kinase yields only 5 papers but a high number
    of associated GO codes (and this is from only
    partially annotated papers)
  • Suggests even a single gene may be involved in
    numerous roles and that clustering according to
    GO codes may give a more focused method of
    searching rather than simply supplying more and
    more keywords which may remove useful and
    important papers from the result set.

GO0006468 biological_process protein amino
acid phosphorylation GO0004674
molecular_function protein serine/threonine
kinase activity GO0004672 molecular_function
protein kinase activity GO0007283
biological_process spermatogenesis GO0008064
biological_process regulation of actin
polymerization and/or depolymerization GO0005515
molecular_function protein binding GO0005634
cellular_component nucleus GO0005925
cellular_component focal adhesion GO0005515
molecular_function protein binding
17
User Scenarios Model Organism DB Curator
  • Possible scenarios using GO tagging/text mining
    to support DB curators include
  • Help assemble texts that may support GO code
    assignment
  • GO tag texts in curators watching brief
  • Automated tagging could act as prompt for/check
    on curators judgement
  • Help to determine gene-GO term pairs for
    annotation
  • Perform GO tagging/ gene name identification at
    text level and suggest all pairs as candidates
  • Perform GO tagging/gene name identification at
    sentence level and suggest candidates
  • Attempt to assign GO evidence codes
  • To text segments providing evidence for GO code
    assignment without identifying GO code/gene pair
    to which the evidenced pertains
  • To text segments providing evidence plus the GO
    code/gene pair to which the evidenced pertains

18
Possible Tasks (1)
  • Assigning GO codes to abstracts/full papers
  • Given a set of texts (PubMed abstracts/full
    papers) and the GO/GO Slim ontology
  • Task assign 0 or more GO codes to a text iff
    the text is about the function/process/component
    identified by the code (assume most specific
    code only assigned)
  • Note in this task there is no association of GO
    code with any specific gene/gene product

19
Possible Tasks (2)
  • Assigning GO codes to genes/gene products in
    abstracts/full papers
  • Given a set of texts (PubMed abstracts/full
    papers) and the GO/GO Slim ontology
  • Task If the text supports the assignment of one
    or more GO codes to a gene/gene product, identify
    gene/gene product-GO code pairs and the text
    supporting the assignments
  • This capability would support additional tasks
  • Given a particular gene/gene product and a text
    collection, find all GO codes for the gene/gene
    product across the collection
  • Given a GO code and a text collection, find all
    genes/gene products tagged with the code across
    the collection

20
Possible Tasks (3)
  • Assigning evidence codes to genes/gene
    products-GO code pairings in abstracts/full
    papers
  • Given a set of texts (PubMed abstracts/full
    papers) and the GO/GO Slim ontology
  • Task As in Task 2. but additionally supply the
    evidence codes
  • A weaker variant of this is just to suggest
    evidence text that may assist in the assignment
    of GO code

21
Related Work
  • Raychaudri, Chang, Sutphin Altman (2002)
  • Task associate GO codes with genes by
  • Associating GO codes with papers
  • Associating a specific GO code with a gene if
    sufficient number of papers mentioning the gene
    have the GO associated with them
  • Method Treat 1. as a document classification
    task and evaluate maximum entropy, Naïve Bayes
    and Nearest Neighbours approaches
  • Evaluation corpus of 20,000 Medline abstracts
    assigned one or more of 21 GO terms/categories
  • Results maximum entropy best -- 72.8
    classification accuracy over 21 categories

22
Related Work
  • Go-KDS (Smith Cleary, 2003)
  • Product of Reel Two
  • Task assign arbitrary GO terms to PubMed
    articles
  • Method
  • Proprietary Weighted Confidence learner (similar
    to Naïve Bayes), using only words as features
  • trained on gene/protein DBs which use GO codes
    AND have links to Medline
  • Evaluated on approx. same data/task as Raychaudri
    et al. -- 70.5 accuracy

23
Related Work
  • GoPubMed
  • On-going work at Dresden University
    (www.gopubmed.org)
  • Task Annotate PubMed abstracts with GO terms
  • Method Use a local sequence alignment algorithm
    with weighted term matching (to overcome limits
    of strict matching) between GO terms and strings
    in texts
  • Evaluation None reported
  • Kiritchenko et al. (U. of Ottawa)
  • Task assign arbitrary GO terms to biomedical
    texs
  • Method
  • Treat task as hierarchical text classification
  • use AdaBoost.MH
  • Evaluation
  • introduce hierarchical evaluation measure
  • Results unclear

24
Related Work (cont)
  • Biocreative challenge -- task 2 contained three
    related subtasks
  • Given an article, a protein and a GO code, where
    the article justifies the assignment of the GO
    code to the protein, find evidence text in the
    article supporting the assignment
  • Given protein-article pairs plus the number of GO
    code assignments supported by the article, find
    the GO code(s) that should be assigned to the
    protein based on the article
  • Given a set of proteins, retrieve a set of papers
    relevant to assigning codes to the proteins plus
    the GO code annotations and the supporting
    passages (not evaluated)
  • Results indicated no systems ready for practical
    use
  • Issues lack of training data complexity of
    tasks

25
Related Work (cont)
  • TREC Genomics Track 2004 -- three tasks related
    to GO code assignment
  • Triage -- given a set of articles find those that
    contain some evidence for the assignment of a GO
    code, i.e. warrant being curated
  • Given an article and names of genes occurring in
    the article assign one or more of the top three
    GO ontologies from which human curators had
    assigned codes
  • Task 2 plus provide evidence code supporting each
    gene-GO hierarchy label association
  • Results for all three tasks poor

26
Outline
  • Context
  • Project Background
  • The Gene Ontology
  • Go Annotation in Model Organism Databases
  • Medline
  • Go Tagging Tasks
  • User types/scenarios
  • Possible tasks
  • Related Work
  • Data sets/Gold Standards
  • Approaches and Results to Date
  • Lexical lookup
  • Vector Space Similarity
  • Machine Learning
  • Exploiting the Results in Search Tools

27
Data Sets and Evaluation
  • In order to assess performance of GO tag
    assignment, a gold standard manually
    annotated/verified corpus is needed
  • However, no such corpus exists

28
Data Sets and Evaluation
  • Solution 1 SGD Gold Standard
  • Derive a corpus from SGD model organism database
    (yeast)
  • Assemble all Medline abstracts cited as evidence
    supporting assignment of GO terms
  • Associate with each abstract the GO term whose
    assignment it is cited as supporting
  • I.e. given the annotated genes in SGD, assign a
    GO term T to a paper P if the paper P is
    referenced in support of a Gene-GO term
    association involving T
  • SGD Gold Standard
  • 4922 PMIDS
  • 2455 GO terms
  • 10485 PMID-GO term pairs

29
Data Sets and Evaluation SGD Gold Standard
  • Advantages
  • Data already exists -- no extra annotation work
    required
  • Can assemble similar corpora for each model
    organism DB
  • Disadvantages
  • Each abstract has associated with it GO terms
    whose assignment to specific genes it supports,
    but may be missing other GO terms which can also
    be legitimately attached to it
  • Not every paper supporting a GO term assignment
    will be cited
  • Consequence
  • SGD gold standard is GO term incomplete
  • Weak measure of recall
  • Precision figures difficult to interpret

30
Data Sets and Evaluation SGD Gold Standard
  • Further issue
  • SGD Gene-GO term assignments are based on full
    papers, whereas system only has access to
    abstracts
  • Consequence
  • Limit on maximum Recall obtainable by system

31
Data Sets and Evaluation (cont)
  • Solution 2 IC Gold Standard
  • Manually extend the GO annotation of abstracts
    derivable from the SGD
  • Goal GO term complete gold standard
  • Selected a subset (800) for which support for
    all the assigned GO codes is found in the
    abstract (rather than the full paper)
  • Manually added additional GO annotations using a
    combination of fuzzy maching against GO and some
    manual addition of synonyms during checking
  • For included terms, include lowest within each
    ontology
  • cell wall biosynthesis gt cell wall
    biosynthesis cell wall
  • Also applied same methodology to evidence
    paragraphs -- brief summaries written by curators
    deliberately using GO vocabulary
  • IC Gold Standard
  • 785 PMIDS
  • 1006 GO terms
  • 5170 PMID-GO term pairs

32
Data Sets and Evaluation (cont)
  • Advantages
  • Much closer to a GO-term complete gold standard
  • Disadvantages
  • Still not GO-term complete
  • Method of creation suggests there may still be
    many unannotated GO terms that ought to be marked
    (direct mentions of GO terms vs. semantically
    entailed GO terms)
  • Gold Standard creation method favors lexical
    look-up approach to GO-tagging
  • Dataset is small

33
Outline
  • Context
  • Project Background
  • The Gene Ontology
  • Go Annotation in Model Organism Databases
  • Medline
  • Go Tagging Tasks
  • User types/scenarios
  • Possible tasks
  • Related Work
  • Data sets/Gold Standards
  • Approaches and Results to Date
  • Lexical lookup
  • Vector Space Similarity
  • Machine Learning
  • Exploiting the Results in Search Tools

34
The Go Tagging Task Addressed
  • The approaches we investigated all considered
    Task 1, as defined earlier
  • Given a set of texts (PubMed abstracts/full
    papers) and the GO/GO Slim ontology
  • Task assign 0 or more GO codes to a text iff
    the text is about the function/process/component
    identified by the code (assume most specific
    code only assigned)

35
Approach 1 Lexical Lookup Using Termino
  • Termino a large-scale terminological resource to
    support term processing for information
    extraction, retrieval, and navigation
  • Termino contains a database holding large numbers
    of terms imported from various existing
    terminological resources, e.g., UMLS, GO
  • Efficient recognition of terms in text is
    achieved through use of finite state recognizers
    compiled from contents of database
  • The results of lexical look-up in Termino can
    feed into further term processing components,
    e.g., term parser
  • Available as a Web Service (see
    http//nlp.shef.ac.uk)

36
Termino Terminology Engine
37
Lexical Look-Up for GO Tag
  • Termino
  • Imported names of all terms in GO, plus their GO
    ids and namespace attributes (18270 names in
    total)
  • Go term synonyms
  • SGD yeast gene names
  • Recognition of terms in text
  • Case-insensitive
  • Simple morphological variants are recognized
  • Cells mapped onto cell
  • Mitochondrial, mitochondria not mapped onto
    mitochondrion

38
Lexical Look-Up for GO Tag (cont)
  • GO code assignment
  • GO term T is assigned to text iff name of T is
    recognized in text
  • Extensions
  • GO term T is assigned to a paper if synonym
    ofterm T occurs in the abstract of the paper
  • GO term T is assigned to a paper if yeast gene
    nameassociated with term T occurs in the
    abstract of the paper

39
Lexical Lookup Results for GO Slim
SGD Dataset SGD Dataset SGD Dataset
P R F
GO Term 35.27 52.36 42.15
Yeast Term 22.87 91.76 36.62
GO Synonyms 37.86 34.20 35.94
GO Yeast 21.15 93.59 34.50
GO Synonyms 32.94 64.65 43.65
GO Synonyms Yeast 20.53 94.17 33.72
IC Dataset IC Dataset IC Dataset
P R F
GO Term 98.62 79.33 87.93
Yeast Term 37.94 75.13 50.42
GO Synonyms 70.49 33.31 45.24
GO Yeast 43.36 94.76 59.50
GO Synonyms 85.52 88.35 86.91
GO Synonyms Yeast 42.42 95.95 58.83
40
Lexical Lookup Results for GO Full
SGD Dataset SGD Dataset SGD Dataset
P R F
GO Term 7.33 15.95 10.05
Yeast Term 7.97 84.42 14.57
GO Synonyms 6.46 7.63 7.00
GO Yeast 6.93 85.66 12.82
GO Synonyms 6.87 22.55 10.54
GO Synonyms Yeast 6.49 86.14 12.08
IC Dataset IC Dataset IC Dataset
P R F
GO Term 90.52 71.30 79.77
Yeast Term 9.26 31.43 14.30
GO Synonyms 29.65 11.53 16.60
GO Yeast 21.00 83.73 33.58
GO Synonyms 69.93 80.04 74.65
GO Synonyms Yeast 20.70 88.38 33.54
41
Lexical Lookup Approach Discussion
  • Recall
  • Effect of curators using full text (SGD) vs.
    abstracts only (IC)
  • Inherent drawbacks of lexical look-up term
    variation, literal mentions
  • Effects of Gold Standard creation method (IC)
  • Precision
  • Effects of Gold Standard creation method (IC)
  • GO vs. GO Slim
  • Recognizing GO Slim terms is easier than
    recognizing GO terms
  • Effects of extensions (synonyms/gene names) on
    performance
  • Adding synonyms variable decrease in Precision,
    substantial increase in Recall
  • Adding yeast terms substantial decrease in
    Precision, substantial increase in Recall

42
Error Analysis
  • False negatives for abstracts
  • Abbreviation mismatch repair (GO name) vs. MMR
    (in text)
  • Permutation, derivation regulation of
    translation vs. regulated translation,
    sporulation vs. sporulate
  • Truncation galactokinase activity vs.
    galactokinase
  • Alternative descriptions protein catabolism vs.
    proteins for degradation, autophagic vacuole vs.
    autophagosomal

43
Approach 2 IR-based Vector Space Similarity
  • Document Collection
  • Build a collection of GO documents where each
    GO document consists of GO term, its synonyms and
    its definition sentence
  • Query
  • Treat each abstract to which GO codes are to be
    assigned as a query against the GO document
    collection
  • Retrieval
  • Given a query (i.e abstract) retrieve relevant
    GO documents (i.e. GO terms)
  • assign top 1, 5, 10 GO terms to an abstract
    which are most similar as measured by Vector
    Space Model(VSM)

44
IR-based Approach
  • indexed the GO documents using Lucene search
    engine
  • Standard IR preprocessing tokenization, stop
    word removal, case normalization, stemming
  • 4 Indices were built according varying as to
    whether they used
  • Standard GO or GO Slim
  • A GO document consisting of the GO term text
    (name definition) or itself plus its ancestor
    GO terms
  • Used standard weighting scheme included in Lucene
  • Postprocessing
  • Re-weighting give credit to duplicated GO
    documents (found on more than one path back to
    root)
  • Threshold the number of relevant GOIDs to return

45
IR-Based Results
  • Better performance on IC abstracts than on SGD
    abstracts
  • Hierarchical documents do slightly worse than
    flat documents
  • Discriminatory effect of specific GO terms may be
    reducedby occurrence of general terms such as
    cell and protein

46
Approach 3 Machine Learning
  • Variety of text classification algorithms Naïve
    Bayes, Decision Tree, SVM classifier,
  • Naïve Bayes predicts only one GO term per
    abstract
  • SGD GS 2.1 GO terms/abstract IC GS 6.6 GO
    terms/abstract
  • Features words, frequent phrases
  • Preprocessing steps tokenization, removal
    ofstop words, stemming
  • Training on 66 of annotated data, evaluation on
    remainder of data
  • GO term assignments vis-à-vis generic GO Slim to
    mitigate data sparsity problems

47
Machine Learning Results
  • One GO term vs. multiple GO terms per abstract
    makes a difference
  • Higher precision scores than lexical look-up
    (SGD) GO terms directly mentioned in text not be
    assigned if GO terms not present in training set
  • Oracle Text Decision Tree (IC) classifier learns
    systematic, strong correlation between words in
    text and words in GO terms

48
Comparison of Approaches
  • Best F scores for GO Slim
  • SGD Gold Standard
  • IC Gold Standard

R P F
LLU 64.6 32.9 43.6
IR 51.5 26.2 34.7
ML 36.8 51.6 43.0
R P F
LLU 79.3 98.6 87.9
IR 59.5 37.6 46.1
ML 76.5 83.0 79.6
49
Outline
  • Context
  • Project Background
  • The Gene Ontology
  • Go Annotation in Model Organism Databases
  • Medline
  • Go Tagging Tasks
  • User types/scenarios
  • Possible tasks
  • Related Work
  • Data sets/Gold Standards
  • Approaches and Results to Date
  • Lexical lookup
  • Vector Space Similarity
  • Machine Learning
  • Exploiting the Results in Search Tools

50
Input keywords here
Upload a file containing a list of Medline
abstracts
Type/paste free texts to get results
51
Click for the abstract details
Click for the GO definition
Search for the abstracts with similar Go
annotations
52
(No Transcript)
53
Click for the abstract details
Click for the GO definition
Search for the abstracts with similar Go
annotations
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Exploiting the Results in Search Tools
58
Input keywords here
Upload a file containing a list of Medline
abstracts
Type/paste free texts to get results
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Conclusions
  • GO tagging is an interesting task that offers
    significant potential benefits to research
    biologists and bioinformaticians
  • Several increasingly complex/valuable variants of
    the task can be identified
  • Simple techniques
  • Direct term matching
  • IR-type text macthing
  • Machine Learning text classification methods
  • have been assessed for their level of
    performance on the simplest task -- assigning GO
    terms to texts at the whole text level
  • Evaluation methods/resources are critical issues
  • Effectively utilising imperfect text mining
    results in end user applications is challenging

63
Future Work
  • Enhancements to each of the 3 simple approaches
  • Combining 3 simple approaches into a hybrid
    system
  • Look other tasks
  • Extracting GO term-gene/gene product pairs
  • Assigning evidence codes
  • Improving resources and methodology for
    evaluating the technology
  • End-user evaluation of search tools employing
    this technology

64
Reference Davis, N., Harkema, H., Gaizauskas,
R.,Guo Y.K., Ghanem, M., Barnwell, T., Guo, Y.
and Ratcliffe, J. (2006) Three Approaches to
GO-Tagging Biomedical Abstracts. In Proceedings
of the Second International Symposium on Semantic
Mining in Biomedicine (SMBM06), Jena, April 2006.
Available from http//www.dcs.shef.ac.uk/robert
g/publications/
  • END
About PowerShow.com