Real-time Text Mining for the Biomedical Literature presentation

About This Presentation

Transcript and Presenter's Notes

Title: Real-time Text Mining for the Biomedical Literature

1
Real-time Text Mining for the Biomedical
Literature a collaboration between Discovery Net
myGrid
Rob Gaizauskas Department of Computer
Science University of Sheffield
Moustafa M. Ghanem Department of
Computing Imperial College London
2
Outline

Context
Workflows, Services and Text Mining
Discovery Net myGrid
Aims and Objectives of New Project
Architecture of New System
Integration of Existing Components
Approach to Text Mining
Data Resources Evaluation
Techniques for Go Tagging
Interface and Results Presentation
Lessons Learnt So far, Conclusions and Broader
Applicability of Work

3
Workflows, Web Services and Text Mining for
Bioinformatics

Workflows
useful computational models for processes that
require repeated execution of a series of complex
analytical tasks
e.g. biologist researching genetic basis of a
disease repeatedly
maps reactive spot in microarray data to gene
sequence
uses a sequence alignment tool to find
proteins/DNA of similar structure
mines info about these homologues from remote DBs
annotates unknown gene sequence with this
discovered info

4
Workflows, Web Services and Text Mining for
Bioinformatics

Web services
Processing resources that are
available via the Internet
use standardised messaging formats, such as XML
enable communication between applications without
being tied to a particular operating
system/programming language
Useful for bioinformatics where data used in
research is
heterogeneous in nature DB records, numerical
results, NL texts
distributed across the internet in research
institutions around the world
available on a variety of platforms and via
non-uniform interfaces

5
Workflows, Web Services and Text Mining for
Bioinformatics

Text mining
any process of revealing information
regularities, patterns or trends in textual
data
includes more established research areas such as
information extraction (IE), information
retrieval (IR), natural language processing
(NLP), knowledge discovery from databases (KDD)
and traditional data mining (DM)
relevant to bioinformatics because of
explosive growth of biomedical literature
availability of some information in textual form
only, e.g. clinical records

6
Workflows, Web Services and Text Mining for
Bioinformatics
7
Discovery Net myGrid

Discovery Net An e-Science testbed for High
Throughput Informatics
2.2M EPSRC Pilot Project
Started Oct 01, Ended in March 05
Service-based infrastructure/workflow model for
Life Sciences, Environmental Modelling and
Geo-hazard Modelling
Infrastructure for mixed data mining / text
mining
Machine learning methods for text mining
myGrid Directly Supporting the e-Scientist
3.5M EPSRC Pilot Project
Started Oct 01, Ends June 05
Service-based infrastructure/workflow model for
Life Sciences
Infrastructure for Text Collection Server, Text
Services Workflow Server and Interface/Browsing
Client
Service-based Terminology Servers

8
myGrid

Overall aim develop an e-biologists workbench
a platform allowing biologists to execute,
analyze, repeat multi-stage in silico experiments
involving distributed data, code and processing
resources
Workflow model for composing/executing processing
components
Web services for distribution
Problem how to integrate text mining into a
biological workflow?
Most text mining runs off-line and supports
interactive browsing of results
Most workflows run end to end with no user
intervention
What are the inputs to text mining to be?
Solution tap off result of a workflow step and
treat as implicit query

9
A myGrid example studying the Genetic Basis of
Disease

Graves Disease
an autoimmune condition affecting tissues in the
thyroid and orbit
being investigated using the micro-array methods
micro-array shows which genes are differentially
expressed in normal patients vs patients with
the disease candidate genes
sequence alignment search (e.g. BLAST) finds
genes/proteins with similar structure
function of these homologues may suggest
function of candidate gene
key step for text mining follows BLAST search
for homologous proteins BLAST report contains
references to proteins in SWISSPROT protein
database
Swissprot records contain ids of abstracts
describing the protein in Medline abstract
database
abstracts can be mined directly or used as
seed'' documents to assemble a set of related
abstracts

10
myGrid Text Services Architecture
11
myGrid Text Services Architecture

3-way division of labour sensible way to deliver
distributed text mining services
Providers of e-archives, such as Medline, will
make archives available via web-services
interface
Cannot offer tailored sevices for every
application
Will provide core, common services
Specialist workflow designers will add value to
basic services from archive to meet their
organizations needs
Users will prefer to execute predefined workflows
via standard light clients such as a browser
Architecture appropriate for many research areas,
not just bioinformatics

12
myGrid Interface/Browsing Client
13
Discovery Net Adding text mining to e-Science
workflows

DNet Workflow server executes DPML workflow and
uses Discovery Nets InfoGrid data access and
integration wrappers and web services

14
Text Mining in e-Science workflows

Problem how to develop new distributed text
mining applications using a workflow?
Most text mining applications require the
integration of a mixture of components (Services)
for text processing tasks (e.g. parsing and
cleaning), natural language processing (e.g.
named entity recognition), statistics and data
mining (e.g. classification, clustering, etc).
There are many design alternatives and end users
may want to prototype and compare alternative
implementations.
Once application developed, most workflows run
end to end with no user intervention
Solution Extend service infrastructure to allow
composition of text mining services.

15
Building text mining applications from workflows
Using workflow technologies to build text mining
applications and services using finer grain
components/services
Text Mining Pipelines
Features are summarized into vector forms which
are suitable for data mining
Results can be document characterization or
hidden relationship extraction
Pre-process documents to enhance the ease of
feature extraction
Retrieve and organize relevant documents
16
Simplified Document Classification Workflow
Predictive Accuracy of Relevance prediction,
using Support Vector Machine classification Ove
rall accuracy 84.5 Precision 78.11 Recall
73.40
17
Text Meta Data Model
Build Classifier training phase using workflow
co-ordinating distributed services Build
Prediction phase using workflow co-ordinating
distributed services Metadata Model Service
Interfaces only tell you how to invoke remote
service but it is up to you to decide what
information flows between services !
18
Aims Objectives of New Project

Aim to develop a unified real-time e-Science
text-mining infrastructure that leverages the
technologies and methods developed by both
Discovery Net and myGrid
Software engineering challenge integrate
complementary service-based text mining
capabilities with different metadata models into
a single framework
Application challenge annotate biomedical
abstracts with semantic categories from the Gene
Ontology
Deliverables
D1 A GO Annotation Service
D2 A Generic Shared Infrastructure for
Grid-enabled Biomedical Document Categorization
D3 Infrastructure for Semantic Document
Annotation
D4 A Detailed Case Study (analysing/evaluating
the GO annotator)
D5 Developing a common framework for
representing exchanging information about
1. Data biomedical documents/doc collections
metadata, biomedical dictionaries
2. Intermediate data Document indexes and
Document feature vectors
3. Text Analysis Results

19
Go TAG A Novel Application

The GO TAG Application Automatic Assignment of
GO (Gene Ontology) Codes to Medline Documents

20
A Machine Learning Approach
21
Run-time System
22
GO Annotator Version 1

Version 1a
Direct search for GO Annotation descriptions and
synonyms in document text
If description is found, document is labelled
with this GO Annotation
Description is also marked-up in document
Version 1b
1a search for gene names extracted from yeast
genome DB
If gene name found, document labelled with GO
annotation(s) associated with gene in DB
Gene name also marked up in document
Termino web-service, hosted at Sheffield,
provides lookup capability
This is wrapped in a DiscoveryNet workflow to
include PubMed query, results visualization and
performance calculations
Workflow is deployed as a web application for end
users which includes applet to interactively
browse results

23
GO Annotator Version 1Underlying Discovery Net
Workflow
24
GO Annotator Version 1Underlying Discovery Net
Workflow
Enter query and retrieve abstracts from PubMed.
25
GO Annotator Version 1Underlying Discovery Net
Workflow
Use Termino to mark-up abstracts with GO
Annotations when match for GO Annotation
description is found.
26
GO Annotator Version 1Underlying Discovery Net
Workflow
Tabulate GO Annotations by PMID.
27
GO Annotator Version 1Underlying Discovery Net
Workflow
Join PMIDs and matching GO Annotations with
abstracts and titles.
28
Workflow Deployment
29
GO Annotator Version 2

Use Saccharomyces (Yeast) Genome Database as
source of papers expertly curated with GO
Annotations
Train classifier using these papers
Hierarchical classification
Training data sufficient to classify over 2000 GO
Annotations
Classifier is then applied to assign unseen
papers with GO Annotations
Main Issues
Choice of features to be extracted from the
training documents
Choice of feature reduction methods to produce
accurate classification
Choice of classification algorithm to be used?

30
GO Annotator Version 2Underlying Discovery Net
Workflow
31
GO Annotator Version 2Underlying DiscoveryNet
Workflow
Papers expertly curated with GO Annotations from
SGD database.
32
GO Annotator Version 2Underlying Discovery Net
Workflow
Generate vector of features (frequent phrases)
for each paper. This is used to train classifier.
33
GO Annotator Version 2Underlying Discovery Net
Workflow
Generate a Naïve Bayesian classification model.
34
GO Annotator Version 2Underlying Discovery Net
Workflow
Generate vector of features (frequent phrases)
for each paper in test data set. This is used to
test the classifier.
35
GO Annotator Version 2Underlying Discovery Net
Workflow
Apply classification model to test data to
evaluate classification accuracy.
36
Interface Results Presentation
37
Achievements to date

Infrastructure Interoperability
More than just remote web service invocation
interoperable metadata models
Mark 1 System Implemented
Annotation based on terminology lookups
15 Recall 5 Precision (Exact matches for
18,000 GO terms)
Measures inadequate due to incompleteness of gold
standard
In process of Finalising Training Data Sets and
Evaluation Metrics
4,922 papers referencing 2,455 GO Terms
Mark 2 Systems in Progress
Naïve Bayesian Approach
41 Recall and 27 Precision
User Interfaces
Mark 3, 4, Systems and Evaluation

38
Implementation Options

Feature Vector Options
Bag of words
Frequent Phrases
Key Phrases (Gene Names, Protein Names, MeSH
terms, etc).
Classifier Options
Bayesian Classifiers
Support Vector Machines
Drag Push (a novel centroid based method)

39
Lessons Learnt and Challenges to Face

Infrastructure
Interoperability Issues
Performance Issues
Communication vs Persistence of remote server
Off-line vs on-line feature extraction
Text Mining
Usability Issues
Evaluation Issues

Write a Comment

User Comments (0)

About PowerShow.com

Real-time Text Mining for the Biomedical Literature PowerPoint PPT Presentation