Transductive Support Vector Classification for RNA Related Biological Abstracts - PowerPoint PPT Presentation

About This Presentation

Title:

Transductive Support Vector Classification for RNA Related Biological Abstracts

Description:

Transductive Support Vector Classification for RNA Related Biological Abstracts ... Finds a hyperspace surface H, which separates positive and negative examples ... – PowerPoint PPT presentation

Number of Views:573

Avg rating:3.0/5.0

Slides: 34

Provided by: wes99

Category:

more less

Transcript and Presenter's Notes

Title: Transductive Support Vector Classification for RNA Related Biological Abstracts

1
Transductive Support Vector Classification for
RNA Related Biological Abstracts

Blake Adams
Graduate Student
Department of Computer Science
Advisor Dr. Muhammad A. Rahman

2
Overview

Statistical Learning Theory
Support Vector Machines
Linear Separability
Project Motivation/Concept
Expectations
Program Design / Algorithm Implementation
Results
Demonstration
Acknowledgements
Questions Answers

3
Statistical Learning Theory

Introduced by Vladimir Vapnik and Alexey
Chervonenkis
4 major areas
Theory of consistency of learning processes
What are the necessary conditions for consistency
of a learning process?
Nonasymptotic theory of the rate of convergence
of learning processes
How fast is the rate of convergence of the
learning process?
Theory of controlling the generalization ability
of learning processes
How can one control the rate of convergence
(generalization) of the learning process?
Theory of constructing learning machines
How can one construct algorithms that can control
the generalization ability?
This concept introduced the support vector
machine.

4
Support Vector Machines

The Support Vector Machine is a classification
technique that is receiving heavy attention due
to its precision classification.
It has been especially in successful in text
categorization
Also showing good results in image recognition
such as face and fingerprint identification.

5
Precision Through SVM

Support Vector Machines express machine learning
through supervised statistical learning theory.
The technique works with high-dimensional data
and avoids the pitfalls of local minima. It
represents decision boundaries using a subset of
training examples known as support vectors

6
Structural Risk Minimization

Support vector machines are based on the
Structural Risk Minimization principle. The idea
of structural risk minimization is to find a
hypothesis h for which we can guarantee the
lowest true error.
The true error of h is the probability that h
will make an error on an unseen and randomly
selected test example.

7
How does SVM work?

Uses training data to create a set of plot points
that can be mapped out and used to predict the
status of future information.
Finds a hyperspace surface H, which separates
positive and negative examples with the maximum
distance.

8
Linearly Separable Data

Linearly Separable Datasets ? and ?
Hyper-plane of separation
Decision boundaries

9
Consequences of hyper-plane selection

Maximizing decision boundary margins for best
success.

10
Research Motivation

Keyword searches have become the norm for finding
information in large bodies of documents but such
searches often prove to be highly Imprecise.
Example
PubMed is a service of the National Library of
Medicine that includes over 15 million citations
from MEDLINE and other life science journals for
biomedical articles back to the 1950s. PubMed
includes links to full text articles and other
related resources. This site adds new citations
on a daily basis.
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
searchDBpubmedtermmRNA
A search on the expressions RNA and Ribosomal
Nucleic Acid at Pubmed earlier in the semester
(geared towards finding articles SPECIFICALLY
about RNA research) yielded a success rate of 38
based on a review of the first 50 abstracts.
What can be done to improve the precision of
searches in such large bodies of documents?

11
Expectations

It is resonable to presume that Support Vector
Machines yield statistically significant
improvements over this success rate at a minimal
cost to the user.
If a user was able to request a body of documents
on a keyword from such a database, and then read
a small subset of abstracts, identifying positive
and negative examples, then a Support Vector
Machine could be used to key on they articles the
user is interested in reading.

12
Learning Technique

One of the short comings of traditional SVM is
its implementation of Inductive Learning. (the
process of reaching a general conclusion from
specific examples). While this method is highly
effective when applied properly, it often
requires SEVERAL examples (at least hundreds,
preferably thousands) in order for it to return
good results.

13
Learning Technique

Transductive Learning
Implemented by Thorsten Joachims Author of
SVM-Light
Well known for his work with Support Vector
Machines and Text-Categorization and highly
regarded as one of the foremost authorities on
the subject.
Transductive Learning takes into account a
specific set of data and attempts to minimize
error for that specific set based on a minimal
number of training examples.

14
Implementing Support Vector Machines

Fortunately, Joachims SVM-Light already
successfully implements Transductive Learning,
thus in this project it was not my task to
reinvent the wheel of Support Vector Machines,
but to develop a set of software tools to convert
sets of articles into training and testing data
that can be read and learned by svm-light.
http//svmlight.joachims.org/

15
Using SVM Light

Requires specific formatted input.
Things needed by SVM Light user
Feature Set
Scoring method
Training data file
Testing data file
Things SVM-Light Generates
Model file
Predictions file.

16
Feature Set

Features can be defined in many ways
Single words
ANY word that appears in more than one document
relating to a particular subject (Bag of words
approach).
Terms
Topic specific ribosomal nucleic acid,
translation, interference, genetic.
Combination
Any method including both concepts such as a
weighing scheme that allows more weight to
words that also classify as terms.

17
Scoring Method

Every feature needs a corresponding value that
represents that particular features impact in
the given example.
A popular method of feature scoring (and the one
implemented in this project) is Term Frequency
Inverse Document Frequency
TF X Log(N/ DF)
TF Total number of times the term occurs in the
document
N Total number of documents in the corpus
DF Total number of documents in the corpus that
contain the term.

18
Generating Training Sets for Transductive Learning

File Format
ltexpected outcomegt ltfeaturegtltvaluegt
ltfeaturegtltvaluegt ltfeaturegtltvaluegt
Feature values must also be organized from lowest
to highest
Thus a completed train file might look something
like
1 12.8473 23.8324 95.423 191.003
1 41.11 95.423 110.012
1 12.84734 1510.9213 219.343 447.7231
-1 18.8473 23.8324 95.423 191.003
-1 45.135.423 110.012 2819.6548
-1 12.84734 1510.9213 219.343 447.7231
0 10.8473 93.84 195.423 291.00
0 41.11 95.423 110.012
0 22.84734 1510.9213 219.343 227.7231
0 14.5473 26.7324 95.42
0 41.11 95.423 110.012
0 12.84734 1510.9213 219.343 447.7231
0 195.1864 247.215 2812.123

19
Generating Test Sets for Transductive Learning

Since Transductive Learning works with a
pre-established set, the test file has the exact
same format. The only difference being that the
expected outcomes originally set to 0 are
now set to their actual value so that the system
can test how well it predicted the outcomes.

20
Key Experiments

How well can SVM Light transductively
Distinguish between abstracts that ARE and ARE
NOT about RNA
Distinguish between abstracts that ARE and ARE
NOT about each of the following types of RNA
messenger RNA
ribosomal RNA
transfer RNA
small nuclear RNA
Glean abstracts about a specific type of RNA from
a body abstracts that are all RNA-centric.

21
Implementation

In order to implement this project two key
elements were collected
A corpus of abstracts was collected and
categorized manually from the Pubmed database.
These articles fell into the following
categories
40 abstracts specific to RNA research AND
containing the term RNA.
40 abstracts not specific to RNA research AND
containing the term RNA.
40 abstracts specific to messenger RNA research
AND containing the term mRNA.
40 abstracts not specific to messenger RNA
research AND containing the term mRNA.
40 abstracts specific to transfer RNA research
AND containing the term tRNA.
40 abstracts not specific to transfer RNA
research AND containing the term tRNA.
40 abstracts specific to ribosomal RNA research
AND containing the term rRNA.
40 abstracts not specific to ribosomal RNA
research AND containing the term rRNA.
40 abstracts specific to small nuclear RNA
research AND containing the term snRNA.
40 abstracts not specific to small nuclear RNA
research AND containing the term snRNA.
This resulted in a GRAND TOTAL corpus of 400
abstracts.

22
Implementation

Term Dictionary
A dictionary of terms specific to each topic was
developed based on these terms relevance to the
particular set of abstracts (ex messenger RNA
abstracts would correspond to a dictionary that
contained the term messenger but small Nuclear
RNA abstracts would not).

23
Implementation

Pre-processing
Prior to conversion from abstract to
feature/value sets, a limited amount of
pre-processing was implemented
All special characters were removed from the
abstracts including parentheses, commas, periods,
apostrophes, colons, semicolons, dashes, etc.
Articles were sorted into an arrangement of all
positive abstracts followed by all negative
abstracts. This was done to facilitate
identification of positive and negative abstracts
by the software tools during generation of
training and testing sets. It should be
mentioned that the order of feature sets has no
bearing on learing.

24
Implementation

Software Development Package Java
2 Major Data Structures
TreeMap Containing Term objects that
represent every term feature in every document in
the corpus of documents.
Term object Actual Term, Unique Term Id,
Document Id, term frequency score, document
frequency score. Objects are keyed by an id that
combines the term Id with the document Id.
TreeMap Containing TermDF objects that
represent individual features as they enter the
program.
TermDF objects contain the actual term, the
document frequency score and the last document id
that contained the term as a placeholder.
Once each Map is constructed, the TermDF map is
cross-referenced with the Term map to set the DF
score to the appropriate value in each term.

25
Algorithm/Implementation

The set of tools for this project were developed
in Java. Programmatic implementation included
the following steps
Read a set of abstracts and a term dictionary
from file.
Tokenize abstracts and test every tokenized word
against the term dictionary.
If the token exists in term dictionary it must be
search for in the current corpus set of terms
to see if it has been added.
If the feature is not found in the current set,
it is added, id-ed, and the TF and DF are
initialized to 1
If the feature is found in the current set, the
current document count is checked to determine
whether the term is an initial occurrence in a
new document or a subsequent occurrence in the
current document
If it is an initial occurance in a new document,
DF is incremented, and TF for the term in the
current article is set to one.
If it is a subsequent occurrence, the DF remains
unchanged and the TF for the current article is
incremented by one.
Once EVERY term for EVERY abstract is accounted
for, the TF/IDF for each term can be calculated.
Finally, the data structure is reorganized to
arrange feature VALUES from lowest to highest for
each feature set.

26
Mapping a Feature
Tokenized Word
Is it in the Term Map?
No
Yes
Is it in the keyword list?
Is word in current document?
Yes
Increment termFreq
No
Yes
No
Assign id, docID, set termFreq to 1
Increment docFreq
Is it in the TermDocFreqMap?
Do Nothing
Yes
Assign featureId, set docFreq to 1, assign
lastDocId
Yes
No
Is word in current document?
Increment docFreq
Discard
No
27
Implementation

So THIS
All organisms sense and respond to conditions
that stress their homeostatic mechanisms Here we
review current studies showing that the nucleolus
long regarded as a mere ribosome producing
factory plays a key role in monitoring and
responding to cellular stress After exposure to
extra or intracellular stress cells rapidly down
regulate the synthesis of ribosomal RNA
Becomes THIS
1 25.619650483261998 50.17733401528291545
61.869439496743316 120.7184649885442351
133.0441924140622225 252.772588722239781
503.283414346005772 516.566828692011544

28
Implementation

All experiments were conducted in sets of 80
abstracts, with 5 positive and 5 negative
training examples. Additionally, 35 positive and
35 negative examples were included but unlabeled
during the training phase to allow to program to
make a decision on where the feature set should
fall in order to minimize error.

29
Results and Conclusions

The outcome for every experiment far outpaced the
highest expectations of the researchers.
Following examination of misclassified abstracts
from the first set of results, some abstracts
were found to have been misclassified by the
manual classifiers. Correcting these errors led
to even better results.
It is the belief of the researchers that
incorporating such a system into a database like
pubmed would allow users to query a term on a
keyword, and then use the support vector machine
to narrow the results into the specific
information the user is looking for, allowing the
user to maximize the use research time.