Title: Transductive Support Vector Classification for RNA Related Biological Abstracts
1Transductive Support Vector Classification for
RNA Related Biological Abstracts
- Blake Adams
- Graduate Student
- Department of Computer Science
- Advisor Dr. Muhammad A. Rahman
2Overview
- Statistical Learning Theory
- Support Vector Machines
- Linear Separability
- Project Motivation/Concept
- Expectations
- Program Design / Algorithm Implementation
- Results
- Demonstration
- Acknowledgements
- Questions Answers
3Statistical Learning Theory
- Introduced by Vladimir Vapnik and Alexey
Chervonenkis - 4 major areas
- Theory of consistency of learning processes
- What are the necessary conditions for consistency
of a learning process? - Nonasymptotic theory of the rate of convergence
of learning processes - How fast is the rate of convergence of the
learning process? - Theory of controlling the generalization ability
of learning processes - How can one control the rate of convergence
(generalization) of the learning process? - Theory of constructing learning machines
- How can one construct algorithms that can control
the generalization ability? - This concept introduced the support vector
machine.
4Support Vector Machines
- The Support Vector Machine is a classification
technique that is receiving heavy attention due
to its precision classification. - It has been especially in successful in text
categorization - Also showing good results in image recognition
such as face and fingerprint identification.
5Precision Through SVM
- Support Vector Machines express machine learning
through supervised statistical learning theory.
The technique works with high-dimensional data
and avoids the pitfalls of local minima. It
represents decision boundaries using a subset of
training examples known as support vectors
6Structural Risk Minimization
- Support vector machines are based on the
Structural Risk Minimization principle. The idea
of structural risk minimization is to find a
hypothesis h for which we can guarantee the
lowest true error. - The true error of h is the probability that h
will make an error on an unseen and randomly
selected test example.
7How does SVM work?
- Uses training data to create a set of plot points
that can be mapped out and used to predict the
status of future information. - Finds a hyperspace surface H, which separates
positive and negative examples with the maximum
distance.
8Linearly Separable Data
- Linearly Separable Datasets ? and ?
- Hyper-plane of separation
- Decision boundaries
9Consequences of hyper-plane selection
- Maximizing decision boundary margins for best
success.
10Research Motivation
- Keyword searches have become the norm for finding
information in large bodies of documents but such
searches often prove to be highly Imprecise. - Example
- PubMed is a service of the National Library of
Medicine that includes over 15 million citations
from MEDLINE and other life science journals for
biomedical articles back to the 1950s. PubMed
includes links to full text articles and other
related resources. This site adds new citations
on a daily basis. - http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
searchDBpubmedtermmRNA - A search on the expressions RNA and Ribosomal
Nucleic Acid at Pubmed earlier in the semester
(geared towards finding articles SPECIFICALLY
about RNA research) yielded a success rate of 38
based on a review of the first 50 abstracts. - What can be done to improve the precision of
searches in such large bodies of documents?
11Expectations
- It is resonable to presume that Support Vector
Machines yield statistically significant
improvements over this success rate at a minimal
cost to the user. - If a user was able to request a body of documents
on a keyword from such a database, and then read
a small subset of abstracts, identifying positive
and negative examples, then a Support Vector
Machine could be used to key on they articles the
user is interested in reading.
12Learning Technique
- One of the short comings of traditional SVM is
its implementation of Inductive Learning. (the
process of reaching a general conclusion from
specific examples). While this method is highly
effective when applied properly, it often
requires SEVERAL examples (at least hundreds,
preferably thousands) in order for it to return
good results.
13Learning Technique
- Transductive Learning
- Implemented by Thorsten Joachims Author of
SVM-Light - Well known for his work with Support Vector
Machines and Text-Categorization and highly
regarded as one of the foremost authorities on
the subject. - Transductive Learning takes into account a
specific set of data and attempts to minimize
error for that specific set based on a minimal
number of training examples.
14Implementing Support Vector Machines
- Fortunately, Joachims SVM-Light already
successfully implements Transductive Learning,
thus in this project it was not my task to
reinvent the wheel of Support Vector Machines,
but to develop a set of software tools to convert
sets of articles into training and testing data
that can be read and learned by svm-light. - http//svmlight.joachims.org/
15Using SVM Light
- Requires specific formatted input.
- Things needed by SVM Light user
- Feature Set
- Scoring method
- Training data file
- Testing data file
- Things SVM-Light Generates
- Model file
- Predictions file.
16Feature Set
- Features can be defined in many ways
- Single words
- ANY word that appears in more than one document
relating to a particular subject (Bag of words
approach). - Terms
- Topic specific ribosomal nucleic acid,
translation, interference, genetic. - Combination
- Any method including both concepts such as a
weighing scheme that allows more weight to
words that also classify as terms.
17Scoring Method
- Every feature needs a corresponding value that
represents that particular features impact in
the given example. - A popular method of feature scoring (and the one
implemented in this project) is Term Frequency
Inverse Document Frequency - TF X Log(N/ DF)
- TF Total number of times the term occurs in the
document - N Total number of documents in the corpus
- DF Total number of documents in the corpus that
contain the term.
18Generating Training Sets for Transductive Learning
- File Format
- ltexpected outcomegt ltfeaturegtltvaluegt
ltfeaturegtltvaluegt ltfeaturegtltvaluegt - Feature values must also be organized from lowest
to highest - Thus a completed train file might look something
like - 1 12.8473 23.8324 95.423 191.003
- 1 41.11 95.423 110.012
- 1 12.84734 1510.9213 219.343 447.7231
- -1 18.8473 23.8324 95.423 191.003
- -1 45.135.423 110.012 2819.6548
- -1 12.84734 1510.9213 219.343 447.7231
- 0 10.8473 93.84 195.423 291.00
- 0 41.11 95.423 110.012
- 0 22.84734 1510.9213 219.343 227.7231
- 0 14.5473 26.7324 95.42
- 0 41.11 95.423 110.012
- 0 12.84734 1510.9213 219.343 447.7231
- 0 195.1864 247.215 2812.123
19Generating Test Sets for Transductive Learning
- Since Transductive Learning works with a
pre-established set, the test file has the exact
same format. The only difference being that the
expected outcomes originally set to 0 are
now set to their actual value so that the system
can test how well it predicted the outcomes.
20Key Experiments
- How well can SVM Light transductively
- Distinguish between abstracts that ARE and ARE
NOT about RNA - Distinguish between abstracts that ARE and ARE
NOT about each of the following types of RNA - messenger RNA
- ribosomal RNA
- transfer RNA
- small nuclear RNA
- Glean abstracts about a specific type of RNA from
a body abstracts that are all RNA-centric.
21Implementation
- In order to implement this project two key
elements were collected - A corpus of abstracts was collected and
categorized manually from the Pubmed database.
These articles fell into the following
categories - 40 abstracts specific to RNA research AND
containing the term RNA. - 40 abstracts not specific to RNA research AND
containing the term RNA. - 40 abstracts specific to messenger RNA research
AND containing the term mRNA. - 40 abstracts not specific to messenger RNA
research AND containing the term mRNA. - 40 abstracts specific to transfer RNA research
AND containing the term tRNA. - 40 abstracts not specific to transfer RNA
research AND containing the term tRNA. - 40 abstracts specific to ribosomal RNA research
AND containing the term rRNA. - 40 abstracts not specific to ribosomal RNA
research AND containing the term rRNA. - 40 abstracts specific to small nuclear RNA
research AND containing the term snRNA. - 40 abstracts not specific to small nuclear RNA
research AND containing the term snRNA. - This resulted in a GRAND TOTAL corpus of 400
abstracts.
22Implementation
- Term Dictionary
- A dictionary of terms specific to each topic was
developed based on these terms relevance to the
particular set of abstracts (ex messenger RNA
abstracts would correspond to a dictionary that
contained the term messenger but small Nuclear
RNA abstracts would not).
23Implementation
- Pre-processing
- Prior to conversion from abstract to
feature/value sets, a limited amount of
pre-processing was implemented - All special characters were removed from the
abstracts including parentheses, commas, periods,
apostrophes, colons, semicolons, dashes, etc. - Articles were sorted into an arrangement of all
positive abstracts followed by all negative
abstracts. This was done to facilitate
identification of positive and negative abstracts
by the software tools during generation of
training and testing sets. It should be
mentioned that the order of feature sets has no
bearing on learing.
24Implementation
- Software Development Package Java
- 2 Major Data Structures
- TreeMap Containing Term objects that
represent every term feature in every document in
the corpus of documents. - Term object Actual Term, Unique Term Id,
Document Id, term frequency score, document
frequency score. Objects are keyed by an id that
combines the term Id with the document Id. - TreeMap Containing TermDF objects that
represent individual features as they enter the
program. - TermDF objects contain the actual term, the
document frequency score and the last document id
that contained the term as a placeholder. - Once each Map is constructed, the TermDF map is
cross-referenced with the Term map to set the DF
score to the appropriate value in each term.
25Algorithm/Implementation
- The set of tools for this project were developed
in Java. Programmatic implementation included
the following steps - Read a set of abstracts and a term dictionary
from file. - Tokenize abstracts and test every tokenized word
against the term dictionary. - If the token exists in term dictionary it must be
search for in the current corpus set of terms
to see if it has been added. - If the feature is not found in the current set,
it is added, id-ed, and the TF and DF are
initialized to 1 - If the feature is found in the current set, the
current document count is checked to determine
whether the term is an initial occurrence in a
new document or a subsequent occurrence in the
current document - If it is an initial occurance in a new document,
DF is incremented, and TF for the term in the
current article is set to one. - If it is a subsequent occurrence, the DF remains
unchanged and the TF for the current article is
incremented by one. - Once EVERY term for EVERY abstract is accounted
for, the TF/IDF for each term can be calculated. - Finally, the data structure is reorganized to
arrange feature VALUES from lowest to highest for
each feature set.
26Mapping a Feature
Tokenized Word
Is it in the Term Map?
No
Yes
Is it in the keyword list?
Is word in current document?
Yes
Increment termFreq
No
Yes
No
Assign id, docID, set termFreq to 1
Increment docFreq
Is it in the TermDocFreqMap?
Do Nothing
Yes
Assign featureId, set docFreq to 1, assign
lastDocId
Yes
No
Is word in current document?
Increment docFreq
Discard
No
27Implementation
- So THIS
- All organisms sense and respond to conditions
that stress their homeostatic mechanisms Here we
review current studies showing that the nucleolus
long regarded as a mere ribosome producing
factory plays a key role in monitoring and
responding to cellular stress After exposure to
extra or intracellular stress cells rapidly down
regulate the synthesis of ribosomal RNA - Becomes THIS
- 1 25.619650483261998 50.17733401528291545
61.869439496743316 120.7184649885442351
133.0441924140622225 252.772588722239781
503.283414346005772 516.566828692011544
28Implementation
- All experiments were conducted in sets of 80
abstracts, with 5 positive and 5 negative
training examples. Additionally, 35 positive and
35 negative examples were included but unlabeled
during the training phase to allow to program to
make a decision on where the feature set should
fall in order to minimize error.
29Results and Conclusions
- The outcome for every experiment far outpaced the
highest expectations of the researchers. - Following examination of misclassified abstracts
from the first set of results, some abstracts
were found to have been misclassified by the
manual classifiers. Correcting these errors led
to even better results. - It is the belief of the researchers that
incorporating such a system into a database like
pubmed would allow users to query a term on a
keyword, and then use the support vector machine
to narrow the results into the specific
information the user is looking for, allowing the
user to maximize the use research time.
30Demonstration
31Future Work
- Future work in this project will address
- Precision in feature selection
- Web Interface to tie application to real results
32Acknowledgements
- Dr Muhammand A Rahman Assistant Professor,
University of West Georgia - Dr Goran Nenadic Assistant Professor,
University of Manchester - Thorsten Joachims Assistant Professor, Cornell
University - The Department of Computer Science University
of West Georgia
33