Transductive Support Vector Classification for RNA Related Biological Abstracts - PowerPoint PPT Presentation

About This Presentation
Title:

Transductive Support Vector Classification for RNA Related Biological Abstracts

Description:

Transductive Support Vector Classification for RNA Related Biological Abstracts ... Finds a hyperspace surface H, which separates positive and negative examples ... – PowerPoint PPT presentation

Number of Views:573
Avg rating:3.0/5.0
Slides: 34
Provided by: wes99
Category:

less

Transcript and Presenter's Notes

Title: Transductive Support Vector Classification for RNA Related Biological Abstracts


1
Transductive Support Vector Classification for
RNA Related Biological Abstracts
  • Blake Adams
  • Graduate Student
  • Department of Computer Science
  • Advisor Dr. Muhammad A. Rahman

2
Overview
  • Statistical Learning Theory
  • Support Vector Machines
  • Linear Separability
  • Project Motivation/Concept
  • Expectations
  • Program Design / Algorithm Implementation
  • Results
  • Demonstration
  • Acknowledgements
  • Questions Answers

3
Statistical Learning Theory
  • Introduced by Vladimir Vapnik and Alexey
    Chervonenkis
  • 4 major areas
  • Theory of consistency of learning processes
  • What are the necessary conditions for consistency
    of a learning process?
  • Nonasymptotic theory of the rate of convergence
    of learning processes
  • How fast is the rate of convergence of the
    learning process?
  • Theory of controlling the generalization ability
    of learning processes
  • How can one control the rate of convergence
    (generalization) of the learning process?
  • Theory of constructing learning machines
  • How can one construct algorithms that can control
    the generalization ability?
  • This concept introduced the support vector
    machine.

4
Support Vector Machines
  • The Support Vector Machine is a classification
    technique that is receiving heavy attention due
    to its precision classification.
  • It has been especially in successful in text
    categorization
  • Also showing good results in image recognition
    such as face and fingerprint identification.

5
Precision Through SVM
  • Support Vector Machines express machine learning
    through supervised statistical learning theory.
    The technique works with high-dimensional data
    and avoids the pitfalls of local minima. It
    represents decision boundaries using a subset of
    training examples known as support vectors

6
Structural Risk Minimization
  • Support vector machines are based on the
    Structural Risk Minimization principle. The idea
    of structural risk minimization is to find a
    hypothesis h for which we can guarantee the
    lowest true error.
  • The true error of h is the probability that h
    will make an error on an unseen and randomly
    selected test example.

7
How does SVM work?
  • Uses training data to create a set of plot points
    that can be mapped out and used to predict the
    status of future information.
  • Finds a hyperspace surface H, which separates
    positive and negative examples with the maximum
    distance.

8
Linearly Separable Data
  • Linearly Separable Datasets ? and ?
  • Hyper-plane of separation
  • Decision boundaries

9
Consequences of hyper-plane selection
  • Maximizing decision boundary margins for best
    success.

10
Research Motivation
  • Keyword searches have become the norm for finding
    information in large bodies of documents but such
    searches often prove to be highly Imprecise.
  • Example
  • PubMed is a service of the National Library of
    Medicine that includes over 15 million citations
    from MEDLINE and other life science journals for
    biomedical articles back to the 1950s. PubMed
    includes links to full text articles and other
    related resources. This site adds new citations
    on a daily basis.
  • http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD
    searchDBpubmedtermmRNA
  • A search on the expressions RNA and Ribosomal
    Nucleic Acid at Pubmed earlier in the semester
    (geared towards finding articles SPECIFICALLY
    about RNA research) yielded a success rate of 38
    based on a review of the first 50 abstracts.
  • What can be done to improve the precision of
    searches in such large bodies of documents?

11
Expectations
  • It is resonable to presume that Support Vector
    Machines yield statistically significant
    improvements over this success rate at a minimal
    cost to the user.
  • If a user was able to request a body of documents
    on a keyword from such a database, and then read
    a small subset of abstracts, identifying positive
    and negative examples, then a Support Vector
    Machine could be used to key on they articles the
    user is interested in reading.

12
Learning Technique
  • One of the short comings of traditional SVM is
    its implementation of Inductive Learning. (the
    process of reaching a general conclusion from
    specific examples). While this method is highly
    effective when applied properly, it often
    requires SEVERAL examples (at least hundreds,
    preferably thousands) in order for it to return
    good results.

13
Learning Technique
  • Transductive Learning
  • Implemented by Thorsten Joachims Author of
    SVM-Light
  • Well known for his work with Support Vector
    Machines and Text-Categorization and highly
    regarded as one of the foremost authorities on
    the subject.
  • Transductive Learning takes into account a
    specific set of data and attempts to minimize
    error for that specific set based on a minimal
    number of training examples.

14
Implementing Support Vector Machines
  • Fortunately, Joachims SVM-Light already
    successfully implements Transductive Learning,
    thus in this project it was not my task to
    reinvent the wheel of Support Vector Machines,
    but to develop a set of software tools to convert
    sets of articles into training and testing data
    that can be read and learned by svm-light.
  • http//svmlight.joachims.org/

15
Using SVM Light
  • Requires specific formatted input.
  • Things needed by SVM Light user
  • Feature Set
  • Scoring method
  • Training data file
  • Testing data file
  • Things SVM-Light Generates
  • Model file
  • Predictions file.

16
Feature Set
  • Features can be defined in many ways
  • Single words
  • ANY word that appears in more than one document
    relating to a particular subject (Bag of words
    approach).
  • Terms
  • Topic specific ribosomal nucleic acid,
    translation, interference, genetic.
  • Combination
  • Any method including both concepts such as a
    weighing scheme that allows more weight to
    words that also classify as terms.

17
Scoring Method
  • Every feature needs a corresponding value that
    represents that particular features impact in
    the given example.
  • A popular method of feature scoring (and the one
    implemented in this project) is Term Frequency
    Inverse Document Frequency
  • TF X Log(N/ DF)
  • TF Total number of times the term occurs in the
    document
  • N Total number of documents in the corpus
  • DF Total number of documents in the corpus that
    contain the term.

18
Generating Training Sets for Transductive Learning
  • File Format
  • ltexpected outcomegt ltfeaturegtltvaluegt
    ltfeaturegtltvaluegt ltfeaturegtltvaluegt
  • Feature values must also be organized from lowest
    to highest
  • Thus a completed train file might look something
    like
  • 1 12.8473 23.8324 95.423 191.003
  • 1 41.11 95.423 110.012
  • 1 12.84734 1510.9213 219.343 447.7231
  • -1 18.8473 23.8324 95.423 191.003
  • -1 45.135.423 110.012 2819.6548
  • -1 12.84734 1510.9213 219.343 447.7231
  • 0 10.8473 93.84 195.423 291.00
  • 0 41.11 95.423 110.012
  • 0 22.84734 1510.9213 219.343 227.7231
  • 0 14.5473 26.7324 95.42
  • 0 41.11 95.423 110.012
  • 0 12.84734 1510.9213 219.343 447.7231
  • 0 195.1864 247.215 2812.123

19
Generating Test Sets for Transductive Learning
  • Since Transductive Learning works with a
    pre-established set, the test file has the exact
    same format. The only difference being that the
    expected outcomes originally set to 0 are
    now set to their actual value so that the system
    can test how well it predicted the outcomes.

20
Key Experiments
  • How well can SVM Light transductively
  • Distinguish between abstracts that ARE and ARE
    NOT about RNA
  • Distinguish between abstracts that ARE and ARE
    NOT about each of the following types of RNA
  • messenger RNA
  • ribosomal RNA
  • transfer RNA
  • small nuclear RNA
  • Glean abstracts about a specific type of RNA from
    a body abstracts that are all RNA-centric.

21
Implementation
  • In order to implement this project two key
    elements were collected
  • A corpus of abstracts was collected and
    categorized manually from the Pubmed database.
    These articles fell into the following
    categories
  • 40 abstracts specific to RNA research AND
    containing the term RNA.
  • 40 abstracts not specific to RNA research AND
    containing the term RNA.
  • 40 abstracts specific to messenger RNA research
    AND containing the term mRNA.
  • 40 abstracts not specific to messenger RNA
    research AND containing the term mRNA.
  • 40 abstracts specific to transfer RNA research
    AND containing the term tRNA.
  • 40 abstracts not specific to transfer RNA
    research AND containing the term tRNA.
  • 40 abstracts specific to ribosomal RNA research
    AND containing the term rRNA.
  • 40 abstracts not specific to ribosomal RNA
    research AND containing the term rRNA.
  • 40 abstracts specific to small nuclear RNA
    research AND containing the term snRNA.
  • 40 abstracts not specific to small nuclear RNA
    research AND containing the term snRNA.
  • This resulted in a GRAND TOTAL corpus of 400
    abstracts.

22
Implementation
  • Term Dictionary
  • A dictionary of terms specific to each topic was
    developed based on these terms relevance to the
    particular set of abstracts (ex messenger RNA
    abstracts would correspond to a dictionary that
    contained the term messenger but small Nuclear
    RNA abstracts would not).

23
Implementation
  • Pre-processing
  • Prior to conversion from abstract to
    feature/value sets, a limited amount of
    pre-processing was implemented
  • All special characters were removed from the
    abstracts including parentheses, commas, periods,
    apostrophes, colons, semicolons, dashes, etc.
  • Articles were sorted into an arrangement of all
    positive abstracts followed by all negative
    abstracts. This was done to facilitate
    identification of positive and negative abstracts
    by the software tools during generation of
    training and testing sets. It should be
    mentioned that the order of feature sets has no
    bearing on learing.

24
Implementation
  • Software Development Package Java
  • 2 Major Data Structures
  • TreeMap Containing Term objects that
    represent every term feature in every document in
    the corpus of documents.
  • Term object Actual Term, Unique Term Id,
    Document Id, term frequency score, document
    frequency score. Objects are keyed by an id that
    combines the term Id with the document Id.
  • TreeMap Containing TermDF objects that
    represent individual features as they enter the
    program.
  • TermDF objects contain the actual term, the
    document frequency score and the last document id
    that contained the term as a placeholder.
  • Once each Map is constructed, the TermDF map is
    cross-referenced with the Term map to set the DF
    score to the appropriate value in each term.

25
Algorithm/Implementation
  • The set of tools for this project were developed
    in Java. Programmatic implementation included
    the following steps
  • Read a set of abstracts and a term dictionary
    from file.
  • Tokenize abstracts and test every tokenized word
    against the term dictionary.
  • If the token exists in term dictionary it must be
    search for in the current corpus set of terms
    to see if it has been added.
  • If the feature is not found in the current set,
    it is added, id-ed, and the TF and DF are
    initialized to 1
  • If the feature is found in the current set, the
    current document count is checked to determine
    whether the term is an initial occurrence in a
    new document or a subsequent occurrence in the
    current document
  • If it is an initial occurance in a new document,
    DF is incremented, and TF for the term in the
    current article is set to one.
  • If it is a subsequent occurrence, the DF remains
    unchanged and the TF for the current article is
    incremented by one.
  • Once EVERY term for EVERY abstract is accounted
    for, the TF/IDF for each term can be calculated.
  • Finally, the data structure is reorganized to
    arrange feature VALUES from lowest to highest for
    each feature set.

26
Mapping a Feature
Tokenized Word
Is it in the Term Map?
No
Yes
Is it in the keyword list?
Is word in current document?
Yes
Increment termFreq
No
Yes
No
Assign id, docID, set termFreq to 1
Increment docFreq
Is it in the TermDocFreqMap?
Do Nothing
Yes
Assign featureId, set docFreq to 1, assign
lastDocId
Yes
No
Is word in current document?
Increment docFreq
Discard
No
27
Implementation
  • So THIS
  • All organisms sense and respond to conditions
    that stress their homeostatic mechanisms Here we
    review current studies showing that the nucleolus
    long regarded as a mere ribosome producing
    factory plays a key role in monitoring and
    responding to cellular stress After exposure to
    extra or intracellular stress cells rapidly down
    regulate the synthesis of ribosomal RNA
  • Becomes THIS
  • 1 25.619650483261998 50.17733401528291545
    61.869439496743316 120.7184649885442351
    133.0441924140622225 252.772588722239781
    503.283414346005772 516.566828692011544

28
Implementation
  • All experiments were conducted in sets of 80
    abstracts, with 5 positive and 5 negative
    training examples. Additionally, 35 positive and
    35 negative examples were included but unlabeled
    during the training phase to allow to program to
    make a decision on where the feature set should
    fall in order to minimize error.

29
Results and Conclusions
  • The outcome for every experiment far outpaced the
    highest expectations of the researchers.
  • Following examination of misclassified abstracts
    from the first set of results, some abstracts
    were found to have been misclassified by the
    manual classifiers. Correcting these errors led
    to even better results.
  • It is the belief of the researchers that
    incorporating such a system into a database like
    pubmed would allow users to query a term on a
    keyword, and then use the support vector machine
    to narrow the results into the specific
    information the user is looking for, allowing the
    user to maximize the use research time.

30
Demonstration
31
Future Work
  • Future work in this project will address
  • Precision in feature selection
  • Web Interface to tie application to real results

32
Acknowledgements
  • Dr Muhammand A Rahman Assistant Professor,
    University of West Georgia
  • Dr Goran Nenadic Assistant Professor,
    University of Manchester
  • Thorsten Joachims Assistant Professor, Cornell
    University
  • The Department of Computer Science University
    of West Georgia

33
  • Questions
  • ?
Write a Comment
User Comments (0)
About PowerShow.com