Protein Fold Classification - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Protein Fold Classification

Description:

SCOP, CATH, databases classify. Protein Family Prediction ... The complexity of this problem depends on how similar the new protein is to ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 20
Provided by: Huz4
Category:

less

Transcript and Presenter's Notes

Title: Protein Fold Classification


1
Protein Fold Classification
  • -Huzefa Rangwala
  • (rangwala_at_cs.umn.edu)
  • http//www.cs.umn.edu/rangwala

Some Slides are from a talk given by Prof. Karypis
2
Need to Know About Proteins
  • Primary Structure Sequence
  • ARN20 Amino Acids
  • Secondary Structure
  • Alpha, Beta, Coil
  • Tertiary Structure
  • Three dimensional atom coordinates
  • Fold Classification
  • SCOP, CATH, databases classify

3
Protein Family Prediction
  • The goal is to assign a newly identified protein
    to an existing family.
  • The complexity of this problem depends on how
    similar the new protein is to previously
    identified/classified proteins.
  • Popular Approaches
  • Blast
  • Profile HMMs
  • Discriminative models

4
Classification Approaches
  • Nearest-neighbor classification
  • Local/global sequence alignment, BLAST, etc.
  • Markov Model-based approaches
  • Markov chains, Hidden Markov Models, Profile
    HMMs, etc.
  • Discriminating Models
  • Neural networks, Support Vector Machines, etc.
  • Arbitrary combination of the above basic
    approaches.

5
Discriminative Models
  • Within the machine learning community,
    classification algorithms based on discriminative
    models, especially kernel-based support vector
    machines, have become very popular as they
    produce very accurate classifiers.
  • The key idea of these discriminative models is to
    build a model that separates the various classes
  • discriminate between classes as opposed to
    describing/generating each class.
  • In the last 5 years, similar models have been
    developed for sequence classification.
  • The work focused on developing string-based
    kernel functions that can be used within the
    context of support vector machines.

6
Key Difference
  • Question
  • So what is the key difference between a
    discriminative model like SVM and a generative
    model like HMM ?

7
Linear Separators
  • Which of the linear separators is optimal?

8
Classification Margin
  • Distance from example to the separator is
  • Examples closest to the hyperplane are support
    vectors.
  • Margin ? of the separator is the width of
    separation between classes.

9
The Kernel Trick
  • The linear classifier relies on inner product
    between vectors K(xi,xj)xiTxj
  • If every datapoint is mapped into
    high-dimensional space via some transformation F
    x ? f(x), the inner product becomes
  • K(xi,xj) f(xi) Tf(xj)
  • A kernel function is some function that
    corresponds to an inner product into some feature
    space.

10
Feature representation
  • We are basically having a transformation function
    to convert sequences, profiles to be fed into a
    SVM black box to train and then classify.

11
Pairwise Kernel
  • Each dimension corresponds to a sequence in the
    training set
  • The value to a particular dimension corresponds
    to the sequence similarity score

Model Learnt
SVM Classifier
SVM Learner
N Vectors, each of dimension N
N Training sequences
Test Vector
Classification result
Unknown Test Sequence
12
How do we classify ?
  • We build binary classifiers for each of the
    classes and then get predicted output from each
    of the models.
  • Decision can be based on maximum result, or some
    weights,
  • We can also build a class against another class
    classifier ? How???

13
k-mer or Spectrum Kernel
  • k-mer kernel
  • each dimension corresponds to a distinct k-mer
  • each sequence is represented as a frequency
    vector of the various k-mers that it contains.
  • k 4
  • Eg - ACDAAA. gt
  • hash ACDA
  • hash CDAA
  • hash DAAA
  • .
  • We have our vector of _______ features.

14
Mismatch kernel (k,m)
  • Adding more information to the kmer kernel
  • extends the k-mer kernel to allow for up to m
    mismatches
  • Eg- (4,1)
  • ACDAAA. gt
  • h ?CDA , h A?DA, hAC?A,hACD?
  • h ?DAA ,
  • h?AAA
  • .
  • We have our vector of _______ features.

15
HMM kernel / Fisher kernel
  • It is derived from a profile HMM
  • We train a HMM and learn the emission and
    transmission probabilities
  • Each of these probabilities become a dimension in
    the feature vector.
  • How will this method do compare to the others ?
    Predict ?

16
Profile Kernel
  • Simple extension of the kmer model but instead of
    using the sequence, use kmer scores from the PSSM
    matrix.
  • Motivation ?
  • -PSSM captures a proteins signature as it is
    build after multiple alignment and taking into
    consideration what regions are conserved.

17
isites Kernel
  • Here a library of motifs, commonly occurring
    three dimensional structures are used to score a
    sequence.
  • It basically gives a probability score of how
    similar a segment or kmer of the sequence looks
    to each of the motifs.
  • These motifs come from isites library.

18
Comparisons/Performance
19
Questions ?
  • Feel free to email me for doubts
  • rangwala_at_cs.umn.edu
  • Please go through the reading list.
Write a Comment
User Comments (0)
About PowerShow.com