Protein Fold Classification

About This Presentation

Title:

Protein Fold Classification

Description:

SCOP, CATH, databases classify. Protein Family Prediction ... The complexity of this problem depends on how similar the new protein is to ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 20

Provided by: Huz4

Category:

more less

Transcript and Presenter's Notes

Title: Protein Fold Classification

1
Protein Fold Classification

-Huzefa Rangwala
(rangwala_at_cs.umn.edu)
http//www.cs.umn.edu/rangwala

Some Slides are from a talk given by Prof. Karypis
2
Need to Know About Proteins

Primary Structure Sequence
ARN20 Amino Acids
Secondary Structure
Alpha, Beta, Coil
Tertiary Structure
Three dimensional atom coordinates
Fold Classification
SCOP, CATH, databases classify

3
Protein Family Prediction

The goal is to assign a newly identified protein
to an existing family.
The complexity of this problem depends on how
similar the new protein is to previously
identified/classified proteins.
Popular Approaches
Blast
Profile HMMs
Discriminative models

4
Classification Approaches

Nearest-neighbor classification
Local/global sequence alignment, BLAST, etc.
Markov Model-based approaches
Markov chains, Hidden Markov Models, Profile
HMMs, etc.
Discriminating Models
Neural networks, Support Vector Machines, etc.
Arbitrary combination of the above basic
approaches.

5
Discriminative Models

Within the machine learning community,
classification algorithms based on discriminative
models, especially kernel-based support vector
machines, have become very popular as they
produce very accurate classifiers.
The key idea of these discriminative models is to
build a model that separates the various classes
discriminate between classes as opposed to
describing/generating each class.
In the last 5 years, similar models have been
developed for sequence classification.
The work focused on developing string-based
kernel functions that can be used within the
context of support vector machines.

6
Key Difference

Question
So what is the key difference between a
discriminative model like SVM and a generative
model like HMM ?

7
Linear Separators

Which of the linear separators is optimal?

8
Classification Margin

Distance from example to the separator is
Examples closest to the hyperplane are support
vectors.
Margin ? of the separator is the width of
separation between classes.

9
The Kernel Trick

The linear classifier relies on inner product
between vectors K(xi,xj)xiTxj
If every datapoint is mapped into
high-dimensional space via some transformation F
x ? f(x), the inner product becomes
K(xi,xj) f(xi) Tf(xj)
A kernel function is some function that
corresponds to an inner product into some feature
space.

10
Feature representation

We are basically having a transformation function
to convert sequences, profiles to be fed into a
SVM black box to train and then classify.

11
Pairwise Kernel

Each dimension corresponds to a sequence in the
training set
The value to a particular dimension corresponds
to the sequence similarity score

Model Learnt
SVM Classifier
SVM Learner
N Vectors, each of dimension N
N Training sequences
Test Vector
Classification result
Unknown Test Sequence
12
How do we classify ?

We build binary classifiers for each of the
classes and then get predicted output from each
of the models.
Decision can be based on maximum result, or some
weights,
We can also build a class against another class
classifier ? How???

13
k-mer or Spectrum Kernel

k-mer kernel
each dimension corresponds to a distinct k-mer
each sequence is represented as a frequency
vector of the various k-mers that it contains.
k 4
Eg - ACDAAA. gt
hash ACDA
hash CDAA
hash DAAA
.
We have our vector of _______ features.

14
Mismatch kernel (k,m)

Adding more information to the kmer kernel
extends the k-mer kernel to allow for up to m
mismatches
Eg- (4,1)
ACDAAA. gt
h ?CDA , h A?DA, hAC?A,hACD?
h ?DAA ,
h?AAA
.
We have our vector of _______ features.

15
HMM kernel / Fisher kernel

It is derived from a profile HMM
We train a HMM and learn the emission and
transmission probabilities
Each of these probabilities become a dimension in
the feature vector.
How will this method do compare to the others ?
Predict ?

16
Profile Kernel

Simple extension of the kmer model but instead of
using the sequence, use kmer scores from the PSSM
matrix.
Motivation ?
-PSSM captures a proteins signature as it is
build after multiple alignment and taking into
consideration what regions are conserved.

17
isites Kernel

Here a library of motifs, commonly occurring
three dimensional structures are used to score a
sequence.
It basically gives a probability score of how
similar a segment or kmer of the sequence looks
to each of the motifs.
These motifs come from isites library.

Protein Fold Classification - PowerPoint PPT Presentation

Protein Fold Classification

SCOP, CATH, databases classify. Protein Family Prediction ... The complexity of this problem depends on how similar the new protein is to ... – PowerPoint PPT presentation