Support Vector Machine and String Kernels for Protein Classification - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Support Vector Machine and String Kernels for Protein Classification

Description:

Support Vector Machine and String Kernels for Protein Classification Christina Leslie Department of Computer Science Columbia University Learning Sequence-based ... – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 37

Provided by: Christina252

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machine and String Kernels for Protein Classification

1
Support Vector Machine and String Kernels for
Protein Classification

Christina Leslie

Department of Computer Science Columbia University
2
Learning Sequence-based Protein Classification

Problem classification of protein sequence data
into families and superfamilies
Motivation Many proteins have been sequenced,
but often structure/function remains unknown
Motivation infer structure/function from
sequence-based classification

3
Sequence Data versus Structure and Function
Sequences for four chains of human hemoglobin
Tertiary Structure
gt1A3NA HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALE
RMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNA
LSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKF
LASVSTVLTSKYR gt1A3NB HEMOGLOBIN VHLTPEEKSAVTALWG
KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGK
KVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLA
HHFGK EFTPPVQAAYQKVVAGVANALAHKYH gt1A3NC
HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
KTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHA
HKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLT
SKYR gt1A3ND HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGE
ALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGL
AHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTP
PVQAAYQKVVAGVANALAHKYH
Function oxygen transport
4
Structural Hierarchy

SCOP Structural Classification of Proteins
Interested in superfamily-level homology remote
evolutionary relationship

5
Learning Problem

Reduce to binary classification problem positive
() if example belongs to a family (e.g. G
proteins) or superfamily (e.g. nucleoside
triphosphate hydrolases), negative (-) otherwise
Focus on remote homology detection
Use supervised learning approach to train a
classifier

Labeled Training Sequences
Classification Rule
Learning Algorithm
6
Two supervised learning approaches to
classification

Generative model approach
Build a generative model for a single protein
family classify each candidate sequence based on
its fit to the model
Only uses positive training sequences
Discriminative approach
Learning algorithm tries to learn decision
boundary between positive and negative examples
Uses both positive and negative training
sequences

7
Hidden Markov Models for Protein Families

Standard generative model profile HMM
Training data multiple alignment of examples
from family
Columns of alignment determine model topology

7LES_DROME LKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKS
LRK....... ABL1_CAEEL IIMHNKLGGGQYGDVYEGYWK.....
...RHDCTIAVKALK........ BFR2_HUMAN
LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD.....A
TRKA_HUMAN IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVA
VKALK........
??
??
8
Profile HMMs for Protein Families

Match, insert and delete states
Observed variables symbol sequence, x1 .. xL
Hidden variables state sequence, ?1 .. ?L
Parameters transition and emission probabilities
Joint probability P(x, ? ?)

9
HMMs Pros and Cons

Ladies and gentlemen, boys and girls

Let us leave something for next week

10
Discriminative Learning

Discriminative approach
Train on both positive and negative
examples to learn classifier

Modern computational learning theory
Goal learn a classifier that generalizes well
to new examples
Do not use training data to estimate
parameters of probability distribution
curse of dimensionality

11
Learning Theoretic Formalism for Classification
Problem

Training and test data drawn i.i.d. from fixed
but unknown probability distribution D on
X ? -1,1
Labeled training set
S (x1, y1), , (xm, ym)

12
Support Vector Machines (SVMs)

We use SVM as discriminative learning algorithm

Training examples mapped to
(usually high-dimensional)
feature space by a feature
map F(x) (F1(x), , Fd(x))

_
_

_

Learn linear decision boundary
Trade-off between maximizing
geometric margin of the training
data and minimizing margin violations

_
_
_
13
SVM Classifiers

Linear classifier defined in feature space by
f(x) ltw,xgt b
SVM solution gives
w ? ?i xi
as a linear combination of support vectors, a
subset of the training vectors

w
_
_
b

_
_
_
_
14
Advantages of SVMs

Large margin classifier leads to good
generalization (performance on test sets)
Sparse classifier depends only on support
vectors, leads to fast classification, good
generalization
Kernel method as well see, we can introduce
sequence-based kernel functions for use with SVMs

15
Hard Margin SVM

Assume training data linearly separable in
feature space
Space of linear classifiers
fw,b(x) ?w, x? b
giving decision rule
hw,b(x) sign(fw,b(x))
If w 1, geometric margin of training data for
hw,b
?S MinS yi (?w, xi? b)

w

_
b
_
_
_
_
_
16
Hard Margin Optimization

Hard margin SVM optimization given training data
S, find linear classifier hw,b with maximal
geometric margin ?S
Convex quadratic dual optimization problem
Sparse classifier in term of support vectors

_
_
_
_
_
_
17
Hard Margin Generalization Error Bounds

Theorem Cristianini, Shawe-Taylor Fix a real
value M gt 0. For any probability distribution D
on X ? -1,1 with support in a ball of radius R
around the origin, with probability 1-? over m
random samples S, any linear hypothesis h with
geometric margin
?S ? M on S
has error no more than
ErrD(h) ? ?(m, ?, M, R)
provided that m is big enough

18
SVMs for Protein Classification

Want to define feature map from space of protein
sequences to vector space
Goals
Computational efficiency
Competitive performance with known methods
No reliance on generative model general method
for sequence-based classification problems

19
Spectrum Feature Map for SVM Protein
Classification

New feature map based on
spectrum of a sequence

C. Leslie, E. Eskin, and W. Noble, The Spectrum
Kernel
A String Kernel for SVM Protein Classification.
Pacific Symposium on Biocomputing, 2002.
C. Leslie, E. Eskin, J. Weston and W. Noble,
Mismatch String Kernels for SVM Protein
Classification.
NIPS 2002.

20
The k-Spectrum of a Sequence
AKQDYYYYEI

Feature map for SVM based on spectrum of a
sequence
The k-spectrum of a sequence is the set of all
k-length contiguous subsequences that it contains
Feature map is indexed by all possible k-length
subsequences
(k-mers) from the alphabet of
amino acids
Dimension of feature space 20k
Generalizes to any sequence data

AKQ KQD QDY DYY YYY YYY YYE
YEI
21
k-Spectrum Feature Map

Feature map for k-spectrum with no mismatches
For sequence x, F(k)(x) (Ft (x))k-mers t,
where Ft (x) occurrences of t in x

AKQDYYYYEI
( 0 , 0 , , 1 , , 1 , , 2 ) AAA AAC
AKQ DYY YYY
22
(k,m)-Mismatch Feature Map

Feature map for k-spectrum, allowing m
mismatches
if s is a k-mer, F(k,m)(s) (Ft(s))k-mers t,
where Ft(s) 1 if s is within m mismatches from
t, 0 otherwise
extend additively to longer sequences x by
summing over all k-mers s in x

AKQ

DKQ
AKY

EKQ
AAQ
23
The Kernel Trick

To train an SVM, can use kernel rather than
explicit feature map
For sequences x, y, feature map F, kernel value
is inner product in feature space
K(x, y) ? F(x), F(y) ?
Gives sequence similarity score
Example of a string kernel
Can be efficiently computed via traversal of trie
data structure

24
Computing the (k,m)-Spectrum Kernel

Use trie (retrieval tree) to organize lexical
traversal of all instances of k-length patterns
(with mismatches) in the training data
Each path down to a leaf in the trie corresponds
to a coordinate in feature map
Kernel values for all training sequences updated
at each leaf node
If m0, traversal time for trie is linear in size
of training data
Traversal time grows exponentially with m, but
usually small values of m are useful
Depth-first traversal makes efficient use of
memory

25
Example Traversing the Mismatch Tree

Traversal for input sequence AVLALKAVLL, k8, m1

26
Example Traversing the Mismatch Tree

Traversal for input sequence AVLALKAVLL, k8, m1

27
Example Traversing the Mismatch Tree

Traversal for input sequence AVLALKAVLL, k8, m1

28
Example Computing the Kernel for Pair of
Sequences

Traversal of trie for k3 (m0)

A
EADLALGKAVF
S1
S2
ADLALGADQVFNG
29
Example Computing the Kernel for Pair of
Sequences

Traversal of trie for k3 (m0)

A
EADLALGKAVF
S1
D
S2
ADLALGADQVFNG
30
Example Computing the Kernel for Pair of
Sequences

Traversal of trie for k3 (m0)

A
EADLALGKAVF
s1
D
s2
ADLALGADQVFNG
L
Update kernel value for K(s1,s2) by adding
contribution for feature ADL
31
Fast prediction

SVM training determines subset of training
sequences corresponding to support vectors and
their weights
(xi, ?i), i 1 .. r
Prediction with no mismatches
Represent SVM classifier by hash table mapping
support k-mers to weights
Test sequences can be classified in linear time
via look-up of k-mers
Prediction with mismatches
Represent classifier as sparse trie traverse
k-mer paths occurring with mismatches in test
sequence

32
Experimental Design

Tested with set of experiments on SCOP dataset
Experiments designed to ask Could the method
discover a new family of a known superfamily?

Diagram from Jaakkola et al.
33
Experiments

160 experiments for 33 target families from 16
superfamilies
Compared results against
SVM-Fisher
SAM-T98 (HMM-based method)
PSI-BLAST (heuristic alignment-based method)

34
Conclusions for SCOP Experiments

Spectrum Kernel with SVM performs as well as the
best-known method for remote homology detection
problem
Efficient computation of string kernel
Fast prediction
Can precompute per k-mer scores and represent
classifier as a lookup table
Gives linear time prdiction for both spectrum
kernel, (unnormalized) mismatch kernel
General approach to classification problems for
sequence data

35
Feature Selection Strategies

Explicit feature filtering
Compute score for each k-mer, based on training
data statistics, during trie traversal and filter
as we compute kernel
Feature elimination as a wrapper for SVM training
Eliminate features corresponding to small
components wi in vector w defining SVM classifier
Kernel principal component analysis
Project to principal components prior to training

36
Ongoing and Future Work

New families of string kernels, mismatching
schemes
Applications to other sequence-based
classification problems, e.g. splice site
prediction
Feature selection
Explicit and implicit dimension reduction
Other machine learning approaches to using sparse
string-based models for classification
Boosting with string-based classifiers