Protein Homology Detection Using String Alignment Kernels - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Protein Homology Detection Using String Alignment Kernels

Description:

Problem: classification of protein sequence data into families and superfamilies ... SCOP: Structural Classification of Proteins ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 26
Provided by: christin268
Category:

less

Transcript and Presenter's Notes

Title: Protein Homology Detection Using String Alignment Kernels


1
Protein Homology Detection Using String
Alignment Kernels
  • Jean-Phillippe Vert, Tatsuya Akutsu

2
Learning Sequence Based Protein Classification
  • Problem classification of protein sequence data
    into families and superfamilies
  • Motivation Many proteins have been sequenced,
    but often structure/function remains unknown
  • Motivation infer structure/function from
    sequence-based classification

3
Sequence Data Versus Structure and function
Sequences for four chains of human hemoglobin
Tertiary Structure
gt1A3NA HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALE
RMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNA
LSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKF
LASVSTVLTSKYR gt1A3NB HEMOGLOBIN VHLTPEEKSAVTALWG
KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGK
KVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLA
HHFGK EFTPPVQAAYQKVVAGVANALAHKYH gt1A3NC
HEMOGLOBIN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
KTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHA
HKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLT
SKYR gt1A3ND HEMOGLOBIN VHLTPEEKSAVTALWGKVNVDEVGGE
ALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGL
AHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTP
PVQAAYQKVVAGVANALAHKYH
Function oxygen transport
4
Structural Hierarchy
  • SCOP Structural Classification of Proteins
  • Interested in superfamily-level homology remote
    evolutionary relationship

Difficult !!
5
Learning Problem
  • Reduce to binary classification problem positive
    () if example belongs to a family (e.g. G
    proteins) or superfamily (e.g. nucleoside
    triphosphate hydrolases), negative (-) otherwise
  • Focus on remote homology detection
  • Use supervised learning approach to train a
    classifier

Labeled Training Sequences
Classification Rule
Learning Algorithm
6
Two supervised learning approaches to
classification
  • Generative model approach
  • Build a generative model for a single protein
    family classify each candidate sequence based on
    its fit to the model
  • Only uses positive training sequences
  • Discriminative approach
  • Learning algorithm tries to learn decision
    boundary between positive and negative examples
  • Uses both positive and negative training
    sequences

7
Targets of the current methods
8
Discriminative Learning
  • Discriminative approach
  • Train on both positive and negative
  • examples to learn classifier
  • Modern computational learning theory
  • Goal learn a classifier that generalizes well
  • to new examples
  • Do not use training data to estimate
  • parameters of probability distribution
  • curse of dimensionality

9
SVM for protein classification
  • Want to define feature map from space of protein
    sequences to vector space
  • Goals
  • Computational efficiency
  • Competitive performance with known methods
  • No reliance on generative model general method
    for sequence-based classification problems

10
Summary of the current kernel methods
  • Feature vector from HMM
  • Fisher kernel (Jaakkola et al., 2000)
  • Marginalized kernel (Tsuda et al., 2002)
  • Feature vector from sequence
  • Spectrum kernel (Leslie et al., 2002)
  • Mismatch kernel (Leslie et al., 2003)
  • Feature vector from other score
  • SVM pairwise (Liao Noble, 2002)

11
String Alignment Kernels
  • Observation SW alignment score provides measure
    of similarity with biological knowledge on
    protein evolution.
  • It can not be used as kernel because of lack of
    positive definiteness.
  • A family of local alignment (LA) kernels that
    mimic SW score are presented .

12
LA Kernels
Choose Feature Vector representation
Get Kernel by inner product of vectors
Other Kernels
LA Kernel
Measure similarity
Get valid kernel
13
LA Kernels
  • Pair score Kaß (x,y)
  • Gap kernel Kgß (x,y) for penalty gap model

?gt0, s is a symmetric similarity score.
with
d is gap opening and e is extension costs
14
LA Kernels
  • Kernel convolution
  • For ngt1, the string kernel can be expressed as

K01
K0 is initial part, succession of n aligned
residues Ka ß with n-1 possible gap Kg ß and a
terminal part K0.
15
LA Kernels
It is convergent for any x and y because of
finite number of non-null terms. It is a
point-wise limit of Mercer Kernels
16
LA with SW score
  • plocal alignment
  • p(x,y,p) score of local alignment p over x,y.
  • ?set of all possible local alignment over x,y.

17
Why SW can not be kernel
  • 1. SW only keep the best alignment instead of sum
    of alignment of x,y.
  • 2. Logrithm can destroy the property of being
    postive definite.

18
Example
LA Kernel
SW score
19
SVM-pairwise
LA kernel
x
y
y
SW Score
x
Pair HMM
(0.9, 0.05, 0.3, 0.2)
(0.2, 0.3, 0.1, 0.01)
Inner Product
0.253
0.227
20
Diagonal Dominant Issue
  • It is the fact that K(x,x) is easily orders of
    magnitude larger than K(x,y) of similar sequence
    which bias the performance of SVM.

21
Diagonal Dominant Issue
(1) The eigen kernel LA-eig a. By
subtracting from the diagonal the smallest
negative eigenvalue of the training Gram matrix,
if there are negative eigenvalues. b.
LA-eig, is equal to except eventually on the
diagonal. (2) The empirical kernel map LA-ekm
22
Methods
  • Implementation
  • The computation of the kernel and
    therefore of with a complexity in
    O(x y), Using dynamic programming by a
    slight modification of the SW algorithm.
  • Normaliztion
  • Dataset
  • 4352 sequences extracted from the Astral database
    (www.cs.columbia.edu/compbio/svmpairwise),
    grouped into families and superfamilies.

23
ROC Curve
24
ROC Curve
25
Summary for the kernels
Write a Comment
User Comments (0)
About PowerShow.com