Title: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization
1Truncation of Protein Sequences for Fast Profile
Alignment with Application to Subcellular
Localization
- Man-Wai MAK and Wei WANG
- The Hong Kong Polytechnic University
- Sun-Yuan KUNG
- Princeton University
2Contents
- Introduction
- Cell Organelles and Proteins Subcellular
Localization - Signal-Based vs. Homology-Based Methods
- Speeding Up the Prediction Process
- Predicting Cleaving Site Location
- Truncating Profiles vs. Truncating Sequences
- Perturbational Discriminant Analysis
- Experiments and Results
- Conclusions
3Organelles
- Cells have a set of organelles that are
specialized for carrying out one or more vital
functions. - Proteins must be transported to the correct
organelles of a cell to properly perform their
functions. - Therefore, knowing the subcellular localization
is one step towards understanding the functions
of proteins.
4Proteins and Their Subcellular Location
5Subcellular Localization Prediction
- Two key methods
- Signal-based
- Homology-based
6Signal-Based Method
Cleavage site
Source S. R. Goodman, Medical Cell Biology,
Elsevier, 2008.
- The amino acid sequence of a protein contains
information about its organelle destination. - Typically, the information can be found within a
short segment of 20 to 100 amino acids preceding
the cleavage site. - Signal-based methods (e.g. TargetP) can determine
the cleavage site location
7Homology-Based Method
N-dim alignment vector
1
Align with each of the training sequences
Full-length Query Sequence
SVM classifier
Subcellular Location
. . .
N
S(1)KNKA S(2)KAKN S(N)KGLL Full-
length Training sequences
- Advantage
- Can predict sequences that do not have cleavage
sites. - Drawback
- Given a query sequence, we need to align it with
every training sequence in the training set,
causing long computation time.
8Sequences Length Distribution
Cleavage Site
Length distribution of Seq.
SP
Ext
Occurrences of Seq.
21
820
Cleavage Site
mTP
Mit
1050
35
cTP
Cleavage Site
Chl
760
Sequence Length
18
- Many sequences are fairly long, thus, aligning
the whole sequence will take long computation
time. - cTP, mTP and SP are under 100 AAs only and
contain the most relevant segment. - Computation saving can be achieved by aligning
the signal segments only.
8
9Proposed Method Aligning the Segments that
Contain the Most Relevant Info.
N
C
Amino Acid Sequence
Signal-based Cleavage Site Predictor (e.g.
TargetP)
Cleavage Site
truncate
Subcellular Location
Homology-based Method
Truncated sequence
10Aligning Profiles Vs. Aligning Sequences
Scheme I Truncate the profiles Scheme II
Truncate the sequences
Query Sequence
11Perturbational Discriminant Analysis
Input and Hilbert Spaces
Hilbert Space
Input Space
Empirical Space
Empirical Space
12Perturbational Discriminant Analysis
- The objective of PDA is to find an optimal
discriminant function in the Hilbert space or
empirical space
- The optimal solution (see derivation in paper)
in the empirical space is
- ? represents the noise (uncertainty) level in the
measurement. It also ensures numerical stability
of the matrix inverse. - ? 1 in this work.
13Perturbational Discriminant Analysis
Example on 2-D Data
3 classes of 2-dim data in the input space
RBF kernal matrix K
Decision boundaries in the input space
Projection onto the 2-dim PDA space
14Perturbational Discriminant Analysis
Application to Sequence Classification
Training sequences
Training Profiles
K
PSI-BLAST
Pairwise Alignment
Compute PDA Para
Test sequence
Test Profile
Align with Training Profiles
PSI-BLAST
Compute PDA Score
15Perturbational Discriminant Analysis
Application to Multi-Class Problems
1-vs-Rest PDA Classifier
MAXNET
16Perturbational Discriminant Analysis
Application to Multi-Class Problems
Cascaded PDA-SVM Classifier
Test sequence
Project onto (C1)-dim PDA space
1-vs-rest SVM Classifier
Class label
17Experiments
Materials
- Eukaryotic sequences extracted from Swiss-Prot
57.5 - Ext, Mit, and Chl contain experimentally
determined cleavage sites - 25 Sequence identity (based on BLASTclust)
Performance Evaluation
- 5-Fold cross validation
- Prediction accuracy and Matthews correlation
coefficient (MCC)
18Comparing Kernel Matrices
Kernel matrix (Scheme I)
Query Sequence
Kernel matrix (Scheme II)
19Sensitivity Analysis
Seq
Cut Seq. at px p gournd-truth cleave site
Subcellular localization (PairProSVM)
Subcellular location
Subcellular Localiation Accuracy ()
- The localization performance degrades when the
cut-off position drifts away from the
ground-truth cleavage site. -
- mTP and cTP are more sensitive to the error of
cleavage site prediction than Ext.
Cyt/Nuc
Ext
Overall
Mit
Chl
Ground-truth cleavage site
p
p32
p64
p2
p-2
p16
p-16
p-8
Cut-off Position
19
20Performance of Cleavage Site Prediction
TargetP(NonPlant)
- Conditional Random Field (CRF) is better than
TargetP(Plant) in terms of predicting the
cleavage sites of signal peptide (Ext) but is
worse than TargetP(Nonplant). - CRF is slightly inferior to TargetP in predicting
the cleavage sites of mitochondria, but it is
significantly better than TargetP in predicting
the cleavage site of chloroplasts.
TargetP(Plant)
CRF
Csite Prediction ACC()
Category
20
21Comparing Profile Creation Time
Scheme I
Scheme I
Score
short
Score
short
Long
Subcellular
Long
Subcellular
PSI
-
SVM or
Pairwise
PSI
-
SVM or
Pairwise
Cut
Cut
Vector
Vector
Location
Location
profile
BLAST
KPDA
Alignment
profile
BLAST
KPDA
Alignment
profile
profile
Query
Query
Sequence
Sequence
short
short
short
Subcellular
short
Subcellular
SVM or
Pairwise
PSI
-
Score
SVM or
Pairwise
PSI
-
Score
Cut
Cut
Location
Location
sequence
KPDA
Alignment
BLAST
Vector
sequence
KPDA
Alignment
BLAST
Vector
profile
profile
Scheme II
Scheme II
Findings Profile creation time can be
substantially reduced by truncating the protein
sequences at the cleavage sites.
22Training and Classification Time
1-vs-rest SVM Classifier
Project onto (C1)-dim PDA space
Findings The training time of 1-vs-rest PDA and
Cascaded PDA-SVM are substantially shorter than
that of SVM.
23Compare with State-of-the-Art Localization
Predictors
MCC
Localization Accuracy ()
Conditional Random Fields
Findings In terms of localization accuracy, the
proposed SignalHomology method performs
slightly better than the signal-based TargetP and
is substantially better than the homology-based
SubLoc.
24Conclusion
- Fast subcellular-localization-prediction can be
achieved by a cascaded fusion of signal-based and
homology-based methods. - As far as localization accuracy is concerned, it
does not matter whether we truncate the sequences
or truncate the profiles. However, truncating the
sequence can save the profile creation time by 6
folds.
24
25Compare with State-of-the-Art Localization
Predictors
26Performance of Cascaded Fusion
Time
- The computation time for full-length pro?le
alignment is a striking 116 hours - Our method not only leads to nearly a 20 folds
reduction in computation time but also boosts the
prediction performance.
Time (hr.)
Acc ()
Subcellular localization accuracy
Full-length Seq.
Seq. with Csite predicted by TargetP(P)
Seq. with Csite predicted by TargetP(N)
Seq. with Csite predicted by CRF
26
27Fusion of Signal- and Homology-Based Methods
1) Cleavage site detection. The cleavage site
(if any) of a query sequence is determined by a
signal-based method. 2) Pre-sequence
selection. The pre-sequence of the query is
obtained by selecting from the N-terminal up to
the cleavage site. 3) Pairwise alignment. The
pre-sequence is aligned with each of the training
pre-sequences to form an N-dim vector, which is
fed to a one-vs-rest SVM classifier for
prediction.
27
28Perturbational Discriminant Analysis
Spectral Space
Define the kernel matrix
K can be factorized via spectral decomposition
into
Empirical Space
Spectral Space