Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization - PowerPoint PPT Presentation

About This Presentation

Title:

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization

Description:

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University – PowerPoint PPT presentation

Number of Views:177

Avg rating:3.0/5.0

Slides: 29

Provided by: Mich72

Category:

more less

Transcript and Presenter's Notes

Title: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization

1
Truncation of Protein Sequences for Fast Profile
Alignment with Application to Subcellular
Localization

Man-Wai MAK and Wei WANG
The Hong Kong Polytechnic University
Sun-Yuan KUNG
Princeton University

2
Contents

Introduction
Cell Organelles and Proteins Subcellular
Localization
Signal-Based vs. Homology-Based Methods
Speeding Up the Prediction Process
Predicting Cleaving Site Location
Truncating Profiles vs. Truncating Sequences
Perturbational Discriminant Analysis
Experiments and Results
Conclusions

3
Organelles

Cells have a set of organelles that are
specialized for carrying out one or more vital
functions.
Proteins must be transported to the correct
organelles of a cell to properly perform their
functions.
Therefore, knowing the subcellular localization
is one step towards understanding the functions
of proteins.

4
Proteins and Their Subcellular Location
5
Subcellular Localization Prediction

Two key methods
Signal-based
Homology-based

6
Signal-Based Method
Cleavage site
Source S. R. Goodman, Medical Cell Biology,
Elsevier, 2008.

The amino acid sequence of a protein contains
information about its organelle destination.
Typically, the information can be found within a
short segment of 20 to 100 amino acids preceding
the cleavage site.
Signal-based methods (e.g. TargetP) can determine
the cleavage site location

7
Homology-Based Method
N-dim alignment vector
1
Align with each of the training sequences
Full-length Query Sequence
SVM classifier
Subcellular Location
. . .
N
S(1)KNKA S(2)KAKN S(N)KGLL Full-
length Training sequences

Advantage
Can predict sequences that do not have cleavage
sites.
Drawback
Given a query sequence, we need to align it with
every training sequence in the training set,
causing long computation time.

8
Sequences Length Distribution
Cleavage Site
Length distribution of Seq.
SP
Ext
Occurrences of Seq.
21
820
Cleavage Site
mTP
Mit
1050
35
cTP
Cleavage Site
Chl
760
Sequence Length
18

Many sequences are fairly long, thus, aligning
the whole sequence will take long computation
time.
cTP, mTP and SP are under 100 AAs only and
contain the most relevant segment.
Computation saving can be achieved by aligning
the signal segments only.

8
9
Proposed Method Aligning the Segments that
Contain the Most Relevant Info.
N
C
Amino Acid Sequence
Signal-based Cleavage Site Predictor (e.g.
TargetP)

Cleavage Site
truncate
Subcellular Location
Homology-based Method
Truncated sequence
10
Aligning Profiles Vs. Aligning Sequences
Scheme I Truncate the profiles Scheme II
Truncate the sequences
Query Sequence
11
Perturbational Discriminant Analysis
Input and Hilbert Spaces
Hilbert Space
Input Space
Empirical Space
Empirical Space
12
Perturbational Discriminant Analysis

The objective of PDA is to find an optimal
discriminant function in the Hilbert space or
empirical space

The optimal solution (see derivation in paper)
in the empirical space is

? represents the noise (uncertainty) level in the
measurement. It also ensures numerical stability
of the matrix inverse.
? 1 in this work.

13
Perturbational Discriminant Analysis
Example on 2-D Data
3 classes of 2-dim data in the input space
RBF kernal matrix K
Decision boundaries in the input space
Projection onto the 2-dim PDA space
14
Perturbational Discriminant Analysis
Application to Sequence Classification
Training sequences
Training Profiles
K
PSI-BLAST
Pairwise Alignment
Compute PDA Para
Test sequence
Test Profile
Align with Training Profiles
PSI-BLAST
Compute PDA Score
15
Perturbational Discriminant Analysis
Application to Multi-Class Problems
1-vs-Rest PDA Classifier
MAXNET
16
Perturbational Discriminant Analysis
Application to Multi-Class Problems
Cascaded PDA-SVM Classifier
Test sequence
Project onto (C1)-dim PDA space
1-vs-rest SVM Classifier
Class label
17
Experiments
Materials

Eukaryotic sequences extracted from Swiss-Prot
57.5
Ext, Mit, and Chl contain experimentally
determined cleavage sites
25 Sequence identity (based on BLASTclust)

Performance Evaluation

5-Fold cross validation
Prediction accuracy and Matthews correlation
coefficient (MCC)

18
Comparing Kernel Matrices
Kernel matrix (Scheme I)
Query Sequence
Kernel matrix (Scheme II)
19
Sensitivity Analysis
Seq
Cut Seq. at px p gournd-truth cleave site
Subcellular localization (PairProSVM)
Subcellular location
Subcellular Localiation Accuracy ()

The localization performance degrades when the
cut-off position drifts away from the
ground-truth cleavage site.
mTP and cTP are more sensitive to the error of
cleavage site prediction than Ext.

Cyt/Nuc
Ext
Overall
Mit
Chl
Ground-truth cleavage site
p
p32
p64
p2
p-2
p16
p-16
p-8
Cut-off Position
19
20
Performance of Cleavage Site Prediction
TargetP(NonPlant)

Conditional Random Field (CRF) is better than
TargetP(Plant) in terms of predicting the
cleavage sites of signal peptide (Ext) but is
worse than TargetP(Nonplant).
CRF is slightly inferior to TargetP in predicting
the cleavage sites of mitochondria, but it is
significantly better than TargetP in predicting
the cleavage site of chloroplasts.

TargetP(Plant)
CRF
Csite Prediction ACC()
Category
20
21
Comparing Profile Creation Time
Scheme I
Scheme I
Score
short
Score
short
Long
Subcellular
Long
Subcellular
PSI
-
SVM or
Pairwise
PSI
-
SVM or
Pairwise
Cut
Cut
Vector
Vector
Location
Location
profile
BLAST
KPDA
Alignment
profile
BLAST
KPDA
Alignment
profile
profile
Query
Query
Sequence
Sequence
short
short
short
Subcellular
short
Subcellular
SVM or
Pairwise
PSI
-
Score
SVM or
Pairwise
PSI
-
Score
Cut
Cut
Location
Location
sequence
KPDA
Alignment
BLAST
Vector
sequence
KPDA
Alignment
BLAST
Vector
profile
profile
Scheme II
Scheme II
Findings Profile creation time can be
substantially reduced by truncating the protein
sequences at the cleavage sites.
22
Training and Classification Time
1-vs-rest SVM Classifier
Project onto (C1)-dim PDA space
Findings The training time of 1-vs-rest PDA and
Cascaded PDA-SVM are substantially shorter than
that of SVM.
23
Compare with State-of-the-Art Localization
Predictors
MCC
Localization Accuracy ()
Conditional Random Fields
Findings In terms of localization accuracy, the
proposed SignalHomology method performs
slightly better than the signal-based TargetP and
is substantially better than the homology-based
SubLoc.
24
Conclusion

Fast subcellular-localization-prediction can be
achieved by a cascaded fusion of signal-based and
homology-based methods.
As far as localization accuracy is concerned, it
does not matter whether we truncate the sequences
or truncate the profiles. However, truncating the
sequence can save the profile creation time by 6
folds.

24
25
Compare with State-of-the-Art Localization
Predictors
26
Performance of Cascaded Fusion
Time

The computation time for full-length pro?le
alignment is a striking 116 hours
Our method not only leads to nearly a 20 folds
reduction in computation time but also boosts the
prediction performance.

Time (hr.)
Acc ()
Subcellular localization accuracy
Full-length Seq.
Seq. with Csite predicted by TargetP(P)
Seq. with Csite predicted by TargetP(N)
Seq. with Csite predicted by CRF
26
27
Fusion of Signal- and Homology-Based Methods
1) Cleavage site detection. The cleavage site
(if any) of a query sequence is determined by a
signal-based method. 2) Pre-sequence
selection. The pre-sequence of the query is
obtained by selecting from the N-terminal up to
the cleavage site. 3) Pairwise alignment. The
pre-sequence is aligned with each of the training
pre-sequences to form an N-dim vector, which is
fed to a one-vs-rest SVM classifier for
prediction.
27
28
Perturbational Discriminant Analysis
Spectral Space
Define the kernel matrix
K can be factorized via spectral decomposition
into
Empirical Space
Spectral Space

Write a Comment

User Comments (0)