Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization - PowerPoint PPT Presentation

About This Presentation
Title:

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization

Description:

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 29
Provided by: Mich72
Category:

less

Transcript and Presenter's Notes

Title: Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization


1
Truncation of Protein Sequences for Fast Profile
Alignment with Application to Subcellular
Localization
  • Man-Wai MAK and Wei WANG
  • The Hong Kong Polytechnic University
  • Sun-Yuan KUNG
  • Princeton University

2
Contents
  • Introduction
  • Cell Organelles and Proteins Subcellular
    Localization
  • Signal-Based vs. Homology-Based Methods
  • Speeding Up the Prediction Process
  • Predicting Cleaving Site Location
  • Truncating Profiles vs. Truncating Sequences
  • Perturbational Discriminant Analysis
  • Experiments and Results
  • Conclusions

3
Organelles
  • Cells have a set of organelles that are
    specialized for carrying out one or more vital
    functions.
  • Proteins must be transported to the correct
    organelles of a cell to properly perform their
    functions.
  • Therefore, knowing the subcellular localization
    is one step towards understanding the functions
    of proteins.

4
Proteins and Their Subcellular Location
5
Subcellular Localization Prediction
  • Two key methods
  • Signal-based
  • Homology-based

6
Signal-Based Method
Cleavage site
Source S. R. Goodman, Medical Cell Biology,
Elsevier, 2008.
  • The amino acid sequence of a protein contains
    information about its organelle destination.
  • Typically, the information can be found within a
    short segment of 20 to 100 amino acids preceding
    the cleavage site.
  • Signal-based methods (e.g. TargetP) can determine
    the cleavage site location

7
Homology-Based Method
N-dim alignment vector
1
Align with each of the training sequences
Full-length Query Sequence
SVM classifier
Subcellular Location
. . .
N
S(1)KNKA S(2)KAKN S(N)KGLL Full-
length Training sequences
  • Advantage
  • Can predict sequences that do not have cleavage
    sites.
  • Drawback
  • Given a query sequence, we need to align it with
    every training sequence in the training set,
    causing long computation time.

8
Sequences Length Distribution
Cleavage Site
Length distribution of Seq.
SP
Ext
Occurrences of Seq.
21
820
Cleavage Site
mTP
Mit
1050
35
cTP
Cleavage Site
Chl
760
Sequence Length
18
  • Many sequences are fairly long, thus, aligning
    the whole sequence will take long computation
    time.
  • cTP, mTP and SP are under 100 AAs only and
    contain the most relevant segment.
  • Computation saving can be achieved by aligning
    the signal segments only.

8
9
Proposed Method Aligning the Segments that
Contain the Most Relevant Info.
N
C
Amino Acid Sequence
Signal-based Cleavage Site Predictor (e.g.
TargetP)

Cleavage Site
truncate
Subcellular Location
Homology-based Method
Truncated sequence
10
Aligning Profiles Vs. Aligning Sequences
Scheme I Truncate the profiles Scheme II
Truncate the sequences
Query Sequence
11
Perturbational Discriminant Analysis
Input and Hilbert Spaces
Hilbert Space
Input Space
Empirical Space
Empirical Space
12
Perturbational Discriminant Analysis
  • The objective of PDA is to find an optimal
    discriminant function in the Hilbert space or
    empirical space
  • The optimal solution (see derivation in paper)
    in the empirical space is
  • ? represents the noise (uncertainty) level in the
    measurement. It also ensures numerical stability
    of the matrix inverse.
  • ? 1 in this work.

13
Perturbational Discriminant Analysis
Example on 2-D Data
3 classes of 2-dim data in the input space
RBF kernal matrix K
Decision boundaries in the input space
Projection onto the 2-dim PDA space
14
Perturbational Discriminant Analysis
Application to Sequence Classification
Training sequences
Training Profiles
K
PSI-BLAST
Pairwise Alignment
Compute PDA Para
Test sequence
Test Profile
Align with Training Profiles
PSI-BLAST
Compute PDA Score
15
Perturbational Discriminant Analysis
Application to Multi-Class Problems
1-vs-Rest PDA Classifier
MAXNET
16
Perturbational Discriminant Analysis
Application to Multi-Class Problems
Cascaded PDA-SVM Classifier
Test sequence
Project onto (C1)-dim PDA space
1-vs-rest SVM Classifier
Class label
17
Experiments
Materials
  • Eukaryotic sequences extracted from Swiss-Prot
    57.5
  • Ext, Mit, and Chl contain experimentally
    determined cleavage sites
  • 25 Sequence identity (based on BLASTclust)

Performance Evaluation
  • 5-Fold cross validation
  • Prediction accuracy and Matthews correlation
    coefficient (MCC)

18
Comparing Kernel Matrices
Kernel matrix (Scheme I)
Query Sequence
Kernel matrix (Scheme II)
19
Sensitivity Analysis
Seq
Cut Seq. at px p gournd-truth cleave site
Subcellular localization (PairProSVM)
Subcellular location
Subcellular Localiation Accuracy ()
  • The localization performance degrades when the
    cut-off position drifts away from the
    ground-truth cleavage site.
  • mTP and cTP are more sensitive to the error of
    cleavage site prediction than Ext.

Cyt/Nuc
Ext
Overall
Mit
Chl
Ground-truth cleavage site
p
p32
p64
p2
p-2
p16
p-16
p-8
Cut-off Position
19
20
Performance of Cleavage Site Prediction
TargetP(NonPlant)
  • Conditional Random Field (CRF) is better than
    TargetP(Plant) in terms of predicting the
    cleavage sites of signal peptide (Ext) but is
    worse than TargetP(Nonplant).
  • CRF is slightly inferior to TargetP in predicting
    the cleavage sites of mitochondria, but it is
    significantly better than TargetP in predicting
    the cleavage site of chloroplasts.

TargetP(Plant)
CRF
Csite Prediction ACC()
Category
20
21
Comparing Profile Creation Time
Scheme I
Scheme I
Score
short
Score
short
Long
Subcellular
Long
Subcellular
PSI
-
SVM or
Pairwise
PSI
-
SVM or
Pairwise
Cut
Cut
Vector
Vector
Location
Location
profile
BLAST
KPDA
Alignment
profile
BLAST
KPDA
Alignment
profile
profile
Query
Query
Sequence
Sequence
short
short
short
Subcellular
short
Subcellular
SVM or
Pairwise
PSI
-
Score
SVM or
Pairwise
PSI
-
Score
Cut
Cut
Location
Location
sequence
KPDA
Alignment
BLAST
Vector
sequence
KPDA
Alignment
BLAST
Vector
profile
profile
Scheme II
Scheme II
Findings Profile creation time can be
substantially reduced by truncating the protein
sequences at the cleavage sites.
22
Training and Classification Time
1-vs-rest SVM Classifier
Project onto (C1)-dim PDA space
Findings The training time of 1-vs-rest PDA and
Cascaded PDA-SVM are substantially shorter than
that of SVM.
23
Compare with State-of-the-Art Localization
Predictors
MCC
Localization Accuracy ()
Conditional Random Fields
Findings In terms of localization accuracy, the
proposed SignalHomology method performs
slightly better than the signal-based TargetP and
is substantially better than the homology-based
SubLoc.
24
Conclusion
  • Fast subcellular-localization-prediction can be
    achieved by a cascaded fusion of signal-based and
    homology-based methods.
  • As far as localization accuracy is concerned, it
    does not matter whether we truncate the sequences
    or truncate the profiles. However, truncating the
    sequence can save the profile creation time by 6
    folds.

24
25
Compare with State-of-the-Art Localization
Predictors
26
Performance of Cascaded Fusion
Time
  • The computation time for full-length pro?le
    alignment is a striking 116 hours
  • Our method not only leads to nearly a 20 folds
    reduction in computation time but also boosts the
    prediction performance.

Time (hr.)
Acc ()
Subcellular localization accuracy
Full-length Seq.
Seq. with Csite predicted by TargetP(P)
Seq. with Csite predicted by TargetP(N)
Seq. with Csite predicted by CRF
26
27
Fusion of Signal- and Homology-Based Methods
1) Cleavage site detection. The cleavage site
(if any) of a query sequence is determined by a
signal-based method. 2) Pre-sequence
selection. The pre-sequence of the query is
obtained by selecting from the N-terminal up to
the cleavage site. 3) Pairwise alignment. The
pre-sequence is aligned with each of the training
pre-sequences to form an N-dim vector, which is
fed to a one-vs-rest SVM classifier for
prediction.
27
28
Perturbational Discriminant Analysis
Spectral Space
Define the kernel matrix
K can be factorized via spectral decomposition
into
Empirical Space
Spectral Space
Write a Comment
User Comments (0)
About PowerShow.com