Title: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites
1Conditional Random Fields for the Prediction of
Signal Peptide Cleavage Sites
- M.W. Mak
- The Hong Kong Polytechnic University
S.Y. Kung Princeton University
2Contents
- Introduction
- Proteins and Their Subcellular Locations
- Importance of Protein Cleavage-Site Prediction
- Information in Amino Acid Sequences
- Existing Approaches to Cleavage Site Prediction
- Conditional Random Field (CRF)
- CRF for Cleavage Site Prediction
- Experiments and Results
- Effectiveness of Different Feature Functions
- Effect of Varying Window Size
- Fusion with SignalP
3Proteins and Their Destination
- A protein consists of a sequence of amino acids.
- Newly synthesized proteins need to pass across
intra-cellular membrane to their destination.
http//redpoll.pharmacy.ualberta.ca
4Signal Peptide
- A short segment of 20 to 100 amino acids (known
as signal peptides) contains information about
the destination (address) of the protein. - The signal peptide is cleaved off from the
resulting mature protein when it passes across
the membrane.
http//nobelprize.org
Mature protein
Source S. R. Goodman, Medical Cell Biology,
Elsevier, 2008.
Signal Peptide
Cleavage Site
5Importance of Cleavage Site Prediction
- Defects in the protein sorting process can cause
serious diseases, e.g., kidney stone
Source http//nobelprize.org/nobel_prizes/medicin
e/laureates/1999/illpres/diseases.html
6Importance of Cleavage Site Prediction
- Many proteins (e.g. insulin) are produced in
living cells. To cause the proteins to be
secreted out of the cell, they are provided with
a signal peptide.
Bioreactor
Source http//nobelprize.org/nobel_prizes/medicin
e/laureates/1999/illpres/diseases.html
7Information in Sequences
- Signal peptides contain some regular patterns.
- Although the patterns exhibit substantial
variation, they can be detected by machine
learning tools.
Rich in hydrophobic AA
Cleavage Site
8Existing Methods
- Weight matrices (PrediSi)
- Neural Networks (SignalP 1.1)
- HMMs (SignalP 3.0)
9Weight Matrices
15 Positions
20 AA
t -1 t t1
M A R S S L F T F L C L A V F I N G C L S Q I E Q
Q
Score at position t 1608678771310686
067178
10SignalP-HMM
Source Nielsen and Krogh
Mature protein
Signal Peptide
11Contents
- Introduction
- Proteins and Their Subcellular Locations
- Importance of Protein Cleavage-Site Prediction
- Information in Amino Acid Sequences
- Existing Approaches to Cleavage Site Prediction
- Conditional Random Field (CRF)
- CRF for Cleavage Site Prediction
- Experiments and Results
- Effectiveness of Amino Acid Properties
- Effectiveness of Different Feature Functions
- Fusion with SignalP
12Conditional Random Fields
- Conditional Random Fields (CRFs) were originally
designed for sequence labeling tasks such as
Part-of-Speech (POS) tagging
- Given a sequence of observations (e.g., words), a
CRF attempts to find the most likely label
sequence, i.e., it gives a label for each of the
observations.
13HMM Vs. CRF
- Hidden Markov Models Learn
Label
Observation
- Conditional Random Fields Learn
More direct
y1
y2
yT
Label
Observation
14Advantages of CRF
- Avoid computing likelihood p(observationlabel).
Instead, the posterior p(labelobservation) is
computed directly. - Able to model long-range dependency without
making the inference problem intractable. - Guarantee global optimal.
Depends on
M A R S S L F T F L C L A V F I N G C L S Q I E Q
Q
15CRF for Cleavage Cite Prediction
Cleavage site
Weights
Length of Sequence
Transition features
n-grams of amino acids
State features
16CRF for Cleavage Cite Prediction
e.g. bi-gram and query sequence T Q T W A G S H
S . . .
17CRF for Cleavage Cite Prediction
Position
18Contents
- Introduction
- Proteins and Their Subcellular Locations
- Importance of Protein Cleavage-Site Prediction
- Information in Amino Acid Sequences
- Existing Approaches to Cleavage Site Prediction
- Conditional Random Field (CRF)
- CRF for Cleavage Site Prediction
- Experiments and Results
- Effectiveness of Different Feature Functions
- Effect of Varying Window Size
- Fusion with SignalP
19Experiments
- Data 1937 protein sequences extracted from
Swissprot 56.5. The cleavage sites locations of
these sequences were biologically determined - Ten-fold cross validation
- For 1st-order state features, up to 5-grams of
amino acids - For 2nd-order state features, up to bi-grams of
amino acids. - Use CRF software
20Results
Effectiveness of using AA Properties
Observations (1) Amino acids provide the most
relevant information (2)
Hydrophobicity and charge/polarity can help
21Results
Effectiveness of Different Feature Functions
- Observations
- Transition feature by itself is no good.
- But, once combined with state-features,
performance improves
(Transition only)
(Transition State)
22Results
Effect of Varying the Window Size
e.g. query sequence T Q T W A G S H S . . .
23Results
Compared with Other Predictors
Observations (1) CRF is slightly better than
SignalP (2) CRF is complementary to SignalP
24Web Server
http//158.132.148.858080/CSitePred/faces/Page1.j
sp
25Web Server
http//158.132.148.858080/CSitePred/faces/Page1.j
sp
Available in May 2009
26(No Transcript)
27Conditional Random Fields
- Conditional Random Fields (CRFs) were originally
designed for sequence labeling tasks such as
Part-of-Speech (POS) tagging
Observations
x
x
y
Labels
- Given a sequence of observations, A CRF attempts
to find the most likely label sequence, i.e., it
gives a label for each of the observations.