Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites - PowerPoint PPT Presentation

About This Presentation
Title:

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

Description:

Contents. Introduction. Proteins and Their Subcellular Locations ... yT. y1. y2. yT. x1. x2. xT. More direct. Label. Observation. Label. Observation. 14 ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 25
Provided by: hkpu
Category:

less

Transcript and Presenter's Notes

Title: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites


1
Conditional Random Fields for the Prediction of
Signal Peptide Cleavage Sites
  • M.W. Mak
  • The Hong Kong Polytechnic University

S.Y. Kung Princeton University
2
Contents
  • Introduction
  • Proteins and Their Subcellular Locations
  • Importance of Protein Cleavage-Site Prediction
  • Information in Amino Acid Sequences
  • Existing Approaches to Cleavage Site Prediction
  • Conditional Random Field (CRF)
  • CRF for Cleavage Site Prediction
  • Experiments and Results
  • Effectiveness of Different Feature Functions
  • Effect of Varying Window Size
  • Fusion with SignalP

3
Proteins and Their Destination
  • A protein consists of a sequence of amino acids.
  • Newly synthesized proteins need to pass across
    intra-cellular membrane to their destination.

http//redpoll.pharmacy.ualberta.ca
4
Signal Peptide
  • A short segment of 20 to 100 amino acids (known
    as signal peptides) contains information about
    the destination (address) of the protein.
  • The signal peptide is cleaved off from the
    resulting mature protein when it passes across
    the membrane.

http//nobelprize.org
Mature protein
Source S. R. Goodman, Medical Cell Biology,
Elsevier, 2008.
Signal Peptide
Cleavage Site
5
Importance of Cleavage Site Prediction
  • Defects in the protein sorting process can cause
    serious diseases, e.g., kidney stone

Source http//nobelprize.org/nobel_prizes/medicin
e/laureates/1999/illpres/diseases.html
6
Importance of Cleavage Site Prediction
  • Many proteins (e.g. insulin) are produced in
    living cells. To cause the proteins to be
    secreted out of the cell, they are provided with
    a signal peptide.

Bioreactor
Source http//nobelprize.org/nobel_prizes/medicin
e/laureates/1999/illpres/diseases.html
7
Information in Sequences
  • Signal peptides contain some regular patterns.
  • Although the patterns exhibit substantial
    variation, they can be detected by machine
    learning tools.

Rich in hydrophobic AA
Cleavage Site
8
Existing Methods
  • Weight matrices (PrediSi)
  • Neural Networks (SignalP 1.1)
  • HMMs (SignalP 3.0)

9
Weight Matrices
15 Positions
20 AA
t -1 t t1
M A R S S L F T F L C L A V F I N G C L S Q I E Q
Q
Score at position t 1608678771310686
067178
10
SignalP-HMM
Source Nielsen and Krogh
Mature protein
Signal Peptide
11
Contents
  • Introduction
  • Proteins and Their Subcellular Locations
  • Importance of Protein Cleavage-Site Prediction
  • Information in Amino Acid Sequences
  • Existing Approaches to Cleavage Site Prediction
  • Conditional Random Field (CRF)
  • CRF for Cleavage Site Prediction
  • Experiments and Results
  • Effectiveness of Amino Acid Properties
  • Effectiveness of Different Feature Functions
  • Fusion with SignalP

12
Conditional Random Fields
  • Conditional Random Fields (CRFs) were originally
    designed for sequence labeling tasks such as
    Part-of-Speech (POS) tagging
  • Given a sequence of observations (e.g., words), a
    CRF attempts to find the most likely label
    sequence, i.e., it gives a label for each of the
    observations.

13
HMM Vs. CRF
  • Hidden Markov Models Learn

Label
Observation
  • Conditional Random Fields Learn

More direct
y1
y2



yT
Label
Observation
14
Advantages of CRF
  • Avoid computing likelihood p(observationlabel).
    Instead, the posterior p(labelobservation) is
    computed directly.
  • Able to model long-range dependency without
    making the inference problem intractable.
  • Guarantee global optimal.

Depends on
M A R S S L F T F L C L A V F I N G C L S Q I E Q
Q
15
CRF for Cleavage Cite Prediction
Cleavage site
Weights
Length of Sequence
Transition features
n-grams of amino acids
State features
16
CRF for Cleavage Cite Prediction
e.g. bi-gram and query sequence T Q T W A G S H
S . . .
17
CRF for Cleavage Cite Prediction
Position
18
Contents
  • Introduction
  • Proteins and Their Subcellular Locations
  • Importance of Protein Cleavage-Site Prediction
  • Information in Amino Acid Sequences
  • Existing Approaches to Cleavage Site Prediction
  • Conditional Random Field (CRF)
  • CRF for Cleavage Site Prediction
  • Experiments and Results
  • Effectiveness of Different Feature Functions
  • Effect of Varying Window Size
  • Fusion with SignalP

19
Experiments
  • Data 1937 protein sequences extracted from
    Swissprot 56.5. The cleavage sites locations of
    these sequences were biologically determined
  • Ten-fold cross validation
  • For 1st-order state features, up to 5-grams of
    amino acids
  • For 2nd-order state features, up to bi-grams of
    amino acids.
  • Use CRF software

20
Results
Effectiveness of using AA Properties
Observations (1) Amino acids provide the most
relevant information (2)
Hydrophobicity and charge/polarity can help
21
Results
Effectiveness of Different Feature Functions
  • Observations
  • Transition feature by itself is no good.
  • But, once combined with state-features,
    performance improves

(Transition only)
(Transition State)
22
Results
Effect of Varying the Window Size
e.g. query sequence T Q T W A G S H S . . .
23
Results
Compared with Other Predictors
Observations (1) CRF is slightly better than
SignalP (2) CRF is complementary to SignalP
24
Web Server
http//158.132.148.858080/CSitePred/faces/Page1.j
sp
25
Web Server
http//158.132.148.858080/CSitePred/faces/Page1.j
sp
Available in May 2009
26
(No Transcript)
27
Conditional Random Fields
  • Conditional Random Fields (CRFs) were originally
    designed for sequence labeling tasks such as
    Part-of-Speech (POS) tagging

Observations
x
x
y
Labels
  • Given a sequence of observations, A CRF attempts
    to find the most likely label sequence, i.e., it
    gives a label for each of the observations.
Write a Comment
User Comments (0)
About PowerShow.com