Application of Hidden Markov Model for Sequence Analysis and Use for Predicting Protein Localization - PowerPoint PPT Presentation

About This Presentation
Title:

Application of Hidden Markov Model for Sequence Analysis and Use for Predicting Protein Localization

Description:

Application of Hidden Markov Model for Sequence Analysis and Use for Predicting ... Finding Multiple Motifs META-MEME. Finding Protein Familes ProfileHMMs(Krogh) ... – PowerPoint PPT presentation

Number of Views:537
Avg rating:3.0/5.0
Slides: 29
Provided by: birendrama
Category:

less

Transcript and Presenter's Notes

Title: Application of Hidden Markov Model for Sequence Analysis and Use for Predicting Protein Localization


1
Application of Hidden Markov Model for Sequence
Analysis and Use for Predicting Protein
Localization
  • By Manchikalapati
  • Myerow
  • Shivananda
  • Monday, April 14, 2003

2
Mathematical Modeling
  • Mathematical Modeling in biology and chemistry
  • Using probabilistic models
  • Bayes Theorem and Maximum Likelihood Theorem
  • Ex HMM

3
What is Markov Chain ?
  • A directed graph with a collection of states with
    transition probabilities.
  • Models a random process with finite states.
  • Markov Assumption The chain is memory less and
    current state probability depends on previous
    state. This allows us to predict behavior.

4
Hidden Markov Model
  • Hidden Markhov Model
  • A probabilistic model that is composed of states
    which are not observable events.
  • A statistical model that describes a probability
    distribution over a number of possible sequences.
  • HMM has the following components
  • States
  • Symbol emission probabilities
  • State transition probabilities
  • Why Hidden? Only the symbol sequence that a
    hidden state emits is observable.
  • Protein Modeling using HMM.

5
What is Hidden? in the Markov Model
  • Observed sequence is a probabilistic function of
    underlying Markov chain
  • In HMMs the state sequence is not uniquely
    determined by the observed symbol sequence, but
    must be inferred probabilistically from it.

6
Definition of Profile
  • A profile is a description of the consensus of a
    multiple sequence alignment.

Alignment Methods
Position Specific Scoring System
Position Independent (Pairwise alignment) Scoring
System Ex BLAST, FASTA
7
Profile HMM
  • Is a linear state machine consisting of a series
    of nodes, each of which corresponds roughly to a
    position (column) in the alignment from which it
    was built.
  • The HMM will have a set of positions which would
    correspond to the columns in a multiple alignment
    and each column can have one of the three states
    Insert, Delete and Match.
  • Profile HMMs can be used to do sensitive
    database searching using statistical descriptions
    of a sequence family's consensus.

8
Profile HMM vs Std Profiles
  • Profile HMMs have a formal probabilistic basis
    and have a consistent theory behind gap and
    insertion scores.
  • Profile HMMs apply a statistical method to
    estimate the true frequency of a residue at a
    given position in the alignment from its observed
    frequency.
  • In general, producing good profile HMMs requires
    less skill and manual intervention than producing
    good standard profiles.
  • Standard profile methods use heuristic methods.
  • Standard profiles use the observed frequency
    itself to assign the score for that residue.

9
Three Algorithms of HMM
  • The Viterbi algorithm get the most probable
    state sequence.
  • The Forward/Backward algorithm score an
    observation sequence against a model.
  • Expectation/Maximization get the parameters of
    the model from the data.
  • For all HMM applications, the algorithms are
    fairly standard. Only the design of the model are
    different.

10
Application of HMM
  • Gene finding
  • Chromosome identification
  • Protein applications include
  • Database searching
  • Homology detection
  • ExOne could take a single sequence of interest,
    and query it against the model to determine if it
    contained certain domains of interest.

11
HMM and its basic elements
  • 1)Match States(M1,M2..)
  • 2)Delete State(D1,D2)
  • 3)Insert States(I0,I1)
  • 4) Begin State
  • 5)End State
  • 6)Emmision Probabilities
  • 7) Transition Probabilites
  • 8) Parameters

12
Problems DEFINE HMM Architecture
  • Problem at hand (given below)defines
    architecture(to the left)
  • Finding Ungapped Motifs -? BLOCKS
  • Finding Multiple Motifs?META-MEME
  • Finding Protein Familes ? ProfileHMMs(Krogh)
  • HMMER2 architecture is used in SAM,HMMER.

13
HMM Profile alignment flow chart in Pfam
14
Three Important Questions that HMM should answer
  • Scoring
  • 1Q) How likely is a given sequence coming from
    the model?
  • Alignment
  • 2Q)What is the optimal path for generating a
    given sequence
  • Training
  • 3Q) Given a set of sequences how can you learn
    about the HMM parameters

15
Q1)How likely is the given Seq (ACCY) coming from
the model
  • Answer
  • Forward Algorithm
  • Prob(A in state I0) 0.40.30.12
  • Prob(C in state I1) 0.050.060.5 0.015
  • Prob(C in state M1) 0.460.01 0.005
  • Prob(C in state M2) (0.0050.97) (0.0150.46)
    .012
  • Prob(Y in state I3) .0120.0150.730.01
    1.31x10-7
  • Prob(Y in state M3) .0120.970.2 0.002

16
Q2)What is the optimal path for generating a
given seq(ACCY)
  • Answer Viterbi Algorithim
  • 1. The probability that the amino acid A was
    generated by state I0 is computed and entered as
    the first element of the matrix.
  • 2. The probabilities that C is emitted in state
    M1 (multiplied by the probability of the most
    likely transition to state M1 from state I0) and
    in state I1 (multiplied by the most likely
    transition to state I1 from state I0) are entered
    into the matrix element indexed by C and I1/M1.
  • 3. The maximum probability, max(I1, M1), is
    calculated.
  • 4. A pointer is set from the winner back to state
    I0.
  • 5. Steps 2-4 are repeated until the matrix is
    filled.
  • Prob(A in state I0) 0.40.30.12
  • Prob(C in state I1) 0.050.060.5 .015
  • Prob(C in state M1) 0.460.01 0.005
  • Prob(C in state M2) 0.460.5 0.23
  • Prob(Y in state I3) 0.0150.730.01 .0001
  • Prob(Y in state M3) 0.970.23 0.22
  • The most likely path through the model can now be
    found by following the back-pointers.

17
3Q)Given a set of sequences how do you learn
about HMM params
  • The Learning Task
  • given
  • a model
  • a set of sequences (the training set)
  • do
  • find the most likely parameters to explain the
    training sequences
  • the goal is find a model that generalizes well to
    sequences we havent seen before
  • Answer Baum-Welch(Forward Backward) Algorithm
  • initialize parameters of model
  • iterate until convergence
  • calculate the expected number of times each
    transition or emission is used
  • adjust the parameters to maximize the
    likelihood of these expected values

18
HMMER in the Workflow
19
Tripartite structure of signal peptide
20
Translocation of Signal Peptide and Signal Anchor

signal peptide
After translocation the signal peptide is cleaved
off and the mature protein released,
signal anchor
The signal anchor is not cleaved off and the
protein is anchored to the membrane
21
Two HMM Models for Signal Peptides First Model
  • (Nielsen, H and Krogh A. Prediction of signal
    peptides and signal anchors by a hidden Markov
    model. Proc. Sixth Int. Conf on Intelligent
    Systems for Molecular Biology, 122-130. AAAI
    Press, 1998.)
  • Model not based on Multiple sequence alignment
    (profile)
  • Compare model to neural network in eukaryotes and
    prokaryotes

22
The model used for signal peptides. The states in
a shaded box are tied to each other.
23
Combined Model
  • The model of signal anchors has only two types of
    states
  • (grouped by the shaded boxes) apart from the
    Met state.
  • The final states shown in the shaded box are tied
    to each other, and model all residues not in a
    signal peptide or an anchor.

24
Hidden Markov model (HMM) vs. neural network
(NN)
  • Cleavage site location percentage of signal
    peptide sequences where the cleavage site was
    placed correctly
  • Discrimination values correlation coefficients
    (Mathews 1975).
  • Protein types signal peptides (sig) cytoplasmic
    or nuclearproteins (non-sec), and signal anchors
    (anc).
  • NN simple S-score NN combined Y-score

25
Second model for Signal Peptide
  • Barash S, Wang W, and Shi Y. Human secretory
    signal peptide description by hidden Markov model
    and generation of a strong artificial signal
    peptide for secreted protein expression. Biochem
    and Biophys Res Com 294, 835-842, 2002.
  • Profile HMM method using HMMER software

26
Steps for Model Building with HMMER
  • N-terminal region of 416 non-redundant human
    secreted proteins
  • Training in hmmalign all start Met aligned in
    first column, 406/416 cleavage sites aligned
  • Build model with MLL estimation (random model
    Swiss Prot 34)
  • Evaluate alignment model 416/416 start Met,
    406/416 cleavage site, 416/416 h-region
  • Re-estimate HMM with maximum discrimination method

27
Model Validation
  • Used hmmemit program to generate artificial
    sequences of variable bit scores
  • In vitro validation using secretion test plasmid
    constructs using secretory alkP with native
    signalP replaced by HMM signal peptides, the
    signal strengths correlate with the bit scores
    (transcription or translation effect?)
  • Ranked signal strengths of known natural human
    secretory proteins above average serum proteins
    such as albumin were found to have high bit
    scores

28
Conclusion
  • HMM and its applicability to sequence analysis
    has been discussed
  • Two different HMM architectures for modeling the
    signal peptide have been shown
  • Both are able to perform the task of separating
    secreted proteins from cytoplasmic and nuclear
    proteins with excellent discrimination
  • Discrimination of signal peptides from signal
    anchors is a little less clean
  • Multiple modeling strategies may be beneficial
    depending on the nature of the query and
    available data for training
Write a Comment
User Comments (0)
About PowerShow.com