HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

Description:

... IN COMPUTATIONAL BIOLOGY ... HMMs in Biology. Gene finding and prediction. Protein ... and interesting problems in computational biology at the moment. ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 36
Provided by: AimeeMc8
Learn more at: http://www.evl.uic.edu
Category:

less

Transcript and Presenter's Notes

Title: HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY


1
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
  • CS 594 An Introduction to Computational
    Molecular Biology
  • BY
  • Shalini Venkataraman
  • Vidhya Gunaseelan

2
Relationship Between DNA, RNA And Proteins
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
3
Protein Structure
Primary Structure of Proteins
The primary structure of peptides and proteins
refers to the linear number and order of the
amino acids present.
4
Protein Structure
Secondary Structure
Protein secondary structure refers to regular,
repeated patters of folding of the protein
backbone. How a protein folds is largely dictated
by the primary sequence of amino acids
Beta Sheet
Alpha Helix
5
Multiple Alignment Process
  • Process of aligning three or more sequences with
    each other
  • Generalization of the algorithm to align two
    sequences
  • Local multiple alignment uses Sum of pairs
    scoring scheme

6
HMM Architecture
  • Markov Chains
  • What is a Hidden Markov Model(HMM)?
  • Components of HMM
  • Problems of HMMs

7
Markov Chains
Rain
Sunny
Cloudy
State transition matrix The probability of the
weather given the previous day's weather.
States Three states - sunny, cloudy, rainy.
Initial Distribution Defining the probability
of the system being in each of the states at time
0.
8
Hidden Markov Models
Hidden states the (TRUE) states of a system
that may be described by a Markov process (e.g.,
the weather). Observable states the states of
the process that are visible' (e.g., seaweed
dampness).
9
Components Of HMM
Output matrix containing the probability of
observing a particular observable state given
that the hidden model is in a particular hidden
state. Initial Distribution contains the
probability of the (hidden) model being in a
particular hidden state at time t 1. State
transition matrix holding the probability of a
hidden state given the previous hidden state.
10
Example-HMM

Transition Prob.
Output Prob.
Scoring a Sequence with an HMM The probability
of ACCY along this path is .4 .3 .46 .6
.97 .5 .015 .73 .01 1 1.76x10-6.
11
Problems With HMM
Scoring problem Given an existing HMM and
observed sequence , what is the probability that
the HMM can generate the sequence
12
Problems With HMM
  • Alignment Problem
  • Given a sequence, what is the optimal state
    sequence that the HMM would use to generate it

13
Problems With HMM
  • Training Problem
  • Given a large amount of data how can we estimate
    the structure and the parameters of the HMM that
    best accounts for the data

14
HMMs in Biology
  • Gene finding and prediction
  • Protein-Profile Analysis
  • Secondary Structure prediction
  • Advantages
  • Limitations

15
Finding genes in DNA sequence
This is one of the most challenging and
interesting problems in computational biology at
the moment. With so many genomes being sequenced
so rapidly, it remains important to begin by
identifying genes computationally.
16
What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
17
In more detail (color state)
(Left)
(Removed)
18
Gene Finding HMMs
  • Our Objective
  • To find the coding and non-coding regions of an
    unlabeled string of DNA nucleotides
  • Our Motivation
  • Assist in the annotation of genomic data produced
    by genome sequencing methods
  • Gain insight into the mechanisms involved in
    transcription, splicing and other processes

19
Why HMMs
  • Classification Classifying observations within a
    sequence
  • Order A DNA sequence is a set of ordered
    observations
  • Grammar Our grammatical structure (and the
    beginnings of our architecture) is right here
  • Success measure of complete exons correctly
    labeled
  • Training data Available from various genome
    annotation projects

20
HMMs for gene finding
  • Training - Expectation Maximization (EM)
  • Parsing Viterbi algorithm

An HMM for unspliced genes. x non-coding DNA c
coding state
21
Genefinders- a comparison
Sn Sensitivity Sp Specificity Ac
Approximate Correlation ME Missing Exons WE
Wrong Exons
GENSCAN Performance Data, http//genes.mit.edu/Acc
uracy.html
22
Protein Profile HMMs
  • Motivation
  • Given a single amino acid target sequence of
    unknown structure, we want to infer the structure
    of the resulting protein. Use Profile Similarity
  • What is a Profile?
  • Proteins families of related sequences and
    structures
  • Same function
  • Clear evolutionary relationship
  • Patterns of conservation, some positions are more
    conserved than the others

23
An Overview
Aligned Sequences Build a Profile HMM (Training)
Database search
Multiple alignments (Viterbi)
Query against Profile HMM database (Forward)
24
Building from an existing alignment
ACA - - - ATG TCA ACT ATC ACA C - -
AGC AGA - - - ATC ACC G - - ATC
insertion
Transition probabilities
Output Probabilities
A HMM model for a DNA motif alignments, The
transitions are shown with arrows whose thickness
indicate their probability. In each state, the
histogram shows the probabilities of the four
bases.
25
Building Final Topology
Deletion states
Matching states
Insertion states
No of matching states average sequence length
in the family PFAM Database - of Protein
families (http//pfam.wustl.edu)
26
Database Searching
  • Given HMM, M, for a sequence family, find all
    members of the family in data base.
  • LL score LL(x) log P(xM)
  • (LL score is length dependent must normalize or
    use Z-score)

27
Query a new sequence
Suppose I have a query protein sequence, and I am
interested in which family it belongs to? There
can be many paths leading to the generation of
this sequence. Need to find all these paths and
sum the probabilities.
Consensus sequence P
(ACACATC) 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x
1x1 x
0.8x1 x 0.8 4.7 x 10 -2
ACAC - - ATC
28
Multiple Alignments
  • Try every possible path through the model that
    would produce the target sequences
  • Keep the best one and its probability.
  • Output Sequence of match, insert and delete
    states
  • Viterbi alg. Dynamic Programming

29
Building unaligned sequences
  • Baum-Welch Expectation-maximization method
  • Start with a model whose length matches the
    average length of the sequences and with random
    output and transition probabilities.
  • Align all the sequences to the model.
  • Use the alignment to alter the output and
    transition probabilities
  • Repeat. Continue until the model stops changing
  • By-product It produced a multiple alignment

30
PHMM Example
An alignment of 30 short amino acid sequences
chopped out of a alignment of the SH3 domain. The
shaded area are the most conserved and were
represented by the main states in the HMM. The
unshaded area was represented by an insert state.
31
Prediction of Protein Secondary structures
  • Prediction of secondary structures is needed for
    the prediction of protein function.
  • Analyze the amino-acid sequences of proteins
  • Learn secondary structures
  • helix, sheet and turn
  • Predict the secondary structures of sequences

32
Advantages
  • Characterize an entire family of sequences.
  • Position-dependent character distributions and
    position-dependent insertion and deletion gap
    penalties.
  • Built on a formal probabilistic basis
  • Can make libraries of hundreds of profile HMMs
    and apply them on a large scale (whole genome)

33
Limitations
  • Markov Chains
  • Probabilities of states are supposed to be
    independent
  • P(y) must be independent of P(x), and vice versa
  • This usually isnt true

P(x)
P(y)

34
Limitations - contd
  • Standard Machine Learning Problems
  • Watch out for local maxima
  • Model may not converge to a truly optimal
    parameter set for a given training set
  • Avoid over-fitting
  • Youre only as good as your training set
  • More training is not always good

35
CONCLUSION
  • For links slides
  • www.evl.uic.edu/shalini/hmm/
Write a Comment
User Comments (0)
About PowerShow.com