HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

Description:

... IN COMPUTATIONAL BIOLOGY ... HMMs in Biology. Gene finding and prediction. Protein ... and interesting problems in computational biology at the moment. ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 36

Provided by: AimeeMc8

Learn more at: http://www.evl.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

1
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

CS 594 An Introduction to Computational
Molecular Biology
BY
Shalini Venkataraman
Vidhya Gunaseelan

2
Relationship Between DNA, RNA And Proteins
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
3
Protein Structure
Primary Structure of Proteins
The primary structure of peptides and proteins
refers to the linear number and order of the
amino acids present.
4
Protein Structure
Secondary Structure
Protein secondary structure refers to regular,
repeated patters of folding of the protein
backbone. How a protein folds is largely dictated
by the primary sequence of amino acids
Beta Sheet
Alpha Helix
5
Multiple Alignment Process

Process of aligning three or more sequences with
each other
Generalization of the algorithm to align two
sequences
Local multiple alignment uses Sum of pairs
scoring scheme

6
HMM Architecture

Markov Chains
What is a Hidden Markov Model(HMM)?
Components of HMM
Problems of HMMs

7
Markov Chains
Rain
Sunny
Cloudy
State transition matrix The probability of the
weather given the previous day's weather.
States Three states - sunny, cloudy, rainy.
Initial Distribution Defining the probability
of the system being in each of the states at time
0.
8
Hidden Markov Models
Hidden states the (TRUE) states of a system
that may be described by a Markov process (e.g.,
the weather). Observable states the states of
the process that are visible' (e.g., seaweed
dampness).
9
Components Of HMM
Output matrix containing the probability of
observing a particular observable state given
that the hidden model is in a particular hidden
state. Initial Distribution contains the
probability of the (hidden) model being in a
particular hidden state at time t 1. State
transition matrix holding the probability of a
hidden state given the previous hidden state.
10
Example-HMM

Transition Prob.
Output Prob.
Scoring a Sequence with an HMM The probability
of ACCY along this path is .4 .3 .46 .6
.97 .5 .015 .73 .01 1 1.76x10-6.
11
Problems With HMM
Scoring problem Given an existing HMM and
observed sequence , what is the probability that
the HMM can generate the sequence
12
Problems With HMM

Alignment Problem
Given a sequence, what is the optimal state
sequence that the HMM would use to generate it

13
Problems With HMM

Training Problem
Given a large amount of data how can we estimate
the structure and the parameters of the HMM that
best accounts for the data

14
HMMs in Biology

Gene finding and prediction
Protein-Profile Analysis
Secondary Structure prediction
Advantages
Limitations

15
Finding genes in DNA sequence
This is one of the most challenging and
interesting problems in computational biology at
the moment. With so many genomes being sequenced
so rapidly, it remains important to begin by
identifying genes computationally.
16
What is a (protein-coding) gene?
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
17
In more detail (color state)
(Left)
(Removed)
18
Gene Finding HMMs

Our Objective
To find the coding and non-coding regions of an
unlabeled string of DNA nucleotides
Our Motivation
Assist in the annotation of genomic data produced
by genome sequencing methods
Gain insight into the mechanisms involved in
transcription, splicing and other processes

19
Why HMMs

Classification Classifying observations within a
sequence
Order A DNA sequence is a set of ordered
observations
Grammar Our grammatical structure (and the
beginnings of our architecture) is right here
Success measure of complete exons correctly
labeled
Training data Available from various genome
annotation projects

20
HMMs for gene finding

Training - Expectation Maximization (EM)
Parsing Viterbi algorithm

An HMM for unspliced genes. x non-coding DNA c
coding state
21
Genefinders- a comparison
Sn Sensitivity Sp Specificity Ac
Approximate Correlation ME Missing Exons WE
Wrong Exons
GENSCAN Performance Data, http//genes.mit.edu/Acc
uracy.html
22
Protein Profile HMMs

Motivation
Given a single amino acid target sequence of
unknown structure, we want to infer the structure
of the resulting protein. Use Profile Similarity
What is a Profile?
Proteins families of related sequences and
structures
Same function
Clear evolutionary relationship
Patterns of conservation, some positions are more
conserved than the others

23
An Overview
Aligned Sequences Build a Profile HMM (Training)
Database search
Multiple alignments (Viterbi)
Query against Profile HMM database (Forward)
24
Building from an existing alignment
ACA - - - ATG TCA ACT ATC ACA C - -
AGC AGA - - - ATC ACC G - - ATC
insertion
Transition probabilities
Output Probabilities
A HMM model for a DNA motif alignments, The
transitions are shown with arrows whose thickness
indicate their probability. In each state, the
histogram shows the probabilities of the four
bases.
25
Building Final Topology
Deletion states
Matching states
Insertion states
No of matching states average sequence length
in the family PFAM Database - of Protein
families (http//pfam.wustl.edu)
26
Database Searching

Given HMM, M, for a sequence family, find all
members of the family in data base.
LL score LL(x) log P(xM)
(LL score is length dependent must normalize or
use Z-score)

27
Query a new sequence
Suppose I have a query protein sequence, and I am
interested in which family it belongs to? There
can be many paths leading to the generation of
this sequence. Need to find all these paths and
sum the probabilities.
Consensus sequence P
(ACACATC) 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x
1x1 x
0.8x1 x 0.8 4.7 x 10 -2
ACAC - - ATC
28
Multiple Alignments

Try every possible path through the model that
would produce the target sequences
Keep the best one and its probability.
Output Sequence of match, insert and delete
states
Viterbi alg. Dynamic Programming

29
Building unaligned sequences

Baum-Welch Expectation-maximization method
Start with a model whose length matches the
average length of the sequences and with random
output and transition probabilities.
Align all the sequences to the model.
Use the alignment to alter the output and
transition probabilities
Repeat. Continue until the model stops changing
By-product It produced a multiple alignment

30
PHMM Example
An alignment of 30 short amino acid sequences
chopped out of a alignment of the SH3 domain. The
shaded area are the most conserved and were
represented by the main states in the HMM. The
unshaded area was represented by an insert state.
31
Prediction of Protein Secondary structures

Prediction of secondary structures is needed for
the prediction of protein function.
Analyze the amino-acid sequences of proteins
Learn secondary structures
helix, sheet and turn
Predict the secondary structures of sequences

32
Advantages

Characterize an entire family of sequences.
Position-dependent character distributions and
position-dependent insertion and deletion gap
penalties.
Built on a formal probabilistic basis
Can make libraries of hundreds of profile HMMs
and apply them on a large scale (whole genome)

33
Limitations

Markov Chains
Probabilities of states are supposed to be
independent
P(y) must be independent of P(x), and vice versa
This usually isnt true

P(x)
P(y)

34
Limitations - contd

Standard Machine Learning Problems
Watch out for local maxima
Model may not converge to a truly optimal
parameter set for a given training set
Avoid over-fitting
Youre only as good as your training set
More training is not always good

35
CONCLUSION