CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore

Description:

CZ5226: Advanced Bioinformatics. Lecture 6: HHM Method for generating motifs. Prof. Chen Yu Zong ... Data and patterns are often not clear cut ... – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 65
Provided by: dbs7
Category:

less

Transcript and Presenter's Notes

Title: CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore


1
CZ5226 Advanced Bioinformatics Lecture 6 HHM
Method for generating motifsProf. Chen Yu
ZongTel 6874-6877Email csccyz_at_nus.edu.sghttp
//xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1,
National University of Singapore
2
Problem in biology
  • Data and patterns are often not clear cut
  • When we want to make a method to recognise a
    pattern (e.g. a sequence motif), we have to learn
    from the data (e.g. maybe there are other
    differences between sequences that have the
    pattern and those that do not)
  • This leads to Data mining and Machine learning

3
A widely used machine learning approach Markov
models
  • Contents
  • Markov chain models (1st order, higher order and
  • inhomogeneous models parameter estimation
    classification)
  • Interpolated Markov models (and back-off
    models)
  • Hidden Markov models (forward, backward and
    Baum-
  • Welch algorithms model topologies applications
    to gene
  • finding and protein family modeling

4
(No Transcript)
5
Markov Chain Models
  • a Markov chain model is defined by
  • a set of states
  • some states emit symbols
  • other states (e.g. the begin state) are silent
  • a set of transitions with associated
    probabilities
  • the transitions emanating from a given state
    define a distribution over the possible next
    states

6
Markov Chain Models
  • Given some sequence x of length L, we can ask how
    probable the sequence is given our model
  • For any probabilistic model of sequences, we can
    write this probability as
  • Key property of a (1st order) Markov chain the
    probability of each Xi depends only on Xi-1

7
Markov Chain Models
Pr(cggt) Pr(c)Pr(gc)Pr(gg)Pr(tg)
8
Markov Chain Models
  • Can also have an end state, allowing the model to
    represent
  • Sequences of different lengths
  • Preferences for sequences ending with particular
    symbols

9
Markov Chain Models
The transition parameters can be denoted by
where Similarly we can denote the probability
of a sequence x as Where aBxi represents the
transition from the begin state
10
Example Application
  • CpG islands
  • CGdinucleotides are rarer in eukaryotic genomes
    than expected given the independent probabilities
    of C, G
  • but the regions upstream of genes are richer in
    CG dinucleotides than elsewhere CpG islands
  • useful evidence for finding genes
  • Could predict CpG islands with Markov chains
  • one to represent CpG islands
  • one to represent the rest of the genome
  • Example includes using Maximum likelihood and
    Bayes statistical data and feeding it to a HM
    model

11
Estimating the Model Parameters
  • Given some data (e.g. a set of sequences from CpG
    islands), how can we determine the probability
    parameters of our model?
  • One approach maximum likelihood estimation
  • given a set of data D
  • set the parameters ? to maximize
  • Pr(D?)
  • i.e. make the data D look likely under the model

12
Maximum Likelihood Estimation
  • Suppose we want to estimate the parameters Pr(a),
    Pr(c), Pr(g), Pr(t)
  • And were given the sequences
  • accgcgctta
  • gcttagtgac
  • tagccgttac
  • Then the maximum likelihood estimates are
  • Pr(a) 6/30 0.2 Pr(g) 7/30 0.233
  • Pr(c) 9/30 0.3 Pr(t) 8/30 0.267

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
These data are derived from genome sequences
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Higher Order Markov Chains
  • An nth order Markov chain over some alphabet is
    equivalent to a first order Markov chain over the
    alphabet of n-tuples
  • Example a 2nd order Markov model for DNA can be
    treated as a 1st order Markov model over
    alphabet
  • AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT,
    TA, TC, TG, and TT (i.e. all possible dipeptides)

22
A Fifth Order Markov Chain
23
Inhomogenous Markov Chains
  • In the Markov chain models we have considered so
    far, the probabilities do not depend on where we
    are in a given sequence
  • In an inhomogeneous Markov model, we can have
    different distributions at different positions in
    the sequence
  • Consider modeling codons in protein coding
    regions

24
Inhomogenous Markov Chains
25
A Fifth Order InhomogeneousMarkov Chain
26
Selecting the Order of aMarkov Chain Model
  • Higher order models remember more history
  • Additional history can have predictive value
  • Example
  • predict the next word in this sentence
    fragment finish __ (up, it, first, last, ?)
  • now predict it given more history
  • Fast guys finish __

27
Hidden Markov models (HMMs)
Given say a T in our input sequence, which state
emitted it?
28
Hidden Markov models (HMMs)
  • Hidden State
  • We will distinguish between the observed parts
    of a problem and the hidden parts
  • In the Markov models we have considered
    previously, it is clear which state accounts for
    each part of the observed sequence
  • In the model above (preceding slide), there are
    multiple states that could account for each part
    of the observed sequence
  • this is the hidden part of the problem
  • states are decoupled from sequence symbols

29
HMM-based homology searching
Transition probabilities and Emission
probabilities Gapped HMMs also have insertion
and deletion states
30
Profile HMM mmatch state, I-insert state,
ddelete state go from left to right. I and m
states output amino acids d states are silent.
31
HMM-based homology searching
  • Most widely used HMM-based profile searching
    tools currently are SAM-T99 (Karplus et al.,
    1998) and HMMER2 (Eddy, 1998)
  • formal probabilistic basis and consistent theory
    behind gap and insertion scores
  • HMMs good for profile searches, bad for alignment
    (due to parametrisation of the models)
  • HMMs are slow

32
Homology-derived Secondary Structure of Proteins
Sander Schneider, 1991
33
The Parameters of an HMM
34
HMM for Eukaryotic Gene Finding
Figure from A. Krogh, An Introduction to Hidden
Markov Models for Biological Sequences
35
A Simple HMM
36
Three Important Questions
  • How likely is a given sequence?
  • the Forward algorithm
  • What is the most probable path for generating a
    given sequence?
  • the Viterbi algorithm
  • How can we learn the HMM parameters given a set
    of sequences?
  • the Forward-Backward
  • (Baum-Welch) algorithm

37
How Likely is a Given Sequence?
  • The probability that the path is taken and the
    sequence is generated
  • (assuming begin/end are the only silent states on
    path)

38
How Likely is a Given Sequence?
39
How Likely is a Given Sequence?
The probability over all paths is but the
number of paths can be exponential in the length
of the sequence... the Forward algorithm
enables us to compute this efficiently
40
How Likely is a Given SequenceThe Forward
Algorithm
  • Define fk(i) to be the probability of being in
    state k
  • Having observed the first i characters of x we
    want to compute fN(L), the probability of being
    in the end state having observed all of x
  • We can define this recursively

41
How Likely is a Given Sequence
42
The forward algorithm
probability that were in start state and have
observed 0 characters from the sequence
  • Initialisation
  • f0(0) 1 (start),
  • fk(0) 0 (other silent states k)
  • Recursion fl(i) el(i)?k fk(i-1)akl
    (emitting states),
  • fl(i) ?k fk(i)akl (silent states)
  • Termination
  • Pr(x) Pr(x1xL) f N(L) ?k fk(L)akN

probability that we are in the end state and have
observed the entire sequence
43
Forward algorithm example
44
Three Important Questions
  • How likely is a given sequence?
  • What is the most probable path for generating a
    given sequence?
  • How can we learn the HMM parameters given a set
    of sequences?

45
Finding the Most Probable PathThe Viterbi
Algorithm
  • Define vk(i) to be the probability of the most
    probable path accounting for the first i
    characters of x and ending in state k
  • We want to compute vN(L), the probability of the
    most probable path accounting for all of the
    sequence and ending in the end state
  • Can be defined recursively
  • Can use DP to find vN(L) efficiently

46
Finding the Most Probable PathThe Viterbi
Algorithm
  • Initialisation
  • v0(0) 1 (start), vk(0) 0 (non-silent states)
  • Recursion for emitting states (i 1L)
  • Recursion for silent states

47
Finding the Most Probable PathThe Viterbi
Algorithm
48
Three Important Questions
  • How likely is a given sequence? (clustering)
  • What is the most probable path for generating a
    given sequence? (alignment)
  • How can we learn the HMM parameters given a set
    of sequences?

49
The Learning Task
  • Given
  • a model
  • a set of sequences (the training set)
  • Do
  • find the most likely parameters to explain the
    training sequences
  • The goal is find a model that generalizes well to
    sequences we havent seen before

50
Learning Parameters
  • If we know the state path for each training
    sequence, learning the model parameters is simple
  • no hidden state during training
  • count how often each parameter is used
  • normalize/smooth to get probabilities
  • process just like it was for Markov chain
    models
  • If we dont know the path for each training
    sequence, how can we determine the counts?
  • key insight estimate the counts by
    considering every path weighted by its
    probability

51
Learning ParametersThe Baum-Welch Algorithm
  • An EM (expectation maximization) approach, a
    forward-backward algorithm
  • Algorithm sketch
  • initialize parameters of model
  • iterate until convergence
  • Calculate the expected number of times each
    transition or emission is used
  • Adjust the parameters to maximize the likelihood
    of these expected values

52
The Expectation step
53
The Expectation step
54
The Expectation step
55
The Expectation step
56
The Expectation step
  • First, we need to know the probability of the i
    th symbol being produced by state q, given
    sequence x
  • Pr( ?i k x)
  • Given this we can compute our expected counts for
    state transitions, character emissions

57
The Expectation step
58
The Backward Algorithm
59
The Expectation step
60
The Expectation step
61
The Expectation step
62
The Maximization step
63
The Maximization step
64
The Baum-Welch Algorithm
  • Initialize parameters of model
  • Iterate until convergence
  • calculate the expected number of times each
    transition or emission is used
  • adjust the parameters to maximize the
    likelihood of these expected values
  • This algorithm will converge to a local maximum
    (in the likelihood of the data given the model)
  • Usually in a fairly small number of iterations
Write a Comment
User Comments (0)
About PowerShow.com