Variants of HMMs - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Variants of HMMs

Description:

A second order HMM with K states is equivalent to a first ... We will divide alignment score by the random score, and take logarithms. Let. P(xi, yj) (1 2 ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 24
Provided by: root
Category:

less

Transcript and Presenter's Notes

Title: Variants of HMMs


1
Variants of HMMs
2
Higher-order HMMs
  • How do we model memory larger than one time
    point?
  • P(?i1 l ?i k) akl
  • P(?i1 l ?i k, ?i -1 j) ajkl
  • A second order HMM with K states is equivalent to
    a first order HMM with K2 states

aHHT
state HH
state HT
aHT(prev H) aHT(prev T)
aHTH
state H
state T
aHTT
aTHH
aTHT
state TH
state TT
aTH(prev H) aTH(prev T)
aTTH
3
Modeling the Duration of States
1-p
  • Length distribution of region X
  • ElX 1/(1-p)
  • Geometric distribution, with mean 1/(1-p)
  • This is a significant disadvantage of HMMs
  • Several solutions exist for modeling different
    length distributions

X
Y
p
q
1-q
4
Soln 1 Chain several states
p
1-p
X
Y
X
X
q
1-q
Disadvantage Still very inflexible lX C
geometric with mean 1/(1-p)
5
Soln 2 Negative binomial distribution
p
p
p
1 p
1 p
1 p
Y
X
X
X
  • Duration in X m turns, where
  • During first m 1 turns, exactly n 1 arrows to
    next state are followed
  • During mth turn, an arrow to next state is
    followed
  • m 1 m 1
  • P(lX m) n 1 (1
    p)n-11p(m-1)-(n-1) n 1 (1 p)npm-n

6
Example genes in prokaryotes
  • EasyGene
  • Prokaryotic
  • gene-finder
  • Larsen TS, Krogh A
  • Negative binomial with n 3

7
Solution 3 Duration modeling
  • Upon entering a state
  • Choose duration d, according to probability
    distribution
  • Generate d letters according to emission probs
  • Take a transition to next state according to
    transition probs
  • Disadvantage Increase in complexity
  • Time O(D2) -- Why?
  • Space O(D)
  • where D maximum duration of state

X
8
Connection Between Alignment and HMMs
9
A state model for alignment
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
10
Lets score the transitions
s(xi, yj)
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
s(xi, yj)
s(xi, yj)
-d
-d
I (1, 0)
J (0, 1)
-e
-e
-e
-e
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
11
How do we find optimal alignment according to
this model?
  • Dynamic Programming
  • M(i, j) Optimal alignment of x1xi to y1yj
    ending in M
  • I(i, j) Optimal alignment of x1xi to y1yj
    ending in I
  • J(i, j) Optimal alignment of x1xi to y1yj
    ending in J
  • The score is additive, therefore we can apply DP
    recurrence formulas

12
Needleman Wunsch with affine gaps state version
  • Initialization
  • M(0,0) 0 M(i,0) M(0,j) -?, for i, j gt 0
  • I(i,0) d i?e J(0,j) d j?e
  • Iteration
  • M(i 1, j 1)
  • M(i, j) s(xi, yj) max I(i 1, j 1)
  • J(i 1, j 1)
  • e I(i 1, j)
  • I(i, j) max e J(i, j 1)
  • d M(i 1, j 1)
  • e I(i 1, j)
  • J(i, j) max e J(i, j 1)
  • d M(i 1, j 1)
  • Termination

13
Probabilistic interpretation of an alignment
  • An alignment is a hypothesis that the two
    sequences are related by evolution
  • Goal
  • Produce the most likely alignment
  • Assert the likelihood that the sequences are
    indeed related

14
A Pair HMM for alignments
BEGIN
?
?
?
1 2? ?
M
I
J
?
?
?
END
15
A Pair HMM for unaligned sequences
Model R
1 - ?
1 - ?
J P(yj)
I P(xi)
END
BEGIN
END BEGIN
1 - ?
1 - ?
?
?
  • P(x, y R) ?(1 ?)m P(x1)P(xm) ?(1 ?)n
    P(y1)P(yn)
  • ?2(1 ?)mn ?i P(xi) ?j P(yj)

16
To compare ALIGNMENT vs. RANDOM hypothesis
1 2? ?
  • Every pair of letters contributes
  • (1 2? ?) P(xi, yj) when matched
  • ? P(xi) P(yj) when gapped
  • (1 ?)2 P(xi) P(yj) in random model
  • Focus on comparison of
  • P(xi, yj) vs. P(xi) P(yj)

M P(xi, yj)
1 2? ?
1 2? ?
?
?
?
?
I P(xi)
J P(yj)
?
?
1 - ?
1 - ?
I P(xi)
J P(yj)
END
BEGIN
END BEGIN
1 - ?
1 - ?
?
?
17
To compare ALIGNMENT vs. RANDOM hypothesis
  • Idea
  • We will divide alignment score by the random
    score,
  • and take logarithms
  • Let
  • P(xi, yj) (1 2? ?)
  • s(xi, yj) log log
  • P(xi) P(yj) (1 ?)2
  • ?(1 2? ?) P(xi)
  • d log
  • (1 ?) (1 2? ?) P(xi)
  • ? P(xi)
  • e log
  • (1 ?) P(xi)

Every letter b in random model contributes
(1 ?) P(b)
18
The meaning of alignment scores
  • Because ?, ?, are small, and ?, ? are very small,
  • P(xi, yj) (1 2? ?)
    P(xi, yj)
  • s(xi, yj) log log ?
    log log(1 2?)
  • P(xi) P(yj) (1 ?)2
    P(xi) P(yj)
  • ?(1 ? ?) 1 ?
  • d log ?
    log ?
  • (1 ?) (1 2? ?) 1 2?
  • ?
  • e log ? log ?
  • (1 ?)

19
The meaning of alignment scores
  • The Viterbi algorithm for Pair HMMs corresponds
    exactly to the Needleman-Wunsch algorithm with
    affine gaps
  • However, now we need to score alignment with
    parameters that add up to probability
    distributions
  • ? 1/mean length of next gap
  • ? 1/mean arrival time of next gap
  • affine gaps decouple arrival time with length
  • ? 1/mean length of aligned sequences (set to 0)
  • ? 1/mean length of unaligned sequences (set to
    0)

20
The meaning of alignment scores
  • Match/mismatch scores
  • P(xi, yj)
  • s(a, b) ? log (lets ignore log(1
    2?) for the moment assume no gaps)
  • P(xi) P(yj)
  • Example
  • Say DNA regions between human and mouse have
    average conservation of 50
  • Then P(A,A) P(C,C) P(G,G) P(T,T) 1/8
    (so they sum to ½)
  • P(A,C) P(A,G) P(T,G)
    1/24 (24 mismatches, sum to ½)
  • Say P(A) P(C) P(G) P(T) ¼
  • log (1/8) / (1/4
    1/4) log 2 1, for match
  • Then, s(a, b) log (1/24) / (1/4 1/4)
    log 16/24 -0.585
  • Cutoff similarity that scores 0 s1 (1
    s)0.585 0
  • According to this model, a 37.5-conserved
    sequence with no gaps would score on average

21
Substitution matrices
  • A more meaningful way to assign match/mismatch
    scores
  • For protein sequences, different substitutions
    have dramatically different frequencies!
  • PAM Matrices
  • Start from a curated set of very similar protein
    sequences
  • Construct ancestral sequences (using parsimony)
  • Calculate Aab frequency of letters a and b
    interchanging
  • Calculate Bab P(ba) Aab/(?cd Acd)
  • Adjust matrix B so that ?a,b qa qb Bab 0.01
  • PAM(1)
  • Let PAM(N) PAM(1)N -- Common PAM(250)

22
Substitution Matrices
  • BLOSUM matrices
  • Start from BLOCKS database (curated, gap-free
    alignments)
  • Cluster sequences according to gt X identity
  • Calculate Aab of aligned a-b in distinct
    clusters, correcting by 1/mn, where m, n are the
    two cluster sizes
  • Estimate
  • P(a) (?b Aab)/(?cd Acd) P(a, b) Aab/(?cd
    Acd)

23
BLOSUM matrices
BLOSUM 50
BLOSUM 62
(The two are scaled differently)
Write a Comment
User Comments (0)
About PowerShow.com