CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

CS5263 Bioinformatics

Description:

Every December, Nucleic Acids Research has a special issue on web servers ... How to find CpG islands in the porcupine genome? Basic idea ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 61

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics

1
CS5263 Bioinformatics

Lecture 13 Hidden Markov Models and applications

2
Project ideas

Implement an HMM
Iterative refinement MSA
Motif finder
Implement an algorithm in a paper
Write a survey paper

3
NAR Web Server Issue

Every December, Nucleic Acids Research has a
special issue on web servers
Not necessary to invent original methods
Biologists appreciate web tools
You get a nice publication
And potentially many citations if your tool is
useful (think about BLAST!)
Talk to me if you want to work on this project

Review of last lectures

5
Problems in HMM

Decoding
Predict the state of each symbol
Viterbi algorithm
Forward-backward algorithm
Evaluation
The probability that a sequence is generated by a
model
Forward-backward algorithm
Learning
unsupervised
Baum-Welch
Viterbi

6
A general HMM
1
2
K states Completely connected (possibly with 0
transition probabilities) Each state has a set of
emission probabilities (emission probabilities
may be 0 for some symbols in some states)
3
K

7
The Viterbi algorithm

V(1,i) w(1, j) r(j, xi1),
V(2,i) w(2, j) r(j, xi1),
V(j, i1) max V(3,i) w(3, j) r(j, xi1),
V(k,i) w(k, j) r(j, xi1)
Or simply
V(j, i1) Maxl V(l,i) w(l, j) r(j, xi1)

Viterbi finds the single best path
In many cases it is also interesting to know what
are the other possible paths
Viterbi assigns a state to each symbol
In many cases it is also interesting to know the
probability that the assignment is correct
Posterior decoding using Forward-backward
algorithm

9
The forward algorithm
10
1
This does not include the emission probability of
xi
11
Forward-backward algorithm

fk(i) prob to get to pos i in state k and emit
xi
bk(i) prob from i to end, given i is in state k
fk(i) bk(i) P(?ik, x)

12
Sequence
state
??
Forward probabilities
Backward probabilities
/ P(X)
Space O(KN) Time O(K2N)
P(?ik x)
P(?ik x) fk(i) bk(i) / P(x)
13
Learning

When the states are known
supervised learning
When the states are unknown
Estimate parameters from unlabeled data
unsupervised learning

How to find CpG islands in the porcupine genome?
14
Basic idea

Estimate our best guess on the model parameters
?
Use ? to predict the unknown labels
Re-estimate a new set of ?
Repeat 2 3 until converge

Two ways
15
Viterbi training

Estimate our best guess on the model parameters
?
Find the Viterbi path using current ?
Re-estimate a new set of ? based on the Viterbi
path
Repeat 2 3 until converge

16
Baum-Welch training

Estimate our best guess on the model parameters
?
Find P(?ik x,?) using forward-backward
algorithm
Re-estimate a new set of ? based on all possible
paths
Repeat 2 3 until converge

E-step
M-step
17
Viterbi vs Baum-Welch training

Viterbi training
Returns a single path
Each position labeled with a fixed state
Each transition counts one
Each emission also counts one
Baum-Welch training
Does not return a single path
Considers the probability that each transition is
used
and the probability that a symbol is generated by
a certain state
They only contribute partial counts

18
Probability that a transition is used
19
Viterbi vs Baum-Welch training

Both guaranteed to converges
Baum-Welch improves the likelihood of the data in
each iteration
True EM (expectation-maximization)
Viterbi improves the probability of the most
probable path in each iteration
EM-like

20
Today

Some practical issues in HMM modeling
HMMs for sequence alignment

21
Duration modeling

For any sub-path, the probability consists of two
components
The product of emission probabilities
Depend on symbols and state path
The product of transition probabilities
Depend on state path

22
Duration modeling

Model a stretch of DNA for which the distribution
does not change for a certain length
The simplest model implies that
P(length L) pL-1(1-p)
i.e., length follows geometric distribution
Not always appropriate

P
Duration the number of steps that a state is
used consecutively without visiting other states
p
s
1-p
L
23
Duration models
P
s
s
s
s
1-p
Min, then geometric
P
P
P
P
s
s
s
s
1-p
1-p
1-p
1-p
Negative binominal
24
Explicit duration modeling
Intron
Exon
Intergenic
P(A I) 0.3 P(C I) 0.2 P(G I) 0.2 P(T
I) 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
25
Silent states

Silent states are states that do not emit symbols
(e.g., the state 0 in our previous examples)
Silent states can be introduced in HMMs to reduce
the number of transitions

26
Silent states

Suppose we want to model a sequence in which
arbitrary deletions are allowed (later this
lecture)
In that case we need some completely forward
connected HMM (O(m2) edges)

27
Silent states

If we use silent states, we use only O(m) edges
Nothing comes free

Algorithms can be modified easily to deal with
silent states, so long as no silent-state loops
Suppose we want to assign high probability to 1?5
and 2?4, there is no way to have also low
probability on 1?4 and 2?5.
28
Modular design of HMM

HMM can be designed modularly
Each modular has own begin / end states (silent)
Each module communicates with other modules only
through begin/end states

29
C
G
A
T
B
E
HMM modules and non-HMM modules can be mixed
B-
E-
A-
T-
C-
G-
30
Explicit duration modeling
Intron
Exon
Intergenic
P(A I) 0.3 P(C I) 0.2 P(G I) 0.2 P(T
I) 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
31
HMM applications

Pair-wise sequence alignment
Multiple sequence alignment
Gene finding
Speech recognition a good tutorial on course
website
Machine translation
Many others

32
FSA for global alignment
e
Xi aligned to a gap
?
d
Xi and Yj aligned
?
d
Yj aligned to a gap
e
?
33
HMM for global alignment
?
Xi aligned to a gap
1- ?
q(xi) 4 emission probabilities
?
Xi and Yj aligned
1-2?
q(yj) 4 emission probabilities
?
Yj aligned to a gap
P(xi,yj)
?
16 emission probabilities
1-?
Pair-wise HMM emit two sequences
simultaneously Algorithm is similar to regular
HMM, but need an additional dimension. e.g. in
Viterbi, we need Vk(i, j) instead of Vk(i)
34
HMM and FSA for alignment

After proper transformation between the
probabilities and substitution scores, the two
are identical
?(a, b) ? log p(a, b) / (q(a) q(b))
d ? log ?
e ? log ?
Details in Durbin book chap 4
Finding an optimal FSA alignment is equivalent to
finding the most probable path with Viterbi

35
HMM for pair-wise alignment

Theoretical advantages
Full probabilistic interpretation of alignment
scores
Probability of all alignments instead of the best
alignment (forward-backward alignment)
Posterior probability that Ai is aligned to Bj
Sampling sub-optimal alignments
Not commonly used in practice
Needleman-Wunsch and Smith-Waterman algorithms
work pretty well, and more intuitive to
biologists
Other reasons?

36
Other applications

HMM for multiple alignment
Widely used
HMM for gene finding
Foundation for most gene finders
Include many knowledge-based fine-tunes and GHMM
extensions
Well only discuss basic ideas

37
HMM for multiple seq alignment

Proteins form families both across and within
species
Ex Globins, Zinc finger
Descended from a common ancestor
Typically have similar three-dimensional
structures, functions, and significant sequence
similarity
Identifying families is very useful suggest
functions
So search and alignment are both useful
Multiple alignment is hard
One very useful approach profile-HMM

38
Say we already have a database of reliable
multiple alignment of protein families When a new
protein comes, how do we align it to the existing
alignments and find the family that the protein
may belong to?
39
Solution 1

Use regular expression to represent the consensus
sequences
Implemented in the Prosite database
for example
C - x(2) - P - F - x - FYWIV - x(7) - C -
x(8,10) - W - C - x(4) - DNSR - FYW - x(3,5)
- FYW - x - FYWI - C

40
Multi-alignments represented by consensus

Consensus sequences are very intuitive
Gaps can be represented by Do-not-cares
Aligning sequences with regular expressions can
be done easily with DP
Possible to allow mismatches in searching
Problems
Limited power in expressing more divergent
positions
E.g. among 100 seqs, 60 have Alanine, 20 have
Glycine, 20 others
Specify boundaries of indel not so easy
unaligned regions have more flexibilities to
evolve
May have to change the regular expression when a
new member of a protein family is identified

41
Solution 2

For a non-gapped alignment, we can create a
position-specific weight matrix (PWM)
Also called a profile
May use pseudocounts

42
Scoring by PWMs
x KCIDNTHIKR
P(x M) ?i ei(xi) Random model each amino
acid xi can be generated with probability
q(xi) P(x random) ?i q(xi) Log-odds ratio
S log P(XM) / P(Xrandom) ?i log
ei(xi) / q(xi)
43
PWMs

Advantage
Can be used to identify both strong and weak
homologies
Easy to implement and use
Probabilistic interpretation
PWMs are used in PSI-BLAST
Given a sequence, get k similar seqs by BLAST
Construct a PWM with these sequences
Search the database for seqs matching the PWM
Improved sensitivity
Problem
Not intuitive to deal with gaps

44
PWMs are HMMs
B
E
M1
Mk
Transition probability 1
20 emission probabilities for each state

This can only be used to search for sequences
without insertion / deletions (indels)
We can add additional states for indels!

45
Dealing with insertions
Ij
B
E
M1
Mj
Mk

This would allow an arbitrary number of
insertions after the j-th position
i.e. the sequence being compared can be longer
than the PWM

46
Dealing with insertions
I1
Ij
Ik
B
E
M1
Mj
Mk

This allows insertions at any position

47
Dealing with Deletions
B
E
Mi
Mj

This would allow a subsequence i, j being
deleted
i.e. the sequence being compared can be shorter
than the PWM

48
Dealing with Deletions
B
E

This would allow an arbitrary length of deletion
Completely forward connected
Too many transitions

49
Deletion with silent states
Dj
Silent state
B
Mj
E

Still allows any length of deletions
Fewer parameters
Less detailed control

50
Full model

Called profile-HMM

Dj
D deletion state I insertion state M matching
state
Ij
B
Mj
E
Algorithm basically the same as a regular HMM
51
Using profile HMM

Alignment
Align a sequence to a profile HMM
Viterbi
Searching
Given a sequence and HMMs for different protein
families, which family the sequence may belong
to?
Given a HMM for a protein family and many
proteins, which protein may belong to the family?
Viterbi or forward
How to score?
Training / Learning
Given a multiple alignment, how to construct a
HMM?
Given an unaligned protein family, how to
construct a HMM?

52
Pfam

A database of protein families
Developed by Sean Eddy and colleagues while
working in Durbins lab
Hand-curated seed multiple alignment
Train HMM from seed alignment
Hand-chosen score thresholds
Automatic classification / classification of all
other protein sequences
7973 families in Rfam 18.0, 8/2005
(covers 75 of proteins)

53
Modeling building from aligned sequences

Matching state for columns with no gaps
When gaps exist, how to decide whether they are
insertions or matching?
Heuristic rule gt50 gaps, insertion, otherwise,
matching
How to add pseudocount?
Simply add one
According to background distribution
Use a mixture of priors (Dirchlet mixtures)
Sequence weighting
Very similar sequences should each get less weight

54
Modeling building from unaligned sequences

Choose a model length and initial parameters
Commonly use average seq length as model length
Baum-Welch or Viterbi training
Usually necessary to use multiple starting points
or other heuristics to escape from local optima
Align all sequences to the final model using
Viterbi

55
Alignment to profile HMMs
Viterbi Or F-B
56
Searching
Protein Database

Scoring
Log likelihood Log P(X M)
Log odds Log P(X M) / P(X random)
Length-normalization
Is the matching real?
How does the score compare with those for
sequences already in the family?
How does the score compare with those for random
sequences?

Score for each protein
57
Example modeling and searching for globins

300 random picked globin sequence
Build a profile HMM from scratch (without
pre-alignment)
Align 60,000 proteins to the HMM

Even after length normalization, LL is still
length-dependent
Log-odd score provides better separation
Takes amino acid composition into account
Real globins could have scores less than 0

Estimate mean score and standard deviation for
non-globin sequences for each length
Z-score (raw score mean) / (standard
deviation)
Z-score is length-invariant
Real globins have positive scores

60
Next lecture