CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

CS5263 Bioinformatics

Description:

Every December, Nucleic Acids Research has a special issue on web servers ... How to find CpG islands in the porcupine genome? Basic idea ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 61
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics


1
CS5263 Bioinformatics
  • Lecture 13 Hidden Markov Models and applications

2
Project ideas
  • Implement an HMM
  • Iterative refinement MSA
  • Motif finder
  • Implement an algorithm in a paper
  • Write a survey paper

3
NAR Web Server Issue
  • Every December, Nucleic Acids Research has a
    special issue on web servers
  • Not necessary to invent original methods
  • Biologists appreciate web tools
  • You get a nice publication
  • And potentially many citations if your tool is
    useful (think about BLAST!)
  • Talk to me if you want to work on this project

4
  • Review of last lectures

5
Problems in HMM
  • Decoding
  • Predict the state of each symbol
  • Viterbi algorithm
  • Forward-backward algorithm
  • Evaluation
  • The probability that a sequence is generated by a
    model
  • Forward-backward algorithm
  • Learning
  • unsupervised
  • Baum-Welch
  • Viterbi

6
A general HMM
1
2
K states Completely connected (possibly with 0
transition probabilities) Each state has a set of
emission probabilities (emission probabilities
may be 0 for some symbols in some states)
3
K


7
The Viterbi algorithm
  • V(1,i) w(1, j) r(j, xi1),
  • V(2,i) w(2, j) r(j, xi1),
  • V(j, i1) max V(3,i) w(3, j) r(j, xi1),
  • V(k,i) w(k, j) r(j, xi1)
  • Or simply
  • V(j, i1) Maxl V(l,i) w(l, j) r(j, xi1)

8
  • Viterbi finds the single best path
  • In many cases it is also interesting to know what
    are the other possible paths
  • Viterbi assigns a state to each symbol
  • In many cases it is also interesting to know the
    probability that the assignment is correct
  • Posterior decoding using Forward-backward
    algorithm

9
The forward algorithm
10
1
This does not include the emission probability of
xi
11
Forward-backward algorithm
  • fk(i) prob to get to pos i in state k and emit
    xi
  • bk(i) prob from i to end, given i is in state k
  • fk(i) bk(i) P(?ik, x)

12
Sequence
state
??
Forward probabilities
Backward probabilities
/ P(X)
Space O(KN) Time O(K2N)
P(?ik x)
P(?ik x) fk(i) bk(i) / P(x)
13
Learning
  • When the states are known
  • supervised learning
  • When the states are unknown
  • Estimate parameters from unlabeled data
  • unsupervised learning

How to find CpG islands in the porcupine genome?
14
Basic idea
  • Estimate our best guess on the model parameters
    ?
  • Use ? to predict the unknown labels
  • Re-estimate a new set of ?
  • Repeat 2 3 until converge

Two ways
15
Viterbi training
  • Estimate our best guess on the model parameters
    ?
  • Find the Viterbi path using current ?
  • Re-estimate a new set of ? based on the Viterbi
    path
  • Repeat 2 3 until converge

16
Baum-Welch training
  • Estimate our best guess on the model parameters
    ?
  • Find P(?ik x,?) using forward-backward
    algorithm
  • Re-estimate a new set of ? based on all possible
    paths
  • Repeat 2 3 until converge

E-step
M-step
17
Viterbi vs Baum-Welch training
  • Viterbi training
  • Returns a single path
  • Each position labeled with a fixed state
  • Each transition counts one
  • Each emission also counts one
  • Baum-Welch training
  • Does not return a single path
  • Considers the probability that each transition is
    used
  • and the probability that a symbol is generated by
    a certain state
  • They only contribute partial counts

18
Probability that a transition is used
19
Viterbi vs Baum-Welch training
  • Both guaranteed to converges
  • Baum-Welch improves the likelihood of the data in
    each iteration
  • True EM (expectation-maximization)
  • Viterbi improves the probability of the most
    probable path in each iteration
  • EM-like

20
Today
  • Some practical issues in HMM modeling
  • HMMs for sequence alignment

21
Duration modeling
  • For any sub-path, the probability consists of two
    components
  • The product of emission probabilities
  • Depend on symbols and state path
  • The product of transition probabilities
  • Depend on state path

22
Duration modeling
  • Model a stretch of DNA for which the distribution
    does not change for a certain length
  • The simplest model implies that
  • P(length L) pL-1(1-p)
  • i.e., length follows geometric distribution
  • Not always appropriate

P
Duration the number of steps that a state is
used consecutively without visiting other states
p
s
1-p
L
23
Duration models
P
s
s
s
s
1-p
Min, then geometric
P
P
P
P
s
s
s
s
1-p
1-p
1-p
1-p
Negative binominal
24
Explicit duration modeling
Intron
Exon
Intergenic
P(A I) 0.3 P(C I) 0.2 P(G I) 0.2 P(T
I) 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
25
Silent states
  • Silent states are states that do not emit symbols
    (e.g., the state 0 in our previous examples)
  • Silent states can be introduced in HMMs to reduce
    the number of transitions

26
Silent states
  • Suppose we want to model a sequence in which
    arbitrary deletions are allowed (later this
    lecture)
  • In that case we need some completely forward
    connected HMM (O(m2) edges)

27
Silent states
  • If we use silent states, we use only O(m) edges
  • Nothing comes free

Algorithms can be modified easily to deal with
silent states, so long as no silent-state loops
Suppose we want to assign high probability to 1?5
and 2?4, there is no way to have also low
probability on 1?4 and 2?5.
28
Modular design of HMM
  • HMM can be designed modularly
  • Each modular has own begin / end states (silent)
  • Each module communicates with other modules only
    through begin/end states

29
C
G
A
T
B
E
HMM modules and non-HMM modules can be mixed
B-
E-
A-
T-
C-
G-
30
Explicit duration modeling
Intron
Exon
Intergenic
P(A I) 0.3 P(C I) 0.2 P(G I) 0.2 P(T
I) 0.3
Generalized HMM. Often used in gene finders
P
L
Empirical intron length distribution
31
HMM applications
  • Pair-wise sequence alignment
  • Multiple sequence alignment
  • Gene finding
  • Speech recognition a good tutorial on course
    website
  • Machine translation
  • Many others

32
FSA for global alignment
e
Xi aligned to a gap
?
d
Xi and Yj aligned
?
d
Yj aligned to a gap
e
?
33
HMM for global alignment
?
Xi aligned to a gap
1- ?
q(xi) 4 emission probabilities
?
Xi and Yj aligned
1-2?
q(yj) 4 emission probabilities
?
Yj aligned to a gap
P(xi,yj)
?
16 emission probabilities
1-?
Pair-wise HMM emit two sequences
simultaneously Algorithm is similar to regular
HMM, but need an additional dimension. e.g. in
Viterbi, we need Vk(i, j) instead of Vk(i)
34
HMM and FSA for alignment
  • After proper transformation between the
    probabilities and substitution scores, the two
    are identical
  • ?(a, b) ? log p(a, b) / (q(a) q(b))
  • d ? log ?
  • e ? log ?
  • Details in Durbin book chap 4
  • Finding an optimal FSA alignment is equivalent to
    finding the most probable path with Viterbi

35
HMM for pair-wise alignment
  • Theoretical advantages
  • Full probabilistic interpretation of alignment
    scores
  • Probability of all alignments instead of the best
    alignment (forward-backward alignment)
  • Posterior probability that Ai is aligned to Bj
  • Sampling sub-optimal alignments
  • Not commonly used in practice
  • Needleman-Wunsch and Smith-Waterman algorithms
    work pretty well, and more intuitive to
    biologists
  • Other reasons?

36
Other applications
  • HMM for multiple alignment
  • Widely used
  • HMM for gene finding
  • Foundation for most gene finders
  • Include many knowledge-based fine-tunes and GHMM
    extensions
  • Well only discuss basic ideas

37
HMM for multiple seq alignment
  • Proteins form families both across and within
    species
  • Ex Globins, Zinc finger
  • Descended from a common ancestor
  • Typically have similar three-dimensional
    structures, functions, and significant sequence
    similarity
  • Identifying families is very useful suggest
    functions
  • So search and alignment are both useful
  • Multiple alignment is hard
  • One very useful approach profile-HMM

38
Say we already have a database of reliable
multiple alignment of protein families When a new
protein comes, how do we align it to the existing
alignments and find the family that the protein
may belong to?
39
Solution 1
  • Use regular expression to represent the consensus
    sequences
  • Implemented in the Prosite database
  • for example
  • C - x(2) - P - F - x - FYWIV - x(7) - C -
    x(8,10) - W - C - x(4) - DNSR - FYW - x(3,5)
    - FYW - x - FYWI - C

40
Multi-alignments represented by consensus
  • Consensus sequences are very intuitive
  • Gaps can be represented by Do-not-cares
  • Aligning sequences with regular expressions can
    be done easily with DP
  • Possible to allow mismatches in searching
  • Problems
  • Limited power in expressing more divergent
    positions
  • E.g. among 100 seqs, 60 have Alanine, 20 have
    Glycine, 20 others
  • Specify boundaries of indel not so easy
  • unaligned regions have more flexibilities to
    evolve
  • May have to change the regular expression when a
    new member of a protein family is identified

41
Solution 2
  • For a non-gapped alignment, we can create a
    position-specific weight matrix (PWM)
  • Also called a profile
  • May use pseudocounts

42
Scoring by PWMs
x KCIDNTHIKR
P(x M) ?i ei(xi) Random model each amino
acid xi can be generated with probability
q(xi) P(x random) ?i q(xi) Log-odds ratio
S log P(XM) / P(Xrandom) ?i log
ei(xi) / q(xi)
43
PWMs
  • Advantage
  • Can be used to identify both strong and weak
    homologies
  • Easy to implement and use
  • Probabilistic interpretation
  • PWMs are used in PSI-BLAST
  • Given a sequence, get k similar seqs by BLAST
  • Construct a PWM with these sequences
  • Search the database for seqs matching the PWM
  • Improved sensitivity
  • Problem
  • Not intuitive to deal with gaps

44
PWMs are HMMs
B
E
M1
Mk
Transition probability 1
20 emission probabilities for each state
  • This can only be used to search for sequences
    without insertion / deletions (indels)
  • We can add additional states for indels!

45
Dealing with insertions
Ij
B
E
M1
Mj
Mk
  • This would allow an arbitrary number of
    insertions after the j-th position
  • i.e. the sequence being compared can be longer
    than the PWM

46
Dealing with insertions
I1
Ij
Ik
B
E
M1
Mj
Mk
  • This allows insertions at any position

47
Dealing with Deletions
B
E
Mi
Mj
  • This would allow a subsequence i, j being
    deleted
  • i.e. the sequence being compared can be shorter
    than the PWM

48
Dealing with Deletions
B
E
  • This would allow an arbitrary length of deletion
  • Completely forward connected
  • Too many transitions

49
Deletion with silent states
Dj
Silent state
B
Mj
E
  • Still allows any length of deletions
  • Fewer parameters
  • Less detailed control

50
Full model
  • Called profile-HMM

Dj
D deletion state I insertion state M matching
state
Ij
B
Mj
E
Algorithm basically the same as a regular HMM
51
Using profile HMM
  • Alignment
  • Align a sequence to a profile HMM
  • Viterbi
  • Searching
  • Given a sequence and HMMs for different protein
    families, which family the sequence may belong
    to?
  • Given a HMM for a protein family and many
    proteins, which protein may belong to the family?
  • Viterbi or forward
  • How to score?
  • Training / Learning
  • Given a multiple alignment, how to construct a
    HMM?
  • Given an unaligned protein family, how to
    construct a HMM?

52
Pfam
  • A database of protein families
  • Developed by Sean Eddy and colleagues while
    working in Durbins lab
  • Hand-curated seed multiple alignment
  • Train HMM from seed alignment
  • Hand-chosen score thresholds
  • Automatic classification / classification of all
    other protein sequences
  • 7973 families in Rfam 18.0, 8/2005
  • (covers 75 of proteins)

53
Modeling building from aligned sequences
  • Matching state for columns with no gaps
  • When gaps exist, how to decide whether they are
    insertions or matching?
  • Heuristic rule gt50 gaps, insertion, otherwise,
    matching
  • How to add pseudocount?
  • Simply add one
  • According to background distribution
  • Use a mixture of priors (Dirchlet mixtures)
  • Sequence weighting
  • Very similar sequences should each get less weight

54
Modeling building from unaligned sequences
  • Choose a model length and initial parameters
  • Commonly use average seq length as model length
  • Baum-Welch or Viterbi training
  • Usually necessary to use multiple starting points
    or other heuristics to escape from local optima
  • Align all sequences to the final model using
    Viterbi

55
Alignment to profile HMMs
Viterbi Or F-B
56
Searching
Protein Database
  • Scoring
  • Log likelihood Log P(X M)
  • Log odds Log P(X M) / P(X random)
  • Length-normalization
  • Is the matching real?
  • How does the score compare with those for
    sequences already in the family?
  • How does the score compare with those for random
    sequences?


Score for each protein
57
Example modeling and searching for globins
  • 300 random picked globin sequence
  • Build a profile HMM from scratch (without
    pre-alignment)
  • Align 60,000 proteins to the HMM

58
  • Even after length normalization, LL is still
    length-dependent
  • Log-odd score provides better separation
  • Takes amino acid composition into account
  • Real globins could have scores less than 0

59
  • Estimate mean score and standard deviation for
    non-globin sequences for each length
  • Z-score (raw score mean) / (standard
    deviation)
  • Z-score is length-invariant
  • Real globins have positive scores

60
Next lecture
  • Gene finding
  • HMM wrap up
Write a Comment
User Comments (0)
About PowerShow.com