Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

Description:

Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Division of Biomedical Informatics, Children s Hospital Research ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 25
Provided by: PediatricI6
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models


1
Introduction to Bioinformatics Lecture
XIIIProfile and Other Hidden Markov Models
  • Jarek Meller
  • Division of Biomedical Informatics,
  • Childrens Hospital Research Foundation
  • Department of Biomedical Engineering, UC

2
Outline of the lecture
  • Multiple alignments, family profiles and
    probabilistic models of biological sequences
  • From simple Markov models to Hidden Markov Models
    (HMMs)
  • Profile HMMs topology and parameter optimization
  • Finding optimal alignments the Viterbi algorithm
  • Other applications of HMMs

3
Web watch personalized predictive medicine
Targeting crucial signal transduction pathway in
lung cancer an inhibitor of the Epidermal Growth
Factor Receptor (EGFR) catalytic activity that
binds EGFRs with specific mutations. Genotyping
the EGFR gene appears to be sufficient to
predict the outcome of the therapy. Paez JG et.
al. Science 304
4
Hidden Markov Models for biological sequences
  • Problems with grammatical structure, such as gene
    finding, family profiles and protein function
    prediction, transmembrane domains prediction
  • In general, one may think of different biases in
    different fragments of the sequence (due to
    functional role for example) or of different
    states emitting these fragments using different
    probability distributions
  • Durbin et. al., Chapters 3 to 6

5
Example Markov chain model for CpG islands
Motivation CpG dinucleotides (and not the C-G
bas pairs across the two strands) are frequently
methylated at C, with methyl-C mutating with a
higher rate into a T however, the methylation
process is suppressed around regulatory
sequences (e.g. promoters) where CpG islands
occur more often.
Transition probabilities tT,GP(aiG ai-1T)
etc.
A
T
C
G
The overall probability of a sequence defined as
product of transition probabilities
6
Example Hidden Markov model for CpG islands
A
T
A
T
C
G
C
G
Adding four more states (A,C,T,G) to
represent the island model, as opposed to
non-island model with unlikely transitions
between the models one obtains a hidden MM for
CpG islands. There is no longer one-to-one
correspondence between the states and the
symbols and knowing the sequence we cannot tell
state the model was in when generating subsequent
letters in the sequence.
7
Probabilistic models of biological sequences
  • For any probabilistic model the total
    probability of observing a sequence a1a2an may
    be written as
  • P(a1a2an) P(an an-1 a1) P(an-1 an-2
    a1) P(a1)
  • In Markov chain models we simply have
  • P(a1a2an) P(an an-1) P(an-1 an-2)
    P(a1)
  • HMMs are generalization of Markov chain
    models, with some hidden states that emit
    sequence symbols according to certain probability
    distributions and (Markov) transitions between
    pairs of hidden states

8
HMMs as probabilistic linguistic models
  • HMMs may be in fact regarded as probabilistic,
    finite automata that generate certain
    languages sets of words (sentences etc.) with
    specific grammatical structure.
  • For example, promoter, start, exon, splice
    junction, intron, stop states will appear in a
    linguistic model of a gene, whereas column
    (sequence position), insert and deletion states
    will be employed in a linguistic model of a
    (protein) family profile.

9
HMMs for gene prediction an exon model
10
HMMs and the supervised learning approach
  • Given a training set of aligned sequences find
    optimal transition and emission probabilities
    that maximize probability of observing the
    training sequences Baum-Welch (Expectation
    Maximization) or Viterbi training algorithm
  • In recognition phase, having the optimized
    probabilities, we ask what is the likelihood that
    a new sequence belongs to a family i.e. it is
    generated by the HMM with sufficiently high
    probability. The Viterbi algorithm, which is in
    fact dynamic programming in a suitable
    formulation, is used to find an optimal path
    through the states, which defines the optimal
    alignment

11
Ungapped profiles and the corresponding HMMs
Each blue square represents a match state that
emits each letter with certain probability
ej(a) which is defined by frequency of a at
position j
Beg
Mj
End


Example AGAAACT AGGAATT TGAATCT P(AGAAACT)16/81
P(TGGATTT)1/81
1 2 3 4 5 6 7
A 2/3 0 2/3 1 2/3 0 0
T 1/3 0 0 0 1/3 1/3 1
C 0 0 0 0 0 2/3 0
G 0 1 1/3 0 0 0 0
Typically, pseudo-counts are added in HMMs to
avoid zero probabilities.
12
HMMs and likelihood optimization
13
Likelihood optimization
14
Insertions and deletions in profile HMMs
Ij
Beg
Mj
End
Insert states emit symbols just like the match
states, however, the emission probabilities are
typically assumed to follow the
background distribution and thus do not
contribute to log-odds scores. Transitions Ij -gt
Ij are allowed and account for an arbitrary
number of inserted residues that are effectively
unaligned (their order within an inserted region
is arbitrary).
15
Insertions and deletions in profile HMMs
Dj
Beg
Mj
End
Deletions are represented by silent states which
do not emit any letters. A sequence of deletions
(with D -gt D transitions) may be used to
connect any two match states, accounting for
segments of the multiple alignment that are not
aligned to any symbol in a query sequence
(string). The total cost of a deletion is the
sum of the costs of individual transitions (M-gtD,
D-gtD, D-gtM) that define this deletion. As in case
of insertions, both linear and affine gap
penalties can be easily incorporated in this
scheme.
16
Gap penalties evolutionary and computational
considerations
  • Linear gap penalties
  • g(k) - k d
  • for a gap of length k and constant d
  • Affine gap penalties
  • g(k) - d (k -1) e
  • where d is opening gap penalty and e an
    extension gap penalty.

17
Profile HMMs as a model for multiple alignments
Dj
Ij
Beg
Mj
End
Example AG---C A-AG-C AG-AA- --AAAC AG--
-C
18
Observed emission and transition counts
C0 C1 C2
C3
AG...C A-AG.C AGAA.- --AAAC AG...C
Dj
1
1
2
1
1
Ij
1
4
2
1
Beg
Mj
End
2
3
4
4
C0 C1 C2 C3
A - 4 0 0
C - 0 0 4
G - 0 3 0
T - 0 0 0
C0 C1 C2 C3
A 0 0 6 0
C 0 0 0 0
G 0 0 1 0
T 0 0 0 0
Match emissions
Insert emissions
19
Computing emission and transition probabilities
20
Optimal alignment corresponds to a path with the
highest probability (or log-odds score)
Dj
Ij
Beg
Mj
End
Problem Given the above model, with emission and
transition probabilities obtained previously,
find the optimal path (alignment) for the query
sequence AGAC Problem Find emission and
transition counts assuming that the 4th column in
the example of multiple alignment in slide 15
corresponds to another match state (and not an
insert state)
21
Outline of the Viterbi algorithm
Dj
Ij
Beg
Mj
End
22
Profile HMMs for local alignments
The trick consists of adding additional insert
states Q that model flanking unaligned sequences
using background frequencies qa and large tQ,Q
Dj
Ij
Mj
End
Beg
Q
Q
23
Summary
  • In general, when the states generating training
    sequences (alignments) are not known an iterative
    procedure
  • Problem with local minima, topology choice
    (length of the profile)
  • Excellent results in family assignment (SAM,
    PFAM), gene prediction, trans-membrane domain
    recognition etc.

24
Outline of the lecture
Write a Comment
User Comments (0)
About PowerShow.com