Proteiinianalyysi 4

About This Presentation

Title:

Proteiinianalyysi 4

Description:

HMMs suit well on describing correlation among ... tila, p. a b a havaittu symbolisekvenssi, x. t(1,1) t(1,2) t(2,end) p1(a) p1(b) p2(a) P(x,p | HMM) ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 48

Provided by: luh8

Category:

more less

Transcript and Presenter's Notes

Title: Proteiinianalyysi 4

1
Proteiinianalyysi 4

Piilomarkovmallit
http//www.bioinfo.biocenter.helsinki.fi/downloads
/teaching/spring2006/proteiinianalyysi

2
Alignment of sequences with a structure hidden
Markov models
Hidden Markov Models

HMMs suit well on describing correlation among
neighbouring sites
probabilistic framework allows for realistic
modelling
well developed mathematical methods provide
best solution (most probable path through the
model)
confidence score (posterior probability of any
single solution)
inference of structure (posterior probability of
using model states)

we can align multiple sequence using complex
models and simultaneously predict their internal
structure

3
Coin game

Fair coin p(1)p(0)0.5
01110001101011100
Biased coin p(1)0.8, p(0)0.2
111011111100100111
Observed series 010111001010011
P(fair coin)0.515
P(biased coin)0.880.27
P(biased coin)/P(fair coin)0.07

4
Multiple sequence alignment (msa)

A) define characters for phylogenetic analysis
B) search for additional family members

SeqA N F L S SeqB N F S SeqC N K Y L
S SeqD N Y L S
NYLS NKYLS NFS NFLS
K -L

Y?F
5
Markov process

State depends only on previous state
Markov models for sequences
States have emission probabilities
State transition probabilities

6
piilomarkovmalli (HMM)

probabilistinen malli aikasarjoille tai
lineaarisille sekvensseille
puheentunnistuksessa
proteiiniperheiden mallitus
kuvaa todennäköisyysjakaumaa sekvenssiavaruuden
yli
todennäköisyyksien summa 1
generatiivinen malli
sekvenssi voidaan linjata ja pisteyttää mallia
vastaan

7
Piilomarkovmalli kaksitilamuuttujalle
t(1,1)
t(2,2)
ttransitiotodennäköisyys pemissiotodennäköi
syys
t(1,2)
t(2,end)
end
1
HMM
2
p1(a) p1(b)
p2(a) p2(b)
tila, p
1 ? 1 ? 2 ? end
a b a
havaittu symbolisekvenssi, x t(1,1)
t(1,2) t(2,end) p1(a) p1(b) p2(a) P(x,p HMM)
8
profiili-HMM
insert match delete
2
1
begin
end
3
4
match state emits one of 20 amino acids insert
state emits one of 20 amino acids delete, begin,
end states are mute
9
profiili-HMM

lineaarinen malli
pisteytys vastaa log-odds-scorea
aukkosakkokin muodollisesti ab(x-1)
käyttö
monen sekvenssin linjaus
homologien tunnistaminen millä
todennäköisyydellä HMM generoi testisekvenssin
sekvenssin linjaus mallia vastaan

10
A possible hidden Markov model for the protein
ACCY.
11
HMM with multiple paths through the model for
ACCY. The highlighted path is only one of several
possibilities.
12
Viterbi algorithm

computes the sequence scores over the most likely
path rather than over the sum of all paths.

13
Forward algorithm

Similar to Viterbi except that a sum rather than
the maximum is computed

14
What the Score Means

Once the probability of a sequence has been
determined, its score can be computed. Because
the model is a generalization of how amino acids
are distributed in a related group (or class) of
sequences, a score measures the probability that
a sequence belongs to the class. A high score
implies that the sequence of interest is probably
a member of the class, and a low score implies it
is probably not a member.

15
Optimisation

The Baum-Welch algorithm is a variation of the
forward algorithm described earlier. It begins
with a reasonable guess for an initial model and
then calculates a score for each sequence in the
training set over all possible paths through this
model . During the next iteration, a new set of
expected emission and transition probabilities is
calculated. The updated parameters replace those
in the initial model, and the training sequences
are scored against the new model. The process is
repeated until model convergence, meaning there
is very little change in parameters between
iterations.
The Viterbi algorithm is less computationally
expensive than Baum-Welch.

16
Heuristics

There is no guarantee that a model built with
either the Baum-Welch or Viterbi algorithm has
parameters which maximize the probability of the
training set. As in many iterative methods,
convergence indicates only that a local maximum
has been found. Several heuristic methods have
been developed to deal with this problem.

17
Heuristics parallel trials

start with several initial models and proceed to
build several models in parallel. When the models
converge at several different local optimums, the
probability of each model given the training set
is computed, and the model with the highest
probability wins.

18
Heuristics add noise

add noise, or random data, into the mix at each
iteration of the model building process.
Typically, an annealing schedule is used. The
schedule controls the amount of noise added
during each iteration. Less and less noise is
added as iterations proceed. The decrease is
either linear or exponential. The effect is to
delay the convergence of the model. When the
model finally does converge, it is more likely to
have found a good approximation to the global
maximum.

19
Sequence weighting
20
Overfitting and regularization
CGGSLLNAN--TVLTAAHC CGGSLIDNK-GWILTAAHC CGGSLIRQG-
-WVMTAAHC CGGSLIREDSSFVLTAAHC
21
Dirichlet mixtures

A sophisticated application of this method is
known as Dirichlet mixtures. The mixtures are
created by statistical analysis of the
distribution of amino acids at particular
positions in a large number of proteins. The
mixtures are built from smaller components known
as Dirichlet densities.
A Dirichlet density is a probability density over
all possible combinations of amino acids
appearing in a given position. It gives high
probability to certain distributions and low
probability to others. For example, a particular
Dirichlet density may give high probability to
conserved distributions where a single amino acid
predominates over all others. Another possibility
is a density where high probability is given to
amino acids with a common identifying feature,
such as the subgroup of hydrophobic amino acids.
When an HMM is built using a Dirichlet mixture, a
wealth of information about protein structure is
factored into the parameter estimation process.
The pseudocounts for each amino acid are
calculated from a weighted sum of Dirichlet
densities and added to the observed amino acid
counts from the training set. The parameters of
the model are calculated as described above for
simple pseudocounts.

22
Log-odds ratio

This number is the log of the ratio between two
probabilities the probability that the sequence
was generated by the HMM and the probability that
the sequence was generated by a null model, whose
parameters reflect the general amino acid
distribution in the training sequences.

23
Limitations of profile-HMM

The HMM is a linear model and is unable to
capture higher order correlations among amino
acids in a protein molecule.
In reality, amino acids which are far apart in
the linear chain may be physically close to each
other when a protein folds. Chemical and
electrical interactions between them cannot be
predicted with a linear model.
Another flaw of HMMs lies at the very heart of
the mathematical theory behind these models the
probability of a protein sequence can be found by
multiplying the probabilities of the amino acids
in the sequence. This claim is only valid if the
probability of any amino acid in the sequence is
independent of the probabilities of its
neighbors.
In biology, this is not the case. There are, in
fact, strong dependencies between these
probabilities. For example, hydrophobic amino
acids are highly likely to appear in proximity to
each other. Because such molecules fear water,
they cluster at the inside of a protein, rather
than at the surface where they would be forced to
encounter water molecules.

24
A simplified hidden Markov model (HMM)
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
BEGIN
END
A 0.1 C 0.4 G 0.4 T 0.1
A 0.1 C 0.2 G 0.2 T 0.5
A 0.4 C 0.3 G 0.1 T 0.2
0.9
1.0 0.7 0.2 0.1
25
(a) Calculate the probability of the sequence TAG
by following a path through the models three
match states.
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
BEGIN
END
A 0.1 C 0.4 G 0.4 T 0.1
A 0.1 C 0.2 G 0.2 T 0.5
A 0.4 C 0.3 G 0.1 T 0.2
0.9
0.70.50.70.40.70.40.90.0247
1.0 0.7 0.2 0.1
26
(b) Repeat a for a path that goes first to the
insert state, then to a match state, then to a
delete state, then to a match state and End.
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
BEGIN
END
A 0.1 C 0.4 G 0.4 T 0.1
A 0.1 C 0.2 G 0.2 T 0.5
A 0.4 C 0.3 G 0.1 T 0.2
0.9
0.10.2510.10.2110.40.90.00018
1.0 0.7 0.2 0.1
27
(c) Which of the two paths is the more probable
one, and what is the ratio of the probability of
the higher to the lower one?

p(path1)/p(path2) 0.0247 / 0.00018 137.2
The highest-scoring path is the best alignment of
the sequence with the model. The Viterbi
algorithm is similar to dynamic programming and
finds the highest-scoring path.

28
Improving the model

Adjust the scores for the states and transition
probabilities by aligning additional sequences
with the model using an HMM adaptation of the
expectation maximization algorithm.
In the expectation step, calculate all of the
possible paths through the model, sum the scores,
and then calculate the probability of each path.
Each state and transition probability is then
updated by the maximization step of the algorithm
to make the model better predict the new sequence.

29
Information theory primer

Information, Uncertainty
Suppose we have a device that can produce 3
symbols, A, B, or C ? uncertainty of 3 symbols
Uncertainty log2(M) with M being the number of
symbols
Logarithm base 2 gives units in bits.
Example In reading mRNA, if the ribosome
encounters any one of 4 equally likely bases,
then the uncertainty is 2 bits.

30
Surprisal u

Let symbols have probabilities Pi
Surprisal Ui -log2(Pi)
For an infinite string of symbols, the average
surprisal
HS Piui - S Pi log2 Pi (bits per symbol)
Sum over all symbols I
H is Shannons entropy

31
H function in the case of two symbols
32
If all symbols are equally likely?

Hequiprobable - S 1/M log2 (1/M) log2 M

33
Coding

Shorter codes using few bits for common symbols
and many bits for rare symbols.
M4 A C G T with probabilities
P(A)1/2, P(C)1/4, P(G)1/8, P(T)1/8
Surprisals U(A)1 bit, U(C)2 bits, U(G)U(T)3
bits
Uncertainty is H1/211/421/831/831.75
(bits per symbol)

34
Recode symbols so that the number of binary
digits equals the surprisal

A 1
C 01
G 000
T 001
The string ACATGAAC is coded as
1.01.1.001.000.1.1.01
14 / 8 1.75 bits per symbols

35
Noisy communication channels

Sender ----noise---- receiver
Two equally likely symbols sent at a rate of 1
bit per second
if x0 is sent, probability of receiving y0 is
0.99 and probability of receiving y1 is 0.01,
and vice versa
Then the uncertainty after receiving a symbol is
Hy(x)-0.99log20.99-0.01log20.010.081
So the actual rate of transmission is
R1-0.0810.919 bits per second

36
Exercise information content
37
Information content of a scoring matrix by the
relative entropy method (ignores background
frequencies)

(a) calculate the entropy or uncertainty (Hc) for
each column and for the entire matrix.
Hc-S pic log2(pic), where pic is the frequency
of amino acid type i in column c.
Column 1 Hc-(0.6 log2(0.6) 20.1
log2(0.1)0.2 log2(0.2))1.571
Column 2 Hc-(0.7 log2(0.7) 30.1
log2(0.1)1.358
Column 3 Hc-(0.6 log2(0.6) 20.1
log2(0.1)0.2 log2(0.2))1.571
Column 4 Hc-(0.7 log2(0.7) 30.1
log2(0.1)1.358
Log2(x)y ?ylog(x)/log(2)

(b) calculate the decrease in uncertainty or
amount of information (Rc) for column 1 due to
these data (for DNA, Rc2-Hc and for proteins,
Rc4.32-Hc).
If its DNA Rc2-Hc 2-1.570.43
If its protein Rc2-Hc 4.32-1.572.75

(c) calculate the amount that the uncertainty is
reduced (or the amount of information
contributed) for each base in column 1.
f(A)0.6, HA,1-0.6log2(0.6)0.442
f(G)f(C)0.1, HG,1 HC,1 -0.1log2(0.1)0.332
f(T)0.2, HT,1-0.2log2(0.2)0.464

40
Database searching

The first and most common operation in protein
informatics...and the only way to access the
information in large databases
Primary tool for inference of homologous
structure and function
Improved algorithms to handle large databases
quickly
Provides an estimate of statistical
significance
Generates alignments
Definitions of similarity can be tuned using
different scoring matrices and algorithm-specific
parameters

41
Nyrkkisääntöjä

45 identtisyys (koko domeenissa) lähes
identtinen rakenne
25 identtisyys samankaltainen rakenne
Twilight zone (R.F.Doolittle) 18-25
identtisyys homologia epävarmaa

42
Esimerkkejä

myoglobiini / leghemoglobiini 15 identtisiä,
homologisia
rodaneesin N- ja C-terminaaliset domeenit 11
identtisiä, geeniduplikaatio
kymotrypsiini / subtilisiini 12 identtisiä,
samankaltainen aktiivinen keskus, konvergentti
evoluutio

43
Types of alignment

Sequence-sequence
Target distribution generic substitution matrix
Sequence-profile
Position-specific target distributions
Profile-profile
Observed frequencies from multiple alignment
Position-specific target distributions
Average both ways
Pair HMM
Probability that two HMMs generate same sequence

44
PSI-Blast

Position-specific-iterated Blast

45
Steps in a PSI-Blast search

Constructs a multiple alignment from a Gapped
Blast search and generates a profile from any
significant local alignments found
The profile is compared to the protein database
and PSI-BLAST estimates the statistical
significance of the local alignments found, using
"significant" hits to extend the profile for the
next round
PSI-BLAST iterates step 2 an arbitrary number
of times or until convergence