Proteiinianalyysi 4 - PowerPoint PPT Presentation

About This Presentation
Title:

Proteiinianalyysi 4

Description:

HMMs suit well on describing correlation among ... tila, p. a b a havaittu symbolisekvenssi, x. t(1,1) t(1,2) t(2,end) p1(a) p1(b) p2(a) P(x,p | HMM) ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 48
Provided by: luh8
Category:

less

Transcript and Presenter's Notes

Title: Proteiinianalyysi 4


1
Proteiinianalyysi 4
  • Piilomarkovmallit
  • http//www.bioinfo.biocenter.helsinki.fi/downloads
    /teaching/spring2006/proteiinianalyysi

2
Alignment of sequences with a structure hidden
Markov models
Hidden Markov Models
  • HMMs suit well on describing correlation among
    neighbouring sites
  • probabilistic framework allows for realistic
    modelling
  • well developed mathematical methods provide
  • best solution (most probable path through the
    model)
  • confidence score (posterior probability of any
    single solution)
  • inference of structure (posterior probability of
    using model states)
  • we can align multiple sequence using complex
    models and simultaneously predict their internal
    structure

3
Coin game
  • Fair coin p(1)p(0)0.5
  • 01110001101011100
  • Biased coin p(1)0.8, p(0)0.2
  • 111011111100100111
  • Observed series 010111001010011
  • P(fair coin)0.515
  • P(biased coin)0.880.27
  • P(biased coin)/P(fair coin)0.07

4
Multiple sequence alignment (msa)
  • A) define characters for phylogenetic analysis
  • B) search for additional family members

SeqA N F L S SeqB N F S SeqC N K Y L
S SeqD N Y L S
NYLS NKYLS NFS NFLS
K -L

Y?F
5
Markov process
  • State depends only on previous state
  • Markov models for sequences
  • States have emission probabilities
  • State transition probabilities

6
piilomarkovmalli (HMM)
  • probabilistinen malli aikasarjoille tai
    lineaarisille sekvensseille
  • puheentunnistuksessa
  • proteiiniperheiden mallitus
  • kuvaa todennäköisyysjakaumaa sekvenssiavaruuden
    yli
  • todennäköisyyksien summa 1
  • generatiivinen malli
  • sekvenssi voidaan linjata ja pisteyttää mallia
    vastaan

7
Piilomarkovmalli kaksitilamuuttujalle
t(1,1)
t(2,2)
ttransitiotodennäköisyys pemissiotodennäköi
syys
t(1,2)
t(2,end)
end
1
HMM
2
p1(a) p1(b)
p2(a) p2(b)
tila, p
1 ? 1 ? 2 ? end
a b a
havaittu symbolisekvenssi, x t(1,1)
t(1,2) t(2,end) p1(a) p1(b) p2(a) P(x,p HMM)
8
profiili-HMM
insert match delete
2
1
begin
end
3
4
match state emits one of 20 amino acids insert
state emits one of 20 amino acids delete, begin,
end states are mute
9
profiili-HMM
  • lineaarinen malli
  • pisteytys vastaa log-odds-scorea
  • aukkosakkokin muodollisesti ab(x-1)
  • käyttö
  • monen sekvenssin linjaus
  • homologien tunnistaminen millä
    todennäköisyydellä HMM generoi testisekvenssin
  • sekvenssin linjaus mallia vastaan

10
A possible hidden Markov model for the protein
ACCY.
11
HMM with multiple paths through the model for
ACCY. The highlighted path is only one of several
possibilities.
12
Viterbi algorithm
  • computes the sequence scores over the most likely
    path rather than over the sum of all paths.

13
Forward algorithm
  • Similar to Viterbi except that a sum rather than
    the maximum is computed

14
What the Score Means
  • Once the probability of a sequence has been
    determined, its score can be computed. Because
    the model is a generalization of how amino acids
    are distributed in a related group (or class) of
    sequences, a score measures the probability that
    a sequence belongs to the class. A high score
    implies that the sequence of interest is probably
    a member of the class, and a low score implies it
    is probably not a member.

15
Optimisation
  • The Baum-Welch algorithm is a variation of the
    forward algorithm described earlier. It begins
    with a reasonable guess for an initial model and
    then calculates a score for each sequence in the
    training set over all possible paths through this
    model . During the next iteration, a new set of
    expected emission and transition probabilities is
    calculated. The updated parameters replace those
    in the initial model, and the training sequences
    are scored against the new model. The process is
    repeated until model convergence, meaning there
    is very little change in parameters between
    iterations.
  • The Viterbi algorithm is less computationally
    expensive than Baum-Welch.

16
Heuristics
  • There is no guarantee that a model built with
    either the Baum-Welch or Viterbi algorithm has
    parameters which maximize the probability of the
    training set. As in many iterative methods,
    convergence indicates only that a local maximum
    has been found. Several heuristic methods have
    been developed to deal with this problem.

17
Heuristics parallel trials
  • start with several initial models and proceed to
    build several models in parallel. When the models
    converge at several different local optimums, the
    probability of each model given the training set
    is computed, and the model with the highest
    probability wins.

18
Heuristics add noise
  • add noise, or random data, into the mix at each
    iteration of the model building process.
    Typically, an annealing schedule is used. The
    schedule controls the amount of noise added
    during each iteration. Less and less noise is
    added as iterations proceed. The decrease is
    either linear or exponential. The effect is to
    delay the convergence of the model. When the
    model finally does converge, it is more likely to
    have found a good approximation to the global
    maximum.

19
Sequence weighting
20
Overfitting and regularization
CGGSLLNAN--TVLTAAHC CGGSLIDNK-GWILTAAHC CGGSLIRQG-
-WVMTAAHC CGGSLIREDSSFVLTAAHC
21
Dirichlet mixtures
  • A sophisticated application of this method is
    known as Dirichlet mixtures. The mixtures are
    created by statistical analysis of the
    distribution of amino acids at particular
    positions in a large number of proteins. The
    mixtures are built from smaller components known
    as Dirichlet densities.
  • A Dirichlet density is a probability density over
    all possible combinations of amino acids
    appearing in a given position. It gives high
    probability to certain distributions and low
    probability to others. For example, a particular
    Dirichlet density may give high probability to
    conserved distributions where a single amino acid
    predominates over all others. Another possibility
    is a density where high probability is given to
    amino acids with a common identifying feature,
    such as the subgroup of hydrophobic amino acids.
  • When an HMM is built using a Dirichlet mixture, a
    wealth of information about protein structure is
    factored into the parameter estimation process.
    The pseudocounts for each amino acid are
    calculated from a weighted sum of Dirichlet
    densities and added to the observed amino acid
    counts from the training set. The parameters of
    the model are calculated as described above for
    simple pseudocounts.

22
Log-odds ratio
  • This number is the log of the ratio between two
    probabilities the probability that the sequence
    was generated by the HMM and the probability that
    the sequence was generated by a null model, whose
    parameters reflect the general amino acid
    distribution in the training sequences.

23
Limitations of profile-HMM
  • The HMM is a linear model and is unable to
    capture higher order correlations among amino
    acids in a protein molecule.
  • In reality, amino acids which are far apart in
    the linear chain may be physically close to each
    other when a protein folds. Chemical and
    electrical interactions between them cannot be
    predicted with a linear model.
  • Another flaw of HMMs lies at the very heart of
    the mathematical theory behind these models the
    probability of a protein sequence can be found by
    multiplying the probabilities of the amino acids
    in the sequence. This claim is only valid if the
    probability of any amino acid in the sequence is
    independent of the probabilities of its
    neighbors.
  • In biology, this is not the case. There are, in
    fact, strong dependencies between these
    probabilities. For example, hydrophobic amino
    acids are highly likely to appear in proximity to
    each other. Because such molecules fear water,
    they cluster at the inside of a protein, rather
    than at the surface where they would be forced to
    encounter water molecules.

24
A simplified hidden Markov model (HMM)
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
BEGIN
END
A 0.1 C 0.4 G 0.4 T 0.1
A 0.1 C 0.2 G 0.2 T 0.5
A 0.4 C 0.3 G 0.1 T 0.2
0.9
1.0 0.7 0.2 0.1
25
(a) Calculate the probability of the sequence TAG
by following a path through the models three
match states.
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
BEGIN
END
A 0.1 C 0.4 G 0.4 T 0.1
A 0.1 C 0.2 G 0.2 T 0.5
A 0.4 C 0.3 G 0.1 T 0.2
0.9
0.70.50.70.40.70.40.90.0247
1.0 0.7 0.2 0.1
26
(b) Repeat a for a path that goes first to the
insert state, then to a match state, then to a
delete state, then to a match state and End.
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
A 0.25 C 0.25 G 0.25 T 0.25
BEGIN
END
A 0.1 C 0.4 G 0.4 T 0.1
A 0.1 C 0.2 G 0.2 T 0.5
A 0.4 C 0.3 G 0.1 T 0.2
0.9
0.10.2510.10.2110.40.90.00018
1.0 0.7 0.2 0.1
27
(c) Which of the two paths is the more probable
one, and what is the ratio of the probability of
the higher to the lower one?
  • p(path1)/p(path2) 0.0247 / 0.00018 137.2
  • The highest-scoring path is the best alignment of
    the sequence with the model. The Viterbi
    algorithm is similar to dynamic programming and
    finds the highest-scoring path.

28
Improving the model
  • Adjust the scores for the states and transition
    probabilities by aligning additional sequences
    with the model using an HMM adaptation of the
    expectation maximization algorithm.
  • In the expectation step, calculate all of the
    possible paths through the model, sum the scores,
    and then calculate the probability of each path.
    Each state and transition probability is then
    updated by the maximization step of the algorithm
    to make the model better predict the new sequence.

29
Information theory primer
  • Information, Uncertainty
  • Suppose we have a device that can produce 3
    symbols, A, B, or C ? uncertainty of 3 symbols
  • Uncertainty log2(M) with M being the number of
    symbols
  • Logarithm base 2 gives units in bits.
  • Example In reading mRNA, if the ribosome
    encounters any one of 4 equally likely bases,
    then the uncertainty is 2 bits.

30
Surprisal u
  • Let symbols have probabilities Pi
  • Surprisal Ui -log2(Pi)
  • For an infinite string of symbols, the average
    surprisal
  • HS Piui - S Pi log2 Pi (bits per symbol)
  • Sum over all symbols I
  • H is Shannons entropy

31
H function in the case of two symbols
32
If all symbols are equally likely?
  • Hequiprobable - S 1/M log2 (1/M) log2 M

33
Coding
  • Shorter codes using few bits for common symbols
    and many bits for rare symbols.
  • M4 A C G T with probabilities
  • P(A)1/2, P(C)1/4, P(G)1/8, P(T)1/8
  • Surprisals U(A)1 bit, U(C)2 bits, U(G)U(T)3
    bits
  • Uncertainty is H1/211/421/831/831.75
    (bits per symbol)

34
Recode symbols so that the number of binary
digits equals the surprisal
  • A 1
  • C 01
  • G 000
  • T 001
  • The string ACATGAAC is coded as
    1.01.1.001.000.1.1.01
  • 14 / 8 1.75 bits per symbols

35
Noisy communication channels
  • Sender ----noise---- receiver
  • Two equally likely symbols sent at a rate of 1
    bit per second
  • if x0 is sent, probability of receiving y0 is
    0.99 and probability of receiving y1 is 0.01,
    and vice versa
  • Then the uncertainty after receiving a symbol is
    Hy(x)-0.99log20.99-0.01log20.010.081
  • So the actual rate of transmission is
    R1-0.0810.919 bits per second

36
Exercise information content
37
Information content of a scoring matrix by the
relative entropy method (ignores background
frequencies)
  • (a) calculate the entropy or uncertainty (Hc) for
    each column and for the entire matrix.
  • Hc-S pic log2(pic), where pic is the frequency
    of amino acid type i in column c.
  • Column 1 Hc-(0.6 log2(0.6) 20.1
    log2(0.1)0.2 log2(0.2))1.571
  • Column 2 Hc-(0.7 log2(0.7) 30.1
    log2(0.1)1.358
  • Column 3 Hc-(0.6 log2(0.6) 20.1
    log2(0.1)0.2 log2(0.2))1.571
  • Column 4 Hc-(0.7 log2(0.7) 30.1
    log2(0.1)1.358
  • Log2(x)y ?ylog(x)/log(2)

38
  • (b) calculate the decrease in uncertainty or
    amount of information (Rc) for column 1 due to
    these data (for DNA, Rc2-Hc and for proteins,
    Rc4.32-Hc).
  • If its DNA Rc2-Hc 2-1.570.43
  • If its protein Rc2-Hc 4.32-1.572.75

39
  • (c) calculate the amount that the uncertainty is
    reduced (or the amount of information
    contributed) for each base in column 1.
  • f(A)0.6, HA,1-0.6log2(0.6)0.442
  • f(G)f(C)0.1, HG,1 HC,1 -0.1log2(0.1)0.332
  • f(T)0.2, HT,1-0.2log2(0.2)0.464

40
Database searching
  • The first and most common operation in protein
    informatics...and the only way to access the
    information in large databases
  • Primary tool for inference of homologous
    structure and function
  • Improved algorithms to handle large databases
    quickly
  • Provides an estimate of statistical
    significance
  • Generates alignments
  • Definitions of similarity can be tuned using
    different scoring matrices and algorithm-specific
    parameters

41
Nyrkkisääntöjä
  • 45 identtisyys (koko domeenissa) lähes
    identtinen rakenne
  • 25 identtisyys samankaltainen rakenne
  • Twilight zone (R.F.Doolittle) 18-25
    identtisyys homologia epävarmaa

42
Esimerkkejä
  • myoglobiini / leghemoglobiini 15 identtisiä,
    homologisia
  • rodaneesin N- ja C-terminaaliset domeenit 11
    identtisiä, geeniduplikaatio
  • kymotrypsiini / subtilisiini 12 identtisiä,
    samankaltainen aktiivinen keskus, konvergentti
    evoluutio

43
Types of alignment
  • Sequence-sequence
  • Target distribution generic substitution matrix
  • Sequence-profile
  • Position-specific target distributions
  • Profile-profile
  • Observed frequencies from multiple alignment
  • Position-specific target distributions
  • Average both ways
  • Pair HMM
  • Probability that two HMMs generate same sequence

44
PSI-Blast
  • Position-specific-iterated Blast

45
Steps in a PSI-Blast search
  • Constructs a multiple alignment from a Gapped
    Blast search and generates a profile from any
    significant local alignments found
  • The profile is compared to the protein database
    and PSI-BLAST estimates the statistical
    significance of the local alignments found, using
    "significant" hits to extend the profile for the
    next round
  • PSI-BLAST iterates step 2 an arbitrary number
    of times or until convergence

46
(No Transcript)
47
PHI-Blast
  • Pattern-hit-initiated Blast
Write a Comment
User Comments (0)
About PowerShow.com