Analysis of biological sequences using Markov Chains and Hidden Markov Models - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Analysis of biological sequences using Markov Chains and Hidden Markov Models

Description:

Analysis of biological sequences using Markov Chains. and Hidden Markov ... Emmanuelle DELLA-CHIESA. Mark HOEBEKE. Mickael GUEDJ. Fran ois K P S Labo AGRO ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 34
Provided by: prum2
Category:

less

Transcript and Presenter's Notes

Title: Analysis of biological sequences using Markov Chains and Hidden Markov Models


1
Analysis of biological sequences using Markov
Chainsand Hidden Markov Models
Colloque T.A.G LAPTH Annecy 8-10 novembre
2006
  • Bernard PRUM,

La genopole Evry France
prum_at_genopole.cnrs.fr
2
Why Markov Models ?
A biological sequence X (X1, X2, ,
Xn) where Xk ??A t , c , a , g or A C
D E F G H I K L M N P Q R S T V W Y A very
common tool for analyzing these sequences is the
Markov Model (MM) P(Xk v Xj , j lt k)
P(Xk v Xk 1) u, v
??A denoted by p(u , v) if Xk 1 u
3
Why MM ? 2
A complex, called Rec BCD, protects the cell
against viruses
Exemple
Rec BCD
E. coli
viruses
own bacteria genome
chi
To avoid the destruction of the genome of the
cell, along the genome exists a password gctggtgg
(it is called chi). When rec BCD bumps into the
chi, it stops its destruction. In order to be
efficient the number of occurrences of the chi is
much higher that the number predicted in a Markov
model.
4
Results MM
5
Parsimonious Markov Models
When we modelize a sequence in order to find
exceptional motifs or for annotation, we have to
estimate the parameters of the model, and more
parameters we have, worst is the estimation. In
a Markov Model of order m, there are 4m
predictors (the m-words), hence 3 x 4m parameters
In the M2 model, there are 16 predictors and 48
parameters In M5, there are 1024 predictors,
3072 parameters
6
PMM 2
A first restriction consists in taking into
account the past up to the point we use a large
past when the sequence shows that this is
necesary, we use a short past when the sequence
allows the economy these models are
called VLMC Variable Length Markov Chains
In this VLMC, there are 12 predictors aa
ca ga ta ac cc
gc tc g at tt
gct There are 36 parameters
Notation gc denotes  g or c  act
denotes  a or c or t 
7
PMM 3
But it is not obvious that for the prediction of
Xk , Xk p is less and less informative
inasmuch p increases.
  • As an example (), let us consider the jumper
    model
  • P(Xt v past) P(Xt v Xt 2)
  • (the dependance jumps over Xt 1)
  • it corresponds to this tree
  • (4 predictors, 12 parameters)

() this model is not as scholar as it seems
for example in a coding region (periodic model
depending on the phase), the 2nd position in a
codon strongly depends on the 2nd position in the
previous codon (cf hydrophobicity)
8
PMM 4
These models are called PMM Parsimonious Markov
Models
More general (?) example
In this PMM there are 8 predictors (24
parameters) aac cac gac
tac g at cgt tt
9
PMM 5
More precisely in the tree of predictors ()
below any node all the partitions of A t , c
, a , g may appear
() the differents predictors appear in this
tree like the path from all the leaves to the root
Hence, there are 15 possibilities below each node.
10
PMM 6
A Parsimonious Markov Model (PMM) is defined
by such a dependance tree t for each leaf
( for each predictor) a law on A
P(Xt u Xt 1 tc) P(Xt u Xt 1 a,
Xt 2 c) P(Xt u Xt 1 a, Xt 2
tag) P(Xt u Xt 1 g)

11
PMM 7
We will only work with finite order PMM the
longer predictor contains, say, m letters (the
depth of the tree is m) Obviously a PMM of order
m is a MM of order m
Note the number of PMM increases very quickly
with m in the 4-letter alphabet and for m 5
there are some 1085 trees

Notations
  • denotes a tree of predictors
  • W its sets of predictors in t (??the leaves)
  • For w ? W , ?w,u P(Xt u w)

12
Statistics on PMM
For a fixed tree t, the likelihood is obviously
N(wu)
L(q) ? ?w,u
(The dots correspond to the first letters in the
sequence. We will not care about them today)
Which leads to the classical MLE

N(wu)
?w,u
(where N(w) ? N(wv)
N(w)
The difficulty arises when we want to choose the
tree problem of choice of model
(within, for example 1085 models)
13
Statistics 2
Therefore we adopt a Bayesian approach A priori
law on the tree let us choose the uniform law
(it can be changed) on the transition
parameter, it is natural to chose a Dirichlet law
which is conjugate if, for w ? W , a priori
P(?w,) ? ?w,u then, a posteriori P(?w,)
? ?w,u
a(w,u)
????????a(w,u)
The MAP estimator of ?w,u remains the same as
before, except the fact that N(w,u) has to be
changed in N(wu) N(wu) a(w, u)
14
Statistics 3
The use of Bayes formula then gives as a
posterior law on the trees
ln P(t X) S S(w)
Where the sum is taken over all the predictors in
the tree t and
S(w) S ln G?(N(wu)) ln G? S N(wu))
(G is such as G(k1) k ! , k ? N)
Writing the posterior law in this way shows that
P(t X) may be maximized in a recursive way
15
Application to real genomes
We fitted MM and PMM for the orders m 3 , 4 and
5 on the set of the 224 complete bacterial
genomes published today on their coding
regions (CDS) To compare the adequacy of this
modelizations, we computed the BIC criterion for
each model M BIC(M) 2 L(M) - nb_param(M) . ln
n
The higher BIC, the better the model
16
Application 2
This picture plots BIC(PMM) BIC(MM) against the
size of the bacterial genome. For all the
bacteriae, PMM fits better than classical MM
17
Approach using FDA
Recent results (Gregory Nuel) concern the use of
Finite (Deterministic) Automata in the
statistic of words or patterns
To a word, we may associate an FDA
Example 1 on a,b, w aaab States
b a aa aaa aaab
a
aa
b
aaa
This can be generalized if one word w is
replaced by a motif (finite family of words) or
even a language.
aaab
18
Approach using FDA
This automata is especially dedicated to the
study of the word w (the motif, ...) if we
run a sequence on this graph, the automate
counts the occurences of w (the motif, ...) It
turns to be VERY efficient wordcount ,
program in EMBOSS, needs 4352 seconds to count
the occurrences of all 12-words in E. coli,
Nuels program acheives this task in 9.86 seconds
The prosite motif LIVMFGE.GASLIVM.(5-11)R
STAQA.LIVAM.STACV (some 1012 words) is
treated by a FDA of 329(30) states in
M0 1393(78) states in M1
19
Approach using FDA
If this sequence X is a Markov chain, we then
have an other MC running on this graph. Even
for rather complicated motifs, this allows to
get the law of all statistics of words -
exact law of the first occurrence of a motif
(taking into account the starting point), -
exact law of the number of occurrences of the
motif, - in particuler expectation and variance
of these laws, opening the possibility of
gaussian, poisonnian,... approximations (and an
exhaustive study of the qualities of these
approximation), - law of a motif M conditionally
to the number of occurrences of another one, M.
20
Hidden Markov Models
  • 2nd Part

21
Hidden Markov Models
An important criticism against Markov
modelization is its stationarity a well known
theorem says that, under weak conditions, P(Xk
u) ?? µ(u) (when k ?? 8) (and the rate
of convergence is exponential.) But
biological sequences are not homogeneous.
There are gc rich segments / gc poor segments
(isochores). One may presume (and verify)
that the rules of succession of letters differ
in coding parts / non-coding parts.
Is it possible to take avantage of this problem
and to develop a tool for the
analysis of heterogeneity ?
gt annotation
22
HMM 2
Suppose that d states alternate along the
sequence
Sk 1 Sk 2 Sk 1
Sk 2 Sk 1
And in each state we have a MC if Sk 1,
then P(Xk v Xk1 u) p1(u v) if Sk
2, then P(Xk v Xk1 u) p2(u v) and
(more technical than biological - see
HSMM) P(Sk y Sk1 x) p0(u v)
Estimate the parameters p1, p2, p0 Allocate
a state 1, 2 to each position
Our objectives
23
HMM 3
Use the likelihood !!
L(q) ? µ0(S1) µS (X1) ....
1
...? p0(Sk-1,Sk) pk (Xk-1,Xk)
n terms (length of the sequence)
over all possibilities S1S2...Sn there are sn
terms
Désespoir !!!
210 000 103 000
24
Annotation
25
H.M.M. continue
26
Searching nucleosome positions
In eukaryotes (only), an important part of the
chromosomes forms chromatine, a state where the
double helix winds round beads forming a collar


10 nm
Each bead is called a nucleosome. Its core is a
complex involving 8 proteins (an octamer) called
histone (H2A, H2B, H3, H4). DNA winds twice this
core and is locked by an other histone (H1). The
total weight of the histones is equal to the
weight of the DNA.
27
Curvature within curvature
The DNA helix turns twice around the histone
core. Each turn corresponds to about 7 pitches of
the helix, each one made with about 10
nucleotides. Total 146 nt within each
nucleosome.
Depending on the position (invs out) the
curvature satisfies different constraints
28
Nuc and no-nuc states
Trifonov (99) as well as Rando (05) underline
that there are no nucleosome in the gene
promotors (accessibility) The introduce before
nucleosome a no-nucleosme state.
. . .
1 2
70
nucleosome core
no-nuc
spacer
Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc.
96 (1999) Yuan, Liu, Dion, Slack, Wu, Altschuler,
Rando, Sciencexpress (2005)
29
Bendability
Following an idea (Baldi, Lavery,...) we
introduce an indice of bendability it
depends on succession of 2, 3, 4,
...di-nucleotides.
g
a
t
t
a
g
c
t
a
c
t
a
q
30
PNUC table
2nd letter a t g c a 0.0 2.8 3,3 5.2 a t 7
,3 7,3 10,0 10,0 a g 3,0 6,4 6,2 7,5
a c 3,3 2,2 8,3 5,4 a a 0.7 0.7 5,8 5,8 t t 2.
8 0.0 5.2 3,3 t g 5.3 3.7 5.4 7,5 t c 6,7 5.2 5.
4 5.4 t 1rst letter 3rd
letter a 5.2 6,7 5.4 5.4 g t 2,2 3,3 5.4 8,3 g
g 5.4 6,5 6,0 7,5 g c 4,2 4,2 4,7 4,7 g a 3.7 5
.3 7,5 5.4 c t 6,4 3,0 7,5 6,2 c g 5.6 5.6 8.2 8
.2 c c 6,5 5.4 7,5 6,0 c
There exist various tables which indicate the
bendability of di-, tri or even tetra-nucleotides
(PNUC, DNase, ...)
We used PNUC-3
PNUC(cga) 8,3
PNUC(tcg) 8,3
() Goodsell, Dickerson, NAR 22 (1994)
31
Scan of K3 of yeast
Sometime
it works
32
What about positions ?
We represent () parts of the chromosome K3 of
Yeast
The green curve (proba of the no-nuc state)
increases between genes (promotors)
The red curve (proba of the nucleosome state)
appears periodically in genes.
() using the software MuGeN, by Mark Hoebeke
33
Acknowledgements
Labo Statistique et Génome Labo MIG
INRA Christopha AMBROISE Philippe
BESSIÈRE Maurice BAUDRY François
RODOLPHE Etienne BIRMELE Sophie
SCHBATH Cécile COT Élisabeth de
TURCKHEIM Emmanuelle DELLA-CHIESA Mark
HOEBEKE Mickael GUEDJ François KÉPÈS Labo
AGRO Sophie LEBRE Jean-Noël
BACRO Catherine MATIAS Jean-Jacques
DAUDIN Vincent MIELE Stéphane ROBIN Florence
MURI-MAJOUBE Grégory NUEL Franck PICARD Lab
Rouen Hugues RICHARD Dominique
CELLIER Anne-Sophie TOCQUET Sabine
MERCIER Nicolas VERGNE Sec Michèle ILBERT
Write a Comment
User Comments (0)
About PowerShow.com