Title: Sequence classification
1Sequence classification hidden Markov models
- Bioinformatics,
- Models algorithms,
- 8th November 2005
- Patrik Johansson,
- Dept. of Cell Molecular Biology,
- Uppsala University
2A family of proteins share a similar structure
but not necessarily sequence
3Classification of an unknown sequence s to family
A or B using HMMs
A
A
A
A
A
A
A
A
A
s
B
B
B
B
B
B
B
B
4Hidden Markov Models, introduction
- General method for pattern recognition, comp.
Neural networks - An HMM generates sequences / sequence
distributions - Markov chain of events
Three coins A, B C gives a Markov chain ?
CAABA..
Outcome, e.g. Heads Heads Tails, generated by
hidden Markov chain ?
5Hidden Markov Models, introduction..
- Model M is emitting a symbol (T, H) in each
state i based on some probability ei - The next state j is chosen based on some
transition probability ai,j
e.g, the sequence s Tails Heads Tails over
the path ? BCC
6Profile hidden Markov Model architecture
- A first approach for sequence distribution
modelling
7Profile hidden Markov Model architecture..
Insertions random ejI(a) q(a)
8Profile Hidden Markov Model architecture..
Alt.
9Profile Hidden Markov Model architecture..
Insert deletestates are generalized to all
positions. The model M can generate sequences
from state B by successive emissions and
transitions to state E
Dj
Ij
E
Mj
B
10Probabilistic sequence modelling
( 1 )
Bayes theorem
( 2 )
..but, P(M) P(s)..?
( 3 )
11Probabilistic sequence modelling..
If N models the whole sequence space (N q)
( 4 )
Since , logarithm
probabilities more convenient
Def., log-odds score V
( 5 )
12Probabilistic sequence modelling..
Eq. ( 4 ) ( 5 ) gives new classification
criteria
logzP(s M)
( 6 )
score
d
logzP(s q)
..for a certain significance level ? (i.e. the
number of incorrect classifications in an n big
database) a threshold d is required
?
( 7 )
13Probabilistic sequence modelling..
Example If ze or z2, the significance level is
chosen to one incorrect classification (false
positive) per 1000 trials in a database of
n10000 sequences
bits
nits,
14Large vs. small threshold d
High d
A
Low d
A
A
A
A
A
A
A
A
B
B
B
B
B
True positives
B
B
B
False positive
15Model characteristics
One can define sensitivity, how many are found
..and selectivity, how many are correct
16Model construction
- From initial alignment
- Most common method. Start from an initial
multiple alignment of e.g. a protein family - Iteratively
- By successive database searches incorporating
new similar sequences into the model - Neural-inspired
- The model is trained using some continuous
minimization algorithm, e.g. Baum-Welsh,
Steepest Descent etc.
17Model construction..
A short family alignment gives a simple model M,
potential matchstates marked with an (?)
B
18Model construction..
A more generalized model Ex. evaluate sequence
sAIEH
19Sequence evaluation
The optimal alignment, i.e. the path that has the
greatest probability of generating sequences s,
can be determined through dynamic programming
The maximum log-odds score VjM(si) for
matchstate j that is emitting si is calculated
from the emission score, previous maximum score
plus transition score
20Sequence evaluation..
Viterbis Algorithm,
( 8 )
( 9 )
( 10 )
21Parameter estimation, background
- Proteins with similar structures can have very
different sequences - Classical sequence alignment based only on
heuristic rules parameters cannot deal with
sequence identities below 50-60 - Substitution matrices add static a priori
information about amino acids and protein
sequences ? good alignments down to 25-30
sequence identity, ex. CLUSTAL - How to get further down into the twilight
zone..? - - More and dynamic a priori information..!
22Parameter estimation
Probability of emitting an alanine in the first
matchstate, eM1(A)..?
- Maximum likelihood-estimation
23Parameter estimation..
- Add-one pseudocount estimation
- Background pseudocount estimation
24Parameter estimation..
- Substitution mixture estimation
- Score
?
Maximum likelihood gives pseudocounts ?
Total estimation
25Parameter estimation..
- All above methods are in spite of their dynamic
implementation, still based on heuristic
parameters - Method that compensates complements lack of
data in a statistically correct way - Dirichlet mixture estimation
Looking at sequence alignments, several different
amino acid distributions seem to be reoccurring,
not just the background distribution q Assume
that there are k probability densities
that generates these
26Parameter estimation, Dirichlet Mixture style..
Given the data, a countvector
, this method allows a linear combination of k
individual estimations weighted with the
probability that n is generated by each component
The k componets can be modelled from a curated
database of alignments. Using some parametric
form of the probability density, an explicit
expression for the probability that n has been
generated by the jth component can be derived
Ex.
27Parameter estimation, Dirichlet Mixture style..
n
The k components describe peaks of aa
distributions in some kind of multidimensional
space Depending on where in sequence space our
countvector n lies, i.e. depending on which
components that can be assumed to have generated
n, distribution information is incorporated into
the probability estimation e
28Classification example
Alignment of some known glycoside hydrolase
family 16 sequences
- Define which columns are to be regarded as
matchstates () - Build the corresponding model M HMM graph
- Estimate all emission and transition
probabilities, ej ajk - Evaluate the log-odds score / probability that
an unknown sequence s has been generated by M
using Viterbis algorithm - If score(s M) gt d, the sequence can be
classified as a GH16 family member
29Classification example..
A certain sequence s1WHKLRQ.. is evaluated and
gets a score of -17.63 nits, i.e. the probability
that M has generated s1 is very small
Another sequence s2SDGSYT.. gets a score of
27.49 nits and can with good significance be
classified as a family member
30Summary
- Hidden Markov models are used mainly for
classification / searching (PFAM), but also for
sequence mapping / alignment - As compared to normal alignment, a position
specific approach is used for sequence
distributions, insertions and deletions - Model building is usually a compromise between
sensitivity and selectivity. If more a priori
information is incorporated, the sensitivity goes
up whereas the selectivity goes down