Sequence classification

About This Presentation

Title:

Sequence classification

Description:

Example ... Proteins with similar structures can have very different sequences. Classical sequence alignment based only on heuristic rules & parameters cannot ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 31

Provided by: xrayB

Category:

more less

Transcript and Presenter's Notes

Title: Sequence classification

1
Sequence classification hidden Markov models

Bioinformatics,
Models algorithms,
8th November 2005
Patrik Johansson,
Dept. of Cell Molecular Biology,
Uppsala University

2
A family of proteins share a similar structure
but not necessarily sequence
3
Classification of an unknown sequence s to family
A or B using HMMs
A
A
A
A
A
A
A
A
A
s
B
B
B
B
B
B
B
B
4
Hidden Markov Models, introduction

General method for pattern recognition, comp.
Neural networks
An HMM generates sequences / sequence
distributions
Markov chain of events

Three coins A, B C gives a Markov chain ?
CAABA..
Outcome, e.g. Heads Heads Tails, generated by
hidden Markov chain ?
5
Hidden Markov Models, introduction..

Model M is emitting a symbol (T, H) in each
state i based on some probability ei
The next state j is chosen based on some
transition probability ai,j

e.g, the sequence s Tails Heads Tails over
the path ? BCC
6
Profile hidden Markov Model architecture

A first approach for sequence distribution
modelling

7
Profile hidden Markov Model architecture..

Insertion modelling

Insertions random ejI(a) q(a)
8
Profile Hidden Markov Model architecture..

Deletion modelling

Alt.
9
Profile Hidden Markov Model architecture..
Insert deletestates are generalized to all
positions. The model M can generate sequences
from state B by successive emissions and
transitions to state E
Dj
Ij
E
Mj
B
10
Probabilistic sequence modelling

Classification criteria

( 1 )
Bayes theorem
( 2 )
..but, P(M) P(s)..?
( 3 )
11
Probabilistic sequence modelling..
If N models the whole sequence space (N q)
( 4 )
Since , logarithm
probabilities more convenient
Def., log-odds score V
( 5 )
12
Probabilistic sequence modelling..
Eq. ( 4 ) ( 5 ) gives new classification
criteria
logzP(s M)
( 6 )
score
d
logzP(s q)
..for a certain significance level ? (i.e. the
number of incorrect classifications in an n big
database) a threshold d is required
?
( 7 )
13
Probabilistic sequence modelling..
Example If ze or z2, the significance level is
chosen to one incorrect classification (false
positive) per 1000 trials in a database of
n10000 sequences
bits
nits,
14
Large vs. small threshold d
High d
A
Low d
A
A
A
A
A
A
A
A
B
B
B
B
B
True positives
B
B
B
False positive
15
Model characteristics
One can define sensitivity, how many are found
..and selectivity, how many are correct
16
Model construction

From initial alignment
Most common method. Start from an initial
multiple alignment of e.g. a protein family
Iteratively
By successive database searches incorporating
new similar sequences into the model
Neural-inspired
The model is trained using some continuous
minimization algorithm, e.g. Baum-Welsh,
Steepest Descent etc.

17
Model construction..
A short family alignment gives a simple model M,
potential matchstates marked with an (?)
B
18
Model construction..
A more generalized model Ex. evaluate sequence
sAIEH
19
Sequence evaluation
The optimal alignment, i.e. the path that has the
greatest probability of generating sequences s,
can be determined through dynamic programming
The maximum log-odds score VjM(si) for
matchstate j that is emitting si is calculated
from the emission score, previous maximum score
plus transition score
20
Sequence evaluation..
Viterbis Algorithm,
( 8 )
( 9 )
( 10 )
21
Parameter estimation, background

Proteins with similar structures can have very
different sequences
Classical sequence alignment based only on
heuristic rules parameters cannot deal with
sequence identities below 50-60
Substitution matrices add static a priori
information about amino acids and protein
sequences ? good alignments down to 25-30
sequence identity, ex. CLUSTAL
How to get further down into the twilight
zone..?
- More and dynamic a priori information..!

22
Parameter estimation
Probability of emitting an alanine in the first
matchstate, eM1(A)..?

Maximum likelihood-estimation

23
Parameter estimation..

Add-one pseudocount estimation

Background pseudocount estimation

24
Parameter estimation..

Substitution mixture estimation
Score

?
Maximum likelihood gives pseudocounts ?
Total estimation
25
Parameter estimation..

All above methods are in spite of their dynamic
implementation, still based on heuristic
parameters
Method that compensates complements lack of
data in a statistically correct way
Dirichlet mixture estimation

Looking at sequence alignments, several different
amino acid distributions seem to be reoccurring,
not just the background distribution q Assume
that there are k probability densities
that generates these
26
Parameter estimation, Dirichlet Mixture style..
Given the data, a countvector
, this method allows a linear combination of k
individual estimations weighted with the
probability that n is generated by each component
The k componets can be modelled from a curated
database of alignments. Using some parametric
form of the probability density, an explicit
expression for the probability that n has been
generated by the jth component can be derived
Ex.
27
Parameter estimation, Dirichlet Mixture style..
n
The k components describe peaks of aa
distributions in some kind of multidimensional
space Depending on where in sequence space our
countvector n lies, i.e. depending on which
components that can be assumed to have generated
n, distribution information is incorporated into
the probability estimation e
28
Classification example
Alignment of some known glycoside hydrolase
family 16 sequences

Define which columns are to be regarded as
matchstates ()
Build the corresponding model M HMM graph
Estimate all emission and transition
probabilities, ej ajk
Evaluate the log-odds score / probability that
an unknown sequence s has been generated by M
using Viterbis algorithm
If score(s M) gt d, the sequence can be
classified as a GH16 family member

29
Classification example..
A certain sequence s1WHKLRQ.. is evaluated and
gets a score of -17.63 nits, i.e. the probability
that M has generated s1 is very small
Another sequence s2SDGSYT.. gets a score of
27.49 nits and can with good significance be
classified as a family member
30
Summary

Hidden Markov models are used mainly for
classification / searching (PFAM), but also for
sequence mapping / alignment
As compared to normal alignment, a position
specific approach is used for sequence
distributions, insertions and deletions
Model building is usually a compromise between
sensitivity and selectivity. If more a priori
information is incorporated, the sensitivity goes
up whereas the selectivity goes down

Write a Comment

User Comments (0)

About PowerShow.com

Sequence classification - PowerPoint PPT Presentation

Sequence classification

Example ... Proteins with similar structures can have very different sequences. Classical sequence alignment based only on heuristic rules & parameters cannot ... – PowerPoint PPT presentation