# Alignment and Algorithms - PowerPoint PPT Presentation

Title:

## Alignment and Algorithms

Description:

### Alignment and Algorithms Scoring Matrices: models of evolution Dynamic Program: speed Pairwise Sequence Alignment, variants Significance Pattern recognition, Hidden ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 29
Provided by: umichEdu7
Category:
Tags:
Transcript and Presenter's Notes

Title: Alignment and Algorithms

1
Alignment and Algorithms
• Scoring Matrices models of evolution
• Dynamic Program speed
• Pairwise Sequence Alignment, variants
• Significance
• Pattern recognition, Hidden Markov Models (HMM)
• Multiple Sequence Alignment (MSA)
• Protein families
• Structure features

2
Scoring Models and Evolution
• Several systems of scoring matrices.
• Statistically based, curated data I.e., based on
confident alignments of multiple sequences.
• Two best known
• PAM matrices Dayhoff et al., 1967
• BLOSUM matrices Henikoff Henikoff, 1992

3
Scoring matrices, cont.
• Recall using p(a,b) prob. that a (amino acid
residue) mutated into b.
• PAM (Point Accepted Mutation) uses sequence data
for proteins which can be aligned with 99
identical sequence. In such aligned sequences,
one counts the frequency of each residue, and of
each pair of residues in a column.

4
Scoring matrices, cont.
• Basically, the score s(a,b) from this data is a
LOD score
• s(a,b) log(p(a,b)/p(a)p(b)).
• Comparison of two models significant
association vs. independent randomness.
• Pruned scaled to make matrices integer-valued.

5
Scoring matrices, cont.
• Interpretation Dayhoff, et al. define a short
unit of evolutionary time, a PAM unit. Only very
conservative changes will be seen in the
composition of evolving proteins in this scale.
• To extend to more distant similarities, used time
independent Markov assumption.

6
Scoring matrices, cont.
• p(a,b) p(a--gtbt1) gives the transition
probability from a --gt b (or b --gt a we cannot
keep track of directionality) in one unit of PAM
time.
• The time independent Markov model assumes
p(a--gtbt2) in two units is simply the sum over
all pairs of intermediate one time unit
transitions
• a --gt c followed by c --gt b

7
Scoring matrices, cont.
• If P(1) is the original 20x20 matrix of amino
acid transition probabilities, then the above
says simply
• P(n) P(1) x P(1)xx P(1) (n times)
• (matrix multiplication)

8
Scoring matrices, cont.
• So, PAM(n) matrix of LOD scores from transitions
in PAM time n.
• Available n1 to n500 smaller n for more
closely related sequences.
• Most commonly used 50 to 250.
• Critique accurate for small time, goes off at
larger times. Compare nucleic acid changes to
amino acid changes.

9
PAM 250 MATRIX
10
Scoring matrices, cont.
• BLOSUM Matrices
• From the BLOCKS database.
• BLOSUM BLOCKS Substitution Matrix
• DB of aligned sequence. Conserved regions are
marked. The BLOSUM(L)
• transition probabilities are figured from
• sequences with L identity of sequence. Score is
a LOD score again. So, large BLOSUM L lt--gt small
PAM n.
• Warning corrections for oversampling of closely
related sequence!

11
Scoring matrices, cont.
• If the BLOSUM62 matrix is compared to PAM160
(it's closest equivalent) then it is found that
the BLOSUM matrix is less tolerant of
substitutions to or from hydrophilic amino acids,
while more tolerant of hydrophobic changes and of
cysteine and tryptophan mismatches.
• PAM is less sensitive to the effect of amino acid
return substitutions, over long time periods.

12
BLOSUM62 Matrix
13
Dynamic Programming
• Problem find (1) best score and (2) best
alignment
• Naïve evaluate each alignment and compare
• Too many alignments!
• Need more speed DP

14
Dynamic Programming, cont.
• No. of alignments is exponentially large.
• Use problem has a linear (left to right)
structure to it, from the linearity of proteins,
nucleic acids.
• Analogous to the problem of distance originally
arose in CS, edit distance.

15
Dynamic Programming, cont.
• The basic algorithms here are
• Needleman-Wunsch (global alignment)
• (2) Smith-Waterman (local alignment)
• BLAST is a heuristic version of SW.
• (3) Many variants.

16
Dynamic Programming, cont.
• NW algorithm can be understood pictorially.
• It has three basic steps
• Initialization
• Iteration
• Traceback
• Basic insight you only have to optimize the
score one step at a time, and can discard looking
at most alignments.

17
Dynamic Programming, cont.
• Missing scoring feature gap penalties.
• Not as well understood as substitutions. Use
geometric random variable models.
• s(a,-) is independent of a, say s(a,-) -d
gap penalty
• For longer gaps, two methods
• linear k gaps in a row gives penalty k x (-d).
• affine k gaps in a row gives penalty
• e (k-1) x (-d),
• Where e is the gap initiation penalty.

18
Dynamic Programming, cont.
• Notice affine gap does not just depend on the
position, since you have to know whether you are
continuing a gap from the left. First example of
dependency in scoring.
• Key to the iterative step is to realize there are
only three ways to extend an alignment
• R1 R1 -
• r1 - r1
• We have to update the score by taking the max of
three possibilities max(s0 s(R1,r1), s0
s(R1,-), s0 s(-,r1)).
• Then to get the scores of all possible
subalignments takes just
exponential number.

19
Dynamic Programming, cont.
• To derive the actual alignment requires storage
one stores a traceback arrow from each position
in the comparison array.
• This information can be translated into the
correct alignment.
• Variants
• Smith-Waterman local alignment. Adjustment is to
0-out and start over again any time a NW score
goes below zero, and procede. One chooses the
largest scoreof a segment, and traces back as
earlier.
• Linear storage saves storage.
• Repeats allows for duplication of segments.

20
Dynamic Programming, cont.
• Variants (cont.)
• Non-overlapping Repeats SW, avoiding previous
best alignments.
• Sampling of highest hits the best alignment
score maybe an artifact of our scoring system, so
many alignments may be high scoring.
• Affine gap penalties actually have to have more
states carried along, a precursor to Hidden
Markov Models.

21
Significance
• E-value in BLAST discussed last time. Equivalent
to statistical concept of a p-value.
• S score for best alignment is a random variable.
• The p-value for an alignment is the probability
that such a high score could have been found
randomly.

22
Significance, cont.
• This can be estimated two ways
• Ungapped local alignment Karlin-Altschul
statistics.
• Generally, KA not available, rely instead on
simulations take two sequences A and B, and
randomly shuffle one and recompare. This
eliminates composition bias in the teest samples,
but randomizes the rest of the biological
information in the sequence, namely the order of
the residues (or nts). SW computes this for
100s of shuffles.

23
HMMs
• These are a special case of a general CS problem
pattern recognition, or machine learning. Can one
make the machine a master at recognizing or
classifying certain things. There are a large
family of such graphical models, this is one of
the simplest.
• Basic problem in MB can one recognize different
functional or structural regions in a protein
from its a.a. sequence? Its fold? Or less
specifically, what are the common features of a
group of proteins.

24
HMMs, cont.
• For example, there exist very many G-protein
coupled receptors, which fall into seven classes.
Can one classify a new protein into (a) being a
GPCR, and (b) assigning its class to it from
sequence data alone?
• Simplest example problem dishonest casino. You
are given a sequence of observed random variables
or states of the system (faces of a die), but you
dont see hidden states (e.g., whether a casino
is using a fair or biased die). Notice that this
kind of example has a linear structure to it,
like our a.a. sequences.

25
HMMs, cont.
• The problem is given an observed path of
(observable states) determine the underlying path
through the hidden states. I.e., given the record
of the coin flips, say when the biased coin was
used and when the fair. Notice that in this model
we know how many visible states there are and we
are assuming we know how many hidden states there
are.
• Of course, we cannot know with cerrtainty when
the fair/biased coin is used, we must seek a
probable answer, and the algorithms show how to
choose a most probable path through the model.
• The harder problem (biological problem) is when
we do not know what the hidden states are,
except where we can define them, e.g.,
structurally. For example, you could run through
protein a.a. sequence and ask a HMM to decide
which residues are situated in alpha helices of
the protein, or when they lie in the cell
membrane.

26
HMMs, cont.
• In the first problem, if we know all the
parameters, meaning mainly the transition
problems for the Hidden Markov states, then it is
easy to find a dynamic programming algorithm for
finding the best path. This is the Viterbi
algorithm.
• Sometimes called parsing the hidden markov
model, because the method originated in speech
recognition work.
• The harder problem is to figure out what the
parameters are for the hidden states. This is
called training the model, and is done by the
Baum-Welch algorithm. This is a version of the
so-called EM method, or algorithm (Expectation
Maximization algorithm).
• BW, or EM, is a heuristic optimization algorithm.
That is, it starts from a given initial guess for
the parameters involved and performs a
calculation which is guaranteed to improve (make
no worse, actually) the previous guess. One runs
for a certain time and turns things off by a
suitable threshold.

27
HMMs, cont.
• Finally, a harder problem still is to learn the
correct architecture of the model. There are
packages for this in the case of protein
families/ multiple sequence alignments.
• In the case of structure, for example, one sees a
classical problem in statistics one can let the
data tell directly what the patterns should be
(unsupervised learning) or you could use
expertise to guide the data. This allows fainter
structural signals to be heard. Problem is
this fittingthe data too tightly, making the
model generated harder to be generally
applicable? Example of TMHMM and unusual, but
necessary faint signals.

28
Further Notes are available at
• http//www.math.lsa.umich.edu/dburns/547/547syll.
html
• You can link from there to lectures on sequence
alignment and hidden Markov models.