Alignment and Algorithms

- Scoring Matrices models of evolution
- Dynamic Program speed
- Pairwise Sequence Alignment, variants
- Significance
- Pattern recognition, Hidden Markov Models (HMM)
- Multiple Sequence Alignment (MSA)
- Protein families
- Structure features

Scoring Models and Evolution

- Several systems of scoring matrices.
- Statistically based, curated data I.e., based on

confident alignments of multiple sequences. - Two best known
- PAM matrices Dayhoff et al., 1967
- BLOSUM matrices Henikoff Henikoff, 1992

Scoring matrices, cont.

- Recall using p(a,b) prob. that a (amino acid

residue) mutated into b. - PAM (Point Accepted Mutation) uses sequence data

for proteins which can be aligned with 99

identical sequence. In such aligned sequences,

one counts the frequency of each residue, and of

each pair of residues in a column.

Scoring matrices, cont.

- Basically, the score s(a,b) from this data is a

LOD score - s(a,b) log(p(a,b)/p(a)p(b)).
- Comparison of two models significant

association vs. independent randomness. - Pruned scaled to make matrices integer-valued.

Scoring matrices, cont.

- Interpretation Dayhoff, et al. define a short

unit of evolutionary time, a PAM unit. Only very

conservative changes will be seen in the

composition of evolving proteins in this scale. - To extend to more distant similarities, used time

independent Markov assumption.

Scoring matrices, cont.

- p(a,b) p(a--gtbt1) gives the transition

probability from a --gt b (or b --gt a we cannot

keep track of directionality) in one unit of PAM

time. - The time independent Markov model assumes

p(a--gtbt2) in two units is simply the sum over

all pairs of intermediate one time unit

transitions - a --gt c followed by c --gt b

Scoring matrices, cont.

- If P(1) is the original 20x20 matrix of amino

acid transition probabilities, then the above

says simply - P(n) P(1) x P(1)xx P(1) (n times)
- (matrix multiplication)

Scoring matrices, cont.

- So, PAM(n) matrix of LOD scores from transitions

in PAM time n. - Available n1 to n500 smaller n for more

closely related sequences. - Most commonly used 50 to 250.
- Critique accurate for small time, goes off at

larger times. Compare nucleic acid changes to

amino acid changes.

PAM 250 MATRIX

Scoring matrices, cont.

- BLOSUM Matrices
- From the BLOCKS database.
- BLOSUM BLOCKS Substitution Matrix
- DB of aligned sequence. Conserved regions are

marked. The BLOSUM(L) - transition probabilities are figured from
- sequences with L identity of sequence. Score is

a LOD score again. So, large BLOSUM L lt--gt small

PAM n. - Warning corrections for oversampling of closely

related sequence!

Scoring matrices, cont.

- If the BLOSUM62 matrix is compared to PAM160

(it's closest equivalent) then it is found that

the BLOSUM matrix is less tolerant of

substitutions to or from hydrophilic amino acids,

while more tolerant of hydrophobic changes and of

cysteine and tryptophan mismatches. - PAM is less sensitive to the effect of amino acid

return substitutions, over long time periods.

BLOSUM62 Matrix

Dynamic Programming

- Problem find (1) best score and (2) best

alignment - Naïve evaluate each alignment and compare
- Too many alignments!
- Need more speed DP

Dynamic Programming, cont.

- No. of alignments is exponentially large.
- Use problem has a linear (left to right)

structure to it, from the linearity of proteins,

nucleic acids. - Analogous to the problem of distance originally

arose in CS, edit distance.

Dynamic Programming, cont.

- The basic algorithms here are
- Needleman-Wunsch (global alignment)
- (2) Smith-Waterman (local alignment)
- BLAST is a heuristic version of SW.
- (3) Many variants.

Dynamic Programming, cont.

- NW algorithm can be understood pictorially.
- It has three basic steps
- Initialization
- Iteration
- Traceback
- Basic insight you only have to optimize the

score one step at a time, and can discard looking

at most alignments.

Dynamic Programming, cont.

- Missing scoring feature gap penalties.
- Not as well understood as substitutions. Use

geometric random variable models. - s(a,-) is independent of a, say s(a,-) -d

gap penalty - For longer gaps, two methods
- linear k gaps in a row gives penalty k x (-d).
- affine k gaps in a row gives penalty
- e (k-1) x (-d),
- Where e is the gap initiation penalty.

Dynamic Programming, cont.

- Notice affine gap does not just depend on the

position, since you have to know whether you are

continuing a gap from the left. First example of

dependency in scoring. - Key to the iterative step is to realize there are

only three ways to extend an alignment - R1 R1 -
- r1 - r1
- We have to update the score by taking the max of

three possibilities max(s0 s(R1,r1), s0

s(R1,-), s0 s(-,r1)). - Then to get the scores of all possible

subalignments takes just - about 5 MN computations, instead of an

exponential number.

Dynamic Programming, cont.

- To derive the actual alignment requires storage

one stores a traceback arrow from each position

in the comparison array. - This information can be translated into the

correct alignment. - Variants
- Smith-Waterman local alignment. Adjustment is to

0-out and start over again any time a NW score

goes below zero, and procede. One chooses the

largest scoreof a segment, and traces back as

earlier. - Linear storage saves storage.
- Repeats allows for duplication of segments.

Dynamic Programming, cont.

- Variants (cont.)
- Non-overlapping Repeats SW, avoiding previous

best alignments. - Sampling of highest hits the best alignment

score maybe an artifact of our scoring system, so

many alignments may be high scoring. - Affine gap penalties actually have to have more

states carried along, a precursor to Hidden

Markov Models.

Significance

- E-value in BLAST discussed last time. Equivalent

to statistical concept of a p-value. - S score for best alignment is a random variable.
- The p-value for an alignment is the probability

that such a high score could have been found

randomly.

Significance, cont.

- This can be estimated two ways
- Ungapped local alignment Karlin-Altschul

statistics. - Generally, KA not available, rely instead on

simulations take two sequences A and B, and

randomly shuffle one and recompare. This

eliminates composition bias in the teest samples,

but randomizes the rest of the biological

information in the sequence, namely the order of

the residues (or nts). SW computes this for

100s of shuffles.

HMMs

- These are a special case of a general CS problem

pattern recognition, or machine learning. Can one

make the machine a master at recognizing or

classifying certain things. There are a large

family of such graphical models, this is one of

the simplest. - Basic problem in MB can one recognize different

functional or structural regions in a protein

from its a.a. sequence? Its fold? Or less

specifically, what are the common features of a

group of proteins.

HMMs, cont.

- For example, there exist very many G-protein

coupled receptors, which fall into seven classes.

Can one classify a new protein into (a) being a

GPCR, and (b) assigning its class to it from

sequence data alone? - Simplest example problem dishonest casino. You

are given a sequence of observed random variables

or states of the system (faces of a die), but you

dont see hidden states (e.g., whether a casino

is using a fair or biased die). Notice that this

kind of example has a linear structure to it,

like our a.a. sequences.

HMMs, cont.

- The problem is given an observed path of

(observable states) determine the underlying path

through the hidden states. I.e., given the record

of the coin flips, say when the biased coin was

used and when the fair. Notice that in this model

we know how many visible states there are and we

are assuming we know how many hidden states there

are. - Of course, we cannot know with cerrtainty when

the fair/biased coin is used, we must seek a

probable answer, and the algorithms show how to

choose a most probable path through the model. - The harder problem (biological problem) is when

we do not know what the hidden states are,

except where we can define them, e.g.,

structurally. For example, you could run through

protein a.a. sequence and ask a HMM to decide

which residues are situated in alpha helices of

the protein, or when they lie in the cell

membrane.

HMMs, cont.

- In the first problem, if we know all the

parameters, meaning mainly the transition

problems for the Hidden Markov states, then it is

easy to find a dynamic programming algorithm for

finding the best path. This is the Viterbi

algorithm. - Sometimes called parsing the hidden markov

model, because the method originated in speech

recognition work. - The harder problem is to figure out what the

parameters are for the hidden states. This is

called training the model, and is done by the

Baum-Welch algorithm. This is a version of the

so-called EM method, or algorithm (Expectation

Maximization algorithm). - BW, or EM, is a heuristic optimization algorithm.

That is, it starts from a given initial guess for

the parameters involved and performs a

calculation which is guaranteed to improve (make

no worse, actually) the previous guess. One runs

for a certain time and turns things off by a

suitable threshold.

HMMs, cont.

- Finally, a harder problem still is to learn the

correct architecture of the model. There are

packages for this in the case of protein

families/ multiple sequence alignments. - In the case of structure, for example, one sees a

classical problem in statistics one can let the

data tell directly what the patterns should be

(unsupervised learning) or you could use

expertise to guide the data. This allows fainter

structural signals to be heard. Problem is

this fittingthe data too tightly, making the

model generated harder to be generally

applicable? Example of TMHMM and unusual, but

necessary faint signals.

Further Notes are available at

- http//www.math.lsa.umich.edu/dburns/547/547syll.

html - You can link from there to lectures on sequence

alignment and hidden Markov models.