Chapter 6.Multiple sequence alignment methods

Outline

- What a multiple alignment means
- Scoring a multiple alignment
- Multidimensional Dynamic Programming
- Progressive alignment methods
- Multiple alignment by profile HMM training

Multiple alignment

- Biologists produce high quality multiple

alignments by hand using expert knowledge of

protein sequence evolution. - Highly conserved regions
- Buried hydrophobic residues
- Influence of protein structure
- Expected patterns of insertions and deletions

Multiple alignment

- Manual multiple sequence alignment is tedius.
- Automatic MSA methods are needed.
- In general, an automatic method must have a way

to assign a score so that better MSA get better

scores. - Scoring a multiple alignment and searching over

possible alignments should be distinguished. - In probabilistic modelling, scoring function is

primary concern. - One of goals in probabilistic modeling is to

incorporate as many of an experts evaluation

criteria as possible into scoring procedure.

What a multiple alignment means

- In a multiple sequence alignment, homologous

residues among a set of sequences are aligned

together in columns. - Homologous is meant in both the structural and

evolutionary sense. - Ideally, a column of aligned residues occupy

similar three-dimensional structural positions

and all diverge from a common ancestral residue.

What a multiple alignment means

- Manually aligned example-10 imunoglobulin

superfamily - A crystal structure of 1tlk(telokin) is known
- The telokin structure and alignments to other

related seqyences reveal conserved

characteristics of the I-set immunoglobulin

superfamily fold, including eight conserved

ß-strands and certain key residues in the

sequences, such as two completely conserved

cysteines in the b and f strands which form a

disulfide bond in the core of the folded

structure.

What a multiple alignment means

What a multiple alignment means

- Except for trivial cases, it is not possible to

create a single correct multiple alignment. - Given pair of divergent but clearly homologus

protein sequences, usually only 50 of the

individual residues were superposable. - The Globin family, often used as a typical

problem in computational work, is in fact

exceptionalalmost the entire structure is

convserved among divergent sequences. - Even the definition of structurally

superposable is subjective and can be expected

to vary among experts.

What a multiple alignment means

- Our ability to define a single correct

alignment will vary with the relatedness of the

sequences being aligned. - An alignment of very similar sequences will

generally be unambiguous, but there alignments

are not of great interest to us. - For cases of interest, there is no objective way

to define an unambiguously correct alignment. - Usually a small subset of key residues will be

identifiable which can be aligned unambiguously

for all the sequences in a family almost

regardless of sequence divergence. - Core structal elements will also tend to be

conserved and meaningfully alignable.

Scoring a multiple alignment

- Two important features of multiple alignments
- Some positions are more conserved than others.
- The sequences are not independent, but instead

are related by a phylogenetic tree.

Scoring a multiple alignment

- An idealised way
- Specifty a complete probabilistic model of

molecular sequence evolution. - The probability of a multiple alignment can be

calculated using evolutionary model. - We dont have enough data to build such a model
- Workable approximationpartly or entirely ignore

the phylogenetic tree while doing some sort of

position-specific scoring.

Scoring a multiple alignment

- Simplifying assumption
- Individual columns of an alignment are

statistically independent. - Then scoring function can be written as
- Mi column i of the multiple alignment m
- S(mi)the score for column i
- Gan function for scoring the gaps that occur in

the alignments. - Unspecified function-affine scoring function can

be used

Scoring a multiple alignment-Minimum Entropy

- Minimum Entropy
- More variability in an alignment will be

described by a higher entropy. Exactly matching

sequences will have 0 entropy (completely

organized) - To find the best alignment we want to have the

minimum entropy.

Scoring a multiple alignment-Minimum Entropy

- Minimum entropy
- Counting the residues in each column
- Probability of residue a in column I (ML

estimate) - Probability of a column(independence assumed)
- Entropy is the negative log of the probability of

the column.

Scoring a multiple alignment-Minimum Entropy

- Treating columns as statistically

independent-Leaving out knowledge of phylogeny. - Actually very similar to HMM without gap

information - The assumption that the sequences are independent

can be reasonable if representative sequence of a

sequence family s carefully chosen. - A variety of tree-based wdighting schemes have

been proposed to deal with this problem to

partially compensate for the defects of the

sequence independence assumption.

Scoring a multiple alignment-Sum of Pairs

- Sum of pairs
- Standard method of scoring multiple alignment
- Similarity to HMM formulation
- Do not use phylogenetic tree
- Assumes statistical indepedence for the columns.
- Not HMM formulation though

Scoring a multiple alignment-Sum of pairs

- Sum of pairs
- Columns are scored by SP function using a

substitution scoring matrix such as a PAM or

BLOSUM matrix. - Use linear gap function or score affine gaps

separately. - Sum N(N-1)/2 pairwise scores

Scoring a multiple alignment-Sum of pairs

- Problem of Sum of pairs
- Sum of scores are not probabilistic correct

extension to log-odds score. - Correct log-odds score extension
- SP score
- Evolutionary events are over-counted, a problem

which increases as the number of sequemces

increases.

Scoring a multiple alignment-Sum of pairs

- Example
- an alignment of N sequences which all have

leucine(L) at a certain position. - BLOSUM50 s(L,L)5
- The SP score of the column is 5N(N-1)/2
- If instead there were one glycine(G) and N-1 Ls
- BLOSUM50 s(G,L)-4
- The SP score of the column is worse than the

score for a column of all Ls by a fraction of

9(N-1) / 5N(N-1)/2 18/5N

Scoring a multiple alignment-Sum of pairs

- Difference is 18/5N
- Relative difference between score between the

correct and incorrect allignment decreases with

the no. of sequences - Yet, if we have MORE evidence that L is conserved

then an outlier out to DECREASE the score more.

Multidimensional Dynamic Programming

- It is possible to generalise pairwise DP

alignment to the alignment of N sequences.

Multidimensional Dynamic Programming

- Assumptions
- The columns of an alignment are statistically

independent - The gaps are scored with a linear gap cost
- Then the overall score S(m) for an alignment can

be calculated as a sum of the scores for each

column.

Multidimensional Dynamic Programming

Multidimensional Dynamic Programming

- Simplifying the notation

Multidimensional Dynamic Programming

- Straightforward Multidimensional DP
- Pros
- It can find optimal solution.
- Arbitary column scoring function can be used
- Only assumption is that column scores are

independent. - Cons
- There are 2N-1 gap combinations for each entry
- Huge computational complexity-O(2N LN)

Multidimensional Dynamic Programming-MSA

- MSA can reduce the volume of the multidimensional

dynamic programing matrix that needs to be

examined - Optimally align up to 5-7 protein sequences of

reasonable length(200-300 residues)

Multidimensional Dynamic Programming-MSA

- Assumptions
- SP scoring system
- The score of a multiple alignment is the sum of

the scores of all pairwise alignment defined by

the multiple alignment. - Then the score of the complete alignment is given

by - Let be the optimal pairwise

alignment of k,l

Multidimensional Dynamic Programming-MSA

- We can obtain a lower bound on the score of any

pairwise alignment that can occur in the optimal

multiple alignment. - Assume that we have a lower bound s(a) on the

score of the optimal multiple alignment, then for

optimal multiple alignment a - We only need to consider pairwise alignment of k

and l that score better than - A good bound s(a) can be obtained by any fast

heurist algorithm - Optimal pairwise alignment can be found using

dynamic programming

Multidimensional Dynamic Programming-MSA

- Now find the complete set of coordinate

pairs (ik,il) such that the best alignment of xk

to xl through (ik,il) scores more than - The costly multidimensional dynamic programming

algorithm can be restricted to evaluate only

cells in the intersection of all theses sets

I,e, cels (i1,i2,iN) for which (ik,il) is in

for all k,l.

Progressive alignment methods

- Most commonly used approach
- Works by constructing a succession of pairwise

alignmensts. - Initially, two sequences are chosen and aligned

by standard pairwise alignmentthis alignment is

fixed. - Then, a third sequence is chosen and aligned to

the first alignment - This process is iterated until all sequences have

been aligned.

Progressive alignment methods

- Basically heuristic
- It does not separate the scoring and optimising.
- It does not directly optimise any global scoring

function. - Fast and efficient, Generates reasonable result

Progressive alignment methods

- Differences between PA algorithms
- The way that they choose the order to do the

alignment - Whether the progression involves only alignment

of sequences to a single growing alignment or

whether subfamilies are built up on a tree

structure and,at certain points, alignments are

aligned to alignments. - Procedure used to align and score sequences or

alignments against existing alignmetns.

Progressive alignment methods- Feng-Doolittle

progressive multiple alignment

- Calculate a diagonal matrix of N(N-1)/2 distances

between all pairs of N sequences by standard

pairwise alignment. Compute a distance matrix

D-log(S) - Construct a Guide tree from the distance matrix

using a clustering algorithm - Starting from the first node added to the tree,

align the child nodes. Repeat for all other nodes

in the order that they were added to the tree.

Progressive alignment methods-Feng-Doolittle

progressive multiple alignment

- Converting alignment scores to distances
- Doesnt need to be accurate-the goal is only to

create an approximate guide tree, not an

evolutionary tree. - In phylogenetic tree construction, more care must

be taken

Progressive alignment methods-Feng-Doolittle

progressive multiple alignment

- Clustering
- Done with The Fitch-Margooliash algorithm
- Sequence-Sequence alignments
- Done with usual pairwise dynamic programming.
- A sequence is added to an existing group by

aligning it pairwise to each sequence in the

group in turn. - The highest scoring pairwise alignment determines

how the sequence will be aligned to the group.

Progressive alignment methods-Feng-Doolittle

progressive multiple alignment

- Once a gap,always a gap rule
- After an alignment is completed, gap symbols are

replaced with a neutral X character. - This rule allows pairwise sequenc alignments to

be used to guide the alignment of sequences to

groups or groups to groups otherwise, any given

pairwise sequence alignment would not necessarily

be consistent with the pre-existing alignment of

a group. - Desirable side effectencouraging gaps to occur

in the same columns in subsequent pairwise

alignments. - Not needed in profile-based progressive alignment

algorithms

Progressive alignment methods

- A problem with the Feng-Doolittle approach
- all alignments are determined by pairwise

sequence alignments. - It is advantageous to use position-specific

information from the groups multiple alignment

to align a new sequence to it. (e.g. degree of

sequence conservation) - Many progressive alignment methods use pairwise

alignment of sequences to profiles or of profiles

to profiles as a subroutine which is used many

times in the process.

Progressive alignment methods

- Linear gap scoring case
- s(-,a)s(a,-)-g and s(-,-)0
- Two profiles sequence 1..n and n1 N
- Global alignment is
- The first two sums are unaffected by the global

alignment(s(-,-)0) - Therefore the optimal alignment of the two

profiles can be obtained by only optimising the

last sum with the cross terms, which can be done

exactly like a standard pairwise alignment.

Progressive alignment methods-CLUSTAW

- Profile-based progresive multiple alignment
- Works in much the same way as the Feng-Doolitle

method except for its carefully tuned use of

profile alignment methods. - Uses various heuristics.

Progressive alignment methods-CLUSTAW

- Construct a distance matrix of all N(N-1)/2 pairs

by pairwise dynamic programming. - Construct a guide tree by a neighbour-joining

clustering algorithm. - Progressively align at nodes in order of

decreasing similarity, using sequence-sequence,

sequence-profile, and profile-profile alignment. - Scoring is basically SP.

Progressive alignment methods-CLUSTAW

- Heuristics used
- Sequences are weighted to compensate for biased

representation in large subfamilies. - The substitution matrix is chosen on the basis of

the similarity expected of the alignment. - Position-specific gap-open penalties are used.
- Gap penalties are increased if there are no gaps

in a column but gaps occur nearby in the

alignment.

Progressive alignment methods-Iterative

refinement methods

- Problem with progressive alignment
- Subalignments are frozen.
- Once a group of sequemces has been aligned, their

alignment to each other cannot be changed at a

later stage as more data arrive. - Iterative refinement methods attempt to

circumvent this problem.

Progressive alignment methods-Iterative

refinement methods

- Iterative refinement method
- An initial alignment is generated
- Then one sequence (or a set of sequences) is

taken out and realigned to a profile of the

remaining aligned sequences. - If a meaningful score is being optimized, this

either increases the overall score or results in

the same score. - Another sequence is chosen and realigned, and so

on, until alignment does not change - Guaranteed to converged to a local maximum.

Progressive alignment methods-Iterative

refinement methods

- Barton-Sternberg multiple alignment
- Find the two sequences with the highest pairwise

similarity and align them using standard pairwise

DP alignment. - Find the sequence that is most similar to a

profile of the alignment of the first two, and

align it to the first two by profile-sequence

alignment. Repeat until all sequences have been

included in the multiple aligment. - Remove sequence x1 and realign it to a profile of

the other aligned sequences x2, xN by

profile-sequence alignment. Repeat for sequences

x2xN. - Repeat the previous realignment step a fixed

number of times, or until the alignment score

converges.

Multiple alignment by profile HMM training

- Sequence profiles could be recast in

probabilistic form as profile HMMs. - Profile HMMs could simply be used in place of

standard profiles in progressive or iterative

alignment methods. - Ad hoc SP scoring scheme can be replaced by more

explicit profile HMM assumption. - Profile HMMs can also be trained from initially

unaligned sequences using the Baum-Welch EM

Multiple alignment by profile HMM training-

Multiple alignment with a known profile HMM

- Before we estimate a model and a multiple

alignment simultaneously we consider the simpler

problem of obtaining a multiple alignment from a

known model. - When we have a multiple alignment and a model of

a small representative set of sequences in a

family, and we wish to use that model to align a

large member of other family members altogether.

Multiple alignment by profile HMM training-

Multiple alignment with a known profile HMM

- We know how to align a sequence to a profile

HMM-Viterbi algorithm - Construction a multiple alignment just requires

calculating a Viterbi alignment for each

individual sequence. - Residues aligned to the same profile HMM match

state are aligned in columns.

Multiple alignment by profile HMM

training-Multiple alignment with a known profile

HMM

- Given a preliminary alignment, HMM can align

further sequences.

Multiple alignment by profile HMM training-

Multiple alignment with a known profile HMM

Multiple alignment by profile HMM training-

Multiple alignment with a known profile HMM

- Importance difference with other MSA programs
- Viterbi path through HMM identifies inserts
- Profile HMM does not align inserts
- Other multiple alighment algorithms align the

whole sequences.

Multiple alignment by profile HMM training-

Multiple alignment with a known profile HMM

- HMM doesnt attempt to align residues assigned to

insert states. - The insert state residues usually represent part

of the sequences which are atypical, unconserved,

and not meaningfully alignable. - This is a biologically realistic view of multiple

alignment

Multiple alignment by profile HMM training-

Profile HMM training from unaligned sequences

- Harder problem-estimating both a model and a

multiple alignment from initially unaligned

sequences. - InitializationChoose the length of the profile

HMM and initialize parameters. - TrainingEstimate the model using the Baum-Welch

algorithm or the Viterbi alternative. - Multiple AlignmentAlign all sequences to the

final model using the Viterbi algorithm and build

a multiple alignment as described in the previous

section.

Multiple alignment by profile HMM training-

Profile HMM training from unaligned sequences

- Initial Model
- The only decision that must be made in choosing

an initial structure for Baum-Welch estimation is

the length of the model M. - A commonly used rule is to set M be the average

length of the training sequence. - We need some randomness in initial parameters to

avoid local maxima.

Multiple alignment by profile HMM training

- Avoiding Local maxima
- Baum-Welch algorithm is guaranteed to find a

LOCAL maxima. - Models are usually quite long and there are many

opportunities to get stuck in a wrong solution. - Multidimensional dynamic programming finds global

optima, but is not practical. - Solution
- Start again many times from different initial

models. - Use some form of stochastic search algorithm,

e.g. simulated annealing.

Multiple alignment by profile HMM

training-Simulated annealing

- Theoretical basis
- Some compounds only crystallise if they are

slowly annealed from high temperature to low

temperature. - One can introduce an artificial temperature T,

and by the laws of statistical physics the

probabiliy of a configuration x is given by the

Gibbs distribution. - In the limit of T-gt0, the system is frozen in

the limit of T-gtinfinity, the system is molten - The minimum can be found by sampling this

probability distribution at a high temperature

first, and then at gradually decreasing

temperatures.

Multiple alignment by profile HMM

training-Simulated annealing

- For an HMM, a natural energy function is
- Approximations
- Noise injection during Baum-Welch reestimation
- Simulated annealing Viterbi estimation of HMMs

Multiple alignment by profile HMM

training-Simulated annealing

- Noise injection during Baum-Welch reestimation
- Add noise to the counts estimated in the

forward-backward procedure - Let the size of this noise decrease slowly.

Multiple alignment by profile HMM

training-Simulated annealing

- Simulated annealing Viterbi estimation of HMMs
- Model is trained by a simulated annealing variant

of the Viterbi approximation to Baum-Welch

estimation. - Viterbi estimation selects the highest

probability path p of each seqeence x. - Simulate annealing samples each path p according

to the likelihood of the path given the current

model as modified by a temperature T.

Multiple alignment by profile HMM

training-Simulated annealing

- Scheduling the temperature
- A whole science (or art) itself
- There are theoretical result for simulated

annealing saying that if the temperature is

lowered slowly enough, finding the optimum is

guaranteed. - In practice a simple exponentially or linearly

decreasing schedule is often used.

Multiple alignment by profile HMM -Comparison to

Gibbs sampling

- The Gibbs sampler algorithm described by

Lawrence et al.1993 has substantial

similarities. - The problem was to simultaneously find the motif

positions and to estimate the parameters for a

consensus statistical model of them. - The statistical model used is essentially a

profile HMM with no insert or delete states. - In HMM framework, both SA algorithm and the Gibbs

sampler are stochastic variants of the Viterbi

algorithm of EM. - The Gibbs sampler is like running simulated

annealing viterbi algorithm at a constant T1,

where alignments are sampled from a probability

distribution unmodified by any effect of a

temperature factor.

Multiple alignment by profile HMM training-Model

surgery

- After(or during) training a model, we can look at

the alignment it produces and decide that model

needs some modification. - Some of the match states are redundant
- Some insert states absorb too much sequence
- Model sugery
- If a match state is used by less than ½ of

training sequences, delete its module

(match-insert-delete states) - If more than ½ of training sequences use a

certain insert state, expand it into n new

modules, where n is average length of insertions - Ad hoc, but works well