Chapter 6. Multiple sequence alignment methods - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 6. Multiple sequence alignment methods

Description:

Multiple alignment by profile HMM training ... Biologists produce high quality multiple alignments by hand using expert ... Barton-Sternberg multiple alignment ... – PowerPoint PPT presentation

Number of Views:451
Avg rating:3.0/5.0
Slides: 62
Provided by: biSn
Category:

less

Transcript and Presenter's Notes

Title: Chapter 6. Multiple sequence alignment methods


1
Chapter 6.Multiple sequence alignment methods
2
Outline
  • What a multiple alignment means
  • Scoring a multiple alignment
  • Multidimensional Dynamic Programming
  • Progressive alignment methods
  • Multiple alignment by profile HMM training

3
Multiple alignment
  • Biologists produce high quality multiple
    alignments by hand using expert knowledge of
    protein sequence evolution.
  • Highly conserved regions
  • Buried hydrophobic residues
  • Influence of protein structure
  • Expected patterns of insertions and deletions

4
Multiple alignment
  • Manual multiple sequence alignment is tedius.
  • Automatic MSA methods are needed.
  • In general, an automatic method must have a way
    to assign a score so that better MSA get better
    scores.
  • Scoring a multiple alignment and searching over
    possible alignments should be distinguished.
  • In probabilistic modelling, scoring function is
    primary concern.
  • One of goals in probabilistic modeling is to
    incorporate as many of an experts evaluation
    criteria as possible into scoring procedure.

5
What a multiple alignment means
  • In a multiple sequence alignment, homologous
    residues among a set of sequences are aligned
    together in columns.
  • Homologous is meant in both the structural and
    evolutionary sense.
  • Ideally, a column of aligned residues occupy
    similar three-dimensional structural positions
    and all diverge from a common ancestral residue.

6
What a multiple alignment means
  • Manually aligned example-10 imunoglobulin
    superfamily
  • A crystal structure of 1tlk(telokin) is known
  • The telokin structure and alignments to other
    related seqyences reveal conserved
    characteristics of the I-set immunoglobulin
    superfamily fold, including eight conserved
    ß-strands and certain key residues in the
    sequences, such as two completely conserved
    cysteines in the b and f strands which form a
    disulfide bond in the core of the folded
    structure.

7
What a multiple alignment means
8
What a multiple alignment means
  • Except for trivial cases, it is not possible to
    create a single correct multiple alignment.
  • Given pair of divergent but clearly homologus
    protein sequences, usually only 50 of the
    individual residues were superposable.
  • The Globin family, often used as a typical
    problem in computational work, is in fact
    exceptionalalmost the entire structure is
    convserved among divergent sequences.
  • Even the definition of structurally
    superposable is subjective and can be expected
    to vary among experts.

9
What a multiple alignment means
  • Our ability to define a single correct
    alignment will vary with the relatedness of the
    sequences being aligned.
  • An alignment of very similar sequences will
    generally be unambiguous, but there alignments
    are not of great interest to us.
  • For cases of interest, there is no objective way
    to define an unambiguously correct alignment.
  • Usually a small subset of key residues will be
    identifiable which can be aligned unambiguously
    for all the sequences in a family almost
    regardless of sequence divergence.
  • Core structal elements will also tend to be
    conserved and meaningfully alignable.

10
Scoring a multiple alignment
  • Two important features of multiple alignments
  • Some positions are more conserved than others.
  • The sequences are not independent, but instead
    are related by a phylogenetic tree.

11
Scoring a multiple alignment
  • An idealised way
  • Specifty a complete probabilistic model of
    molecular sequence evolution.
  • The probability of a multiple alignment can be
    calculated using evolutionary model.
  • We dont have enough data to build such a model
  • Workable approximationpartly or entirely ignore
    the phylogenetic tree while doing some sort of
    position-specific scoring.

12
Scoring a multiple alignment
  • Simplifying assumption
  • Individual columns of an alignment are
    statistically independent.
  • Then scoring function can be written as
  • Mi column i of the multiple alignment m
  • S(mi)the score for column i
  • Gan function for scoring the gaps that occur in
    the alignments.
  • Unspecified function-affine scoring function can
    be used

13
Scoring a multiple alignment-Minimum Entropy
  • Minimum Entropy
  • More variability in an alignment will be
    described by a higher entropy. Exactly matching
    sequences will have 0 entropy (completely
    organized)
  • To find the best alignment we want to have the
    minimum entropy.

14
Scoring a multiple alignment-Minimum Entropy
  • Minimum entropy
  • Counting the residues in each column
  • Probability of residue a in column I (ML
    estimate)
  • Probability of a column(independence assumed)
  • Entropy is the negative log of the probability of
    the column.

15
Scoring a multiple alignment-Minimum Entropy
  • Treating columns as statistically
    independent-Leaving out knowledge of phylogeny.
  • Actually very similar to HMM without gap
    information
  • The assumption that the sequences are independent
    can be reasonable if representative sequence of a
    sequence family s carefully chosen.
  • A variety of tree-based wdighting schemes have
    been proposed to deal with this problem to
    partially compensate for the defects of the
    sequence independence assumption.

16
Scoring a multiple alignment-Sum of Pairs
  • Sum of pairs
  • Standard method of scoring multiple alignment
  • Similarity to HMM formulation
  • Do not use phylogenetic tree
  • Assumes statistical indepedence for the columns.
  • Not HMM formulation though

17
Scoring a multiple alignment-Sum of pairs
  • Sum of pairs
  • Columns are scored by SP function using a
    substitution scoring matrix such as a PAM or
    BLOSUM matrix.
  • Use linear gap function or score affine gaps
    separately.
  • Sum N(N-1)/2 pairwise scores

18
Scoring a multiple alignment-Sum of pairs
  • Problem of Sum of pairs
  • Sum of scores are not probabilistic correct
    extension to log-odds score.
  • Correct log-odds score extension
  • SP score
  • Evolutionary events are over-counted, a problem
    which increases as the number of sequemces
    increases.

19
Scoring a multiple alignment-Sum of pairs
  • Example
  • an alignment of N sequences which all have
    leucine(L) at a certain position.
  • BLOSUM50 s(L,L)5
  • The SP score of the column is 5N(N-1)/2
  • If instead there were one glycine(G) and N-1 Ls
  • BLOSUM50 s(G,L)-4
  • The SP score of the column is worse than the
    score for a column of all Ls by a fraction of
    9(N-1) / 5N(N-1)/2 18/5N

20
Scoring a multiple alignment-Sum of pairs
  • Difference is 18/5N
  • Relative difference between score between the
    correct and incorrect allignment decreases with
    the no. of sequences
  • Yet, if we have MORE evidence that L is conserved
    then an outlier out to DECREASE the score more.

21
Multidimensional Dynamic Programming
  • It is possible to generalise pairwise DP
    alignment to the alignment of N sequences.

22
Multidimensional Dynamic Programming
  • Assumptions
  • The columns of an alignment are statistically
    independent
  • The gaps are scored with a linear gap cost
  • Then the overall score S(m) for an alignment can
    be calculated as a sum of the scores for each
    column.

23
Multidimensional Dynamic Programming
24
Multidimensional Dynamic Programming
  • Simplifying the notation

25
Multidimensional Dynamic Programming
  • Straightforward Multidimensional DP
  • Pros
  • It can find optimal solution.
  • Arbitary column scoring function can be used
  • Only assumption is that column scores are
    independent.
  • Cons
  • There are 2N-1 gap combinations for each entry
  • Huge computational complexity-O(2N LN)

26
Multidimensional Dynamic Programming-MSA
  • MSA can reduce the volume of the multidimensional
    dynamic programing matrix that needs to be
    examined
  • Optimally align up to 5-7 protein sequences of
    reasonable length(200-300 residues)

27
Multidimensional Dynamic Programming-MSA
  • Assumptions
  • SP scoring system
  • The score of a multiple alignment is the sum of
    the scores of all pairwise alignment defined by
    the multiple alignment.
  • Then the score of the complete alignment is given
    by
  • Let be the optimal pairwise
    alignment of k,l

28
Multidimensional Dynamic Programming-MSA
  • We can obtain a lower bound on the score of any
    pairwise alignment that can occur in the optimal
    multiple alignment.
  • Assume that we have a lower bound s(a) on the
    score of the optimal multiple alignment, then for
    optimal multiple alignment a
  • We only need to consider pairwise alignment of k
    and l that score better than
  • A good bound s(a) can be obtained by any fast
    heurist algorithm
  • Optimal pairwise alignment can be found using
    dynamic programming

29
Multidimensional Dynamic Programming-MSA
  • Now find the complete set of coordinate
    pairs (ik,il) such that the best alignment of xk
    to xl through (ik,il) scores more than
  • The costly multidimensional dynamic programming
    algorithm can be restricted to evaluate only
    cells in the intersection of all theses sets
    I,e, cels (i1,i2,iN) for which (ik,il) is in
    for all k,l.

30
Progressive alignment methods
  • Most commonly used approach
  • Works by constructing a succession of pairwise
    alignmensts.
  • Initially, two sequences are chosen and aligned
    by standard pairwise alignmentthis alignment is
    fixed.
  • Then, a third sequence is chosen and aligned to
    the first alignment
  • This process is iterated until all sequences have
    been aligned.

31
Progressive alignment methods
  • Basically heuristic
  • It does not separate the scoring and optimising.
  • It does not directly optimise any global scoring
    function.
  • Fast and efficient, Generates reasonable result

32
Progressive alignment methods
  • Differences between PA algorithms
  • The way that they choose the order to do the
    alignment
  • Whether the progression involves only alignment
    of sequences to a single growing alignment or
    whether subfamilies are built up on a tree
    structure and,at certain points, alignments are
    aligned to alignments.
  • Procedure used to align and score sequences or
    alignments against existing alignmetns.

33
Progressive alignment methods- Feng-Doolittle
progressive multiple alignment
  • Calculate a diagonal matrix of N(N-1)/2 distances
    between all pairs of N sequences by standard
    pairwise alignment. Compute a distance matrix
    D-log(S)
  • Construct a Guide tree from the distance matrix
    using a clustering algorithm
  • Starting from the first node added to the tree,
    align the child nodes. Repeat for all other nodes
    in the order that they were added to the tree.

34
Progressive alignment methods-Feng-Doolittle
progressive multiple alignment
  • Converting alignment scores to distances
  • Doesnt need to be accurate-the goal is only to
    create an approximate guide tree, not an
    evolutionary tree.
  • In phylogenetic tree construction, more care must
    be taken

35
Progressive alignment methods-Feng-Doolittle
progressive multiple alignment
  • Clustering
  • Done with The Fitch-Margooliash algorithm
  • Sequence-Sequence alignments
  • Done with usual pairwise dynamic programming.
  • A sequence is added to an existing group by
    aligning it pairwise to each sequence in the
    group in turn.
  • The highest scoring pairwise alignment determines
    how the sequence will be aligned to the group.

36
Progressive alignment methods-Feng-Doolittle
progressive multiple alignment
  • Once a gap,always a gap rule
  • After an alignment is completed, gap symbols are
    replaced with a neutral X character.
  • This rule allows pairwise sequenc alignments to
    be used to guide the alignment of sequences to
    groups or groups to groups otherwise, any given
    pairwise sequence alignment would not necessarily
    be consistent with the pre-existing alignment of
    a group.
  • Desirable side effectencouraging gaps to occur
    in the same columns in subsequent pairwise
    alignments.
  • Not needed in profile-based progressive alignment
    algorithms

37
Progressive alignment methods
  • A problem with the Feng-Doolittle approach
  • all alignments are determined by pairwise
    sequence alignments.
  • It is advantageous to use position-specific
    information from the groups multiple alignment
    to align a new sequence to it. (e.g. degree of
    sequence conservation)
  • Many progressive alignment methods use pairwise
    alignment of sequences to profiles or of profiles
    to profiles as a subroutine which is used many
    times in the process.

38
Progressive alignment methods
  • Linear gap scoring case
  • s(-,a)s(a,-)-g and s(-,-)0
  • Two profiles sequence 1..n and n1 N
  • Global alignment is
  • The first two sums are unaffected by the global
    alignment(s(-,-)0)
  • Therefore the optimal alignment of the two
    profiles can be obtained by only optimising the
    last sum with the cross terms, which can be done
    exactly like a standard pairwise alignment.

39
Progressive alignment methods-CLUSTAW
  • Profile-based progresive multiple alignment
  • Works in much the same way as the Feng-Doolitle
    method except for its carefully tuned use of
    profile alignment methods.
  • Uses various heuristics.

40
Progressive alignment methods-CLUSTAW
  • Construct a distance matrix of all N(N-1)/2 pairs
    by pairwise dynamic programming.
  • Construct a guide tree by a neighbour-joining
    clustering algorithm.
  • Progressively align at nodes in order of
    decreasing similarity, using sequence-sequence,
    sequence-profile, and profile-profile alignment.
  • Scoring is basically SP.

41
Progressive alignment methods-CLUSTAW
  • Heuristics used
  • Sequences are weighted to compensate for biased
    representation in large subfamilies.
  • The substitution matrix is chosen on the basis of
    the similarity expected of the alignment.
  • Position-specific gap-open penalties are used.
  • Gap penalties are increased if there are no gaps
    in a column but gaps occur nearby in the
    alignment.

42
Progressive alignment methods-Iterative
refinement methods
  • Problem with progressive alignment
  • Subalignments are frozen.
  • Once a group of sequemces has been aligned, their
    alignment to each other cannot be changed at a
    later stage as more data arrive.
  • Iterative refinement methods attempt to
    circumvent this problem.

43
Progressive alignment methods-Iterative
refinement methods
  • Iterative refinement method
  • An initial alignment is generated
  • Then one sequence (or a set of sequences) is
    taken out and realigned to a profile of the
    remaining aligned sequences.
  • If a meaningful score is being optimized, this
    either increases the overall score or results in
    the same score.
  • Another sequence is chosen and realigned, and so
    on, until alignment does not change
  • Guaranteed to converged to a local maximum.

44
Progressive alignment methods-Iterative
refinement methods
  • Barton-Sternberg multiple alignment
  • Find the two sequences with the highest pairwise
    similarity and align them using standard pairwise
    DP alignment.
  • Find the sequence that is most similar to a
    profile of the alignment of the first two, and
    align it to the first two by profile-sequence
    alignment. Repeat until all sequences have been
    included in the multiple aligment.
  • Remove sequence x1 and realign it to a profile of
    the other aligned sequences x2, xN by
    profile-sequence alignment. Repeat for sequences
    x2xN.
  • Repeat the previous realignment step a fixed
    number of times, or until the alignment score
    converges.

45
Multiple alignment by profile HMM training
  • Sequence profiles could be recast in
    probabilistic form as profile HMMs.
  • Profile HMMs could simply be used in place of
    standard profiles in progressive or iterative
    alignment methods.
  • Ad hoc SP scoring scheme can be replaced by more
    explicit profile HMM assumption.
  • Profile HMMs can also be trained from initially
    unaligned sequences using the Baum-Welch EM

46
Multiple alignment by profile HMM training-
Multiple alignment with a known profile HMM
  • Before we estimate a model and a multiple
    alignment simultaneously we consider the simpler
    problem of obtaining a multiple alignment from a
    known model.
  • When we have a multiple alignment and a model of
    a small representative set of sequences in a
    family, and we wish to use that model to align a
    large member of other family members altogether.

47
Multiple alignment by profile HMM training-
Multiple alignment with a known profile HMM
  • We know how to align a sequence to a profile
    HMM-Viterbi algorithm
  • Construction a multiple alignment just requires
    calculating a Viterbi alignment for each
    individual sequence.
  • Residues aligned to the same profile HMM match
    state are aligned in columns.

48
Multiple alignment by profile HMM
training-Multiple alignment with a known profile
HMM
  • Given a preliminary alignment, HMM can align
    further sequences.

49
Multiple alignment by profile HMM training-
Multiple alignment with a known profile HMM
50
Multiple alignment by profile HMM training-
Multiple alignment with a known profile HMM
  • Importance difference with other MSA programs
  • Viterbi path through HMM identifies inserts
  • Profile HMM does not align inserts
  • Other multiple alighment algorithms align the
    whole sequences.

51
Multiple alignment by profile HMM training-
Multiple alignment with a known profile HMM
  • HMM doesnt attempt to align residues assigned to
    insert states.
  • The insert state residues usually represent part
    of the sequences which are atypical, unconserved,
    and not meaningfully alignable.
  • This is a biologically realistic view of multiple
    alignment

52
Multiple alignment by profile HMM training-
Profile HMM training from unaligned sequences
  • Harder problem-estimating both a model and a
    multiple alignment from initially unaligned
    sequences.
  • InitializationChoose the length of the profile
    HMM and initialize parameters.
  • TrainingEstimate the model using the Baum-Welch
    algorithm or the Viterbi alternative.
  • Multiple AlignmentAlign all sequences to the
    final model using the Viterbi algorithm and build
    a multiple alignment as described in the previous
    section.

53
Multiple alignment by profile HMM training-
Profile HMM training from unaligned sequences
  • Initial Model
  • The only decision that must be made in choosing
    an initial structure for Baum-Welch estimation is
    the length of the model M.
  • A commonly used rule is to set M be the average
    length of the training sequence.
  • We need some randomness in initial parameters to
    avoid local maxima.

54
Multiple alignment by profile HMM training
  • Avoiding Local maxima
  • Baum-Welch algorithm is guaranteed to find a
    LOCAL maxima.
  • Models are usually quite long and there are many
    opportunities to get stuck in a wrong solution.
  • Multidimensional dynamic programming finds global
    optima, but is not practical.
  • Solution
  • Start again many times from different initial
    models.
  • Use some form of stochastic search algorithm,
    e.g. simulated annealing.

55
Multiple alignment by profile HMM
training-Simulated annealing
  • Theoretical basis
  • Some compounds only crystallise if they are
    slowly annealed from high temperature to low
    temperature.
  • One can introduce an artificial temperature T,
    and by the laws of statistical physics the
    probabiliy of a configuration x is given by the
    Gibbs distribution.
  • In the limit of T-gt0, the system is frozen in
    the limit of T-gtinfinity, the system is molten
  • The minimum can be found by sampling this
    probability distribution at a high temperature
    first, and then at gradually decreasing
    temperatures.

56
Multiple alignment by profile HMM
training-Simulated annealing
  • For an HMM, a natural energy function is
  • Approximations
  • Noise injection during Baum-Welch reestimation
  • Simulated annealing Viterbi estimation of HMMs

57
Multiple alignment by profile HMM
training-Simulated annealing
  • Noise injection during Baum-Welch reestimation
  • Add noise to the counts estimated in the
    forward-backward procedure
  • Let the size of this noise decrease slowly.

58
Multiple alignment by profile HMM
training-Simulated annealing
  • Simulated annealing Viterbi estimation of HMMs
  • Model is trained by a simulated annealing variant
    of the Viterbi approximation to Baum-Welch
    estimation.
  • Viterbi estimation selects the highest
    probability path p of each seqeence x.
  • Simulate annealing samples each path p according
    to the likelihood of the path given the current
    model as modified by a temperature T.

59
Multiple alignment by profile HMM
training-Simulated annealing
  • Scheduling the temperature
  • A whole science (or art) itself
  • There are theoretical result for simulated
    annealing saying that if the temperature is
    lowered slowly enough, finding the optimum is
    guaranteed.
  • In practice a simple exponentially or linearly
    decreasing schedule is often used.

60
Multiple alignment by profile HMM -Comparison to
Gibbs sampling
  • The Gibbs sampler algorithm described by
    Lawrence et al.1993 has substantial
    similarities.
  • The problem was to simultaneously find the motif
    positions and to estimate the parameters for a
    consensus statistical model of them.
  • The statistical model used is essentially a
    profile HMM with no insert or delete states.
  • In HMM framework, both SA algorithm and the Gibbs
    sampler are stochastic variants of the Viterbi
    algorithm of EM.
  • The Gibbs sampler is like running simulated
    annealing viterbi algorithm at a constant T1,
    where alignments are sampled from a probability
    distribution unmodified by any effect of a
    temperature factor.

61
Multiple alignment by profile HMM training-Model
surgery
  • After(or during) training a model, we can look at
    the alignment it produces and decide that model
    needs some modification.
  • Some of the match states are redundant
  • Some insert states absorb too much sequence
  • Model sugery
  • If a match state is used by less than ½ of
    training sequences, delete its module
    (match-insert-delete states)
  • If more than ½ of training sequences use a
    certain insert state, expand it into n new
    modules, where n is average length of insertions
  • Ad hoc, but works well
Write a Comment
User Comments (0)
About PowerShow.com