Sequence Alignment - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Sequence Alignment

Description:

Sequence Alignment – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 29
Provided by: t808
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
2
Sequence Alignment
  • Why
  • To match a new sequence to others with known
    functions
  • To search for ESTs and other signs of gene
    expression
  • To understand population dynabmics and
    evolutionary relationships between genes and
    species
  • To find important regions within proteins
  • Issues
  • Alignment should mimic evolutionary descent the
    actual history of mutation and selection that led
    to this gene
  • But it is too complicated to get perfectly
    correct
  • Protein alignments work over larger evolutionary
    distances than nucleotide
  • How to treat substitutions, insertions and
    deletions (gaps)
  • How to score possible alignments
  • Global vs. local alignment
  • Multiple alignment (as an extension of pairwise
    alignment
  • Hidden Markov Models and other ways of
    abstracting multiple alignment information
  • Homology related by evolutionary descent. As
    opposed to similarity, which is not necessarily
    based on descent from a common ancestor
  • But in practice, long aligned sequences seem to
    only arise by evolution
  • Short alignments can be due to chance or
    convergent evolution.

3
Example Alignments
  • THISSEQUENCE vs. THATSEQUENCE
  • Same length, just 2 mismatches
  • THISISASEQUENCE vs. THATSEQUENCE
  • Length is different, need to introduce gaps to
    maximize identities.

4
Scoring by Identity
  • One simple way to score an alignment is by
    counting the number of perfect matches.
  • Get percentage of identities by dividing number
    of matches by total positions (including gap
    positions). This is a measure of relatedness
    between 2 proteins.
  • For previous example, 11 matches with 16
    positions 68.75 (69) identities
  • Length matters it is harder to get a high
    percentage of identities in a long sequence than
    in a short one.
  • Problem of random matches. For nucleotides, 25
    of all positions in random sequences match, and
    its 5 for proteins.
  • General rule, based on proteins with known
    structural similarity
  • Two proteins are probably structurally similar
    (and thus probably homologous) if they have 30
    or more identical amino acids over their whole
    length when aligned.
  • Less than 20 amino acid identity means probably
    not homologous
  • Between 20 and 30 is a gray zone
  • My personal happiness with matches increases when
    its above 35
  • Except for very unusual proteins, 100 identity
    doesnt occur between homologous proteins in
    different species

5
Dotplots
  • Dotplots are a simple way of seeing alignments
  • We really like to see good visual demonstrations,
    not just tables of numbers
  • Its a grid put one sequence along the top and
    the other down the side, and put a dot wherever
    they match.
  • You see the alignment as a diagonal

6
Dotplot Noise
  • A big problem is noise there are lots of random
    matches (roughly 5 for proteins) that confuse
    the image.
  • Standard solution create a sliding window (say
    10 residues) and only mark a dot if a minimum
    number of matches occur in that window (say 3).
  • A lot of noise goes away
  • This is a sequence compared to itself, so there
    is a perfect diagonal.

7
A Real Dotplot
  • Two haptoglobin sequences. (Haptoglobin is a
    blood protein that binds to hemoglobin that has
    gotten out of the red blood cells).
  • You can see a gap in one sequence, a region of
    poor similarity just before it, and a simple
    sequence repeat near the beginning.

8
Similarity Matching
  • In proteins, many substitutions occur that have
    little effect on structure or function
  • or, they alter the protein to make it more
    adapted for the lifestyles of the different
    species
  • This depends on where in the protein they occur
    and on the chemical and physical properties of
    the amino acids.
  • Substitution matrices scores of the probability
    of changing one amino acid into another.
  • Amino acids are similar if they can frequently be
    substituted for each other.
  • These are just overall numbers compiled over many
    sequences, not adapted to specific cases.
  • Early attempts were based on amino acid
    properties, or on the nubmer of nucleotide
    substitutions needed to change form one amino
    acid to the other.
  • Now they are based on actual comparison between
    sequences.
  • The two most popular types PAM and BLOSUM
  • There are other, more specialized substitution
    matrices, for comparing transmembrane regions,
    for example.

9
BLOSUM62 Matrix
10
Similarity Matrix Theory
  • Think about aligning 2 proteins from similar
    species that are orthologs same function and
    syntenic. At some point back in evolutionary
    time, there was a single DNA sequence that is the
    common ancestor of both proteins.
  • Most paired amino acids are identical, but a few
    are different.
  • Reduce the problem consider a single aligned
    pair of amino acids, that are not identical. T-S
  • We are comparing 2 theories of how these amino
    acids were derived from a common ancestor.
  • Random mutation followed by natural selection.
    Some substitutions will happen more frequently
    than others because they lead to functional
    proteins more often.
  • The frequency with which T and S are substituted
    for each other by evolution is derived from
    counting them in well-aligned sequences.
    freq(T-S)
  • Completely random changes every possible
    substitution happens in proportion to the
    relative frequencies of the different amino
    acids, the two amino acids are unrelated to each
    other.
  • In this case, the frequency of a T and an S is
    just the product of the frequency of Ts and the
    frequency of Ss in the entire protein (or
    proteome). - freq(T) freq(S)
  • The odds ratio is the evolutionary theory
    (observed data) frequency divided by the random
    theory frequency. OR freq(T-S) / freq(T)
    freq(S)

11
More Theory
  • We want to get the odds that a given alignment
    fits the evolutionary model better than a random
    model.
  • Good alignments give high odds ratios
  • Need to multiply the ORs for all amino acids in
    the alignment
  • It is easier (and doesnt overflow the computers
    floating point calculator) to take the logarithm
    of the odds ratio for each amino acid, and then
    add the logarithms.
  • This is the lod score (log of odds).
  • A negative score means that the given
    substitution is less likely than chance, and a
    positive score means it is more likely than
    chance.
  • You can score each possible alignment by adding
    up over the whole protein
  • Some fooling with constants (which dont distort
    the results but are either more pleasing to the
    human eye or make further calculations easier
    multiply lod score by 10, or add a constant to
    make al values 0 or greater

12
PAM
  • PAM Point Accepted Mutations, meaning single
    amino acid substitutions (point mutations) that
    have been accepted by natural selection they
    are functional in different species.
  • Derived by Dayhoff and colleagues in the 1960s
    and 1970s (although there are some newer
    versions around)
  • They give a measure of the frequency of changing
    from one amino acid to another, as compared to
    the frequency of random change
  • Derived from global alignments of homologus
    sequences from different, but closely related,
    species. The sequences had an average of 1 amino
    acid change per hundred residues. Thus we assume
    at most 1 mutation has occurred at each position.
  • Do an phylogenetic analysis of the sequences to
    determine which mutations have occurred
  • Calculate the lod scores. Then multiply all of
    them by 10 and round to integers.
  • This set of scores derived from sequence
    alignments is the PAM1 matrix.
  • Since most sequences being aligned are not
    between such closely related species, the PAM1
    matrix is multiplied by itself many times to
    mimic lots of small changes.
  • This concept is a serious weakness multiplying
    of errors magnifies them.
  • The number after PAM is the number of times the
    matrix has been multiplied by itself.
  • Common ones PAM30, PAM70, PAM120, PAM250.
    Bigger number better for more distant
    relationships

13
BLOSUM
  • BLOck Substitution Matrix. Derived in the
    1990s by Henikoff and Henikoff.
  • Based on local alignments of Blocks, which are
    short, highly homologous regions, with no gaps
  • Sequences were grouped together if they were very
    similar, and then comparisons were made between
    the groups as in the PAM matrices.
  • No attempt at phylogenetic trees
  • The different BLOSUM matrices have specific
    cutoffs for amino acid identities. For example,
    the BLOSUM62 matrix is based on sequence blocks
    with at least 62 identity.
  • The odds ratio for each substitution is
    calculated, but instead of taking the base 10 log
    and multiplying the result by 10 as in PAM,
    BLOSUM takes the base 2 log and multiplies by 2.
    This gives scores in half-bits.
  • Bigger numbers imply closer evolutionary
    distance, so BLOSUM80 is better for closely
    related species than BLOSUM 45.
  • BLOSUM seems to work better than PAM
  • BLOSUM62 is the default used in BLAST searches.

14
BLOSUM62 and PAM120 Matrices
The colors represent different physiochemical
properties. Note that some substitutions
are positive, which indicates that they occur
more frequently than chance. The average value
is negative it is more likely than an amino acid
will stay the same than change. The diagonal
values are unchanged amino acids, all of which
have positive values. Some are less
changeable than others tryptophan and
cysteine especially.
15
Gaps
  • Gaps occur with roughly 1/10 the frequency of
    base substitutions, so they are common in most
    alignments.
  • Symbolized by hyphens ( --- ) paired with
    residues like a mismatch with a blank space.
  • You can assign a penalty for each gap position.
  • This is called a linear gap penalty the total
    penalty is proportional to the gap length.
  • The problem is, once you start putting them in,
    you can get almost anything aligned.
  • Alignment programs usually distinguish between
    creating a gap and extending a gap. Thus, the
    gap opening penalty and a (smaller) gap extension
    penalty.
  • This is called an affine gap penalty.
  • Although substitutions have a lot of theory
    behind them, gap penalties are generally
    determined by heuristic means.
  • Heuristic a method or value determined by
    trial-and-error experiments, without a strong
    guiding theory.
  • In this case, gap opening and extension penalties
    are the result of trying many possibilities and
    seeing which ones give the most pleasing
    alignments.
  • The BLAST default is a -11 penalty for opening
    the gap and -1 for each additional base of gap.
    (11/1)
  • Other options on BLAST at NCBI are 7/2, 8/2, 9/2,
    10/1, and 12/1

16
  • Comparing 2 distantly related sequences with
    different gap penalties
  • Top sequence has fewer gaps and longer matches.
  • Bottom sequence has more identities and
    similarities overall, but lots of little gaps.
    The matches near the C-terminal are absurd.
  • Look at the short segment after the first gap in
    the lower sequence gained 3 identities

17
How Do We Make Alignments?
  • We have been working on scoring an alignment
    identities and similarities, and gap penalties.
  • But, how do you get an alignment to score in the
    first place?
  • Trying all possibilities is one of those more
    possibilities than there are atoms in the
    Universe problems.
  • The general solution dynamic programming, a
    technique first applied to DNA sequences by
    Needleman and Wunsch (1970)
  • Their original method gave global alignments.
  • Smith and Waterman (1981) provided a slight (but
    critical) modification that produced local
    alignments, which work better than global for
    most genes.
  • These methods provide an optimal alignment, for a
    given substitution matrix and set of gap
    penalties.
  • They are much faster than trying all
    possibilities, but still not quick enough.
    Various refinements and heuristic methods improve
    the speed.

18
Smith-Waterman Algorithm
  • Start with a 2-dimensional matrix with one
    sequence along the top and the other sequence
    down the left side. All possible pairs of
    nucleotides or amino acids are represented by the
    cells of the matrix.
  • Edge rows along the top and left side.
  • All possible alignments are represented by the
    paths through the matrix.
  • a diagonal step is an alignment between the query
    and the subject sequences at that position
  • a vertical step is a gap in the query sequence
  • a horizontal step is a gap in the subject
    sequence.
  • Have a match reward and penalties for mismatches,
    gap openings, and gap extensions. For our
    example, we will use the BLOSUM62 matrix, with a
    linear gap penalty of -6
  • Initialize the edge rows to scores of 0.

19
BLOSUM62 With positive scores marked
20
Calculating Cell Scores
T A
T 5 7
G 2 ?
  • The cell at row i and column j has a score S(i,
    j)
  • Starting at top left cell, proceed row-by-row,
    calculating each cells score S(i, j). S(i, j)
    is the maximum of
  • 0 (i.e. set to 0 if the calculated score is less
    than 0)
  • S(i-1, j-1) match/mismatch score for cell (i,
    j)
  • S(i, j-1) match/mismatch score for cell (i, j)
    gap penalty
  • S(i-1, j) match/mismatch score for cell (i, j)
    gap penalty

For the cell in question, the bases dont match,
so it starts with a match/mismatch score of -1.
There are 3 possible alignment paths to this
cell 1. diagonal (query/subject alignment).
Score 5 1 4. 2. vertical (query gap).
Score 7 4 1 2 3. horizontal (subject
gap). Score 2 4 1 -3 (set to 0) Since 4
is the maximum, the cells value is set to 4.
21
Smith-Waterman Details
  • Start at the first row T doesnt match anything,
    and looking at BLOSUM62, the only positive score
    for a mismatch is 1 with S.
  • We keep track of the 0 -gt 1 diagonal
  • Second row H matches N 1, but nothing else..
  • The diagonal staring with the 1 in the previous
    row is a H-A mismatch -2, so 1 -22 -1, which
    is scored as 0.
  • Third row I gives positive scores with M. L, and
    V. But, nothing builds on the previous row.

22
More S-W
  • Fourth row S has positive scores with N, A, and
    T.
  • S-S 4 match, added to 4 from the diagonal 8
  • S-A 1. For a horizontal move (subject gap), 8
    1 6 3.
  • S-I is -2 mismatch, added to 2 from the diagonal
    0.
  • S-G 0 mismatch, added to 4 from the diagonal

23
More S-W
24
Still More!
25
Traceback
  • Then, start at the highest score in the matrix
    and trace back the path leading through the
    highest previous scores to 0. Go left and up
    only, preferring the diagonal path if a choice
    needs to be made.
  • High score is 16, in the bottom row (but it could
    have been elsewhere).
  • Write the alignment starting at the top.
  • It doesnt cover the entire sequence it is a
    local alignment, not global.
  • It isnt perfect the strong diagonal from LI and
    the 0 mismatch score from a G-N match overcame
    the gap penalty needed to put a gap where the G
    is.
  • Nevertheless, given the BLOSUM62 matrix and the
    -6 linear gap penalty, this is an optimal
    alignment,

ISALIGNE IS-LIN-E
26
Changing the Gap Penalty
  • The top one has a -4 gap penalty and the bottom
    one has a -8 gap penalty (both linear). They
    give somewhat different alignments.

27
A Needleman-Wunsch Alignment
28
Speeding Things Up
Write a Comment
User Comments (0)
About PowerShow.com