Pairwise Sequence Alignment - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Pairwise Sequence Alignment

Description:

An alignment is a mapping from one sequence to another, identifying elements ... Called Needleman-Wunch or Smith-Waterman. Alignment matrix ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: leahHa2
Category:

less

Transcript and Presenter's Notes

Title: Pairwise Sequence Alignment


1
Pairwise Sequence Alignment
  • What is an alignment, and why might it be
    significant?
  • An alignment is a mapping from one sequence to
    another, identifying elements that are likely to
    have arisen from a common ancestor
  • A good alignment is an indication of homology
  • Alignments are NOT exact matches. We will need a
    method to find good alignments in a database...

2
Similarity vs. HomologyParalogs vs. Orthologs
  • Homology is an evolutionary relationship that
    either exists or does not. It cannot be partial.
  • An ortholog is a homolog with shared function.
  • A paralog is a homolog that arose through a gene
    duplication event. Paralogs often have divergent
    function.
  • Similarity is a measure of the quality of
    alignment between two sequences. High similarity
    is evidence for homology. Similar sequences may
    be orthologs or paralogs.

3
How do we compute similarity?
  • Similarity can be defined by counting positions
    that are identical between two sequences
  • Gaps (insertions/deletions) can be important
    abcdef abcdef abcdef
    abceef acdef a-cdef

4
Not all mismatches are the same
  • Some amino acids are more substitutable for each
    other than others. Serine and threonine are
    more alike than tryptophan and alanine.
  • We can introduce "mismatch costs" for handling
    different substitutions.
  • We don't usually use mismatch costs in aligning
    nucleotide sequences, since no substitution is
    per se better than any other.

5
Many possible alignments to consider
  • Without gaps, there are are NxM possible
    alignments between sequences of length N and M
  • Once we start allowing gaps, there are many
    possible arrangements to consider abcbcd
    abcbcd abcbcd
    abc--d a--bcd ab--cd
  • This becomes a very large number when we allow
    mismatches, since we then need to look at every
    possible pairing between elements there are
    roughly NM possible alignments.

6
Exponential computations get big fast
  • If nm100, there are 100100 10200
    100,000,000,000,000,000,000,000,000,000,000,000,00
    0,000,000,000,000,000,000,000,000,000,000,000,000,
    000,000,000,000,000,000,000,000,000,000,000,000,00
    0,000,000,000,000,000,000,000,000,000,000,000,000,
    000,000,000,000,000,000,000,000,000,000,000,000,00
    0,000,000,000,000 different alignments.
  • And 100 amino acids is a small protein!

7
Avoiding random alignments with a score function
  • Not only are there many possible gapped
    alignments, but introducing too many gaps makes
    nonsense alignments possible
    s--e-----qu---en--ce sometimesquipsentice
  • Need to distinguish between alignments that occur
    due to homology, and those that could be expected
    to be seen just by chance.
  • Define a score function that accounts for both
    element mismatches and a gap penalty

8
Match scores
  • Match scores are oftencalculated on the basis
    of the frequency of particular mutations in
    very similar sequences.
  • We can transform substitution frequencies into
    log odds scores, which can then be added
    together.

9
Local vs. Global alignments
  • A global alignment includes all elements of a
    sequence, and includes gaps
  • A global alignment may or may not include "end
    gap" penalties.
  • A local alignment is includes only subsequences,
    and sometimes computed without gaps.
  • Local alignments can find shared domains in
    divergent proteins and are fast to compute
  • Global alignments are better indicators of
    homology and take longer to compute.

10
An alignment score
  • An alignment score is the sum of all the match
    scores of an alignment, with a penalty subtracted
    for each gap.
  • Gap penalties are usually "affine" meaning that
    the penalty for one long gap is smaller than the
    penalty for many smaller gaps that add up to the
    same size.a b c - - da c c e f d9 2 7 6
    gt 24 - (10 2) 12

Gap start continuationpenalty
Matchscore
AlignmentScore
11
Finding the optimal alignment
  • Given a pair of sequences and a score function,
    identify the best scoring (optimal) alignment
    between the sequences.
  • Remember, exponential number of possible
    alignments (most with terrible scores).
  • Computer science to the rescue dynamic
    programming identifies optimal alignments in time
    proportional to the sum of the lengths of the
    sequences

12
Dynamic programming
  • The name comes from an operations research task,
    and has nothing to do with writing programs.
  • The key idea is to start aligning the sequences
    left to right once a prefix is optimally
    aligned, nothing about the remainder of the
    alignment changes the alignment of the prefix.
  • We construct a matrix of possible alignment
    scores (NxM2 calculations worst case) and then
    "traceback" to find the optimal alignment.
  • Called Needleman-Wunch or Smith-Waterman

13
Alignment matrix
  • Create a matrix with each sequence to be aligned
    along one edge and the score of the alignment of
    each pair of elements in a cell.
  • Best local alignment is just the highest
    scoring diagonal

14
Dynamic programming matrix
  • Each cell has the score for the best aligned
    sequence prefix up to that position.
  • Number in ( )s is thealignment score forthe
    pair of amino acids at that position.
  • Gap penalty here is-12 to start and -4 to
    continue.

15
Optimal alignment by traceback
  • We traceback a path that gets us the highest
    score. If we don't have end gap penalties,
    then takeany path from thelast row or columnto
    the first.
  • Otherwise we needto include the top and bottom
    corners

16
Study guide....
  • Dynamic programming alignments are a key
    technology in bioinformatics, and you should
    understand how they work.
  • The method is counterintuitive
  • Work some examples by hand. The textbook has a
    very good explanation, and there is more detail
    and supplementary material on the textbook web
    site, www.bioinformaticsonline.org

17
How do we pick match scores?
  • For match scores, two main options
  • PAM based on global alignments of closely related
    sequences. Normalized to changes per 100 sites,
    then exponentiated for more distant relatives.
  • BLOSUM based on local alignments in much more
    diverse sequences
  • Picking the right distance is important, and may
    be hard to do. BLOSUM seems to work better for
    more evolutionarily distant sequences. BLOSUM62
    is a good default.

18
Picking gap penalties
  • Many different possible forms
  • Most common is affine (gap open gap continue
    penalities)
  • More complex penalties have been proposed.
  • Penalties must be commensurate with match scores.
    Therefore, the match scoring scheme influences
    the gap penalty
  • Most alignment programs suggest appropriate
    penalties for each match score option.

19
Searching for optimal scores
  • One possibility is to try several different match
    score and gap penalties, and choose the best
    result.
  • In general, this is called parameter space search
    and it is important in many areas.
  • Problems
  • requires a lot computation
  • we need some principled way to compare the
    results.
  • Use significance testing to compare...

20
The significance of an alignment
  • Significance testing is the branch of statistics
    that is concerned with assesing the probability
    that a particular result could have occurred by
    chance.
  • How do we calculate the probability that an
    alignment occurred by chance?
  • Either with a model of evolution, or
  • Empirically, by scrambling our sequences and
    calculating scores on many randomized sequences.

21
For next week
  • Read Mount, Chapter 3 on pairwise sequence
    alignment.
  • Finish Assignment 1. Start Assignment 2.
Write a Comment
User Comments (0)
About PowerShow.com