Sequence Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence Alignment

Description:

Sequence Alignment. – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 44
Provided by: NirF79
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
2
Sequences
  • Much of bioinformatics involves sequences
  • DNA sequences
  • RNA sequences
  • Protein sequences
  • We can think of these sequences as strings of
    letters
  • DNA RNA alphabet of 4 letters
  • Protein alphabet of 20 letters

3
20 Amino Acids
  • Glycine (G, GLY)
  • Alanine (A, ALA)
  • Valine (V, VAL)
  • Leucine (L, LEU)
  • Isoleucine (I, ILE)
  • Phenylalanine (F, PHE)
  • Proline (P, PRO)
  • Serine (S, SER)
  • Threonine (T, THR)
  • Cysteine (C, CYS)
  • Methionine (M, MET)
  • Tryptophan (W, TRP)
  • Tyrosine (T, TYR)
  • Asparagine (N, ASN)
  • Glutamine (Q, GLN)
  • Aspartic acid (D, ASP)
  • Glutamic Acid (E, GLU)
  • Lysine (K, LYS)
  • Arginine (R, ARG)

4
Sequence Comparison
  • Finding similarity between sequences is important
    for many biological questions
  • For example
  • Find similar proteins
  • Allows to predict function structure
  • Locate similar subsequences in DNA
  • Allows to identify (e.g) regulatory elements
  • Locate DNA sequences that might overlap
  • Helps in sequence assembly

5
Sequence Alignment
  • Input two sequences over the same alphabet
  • Output an alignment of the two sequences
  • Example
  • GCGCATGGATTGAGCGA
  • TGCGCCATTGATGACCA
  • A possible alignment
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A

6
Alignments
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • Three elements
  • Perfect matches
  • Mismatches
  • Insertions deletions (indel)

7
Choosing Alignments
  • There are many possible alignments
  • For example, compare
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • to
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Which one is better?

8
Scoring Alignments
  • Rough intuition
  • Similar sequences evolved from a common ancestor
  • Evolution changed the sequences from this
    ancestral sequence by mutations
  • Replacements one letter replaced by another
  • Deletion deletion of a letter
  • Insertion insertion of a letter
  • Scoring of sequence similarity should examine how
    many operations took place

9
Simple Scoring Rule
  • Score each position independently
  • Match 1
  • Mismatch -1
  • Indel -2
  • Score of an alignment is sum of positional scores

10
Example
  • Example
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • Score (1x13) (-1x2) (-2x4) 3
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Score (1x5) (-1x6) (-2x11) -23

11
More General Scores
  • The choice of 1,-1, and -2 scores was quite
    arbitrary
  • Depending on the context, some changes are more
    plausible than others
  • Exchange of an amino-acid by one with similar
    properties (size, charge, etc.)
  • vs.
  • Exchange of an amino-acid by one with opposite
    properties

12
Additive Scoring Rules
  • We define a scoring function by specifying a
    function
  • ?(x,y) is the score of replacing x by y
  • ?(x,-) is the score of deleting x
  • ?(-,x) is the score of inserting x
  • The score of an alignment is the sum of position
    scores

13
Edit Distance
  • The edit distance between two sequences is the
    cost of the cheapest set of edit operations
    needed to transform one sequence into the other
  • Computing edit distance between two sequences
    almost equivalent to finding the alignment that
    minimizes the distance

14
Computing Edit Distance
  • How can we compute the edit distance??
  • If s n and t m, there are more than
    alignments
  • The additive form of the score allows to perform
    dynamic programming to compute edit distance
    efficiently

15
Recursive Argument
  • Suppose we have two sequencess1..n1 and
    t1..m1
  • The best alignment must be in one of three cases
  • 1. Last position is (sn1,tm 1 )
  • 2. Last position is (sn 1,-)
  • 3. Last position is (-, tm 1 )

16
Recursive Argument
  • Suppose we have two sequencess1..n1 and
    t1..m1
  • The best alignment must be in one of three cases
  • 1. Last position is (sn1,tm 1 )
  • 2. Last position is (sn 1,-)
  • 3. Last position is (-, tm 1 )

17
Recursive Argument
  • Suppose we have two sequencess1..n1 and
    t1..m1
  • The best alignment must be in one of three cases
  • 1. Last position is (sn1,tm 1 )
  • 2. Last position is (sn 1,-)
  • 3. Last position is (-, tm 1 )

18
Recursive Argument
  • Define the notation
  • Using our recursive argument, we get the
    following recurrence for V

Vi,j Vi1,j
Vi,j1 Vi1,j1
19
Recursive Argument
  • Of course, we also need to handle the base cases
    in the recursion

AA - -
versus
We fill the matrix using the recurrence rule
20
Dynamic Programming Algorithm
We continue to fill the matrix using the
recurrence rule
21
Dynamic Programming Algorithm
V0,0 V0,1
V1,0 V1,1
versus
22
Dynamic Programming Algorithm
23
Dynamic Programming Algorithm
24
Reconstructing the Best Alignment
  • To reconstruct the best alignment, we record
    which case(s) in the recursive rule maximized the
    score

25
Reconstructing the Best Alignment
  • We now trace back a path that corresponds to the
    best alignment

26
Reconstructing the Best Alignment
  • Sometimes, more than one alignment has the best
    score

AAAC A-GC
27
Complexity
  • Space O(mn)
  • Time O(mn)
  • Filling the matrix O(mn)
  • Backtrace O(mn)

28
Space Complexity
  • In real-life applications, n and m can be very
    large
  • The space requirements of O(mn) can be too
    demanding
  • If m n 1000 we need 1MB space
  • If m n 10000, we need 100MB space
  • We can afford to perform extra computation to
    save space
  • Looping over million operations takes less than
    seconds on modern workstations
  • Can we trade off space with time?

29
Why Do We Need So Much Space?
  • To compute d(s,t), we only need O(n) space
  • Need to compute Vn,m
  • Can fill in V, column by column, only storing the
    last two columns in memory
  • Note however
  • This trick fails when we need to reconstruct
    the sequence
  • Trace back information eats up all the memory

30
Why Do We Need So Much Space?
  • To find d(s,t), need O(n) space
  • Need to compute Vn,m
  • Can fill in V, column by column, storing only
    two columns in memory
  • Note however
  • This trick fails when we need to reconstruct
    the sequence
  • Trace back information eats up all the memory

31
Space Efficient Version Outline
Input Sequences s1,n and t1,m to be
aligned. Idea perform divide and conquer
  • Find position (n/2, j) at which the best
    alignment crosses a midpoint

32
Finding the Midpoint
  • The score of the best alignment that goes through
    j equals d(s1,n/2,t1,j)
    d(sn/21,n,tj1,m)

33
Finding the Midpoint (Algorithm)
  • Define
  • Vi,j d(s1..i,t1..j) (As
    before)
  • Bi,j d(si1..n,tj1..m)
    (Symmetrically)
  • Fi,j Bi,j score of best alignment through
    (i,j)

34
Computing Vi,j
Vi,j d(s1..i,t1..j)
As before
Vi,j Vi1,j
Vi,j1 Vi1,j1
Requires linear space complexity
35
Computing Bi,j
Bi,j d(si1..n,tj1..m)
Symmetrically (replacing i with i1, and j with
j1)
Bi,j Bi1,j
Bi,j1 Bi1,j1
Requires linear space complexity
36
Local Alignment
  • Consider now a different question
  • Can we find similar substring of s and t
  • Formally, given s1..n and t1..m find i,j,k,
    and l such that d(si..j,tk..l) is maximal

37
Local Alignment
  • As before, we use dynamic programming
  • We now want to setVi,j to record the best
    alignment of a suffix of s1..i and a suffix of
    t1..j
  • How should we change the recurrence rule?

38
Local Alignment
  • New option
  • We can start a new match instead of extend
    previous alignment

Alignment of empty suffixes
39
Local Alignment Example
s TAATA t ATCTAA
40
Local Alignment Example
s TAATA t TACTAA
41
Local Alignment Example
s TAATA t TACTAA
42
Local Alignment Example
s TAATA t TACTAA
43
Sequence Alignment
  • We seen two variants of sequence alignment
  • Global alignment
  • Local alignment
  • Other variants
  • Finding best overlap (exercise)
  • All are based on the same basic idea of dynamic
    programming
Write a Comment
User Comments (0)
About PowerShow.com