Class%202:%20Basic%20Sequence%20Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Class%202:%20Basic%20Sequence%20Alignment

Description:

Find genes/proteins with common origin. Allows to predict function & structure ... best alignment, we record which case in the recursive rule maximized the score ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 39
Provided by: NirFri
Category:

less

Transcript and Presenter's Notes

Title: Class%202:%20Basic%20Sequence%20Alignment


1
Class 2 Basic Sequence Alignment
2
Sequence Comparison
  • Much of bioinformatics involves sequences
  • DNA sequences
  • RNA sequences
  • Protein sequences
  • We can think of these sequences as strings of
    letters
  • DNA RNA alphabet of 4 letters
  • Protein alphabet of 20 letters

3
Sequence Comparison (cont)
  • Finding similarity between sequences is important
    for many biological questions
  • For example
  • Find genes/proteins with common origin
  • Allows to predict function structure
  • Locate common subsequences in genes/proteins
  • Identify common motifs
  • Locate sequences that might overlap
  • Help in sequence assembly

4
Sequence Alignment
  • Input two sequences over the same alphabet
  • Output an alignment of the two sequences
  • Example
  • GCGCATGGATTGAGCGA
  • TGCGCCATTGATGACCA
  • A possible alignment
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A

5
Alignments
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • Three elements
  • Perfect matches
  • Mismatches
  • Insertions deletions (indel)

6
Choosing Alignments
  • There are many possible alignments
  • For example, compare
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • to
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Which one is better?

7
Scoring Alignments
  • Rough intuition
  • Similar sequences evolved from a common ancestor
  • Evolution changed the sequences from this
    ancestral sequence by mutations
  • Replacements one letter replaced by another
  • Deletion deletion of a letter
  • Insertion insertion of a letter
  • Scoring of sequence similarity should examine how
    many operations took place

8
Simple Scoring Rule
  • Score each position independently
  • Match 1
  • Mismatch -1
  • Indel -2
  • Score of an alignment is sum of positional scores

9
Example
  • Example
  • -GCGC-ATGGATTGAGCGA
  • TGCGCCATTGAT-GACC-A
  • Score (1x13) (-1x2) (-2x4) 3
  • ------GCGCATGGATTGAGCGA
  • TGCGCC----ATTGATGACCA--
  • Score (1x5) (-1x6) (-2x11) -23

10
More General Scores
  • The choice of 1,-1, and -2 scores was quite
    arbitrary
  • Depending on the context, some changes are more
    plausible than others
  • Exchange of an amino-acid by one with similar
    properties (size, charge, etc.)
  • vs.
  • Exchange of an amino-acid by one with opposite
    properties

11
Additive Scoring Rules
  • We define a scoring function by specifying a
    function
  • ?(x,y) is the score of replacing x by y
  • ?(x,-) is the score of deleting x
  • ?(-,x) is the score of inserting x
  • The score of an alignment is the sum of position
    scores

12
Edit Distance
  • The edit distance between two sequences is the
    cost of the cheapest set of edit operations
    needed to transform one sequence into the other
  • Computing edit distance between two sequences
    almost equivalent to finding the alignment that
    minimizes the distance

13
Computing Edit Distance
  • How can we compute the edit distance??
  • If s n and t m, there are more than
    alignments
  • The additive form of the score allows to perform
    dynamic programming to compute edit distance
    efficiently

14
Recursive Argument
  • Suppose we have two sequencess1..i1 and
    t1..j1
  • The best alignment must be in one of three cases
  • 1. Last position is (si1,tj1 )
  • 2. Last position is (si1,-)
  • 3. Last position is (-, tj 1 )

15
Recursive Argument
  • Suppose we have two sequencess1..i1 and
    t1..j1
  • The best alignment must be in one of three cases
  • 1. Last position is (si1,tj 1 )
  • 2. Last position is (si 1,-)
  • 3. Last position is (-, tj 1 )

16
Recursive Argument
  • Suppose we have two sequencess1..i1 and
    t1..j1
  • The best alignment must be in one of three cases
  • 1. Last position is (si1,tj 1 )
  • 2. Last position is (si 1,-)
  • 3. Last position is (-, tj 1 )

17
Recursive Argument
  • Define the notation
  • Using the recursive argument, we get the
    following recurrence for V

18
Recursive Argument
  • Of course, we also need to handle the base cases
    in the recursion

19
Dynamic Programming Algorithm
We fill the matrix using the recurrence rule
20
Dynamic Programming Algorithm
21
Reconstructing the Best Alignment
  • To reconstruct the best alignment, we record
    which case in the recursive rule maximized the
    score

22
Reconstructing the Best Alignment
  • We now trace back the path the corresponds to the
    best alignment

AAAC AG-C
23
Reconstructing the Best Alignment
  • Sometimes, more than one alignment has the best
    score

AAAC A-GC
24
Complexity
  • Space O(mn)
  • Time O(mn)
  • Filling the matrix O(mn)
  • Backtrace O(mn)

25
Space Complexity
  • In real-life applications, n and m can be very
    large
  • The space requirements of O(mn) can be too
    demanding
  • If m n 1000 we need 1MB space
  • If m n 10000, we need 100MB space
  • We can afford to perform extra computation to
    save space
  • Looping over million operations takes less than
    seconds on modern workstations
  • Can we trade off space with time?

26
Why Do We Need So Much Space?
  • To find d(s,t), need O(n) space
  • Need to compute Vi,m
  • Can fill in V, column by column, storing only
    two columns in memory
  • Note however
  • This trick fails when we need to reconstruct
    the sequence
  • Trace back information eats up all the memory

27
Space Efficient Version Outline
  • Idea perform divide and conquer
  • Find position (n/2, j) at which the best
    alignment crosses s midpoint

s
t
28
Finding the Midpoint
  • Suppose s1,n and t1,m are given
  • We can write the score of the best alignment that
    goes through j as
  • d(s1,n/2,t1,j) d(sn/21,n,tj1,m)
  • Thus, we need to compute these two quantities for
    all values of j

29
Finding the Midpoint (cont)
  • Define
  • Fi,j d(s1,i,t1,j)
  • Bi,j d(si1,n,tj1,m)
  • Fi,j Bi,j score of best alignment through
    (i,j)
  • We compute Fi,j as we did before
  • We compute Bi,j in exactly the same manner,
    going backward from Bn,m

30
Time Complexity Analysis
  • Finding mid-point cmn (c - a constant)
  • Recursive sub-problems of sizes (n/2,j) and
    (n/2,m-j-1)
  • T(m,n) cmn T(j,n/2) T(m-j-1, n/2)
  • Lemma T(m,n) ? 2cmn
  • Time complexity is linear in size of the problem
  • At worse, twice the time of regular solution.

31
Local Alignment
  • Consider now a different question
  • Can we find similar substring of s and t
  • Formally, given s1..n and t1..m find i,j,k,
    and l such that d(si..j,tk..l) is maximal

32
Local Alignment
  • As before, we use dynamic programming
  • We now want to setVi,j to record the best
    alignment of a suffix of s1..i and a suffix of
    t1..j
  • How should we change the recurrence rule?

33
Local Alignment
  • New option
  • We can start a new match instead of extend
    previous alignment

Alignment of empty suffixes
34
Local Alignment Example
s TAATA t TACTAA
35
Local Alignment Example
s TAATA t TACTAA
36
Local Alignment Example
s TAATA t TACTAA
37
Local Alignment Example
s TAATA t TACTAA
38
Sequence Alignment
  • We saw two variants of sequence alignment
  • Global alignment
  • Local alignment
  • Other variants
  • Finding best overlap (exercise)
  • All are based on the same basic idea of dynamic
    programming
Write a Comment
User Comments (0)
About PowerShow.com