Class%202:%20Basic%20Sequence%20Alignment - PowerPoint PPT Presentation

About This Presentation

Title:

Class%202:%20Basic%20Sequence%20Alignment

Description:

Find genes/proteins with common origin. Allows to predict function & structure ... best alignment, we record which case in the recursive rule maximized the score ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 39

Provided by: NirFri

Category:

more less

Transcript and Presenter's Notes

Title: Class%202:%20Basic%20Sequence%20Alignment

1
Class 2 Basic Sequence Alignment
2
Sequence Comparison

Much of bioinformatics involves sequences
DNA sequences
RNA sequences
Protein sequences
We can think of these sequences as strings of
letters
DNA RNA alphabet of 4 letters
Protein alphabet of 20 letters

3
Sequence Comparison (cont)

Finding similarity between sequences is important
for many biological questions
For example
Find genes/proteins with common origin
Allows to predict function structure
Locate common subsequences in genes/proteins
Identify common motifs
Locate sequences that might overlap
Help in sequence assembly

4
Sequence Alignment

Input two sequences over the same alphabet
Output an alignment of the two sequences
Example
GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA
A possible alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A

5
Alignments

-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Three elements
Perfect matches
Mismatches
Insertions deletions (indel)

6
Choosing Alignments

There are many possible alignments
For example, compare
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Which one is better?

7
Scoring Alignments

Rough intuition
Similar sequences evolved from a common ancestor
Evolution changed the sequences from this
ancestral sequence by mutations
Replacements one letter replaced by another
Deletion deletion of a letter
Insertion insertion of a letter
Scoring of sequence similarity should examine how
many operations took place

8
Simple Scoring Rule

Score each position independently
Match 1
Mismatch -1
Indel -2
Score of an alignment is sum of positional scores

9
Example

Example
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score (1x13) (-1x2) (-2x4) 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score (1x5) (-1x6) (-2x11) -23

10
More General Scores

The choice of 1,-1, and -2 scores was quite
arbitrary
Depending on the context, some changes are more
plausible than others
Exchange of an amino-acid by one with similar
properties (size, charge, etc.)
vs.
Exchange of an amino-acid by one with opposite
properties

11
Additive Scoring Rules

We define a scoring function by specifying a
function
?(x,y) is the score of replacing x by y
?(x,-) is the score of deleting x
?(-,x) is the score of inserting x
The score of an alignment is the sum of position
scores

12
Edit Distance

The edit distance between two sequences is the
cost of the cheapest set of edit operations
needed to transform one sequence into the other
Computing edit distance between two sequences
almost equivalent to finding the alignment that
minimizes the distance

13
Computing Edit Distance

How can we compute the edit distance??
If s n and t m, there are more than
alignments
The additive form of the score allows to perform
dynamic programming to compute edit distance
efficiently

14
Recursive Argument

Suppose we have two sequencess1..i1 and
t1..j1
The best alignment must be in one of three cases
1. Last position is (si1,tj1 )
2. Last position is (si1,-)
3. Last position is (-, tj 1 )

15
Recursive Argument

Suppose we have two sequencess1..i1 and
t1..j1
The best alignment must be in one of three cases
1. Last position is (si1,tj 1 )
2. Last position is (si 1,-)
3. Last position is (-, tj 1 )

16
Recursive Argument

Suppose we have two sequencess1..i1 and
t1..j1
The best alignment must be in one of three cases
1. Last position is (si1,tj 1 )
2. Last position is (si 1,-)
3. Last position is (-, tj 1 )

17
Recursive Argument

Define the notation
Using the recursive argument, we get the
following recurrence for V

18
Recursive Argument

Of course, we also need to handle the base cases
in the recursion

19
Dynamic Programming Algorithm
We fill the matrix using the recurrence rule
20
Dynamic Programming Algorithm
21
Reconstructing the Best Alignment

To reconstruct the best alignment, we record
which case in the recursive rule maximized the
score

22
Reconstructing the Best Alignment

We now trace back the path the corresponds to the
best alignment

AAAC AG-C
23
Reconstructing the Best Alignment

Sometimes, more than one alignment has the best
score

AAAC A-GC
24
Complexity

Space O(mn)
Time O(mn)
Filling the matrix O(mn)
Backtrace O(mn)

25
Space Complexity

In real-life applications, n and m can be very
large
The space requirements of O(mn) can be too
demanding
If m n 1000 we need 1MB space
If m n 10000, we need 100MB space
We can afford to perform extra computation to
save space
Looping over million operations takes less than
seconds on modern workstations
Can we trade off space with time?

26
Why Do We Need So Much Space?

To find d(s,t), need O(n) space
Need to compute Vi,m
Can fill in V, column by column, storing only
two columns in memory

Note however
This trick fails when we need to reconstruct
the sequence
Trace back information eats up all the memory

27
Space Efficient Version Outline

Idea perform divide and conquer
Find position (n/2, j) at which the best
alignment crosses s midpoint

s
t
28
Finding the Midpoint

Suppose s1,n and t1,m are given
We can write the score of the best alignment that
goes through j as
d(s1,n/2,t1,j) d(sn/21,n,tj1,m)
Thus, we need to compute these two quantities for
all values of j

29
Finding the Midpoint (cont)

Define
Fi,j d(s1,i,t1,j)
Bi,j d(si1,n,tj1,m)
Fi,j Bi,j score of best alignment through
(i,j)
We compute Fi,j as we did before
We compute Bi,j in exactly the same manner,
going backward from Bn,m

30
Time Complexity Analysis

Finding mid-point cmn (c - a constant)
Recursive sub-problems of sizes (n/2,j) and
(n/2,m-j-1)
T(m,n) cmn T(j,n/2) T(m-j-1, n/2)
Lemma T(m,n) ? 2cmn
Time complexity is linear in size of the problem
At worse, twice the time of regular solution.

31
Local Alignment

Consider now a different question
Can we find similar substring of s and t
Formally, given s1..n and t1..m find i,j,k,
and l such that d(si..j,tk..l) is maximal

32
Local Alignment

As before, we use dynamic programming
We now want to setVi,j to record the best
alignment of a suffix of s1..i and a suffix of
t1..j
How should we change the recurrence rule?

33
Local Alignment

New option
We can start a new match instead of extend
previous alignment

Alignment of empty suffixes
34
Local Alignment Example
s TAATA t TACTAA
35
Local Alignment Example
s TAATA t TACTAA
36
Local Alignment Example
s TAATA t TACTAA
37
Local Alignment Example
s TAATA t TACTAA
38
Sequence Alignment