Sequence Alignment

Sequences

- Much of bioinformatics involves sequences
- DNA sequences
- RNA sequences
- Protein sequences
- We can think of these sequences as strings of

letters - DNA RNA alphabet of 4 letters
- Protein alphabet of 20 letters

20 Amino Acids

- Glycine (G, GLY)
- Alanine (A, ALA)
- Valine (V, VAL)
- Leucine (L, LEU)
- Isoleucine (I, ILE)
- Phenylalanine (F, PHE)
- Proline (P, PRO)
- Serine (S, SER)
- Threonine (T, THR)
- Cysteine (C, CYS)
- Methionine (M, MET)
- Tryptophan (W, TRP)
- Tyrosine (T, TYR)
- Asparagine (N, ASN)
- Glutamine (Q, GLN)
- Aspartic acid (D, ASP)
- Glutamic Acid (E, GLU)
- Lysine (K, LYS)
- Arginine (R, ARG)

Sequence Comparison

- Finding similarity between sequences is important

for many biological questions - For example
- Find similar proteins
- Allows to predict function structure
- Locate similar subsequences in DNA
- Allows to identify (e.g) regulatory elements
- Locate DNA sequences that might overlap
- Helps in sequence assembly

Sequence Alignment

- Input two sequences over the same alphabet
- Output an alignment of the two sequences
- Example
- GCGCATGGATTGAGCGA
- TGCGCCATTGATGACCA
- A possible alignment
- -GCGC-ATGGATTGAGCGA
- TGCGCCATTGAT-GACC-A

Alignments

- -GCGC-ATGGATTGAGCGA
- TGCGCCATTGAT-GACC-A
- Three elements
- Perfect matches
- Mismatches
- Insertions deletions (indel)

Choosing Alignments

- There are many possible alignments
- For example, compare
- -GCGC-ATGGATTGAGCGA
- TGCGCCATTGAT-GACC-A
- to
- ------GCGCATGGATTGAGCGA
- TGCGCC----ATTGATGACCA--
- Which one is better?

Scoring Alignments

- Rough intuition
- Similar sequences evolved from a common ancestor
- Evolution changed the sequences from this

ancestral sequence by mutations - Replacements one letter replaced by another
- Deletion deletion of a letter
- Insertion insertion of a letter
- Scoring of sequence similarity should examine how

many operations took place

Simple Scoring Rule

- Score each position independently
- Match 1
- Mismatch -1
- Indel -2
- Score of an alignment is sum of positional scores

Example

- Example
- -GCGC-ATGGATTGAGCGA
- TGCGCCATTGAT-GACC-A
- Score (1x13) (-1x2) (-2x4) 3
- ------GCGCATGGATTGAGCGA
- TGCGCC----ATTGATGACCA--
- Score (1x5) (-1x6) (-2x11) -23

More General Scores

- The choice of 1,-1, and -2 scores was quite

arbitrary - Depending on the context, some changes are more

plausible than others - Exchange of an amino-acid by one with similar

properties (size, charge, etc.) - vs.
- Exchange of an amino-acid by one with opposite

properties

Additive Scoring Rules

- We define a scoring function by specifying a

function - ?(x,y) is the score of replacing x by y
- ?(x,-) is the score of deleting x
- ?(-,x) is the score of inserting x
- The score of an alignment is the sum of position

scores

Edit Distance

- The edit distance between two sequences is the

cost of the cheapest set of edit operations

needed to transform one sequence into the other - Computing edit distance between two sequences

almost equivalent to finding the alignment that

minimizes the distance

Computing Edit Distance

- How can we compute the edit distance??
- If s n and t m, there are more than

alignments - The additive form of the score allows to perform

dynamic programming to compute edit distance

efficiently

Recursive Argument

- Suppose we have two sequencess1..n1 and

t1..m1 - The best alignment must be in one of three cases
- 1. Last position is (sn1,tm 1 )
- 2. Last position is (sn 1,-)
- 3. Last position is (-, tm 1 )

Recursive Argument

- Suppose we have two sequencess1..n1 and

t1..m1 - The best alignment must be in one of three cases
- 1. Last position is (sn1,tm 1 )
- 2. Last position is (sn 1,-)
- 3. Last position is (-, tm 1 )

Recursive Argument

- Suppose we have two sequencess1..n1 and

t1..m1 - The best alignment must be in one of three cases
- 1. Last position is (sn1,tm 1 )
- 2. Last position is (sn 1,-)
- 3. Last position is (-, tm 1 )

Recursive Argument

- Define the notation
- Using our recursive argument, we get the

following recurrence for V

Vi,j Vi1,j

Vi,j1 Vi1,j1

Recursive Argument

- Of course, we also need to handle the base cases

in the recursion

AA - -

versus

We fill the matrix using the recurrence rule

Dynamic Programming Algorithm

We continue to fill the matrix using the

recurrence rule

Dynamic Programming Algorithm

V0,0 V0,1

V1,0 V1,1

versus

Dynamic Programming Algorithm

Dynamic Programming Algorithm

Reconstructing the Best Alignment

- To reconstruct the best alignment, we record

which case(s) in the recursive rule maximized the

score

Reconstructing the Best Alignment

- We now trace back a path that corresponds to the

best alignment

Reconstructing the Best Alignment

- Sometimes, more than one alignment has the best

score

AAAC A-GC

Complexity

- Space O(mn)
- Time O(mn)
- Filling the matrix O(mn)
- Backtrace O(mn)

Space Complexity

- In real-life applications, n and m can be very

large - The space requirements of O(mn) can be too

demanding - If m n 1000 we need 1MB space
- If m n 10000, we need 100MB space
- We can afford to perform extra computation to

save space - Looping over million operations takes less than

seconds on modern workstations - Can we trade off space with time?

Why Do We Need So Much Space?

- To compute d(s,t), we only need O(n) space
- Need to compute Vn,m
- Can fill in V, column by column, only storing the

last two columns in memory - Note however
- This trick fails when we need to reconstruct

the sequence - Trace back information eats up all the memory

Why Do We Need So Much Space?

- To find d(s,t), need O(n) space
- Need to compute Vn,m
- Can fill in V, column by column, storing only

two columns in memory

- Note however
- This trick fails when we need to reconstruct

the sequence - Trace back information eats up all the memory

Space Efficient Version Outline

Input Sequences s1,n and t1,m to be

aligned. Idea perform divide and conquer

- Find position (n/2, j) at which the best

alignment crosses a midpoint

Finding the Midpoint

- The score of the best alignment that goes through

j equals d(s1,n/2,t1,j)

d(sn/21,n,tj1,m)

Finding the Midpoint (Algorithm)

- Define
- Vi,j d(s1..i,t1..j) (As

before) - Bi,j d(si1..n,tj1..m)

(Symmetrically) - Fi,j Bi,j score of best alignment through

(i,j)

Computing Vi,j

Vi,j d(s1..i,t1..j)

As before

Vi,j Vi1,j

Vi,j1 Vi1,j1

Requires linear space complexity

Computing Bi,j

Bi,j d(si1..n,tj1..m)

Symmetrically (replacing i with i1, and j with

j1)

Bi,j Bi1,j

Bi,j1 Bi1,j1

Requires linear space complexity

Local Alignment

- Consider now a different question
- Can we find similar substring of s and t
- Formally, given s1..n and t1..m find i,j,k,

and l such that d(si..j,tk..l) is maximal

Local Alignment

- As before, we use dynamic programming
- We now want to setVi,j to record the best

alignment of a suffix of s1..i and a suffix of

t1..j - How should we change the recurrence rule?

Local Alignment

- New option
- We can start a new match instead of extend

previous alignment

Alignment of empty suffixes

Local Alignment Example

s TAATA t ATCTAA

Local Alignment Example

s TAATA t TACTAA

Local Alignment Example

s TAATA t TACTAA

Local Alignment Example

s TAATA t TACTAA

Sequence Alignment

- We seen two variants of sequence alignment
- Global alignment
- Local alignment
- Other variants
- Finding best overlap (exercise)
- All are based on the same basic idea of dynamic

programming