1 / 74

Sequence alignment algorithms

Presented By Cary Miller Sastry

Akella Daisuke Yasuda

Overview

- Biological background / motivation / applications
- Dot matrix / dynamic programming
- FASTA / BLAST

biology

- Biomolecules are strings from a restricted

alphabet - Length4 DNA
- Length20 protein
- Proteins are the working part

Proteins

- Protein is a linear sequence of 20 characters

(amino acids) - Proteins do not maintain linearity
- Folding happens
- Folding determines overall 3-D shape
- Shape determines function

Sequence Structure Function

- sequence does not reveal structure
- Much less function
- A sequenceARTUVEDYERRWWUHUK

Structure

- Pic 1
- Pic 2

Structure

function

- Protein A is a constituent of muscle, skin,

cartilage, or - Protein B catalyzes the transformation of glucose

to fructose, or - How do we find proteins with similar function?

Nature does not solve the same problem twice

(usually)

- Short sequence with a specific function (or

shape) is called a domain - The same domain appears in multiple proteins
- If we find the same domain in multiple proteins

that provides a clue to function and/or structure

Amino acids

- Each has the same basic chemical configuration

but has a functional group that makes it

chemically unique - They occur in families
- Some functional groups are similar

How biologists study proteins

- Expensive (NMR, x-ray crystallography)
- Discovery of function is difficult
- Few proteins are understood in detail
- Many are known by sequence
- Sequence is easier to get than structure or

function

A biological scenario

- Biologist discovers the sequence of a new protein

with unknown function - She has no idea of function
- If sequence can be associated with a known

protein sequence we have a clue about structure

and/or function - Most proteins have unknown function

Public databases

- Vast quantities of sequence, structure, function

info is deposited into public databases - A new sequence should be compared to the database

Comparing sequences

- Alignment with exact matchABCTUVABUVABCTUVAB

----UV

Alignment with inexact match

- InexactGARUIPPRSTGARVVBUIEEYSTGAR------UIPPRS

TGARVVBUIEEYST

Global vs. local alignment

- ABQRTASGGBV
- ABRRRASGVBB
- ABQRTASGGBV
- ABQ------SGGBV

A real alignment

- MyoglobinPDLRKY FKG-A ENFTA DDVQ KSDRPDTKAY

FPKFG DLSTA AALK SSPK - Homology common ancestry

Real alignment

- Pic 3

Real alignment

Scoring pairs of amino acids

- For amino acid pairs assign a score based on

frequency of substitutionATRGUVXQATRCVVXTATRGV

VEQAT-----VVEQ

A substitution matrix

- Pic 4

A substitution matrix

Substitution matrices

- Pam and Blosum are standard substitution matrices
- Also include scores for
- Gap opening
- Gap extension

Scoring amino acid strings

- Sum the individual pair scores
- Database is huge
- Spurious match to random sequence is likely
- Try your name
- E-value is probability of getting a given score

from a random sequence

Alignment algorithms

- Dot matrix
- Dynamic programming
- FASTA
- BLAST

Dot Matrix and DP

Dot Matrix

- Locating regions of similarity between two DNA or

protein sequences which provide a great deal of

information about the function and structure of

the query sequence. - Similar structure indicates homology, or similar

evolution, which provides critical information

about the functions of these sequences.

Dot Matrix Contd..

- A dot matrix plot is a method of aligning two

sequences to provide a picture of the homology

between them. - The dot matrix plot is created by designating one

sequence to be the subject and placing it on the

horizontal axis and designating the second

sequence to be the query and placing it on the

vertical axis of the matrix.

Dot Matrix Contd..

- At each position within the matrix, a point is

plotted if the horizontal and vertical elements

are identical. - Diagonal lines within the resulting matrix

indicate regions of similarity. A simple dot

matrix plot is shown in Figure A.

(No Transcript)

Dot Matrix with noise reduction

- A certain percentage of the matches between

sequence elements can be expected to be the

result of the random nature of their evolution.

These random matches are considered noise" and

are filtered out to enhance the diagonal lines.

Dot Matrix

- Noise Reduction
- a) Noise reduction in dot matrix can be done

by centering a substring of elements of the

query sequence over each element in the

subject sequence and determining the number of

corresponding elements within this window.

Dot Matrix

- b) If the number of corresponding elements

exceeds a specified threshold then a point is

plotted for the center element. This is

demonstrated in figure B.

Dot Matrix (Figure B)

Dot Matrix

- Advantages Readily reveals the presence of

insertions/deletions and direct and inverted

repeats that are more difficult to find by the

other, more automated methods. - DisadvantagesMost dot matrix computer programs

do not show an actual alignment. Does not return

a score to indicate how optimal a given

alignment is.

Dynamic Programming

- Dynamic programming (DP) algorithms are a general

class of algorithms typically applied to

optimization problems. - For DP to be applicable, an optimization problem

must have two key ingredients - a) Optimal substructure an optimal solution

to the problem contains within it optimal

solutions to sub-problems. - b) Overlapping sub-problems the pieces

of larger problem have a sequential

dependency.

Dynamic Programming

- DP works by first solving every sub-sub-problem

just once, and saves its answer in a table,

thereby avoiding the work of re- computing the

answer every time the sub-sub-problem is

encountered. Each intermediate answer is stored

with a score, and DP finally chooses the sequence

of solution that yields the highest score.

Dynamic Programming

- Path Matrix

Dynamic Programming

- Both global and local types of alignments may be

made by simple changes in the basic DP algorithm. - Alignments depend on the choice of a scoring

system for comparing character pairs and penalty

scores (e.g. PAM and BLOSUM matrixes covered

before) - Scoring functions example
- w (match) 2 or substitution matrix
- w (mismatch) -1 or substitution matrix
- w (gap) -3

Dynamic Programming

- Global Alignment (Needleman-Wunsch)
- a) General goal is to obtain optimal global

alignment between two sequences, allowing

gaps.b) We construct a matrix F indexed by i

and j, one index for each sequence, where the

value F(i,j) is the score of the best

alignment between the initial segment x1i of x

up to xi and the initial segment y1j of y up

to yj. We begin by initializing F(0,0) 0.

We then proceed to fill the matrix from top

left to bottom right. If F(i-1, j-1),

F(i-1,j) and F(i,j-1) are known, it is

possible to calculate F(i,j).

Dynamic Programming

- F(i,j) max F(i-1, j-1) s(xi , yj

)F(i-1,j) dF(i, j-1) d. - where s(a,b) is the likelihood score

that residues a and b occur as an aligned

pair, and d is the gap penalty. - Once you construct the matrix, you trace back the

path that leads to F(n,m), which is by definition

the best score for an alignment of x1n to y1m.

Dynamic Programming

- Global Dynamic programming matrix

Dynamic Programming

- Local alignment (Smith-Waterman)Two changes from

global alignment1. Possibility of taking the

value 0 if all other options have value less

than 0. This corresponds to starting a new

alignment.2. Alignments can end anywhere in

the matrix, so instead of taking the value

in the bottom right corner, F(n,m) for the

best score, we look for the highest value of

F(i,j) over the whole matrix and start the

trace-back from there. - F(i,j) max 0F(i-1, j-1) s(xi , yj

) F(i-1,j) dF(i, j-1) d.

Dynamic Programming

- Local Dynamic programming matrix

Dynamic Programming

- Advantages Guaranteed in a mathematical

sense to provide the optimal (very best or

highest-scoring) alignment for a given set of

scoringfunctions. - Disadvantages
- a) Slow due to the very large number of

computational steps O(n 2).b) Computer

memory requirement also increases as the square

of the sequence lengths. - Therefore, it is difficult to use the

method for very long sequences.

FASTA and BLAST

FASTA - Idea -

- Problem of Dynamic Programming
- D.P. compute the score in a lot of useless

area for optimal sequence - FASTA focuses on diagonal area

FASTA - Heuristic -

- Heuristic
- Good local alignment should have some exact

match subsequence.

FASTA focus on this area

FASTA - Hi Level Algorithm -

- Hi level algorithm
- Let q be a query
- max ? 0
- For each sequence, s in DB
- compare q with s and compute a score, y
- if max
- max ? y
- bestSequence ? s
- Return bestSequence

FASTA - Algorithm -

- Step 1
- Find all hot-spots
- // Hot spots is pairs of words of length k

that exactly match

Sequence 1

Hot Spots

Sequence 2

FASTA - Algorithm -

- Step 1 in detail
- Use look-up Table
- Query G A A T T C A G T T A
- Sequence G G A T C G A

DotMatrix

Look-up Table

FASTA - Algorithm -

- Step 2
- Score the Hot-spot and locate the ten best

diagonal run. - // There is some scoring system ex. PAM250

FASTA - Algorithm -

- Step 3
- Combine sub-alignments into one alignment

with GAP

GAP

One of local alignment

FASTA - Algorithm -

- Step 4
- Consider weighted direct graph.
- Let node be a sub-alignment found in step 1
- Let u and v be nodes
- Edge (u,v) exists if alignment u is before

in the sequence. - Each edge has gap penalty (negative)
- Find the maximum weight path

Sub-sequence

Edge

One Sequence

FASTA - Algorithm -

- Step 4 in detail

GAP

Sub-alignment

Gap

-5

-3

-3

Max Weight Path

FASTA - Algorithm -

- Step 5
- Use the dynamic programming in restricted area

around the best-score alignment to find out the

higher-score alignment than the best-score

alignment

Width of this band is a parameter

FASTA - Algorithm -

- Summary of Algorithm
- 1 Find all hot-spots
- // Hot spots is pairs of words of length k

that exactly match - 2 Score the Hot-spot and locate the ten best

diagonal run. - 3 Combine sub-alignments into one alignment
- 4 Score Each alignment with gap penalty and

pick up the best-score alignment - 5 Use the dynamic programming in restricted

area around the best-score alignment to find out

the alignment greater than the best-score

alignment.

FASTA - Complexity -

- Complexity
- Step 1 and 2 // select the best 10

diagonal run - Let n be a sequence from DB
- O(n) because Step 1 just uses look up the

table - O(n)

FASTA - Complexity -

- Step 3 and 4 // compute the MAX Weight Path
- Let r be the number of sub-alignments. (r

10) - Lets be the number of edges
- O(r2)
- n1 n2 n3
- n1
- n2
- n3
- ? 1 of D.P because r2 102
- and mn 104

Positive Weight

-5

-3

-3

Max Weight Path

FASTA - Complexity -

- Step 5 // compute partial D.P.
- Depends on the restricted area
- Therefore, FASTA is faster than D.P.

Width of this band is a parameter

BLAST - Heuristic -

- Another Heuristic algorithm
- Heuristic but evaluating the result

statistically. - Homologous sequence are likely to contain a

short high scoring word pair, a hit. - BLAST tries to extend it on the both sides to

get optimal sequence.

A T T A G .

Sequence

Short high score Word

BLAST - Algorithm -

Neighborhood Word

- Step 1 preprocessing Query
- Compile the short-hit scoring word list from

query. - The length of query word,w, is 3 for brosom

scoring - Threshold T is 13

BLAST - Algorithm -

- Step 1 2
- Create neighborhood words for each query word

Query Word

Neighborhood words

BLAST - Algorithm -

- Step 2 Scanning DB
- For each words list, identify all exact matches

with DB sequences

Query Word

Neighborhood Word list

Sequences in DB

Sequence 1

Sequence 2

Step 2

Step 1

The purpose of Step 1 and 2 is as same as FASTA

BLAST - Algorithm -

- Step 2-2
- Method 1 Hash Table
- Query LAALLNKCKTPQGQRLVNQWIKQPLMD

Hash Table

Word list

BLAST - Algorithm -

- Step 2-3
- Method 2 Finite Automata

A,G

L

A

G

A

A

A

I

BLAST Algorithm -

- Step 3 (Search optimal alignment)
- Let S be a score of hit-word
- For each hit-word, extend ungapped alignmentin

both directions. - Step 4 (Evaluate the alignment statistically)
- Stop extension when E-value (depending on score

S) become less than threshold. The hit-word is

called High Scoring Segment Pair. BLAST return it - E-value the number of HSPs having score S

(or higher) expected to - occur only by chance.
- ? Smaller E-value, more significant in

statistics - Bigger E-value , by chance

A T T A G .

Sequence

Hit Word

BLAST - Algorithm -

- Step 3 -2
- Definition of E-Value
- The expected number of HSP with the score at

least S is - E Knme-?S
- K, ? is constant depending on model
- n, m are the length of query and sequence
- The probability of finding at least one such HSP

is - P 1 - eE
- ? If a word is hit by chance

(E-value is bigger), - P become smaler.

BLAST - Running Time -

- Running Time
- The length of Query 153
- DB size 5997 sequences
- PC Pentium 4
- By Dr. Takeshi Kawabata
- Nara Sentan Gijyutu University

Comparison of Algorithm

- Dynamic Programming
- 1. most sensitive result
- ? D.P uses all information of two sequence
- 2. Running time is slow
- ? D.P compute the useless area for computing

the optimal sequence.

Comparison of Algorithm

- FASTA
- 1. Less sensitive than D.P and BLAST
- ? FASTA uses partial information to speed up

the computaiotn. - ? FASTA does not evaluate the result

statistically. - 2. Running time is faster D.P
- ? the same reason as the above.

Comparison of Algorithms

- BLAST
- 1. Sensitive than FASTA
- ? BLAST evaluate the result statistically.
- 2.Faster than FASTA
- ? Because BLAST evaluate the entire DB with

the same threshold based on statistics. BLAST

eliminate noises and reduces the running time.

FASTA vs BLAST

- BLAST
- Compare the query and sequences in DB
- with the same threshold.
- FASTA
- compare the query and a sequence one by one
- And compare the each result.

DB

DB

Query

Conclusion