Loading...

PPT – Sequence Comparison PowerPoint presentation | free to view - id: 85261-ZDc1Z

The Adobe Flash plugin is needed to view this content

Sequence Comparison

MA5232 Lecture Notes

- LX Zhang
- Department of Mathematics
- National University of Singapore
- matzlx_at_nus.edu.sg

See what everyone else has seen, but think them

mathematically.

- Reference Books
- K.-M. Chao and L.X. Zhang.
- Sequence Comparison Theory and Methods,

Springer - 2. R.C. Deonier, S. Tavare, and M.S. Waterman,
- Computational Genome Analysis, Springer.
- 3. J. Felsenstein, Inferring Phylogenies,

Springer.

Motivation Sequence comparison problems arise

everywhere

- Web search tools are designed based on sequence

comparison solutions. - Text processing tools are designed based on

efficient solutions to text storage and word

pattern matching. - Finally, bio-molecular sequence comparison is

extremely important in modern biological and

medical sciences.

Motivation

Motivation

- All genomes are evolved from a common ancestral

genome. - Less important regions of the genome mutate

or change with time more frequently than the

functional important parts. - Hence, sequence conservation implies function.

In other words, sequence conservation serves as a

signal of the functionally important regions in a

genome.

An Example of Sequence Function Analysis

- To determine whether the SARS is caused by a new

virus or not, one may sequence - the SARS virus and search against a databases

of known bacterial and viral sequences.

Sequence database

Output

List of similar matches

Part I Alignment Model for Bio-Seq

Comparison

- An alignment among k sequences is a k-row matrix
- such that
- The row j contains the j-th sequence and

between - two consecutive residues there may be one or

more -s - Each column contains at least one residue.

More examples

a g a a - t g c a a c a a g t g - a

a g a a t g - c a a c a a - g t g a

Each alignment is an evolutionary hypothesis,

which accounts for three types of mutations

-- Substitution ( a g c t ? a g t t ) --

Insertion ( a g c t ? a g a c t ) --

Deletion ( a g c t ? a t )

- Because of point mutation (sub, indel) and
- large scale mutation (reversal, fission, fusion,

etc), - a gene has different sequences in different

species. - different genomes have different size and

number of chromosomes. - Indels may not affect the structure of a protein.

Alignment of globin protein sequences

Part II Scoring an Alignment

a g t c t c c a t c c c

a g t c t c c a t c c c

The quality of an alignment is measured by the

sum of the scores of pairs of aligned residues

plus the scores for gaps.

The left alignment is scored by s(a, a)

3s(c, c) s(t, t) s(g, -) s(t, -)

- Scores for aligned residues and gaps form a basic

scoring scheme

A G C T - A 4

-2 -1 -2 -1 G 3

-2 -1 -1 C 4

-2 -1 T 3

-1

a g t c t c c a t c c c

has score 4 (-1) 3 4 (-1) 4 4 18

- Under a meaningful scoring scheme, the score of

an alignment is essentially - the logarithm of likelihood ratio of the

alignment between two random sequences. - Matches score a positive value
- Mismatches and indels are penalized
- by scoring a negative value.

Indels and Mismatch are introduced to bring up

matches that appear later.

A sequence is a word or string on alphabet

?. For DNA sequences, ? a, g, c, t For

protein sequences, ? has 20 letters For

English words, ? a, b, , y, z.

- (Global) Pairwise Sequence Alignment
- Instance Two sequences S1 and S2, and
- a scoring scheme
- Question Find an alignment of S1 and S2 that
- has the highest score.

Such an alignment that has the highest alignment

score is called an optimal alignment.

Part III Generality of Alignment as a Model

- Many different problems such as
- Maximum subsequence problem
- Minimum supersequence problem
- Best fitting problem
- Finding Levenshtein distance
- in sequence comparison can be solved as a

special case of alignment by using a specific

scoring scheme.

Maximum Subsequence Problem

- A sequence is a word or string on alphabet ?.
- For DNA sequences, ? a, g, c, t
- For protein sequences, ? has 20 letters
- For English words, ? a, b, , y, z.
- Let S and T be two sequences on ?.
- S is a subsequence of T
- or T is a supersequence of S if
- all the characters in S appear in T in the same

- order.

- Maximum Subsequence Problem
- Instance Two sequences S1 and S2
- Question Find the longest common
- subsequence of S1 and S2.
- a g g c c a a t a g g c c a a t
- a c g g c t c a a c g g c t c a

One-to-one correspondence

a g g c c a a t a c g g c t c a

a g g c - - c a a t - - a - - c g g c - - t

c a

s(x,x)1, s(x, y)0

The length of lcs the alignment

score

The LCS problem is a special case of the

alignment problem.

Levenshtein distance

- In computer science and information theory,
- the Levenshtein distance between two strings is

defined to be - the minimum number of edit operations needed to

transform one - string to the other. It is also called the edit

distance. - kitten ? sitten ? sittin ? sitting

Finding the edit distance is equivalent

to aligning sequences with match scores 0

and mismatch and indel score -1

k i t t e n - s i t t i n g

Part IV Key Issues of Alignment

- 1. Algorithmic aspect
- Given two or more sequences and a scoring

scheme, how to find the optimal alignment that

has the maximum score? - 2. Scoring function
- Which score schemes are good at ranking

alignments? - 3. Evaluation of output alignments
- Are the output alignments statistically

significant?

The number of possible alignments

- Theorem There are
- possible alignments for two m-character

sequences. - Proof. An alignment of two m-character sequences

has at least m columns and at most 2m columns.

So we only need to prove the following fact.

Fact There are

possible alignments with k columns for two

m-character sequences S and T.

A

C

A

A

G

C

C

Sequence S appears in the first row in

ways. After the configuration of the first row

is fixed, T appears in the second row in

ways.

Alignment GraphCompact representation of all

alignments

- The length of a sequence is the number of
- characters contained in it.

For example, sequence TATAGC has length 6.

Let S1 and S2 be two sequence of length m and n,

respectively. The alignment graph A(S1, S2) of S1

and S2 are defined as follows

There is one-to-one correspondence from the

alignment s between S and T to the paths from

left-top vertex to the right-bottom vertex.

-- Vertices are lattice points (i, j), 0 i

m, 0 j n. In total, (m1)(n1) vertices.

-- There is an arc from (i, j) to (i, j) if

and only if 0 i i 1 and 0 j

j 1. -- Three types of edges

horizontal edges vertical edges

diagonal edges

(0, 0)

(1, 0)

(1, 1)

(2, 2)

(3, 3)

(4, 4)

(5, 6)

(4, 5)

(6, 6)

(7,7)

S2

In the alignment graph, each edge

corresponds uniquely to a possible column in an

alignment. -- diagonal edges for

match/mismatches -- horizontal and vertical

edges for indels.

S1

A G

T T

A -

C C

- G

T T

G G

G -

- C

S2

S1

Part IV Alignment Algorithms

- Pairwise Alignment Problem
- Input Two sequence S and T and a score

scheme - Question Find an alignment of S and T that has

the - largest alignment score, called optimal

alignment. - Local Alignment Problem
- Input Two sequence S and T and a score

scheme - Question Find the best alignment of segments in

S and T respectively.

Part IV.1 Global Sequence Alignment

G T A C

G

Each column of an alignment corresponds to an

edge of the corresponding path.

A

T

A

Assign the column score to an edge as

weight. Find an optimal alignment is to find a

path with the largest weight.

C

A

- Consider two sequences
- Xx1x2xj, Yy1y2yk
- Write their t-char prefixes as
- X1,, tx1x2xt Y1, , ty1y2yt
- Let S(j, k) be the score of optimally aligning

X1, , j and Y1,, k. - Consider one of the best alignments of X1,,j

and Y1,,k.

Case 1 The alignment corresponds to a path

whose last edge is diagonal.

xj

yk

A best alignment of X1,, j-1 and Y1, , k-1

S(j, k) S(j-1, k-1) s(xj, yk)

- S(j, k) optimal score of aligning X1,, j

and Y1,,k.

Case 2 Correspond to a path whose last edge is

vertical

S(j, k) S(j-1, k) s(xj, -)

An optimal alignment of X1,, j-1 and Y1, , k

Case 3 Correspond to a path whose last edge is

horizontal

S(j, k) S(j, k-1) s(-, yk)

An optimal alignment of X1,, j and Y1, , k-1

- S(j, k) the score of optimally aligning x1, ,

j - and y1, , k.

If we know the optimal alignments for X1,..,

j-1 and Y1, ..,k, X1, .., j and Y1, ,

k-1, X1, , j-1 and Y1, , k-1, we can

find an optimal alignments for X1,, j and Y1,

, k.

S(0, k) the score of optimally aligning an

empty X to Y s(-, y1) s(-, y2)

s(- y3) s(-, yk)

S(j, 0) the score of optimally aligning X to

the empty Y s(x1 , -) s(x2, -)

s(x3, -) s(xj , -)

Tabular Computation of Pairwise Alignment

To align an j-character sequence X and a

k-character sequence Y, we form an (j1) by

(k1) matrix F in which S(j, k) is the score

of optimally aligning X1,..,j and Y1,,k

and computed by the formula one by one from left

to right, top to bottom.

A G G C G T

The score of the best align. of AGT and AG

A

G

The score of the best align. of X and Y

T

G

Alignment Example

G T A C

G

S(0, 0) S(0, 1) S(0, 2) S(0, 3)

S(0, 4) S(0, 5)

0

A

1

S(1, 0) S(1, 1) S(1, 2) S(1, 3)

S(1, 4) S(1, 5)

T

S(2, 0) S(2, 1) S(2, 2) S(2, 3)

S(2, 4) S(2, 5)

2

A

3

S(3, 0) S(3, 1) S(3, 2) S(3, 3)

S(3, 4) S(3, 5)

C

4

S(4, 0) S(4, 1) S(4, 2) S(4, 3)

S(4, 4) S(4, 5)

A

S(5, 0) S(5, 1) S(5, 2) S(5, 3)

S(5, 4) S(5, 5)

5

Alignment Example with Scoring Scheme Match 8,

mismatch -5, indel column -3

G T A C

G

0 -3 -6

-9 -12 -15

0

A

1

-3 -5 -8

2 -1

-4

T

-6 -8 3

2

0

6

-3

A

3

-9 -11 0 11

8 5

C

4

-12 -14 -3 8

19 16

A

-15 -17 -8 5

16 14

5

An optimal alignment can be found using

information on how S(j, k) are calculated.

A G G C G T

A

G

T

G

a ? b indicates that b can be calculated from

a. Any non-decreasing path from the first cell

to the last cell gives an optimal alignment.

G T A C

G

0

0 -3 -6

-9 -12 -15

A

1

-3 -5 -8 2

-1

-4

T

-6 -8 3

2

0

-6

-3

A

3

-9 -11 0 11

8 5

C

4

-12 -14 -3 8

19 16

A

-15 -17 -6 5

16 14

5

Backtracking from the right-most cell at the

bottom to the the left-most cell on the top

obtain the optimal path. It is not unique unlike

this example in general.

G T A C

G

0

0 -3 -6

-9 -12 -15

A

1

-3 -5 -8 2

-1

-4

T

-6 -8 3

2

0

-6

-3

A

3

-9 -11 0 11

8 5

C

4

-12 -14 -3 8

19 16

A

-15 -17 -6 5

16 14

5

An example where there are more than one optimal

alignment

G T A C

G

0

0 -3 -6

-9 -12 -15

A

-3 -5 -8 2

-1

A G

T T

A A

C --

A --

T

-6 -8 3

2

0

A

3

-9 -11 0 11

8 5

C

-12 -14 -3 8

19 16

A

-15 -17 -6 5

16 14

5

Summary

- To align a j-char sequence and a k-char
- sequence, we use a (j1)x(k1) table in which
- each entry is an alignment score. We compute
- the entry at (j, k) from three entries at

(j-1, k), - (j, k-1) , (j-1, k) in 5 operations using a
- recurrence formula. Hence, the alignment score

can - be computed in 5(j1)(k1) operations.

- To output an optimal alignment, we
- trackback the score computation from the
- the last cell to the first cell. For this

purpose, - we need to know which of entries
- (j-1, k), (j, k-1) and (j-1,k-1) gives the

entry at (j, k) using three extra pointers.

Overall, this also takes about 3(j1)(k1)

operations.

Theorem Aligning two sequences of length j

and k respectively takes quadratic time O(jk).

Remark Our algorithm uses O(jk) cells. But it

is possible to use only O(jk) cells

Dynamic Programming Technique

- Solve an instance of a problem by taking
- advantage of the computed solutions to
- the small parts of the instance.

The term was first used by Bellman in 1940s

Richard Ernest Bellman (19201984)

Part IV.2 Locally Aligning Sequences

Given two sequences S and T, find best

alignment of a segments of X and a segment of Y.

A G G C A A

G A

A G T G

G T A

Locally Aligning Sequences

Input Two sequences of length m and n, and a

scoring scheme s( , ).

Step 1 F(0, k)0 F(j, 0)0 and for jgt0

and kgt0, compute F(j,k) by

Step 2 Identify the largest entry F(m,n) in

the table and backtrack from the cell (m,

n) to a cell (m, n) such that F(m, n)0

to obtain an optimal local alignment.

local alignment

Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T

CTTAACA

Example of a local alignment

Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T

CTTA

A

C

C

Part IV.3 Multiple Sequence Alignment

- Generalization of dynamic programming method
- Given three sequences S1x1,..n, S21,..,n,

S31,..,n, - let F(i, j, k) denote the score of optimally

aligning the i-character prefix of S1, the

j-character prefix of S2 and the k-character

prefix of S3. Then, any alignment of these

prefixes has 7 possibilities in the last column

xi xj xk

-- -- --

xi -- --

-- xj --

-- -- xk

xi xj --

xi -- xk

-- xj xk

Basis values F(0, 0, k) ?s(-, -, zt) ,

F(i, 0, 0) ?s( xt , -, -) ,

F(0, j,0) ?s( -, yt , -) .

(i, j, k)

- One popular scoring approach is the
- Sum-of-Pairs score (SP-score), which is
- defined as follows
- where s(,) is given by a scoring scheme as in

pairwise alignment.

Aligning k Sequences

- In general, for aligning k m-char sequences, we
- need a k-dimensional array of (m1)k cells. At

the last column of an alignment, there are 2k -1

possibilities (1 is subtracted because all gaps

are not allowed). - Hence, we must
- fill out all (m1)k cells
- for each cell, consider all 2k -1 possibilities.

- for each possibility, we compute the SP score

using O(k2) operations.

Efficiency of the DP method

- For k m-character sequences, the computation

involves a matrix with (m1)k entries each

entry is computed by 2k arithmetic operations.

Hence, the method takes about O(k22k(m1)k)

operations. - When mgt100, kgt20 or mgt300 and k5, aligning by

this is impossible even with Carrillio-Lipman

bound.

Complexity of MSA

- Theorem MSA is an NP-hard problem.
- This means that MSA is likely not solved
- essentially faster than guessing and checking

its - solutions.
- It also seems true that MSA takes exponential

memory space.

Application one Genome Sequencing

- Genome sequencing is to determine the exact order

of the millions of chemical building blocks (A,

T, C, and G) that make up the DNA of an

organism.

The genome of a complex organism like human is

huge. However, the existing technology can only

be used for fairly short (say, 500-700bp) stands.

Therefore, a feasible approach to genome

sequencing is to cut the entire genome at a

number of points randomly to generate a large

number of small segments, which are sequenced,

called READs.

Computer programs use overlapping information on

different READs to assemble them into a

contiguous sequence. Multiple overlapping READS

are obtained for the target genome by performing

several rounds of fragmentation.

Target XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

XX !st round NNNNNAGCATGCTGCANNNNNNNNNNNNNNNNN

NN NNNNNNNNNNNNNNNNGTCATGCTTAGGC

TANNNNN 2nd round NNNNNAGCATGNNNNNNNNNNNNNNNNNN

NNNNNNN NNNNNNNNNNNCTGCAGTCATGCT

TAGGCANNNNNNN Assembly

AGCATGCTGCAGTCATGCTTAGGCA

The above sequencing process is called Shotgun

sequencing. Applying the shotgun sequencing to

human genome, we will get about 50 million short

segments of about 600 base pairs long. To have a

highly accurate reconstruction of human genome,

the fragmentation-and-sequencing process needs to

be performed in 7 to 10 rounds to get enough

overlapping information.

Target DNA

1st round

2nd round

3rd round

4th round