Sequence Comparison - PowerPoint PPT Presentation

Loading...

PPT – Sequence Comparison PowerPoint presentation | free to view - id: 85261-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Sequence Comparison

Description:

Reference Books: K.-M. Chao and L.X. Zhang. Sequence Comparison: Theory and Methods, Springer ... extremely important in modern biological and medical sciences. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 56
Provided by: mat91
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Sequence Comparison


1
Sequence Comparison
MA5232 Lecture Notes
  • LX Zhang
  • Department of Mathematics
  • National University of Singapore
  • matzlx_at_nus.edu.sg

2
See what everyone else has seen, but think them
mathematically.
  • Reference Books
  • K.-M. Chao and L.X. Zhang.
  • Sequence Comparison Theory and Methods,
    Springer
  • 2. R.C. Deonier, S. Tavare, and M.S. Waterman,
  • Computational Genome Analysis, Springer.
  • 3. J. Felsenstein, Inferring Phylogenies,
    Springer.

3
Motivation Sequence comparison problems arise
everywhere
  • Web search tools are designed based on sequence
    comparison solutions.
  • Text processing tools are designed based on
    efficient solutions to text storage and word
    pattern matching.
  • Finally, bio-molecular sequence comparison is
    extremely important in modern biological and
    medical sciences.

4
Motivation
5
Motivation
  • All genomes are evolved from a common ancestral
    genome.
  • Less important regions of the genome mutate
    or change with time more frequently than the
    functional important parts.
  • Hence, sequence conservation implies function.
    In other words, sequence conservation serves as a
    signal of the functionally important regions in a
    genome.

6
An Example of Sequence Function Analysis
  • To determine whether the SARS is caused by a new
    virus or not, one may sequence
  • the SARS virus and search against a databases
    of known bacterial and viral sequences.

Sequence database
Output
List of similar matches
7
Part I Alignment Model for Bio-Seq
Comparison
  • An alignment among k sequences is a k-row matrix
  • such that
  • The row j contains the j-th sequence and
    between
  • two consecutive residues there may be one or
    more -s
  • Each column contains at least one residue.

8
More examples

a g a a - t g c a a c a a g t g - a
a g a a t g - c a a c a a - g t g a
Each alignment is an evolutionary hypothesis,
which accounts for three types of mutations
-- Substitution ( a g c t ? a g t t ) --
Insertion ( a g c t ? a g a c t ) --
Deletion ( a g c t ? a t )
9
  • Because of point mutation (sub, indel) and
  • large scale mutation (reversal, fission, fusion,
    etc),
  • a gene has different sequences in different
    species.
  • different genomes have different size and
    number of chromosomes.
  • Indels may not affect the structure of a protein.

10
Alignment of globin protein sequences
11
Part II Scoring an Alignment
a g t c t c c a t c c c
a g t c t c c a t c c c
The quality of an alignment is measured by the
sum of the scores of pairs of aligned residues
plus the scores for gaps.
The left alignment is scored by s(a, a)
3s(c, c) s(t, t) s(g, -) s(t, -)
12
  • Scores for aligned residues and gaps form a basic
    scoring scheme

A G C T - A 4
-2 -1 -2 -1 G 3
-2 -1 -1 C 4
-2 -1 T 3
-1
a g t c t c c a t c c c
has score 4 (-1) 3 4 (-1) 4 4 18
13
  • Under a meaningful scoring scheme, the score of
    an alignment is essentially
  • the logarithm of likelihood ratio of the
    alignment between two random sequences.
  • Matches score a positive value
  • Mismatches and indels are penalized
  • by scoring a negative value.

Indels and Mismatch are introduced to bring up
matches that appear later.
14
A sequence is a word or string on alphabet
?. For DNA sequences, ? a, g, c, t For
protein sequences, ? has 20 letters For
English words, ? a, b, , y, z.
  • (Global) Pairwise Sequence Alignment
  • Instance Two sequences S1 and S2, and
  • a scoring scheme
  • Question Find an alignment of S1 and S2 that
  • has the highest score.

Such an alignment that has the highest alignment
score is called an optimal alignment.
15
Part III Generality of Alignment as a Model
  • Many different problems such as
  • Maximum subsequence problem
  • Minimum supersequence problem
  • Best fitting problem
  • Finding Levenshtein distance
  • in sequence comparison can be solved as a
    special case of alignment by using a specific
    scoring scheme.

16
Maximum Subsequence Problem
  • A sequence is a word or string on alphabet ?.
  • For DNA sequences, ? a, g, c, t
  • For protein sequences, ? has 20 letters
  • For English words, ? a, b, , y, z.
  • Let S and T be two sequences on ?.
  • S is a subsequence of T
  • or T is a supersequence of S if
  • all the characters in S appear in T in the same
  • order.

17
  • Maximum Subsequence Problem
  • Instance Two sequences S1 and S2
  • Question Find the longest common
  • subsequence of S1 and S2.
  • a g g c c a a t a g g c c a a t
  • a c g g c t c a a c g g c t c a

18
One-to-one correspondence
a g g c c a a t a c g g c t c a
a g g c - - c a a t - - a - - c g g c - - t
c a
s(x,x)1, s(x, y)0
The length of lcs the alignment
score
The LCS problem is a special case of the
alignment problem.
19
Levenshtein distance
  • In computer science and information theory,
  • the Levenshtein distance between two strings is
    defined to be
  • the minimum number of edit operations needed to
    transform one
  • string to the other. It is also called the edit
    distance.
  • kitten ? sitten ? sittin ? sitting

Finding the edit distance is equivalent
to aligning sequences with match scores 0
and mismatch and indel score -1
k i t t e n - s i t t i n g
20
Part IV Key Issues of Alignment
  • 1. Algorithmic aspect
  • Given two or more sequences and a scoring
    scheme, how to find the optimal alignment that
    has the maximum score?
  • 2. Scoring function
  • Which score schemes are good at ranking
    alignments?
  • 3. Evaluation of output alignments
  • Are the output alignments statistically
    significant?

21
The number of possible alignments
  • Theorem There are
  • possible alignments for two m-character
    sequences.
  • Proof. An alignment of two m-character sequences
    has at least m columns and at most 2m columns.
    So we only need to prove the following fact.

22
Fact There are
possible alignments with k columns for two
m-character sequences S and T.
A
C
A
A
G
C
C
Sequence S appears in the first row in
ways. After the configuration of the first row
is fixed, T appears in the second row in
ways.
23
Alignment GraphCompact representation of all
alignments
  • The length of a sequence is the number of
  • characters contained in it.

For example, sequence TATAGC has length 6.
Let S1 and S2 be two sequence of length m and n,
respectively. The alignment graph A(S1, S2) of S1
and S2 are defined as follows
24
There is one-to-one correspondence from the
alignment s between S and T to the paths from
left-top vertex to the right-bottom vertex.
-- Vertices are lattice points (i, j), 0 i
m, 0 j n. In total, (m1)(n1) vertices.
-- There is an arc from (i, j) to (i, j) if
and only if 0 i i 1 and 0 j
j 1. -- Three types of edges
horizontal edges vertical edges
diagonal edges
25
(0, 0)
(1, 0)
(1, 1)
(2, 2)
(3, 3)
(4, 4)
(5, 6)
(4, 5)
(6, 6)
(7,7)
S2
In the alignment graph, each edge
corresponds uniquely to a possible column in an
alignment. -- diagonal edges for
match/mismatches -- horizontal and vertical
edges for indels.
S1
26
A G
T T
A -
C C
- G
T T
G G
G -
- C
S2
S1
27
Part IV Alignment Algorithms
  • Pairwise Alignment Problem
  • Input Two sequence S and T and a score
    scheme
  • Question Find an alignment of S and T that has
    the
  • largest alignment score, called optimal
    alignment.
  • Local Alignment Problem
  • Input Two sequence S and T and a score
    scheme
  • Question Find the best alignment of segments in
    S and T respectively.

28
Part IV.1 Global Sequence Alignment
G T A C
G
Each column of an alignment corresponds to an
edge of the corresponding path.
A
T
A
Assign the column score to an edge as
weight. Find an optimal alignment is to find a
path with the largest weight.
C
A
29
  • Consider two sequences
  • Xx1x2xj, Yy1y2yk
  • Write their t-char prefixes as
  • X1,, tx1x2xt Y1, , ty1y2yt
  • Let S(j, k) be the score of optimally aligning
    X1, , j and Y1,, k.
  • Consider one of the best alignments of X1,,j
    and Y1,,k.

Case 1 The alignment corresponds to a path
whose last edge is diagonal.
xj
yk
A best alignment of X1,, j-1 and Y1, , k-1
S(j, k) S(j-1, k-1) s(xj, yk)
30
  • S(j, k) optimal score of aligning X1,, j
    and Y1,,k.

Case 2 Correspond to a path whose last edge is
vertical
S(j, k) S(j-1, k) s(xj, -)
An optimal alignment of X1,, j-1 and Y1, , k
Case 3 Correspond to a path whose last edge is
horizontal
S(j, k) S(j, k-1) s(-, yk)
An optimal alignment of X1,, j and Y1, , k-1
31
  • S(j, k) the score of optimally aligning x1, ,
    j
  • and y1, , k.

If we know the optimal alignments for X1,..,
j-1 and Y1, ..,k, X1, .., j and Y1, ,
k-1, X1, , j-1 and Y1, , k-1, we can
find an optimal alignments for X1,, j and Y1,
, k.
S(0, k) the score of optimally aligning an
empty X to Y s(-, y1) s(-, y2)
s(- y3) s(-, yk)
S(j, 0) the score of optimally aligning X to
the empty Y s(x1 , -) s(x2, -)
s(x3, -) s(xj , -)
32
Tabular Computation of Pairwise Alignment
To align an j-character sequence X and a
k-character sequence Y, we form an (j1) by
(k1) matrix F in which S(j, k) is the score
of optimally aligning X1,..,j and Y1,,k
and computed by the formula one by one from left
to right, top to bottom.
A G G C G T
The score of the best align. of AGT and AG
A
G
The score of the best align. of X and Y
T
G
33
Alignment Example
G T A C
G
S(0, 0) S(0, 1) S(0, 2) S(0, 3)
S(0, 4) S(0, 5)
0
A
1
S(1, 0) S(1, 1) S(1, 2) S(1, 3)
S(1, 4) S(1, 5)
T
S(2, 0) S(2, 1) S(2, 2) S(2, 3)
S(2, 4) S(2, 5)
2
A
3
S(3, 0) S(3, 1) S(3, 2) S(3, 3)
S(3, 4) S(3, 5)
C
4
S(4, 0) S(4, 1) S(4, 2) S(4, 3)
S(4, 4) S(4, 5)
A
S(5, 0) S(5, 1) S(5, 2) S(5, 3)
S(5, 4) S(5, 5)
5
34
Alignment Example with Scoring Scheme Match 8,
mismatch -5, indel column -3
G T A C
G
0 -3 -6
-9 -12 -15
0
A
1
-3 -5 -8
2 -1
-4
T
-6 -8 3
2
0
6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -8 5
16 14
5
35
An optimal alignment can be found using
information on how S(j, k) are calculated.
A G G C G T
A
G
T
G
36

a ? b indicates that b can be calculated from
a. Any non-decreasing path from the first cell
to the last cell gives an optimal alignment.
G T A C
G
0
0 -3 -6
-9 -12 -15
A
1
-3 -5 -8 2
-1
-4
T
-6 -8 3
2
0
-6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
37

Backtracking from the right-most cell at the
bottom to the the left-most cell on the top
obtain the optimal path. It is not unique unlike
this example in general.
G T A C
G
0
0 -3 -6
-9 -12 -15
A
1
-3 -5 -8 2
-1
-4
T
-6 -8 3
2
0
-6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
38

An example where there are more than one optimal
alignment
G T A C
G
0
0 -3 -6
-9 -12 -15
A
-3 -5 -8 2
-1
A G
T T
A A
C --
A --
T
-6 -8 3
2
0
A
3
-9 -11 0 11
8 5
C
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
39
Summary
  • To align a j-char sequence and a k-char
  • sequence, we use a (j1)x(k1) table in which
  • each entry is an alignment score. We compute
  • the entry at (j, k) from three entries at
    (j-1, k),
  • (j, k-1) , (j-1, k) in 5 operations using a
  • recurrence formula. Hence, the alignment score
    can
  • be computed in 5(j1)(k1) operations.

40
  • To output an optimal alignment, we
  • trackback the score computation from the
  • the last cell to the first cell. For this
    purpose,
  • we need to know which of entries
  • (j-1, k), (j, k-1) and (j-1,k-1) gives the
    entry at (j, k) using three extra pointers.
    Overall, this also takes about 3(j1)(k1)
    operations.

Theorem Aligning two sequences of length j
and k respectively takes quadratic time O(jk).
Remark Our algorithm uses O(jk) cells. But it
is possible to use only O(jk) cells
41
Dynamic Programming Technique
  • Solve an instance of a problem by taking
  • advantage of the computed solutions to
  • the small parts of the instance.

The term was first used by Bellman in 1940s
Richard Ernest Bellman (19201984)
42
Part IV.2 Locally Aligning Sequences
Given two sequences S and T, find best
alignment of a segments of X and a segment of Y.
43
A G G C A A
G A
A G T G
G T A
44
Locally Aligning Sequences
Input Two sequences of length m and n, and a
scoring scheme s( , ).
Step 1 F(0, k)0 F(j, 0)0 and for jgt0
and kgt0, compute F(j,k) by
Step 2 Identify the largest entry F(m,n) in
the table and backtrack from the cell (m,
n) to a cell (m, n) such that F(m, n)0
to obtain an optimal local alignment.
45
local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T

CTTAACA
46
Example of a local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
CTTA
A
C
C
47
Part IV.3 Multiple Sequence Alignment
  • Generalization of dynamic programming method
  • Given three sequences S1x1,..n, S21,..,n,
    S31,..,n,
  • let F(i, j, k) denote the score of optimally
    aligning the i-character prefix of S1, the
    j-character prefix of S2 and the k-character
    prefix of S3. Then, any alignment of these
    prefixes has 7 possibilities in the last column

xi xj xk
-- -- --
xi -- --
-- xj --
-- -- xk
xi xj --
xi -- xk
-- xj xk
48
Basis values F(0, 0, k) ?s(-, -, zt) ,
F(i, 0, 0) ?s( xt , -, -) ,
F(0, j,0) ?s( -, yt , -) .

(i, j, k)
49
  • One popular scoring approach is the
  • Sum-of-Pairs score (SP-score), which is
  • defined as follows
  • where s(,) is given by a scoring scheme as in
    pairwise alignment.

50
Aligning k Sequences
  • In general, for aligning k m-char sequences, we
  • need a k-dimensional array of (m1)k cells. At
    the last column of an alignment, there are 2k -1
    possibilities (1 is subtracted because all gaps
    are not allowed).
  • Hence, we must
  • fill out all (m1)k cells
  • for each cell, consider all 2k -1 possibilities.
  • for each possibility, we compute the SP score
    using O(k2) operations.

51
Efficiency of the DP method
  • For k m-character sequences, the computation
    involves a matrix with (m1)k entries each
    entry is computed by 2k arithmetic operations.
    Hence, the method takes about O(k22k(m1)k)
    operations.
  • When mgt100, kgt20 or mgt300 and k5, aligning by
    this is impossible even with Carrillio-Lipman
    bound.

52
Complexity of MSA
  • Theorem MSA is an NP-hard problem.
  • This means that MSA is likely not solved
  • essentially faster than guessing and checking
    its
  • solutions.
  • It also seems true that MSA takes exponential
    memory space.

53
Application one Genome Sequencing
  • Genome sequencing is to determine the exact order
    of the millions of chemical building blocks (A,
    T, C, and G) that make up the DNA of an
    organism.

The genome of a complex organism like human is
huge. However, the existing technology can only
be used for fairly short (say, 500-700bp) stands.
Therefore, a feasible approach to genome
sequencing is to cut the entire genome at a
number of points randomly to generate a large
number of small segments, which are sequenced,
called READs.
54
Computer programs use overlapping information on
different READs to assemble them into a
contiguous sequence. Multiple overlapping READS
are obtained for the target genome by performing
several rounds of fragmentation.
Target XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XX !st round NNNNNAGCATGCTGCANNNNNNNNNNNNNNNNN
NN NNNNNNNNNNNNNNNNGTCATGCTTAGGC
TANNNNN 2nd round NNNNNAGCATGNNNNNNNNNNNNNNNNNN
NNNNNNN NNNNNNNNNNNCTGCAGTCATGCT
TAGGCANNNNNNN Assembly
AGCATGCTGCAGTCATGCTTAGGCA
55
The above sequencing process is called Shotgun
sequencing. Applying the shotgun sequencing to
human genome, we will get about 50 million short
segments of about 600 base pairs long. To have a
highly accurate reconstruction of human genome,
the fragmentation-and-sequencing process needs to
be performed in 7 to 10 rounds to get enough
overlapping information.
Target DNA
1st round
2nd round
3rd round
4th round
About PowerShow.com