Title: Sparse Normalized Local Alignment Nadav Efraty Gad M' Landau
1Sparse Normalized Local AlignmentNadav
EfratyGad M. Landau
2Goal
- To find the substrings, X and Y, whose
normalized alignment value - LCS(X,Y)/(XY)
-
- is the highest,
- Or higher than a predefined similarity level.
3- Introduction
- The O(rLloglogn) normalized local LCS algorithm
- The O(rMloglogn) normalized local LCS algorithm
- Conclusions and open problems
4Introduction
5Background - Global similarity
- LCS-.
- computing a dynamic programming
- table of size (n1)x(m1)
- T(i,0)T(0,j)0
- for all i,j (1 i m 1 j n)
- if XjYi then T(i,j)T(i-1,j-1)1,
- else, T(i,j)maxT(i-1,j) , T(i,j-1)
6Background - Global similarity
The naive LCS algorithm
XjYi T(i,j)T(i-1,j-1)1
Xj?Yi T(i,j)maxT(i,j-1),T(i-1,j)
7Background - Global similarity
The typical staircase shape of the layers in the
matrix
8Background - Global similarity
- Edit distance measures the minimal number of
operations that are required to transform one
string into another one. - operations
- substitution
- Deletion
- insertion.
9Background - Local similarity
- The Smith Waterman algorithm (1981)
- T(i,0)T(0,j)0 ,
- for all i,j (1 i m 1 j n)
- T(i,j)maxT(i-1,j-1) S(Yi,Xj) , T(i-1,j) D(Yi)
, T(i,j-1) I(Xj) , 0
10Background - Local similarity
The weaknesses of the Smith Waterman algorithm
- Mosaic effect - Lack of ability to discard poorly
conserved intermediate segments.
- Shadow effect Short, but more biologically
important alignments may not be detected because
they are overlapped by longer (and less
important) alignments.
- The sparsity of the essential data is not
exploited.
11- The solution Normalization
- The statistical significance of the local
- alignment depends on both its score and
- length. Instead of searching for an alignment
- that maximizes the score S(X,Y), search for
- the alignment that maximizes
- S(X,Y)/(XY).
12- Arslan, Egecioglu, Pevzner (2001) uses a
mathematical technique that allows convergence to
the optimal alignment value through iterations of
the Smith Waterman algorithm. - SCORE(X,Y)/(XYL), where L is a
constant that controls the amount of
normalization. - O(n2logn).
13Our approach
- The degree of similarity is defined as
LCS(X,Y)/(XY). - M - a minimal length constraint.
- Similarity level.
14The O(rLloglogn) normalized local LCS algorithm
15Definitions
- A chain is a sequence of matches that is strictly
increasing in both components.
- The length of a chain from match (i,j) to match
(i,j) is i-ij-j, that is, the length of the
substrings which create the chain.
- A k-chain(i,j) is the shortest chain of k
matches starting from (i,j).
- The normalized value of k-chain(i,j) is k
divided by its length.
16The algorithm
- For each match (a,b), construct k-chain(a,b) for
1kL (LLCS(X,Y)). - Examine all the k-chains with kM, starting from
each match, and report either - The k-chains with the highest normalized value.
- k-chains whose normalized value exceed a
predefined threshold.
17- Problem k-chain(a,b) is not the prefix of
- (k1)-chain(a,b).
18Solution (k1)-chain(a,b) (a,b)
is concatenated to a k-chain(i,j) below and to
the right of (a,b).
19Question How can we find the proper match
(i,j) which is the head of the k-chain that
should be concatenated to (a,b) in order to
construct (k1)-chain(a,b) .
20- Definitions
- Range- The range of a match (i,j) is
(0i-1,0j-1). - Mutual range- An area of the table which is
overlapped by at least two ranges of distinct
matches. - Owner- (i,j) is the owner of the range where
k-chain(i,j) is the suffix of (k1)-chain(a,b)
for any match (a,b) in the range. - L separated lists of ranges and their owners are
maintained by the algorithm.
21- If (a,b) is in the range of a single match
(i,j) (it is not in a mutual range),
k-chain(i,j) would be the suffix of
(k1)-chain(a,b). - If (x,y) is in the mutual range of two matches,
how can we determined which of them should be
concatenated to (a,b)? - Lemma A mutual range of two matches is owned
completely by one of them.
22- Lemma A mutual range of two matches, p ((i,j))
and q ((i,j)), is owned completely by one of
them. - Proof There are two distinct cases
- Case 1 ii and jj
X
0
J
n
J
0
Y
i
(i,j)
(i,j)
i
(i,j)
(i,j)
m
23Case 2 ilti and jgtj The mutual range of p and
q is (0...i-1,0...j'-1). Entry (i-1,j'-1) is
the mutual point (MP) of p and q. p will be the
owner of the mutual range if Lp(j-j')
Lq(i'-i)
24The algorithm
- Preprocessing.
- Process the matches row by row, from bottom up.
For the matches of row i - Stage 1 Construct k-chains 1kL for all the
matches in the row i, using the L lists of ranges
and owners. - Stage 2 Update the lists of ranges and owners
with the matches of row i and their k-chains. - Examine the k-chains of all matches and report
the ones with the highest normalized value.
25Stage 2
- Let LROk be the list of ranges and owners that
are the heads of k-chains. - Insert each match (i,j) of row i which is the
head of a k-chain to LROk. - If there is already another match with column
coordinate j, extract it from LROk.
Row 0
Row 0
Row i
Row i1
26Stage 2 cont
- While for (i',j'), which is the left neighbor of
(i,j) in LROk - (length of k-chain(i,j)i'-i) (length of
k-chain(i,j)j-j'), - (i',j') should be extracted from LROk.
Row 0
Row i
27Stage 1
- Constructing (k1)-chain(i,j) concatenating
(i,j) to the match in LROk which is the owner of
the range of (i,j). - Record the value of (k1)-chain(i,j) with the
match (i,j).
Row 0
Row i
28Reporting the best alignments
- The best alignment is either the alignment with
the highest normalized value or the alignments
whose similarity exceed a predefined value. - Check all the k-chains, kM, starting from each
match and report the best alignments.
29Complexity analysis
- Preprocessing- O(nlogSY)
- Stage 1-
- For each of the r matches we construct at most L
k-chains. - Using a Johnson Tree stage 1 is computed in
O(rLloglogn) time. - Stage 2- Each of the r matches is inserted and
extracted at most once to each of the LROks.
Total, O(rLloglogn) time.
30Complexity analysis
- Reporting the best alignments is done in O(rL)
time. - Total time complexity of this algorithm is
O(nlogSY rLloglogn). - Space complexity is O(rLnL).
31The O(rMloglogn) normalized local LCS algorithm
32The O(rMloglogn) normalized local LCS algorithm
- Reoprts
- The normalized alignment value of the best
- possible local alignment. (value and
- substrings).
33Computing the highest normalized value
- Definition A sub-chain of a k-Chain is a path
that contains a - sequence of x k consecutive matches of the
k-Chain. - Claim When a k-chain is split into a number of
non - overlapping consecutive sub-chains, the
normalized value of - a k-chain is smaller or equal than that of its
best sub-chain. - Result The normalized value of any k-chain (kM)
is smaller or - equal than the value of its best sub-chain with M
to 2M-1 - matches.
-
34Computing the highest normalized value
- A sub-chains of less than M matches may not be
reported. - Sub-chains of 2M matches or more, can be split
into shorter sub-chains of M to 2M-1 matches. - Is it sufficient to construct all the sub-chains
of exactly M matches?
- No - Sub-chains of M1 to 2M-1 matches can not be
split to sub-chains of M matches.
35Computing the highest normalized value
- The algorithm For each match construct all the
k-chains, for k2M-1. - The algorithm constructs all these chains, that
are, in fact, the sub-chains of all the longer
k-chains. - A longer chain can not be better than its best
sub-chain.
- This algorithm is able to report the highest
normalized value of a sub-chain (of at least M
matches) which is equal to the highest normalized
value of a chain of at least M matches.
36Constructing the longest optimal alignment
Definition A perfect alignment is an alignment
of two identical strings. Its normalized value is
½ Unless the optimal alignment is perfect, the
longest optimal alignment has no more than 2M-1
matches.
37- Constructing the longest optimal alignment
Assume there is a chain with more than
2M-1matches whose normalized value is the
optimal, denoted by LB.
- LB may be split to a number of sub-chains of M
matches, followed by a single sub-chain of
between M and 2M-1 matches.
- The normalized value of each such sub-chain must
be equal to that of LB, otherwise, LB is not
optimal.
- Each such sub-chain must start and end at a
match, otherwise, the normalized value of the
chain comprised of the same matches will be
higher than that of LB.
38Constructing the longest optimal alignment
- Note that if we concatenate two optimal
sub-chains where the head of the second is next
to the tail of the first the concatenated chain
is optimal.
10/30
10/30
20/60
- when the head of the second is not next to the
tail of the first, the concatenated chain is not
optimal.
10/30
10/30
0/2
20/62
- The tails and heads of the sub-chains from which
LB is comprised must be next to each other.
39Constructing the longest optimal alignment
- If the tails and heads of the optimal sub-chains
from which LB is comprised are next to each other
then their concatenation (i.e. LB) is optimal.
Lets examine the first two sub-chains
M/L
2M/2L
M/L
- But, what happens if we examine the following
sub-chain
M/L
M/L
- Its number of matches is M1 and its length is
L2. - Since M/Llt½, (M1)/(L2)gtM/L. Thus, we found a
chain of M1 matches whose normalized value is
higher than that of LB, in contradiction to the
optimality of LB.
40Closing remarks
41The advantages of the new algorithm
- The first algorithm to combine the normalized
local and the sparse. - Ideal for textual local comparison (where the
sparsity is typically dramatic) as well as for
screening bio sequences. - As a normalized alignment algorithm, it does not
suffer form the weaknesses from which non
normalized algorithms suffer. - A straight forward approach to the minimal
constraint which is easy to control and
understand, and in the same time, does not
require reformulation of the original problem. - the minimal constraint is problem related rather
than input related.
42The end