Sparse Normalized Local Alignment Nadav Efraty Gad M' Landau - PowerPoint PPT Presentation

About This Presentation

Title:

Sparse Normalized Local Alignment Nadav Efraty Gad M' Landau

Description:

The O(rLloglogn) normalized local LCS algorithm. The O ... The sparsity of the essential data is not exploited. 4. 4. 4. 3. 2. 1. 1. 0. D. 4. 4. 4. 3. 2. 1 ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 43

Provided by: nad121

Category:

more less

Transcript and Presenter's Notes

Title: Sparse Normalized Local Alignment Nadav Efraty Gad M' Landau

1
Sparse Normalized Local AlignmentNadav
EfratyGad M. Landau
2
Goal

To find the substrings, X and Y, whose
normalized alignment value
LCS(X,Y)/(XY)
is the highest,
Or higher than a predefined similarity level.

Introduction
The O(rLloglogn) normalized local LCS algorithm
The O(rMloglogn) normalized local LCS algorithm
Conclusions and open problems

4
Introduction
5
Background - Global similarity

LCS-.
computing a dynamic programming
table of size (n1)x(m1)
T(i,0)T(0,j)0
for all i,j (1 i m 1 j n)
if XjYi then T(i,j)T(i-1,j-1)1,
else, T(i,j)maxT(i-1,j) , T(i,j-1)

6
Background - Global similarity
The naive LCS algorithm
XjYi T(i,j)T(i-1,j-1)1
Xj?Yi T(i,j)maxT(i,j-1),T(i-1,j)
7
Background - Global similarity
The typical staircase shape of the layers in the
matrix
8
Background - Global similarity

Edit distance measures the minimal number of
operations that are required to transform one
string into another one.
operations
substitution
Deletion
insertion.

9
Background - Local similarity

The Smith Waterman algorithm (1981)
T(i,0)T(0,j)0 ,
for all i,j (1 i m 1 j n)
T(i,j)maxT(i-1,j-1) S(Yi,Xj) , T(i-1,j) D(Yi)
, T(i,j-1) I(Xj) , 0

10
Background - Local similarity
The weaknesses of the Smith Waterman algorithm

Mosaic effect - Lack of ability to discard poorly
conserved intermediate segments.

Shadow effect Short, but more biologically
important alignments may not be detected because
they are overlapped by longer (and less
important) alignments.

The sparsity of the essential data is not
exploited.

The solution Normalization
The statistical significance of the local
alignment depends on both its score and
length. Instead of searching for an alignment
that maximizes the score S(X,Y), search for
the alignment that maximizes
S(X,Y)/(XY).

Arslan, Egecioglu, Pevzner (2001) uses a
mathematical technique that allows convergence to
the optimal alignment value through iterations of
the Smith Waterman algorithm.
SCORE(X,Y)/(XYL), where L is a
constant that controls the amount of
normalization.
O(n2logn).

13
Our approach

The degree of similarity is defined as
LCS(X,Y)/(XY).
M - a minimal length constraint.
Similarity level.

14
The O(rLloglogn) normalized local LCS algorithm
15
Definitions

A chain is a sequence of matches that is strictly
increasing in both components.

The length of a chain from match (i,j) to match
(i,j) is i-ij-j, that is, the length of the
substrings which create the chain.

A k-chain(i,j) is the shortest chain of k
matches starting from (i,j).

The normalized value of k-chain(i,j) is k
divided by its length.

16
The algorithm

For each match (a,b), construct k-chain(a,b) for
1kL (LLCS(X,Y)).
Examine all the k-chains with kM, starting from
each match, and report either
The k-chains with the highest normalized value.
k-chains whose normalized value exceed a
predefined threshold.

Problem k-chain(a,b) is not the prefix of
(k1)-chain(a,b).

18
Solution (k1)-chain(a,b) (a,b)
is concatenated to a k-chain(i,j) below and to
the right of (a,b).
19
Question How can we find the proper match
(i,j) which is the head of the k-chain that
should be concatenated to (a,b) in order to
construct (k1)-chain(a,b) .
20

Definitions
Range- The range of a match (i,j) is
(0i-1,0j-1).
Mutual range- An area of the table which is
overlapped by at least two ranges of distinct
matches.
Owner- (i,j) is the owner of the range where
k-chain(i,j) is the suffix of (k1)-chain(a,b)
for any match (a,b) in the range.
L separated lists of ranges and their owners are
maintained by the algorithm.

If (a,b) is in the range of a single match
(i,j) (it is not in a mutual range),
k-chain(i,j) would be the suffix of
(k1)-chain(a,b).
If (x,y) is in the mutual range of two matches,
how can we determined which of them should be
concatenated to (a,b)?
Lemma A mutual range of two matches is owned
completely by one of them.

Lemma A mutual range of two matches, p ((i,j))
and q ((i,j)), is owned completely by one of
them.
Proof There are two distinct cases
Case 1 ii and jj

X
0
J
n
J
0
Y
i
(i,j)
(i,j)
i
(i,j)
(i,j)
m
23
Case 2 ilti and jgtj The mutual range of p and
q is (0...i-1,0...j'-1). Entry (i-1,j'-1) is
the mutual point (MP) of p and q. p will be the
owner of the mutual range if Lp(j-j')
Lq(i'-i)
24
The algorithm

Preprocessing.
Process the matches row by row, from bottom up.
For the matches of row i
Stage 1 Construct k-chains 1kL for all the
matches in the row i, using the L lists of ranges
and owners.
Stage 2 Update the lists of ranges and owners
with the matches of row i and their k-chains.
Examine the k-chains of all matches and report
the ones with the highest normalized value.

25
Stage 2

Let LROk be the list of ranges and owners that
are the heads of k-chains.
Insert each match (i,j) of row i which is the
head of a k-chain to LROk.
If there is already another match with column
coordinate j, extract it from LROk.

Row 0
Row 0
Row i
Row i1
26
Stage 2 cont

While for (i',j'), which is the left neighbor of
(i,j) in LROk
(length of k-chain(i,j)i'-i) (length of
k-chain(i,j)j-j'),
(i',j') should be extracted from LROk.

Row 0
Row i
27
Stage 1

Constructing (k1)-chain(i,j) concatenating
(i,j) to the match in LROk which is the owner of
the range of (i,j).
Record the value of (k1)-chain(i,j) with the
match (i,j).

Row 0
Row i
28
Reporting the best alignments

The best alignment is either the alignment with
the highest normalized value or the alignments
whose similarity exceed a predefined value.
Check all the k-chains, kM, starting from each
match and report the best alignments.

29
Complexity analysis

Preprocessing- O(nlogSY)
Stage 1-
For each of the r matches we construct at most L
k-chains.
Using a Johnson Tree stage 1 is computed in
O(rLloglogn) time.
Stage 2- Each of the r matches is inserted and
extracted at most once to each of the LROks.
Total, O(rLloglogn) time.

30
Complexity analysis

Reporting the best alignments is done in O(rL)
time.
Total time complexity of this algorithm is
O(nlogSY rLloglogn).
Space complexity is O(rLnL).

31
The O(rMloglogn) normalized local LCS algorithm
32
The O(rMloglogn) normalized local LCS algorithm

Reoprts
The normalized alignment value of the best
possible local alignment. (value and
substrings).

33
Computing the highest normalized value

Definition A sub-chain of a k-Chain is a path
that contains a
sequence of x k consecutive matches of the
k-Chain.
Claim When a k-chain is split into a number of
non
overlapping consecutive sub-chains, the
normalized value of
a k-chain is smaller or equal than that of its
best sub-chain.
Result The normalized value of any k-chain (kM)
is smaller or
equal than the value of its best sub-chain with M
to 2M-1
matches.

34
Computing the highest normalized value

A sub-chains of less than M matches may not be
reported.
Sub-chains of 2M matches or more, can be split
into shorter sub-chains of M to 2M-1 matches.
Is it sufficient to construct all the sub-chains
of exactly M matches?

No - Sub-chains of M1 to 2M-1 matches can not be
split to sub-chains of M matches.

35
Computing the highest normalized value

The algorithm For each match construct all the
k-chains, for k2M-1.
The algorithm constructs all these chains, that
are, in fact, the sub-chains of all the longer
k-chains.
A longer chain can not be better than its best
sub-chain.

This algorithm is able to report the highest
normalized value of a sub-chain (of at least M
matches) which is equal to the highest normalized
value of a chain of at least M matches.

36
Constructing the longest optimal alignment
Definition A perfect alignment is an alignment
of two identical strings. Its normalized value is
½ Unless the optimal alignment is perfect, the
longest optimal alignment has no more than 2M-1
matches.
37

Constructing the longest optimal alignment

Assume there is a chain with more than
2M-1matches whose normalized value is the
optimal, denoted by LB.

LB may be split to a number of sub-chains of M
matches, followed by a single sub-chain of
between M and 2M-1 matches.

The normalized value of each such sub-chain must
be equal to that of LB, otherwise, LB is not
optimal.

Each such sub-chain must start and end at a
match, otherwise, the normalized value of the
chain comprised of the same matches will be
higher than that of LB.

38
Constructing the longest optimal alignment

Note that if we concatenate two optimal
sub-chains where the head of the second is next
to the tail of the first the concatenated chain
is optimal.

10/30
10/30
20/60

when the head of the second is not next to the
tail of the first, the concatenated chain is not
optimal.

10/30
10/30
0/2
20/62

The tails and heads of the sub-chains from which
LB is comprised must be next to each other.

39
Constructing the longest optimal alignment

If the tails and heads of the optimal sub-chains
from which LB is comprised are next to each other
then their concatenation (i.e. LB) is optimal.
Lets examine the first two sub-chains

M/L
2M/2L
M/L

But, what happens if we examine the following
sub-chain

M/L
M/L

Its number of matches is M1 and its length is
L2.
Since M/Llt½, (M1)/(L2)gtM/L. Thus, we found a
chain of M1 matches whose normalized value is
higher than that of LB, in contradiction to the
optimality of LB.

40
Closing remarks
41
The advantages of the new algorithm

The first algorithm to combine the normalized
local and the sparse.
Ideal for textual local comparison (where the
sparsity is typically dramatic) as well as for
screening bio sequences.
As a normalized alignment algorithm, it does not
suffer form the weaknesses from which non
normalized algorithms suffer.
A straight forward approach to the minimal
constraint which is easy to control and
understand, and in the same time, does not
require reformulation of the original problem.
the minimal constraint is problem related rather
than input related.