Sparse Normalized Local Alignment Nadav Efraty Gad M' Landau - PowerPoint PPT Presentation

About This Presentation
Title:

Sparse Normalized Local Alignment Nadav Efraty Gad M' Landau

Description:

The O(rLloglogn) normalized local LCS algorithm. The O ... The sparsity of the essential data is not exploited. 4. 4. 4. 3. 2. 1. 1. 0. D. 4. 4. 4. 3. 2. 1 ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 43
Provided by: nad121
Category:

less

Transcript and Presenter's Notes

Title: Sparse Normalized Local Alignment Nadav Efraty Gad M' Landau


1
Sparse Normalized Local AlignmentNadav
EfratyGad M. Landau
2
Goal
  • To find the substrings, X and Y, whose
    normalized alignment value
  • LCS(X,Y)/(XY)
  • is the highest,
  • Or higher than a predefined similarity level.

3
  • Introduction
  • The O(rLloglogn) normalized local LCS algorithm
  • The O(rMloglogn) normalized local LCS algorithm
  • Conclusions and open problems

4
Introduction
5
Background - Global similarity
  • LCS-.
  • computing a dynamic programming
  • table of size (n1)x(m1)
  • T(i,0)T(0,j)0
  • for all i,j (1 i m 1 j n)
  • if XjYi then T(i,j)T(i-1,j-1)1,
  • else, T(i,j)maxT(i-1,j) , T(i,j-1)

6
Background - Global similarity
The naive LCS algorithm
XjYi T(i,j)T(i-1,j-1)1
Xj?Yi T(i,j)maxT(i,j-1),T(i-1,j)
7
Background - Global similarity
The typical staircase shape of the layers in the
matrix
8
Background - Global similarity
  • Edit distance measures the minimal number of
    operations that are required to transform one
    string into another one.
  • operations
  • substitution
  • Deletion
  • insertion.

9
Background - Local similarity
  • The Smith Waterman algorithm (1981)
  • T(i,0)T(0,j)0 ,
  • for all i,j (1 i m 1 j n)
  • T(i,j)maxT(i-1,j-1) S(Yi,Xj) , T(i-1,j) D(Yi)
    , T(i,j-1) I(Xj) , 0

10
Background - Local similarity
The weaknesses of the Smith Waterman algorithm
  • Mosaic effect - Lack of ability to discard poorly
    conserved intermediate segments.
  • Shadow effect Short, but more biologically
    important alignments may not be detected because
    they are overlapped by longer (and less
    important) alignments.
  • The sparsity of the essential data is not
    exploited.



11
  • The solution Normalization
  • The statistical significance of the local
  • alignment depends on both its score and
  • length. Instead of searching for an alignment
  • that maximizes the score S(X,Y), search for
  • the alignment that maximizes
  • S(X,Y)/(XY).

12
  • Arslan, Egecioglu, Pevzner (2001) uses a
    mathematical technique that allows convergence to
    the optimal alignment value through iterations of
    the Smith Waterman algorithm.
  • SCORE(X,Y)/(XYL), where L is a
    constant that controls the amount of
    normalization.
  • O(n2logn).

13
Our approach
  • The degree of similarity is defined as
    LCS(X,Y)/(XY).
  • M - a minimal length constraint.
  • Similarity level.

14
The O(rLloglogn) normalized local LCS algorithm
15
Definitions
  • A chain is a sequence of matches that is strictly
    increasing in both components.
  • The length of a chain from match (i,j) to match
    (i,j) is i-ij-j, that is, the length of the
    substrings which create the chain.
  • A k-chain(i,j) is the shortest chain of k
    matches starting from (i,j).
  • The normalized value of k-chain(i,j) is k
    divided by its length.

16
The algorithm
  • For each match (a,b), construct k-chain(a,b) for
    1kL (LLCS(X,Y)).
  • Examine all the k-chains with kM, starting from
    each match, and report either
  • The k-chains with the highest normalized value.
  • k-chains whose normalized value exceed a
    predefined threshold.

17
  • Problem k-chain(a,b) is not the prefix of
  • (k1)-chain(a,b).

18
Solution (k1)-chain(a,b) (a,b)
is concatenated to a k-chain(i,j) below and to
the right of (a,b).
19
Question How can we find the proper match
(i,j) which is the head of the k-chain that
should be concatenated to (a,b) in order to
construct (k1)-chain(a,b) .
20
  • Definitions
  • Range- The range of a match (i,j) is
    (0i-1,0j-1).
  • Mutual range- An area of the table which is
    overlapped by at least two ranges of distinct
    matches.
  • Owner- (i,j) is the owner of the range where
    k-chain(i,j) is the suffix of (k1)-chain(a,b)
    for any match (a,b) in the range.
  • L separated lists of ranges and their owners are
    maintained by the algorithm.

21
  • If (a,b) is in the range of a single match
    (i,j) (it is not in a mutual range),
    k-chain(i,j) would be the suffix of
    (k1)-chain(a,b).
  • If (x,y) is in the mutual range of two matches,
    how can we determined which of them should be
    concatenated to (a,b)?
  • Lemma A mutual range of two matches is owned
    completely by one of them.

22
  • Lemma A mutual range of two matches, p ((i,j))
    and q ((i,j)), is owned completely by one of
    them.
  • Proof There are two distinct cases
  • Case 1 ii and jj

X
0
J
n
J
0
Y
i
(i,j)
(i,j)
i
(i,j)
(i,j)
m
23
Case 2 ilti and jgtj The mutual range of p and
q is (0...i-1,0...j'-1). Entry (i-1,j'-1) is
the mutual point (MP) of p and q. p will be the
owner of the mutual range if Lp(j-j')
Lq(i'-i)
24
The algorithm
  • Preprocessing.
  • Process the matches row by row, from bottom up.
    For the matches of row i
  • Stage 1 Construct k-chains 1kL for all the
    matches in the row i, using the L lists of ranges
    and owners.
  • Stage 2 Update the lists of ranges and owners
    with the matches of row i and their k-chains.
  • Examine the k-chains of all matches and report
    the ones with the highest normalized value.

25
Stage 2
  • Let LROk be the list of ranges and owners that
    are the heads of k-chains.
  • Insert each match (i,j) of row i which is the
    head of a k-chain to LROk.
  • If there is already another match with column
    coordinate j, extract it from LROk.

Row 0
Row 0
Row i
Row i1
26
Stage 2 cont
  • While for (i',j'), which is the left neighbor of
    (i,j) in LROk
  • (length of k-chain(i,j)i'-i) (length of
    k-chain(i,j)j-j'),
  • (i',j') should be extracted from LROk.

Row 0
Row i
27
Stage 1
  • Constructing (k1)-chain(i,j) concatenating
    (i,j) to the match in LROk which is the owner of
    the range of (i,j).
  • Record the value of (k1)-chain(i,j) with the
    match (i,j).

Row 0
Row i
28
Reporting the best alignments
  • The best alignment is either the alignment with
    the highest normalized value or the alignments
    whose similarity exceed a predefined value.
  • Check all the k-chains, kM, starting from each
    match and report the best alignments.

29
Complexity analysis
  • Preprocessing- O(nlogSY)
  • Stage 1-
  • For each of the r matches we construct at most L
    k-chains.
  • Using a Johnson Tree stage 1 is computed in
    O(rLloglogn) time.
  • Stage 2- Each of the r matches is inserted and
    extracted at most once to each of the LROks.
    Total, O(rLloglogn) time.

30
Complexity analysis
  • Reporting the best alignments is done in O(rL)
    time.
  • Total time complexity of this algorithm is
    O(nlogSY rLloglogn).
  • Space complexity is O(rLnL).

31
The O(rMloglogn) normalized local LCS algorithm
32
The O(rMloglogn) normalized local LCS algorithm
  • Reoprts
  • The normalized alignment value of the best
  • possible local alignment. (value and
  • substrings).

33
Computing the highest normalized value
  • Definition A sub-chain of a k-Chain is a path
    that contains a
  • sequence of x k consecutive matches of the
    k-Chain.
  • Claim When a k-chain is split into a number of
    non
  • overlapping consecutive sub-chains, the
    normalized value of
  • a k-chain is smaller or equal than that of its
    best sub-chain.
  • Result The normalized value of any k-chain (kM)
    is smaller or
  • equal than the value of its best sub-chain with M
    to 2M-1
  • matches.

34
Computing the highest normalized value
  • A sub-chains of less than M matches may not be
    reported.
  • Sub-chains of 2M matches or more, can be split
    into shorter sub-chains of M to 2M-1 matches.
  • Is it sufficient to construct all the sub-chains
    of exactly M matches?
  • No - Sub-chains of M1 to 2M-1 matches can not be
    split to sub-chains of M matches.

35
Computing the highest normalized value
  • The algorithm For each match construct all the
    k-chains, for k2M-1.
  • The algorithm constructs all these chains, that
    are, in fact, the sub-chains of all the longer
    k-chains.
  • A longer chain can not be better than its best
    sub-chain.
  • This algorithm is able to report the highest
    normalized value of a sub-chain (of at least M
    matches) which is equal to the highest normalized
    value of a chain of at least M matches.

36
Constructing the longest optimal alignment
Definition A perfect alignment is an alignment
of two identical strings. Its normalized value is
½ Unless the optimal alignment is perfect, the
longest optimal alignment has no more than 2M-1
matches.
37
  • Constructing the longest optimal alignment

Assume there is a chain with more than
2M-1matches whose normalized value is the
optimal, denoted by LB.
  • LB may be split to a number of sub-chains of M
    matches, followed by a single sub-chain of
    between M and 2M-1 matches.
  • The normalized value of each such sub-chain must
    be equal to that of LB, otherwise, LB is not
    optimal.
  • Each such sub-chain must start and end at a
    match, otherwise, the normalized value of the
    chain comprised of the same matches will be
    higher than that of LB.

38
Constructing the longest optimal alignment
  • Note that if we concatenate two optimal
    sub-chains where the head of the second is next
    to the tail of the first the concatenated chain
    is optimal.

10/30
10/30
20/60
  • when the head of the second is not next to the
    tail of the first, the concatenated chain is not
    optimal.

10/30
10/30
0/2
20/62
  • The tails and heads of the sub-chains from which
    LB is comprised must be next to each other.

39
Constructing the longest optimal alignment
  • If the tails and heads of the optimal sub-chains
    from which LB is comprised are next to each other
    then their concatenation (i.e. LB) is optimal.
    Lets examine the first two sub-chains

M/L
2M/2L
M/L
  • But, what happens if we examine the following
    sub-chain

M/L
M/L
  • Its number of matches is M1 and its length is
    L2.
  • Since M/Llt½, (M1)/(L2)gtM/L. Thus, we found a
    chain of M1 matches whose normalized value is
    higher than that of LB, in contradiction to the
    optimality of LB.

40
Closing remarks
41
The advantages of the new algorithm
  • The first algorithm to combine the normalized
    local and the sparse.
  • Ideal for textual local comparison (where the
    sparsity is typically dramatic) as well as for
    screening bio sequences.
  • As a normalized alignment algorithm, it does not
    suffer form the weaknesses from which non
    normalized algorithms suffer.
  • A straight forward approach to the minimal
    constraint which is easy to control and
    understand, and in the same time, does not
    require reformulation of the original problem.
  • the minimal constraint is problem related rather
    than input related.

42
The end
Write a Comment
User Comments (0)
About PowerShow.com