An Optimal Algorithm for MAX-SUM SEGMENT - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

An Optimal Algorithm for MAX-SUM SEGMENT

Description:

An application in bioinformatics. Finding new repeats in genomic sequences ... O(n2) feasible segments and output one segment with maximum sum. Previous work ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 39
Provided by: iisSin
Category:

less

Transcript and Presenter's Notes

Title: An Optimal Algorithm for MAX-SUM SEGMENT


1
An Optimal Algorithm for MAX-SUM SEGMENT Its
Applications in Bioinformatics
  • Hsueh-I Lu
  • Academia Sinica, Taiwan

2
This is joint work with
  • Institute of Statistics, National Central
    University
  • Tsai-Hung Fan
  • Tsung-Shan Tsou
  • Institute of Biomedical Sciences, Academia Sinica
  • Adam Yao
  • Shufen Lee
  • Tsai-Cheng Wang

3
Outline
  • The MAX-SUM SEGMENT problem
  • Previous work and our algorithm
  • A feature of our algorithm
  • Processing the input sequence in an on-line
    manner
  • An application in bioinformatics
  • Finding new repeats in genomic sequences

4
The MAX-SUM SEGMENT Problem
  • Input
  • a sequence S of n numbers
  • two length bounds L and U.
  • Output
  • a segment Si, j with maximum sum over all
    segments of S with length at least L and at most
    U.

5
Example
sum 2
sum 4
  • S -3 2 -2 5 -4 1 -2 3 1
  • L 2 U 2
  • L 2 U 8

sum 5
sum 3
A feasible segment
6
Simple Observations
  1. There are only O(n2) segments Si, j in S.
  2. The sum of each segment Si, j can be computed
    in O(1) time.

O(n) values for i and O(n) values for j.
7
Simple Observations
  1. There are only O(n2) segments Si, j in S.
  2. The sum of each segment Si, j can be computed
    in O(1) time.

With appropriate linear-time preprocessing
8
Prefix sums of S
  • Define prefix-sum(i) sum of S1, i
  • O(n) time to pre-compute prefix-sum(i) for all
    indices i.
  • Sum of Si, j
  • prefix-sum(j) prefix-sum(i 1)

j
i
1
9
As a result
  • The problem has a naïve O(n2)-time algorithm
  • Go through all O(n2) feasible segments and output
    one segment with maximum sum.

10
Previous work
  • Lin, Jiang, and Chao JCSS 2002 gave the first
    known O(n)-time algorithm for the MAX-SUM SEGMENT
    problem.

11
Lin-Jiang-Chao
  • Based upon a clever but somewhat complicated
    technique called left-negative decomposition of
    the input sequence.

12
Our result
  • An alternative linear-time algorithm
  • Bypassing the pre-processing step of
    left-negative segment decomposition.
  • Has the capability to process the input sequence
    in an on-line manner

13
On-line processing
  • Suppose the input sequence S is given in
    iterations.
  • For each i 1, 2, , n, the number Si is given
    in the i-th iteration.
  • When Si is received, our algorithm is capable
    of interatively outputing a max-sum segment of
    S1,i.
  • The required working space is O(U L 1).

14
Our algorithm
15
Geometric representation
height of j-th point prefix-sum(j)
S 2 -1 1 1 -2 3 -2
1
2
j
16
Sum of Si, j prefix-sum(j) prefix-sum(i
1)
  • For each ending index j,
  • Maximizing the sum of Si, j is equivalent to
    minimizing prefix-sum(i 1).

17
Lowest point in j-U, j-L
valley(j)
Feasible index set F(j) for index j.
Feasible index set F(j) for index j.
j
j U
j L
18
valley(j)
  • Clearly, for each ending index j, 1valley(j) has
    to be the best starting index.
  • In other words, 1valley(j) maximizes the sum of
    Si, j over all feasible starting indices i.

19
As a result
  • The MAX-SUM SEGMENT problem is linear-time
    reducible to computing valley(j) for all indices
    j.
  • More specifically, the solution to MAX-SUM
    SEGMENT is simply the segment S1valley(j), j
    with maximum sum over all indices j.

20
Computing valley(j) for all indices j in O(n) time
  • Case 1 U is ineffective (e.g., U n)
  • Case 2 U is effective

21
Case 1 U is ineffective
j
F(j)
j 1
F(j 1)
j 2
F(j 2)
22
F(j 1) F(j) ? j-L1
  • if prefix-sum( jL1) lt prefix-sum(valley(j))
  • let valley(j 1) j L 1
  • otherwise,
  • let valley(j 1) valley(j)

O(n) time to compute valley(j) for all j
O(n) time to compute a max-sum segment.
O(1) time to compute valley(j 1) from valley(j)
23
Case 2 U is effective
F(j)
j
j 1
j 1
F(j 1)
F(j 2)
24
We need a data structure D(j) for the indices in
F(j)
  • The specification
  • valley(j) can be obtained from D(j) in O(1)
    time.
  • D(j 1) can be updated from D(j) in amortized
    O(1) time.

25
A solution
  • Let D(j) be a list of indices X(1), X(2) such
    that
  • X(1) is the lowest point in F(j) j U, jL
  • That is, X(1) valley(j).
  • X(2) is the lowest point in X(1)1, jL
  • X(3) is the lowest point in X(2)1, jL

26
j - U
j - L
27
Spec. 1
  • valley(j) can be obtained from D(j) in O(1)
    time?
  • Yes! Just read the first number in D(j).

28
Spec. 2
  • D(j1) can be updated from D(j) in amortized O(1)
    time?
  • Yes! Two steps
  • (1) If valley(j) is not in F(j1), we just delete
    the first number from the list.
  • (2) Scan the list of indices in D(j) from right
    to left to see where j-U1 fits in.

29
(No Transcript)
30
time complexity
  • The overall time complexity is O(n), although
    some iterations may take more than constant time.

31
On-line processing
  • In the j-th iteration, when Sj is received,
  • prefix-sum(j) can be computed on the fly
  • valley(j) can be computed interatively,
  • So, our algorithm can process the input sequence
    in an online manner.
  • the required working space is O(U L 1).

32
An application in bioinformatics
  • Identifying repeats in genomic sequences.

33
Repeats
  • DNA repeats usually contain important biological
    information.
  • TIGR (The Institute of Genomic Research)
    maintains databases for repeats in various
    genomes.

34
Finding repeats via MAX-SUM SEGMENT
  • Input a DNA sequence R
  • Output a segment Ri,j of R that is likely to
    be a previously unknown repeat in R.
  • Step 1. Filter out known repeats
  • Step 2. Self-alignment and reducing the 2D
    alignment scores into a sequence S of numbers.
  • Step 3. Run MAX-SUM SEGMENT algorithm on S

35
1.Masking Repeats of Chromosome 1 listed in the
Tigr Rice Repeat Database
Unmasked Chromosome 1
Masked Chromosome 1
Masked region
36
2.Filtering Simple Regions of Chromosome 1
Masked Chromosome 1
Filtered and Masked Chromosome 1
Masked region
Filtered region
37
Aligning Filtered and Rice Chromosome 1 against
itself
Negative scores to penalize unaligned area.
E lt 1e -10
38
Max-Sum
Adding the scores in the same column into a
score.
39
left36345576 right36345588 Score253
Length44744612 4 1 30 144 56 2 9 2 1 1 9
-3 36345580
ataaaaaatattagaagaaaagtatagagtgcatatagaaatataattaa
gaaataatagaaattcggaattagaaaacaacagatattagaagaagagt
atagagtccatataagaatttagaatgaactaaaattcggaataaaaatt
aaaattaaagatagaatttagagtctata
over 80 homologues (e-value lt e-10 ) in rice
chromosome 1 but none in the TIGR Rice Repeat
Database released on July 29, 2002
40
Conclusion
  • We give a new O(n)-time algorithm for the MAXSUM
    SEGMENT problem.
  • Our algorithm can handle the input sequence in an
    online manner, which is an important feature for
    handling genome-sized input. The working space
    required is only O(U-L1).
  • Our algorithm can be an important subroutine in
    finding new repeats in genomic sequences.
Write a Comment
User Comments (0)
About PowerShow.com