The Longest Common Subsequence Problem for Arcannotated Sequences Tao Jiang, GuoHui Lin, Bin Ma, Kai - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

The Longest Common Subsequence Problem for Arcannotated Sequences Tao Jiang, GuoHui Lin, Bin Ma, Kai

Description:

Arc-annotated sequence usage. The secondary and tertiary structure of RNA ... Alas, I know little about dynamic programming -but I know divide-and- conquer ... – PowerPoint PPT presentation

Number of Views:295

Avg rating:3.0/5.0

Slides: 54

Provided by: kevi96

Category:

more less

Transcript and Presenter's Notes

Title: The Longest Common Subsequence Problem for Arcannotated Sequences Tao Jiang, GuoHui Lin, Bin Ma, Kai

1
The Longest Common Subsequence Problem for
Arc-annotated SequencesTao Jiang, Guo-Hui Lin,
Bin Ma, Kaizhong Zhang

B89902003 ???
B89902005 ???
B89902027 ???

2
Overview

Arc-annotated sequence usage
The secondary and tertiary structure of RNA
Protein sequence
Solve the open questions in
P.A. Evans, Algorithms and Complexity for
Annotated Sequence Analysis, Ph.D. Thesis,
University of Victoria 1999.
P.A. Evans, Finding common subsequences with
pseudoknots, in Proceedings of 10th Annual
Symposium on Combinatorial Pattern matching
(CPM99), LNCS 1645, pp. 270-280

3
Definitions (I)

Symbol definition
S Sequence
P arc set
Arc defintion
(S, P) pair is called arc-annotated sequence

4
Definitions (II)

Arc-preserving
Arc mapping is kept when performing LCS
Cutwidth
The number arcs crossing the position
Arc-cutwidth
The max cutwidth of the sequence

5
Restrictions (I)

NP-hard problems if therere no restrictions on
arc annotations
Fortunately, RNA and protein sequences contains
some contraints

6
Restrictions (II)

No sharing of endpoints
2. No crossing
3. No nesting
4. No arcs

7
Restrictions (III)

Five levels
Unlimited
No restrictions
Crossing
Restriction 1
Nested
Restriction 1, 2
Chain
Restriction 1, 2, 3
Plain
Restriction 4

8
Result (I)
S1 n S2 m
9
Result (II)

LCS (crossing, crossing)
2-approximation algorithm
LCS (crossing, plain)
MAX SNP-hard
LCS (nested, plain)
Dynamic programming algorithm

10
LCS (crossing, crossing)-def (I)

(S1, P1), (S2, P2)
Arc-annotated sequences
Y
The result of common LCS()
Y L
M
The mapping between S1 and S2 induced by Y
M(i1,j1),,(i2,j2)

11
LCS (crossing, crossing)-def (II)

Graph GM
(ik, jk), (il, jl) vertex
Max( deg( vertex of GM)) lt 2

12
LCS (crossing, crossing)- Algo
13
LCS (crossing, crossing)- Result

LCS (crossing, crossing) has a 2-approximation
algorithm with time complexity O(nm)
LCS (crossing, nested), LCS (crossing, chain),
and LCS (crossing, plain) has a 2-approximation
algorithm with time complexity O(nm)

14
LCS (unlimited, plain)

Prove that it cant be approximated within ratio
Lemma 1
MaxIS-B is Max SNP-complete when B gt 3
Lemma 2
MaxIS-Cubic is SNP-complete

15
Proof of Lemma 2 (I)

L-reduction from MaxIS-3 to MaxIS-Cubic
G(V,E) Instance of MaxIS-3
ideg1
j deg2
n-i-j deg3
V the max IS set
opt(G) V

16
Proof of Lemma 2 (II)

trivially, opt(G) gt n/4
ij lt 4opt(G)
G instance of MaxIS-Cubic
opt(G) the max IS of G
Goal Construct G via G and a special graph H

17
Proof of Lemma 2 (III)

Graph H is like this
triangle
2ij
cycle size
2(2ij)

18
Proof of Lemma 2 (IV)

H has a maximal IS of size 2(2ij)
Construct G
Connect vertex of deg1 of G to two free vertices
in H
Connect vertex of deg2 of G to one free vertices
in H
G is cubic graph

19
Proof of Lemma 2 (V)

k opt(G) 2(2ij)
k one max IS of G
opt(G) gt opt(G) 2(2ij) (1)

20
Proof of Lemma 2 (VI)

Another thoughts
V the IS set of G, V k
Deleting the vertices of V which are in H will
get a IS set of G with size k
At most 2(2ij) vertices of V is in H
kgt k 2(2ij) ..(2)

21
Proof of Lemma 2 (VII)

From (1)
From (2)
L-reduction o.k.
MaxIS-Cubic is Max SNP-complete

22
Proof of LCS(unlimited, plain)(I)

Show that MaxIs can be L-reduce to LCS(unlimited,
plain)
MaxIS cant be approximated

23
Proof of LCS(unlimited, plain) (II)

G(V,E) instance of MaxIS
I instance of LCS consists
S1an with P1 E
S2an with P2 ?
Vvi ,.., vk, IS, 1-1 corresponds to
arc-preserving common subsequences consisting of
i1th,..,ikth as from S1
So, LCS() includes MaxIS as a subproblem.

24
Corollary

LCS(unlimited, chain), LCS(unlimited, nested),
and LCS(unlimited, unlimited) cant be
approximated within ratio

25
LCS(crossing, plan) is MAX SNP-hard

Use L-reduction to reduce MAXIS-Cubic to problem
LCS(crossing, plan)
G(V, E) is a cubic graph, n V
For S1 Construct a segment Tu of letters
aaaabbccc for each vertex u V
For edge (u, v), introduce an arc between c
from Tu to c from Tv, each letter c can be used
only once

26
Instance I constructedfrom cubic graph G

S2 is obtained by concatenating n identical
segments of aaaacccbb

27
Proof(1)

Opt(I) Opt(G) 6n
Assume Y is an arc-preserving common subsequence
of length k for (S1, P1) and (S2, P2)
(1) four a should be matched
(2) if a b is matched then no c is matched
and vice versa

28
Proof(2)

Define a subset V of vertices of G for every
segment Tu in sequence S1, if all its three c
is matched, we put u in V
V is an independent set for G, let k V
Kgtk -6n, n/4 opt(G) n/2
Opt(I) Opt(G) 6n 25n (a)
k opt(G) k opt(I) (b)

29
Proof(3)

Inequalities (a) (b) show the reduction is
L-reduction, thus problem LCS(crossing, plain) is
MAX SNP-hard
LCS(crossing, chain), LCS(crossing, nested),
LCS(crossing, crossing) are all MAX SNP-hard

30
Notes

if with additional constrain
for any (i1, j1) in the mapping, if (i1, i2)
P1 then, for some j2, (i2, j2) is in the
mapping, and if (j1, j2) P2 then, for some
i2, (i2, j2) is in the mapping.
For this definition, LCS(crossing, crossing) is
NP-hard and LCS(crossing, nested) is solvable in
polynomial time

31
LCS(nested, plain)

Input
Given a pair (S1, P1) and (S2, Ø) of
arc-annotated sequences with P1 being nested
Output
The length of a longest arc-preserving common
subsequence for the pair(no arc on the LAPC
subsequence)

32
Denote
u(i)

n S1
m S2
u(i) denote the arc in P1 incident on position i
of sequence S1
If u(i) not exist, we call i free
x(S1i, S2j) 1 if S1i S2j, or 0
otherwise

i
u(i)r
u(i)l
33
Dynamic Programming Algorithm
-Alas, I know little about dynamic programming
-but I know divide-and- conquer
-pang feng says DP is bottom up, DC is top-down
34
Divide and Conquer algorithm

Two function
?DP(i1,i2j,j) knows the length of a LARC
subsequence for the pair (S1i1, i2) and
(S2j,j, Ø), if and only if i1 u(i2)l
?DP(i,ij,j) knows the length of a LARC
subsequence for the pair (S1i, i) and
(S2j,j, Ø), if and only if i lt u(i)l or i
free

S1
S1
i
i
i
i
-how?
S1
S1
i1
i2
i
i
35
Divide and Conquer algorithm

?DP(i,ij,j)
If i is free
?DP(i,ij,j) max
-simple LCS algorithm

?
?DP(i,i-1j,j-1)x(S1i, S2j)
?DP(i,i-1j,j)
?DP(i,ij,j-1)
36
?DP(i,ij,j)

Else if i u(i)r and i lt u(i)l
?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
?DP(u(i)l,i j,j)

j ? j ? j
S1
i
u(i)l
i
S2
S1
j
j
i
i
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
37
?DP(i,ij,j)

Else (i u(i)l )
Just Call ?DP(i,ij,j)

S1
i1
i2
38
?DP(i1,i2j,j)
S1
i1
i2
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
S2
j
j
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
?
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
?DP(i1, i2 j 1, j)
-merge ?DP and ?DP into DP
39
Example
S1
(1,8)
A
T
G
C
T
A
C
G
1 2 3 4 5 6 7 8
S2
A
T

Top down approach

(1,1)
(2,8)
A
T
G
C
T
A
C
G
(3,7)
T
G
A
C
G
(3,3)
(4,7)
T
G
A
C
G
T
A
(5,6)
A
(5,5)
40
Example bottom up
(1,8)
(5,5)
1 2
T1,1?DP(5,51,1) T1,2?DP(5,51,2) T2,2?DP(5,
52,2)
(1,1)
(2,8)
1
1
1 2
T
0
(3,7)
?
?DP(i,i-1j,j-1)x(S1i, S2j)
(3,3)
(4,7)
?DP(i,ij,j) max
?DP(i,i-1j,j)
(5,6)
?DP(i,ij,j-1)
(5,5)
(5,6) 6 is free
1 2
DP(5,61,2) max DP(5, 5 1, 1)x(S12, S22)
DP(5, 5 1, 2 ) DP(5, 6 1, 1 )
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
41
Example bottom up
(1,8)
(5,6)
1 2
1
2
1 2
T
(1,1)
(2,8)
1
(3,7)
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
?
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
(3,3)
(4,7)
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
(5,6)
?DP(i1, i2 j 1, j)
(5,5)
(4,7) arc, ??DP
1 2
?DP(4,71,2) max DP(5, 6 2, 2) x(S14,
S21) DP(5, 6 1, 1) x(S17, S22) DP(5,
6 1, 2) DP(4, 7 1, 1) DP(4, 7 2, 2)
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
42
Example bottom up
(1,8)
(3,3)
(4,7)
1 2
1 2
1
2
1 2
T
0
0
1 2
T
(1,1)
(2,8)
1
0
(3,7)

?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
?DP(u(i)l,i j,j)
(3,7) 7 u(7)r and 3 lt u(7)l

(3,3)
(4,7)
(5,6)
(5,5)
DP(3,71,2) max DP(3, 3 1, 0) DP(4, 7 1,
2) DP(3, 3 1, 1) DP(4, 7 2, 2) DP(3, 3 1,
2) DP(4, 7 3, 2)
1 2
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
43
Example bottom up
(1,8)
(3,7)
1 2
1
2
1 2
T
(1,1)
(2,8)
1
(3,7)
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
?
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
(3,3)
(4,7)
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
(5,6)
?DP(i1, i2 j 1, j)
(5,5)
(2,8) arc, ??DP
1 2
?DP(2, 8 1, 2) max DP(3, 7 2, 2) x(S12,
S21) DP(3, 7 1, 1) x(S18, S22) DP(3,
7 1, 2) DP(2, 8 1, 1) DP(2, 8 2, 2)
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
44
Example bottom up
(1,8)
(1,1)
(2,8)
1 2
1 2
1
2
1 2
T
1
1
1 2
T
(1,1)
(2,8)
1
0
(3,7)

?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
?DP(u(i)l,i j,j)
(1,8) 8 u(8)r and 1 lt u(8)l

(3,3)
(4,7)
(5,6)
(5,5)
ANS
DP(1,81,2) max DP(1, 1 1, 0) DP(2, 8 1,
2) DP(1, 1 1, 1) DP(2, 8 2, 2) DP(1, 1 1,
2) DP(2, 8 3, 2)
1 2
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
45
Time Complexity
(1,8)
(1,1)
(2,8)
(3,7)