Title: The Longest Common Subsequence Problem for Arcannotated Sequences Tao Jiang, GuoHui Lin, Bin Ma, Kai
1The Longest Common Subsequence Problem for
Arc-annotated SequencesTao Jiang, Guo-Hui Lin,
Bin Ma, Kaizhong Zhang
- B89902003 ???
- B89902005 ???
- B89902027 ???
2Overview
- Arc-annotated sequence usage
- The secondary and tertiary structure of RNA
- Protein sequence
- Solve the open questions in
- P.A. Evans, Algorithms and Complexity for
Annotated Sequence Analysis, Ph.D. Thesis,
University of Victoria 1999. - P.A. Evans, Finding common subsequences with
pseudoknots, in Proceedings of 10th Annual
Symposium on Combinatorial Pattern matching
(CPM99), LNCS 1645, pp. 270-280
3Definitions (I)
- Symbol definition
- S Sequence
- P arc set
- Arc defintion
- (S, P) pair is called arc-annotated sequence
4Definitions (II)
- Arc-preserving
- Arc mapping is kept when performing LCS
- Cutwidth
- The number arcs crossing the position
- Arc-cutwidth
- The max cutwidth of the sequence
5Restrictions (I)
- NP-hard problems if therere no restrictions on
arc annotations - Fortunately, RNA and protein sequences contains
some contraints
6Restrictions (II)
- No sharing of endpoints
- 2. No crossing
- 3. No nesting
- 4. No arcs
7Restrictions (III)
- Five levels
- Unlimited
- No restrictions
- Crossing
- Restriction 1
- Nested
- Restriction 1, 2
- Chain
- Restriction 1, 2, 3
- Plain
- Restriction 4
8Result (I)
S1 n S2 m
9Result (II)
- LCS (crossing, crossing)
- 2-approximation algorithm
- LCS (crossing, plain)
- MAX SNP-hard
- LCS (nested, plain)
- Dynamic programming algorithm
10LCS (crossing, crossing)-def (I)
- (S1, P1), (S2, P2)
- Arc-annotated sequences
- Y
- The result of common LCS()
- Y L
- M
- The mapping between S1 and S2 induced by Y
- M(i1,j1),,(i2,j2)
11LCS (crossing, crossing)-def (II)
- Graph GM
- (ik, jk), (il, jl) vertex
-
- Max( deg( vertex of GM)) lt 2
12LCS (crossing, crossing)- Algo
13LCS (crossing, crossing)- Result
- LCS (crossing, crossing) has a 2-approximation
algorithm with time complexity O(nm) - LCS (crossing, nested), LCS (crossing, chain),
and LCS (crossing, plain) has a 2-approximation
algorithm with time complexity O(nm)
14LCS (unlimited, plain)
- Prove that it cant be approximated within ratio
- Lemma 1
- MaxIS-B is Max SNP-complete when B gt 3
- Lemma 2
- MaxIS-Cubic is SNP-complete
15Proof of Lemma 2 (I)
- L-reduction from MaxIS-3 to MaxIS-Cubic
- G(V,E) Instance of MaxIS-3
- ideg1
- j deg2
- n-i-j deg3
- V the max IS set
- opt(G) V
16Proof of Lemma 2 (II)
- trivially, opt(G) gt n/4
- ij lt 4opt(G)
- G instance of MaxIS-Cubic
- opt(G) the max IS of G
- Goal Construct G via G and a special graph H
17Proof of Lemma 2 (III)
- Graph H is like this
- triangle
- 2ij
- cycle size
- 2(2ij)
18Proof of Lemma 2 (IV)
- H has a maximal IS of size 2(2ij)
- Construct G
- Connect vertex of deg1 of G to two free vertices
in H - Connect vertex of deg2 of G to one free vertices
in H - G is cubic graph
19Proof of Lemma 2 (V)
- k opt(G) 2(2ij)
- k one max IS of G
- opt(G) gt opt(G) 2(2ij) (1)
20Proof of Lemma 2 (VI)
- Another thoughts
- V the IS set of G, V k
- Deleting the vertices of V which are in H will
get a IS set of G with size k - At most 2(2ij) vertices of V is in H
- kgt k 2(2ij) ..(2)
21Proof of Lemma 2 (VII)
- From (1)
- From (2)
- L-reduction o.k.
- MaxIS-Cubic is Max SNP-complete
22Proof of LCS(unlimited, plain)(I)
- Show that MaxIs can be L-reduce to LCS(unlimited,
plain) - MaxIS cant be approximated
23Proof of LCS(unlimited, plain) (II)
- G(V,E) instance of MaxIS
- I instance of LCS consists
- S1an with P1 E
- S2an with P2 ?
- Vvi ,.., vk, IS, 1-1 corresponds to
arc-preserving common subsequences consisting of
i1th,..,ikth as from S1 - So, LCS() includes MaxIS as a subproblem.
24Corollary
- LCS(unlimited, chain), LCS(unlimited, nested),
and LCS(unlimited, unlimited) cant be
approximated within ratio
25LCS(crossing, plan) is MAX SNP-hard
- Use L-reduction to reduce MAXIS-Cubic to problem
LCS(crossing, plan) - G(V, E) is a cubic graph, n V
- For S1 Construct a segment Tu of letters
aaaabbccc for each vertex u V - For edge (u, v), introduce an arc between c
from Tu to c from Tv, each letter c can be used
only once
26Instance I constructedfrom cubic graph G
- S2 is obtained by concatenating n identical
segments of aaaacccbb
27Proof(1)
- Opt(I) Opt(G) 6n
- Assume Y is an arc-preserving common subsequence
of length k for (S1, P1) and (S2, P2) - (1) four a should be matched
- (2) if a b is matched then no c is matched
and vice versa
28Proof(2)
- Define a subset V of vertices of G for every
segment Tu in sequence S1, if all its three c
is matched, we put u in V - V is an independent set for G, let k V
- Kgtk -6n, n/4 opt(G) n/2
- Opt(I) Opt(G) 6n 25n (a)
- k opt(G) k opt(I) (b)
29Proof(3)
- Inequalities (a) (b) show the reduction is
L-reduction, thus problem LCS(crossing, plain) is
MAX SNP-hard - LCS(crossing, chain), LCS(crossing, nested),
LCS(crossing, crossing) are all MAX SNP-hard
30Notes
- if with additional constrain
- for any (i1, j1) in the mapping, if (i1, i2)
P1 then, for some j2, (i2, j2) is in the
mapping, and if (j1, j2) P2 then, for some
i2, (i2, j2) is in the mapping. - For this definition, LCS(crossing, crossing) is
NP-hard and LCS(crossing, nested) is solvable in
polynomial time
31LCS(nested, plain)
- Input
- Given a pair (S1, P1) and (S2, Ø) of
arc-annotated sequences with P1 being nested - Output
- The length of a longest arc-preserving common
subsequence for the pair(no arc on the LAPC
subsequence)
32Denote
u(i)
- n S1
- m S2
- u(i) denote the arc in P1 incident on position i
of sequence S1 - If u(i) not exist, we call i free
- x(S1i, S2j) 1 if S1i S2j, or 0
otherwise
i
u(i)r
u(i)l
33Dynamic Programming Algorithm
-Alas, I know little about dynamic programming
-but I know divide-and- conquer
-pang feng says DP is bottom up, DC is top-down
34Divide and Conquer algorithm
- Two function
- ?DP(i1,i2j,j) knows the length of a LARC
subsequence for the pair (S1i1, i2) and
(S2j,j, Ø), if and only if i1 u(i2)l - ?DP(i,ij,j) knows the length of a LARC
subsequence for the pair (S1i, i) and
(S2j,j, Ø), if and only if i lt u(i)l or i
free
S1
S1
i
i
i
i
-how?
S1
S1
i1
i2
i
i
35Divide and Conquer algorithm
- ?DP(i,ij,j)
- If i is free
- ?DP(i,ij,j) max
- -simple LCS algorithm
?
?DP(i,i-1j,j-1)x(S1i, S2j)
?DP(i,i-1j,j)
?DP(i,ij,j-1)
36?DP(i,ij,j)
- Else if i u(i)r and i lt u(i)l
- ?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
- ?DP(u(i)l,i j,j)
j ? j ? j
S1
i
u(i)l
i
S2
S1
j
j
i
i
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
37?DP(i,ij,j)
- Else (i u(i)l )
- Just Call ?DP(i,ij,j)
S1
i1
i2
38?DP(i1,i2j,j)
S1
i1
i2
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
S2
j
j
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
?
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
?DP(i1, i2 j 1, j)
-merge ?DP and ?DP into DP
39Example
S1
(1,8)
A
T
G
C
T
A
C
G
1 2 3 4 5 6 7 8
S2
A
T
(1,1)
(2,8)
A
T
G
C
T
A
C
G
(3,7)
T
G
A
C
G
(3,3)
(4,7)
T
G
A
C
G
T
A
(5,6)
A
(5,5)
40Example bottom up
(1,8)
(5,5)
1 2
T1,1?DP(5,51,1) T1,2?DP(5,51,2) T2,2?DP(5,
52,2)
(1,1)
(2,8)
1
1
1 2
T
0
(3,7)
?
?DP(i,i-1j,j-1)x(S1i, S2j)
(3,3)
(4,7)
?DP(i,ij,j) max
?DP(i,i-1j,j)
(5,6)
?DP(i,ij,j-1)
(5,5)
(5,6) 6 is free
1 2
DP(5,61,2) max DP(5, 5 1, 1)x(S12, S22)
DP(5, 5 1, 2 ) DP(5, 6 1, 1 )
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
41Example bottom up
(1,8)
(5,6)
1 2
1
2
1 2
T
(1,1)
(2,8)
1
(3,7)
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
?
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
(3,3)
(4,7)
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
(5,6)
?DP(i1, i2 j 1, j)
(5,5)
(4,7) arc, ??DP
1 2
?DP(4,71,2) max DP(5, 6 2, 2) x(S14,
S21) DP(5, 6 1, 1) x(S17, S22) DP(5,
6 1, 2) DP(4, 7 1, 1) DP(4, 7 2, 2)
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
42Example bottom up
(1,8)
(3,3)
(4,7)
1 2
1 2
1
2
1 2
T
0
0
1 2
T
(1,1)
(2,8)
1
0
(3,7)
- ?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
- ?DP(u(i)l,i j,j)
- (3,7) 7 u(7)r and 3 lt u(7)l
(3,3)
(4,7)
(5,6)
(5,5)
DP(3,71,2) max DP(3, 3 1, 0) DP(4, 7 1,
2) DP(3, 3 1, 1) DP(4, 7 2, 2) DP(3, 3 1,
2) DP(4, 7 3, 2)
1 2
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
43Example bottom up
(1,8)
(3,7)
1 2
1
2
1 2
T
(1,1)
(2,8)
1
(3,7)
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
?
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
(3,3)
(4,7)
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
(5,6)
?DP(i1, i2 j 1, j)
(5,5)
(2,8) arc, ??DP
1 2
?DP(2, 8 1, 2) max DP(3, 7 2, 2) x(S12,
S21) DP(3, 7 1, 1) x(S18, S22) DP(3,
7 1, 2) DP(2, 8 1, 1) DP(2, 8 2, 2)
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
44Example bottom up
(1,8)
(1,1)
(2,8)
1 2
1 2
1
2
1 2
T
1
1
1 2
T
(1,1)
(2,8)
1
0
(3,7)
- ?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
- ?DP(u(i)l,i j,j)
- (1,8) 8 u(8)r and 1 lt u(8)l
(3,3)
(4,7)
(5,6)
(5,5)
ANS
DP(1,81,2) max DP(1, 1 1, 0) DP(2, 8 1,
2) DP(1, 1 1, 1) DP(2, 8 2, 2) DP(1, 1 1,
2) DP(2, 8 3, 2)
1 2
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
45Time Complexity
(1,8)
(1,1)
(2,8)
(3,7)
- Table Size m(m-1)/2 O(m2)
- Number of Tables
- Possible (i,j)
- Arc at most n/2 O(n)
- Inside Arc at most as many as arc
- Free at most O(n)
- Table Entry
- O(n) O(m2) O(nm2)
(3,3)
(4,7)
m
(5,6)
m
(5,5)
A
T
G
C
T
A
C
G
A
T
G
C
T
A
C
G
A
T
G
C
T
A
C
G
46Time Complexity
- Compute a entry at most cost
- O(m)
- Time Complexity
- O(m)O(nm2 ) O(nm3 )
?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
?DP(u(i)l,i j,j)
47Extend LCS(nested, plain) Algorithm
- Extend to LCS(nested, chain)
- Add two new value a,ß to DP(i,ij,j)
- DP(i,ij,j a,ß)
- Extend to LCS(crossing, nested)
- Restrict the cut-width to a constant k
- Add k (ai,ßi) to DP(i,ij,j)
48LCS(nested, chain)Notation
- - denote nothing
- ? the rightmost position of j,j-1 except a,ß
j
j
ß
a
49Modification (I)
- If i is free and j u(j)l,
- DP(i,ij,j a,-) max
- DP(i,i-1j, ? a,-) x(S1i,S2j)
- DP(i,i-1j, j a,-)
- DP(i,ij, ? a,-)
- DP(i,ij,j a,j) DP(i,ij,j a,-)
- If alt ?,
- a a
- else a -
50Modification (II)
- If i is free and j u(j)r (! a),
- DP(i,ij,j a,-) max
- DP(i,i-1j, ? a, ß) x(S1i,S2j)
- DP(i,i-1j, j a,-)
- DP(i,ij, j-1 a,-)
- DP(i,ij,j a,j) DP(i,ij,j-1 a,-)
- If jlt u(j)l lt ?,
- ß u(j)l
- else ß -
51Modification (II)
- If i is free and the rest case
- DP(i,ij,j a, ß) max
- DP(i,i-1j, ? a, ß) x(S1i,S2j)
- DP(i,i-1j, j a, ß)
- DP(i,ij, ? a, ß)
- If alt ?,
- a a, else a -
- If ßlt ?,
- ß ß, else ß -
52Modification (II)
- If i isnt free, modify Phase 2 and the second
step with similar actions. - One entry DP(i,ij,j) extends to at most 4
entries. - Every entry can be computed in O(m) time from its
preceding entries. - So, The time complexity LCS(nested, chain) is
O(nm3)
53LCS(crossing, nested)
- By similar modifications, (a1,ß1),..(ak, ßk)
are used with cut-width k. - So, the time complexity is O(4knm3)