The Longest Common Subsequence Problem for Arcannotated Sequences Tao Jiang, GuoHui Lin, Bin Ma, Kai - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

The Longest Common Subsequence Problem for Arcannotated Sequences Tao Jiang, GuoHui Lin, Bin Ma, Kai

Description:

Arc-annotated sequence usage. The secondary and tertiary structure of RNA ... Alas, I know little about dynamic programming -but I know divide-and- conquer ... – PowerPoint PPT presentation

Number of Views:295
Avg rating:3.0/5.0
Slides: 54
Provided by: kevi96
Category:

less

Transcript and Presenter's Notes

Title: The Longest Common Subsequence Problem for Arcannotated Sequences Tao Jiang, GuoHui Lin, Bin Ma, Kai


1
The Longest Common Subsequence Problem for
Arc-annotated SequencesTao Jiang, Guo-Hui Lin,
Bin Ma, Kaizhong Zhang
  • B89902003 ???
  • B89902005 ???
  • B89902027 ???

2
Overview
  • Arc-annotated sequence usage
  • The secondary and tertiary structure of RNA
  • Protein sequence
  • Solve the open questions in
  • P.A. Evans, Algorithms and Complexity for
    Annotated Sequence Analysis, Ph.D. Thesis,
    University of Victoria 1999.
  • P.A. Evans, Finding common subsequences with
    pseudoknots, in Proceedings of 10th Annual
    Symposium on Combinatorial Pattern matching
    (CPM99), LNCS 1645, pp. 270-280

3
Definitions (I)
  • Symbol definition
  • S Sequence
  • P arc set
  • Arc defintion
  • (S, P) pair is called arc-annotated sequence

4
Definitions (II)
  • Arc-preserving
  • Arc mapping is kept when performing LCS
  • Cutwidth
  • The number arcs crossing the position
  • Arc-cutwidth
  • The max cutwidth of the sequence

5
Restrictions (I)
  • NP-hard problems if therere no restrictions on
    arc annotations
  • Fortunately, RNA and protein sequences contains
    some contraints

6
Restrictions (II)
  • No sharing of endpoints
  • 2. No crossing
  • 3. No nesting
  • 4. No arcs

7
Restrictions (III)
  • Five levels
  • Unlimited
  • No restrictions
  • Crossing
  • Restriction 1
  • Nested
  • Restriction 1, 2
  • Chain
  • Restriction 1, 2, 3
  • Plain
  • Restriction 4

8
Result (I)
S1 n S2 m
9
Result (II)
  • LCS (crossing, crossing)
  • 2-approximation algorithm
  • LCS (crossing, plain)
  • MAX SNP-hard
  • LCS (nested, plain)
  • Dynamic programming algorithm

10
LCS (crossing, crossing)-def (I)
  • (S1, P1), (S2, P2)
  • Arc-annotated sequences
  • Y
  • The result of common LCS()
  • Y L
  • M
  • The mapping between S1 and S2 induced by Y
  • M(i1,j1),,(i2,j2)

11
LCS (crossing, crossing)-def (II)
  • Graph GM
  • (ik, jk), (il, jl) vertex
  • Max( deg( vertex of GM)) lt 2

12
LCS (crossing, crossing)- Algo
13
LCS (crossing, crossing)- Result
  • LCS (crossing, crossing) has a 2-approximation
    algorithm with time complexity O(nm)
  • LCS (crossing, nested), LCS (crossing, chain),
    and LCS (crossing, plain) has a 2-approximation
    algorithm with time complexity O(nm)

14
LCS (unlimited, plain)
  • Prove that it cant be approximated within ratio
  • Lemma 1
  • MaxIS-B is Max SNP-complete when B gt 3
  • Lemma 2
  • MaxIS-Cubic is SNP-complete

15
Proof of Lemma 2 (I)
  • L-reduction from MaxIS-3 to MaxIS-Cubic
  • G(V,E) Instance of MaxIS-3
  • ideg1
  • j deg2
  • n-i-j deg3
  • V the max IS set
  • opt(G) V

16
Proof of Lemma 2 (II)
  • trivially, opt(G) gt n/4
  • ij lt 4opt(G)
  • G instance of MaxIS-Cubic
  • opt(G) the max IS of G
  • Goal Construct G via G and a special graph H

17
Proof of Lemma 2 (III)
  • Graph H is like this
  • triangle
  • 2ij
  • cycle size
  • 2(2ij)

18
Proof of Lemma 2 (IV)
  • H has a maximal IS of size 2(2ij)
  • Construct G
  • Connect vertex of deg1 of G to two free vertices
    in H
  • Connect vertex of deg2 of G to one free vertices
    in H
  • G is cubic graph

19
Proof of Lemma 2 (V)
  • k opt(G) 2(2ij)
  • k one max IS of G
  • opt(G) gt opt(G) 2(2ij) (1)

20
Proof of Lemma 2 (VI)
  • Another thoughts
  • V the IS set of G, V k
  • Deleting the vertices of V which are in H will
    get a IS set of G with size k
  • At most 2(2ij) vertices of V is in H
  • kgt k 2(2ij) ..(2)

21
Proof of Lemma 2 (VII)
  • From (1)
  • From (2)
  • L-reduction o.k.
  • MaxIS-Cubic is Max SNP-complete

22
Proof of LCS(unlimited, plain)(I)
  • Show that MaxIs can be L-reduce to LCS(unlimited,
    plain)
  • MaxIS cant be approximated

23
Proof of LCS(unlimited, plain) (II)
  • G(V,E) instance of MaxIS
  • I instance of LCS consists
  • S1an with P1 E
  • S2an with P2 ?
  • Vvi ,.., vk, IS, 1-1 corresponds to
    arc-preserving common subsequences consisting of
    i1th,..,ikth as from S1
  • So, LCS() includes MaxIS as a subproblem.

24
Corollary
  • LCS(unlimited, chain), LCS(unlimited, nested),
    and LCS(unlimited, unlimited) cant be
    approximated within ratio

25
LCS(crossing, plan) is MAX SNP-hard
  • Use L-reduction to reduce MAXIS-Cubic to problem
    LCS(crossing, plan)
  • G(V, E) is a cubic graph, n V
  • For S1 Construct a segment Tu of letters
    aaaabbccc for each vertex u V
  • For edge (u, v), introduce an arc between c
    from Tu to c from Tv, each letter c can be used
    only once

26
Instance I constructedfrom cubic graph G
  • S2 is obtained by concatenating n identical
    segments of aaaacccbb

27
Proof(1)
  • Opt(I) Opt(G) 6n
  • Assume Y is an arc-preserving common subsequence
    of length k for (S1, P1) and (S2, P2)
  • (1) four a should be matched
  • (2) if a b is matched then no c is matched
    and vice versa

28
Proof(2)
  • Define a subset V of vertices of G for every
    segment Tu in sequence S1, if all its three c
    is matched, we put u in V
  • V is an independent set for G, let k V
  • Kgtk -6n, n/4 opt(G) n/2
  • Opt(I) Opt(G) 6n 25n (a)
  • k opt(G) k opt(I) (b)

29
Proof(3)
  • Inequalities (a) (b) show the reduction is
    L-reduction, thus problem LCS(crossing, plain) is
    MAX SNP-hard
  • LCS(crossing, chain), LCS(crossing, nested),
    LCS(crossing, crossing) are all MAX SNP-hard

30
Notes
  • if with additional constrain
  • for any (i1, j1) in the mapping, if (i1, i2)
    P1 then, for some j2, (i2, j2) is in the
    mapping, and if (j1, j2) P2 then, for some
    i2, (i2, j2) is in the mapping.
  • For this definition, LCS(crossing, crossing) is
    NP-hard and LCS(crossing, nested) is solvable in
    polynomial time

31
LCS(nested, plain)
  • Input
  • Given a pair (S1, P1) and (S2, Ø) of
    arc-annotated sequences with P1 being nested
  • Output
  • The length of a longest arc-preserving common
    subsequence for the pair(no arc on the LAPC
    subsequence)

32
Denote
u(i)
  • n S1
  • m S2
  • u(i) denote the arc in P1 incident on position i
    of sequence S1
  • If u(i) not exist, we call i free
  • x(S1i, S2j) 1 if S1i S2j, or 0
    otherwise

i
u(i)r
u(i)l
33
Dynamic Programming Algorithm
-Alas, I know little about dynamic programming
-but I know divide-and- conquer
-pang feng says DP is bottom up, DC is top-down
34
Divide and Conquer algorithm
  • Two function
  • ?DP(i1,i2j,j) knows the length of a LARC
    subsequence for the pair (S1i1, i2) and
    (S2j,j, Ø), if and only if i1 u(i2)l
  • ?DP(i,ij,j) knows the length of a LARC
    subsequence for the pair (S1i, i) and
    (S2j,j, Ø), if and only if i lt u(i)l or i
    free

S1
S1
i
i
i
i
-how?
S1
S1
i1
i2
i
i
35
Divide and Conquer algorithm
  • ?DP(i,ij,j)
  • If i is free
  • ?DP(i,ij,j) max
  • -simple LCS algorithm

?
?DP(i,i-1j,j-1)x(S1i, S2j)
?DP(i,i-1j,j)
?DP(i,ij,j-1)
36
?DP(i,ij,j)
  • Else if i u(i)r and i lt u(i)l
  • ?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
  • ?DP(u(i)l,i j,j)

j ? j ? j
S1
i
u(i)l
i
S2
S1
j
j
i
i
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
S2
j
j
37
?DP(i,ij,j)
  • Else (i u(i)l )
  • Just Call ?DP(i,ij,j)

S1
i1
i2
38
?DP(i1,i2j,j)
S1
i1
i2
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
S2
j
j
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
?
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
?DP(i1, i2 j 1, j)
-merge ?DP and ?DP into DP
39
Example
S1
(1,8)
A
T
G
C
T
A
C
G
1 2 3 4 5 6 7 8
S2
A
T
  • Top down approach

(1,1)
(2,8)
A
T
G
C
T
A
C
G
(3,7)
T
G
A
C
G
(3,3)
(4,7)
T
G
A
C
G
T
A
(5,6)
A
(5,5)
40
Example bottom up
(1,8)
(5,5)
1 2
T1,1?DP(5,51,1) T1,2?DP(5,51,2) T2,2?DP(5,
52,2)
(1,1)
(2,8)
1
1
1 2
T
0
(3,7)
?
?DP(i,i-1j,j-1)x(S1i, S2j)
(3,3)
(4,7)
?DP(i,ij,j) max
?DP(i,i-1j,j)
(5,6)
?DP(i,ij,j-1)
(5,5)
(5,6) 6 is free
1 2
DP(5,61,2) max DP(5, 5 1, 1)x(S12, S22)
DP(5, 5 1, 2 ) DP(5, 6 1, 1 )
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
41
Example bottom up
(1,8)
(5,6)
1 2
1
2
1 2
T
(1,1)
(2,8)
1
(3,7)
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
?
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
(3,3)
(4,7)
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
(5,6)
?DP(i1, i2 j 1, j)
(5,5)
(4,7) arc, ??DP
1 2
?DP(4,71,2) max DP(5, 6 2, 2) x(S14,
S21) DP(5, 6 1, 1) x(S17, S22) DP(5,
6 1, 2) DP(4, 7 1, 1) DP(4, 7 2, 2)
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
42
Example bottom up
(1,8)
(3,3)
(4,7)
1 2
1 2
1
2
1 2
T
0
0
1 2
T
(1,1)
(2,8)
1
0
(3,7)
  • ?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
  • ?DP(u(i)l,i j,j)
  • (3,7) 7 u(7)r and 3 lt u(7)l

(3,3)
(4,7)
(5,6)
(5,5)
DP(3,71,2) max DP(3, 3 1, 0) DP(4, 7 1,
2) DP(3, 3 1, 1) DP(4, 7 2, 2) DP(3, 3 1,
2) DP(4, 7 3, 2)
1 2
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
43
Example bottom up
(1,8)
(3,7)
1 2
1
2
1 2
T
(1,1)
(2,8)
1
(3,7)
?DP(i11, i2 - 1 j 1, j) x(S1i1, S2j)
?
?DP(i11, i2 - 1 j, j -1) x(S1i2, S2j)
(3,3)
(4,7)
?DP(i1 1, i2 - 1 j, j)
?DP(i1,i2j,j) max
?DP(i1, i2 j, j - 1)
(5,6)
?DP(i1, i2 j 1, j)
(5,5)
(2,8) arc, ??DP
1 2
?DP(2, 8 1, 2) max DP(3, 7 2, 2) x(S12,
S21) DP(3, 7 1, 1) x(S18, S22) DP(3,
7 1, 2) DP(2, 8 1, 1) DP(2, 8 2, 2)
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
44
Example bottom up
(1,8)
(1,1)
(2,8)
1 2
1 2
1
2
1 2
T
1
1
1 2
T
(1,1)
(2,8)
1
0
(3,7)
  • ?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
  • ?DP(u(i)l,i j,j)
  • (1,8) 8 u(8)r and 1 lt u(8)l

(3,3)
(4,7)
(5,6)
(5,5)
ANS
DP(1,81,2) max DP(1, 1 1, 0) DP(2, 8 1,
2) DP(1, 1 1, 1) DP(2, 8 2, 2) DP(1, 1 1,
2) DP(2, 8 3, 2)
1 2
1
2
1 2
T
S1
A
T
G
C
T
A
C
G
1
1 2 3 4 5 6 7 8
S2
A
T
45
Time Complexity
(1,8)
(1,1)
(2,8)
(3,7)
  • Table Size m(m-1)/2 O(m2)
  • Number of Tables
  • Possible (i,j)
  • Arc at most n/2 O(n)
  • Inside Arc at most as many as arc
  • Free at most O(n)
  • Table Entry
  • O(n) O(m2) O(nm2)

(3,3)
(4,7)
m
(5,6)
m
(5,5)
A
T
G
C
T
A
C
G
A
T
G
C
T
A
C
G
A
T
G
C
T
A
C
G
46
Time Complexity
  • Compute a entry at most cost
  • O(m)
  • Time Complexity
  • O(m)O(nm2 ) O(nm3 )

?DP(i,ij,j) max?DP(i, u(i)l-1j,j-1)
?DP(u(i)l,i j,j)
47
Extend LCS(nested, plain) Algorithm
  • Extend to LCS(nested, chain)
  • Add two new value a,ß to DP(i,ij,j)
  • DP(i,ij,j a,ß)
  • Extend to LCS(crossing, nested)
  • Restrict the cut-width to a constant k
  • Add k (ai,ßi) to DP(i,ij,j)

48
LCS(nested, chain)Notation
  • - denote nothing
  • ? the rightmost position of j,j-1 except a,ß

j
j
ß
a
49
Modification (I)
  • If i is free and j u(j)l,
  • DP(i,ij,j a,-) max
  • DP(i,i-1j, ? a,-) x(S1i,S2j)
  • DP(i,i-1j, j a,-)
  • DP(i,ij, ? a,-)
  • DP(i,ij,j a,j) DP(i,ij,j a,-)
  • If alt ?,
  • a a
  • else a -

50
Modification (II)
  • If i is free and j u(j)r (! a),
  • DP(i,ij,j a,-) max
  • DP(i,i-1j, ? a, ß) x(S1i,S2j)
  • DP(i,i-1j, j a,-)
  • DP(i,ij, j-1 a,-)
  • DP(i,ij,j a,j) DP(i,ij,j-1 a,-)
  • If jlt u(j)l lt ?,
  • ß u(j)l
  • else ß -

51
Modification (II)
  • If i is free and the rest case
  • DP(i,ij,j a, ß) max
  • DP(i,i-1j, ? a, ß) x(S1i,S2j)
  • DP(i,i-1j, j a, ß)
  • DP(i,ij, ? a, ß)
  • If alt ?,
  • a a, else a -
  • If ßlt ?,
  • ß ß, else ß -

52
Modification (II)
  • If i isnt free, modify Phase 2 and the second
    step with similar actions.
  • One entry DP(i,ij,j) extends to at most 4
    entries.
  • Every entry can be computed in O(m) time from its
    preceding entries.
  • So, The time complexity LCS(nested, chain) is
    O(nm3)

53
LCS(crossing, nested)
  • By similar modifications, (a1,ß1),..(ak, ßk)
    are used with cut-width k.
  • So, the time complexity is O(4knm3)
Write a Comment
User Comments (0)
About PowerShow.com