Loading...

PPT – The Longest Common Subsequence Problem and Its Variants PowerPoint presentation | free to download - id: 529415-YWU1N

The Adobe Flash plugin is needed to view this content

The Longest Common Subsequence Problem and Its

Variants

- ???
- ??????????
- http//www.nsysu.edu.tw

Outline

- Introduction to Bioinformatics
- Traditional LCS Algorithms
- Our Works
- Block Edit Problems
- LCS of Run-Length Encoded Strings
- Merged LCS Problem
- Mosaic LCS Problem
- Conclusions

Introduction to Bioinformatics

????(???????????)

- DNA???????????

DNA and RNA

- Nucleotide (???)
- ??? (adenine, A)
- ????(guanine, G)
- ???(cytosine, C)
- ????(thymine, T)
- ???(uracil, U)
- DNA(deoxyribonucleic acid , ??????)
- A, G, C, T (base pair G?C, AT )
- RNA(ribonucleic acid, ????)
- A, G, C, U (base pair G?C, AU, G?U

)

DNA Double Helix (????)

DNA Length

- The total length of the human DNA is about 3?109

(30?) base pairs. - 1 1.5 of DNA sequence is useful.
- of human genes 30,00040,000
- Conclusion from the Human Genome Project

(19902003) - Expected is 100,000 originally.

From DNA via RNA to Protein

DNA, Genes and Proteins

- DNA program for cell processes
- Proteins execute cell processes

Promoter(???) and Gene

Amino Acids (???)

???Protein(???)?????,?20?

Protein Structure

Traditional Dynamic Programming (DP) for the

Longest Common Subsequence (LCS) Problem

The Longest Common Subsequence (LCS) Problem

- A string S1 TAGTCACG
- A subsequence of S1 deleting 0 or more symbols

from S1 (not necessarily consecutive). - e.g. G, AGC, TATC, AGACG
- Common subsequences of S1 TAGTCACG and S2

AGACTGTC - GG, AGC, AGACG
- Longest common subsequence (LCS) S1 TAGTCACG
- S2 AGACTGTC
- LCS AGACG

Applications of LCS

- The edit distance of two strings or files.
- ( of deletions and insertions)
- S1 TAGTCAC G
- S2 AG ACTGTC
- Operation DMMDDMMIMII
- Spoken word recognition
- Similarity of two biological sequences (DNA or

protein) - Sequence alignment

The Traditional LCS Algorithm

- S1 a1 a2 ? am and S2 b1 b2 ? bn
- Ai,j denotes the length of the longest common

subsequence of a1 a2 ? ai and b1 b2 ? bj. - Dynamic programming
- Ai,j Ai-1,j-1 1 if ai bj
- max Ai-1,j, Ai,j-1 if ai? bj
- A0,0 A0,j Ai,0 0 for 1? i? m, 1? j?

n. - Time complexity O(mn)

a1 a2 ? ai-1ai b1 b2 ? bj-1bj

LCS and Edit Distance

- Edit distance S1 S2 - 2 LCS(S1, S2)

Sequence Alignment

- S1 TAGTCACG
- S2 AGACTGTC
- ?
- ----TAGTCACG TAGTCAC-G--
- AGACT-GTC--- -AG--ACTGTC
- Which one is better?
- We can set different gap penalties as parameters

for different purposes.

Gap Penalty for Sequence Alignment

- is the gap penalty.
- Suppose

Example for Sequence Alignment

TAGTCAC-G-- -AG--ACTGTC

PAM250 Score Matrix for Protein Alignment

MSA, ET and LCS

- Multiple sequence alignment
- LCS
- Phylogeny (evolutionary tree)

???

Hunt-Szymanski LCS Algorithm

- By extending the idea in RSK (Robinson-Schensted-K

nuth) algorithm for solving the longest

increasing subsequence, the LCS problem can be

solved in O(r log n) time, where r denotes the

number of matches. - This algorithm is faster than the traditional

dynamic programming if r is small.

The Pairs of Matching in Hunt-Szymanski Algorithm

- Input sequences TAGTCACG and AGACTGTC
- Pairs of matching

A G A C T G T C

T

A

G

T

C

A

C

G

(1,5) (1,7)

(2,1) (2,3)

(3,2) (3,6)

(4,5) (4,7)

(5,4) (5,8)

(6,1) (6,3)

(7,4) (7,8)

(8,2) (8,6)

Example for Hunt-Szymanski Algorithm

- The insertion order is row major and column

backward. - Time Complexity O(r log n), r of matches Each

match needs O(log n) time for binary search.

(1,7) (1,5) (2,3) (2,1) (3,6) (3,2) (4,7) (4,5) (5,8) (5,4)

1 (1,7) (1,5) (2,3) (2,1) (2,1) (2,1) (2,1) (2,1) (2,1) (2,1)

2 (3,6) (3,2) (3,2) (3,2) (3,2) (3,2)

3 (4,7) (4,5) (4,5) (5,4)

4 (5,8) (5,8)

L

Time and Space Complexities for LCS

Block Edit Problems

Motivation Finding Similar Codes

Block Edit Problems

- Operations Block copy, block deletion and block

move. - Shapira and Storer (2002) proved that it is

NP-hard when recursive block-move operations are

allowed. - Various approximations were proposed.
- Our assumptions Restricted edit sequence
- A series of edit operations are performed from

left to right on the source string X. - Any two block-edit operations would not be

performed on overlapping regions on X.

A Series of Block Edit Operations

Restricted Edit Sequence

- (a) General (recursive) edit operations
- (b) Restricted edit sequence

Definitions of the Problems (1/2)

- Let P(o, c) denote a block edit problem
- o a composition of block-edit operations
- c the class of cost measures
- The Block-Copy operations
- External copy copy a substring of X to Wi
- Internal copy copy a valid substring of Wi-1 to

Wi - Shifted copy copy a shifted substring

Definitions of the Problems (2/2)

- The Cost Measures that can be chosen
- Constant cost pcopy
- Linear cost ps k pe
- Nested cost pcopy dc(A, B)
- Three problems are defined in our work
- P(EIS,C)
- P(EI,L)
- P(EI,N)

Problem 1 -- P(EIS,C) External, Internal,

Shifted, Constant

- External and internal copies are allowed in

constant cost. - Shifted copies are allowed in constant cost.
- It can be solved by a straightforward DP

algorithm in O(nm2 (n m) S) time. - We propose an O(nm) time DP algorithm with
- O(nm2) preprocessing time in worst case
- O(nmlogm) preprocessing time in average case

Recurrence DP Formula for P(EIS,C)

- Straightforward implementation O(nm2 (n m)

S) time.

Functions and Operations (1)

- Character operations
- Block deletions

Functions and Operations (2)

- External copies
- Internal copies

Functions and Operations (3)

- Shifted copies

Preprocessing for P(EIS,C)

- For external copies
- Build a suffix tree T(XRYR) to find the common

substrings between X and Y. - For internal copies
- Build a suffix tree T(YR) to find the valid

common substrings to be copied from working

string Wi to Wi1. - For shifted copies
- Compute the differential strings X' and Y' of X

and Y. - Find the valid common substrings for external /

internal copies.

Preprocessing - Suffix Trees

Preprocessing Longest Common Prefixes (LCP) and

Suffix trees

Finding and Maintaining the Range Minimum in

Constant Time

Problem 2 -- P(EI,L) External, Internal, Linear

- The cost of each copy or deletion is with an

initial penalty plus a linear extended penalty.

Problem 3 -- P(EI,N) External, Internal, Nested

- The copied strings can be further edited with

character-edit operations.

Summary of Block Edit Problems

LCS of Run-Length Encoded Strings

LCS of Run-Length Encoded Strings

- Run-length encoding (RLE) compression aaaaabbbcccc

dd ? a5b3c4d2 - Input
- RLE string X length n, k runs
- RLE string Y length m, l runs
- Output
- LCS between X and Y.

Dark Light Blocks

- Divide the DP lattice into k l blocks.
- Dark blocks matched blocks Light blocks

mismatched blocks

Results of Bunke and Csirik (1995)

- Lemma 1 (Dark block)
- Lemma 2 (Light block)
- Only the boundaries of the blocks are needed.

Results of Liu et al. (2008)

- A complex modified DP formula which computes the

DP lattice row by row. - Only the bottom boundaries of the blocks are

needed.

Additional Lemmas

- Lemma 3 (Monotonicity)
- Lemma 4 (Merged light blocks) if

,

Proof of Lemma 4

Basic Idea

- C(v) denotes the number of occurrences of the

matched symbol in the right side of v. - ni denotes the length of current run of X.

Dummy Nodes Candidate Paths

- Some dummy nodes are considered, too.
- Divide the candidate paths into two sets.

Range Minimum / Maximum Query (RMQ)

- Given an array A and a range i, j, find the

maximum in the range i, j - Can be solved in O(n) preprocessing time and O(1)

query time.

Finding the Maximum from the Candidate Paths

- The value of u0 can be computed by Lemma 4.
- The maximum of the second set can be found by

precomputing an array Li and then applying

RMQ (Range Maximum Query) on it.

How Fast It Is?

- The elements needed to be computed
- Right bottom corners of all blocks.
- Bottom boundaries of the dark blocks.
- Let p1 and p2 denote the numbers of elements in

the bottom and right boundaries of the dark

blocks. The time complexity of our algorithm is

.

The Merged LCS Problem

Motivation -- Riffle Shuffle

Riffle Shuffle

A

B

1

2

E(A, B)

4

3

Relationship among Decks (1)

E(A, B)

LCS(T, E(A, B))

T E(A, B) T ! E(A, B)

T

Relationship among Decks (2)

E(A, B)

?

LCS(T, E(A, B))

?

T

? LCS(T, A, B)

A

B

Nested Genes

- Fruit fly -- Drosophila melanogaster
- Gene dcp-1 (Dmel_CG5370)
- Gene pita (Dmel_CG3941)
- (LOCUS AE003461)

Laundrie et al., Genetics 165, 2003

Whole Genome Duplication

2R

Kellis et al., Nature 428(6983), 2004

Doubly Conserved Synteny Block

- Two yeast species
- Kluyveromyces waltii
- Saccharomyces cerevisiae

Kellis et al., Nature 428(6983), 2004

Block ?

Merged Sequence

- An interleaving sequence of merging sequences A

and B, denoted as E(A, B) - The merged sequence is not unique.
- A cgatacc B aattcgc
- E1(A, B) cgataaacgc
- E2(A, B) aattcgcgcatacc
- E3(A, B) cgaaatactcgc

Merged-LCS Problem

- To find the relationship among sequences T, A,

and B, denoted as LCS(T, E(A, B)) - T atacgcgctt
- A cgatacc
- B aattcgc
- A -----cg---at-acc
- T ata--cgcgc-tt---
- B a-att--cgc------
- a a cgcgc t LCS(T, E(A, B))

E1(A, B) cgataaacgc E2(A, B)

aattcgcgcatacc E3(A, B) cgaaatactcgc

Algorithm MergedLCS

- Dynamic programming formula
- Time complexity O(nm2), nT, m maxA, B
- Space complexity O(nm)
- Hirsberg 1975, divide-and-conquer

Blocked Merged Sequence

- An interleaving block sequence of merging block

sequences A and B, denoted as Eb(A, B) - The blocked merged sequence is not unique.
- Ab cgat acc Bb aat tc gc
- A1 A2

B1 B2 B3 - Eb4(Ab, Bb) Ab1Bb1Bb2Ab2 cgataattcacc
- Eb5(Ab, Bb) Bb1Ab1Ab2Bb2 aatcgatacctc
- Eb6(Ab, Bb) Bb1Bb2Ab1Bb3Ab2 aattccgatgcacc

Blocked Merged LCS Problem

- To find the relationship among block sequences T,

Ab, and Bb, denoted as bLCS(T, Eb(Ab, Bb)) - T atacgcgctt
- Ab cgat acc Bb aat tc gc
- Eb5(Ab, Bb) Bb1Ab1Ab2Bb2 aat cgat acc tc
- T a-ta cg-- -cgc t-t
- Eb5(Ab, Bb) aat- cgat ac-c tc-
- a t cg c c t bLCS(T, Eb(Ab, Bb))

Algorithm for Block Merged LCS

- Consider the symbol EOB (End of block)
- Complexity O(n m mb)
- n T, m maxAb, Bb, mb max. number of

blocks in Ab and Bb

Improved Algorithm BMergedLCS

- Step 1. Compute S-table St(T, Abi) and St(T,

Bbj). O(nm) - Step 2. Initialize Lb(i, 0, 0) 0. O(n)
- Step 3. Vb(j, k) maxVb(j?1, k) ? St(T, Abi),

Vb(j?1, k) ? St(T, Bbj). O(nmb2) - Step 4. Return Lb(T, ?, ?). O(1) or O(n)
- Complexity O(nm nmb2)
- n T, m maxAb, Bb
- mb max. number of blocks in Ab and Bb

Experimental Results (1)

Data Set Sequence Length (bp) Sequence Length (bp) Sequence Length (bp) Number of Blocks Number of Blocks Running time (sec.) Running time (sec.)

Data Set T A B A B MergedLCS BMergedLCS

dodA 1629 687 942 6 7 52.69 0.70

pita dcp-1 6000 2480 1756 3 3 1312.29 13.25

Experimental Results (2)

MergedLCS

BMergedLCS

Clustal W

Summary Merged LCS

- The merged-LCS problem LCS(T, E(A, B))
- MergedLCS O(nm2)
- The blocked merged-LCS problem bLCS(T, Eb(Ab,

Bb)) - BMergedLCS O(n m mb)
- BMergedLCS O(nm nmb2)
- n T, m maxAb, Bb
- mb max. number of blocks in Ab and Bb

The Mosaic LCS Problem

Chimera (???)

Chimera of Arezzo an Etruscan bronze

(????)

Chimeric Alignment

- Komatsoulis and Waterman, 1997
- For detecting chimeric sequences

S1

S2

S3

S4

T

?-mosaic LCS Problem

Input Target sequence T, mosaic number ?,

sequence set S.

1

S

T

2

3

? 4

4

Output Maximal LCS(T, C), C C1C2C?, Ci ? S.

e.g. max LCS(T, C1C2C3C4) Ci ? S

Algorithm for ?-mosaic LCS (1)

LCS(Tp, q, Sj), 0 ? p ? q ? n, Sj ? S, Sj m

O(n2mS)

Tp,q

T

p

q

Sj

Sj

Sj

Sj

Algorithm for ?-mosaic LCS (2)

Recursive doubling scheme LCS(Tp,r, C1C2)

max LCS(Tp, q, C1) LCS(Tq, r, C2)

O(n3)

0 ? p ? q ? r ? n, Ci ? S

Tp, q

Tq, r

T

p

q

r

C1

C2

(1, 1) 2, (2, 2) 4, (4, 4) 8, (8, 8) 16

O(n3 log ?)

(C1, C2) C1C2, (C1C2, C3C4) C1C2 C3C4

Summary Mosaic LCS

- Mosaic LCS Problem LCS(T, Clt1,?gt)
- Straightforward DP O(n2mS n3 log ?)
- Improved Algorithm with S-table
- O(n(m?) S)

Conclusions

- Other related problems
- Constrained LCS problem
- Longest Increasing Subsequence Problem
- Longest Common Increasing Subsequence Problem of

Two Sequences - Near Optimal Alignment
- Alignment with Multiple Scoring Functions
- Multiple Sequence Alignment
- Fast LCS of Multiple Sequences

References (1)

- Block Edit Distance
- Ukkonen, 1985 Algorithms for approximate string

matching, Information and Control, Vol. 64, pp.

100-118, 1985. - Shapira and Storer, 2007 Edit distance with

move operations, Journal of Discrete Algorithms,

Vol. 5, No. 2, pp. 380-392, 2007. - Ann 2007 Hsing-Yen Ann, Chang-Biau Yang,

Yung-Hsing Peng and Bern-Cherng Liaw, "Efficient

Algorithms for the Block Edit Problems," Proc. of

the 24th Workshop on Combinatorial Mathematics

and Computation Theory, pp. 201-208, Nantou,

Taiwan, April 27-28, 2007. - LCS of Run-Length Encoded Strings
- Bunke and Csirik, 1995 An improved algorithm

for computing the edit distance of run-length

coded strings, Information Processing Letters,

Vol. 54, No. 2, pp. 9396, 1995. - Liu et al., 2008 Finding a longest common

subsequence between a run-length-encoded string

and an uncompressed string, Journal of

Complexity, Vol. 24, No. 2, pp. 173184, 2008. - Ann 2008 Hsing-Yen Ann, Chang-Biau Yang,

Chiou-Ting Tseng, Chiou-Yi Hor "A fast and simple

algorithm for computing the longest common

subsequence of run-length encoded strings,

Information Processing Letters, Vol. 108, pp.

360364, 2008.

References (2)

- Merged LCS and Mosaic LCS
- Huang et al. 2007 Kuo-Si Huang, Chang-Biau

Yang, Kuo-Tsung Tseng, Yung-Hsing Peng and

Hsing-Yen Ann, "Dynamic Programming Algorithms

for the Mosaic Longest Common Subsequence,"

Problem. Information Processing Letters, Vol.

102, pp. 99-103, 2007. - Huang et al. 2008 Kuo-Si Huang, Chang-Biau

Yang, Kuo-Tsung Tseng, Hsing-Yen Ann and

Yung-Hsing Peng, "Efficient Algorithms for

Finding Interleaving Relationship between

Sequences," Information Processing Letters, Vol.

105 (5), pp.188-193, 2008.

References (3)

- Suffix Tree and Range Minimum Query
- Bender and Farach-Colton, 2000 The LCA problem

revisited, in LATIN 2000 Theoretical

Informatics, 4th Latin American Symposium, Punta

del Este, Uruguay, 2000, pp. 8894. - Weiner, 1973 Linear pattern matching algorithm,

In Proceedings of the 14th Annual IEEE Symposium

on Switching and Automata Theory, pp. 1-11, 1973. - Genome
- Laundrie et al., 2003 Germline cell death is

inhibited by P-element insertions disrupting the

dcp-1/pita nested gene pair in Drosophila,

Genetics, Vol. 165, No. 4, pp. 1881-1888, 2003. - Kellis et al., 2004 Proof and evolutionary

analysis of ancient genome duplication in the

yeast Saccharomyces cerevisiae, Nature, Vol. 428,

pp. 617-624, 2004.

???? UAA UAG UGA The End ???? ???

Finding and Maintaining the Range Minimum with

Linear Penalties

Finding the Substring Edit Distance in Constant

Time

Diagram for Blocked Merged LCS (1/2)

T

Aib

Abi-1

Bjb

Bbj-1

T

T

T

Aib

Aib

Aib

Bjb

Bjb

Bjb

Diagram for Blocked Merged LCS (2/2)

T

?

Aib

Abi-1

Bjb

Bbj-1

?

S-table

(No Transcript)

(No Transcript)

(No Transcript)

Improved Algorithm

- Vl minEVl-1 ? St(T, Si) for each Si ? S and 1

l ?

St(T, Si) O(nmS) Vl minEVl-1 ? St(T, Si)

O(n?S) Time Complexity O(n(m?)S)

Example of Algorithm Formosa2

- T agactagtc
- S S1agc, S2act, S3aatg, S4ttcg
- T agactagtc
- S1 12-3----- (0,1,2,4)
- S2 1--23---- (0,1,4,5)
- S3 12--3-4-- (0,1,2,5,7)
- S4 -1----2-3 (0,2,7,9)
- 12-3--4-- (0,1,2,4,7)V1

Example of Algorithm Formosa2