The Longest Common Subsequence Problem and Its Variants - PowerPoint PPT Presentation

Loading...

PPT – The Longest Common Subsequence Problem and Its Variants PowerPoint presentation | free to download - id: 529415-YWU1N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

The Longest Common Subsequence Problem and Its Variants

Description:

Title: Longest Common Subsequences and Its Variants Author: Last modified by: ynag Created Date: 11/11/2008 7:29:26 AM Document presentation format – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 99
Provided by: 49889
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The Longest Common Subsequence Problem and Its Variants


1
The Longest Common Subsequence Problem and Its
Variants
  • ???
  • ??????????
  • http//www.nsysu.edu.tw

2
Outline
  • Introduction to Bioinformatics
  • Traditional LCS Algorithms
  • Our Works
  • Block Edit Problems
  • LCS of Run-Length Encoded Strings
  • Merged LCS Problem
  • Mosaic LCS Problem
  • Conclusions

3
Introduction to Bioinformatics
4
????(???????????)
  • DNA???????????

5
DNA and RNA
  • Nucleotide (???)
  • ??? (adenine, A)
  • ????(guanine, G)
  • ???(cytosine, C)
  • ????(thymine, T)
  • ???(uracil, U)
  • DNA(deoxyribonucleic acid , ??????)
  • A, G, C, T (base pair G?C, AT )
  • RNA(ribonucleic acid, ????)
  • A, G, C, U (base pair G?C, AU, G?U
    )

6
DNA Double Helix (????)
7
DNA Length
  • The total length of the human DNA is about 3?109
    (30?) base pairs.
  • 1 1.5 of DNA sequence is useful.
  • of human genes 30,00040,000
  • Conclusion from the Human Genome Project
    (19902003)
  • Expected is 100,000 originally.

8
From DNA via RNA to Protein
9
DNA, Genes and Proteins
  • DNA program for cell processes
  • Proteins execute cell processes

10
Promoter(???) and Gene
11
Amino Acids (???)
???Protein(???)?????,?20?
12
Protein Structure
13
Traditional Dynamic Programming (DP) for the
Longest Common Subsequence (LCS) Problem
14
The Longest Common Subsequence (LCS) Problem
  • A string S1 TAGTCACG
  • A subsequence of S1 deleting 0 or more symbols
    from S1 (not necessarily consecutive).
  • e.g. G, AGC, TATC, AGACG
  • Common subsequences of S1 TAGTCACG and S2
    AGACTGTC
  • GG, AGC, AGACG
  • Longest common subsequence (LCS) S1 TAGTCACG
  • S2 AGACTGTC
  • LCS AGACG

15
Applications of LCS
  • The edit distance of two strings or files.
  • ( of deletions and insertions)
  • S1 TAGTCAC G
  • S2 AG ACTGTC
  • Operation DMMDDMMIMII
  • Spoken word recognition
  • Similarity of two biological sequences (DNA or
    protein)
  • Sequence alignment

16
The Traditional LCS Algorithm
  • S1 a1 a2 ? am and S2 b1 b2 ? bn
  • Ai,j denotes the length of the longest common
    subsequence of a1 a2 ? ai and b1 b2 ? bj.
  • Dynamic programming
  • Ai,j Ai-1,j-1 1 if ai bj
  • max Ai-1,j, Ai,j-1 if ai? bj
  • A0,0 A0,j Ai,0 0 for 1? i? m, 1? j?
    n.
  • Time complexity O(mn)

a1 a2 ? ai-1ai b1 b2 ? bj-1bj
17
LCS and Edit Distance
  • Edit distance S1 S2 - 2 LCS(S1, S2)

18
Sequence Alignment
  • S1 TAGTCACG
  • S2 AGACTGTC
  • ?
  • ----TAGTCACG TAGTCAC-G--
  • AGACT-GTC--- -AG--ACTGTC
  • Which one is better?
  • We can set different gap penalties as parameters
    for different purposes.

19
Gap Penalty for Sequence Alignment
  • is the gap penalty.
  • Suppose

20
Example for Sequence Alignment
TAGTCAC-G-- -AG--ACTGTC
21
PAM250 Score Matrix for Protein Alignment
22
MSA, ET and LCS
  • Multiple sequence alignment
  • LCS
  • Phylogeny (evolutionary tree)

???
23
Hunt-Szymanski LCS Algorithm
  • By extending the idea in RSK (Robinson-Schensted-K
    nuth) algorithm for solving the longest
    increasing subsequence, the LCS problem can be
    solved in O(r log n) time, where r denotes the
    number of matches.
  • This algorithm is faster than the traditional
    dynamic programming if r is small.

24
The Pairs of Matching in Hunt-Szymanski Algorithm
  • Input sequences TAGTCACG and AGACTGTC
  • Pairs of matching

A G A C T G T C
T
A
G
T
C
A
C
G
(1,5) (1,7)
(2,1) (2,3)
(3,2) (3,6)
(4,5) (4,7)
(5,4) (5,8)
(6,1) (6,3)
(7,4) (7,8)
(8,2) (8,6)
25
Example for Hunt-Szymanski Algorithm
  • The insertion order is row major and column
    backward.
  • Time Complexity O(r log n), r of matches Each
    match needs O(log n) time for binary search.

(1,7) (1,5) (2,3) (2,1) (3,6) (3,2) (4,7) (4,5) (5,8) (5,4)
1 (1,7) (1,5) (2,3) (2,1) (2,1) (2,1) (2,1) (2,1) (2,1) (2,1)
2 (3,6) (3,2) (3,2) (3,2) (3,2) (3,2)
3 (4,7) (4,5) (4,5) (5,4)
4 (5,8) (5,8)
L
26
Time and Space Complexities for LCS
27
Block Edit Problems
28
Motivation Finding Similar Codes
29
Block Edit Problems
  • Operations Block copy, block deletion and block
    move.
  • Shapira and Storer (2002) proved that it is
    NP-hard when recursive block-move operations are
    allowed.
  • Various approximations were proposed.
  • Our assumptions Restricted edit sequence
  • A series of edit operations are performed from
    left to right on the source string X.
  • Any two block-edit operations would not be
    performed on overlapping regions on X.

30
A Series of Block Edit Operations
31
Restricted Edit Sequence
  • (a) General (recursive) edit operations
  • (b) Restricted edit sequence

32
Definitions of the Problems (1/2)
  • Let P(o, c) denote a block edit problem
  • o a composition of block-edit operations
  • c the class of cost measures
  • The Block-Copy operations
  • External copy copy a substring of X to Wi
  • Internal copy copy a valid substring of Wi-1 to
    Wi
  • Shifted copy copy a shifted substring

33
Definitions of the Problems (2/2)
  • The Cost Measures that can be chosen
  • Constant cost pcopy
  • Linear cost ps k pe
  • Nested cost pcopy dc(A, B)
  • Three problems are defined in our work
  • P(EIS,C)
  • P(EI,L)
  • P(EI,N)

34
Problem 1 -- P(EIS,C) External, Internal,
Shifted, Constant
  • External and internal copies are allowed in
    constant cost.
  • Shifted copies are allowed in constant cost.
  • It can be solved by a straightforward DP
    algorithm in O(nm2 (n m) S) time.
  • We propose an O(nm) time DP algorithm with
  • O(nm2) preprocessing time in worst case
  • O(nmlogm) preprocessing time in average case

35
Recurrence DP Formula for P(EIS,C)
  • Straightforward implementation O(nm2 (n m)
    S) time.

36
Functions and Operations (1)
  • Character operations
  • Block deletions

37
Functions and Operations (2)
  • External copies
  • Internal copies

38
Functions and Operations (3)
  • Shifted copies

39
Preprocessing for P(EIS,C)
  • For external copies
  • Build a suffix tree T(XRYR) to find the common
    substrings between X and Y.
  • For internal copies
  • Build a suffix tree T(YR) to find the valid
    common substrings to be copied from working
    string Wi to Wi1.
  • For shifted copies
  • Compute the differential strings X' and Y' of X
    and Y.
  • Find the valid common substrings for external /
    internal copies.

40
Preprocessing - Suffix Trees
41
Preprocessing Longest Common Prefixes (LCP) and
Suffix trees
42
Finding and Maintaining the Range Minimum in
Constant Time
43
Problem 2 -- P(EI,L) External, Internal, Linear
  • The cost of each copy or deletion is with an
    initial penalty plus a linear extended penalty.

44
Problem 3 -- P(EI,N) External, Internal, Nested
  • The copied strings can be further edited with
    character-edit operations.

45
Summary of Block Edit Problems
46
LCS of Run-Length Encoded Strings
47
LCS of Run-Length Encoded Strings
  • Run-length encoding (RLE) compression aaaaabbbcccc
    dd ? a5b3c4d2
  • Input
  • RLE string X length n, k runs
  • RLE string Y length m, l runs
  • Output
  • LCS between X and Y.

48
Dark Light Blocks
  • Divide the DP lattice into k l blocks.
  • Dark blocks matched blocks Light blocks
    mismatched blocks

49
Results of Bunke and Csirik (1995)
  • Lemma 1 (Dark block)
  • Lemma 2 (Light block)
  • Only the boundaries of the blocks are needed.

50
Results of Liu et al. (2008)
  • A complex modified DP formula which computes the
    DP lattice row by row.
  • Only the bottom boundaries of the blocks are
    needed.

51
Additional Lemmas
  • Lemma 3 (Monotonicity)
  • Lemma 4 (Merged light blocks) if
    ,

52
Proof of Lemma 4
53
Basic Idea
  • C(v) denotes the number of occurrences of the
    matched symbol in the right side of v.
  • ni denotes the length of current run of X.

54
Dummy Nodes Candidate Paths
  • Some dummy nodes are considered, too.
  • Divide the candidate paths into two sets.

55
Range Minimum / Maximum Query (RMQ)
  • Given an array A and a range i, j, find the
    maximum in the range i, j
  • Can be solved in O(n) preprocessing time and O(1)
    query time.

56
Finding the Maximum from the Candidate Paths
  • The value of u0 can be computed by Lemma 4.
  • The maximum of the second set can be found by
    precomputing an array Li and then applying
    RMQ (Range Maximum Query) on it.

57
How Fast It Is?
  • The elements needed to be computed
  • Right bottom corners of all blocks.
  • Bottom boundaries of the dark blocks.
  • Let p1 and p2 denote the numbers of elements in
    the bottom and right boundaries of the dark
    blocks. The time complexity of our algorithm is
    .

58
The Merged LCS Problem
59
Motivation -- Riffle Shuffle
60
Riffle Shuffle
A
B
1
2
E(A, B)
4
3
61
Relationship among Decks (1)
E(A, B)
LCS(T, E(A, B))
T E(A, B) T ! E(A, B)
T
62
Relationship among Decks (2)
E(A, B)
?
LCS(T, E(A, B))
?
T
? LCS(T, A, B)
A
B
63
Nested Genes
  • Fruit fly -- Drosophila melanogaster
  • Gene dcp-1 (Dmel_CG5370)
  • Gene pita (Dmel_CG3941)
  • (LOCUS AE003461)

Laundrie et al., Genetics 165, 2003
64
Whole Genome Duplication
2R
Kellis et al., Nature 428(6983), 2004
65
Doubly Conserved Synteny Block
  • Two yeast species
  • Kluyveromyces waltii
  • Saccharomyces cerevisiae

Kellis et al., Nature 428(6983), 2004
Block ?
66
Merged Sequence
  • An interleaving sequence of merging sequences A
    and B, denoted as E(A, B)
  • The merged sequence is not unique.
  • A cgatacc B aattcgc
  • E1(A, B) cgataaacgc
  • E2(A, B) aattcgcgcatacc
  • E3(A, B) cgaaatactcgc

67
Merged-LCS Problem
  • To find the relationship among sequences T, A,
    and B, denoted as LCS(T, E(A, B))
  • T atacgcgctt
  • A cgatacc
  • B aattcgc
  • A -----cg---at-acc
  • T ata--cgcgc-tt---
  • B a-att--cgc------
  • a a cgcgc t LCS(T, E(A, B))

E1(A, B) cgataaacgc E2(A, B)
aattcgcgcatacc E3(A, B) cgaaatactcgc
68
Algorithm MergedLCS
  • Dynamic programming formula
  • Time complexity O(nm2), nT, m maxA, B
  • Space complexity O(nm)
  • Hirsberg 1975, divide-and-conquer

69
Blocked Merged Sequence
  • An interleaving block sequence of merging block
    sequences A and B, denoted as Eb(A, B)
  • The blocked merged sequence is not unique.
  • Ab cgat acc Bb aat tc gc
  • A1 A2
    B1 B2 B3
  • Eb4(Ab, Bb) Ab1Bb1Bb2Ab2 cgataattcacc
  • Eb5(Ab, Bb) Bb1Ab1Ab2Bb2 aatcgatacctc
  • Eb6(Ab, Bb) Bb1Bb2Ab1Bb3Ab2 aattccgatgcacc

70
Blocked Merged LCS Problem
  • To find the relationship among block sequences T,
    Ab, and Bb, denoted as bLCS(T, Eb(Ab, Bb))
  • T atacgcgctt
  • Ab cgat acc Bb aat tc gc
  • Eb5(Ab, Bb) Bb1Ab1Ab2Bb2 aat cgat acc tc
  • T a-ta cg-- -cgc t-t
  • Eb5(Ab, Bb) aat- cgat ac-c tc-
  • a t cg c c t bLCS(T, Eb(Ab, Bb))

71
Algorithm for Block Merged LCS
  • Consider the symbol EOB (End of block)
  • Complexity O(n m mb)
  • n T, m maxAb, Bb, mb max. number of
    blocks in Ab and Bb

72
Improved Algorithm BMergedLCS
  • Step 1. Compute S-table St(T, Abi) and St(T,
    Bbj). O(nm)
  • Step 2. Initialize Lb(i, 0, 0) 0. O(n)
  • Step 3. Vb(j, k) maxVb(j?1, k) ? St(T, Abi),
    Vb(j?1, k) ? St(T, Bbj). O(nmb2)
  • Step 4. Return Lb(T, ?, ?). O(1) or O(n)
  • Complexity O(nm nmb2)
  • n T, m maxAb, Bb
  • mb max. number of blocks in Ab and Bb

73
Experimental Results (1)
Data Set Sequence Length (bp) Sequence Length (bp) Sequence Length (bp) Number of Blocks Number of Blocks Running time (sec.) Running time (sec.)
Data Set T A B A B MergedLCS BMergedLCS
dodA 1629 687 942 6 7 52.69 0.70
pita dcp-1 6000 2480 1756 3 3 1312.29 13.25
74
Experimental Results (2)
MergedLCS
BMergedLCS
Clustal W
75
Summary Merged LCS
  • The merged-LCS problem LCS(T, E(A, B))
  • MergedLCS O(nm2)
  • The blocked merged-LCS problem bLCS(T, Eb(Ab,
    Bb))
  • BMergedLCS O(n m mb)
  • BMergedLCS O(nm nmb2)
  • n T, m maxAb, Bb
  • mb max. number of blocks in Ab and Bb

76
The Mosaic LCS Problem
77
Chimera (???)
Chimera of Arezzo an Etruscan bronze
(????)
78
Chimeric Alignment
  • Komatsoulis and Waterman, 1997
  • For detecting chimeric sequences

S1
S2
S3
S4
T
79
?-mosaic LCS Problem
Input Target sequence T, mosaic number ?,
sequence set S.
1
S
T
2
3
? 4
4
Output Maximal LCS(T, C), C C1C2C?, Ci ? S.
e.g. max LCS(T, C1C2C3C4) Ci ? S
80
Algorithm for ?-mosaic LCS (1)
LCS(Tp, q, Sj), 0 ? p ? q ? n, Sj ? S, Sj m
O(n2mS)
Tp,q
T
p
q
Sj
Sj
Sj
Sj
81
Algorithm for ?-mosaic LCS (2)
Recursive doubling scheme LCS(Tp,r, C1C2)
max LCS(Tp, q, C1) LCS(Tq, r, C2)
O(n3)
0 ? p ? q ? r ? n, Ci ? S
Tp, q
Tq, r
T
p
q
r
C1
C2
(1, 1) 2, (2, 2) 4, (4, 4) 8, (8, 8) 16
O(n3 log ?)
(C1, C2) C1C2, (C1C2, C3C4) C1C2 C3C4
82
Summary Mosaic LCS
  • Mosaic LCS Problem LCS(T, Clt1,?gt)
  • Straightforward DP O(n2mS n3 log ?)
  • Improved Algorithm with S-table
  • O(n(m?) S)

83
Conclusions
  • Other related problems
  • Constrained LCS problem
  • Longest Increasing Subsequence Problem
  • Longest Common Increasing Subsequence Problem of
    Two Sequences
  • Near Optimal Alignment
  • Alignment with Multiple Scoring Functions
  • Multiple Sequence Alignment
  • Fast LCS of Multiple Sequences

84
References (1)
  • Block Edit Distance
  • Ukkonen, 1985 Algorithms for approximate string
    matching, Information and Control, Vol. 64, pp.
    100-118, 1985.
  • Shapira and Storer, 2007 Edit distance with
    move operations, Journal of Discrete Algorithms,
    Vol. 5, No. 2, pp. 380-392, 2007.
  • Ann 2007 Hsing-Yen Ann, Chang-Biau Yang,
    Yung-Hsing Peng and Bern-Cherng Liaw, "Efficient
    Algorithms for the Block Edit Problems," Proc. of
    the 24th Workshop on Combinatorial Mathematics
    and Computation Theory, pp. 201-208, Nantou,
    Taiwan, April 27-28, 2007.
  • LCS of Run-Length Encoded Strings
  • Bunke and Csirik, 1995 An improved algorithm
    for computing the edit distance of run-length
    coded strings, Information Processing Letters,
    Vol. 54, No. 2, pp. 9396, 1995.
  • Liu et al., 2008 Finding a longest common
    subsequence between a run-length-encoded string
    and an uncompressed string, Journal of
    Complexity, Vol. 24, No. 2, pp. 173184, 2008.
  • Ann 2008 Hsing-Yen Ann, Chang-Biau Yang,
    Chiou-Ting Tseng, Chiou-Yi Hor "A fast and simple
    algorithm for computing the longest common
    subsequence of run-length encoded strings,
    Information Processing Letters, Vol. 108, pp.
    360364, 2008.

85
References (2)
  • Merged LCS and Mosaic LCS
  • Huang et al. 2007 Kuo-Si Huang, Chang-Biau
    Yang, Kuo-Tsung Tseng, Yung-Hsing Peng and
    Hsing-Yen Ann, "Dynamic Programming Algorithms
    for the Mosaic Longest Common Subsequence,"
    Problem. Information Processing Letters, Vol.
    102, pp. 99-103, 2007.
  • Huang et al. 2008 Kuo-Si Huang, Chang-Biau
    Yang, Kuo-Tsung Tseng, Hsing-Yen Ann and
    Yung-Hsing Peng, "Efficient Algorithms for
    Finding Interleaving Relationship between
    Sequences," Information Processing Letters,  Vol.
    105 (5), pp.188-193, 2008.

86
References (3)
  • Suffix Tree and Range Minimum Query
  • Bender and Farach-Colton, 2000 The LCA problem
    revisited, in LATIN 2000 Theoretical
    Informatics, 4th Latin American Symposium, Punta
    del Este, Uruguay, 2000, pp. 8894.
  • Weiner, 1973 Linear pattern matching algorithm,
    In Proceedings of the 14th Annual IEEE Symposium
    on Switching and Automata Theory, pp. 1-11, 1973.
  • Genome
  • Laundrie et al., 2003 Germline cell death is
    inhibited by P-element insertions disrupting the
    dcp-1/pita nested gene pair in Drosophila,
    Genetics, Vol. 165, No. 4, pp. 1881-1888, 2003.
  • Kellis et al., 2004 Proof and evolutionary
    analysis of ancient genome duplication in the
    yeast Saccharomyces cerevisiae, Nature, Vol. 428,
    pp. 617-624, 2004.

87
???? UAA UAG UGA The End ???? ???
88
Finding and Maintaining the Range Minimum with
Linear Penalties
89
Finding the Substring Edit Distance in Constant
Time
90
Diagram for Blocked Merged LCS (1/2)
T
Aib
Abi-1
Bjb
Bbj-1
T
T
T
Aib
Aib
Aib
Bjb
Bjb
Bjb
91
Diagram for Blocked Merged LCS (2/2)
T
?
Aib
Abi-1
Bjb
Bbj-1
?
92
S-table
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
Improved Algorithm
  • Vl minEVl-1 ? St(T, Si) for each Si ? S and 1
    l ?

St(T, Si) O(nmS) Vl minEVl-1 ? St(T, Si)
O(n?S) Time Complexity O(n(m?)S)
97
Example of Algorithm Formosa2
  • T agactagtc
  • S S1agc, S2act, S3aatg, S4ttcg
  • T agactagtc
  • S1 12-3----- (0,1,2,4)
  • S2 1--23---- (0,1,4,5)
  • S3 12--3-4-- (0,1,2,5,7)
  • S4 -1----2-3 (0,2,7,9)
  • 12-3--4-- (0,1,2,4,7)V1

98
Example of Algorithm Formosa2
About PowerShow.com