Inexact Matching - PowerPoint PPT Presentation

Loading...

PPT – Inexact Matching PowerPoint presentation | free to download - id: 529417-ZmIxY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Inexact Matching

Description:

Inexact Matching Charles Yan 2008 Longest Common Subsequence Given two strings, find a longest subsequence that they share substring vs. subsequence of a string ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 79
Provided by: digitalC8
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Inexact Matching


1
Inexact Matching
  • Charles Yan
  • 2008

2
Longest Common Subsequence
  • Given two strings, find a longest subsequence
    that they share
  • substring vs. subsequence of a string
  • Substring the characters in a substring of S
    must occur contiguously in S
  • Subsequence the characters can be interspersed
    with gaps.
  • Consider ababc and abdcb
  • alignment 1
  • ababc.
  • abd.cb
  • the longest common subsequence is ab..c with
    length 3
  • alignment 2
  • aba.bc
  • abdcb.
  • the longest common subsequence is ab..b with
    length 3

3
Longest Common Subsequence
  • Lets give a score M an alignment in this way,
  • Msum s(xi,yi), where xi is the i character in
    the first aligned sequence
  • yi is the i character in the second aligned
    sequence
  • s(xi,yi) 1 if xi yi
  • s(xi,yi) 0 if xi?yi or any of them is a gap
  • The score for alignment
  • ababc.
  • abd.cb
  • Ms(a,a)s(b,b)s(a,d)s(b,.)s(c,c)s(.,b)3
  • To find the longest common subsequence between
    sequences S1 and S2 is to find the alignment that
    maximizes score M.

4
Longest Common Subsequence
S1 a1a2a3ai S2 b1b2b3bj
  • Subproblem optimality
  • Consider two sequences
  • Let the optimal alignment be
  • x1x2x3xn-1xn
  • y1y2y3yn-1yn
  • There are three possible cases
  • for the last pair (xn,yn)

5
Longest Common Subsequence
There are three cases for (xn,yn) pair
S1 a1a2a3ai S2 b1b2b3bj
x1x2x3xn-1xn y1y2y3yn-1yn
  • Mi,j MAX Mi-1, j-1 S (ai,bj)
    (match/mismatch)
  • Mi,j-1 0 (gap in sequence 1)
  • Mi-1,j 0 (gap in sequence 2)
  • Mi,j is the score for optimal alignment between
    strings a1i (substring of a from index 1 to i)
    and b1j

6
Longest Common Subsequence
  • Mi,j MAX
  • Mi-1, j-1 S(ai,bj)
  • Mi,j-1 0
  • Mi-1,j 0

s(ai,bj) 1 if aibj s(ai,bj) 0 if ai?bj or any
of them is a gap
  • Examples
  • G A A T T C A G T T A (sequence 1)
  • G G A T C G A (sequence 2)

7
Longest Common Subsequence
Fill the score matrix M and trace back table B
M1,1 MAXM0,0 1, M1, 0 0, M0,1 0 MAX
1, 0, 0 1
Score matrix M
Trace back table B
8
Longest Common Subsequence
Score matrix M
Trace back table B
M7,116 (lower right corner of Score matrix) This
tells us that the best alignment has a score of
6 What is the best alignment?
9
Longest Common Subsequence
We need to use trace back table to find out the
best alignment, which has a score of 6
  • Find the path from lower right
  • corner to upper left corner

10
Longest Common Subsequence
(2) At the same time, write down the alignment
backward
S1
Take one character from each sequence
Take one character from sequence S1 (columns)
S2
Take one character from sequence S2 (rows)
11
Longest Common Subsequence
Take one character from each sequence
Take one character from sequence S1 (columns)
Take one character from sequence S2 (rows)
12
Longest Common Subsequence
Thus, the optimal alignment is
The longest common subsequence is
G.A.T.C.G..A There might be multiple longest
common subsequences (LCSs) between two given
sequences. These LCSs have the same number of
characters (not include gaps)
13
Longest Common Subsequence
  • Algorithm LCS (string A, string B)
  • Input strings A and B
  • Output the longest common subsequence of A and B
  • M Score Matrix
  • B trace back table (use letter a, b, c for
    )
  • nA.length()
  • mB.length()
  • // fill in M and B
  • for (i0iltm1i)
  • for (j0jltn1j)
  • if (i0) (j0)
  • then M(i,j)0
  • else if (AiBj)
  • M(i,j)max Mi-1,j-11, Mi-1,j,
    Mi,j-1
  • update the entry in trace table B
  • else
  • M(i,j)max Mi-1,j-1, Mi-1,j, Mi,j-1
  • update the entry in trace table B

14
Global Alignment
  • Global Alignment Find the overall similarity
    between two sequences
  • Mi,j MAX
  • Mi-1, j-1 S(ai,bj)
  • Mi,j-1 0
  • Mi-1,j 0

s(ai,bj) 1 if aibj s(ai,bj) 0 if ai?bj or any
of them is a gap
s(ai,bj) can be replaced by the similarity score
between ai , bj
15
Sequence Alignment
16
Local Alignment
  • Local Alignment Find all pairs of substrings
    that have similarity scores higher than a.

Global
Local




































17
Global Alignment
  • gtseq1
  • ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDV
    DAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAA
    VLIAIIVAVITWCILLWSHFG
  • gtseq2
  • MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDAL
    SYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWV
    FYGLAAILTVVTLIGVVTIIKAGYSAWKG
  • Global alignment score -2
  • 10 20 30 40
    50
  • Seq1 ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVV
    IACWL---DVDAITRVL
  • . . . .. . .. .
    ...... . .
  • Seq2 MINPNP-KRSDEPVFWGLFGAGGMW---SAIIAPVMILLVGIL
    LPLGLFPGDALSYERVL
  • 10 20 30
    40 50
  • 60 70 80 90
    100
  • Seq1 LISSVML--VMIVEILNSAIEAVVDRIGSEYHEL-----SGRA
    KDMGSAAVLIAI-IVAV
  • ... .. .. .. . . . .. ..
    . . .. ...
  • Seq2 AFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKW
    VFYGLAAILTVVTLIGV
  • 60 70 80 90
    100 110
  • 110 120

18
Local Alignment
  • gtseq1
  • ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDV
    DAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAA
    VLIAIIVAVITWCILLWSHFG
  • gtseq2
  • MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDAL
    SYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWV
    FYGLAAILTVVTLIGVVTIIKAGYSAWKG
  • Score 19.6 bits
  • Seq 1 6 GFTRIIKAAGYSWKG 20
  • G IIKA WKG
  • Seq 2 115 GVVTIIKAGYSAWKG 129
  • Score 13.1 bits
  • Seq 1 103 IAIIVAVIT 111
  • A I VT
  • Seq 2 104 LAAILTVVT 112

Seq 1
Seq2
19
Local Alignment
Global
Local
  • Mi,j MAX
  • 0
  • Mi-1, j-1 S(ai,bj)
  • Mi,j-1 0
  • Mi-1,j 0
  • Mi,j MAX
  • Mi-1, j-1 S(ai,bj)
  • Mi,j-1 0
  • Mi-1,j 0

20
Gap Penalty
There are three cases for (xn,yn) pair
S1 a1a2a3ai S2 b1b2b3bj
x1x2x3xn-1xn y1y2y3yn-1yn
  • Mi,j MAX Mi-1, j-1 S (ai,bj)
    (match/mismatch)
  • Mi,j-1 0 (gap in sequence 1)
  • Mi-1,j 0 (gap in sequence 2)
  • Mi,j is the score for optimal alignment between
    strings a1i (substring of a from index 1 to i)
    and b1j

21
Gap Penalty
There are three cases for (xn,yn) pair
S1 a1a2a3ai S2 b1b2b3bj
x1x2x3xn-1xn y1y2y3yn-1yn
  • Mi,j MAX Mi-1, j-1 S (ai,bj)
    (match/mismatch)
  • Mi,j-1 G (-, bj) (gap penalty)
  • Mi-1,j G (ai,-) (gap penalty)
  • Mi,j is the score for optimal alignment between
    strings a1i (substring of a from index 1 to i)
    and b1j

22
Gap Penalty
Seq1 NSAIEAVVDRIGSEYHEL-----SGRWVFYGLAA Seq2
VLPLWCGLHRMHHAMHDLKLHSPAGKWVFYGLAA Seq1
NSAIEAVVDRIGSEYHE--L-S--GRWVFYGLAA Seq2
VLPLWCGLHRMHHAMHDLKLHSPAGKWVFYGLAA
  • 1. Constant gap weight Equal penalty for each
    gap, regardless of the gap length.
  • A gap of length q has a penalty of W, so is a
    gap of length 1
  • 2. Affine gap weight a constant weight to each
    additional space
  • gap opening penalty Wg (-10)
  • gap extension penalty Ws (-1)
  • A gap of length q will have a penalty of
    Wg(q-1)Ws
  • How to modify the recursion function?
  • 3. Convex gap weight Each additional space
    contributes less than the proceeding space
  • Wglog (q)
  • 4. Alphabet-weighted gap penalty The penalty
    also depend on letter.

23
BLAST
  • http//www.ncbi.nlm.nih.gov/BLAST/
  • Local alignment
  • Database search
  • Efficiency is vital

24
Local Alignment
  • gtseq1
  • ANNTTGFTRIIKAAGYSWKGLRAAWINEAAFRQEGVAVLLAVVIACWLDV
    DAITRVLLISSVMLVMIVEILNSAIEAVVDRIGSEYHELSGRAKDMGSAA
    VLIAIIVAVITWCILLWSHFG
  • gtseq2
  • MINPNPKRSDEPVFWGLFGAGGMWSAIIAPVMILLVGILLPLGLFPGDAL
    SYERVLAFAQSFIGRVFLFLMIVLPLWCGLHRMHHAMHDLKIHVPAGKWV
    FYGLAAILTVVTLIGVVTIIKAGYSAWKG
  • Score 19.6 bits
  • Seq 1 6 GFTRIIKAAGYSWKG 20
  • G IIKA WKG
  • Seq 2 115 GVVTIIKAGYSAWKG 129
  • Score 13.1 bits
  • Seq 1 103 IAIIVAVIT 111
  • A I VT
  • Seq 2 104 LAAILTVVT 112

Seq 1
Seq2
25
BLAST
  • http//www.ncbi.nlm.nih.gov/BLAST/
  • Local alignment
  • Database search
  • Efficiency is vital

26
(No Transcript)
27
(No Transcript)
28
Raw Score S
  • The raw score S for an alignment is calculated by
    summing the scores for each aligned position and
    the scores for gaps

29
Bit Score S'
  • Raw scores have little meaning without detailed
    knowledge of the scoring system used, or more
    simply its statistical parameters K and lambda.
    Unless the scoring system is understood, citing a
    raw score alone is like citing a distance without
    specifying feet, meters, or light years. By
    normalizing a raw score using the formula
    one attains a "bit score" S', which has a
    standard set of units.

30
Bit Score S'
  • The value S' is derived from the raw alignment
    score S in which the statistical properties of
    the scoring system used have been taken into
    account. Because bit scores have been normalized
    with respect to the scoring system, they can be
    used to compare alignment scores from different
    searches.

31
Significance
  • The significance of each alignment is computed as
    a P value or an E value
  • E value Expectation value. The number of
    different alignments with scores equivalent to or
    better than S that are expected to occur in a
    database search by chance. The lower the E value,
    the more significant the score.
  • P value The probability of an alignment
    occurring with the score in question or better.
    The p value is calculated by relating the
    observed alignment score, S, to the expected
    distribution of HSP scores from comparisons of
    random sequences of the same length and
    composition as the query to the database. The
    most highly significant P values will be those
    close to 0. P values and E values are different
    ways of representing the significance of the
    alignment.

32
E-value
  • In the limit of sufficiently large sequence
    lengths m and n, the statistics of HSP scores are
    characterized by two parameters, K and lambda.
    Most simply, the expected number of HSPs with
    score at least S is given by the formula We
    call this the E-value for the score S.    This
    formula makes eminently intuitive sense. Doubling
    the length of either sequence should double the
    number of HSPs attaining a given score. Also, for
    an HSP to attain the score 2x it must attain the
    score x twice in a row, so one expects E to
    decrease exponentially with score. The parameters
    K and lambda can be thought of simply as natural
    scales for the search space size and the scoring
    system respectively.

33
P-value
  • The number of random HSPs with score gt S is
    described by a Poisson distribution. This means
    that the probability of finding exactly a HSPs
    with score gtS is given by where E is the
    E-value of S given by equation (1) above.
    Specifically the chance of finding zero HSPs with
    score gtS is e-E, so the probability of finding
    at least one such HSP is This is the P-value
    associated with the score S. For example, if one
    expects to find three HSPs with score gt S, the
    probability of finding at least one is 0.95. The
    BLAST programs report E-value rather than
    P-values because it is easier to understand the
    difference between, for example, E-value of 5 and
    10 than P-values of 0.993 and 0.99995.

34
BLAST
  • The BLAST programs (Basic Local Alignment Search
    Tools) are a set of sequence comparison
    algorithms introduced in 1990 that are used to
    search sequence databases for optimal local
    alignments to a query.
  • Break the query and database sequences into
    fragments ("words"), and initially seek matches
    between fragments. The initial search is done for
    a word of length "W" that scores at least "T"
    when compared to the query using a given
    substitution matrix.
  • Word hits are then extended in either direction
    in an attempt to generate an alignment with a
    score exceeding the threshold of "S". The "T"
    parameter dictates the speed and sensitivity of
    the search.

35
(No Transcript)
36
BLAST
  • Use keyword trees for find HSPs in a subject
    sequence.
  • For each HSP, extend the local alignment at both
    ends as long as the alignment score is higher
    than threshold T.
  • If a local alignment has score higher than C,
    than the there is significant similarity
    between query and subject sequences. Report a
    hit.

37
(No Transcript)
38
(No Transcript)
39
Inexact matching
  • Alignment
  • Motif/profile searching

40
PROSITE
  • A profile or weight matrix is a table of
    position-specific alphabet weights and gap costs.
    These numbers (also referred to as scores) are
    used to calculate a similarity score for any
    alignment between a profile and a sequence, or
    parts of a profile and a sequence. An alignment
    with a similarity score higher than or equal to a
    given cut-off value constitutes a motif
    occurrence.

41
Motifs and Matching
  • Motif Finding
  • Given a set of protein sequences, to find the
    motif(s) that are shared by these proteins.
  • Motif Scanning
  • Given a motif and a protein sequence, to find
    the occurrences (not necessary identical) of the
    motif on the protein sequences.

42
Motifs
  • How significant is a motif? How different is it
    from a uniform distribution? Does it represent a
    biologically significant region?
  • Information Content (IC)
  • Probability of observing character j at
    position i

43
Motifs
  • Usually, the background is not a uniform
    distribution. Thus, it is more useful to take
    into account the background distribution
  • KL divergence score
  • Probability of observing character j at
    position i
  • Probability that character j occurs at position
    i based on background distribution.
  • The higher the score, the more unlikely to obtain
    the motif by chance
  • , i.e., the more significant is the motif.

44
Motifs
  • How to find potential occurrences of a motif in a
    give string?
  • Motif/profile
  • Counts the entry in (j,i)th cell is the times
    character j occurs at ith column Ni,j
  • Frequency (likelihood)
  • Log-likelihood
  • Log-odds

45
Motifs
  • For a given motif of length K, to scan an input
    sequence to find the occurrences of the motif.
  • Assume that each entry in the motif is Pi(j)

ATCGCTAGCTAAGTAGTGGGCTAAGCTAAGCTAAGTGTGTAGCGTA
A
C
T
G
Xx1x2x3xxk is the substring within the sliding
window
46
Motifs
  • Independent identical distribution (iid)

Markov chain model (1st order or higher order)
47
Motifs
48
MotifScanner
  • http//homes.esat.kuleuven.be/thijs/Work/MotifSca
    nner.html

49
JASPAR Motifs
  • JASPAR
  • MA0112
  • 1 1 7 2 0 0 0 6 1 2 3 1 1 5 0 0 2 3
  • 5 5 1 0 0 0 7 0 7 5 2 1 0 1 8 9 4 4
  • 1 1 1 7 9 0 2 2 1 1 4 1 8 3 0 0 0 1
  • 2 2 0 0 0 9 0 1 0 1 0 6 0 0 1 0 3 1

50
Suffix Trees for Inexact Matching
  • In a rooted tree T, a node u is an ancestor of a
    node v is u is on the unique path from the root
    to v.
  • A proper ancestor of v is an ancestor that is not
    v.
  • In a rooted tree T, the lowest common ancestor
    (lca) of two nodes x and y is the deepest node in
    T that is an ancestor of both x and y.
  • lca retrieval problem Given two nodes, x and y,
    of a rooted tree, to find their lca.
  • Let n be the number of nodes in a rooted tree T,
    then after a O(n)-time preprocessing, lca
    retrieval problem can be solved in constant
    time!!! (Independent of n)

51
lca Retrieval
52
Longest Common Extension
  • Longest common extension (lce) problem Given
    string S1 and S2, with a total length of n, for
    each specified index pair (i, j), to find the
    length of the longest string of S1 starting at i
    that matches a substring of S2 starting at
    position j. That is, to find the length of the
    longest prefix of suffix i of S1 that matches a
    prefix of suffix j of S2.
  • Given an index pair (i,j) It is easy to solve the
    lce problem in linear time.
  • But, we are challenged to solve it in constant
    time for each given index pair (after a linear
    time preprocessing)!

a
x
i
a
y
j
53
Longest Common Extension
  • Build a generalized suffix tree for S1 and S2.
    O(n)
  • Compute the string-depth of each node. O(n)
  • For a given pair (i,j), lce problem is reduced to
    lca problem, which can be solved in constant
    time.

54
K-Mismatch Problem
  • Given a pattern P, a text T, and a fixed number k
    that is independent of the length of P and T, a
    k-mismatch of P is a P-length substring of T
    that matches at least P-k characters of P. That
    is, it matches P with at most k mismatches.
  • Does not allow insertions or deletions
  • Matches or mismatches
  • Pbend, Tabentbananaend, k2

55
K-Mismatch Problem
  • Pn, Tm,
  • O(km) vs O(nm) Kltltn
  • Algorithm
  • INPUT (T, P, i, K)
  • OUTPUT whether a K-mismatch of P occurs in T
    starting at i
  • j1 ii count0
  • While (countK)
  • l lce (j, i )
  • if jln1, then a K-mismatch of P occurs in T
    starting at i, stop
  • count
  • if countgtk, then a K-mismatch of P does NOT
    occurs in T starting at i, stop
  • jjl1 i i l1

56
K-Mismatch Problem
  • When the alphabet size is small, e.g. ?4, and
    k is small, e.g. k2,
  • A practical approach in biological database
    search
  • Build a suffix tree of T O(m)
  • Enumerate all k-mismatches (p) of P (2416)
  • Find the occurrences of every p (16n)
  • Total time O(m16n) ltltO(km), since nltltm

57
Maximal Palindromes
  • An even-length substring S of S is a maximal
    palindrome of radius k if, starting in the middle
    of S, S read the same in both directions for k
    characters but not for any kgtk character.
  • aabactgaaccaat
  • An odd-length maximal palindrome S is similarly
    defined after excluding the middle character of
    S.
  • aabactgaaccaat

58
Palindrome Problem
  • Given a string S of length n, the palindrome
    problem is to locate all maximal palindromes in
    S.
  • Sr is the reverse of S. O(n)
  • For q from 1 to n-1
  • Find the lce for index pair (q1, n-q1) in S and
    Sr, respectively. Let k be the length of lce
    (q1, n-q1).
  • If kgt0, then report that a maximal palindrome of
    radius k centered at q.
  • All the maximal even-length palindromes in a
    string can be identified in linear time.
  • How about odd-length palindromes??

59
Complemented Palindromes
  • In DNA two halves of the substring from a
    palindrome only if the characters in one half are
    converted to their complement characters. (A and
    T are complemented, and C and G are complemented)
  • ATTAGCTAAT
  • TAATCGATTA
  • Finding all complemented palindromes in a string
    S can be solved in linear time.
  • Let Sr be the complemented string.

60
K-mismatch Palindromes
  • A k-mismatch palindrome is a substring that
    become a palindrome after k or fewer mutations.
  • axabbcca 2-mismatch palindrome.
  • Finding all k-mismatch palindromes in a string S
    can be solved in O(kn) time.

61
Tandem Repeats
  • A tandem repeat a is a string that can be written
    as bb, where b is a substring.
  • Does not need to be maximal
  • xababababy two tandem repeats starting at 2.
  • Find all tandem repeats in string S in O(n2)
    time.
  • For every pair of index (i,j) (iltj)
  • Find lce (i,j)
  • If lce (i,j) gtj-i, then a tandem repeat starting
    at i with length of 2(j-i)

62
K-mismatch Tandem Repeats
  • A substring that becomes a tandem repeat after k
    or fewer mutations.
  • axabaybb 2 mismatch tandem repeat
  • Find all k-mismatch tandem repeats in string S in
    O(kn2) time.
  • Tandem repeats (or k-mismatch tandem repeats) can
    be solved faster.

63
Repetitive Structures
  • A maximal pair in a string S is a pair of
    identical substrings a and b in S such that the
    character to the immediate left (right) of a is
    different from the character to the immediate
    left (right) of b.
  • Represented by (p1,p2,n), where p1 and p2 are
    the starting position of a, and b. n is the
    length.
  • xabcyiiizabcqabcyrxar
  • (2,10,3) (10,14,3)
  • But (2,14,3) is not a maximal pair. Instead,
    (2,14,4) is a maximal pair.
  • For an input string S, R(S) is the set of maximal
    pairs. R(S) is too large to be useful.

b
c
x
y
a
b
q
p
64
Repetitive Structures
  • A maximal repeat a is a substring of S that
    occurs in a maximal pair of S.
  • xabcyiiizabcqabcyrxar
  • (2,10,3) (10,14,3) aabc
  • (2,14,4) aabcy
  • For an input string S, R(S) is the set of
    maximal repeats.
  • R(S) R(S)
  • A supermaximal repeat is a maximal repeat that
    never occurs as a substring of any other maximal
    repeat.
  • abcy is a supermaximal repeat. abc is not.

65
Maximal Repeats
  • Goal To find all maximal repeats in linear time.
  • Lemma 7.12.1 Let T be the suffix tree for string
    S. If a string a is a maximal repeat in S then a
    must be the path-label of an internal node v in
    T.
  • Theorem 7.12.1 There can be at most n maximal
    repeats in any string of length n.
  • Is it true that the path-label of any internal
    node must be a maximal repeat?

q
a
y
b
c
x
y
a
a
c
q
p
p
66
Maximal Repeats
  • For each position i in string S, character S(i-1)
    is called the left character of i.
  • Let T be a suffix tree, the left character of a
    leaf (i) in T is the left character of position
    i.
  • A node v is left diverse if at least two leaves
    in vs subtree have different left characters.
  • A path-label of a node v in T is a maximal repeat
    if and only if v is left diverse.
  • A? B is easy
  • B?A

67
Maximal Repeats
b
c
x
y
a
a
q
p
q
a
  • Let node v with path-label a is left diverse. Let
    p and q are the two leaves below v that have
    different left character. b?x
  • if c?y (leaves p and q diverge at v) then a is a
    maximal repeat, because (p,q,a) is a maximal
    pair.

y
v
c
p
68
Maximal Repeats
b
c
x
y
m
n
a
a
a
q
k
p
  • Else if cy (leaves p and q diverge at a point
    below v) , then there is anther branch at v
    (because it is an internal node). Let the leave
    be k.
  • Then n ?y/c
  • If b ?m, then (p,k,a) is a maximal pair, a is a
    maximal repeat.
  • Else if bm, then b ?x, (q,k,a) is a maximal
    pair, a is a maximal repeat.

a
q
v
n
c/y
p
k
69
Maximal Repeats
  • The property of left diverse propagates upward in
    T.
  • If a node is left diverse then its parent node is
    also left diverse.
  • A node is a frontier node if it is left diverse
    and none of its children are left diverse.
  • Trim of all leaves and nodes under frontier
    nodes, we obtain a subtree (frontier tree) in
    which every path from the root to a node is a
    maximal repeat.
  • The frontier tree is a compact representation of
    all maximal repeats.

70
Maximal Repeats
71
Maximal Repeats
  • Finding left diverse node
  • Algorithm
  • Start from leaves
  • At a node v,
  • if some children are left diverse, then label v
    as left diverse
  • else if all vs children have the same left
    character or v is a leaf, then record the left
    character.
  • else label v as frontier node.
  • Depth-first
  • At each node the processing time is proportional
    to the number of children. Return the node is
    left diverse or the left character of the node.
  • O(n) in total.
  • Maximal repeats in S can be found in O(n) time.
  • To be precise, a compact representation of
    maximal repeats in S can be found in O(n) time.

72
Supermaximal Repeats
  • A supermaximal repeat is a maximal repeat that
    never occurs as a substring of any other maximal
    repeat.
  • A supermaximal repeat must be the path-label from
    the root to a frontier node.

a
b
73
Supermaximal Repeats
  • Is it true the every path from the root to a
    frontier node is a supermaximal repeat?

node s is NOT left diverse c?y
x
c
x
y
bg
bg
q
q
c
y
d
e
bg
bg
a
a
q
q
j
i
d?e, c?y The parent of leaves i and j (node u) is
left diverse The path-label of u is abg, which is
a maximal repeat
u
b
g
s
v
q
p
i
j
74
Supermaximal Repeats
  • If a frontier node v has a children that is an
    internal node, then the path-label of v is NOT a
    supermaximal repeat.

75
Supermaximal Repeats
  • Is it true the every path from the root to a
    frontier node that has not internal node children
    is a supermaximal repeat?

c?y
x
c
x
y
b
b
q
q
c
y
d
e
b
b
a
a
q
q
j
i
d?e, c?y The parent of leaves i and j (node u) is
left diverse The path-label of u is ab, which is
a maximal repeat
u
b
q
p
i
j
76
Supermaximal Repeats
  • If a frontier node v has a children that is an
    internal node, then the path-label of v is NOT a
    supermaximal repeat.
  • If a frontier node v has two children that have
    the same left character, then the path-label of v
    is NOT a supermaximal repeat.
  • A frontier node v represent a supermaximal repeat
    if and only if all of its children are leaves and
    each leaf has a distinct left character.
  • To check whether all children are leaves O(k)
  • To check whether all children have distinct left
    character O(k) (Assuming constant alphabet size)
  • Supermaixmal repeats can be found in linear time.
  • The nodes corresponding to supermaximal repeats
    can be found in linear time.

77
Maximal Pairs
  • How to output maximal pairs?

a
b
c
x
y
a
a
j
v
i
j
q
i
At a left diverse node v, if i and j root at
different children of v and i and j have
different left character, then (i,j a) define a
maximal pair.
78
Maximal Pairs
  • Starting from leaves
  • For each node v, we will maintain ? linked
    lists, with one list for a character.
  • The list for character x consists of vs children
    leafs that have x as left character.
  • x?j,q
  • b?i
  • At each left diverse node, output the maximal
    pairs using the linked lists of its children
  • Link (not copy) childrens linked lists to get
    the vs linked lists O(c) (assuming constant
    alphabet size, c the number of children)
  • O(nk) in total, where k is the number of maximal
    pairs.
About PowerShow.com