Efficient Computation of Substring Equivalence Classes with Suffix Arrays - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Description:

is faster and requires less memory than suffix tree and CDAWG based ... The total number of elements in the equivalence classes (shortly ECs) is O(n2). Solution ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 35
Provided by: csd50
Category:

less

Transcript and Presenter's Notes

Title: Efficient Computation of Substring Equivalence Classes with Suffix Arrays


1
Efficient Computation of Substring
EquivalenceClasses with Suffix Arrays
  • Kazuyuki Narisawa,
  • Shunsuke Inenaga, Hideo Bannai and Masayuki
    Takeda
  • Kyushu University, Japan

2
Contents
  • Introduction
  • Problem definition
  • Suffix tree based algorithm
  • Simulation by suffix array
  • Computational experiment
  • Application
  • Summary

3
Main contribution
Time and space efficient computation of substring
equivalence classes Blumer et al. 1987 with
suffix arrays
  • Linear time and space
  • is faster and requires less memory than suffix
    tree and CDAWG based methods.

4
Equivalence relation and classes
?Substrings with essentially identical occurrence
in w
example
Bettyboughtabitofbetterbutterand madeabe
tterbatterafterbreakfast.
bet ? betterb?
5
Problem
  • Input string w of length n
  • Output the equivalence classes on w
  • Difficulty
  • The total number of elements in the equivalence
    classes (shortly ECs) is O(n2).
  • Solution
  • The number of the ECs is O(n).
  • Each EC can be succinctly represented in O(1)
    space.

6
Succinct representation of the ECs
  • representative the longest element(maximal
    extension)
  • minimal strings the elements which belong to
    another EC
  • when the left
    or right most character is deleted

the elements of x? can be enumerated with the
representative and minimal strings
example
representative
Betty-bought-a-bit-of-better- butter-and-made-a-be
tter- batter-after-breakfast.
minimal strings
7
Problem
  • Input string w of length n
  • Output succinct representations of the
    equivalence classes on w
  • additionally, we will output
  • size ( the number of elements in each EC)
  • frequency ( the number of occurrences of the
    elements in each EC )
  • of each EC

8
Possible solutions
  • Suffix TreeWeiner 1973
  • Compact Directed Acyclic Word Graph (CDAWG)
    Blumer et al. 1985
  • ECs can be computed with either of the data
    structures in linear time and space.

9
Suffix tree (with suffix link)
ababbbabbc
Ignore leaves here because they form a trivial EC.
10
Equivalence classes on suffix tree
ababbbabbc

a
b
c
b

b
11
10
a
c

b
b
b
b
a
a
b
b
b
EC
9
a
b
b
b
b
b
babb bab ba
c
a
c
a
b
c

b
a

b

b
b
b
c
b
1
c
b
a


c
c
b


b
7
3
c
EC def.
5
8
4

Essentially same occurrence substrings
2
6
11
Suffix tree algorithm
  • foreach node v in suffix tree
  • if(node v is representative of EC v )
  • follow suffix link
  • while(node is in EC v)
  • follow suffix link
  • compute size and minimal
    strings
  • output succinct representation of EC v

12
Algorithm with suffix tree

a
b
c

b
11
10
a
c

b
b
b
b
a
Suffix tree requires large memory space.
b
b
9
a
b
b
b
b
c
a
c
a
b
c

b
a

b

b
b
b
c
b
1
c
b
a


c
c
b


b
7
3
c
5
8
4

2
6
13
Suffix array Manber and Myers 1993
  • Can simulate traversal on suffix tree
  • ?using lcp and rank arrays Kasai et al. 2003
  • Can simulate traversal on suffix links
  • ?using additional data structure suffix link
    table Abouelhoda et al. 2004

14
Suffix array
lexicographically sort suffixes
ababbbabbc
Suffix Array
15
Lcp array
ababbbabbc
lcpithe length of the longest common
prefix of i th and (i 1) th suffixes
Lcp Array
Suffix Array
16
Rank array
ababbbabbc
rankSAi i
Rank Array
Suffix Array
17
Suffix array has less information
Information available during traversal for each
data structure, when visiting node v
Suffix Tree
Suffix Array
1. label from root to each node 2. label from
parent to each node 3. num. leaves in each
subtree 4. parent of each node 5. children of
each node 6. suffix link of each node
  • length of label from root to v
  • length of label from root to the parent of v
  • left most leaf ID in subtree rooted at v
  • right most leaf ID in subtree rooted at v

18
Suffix array has less information
length of parent label from root1
11
10
9
label length from root4
1
2
3
6
7
8
5
4
19
Suffix array algorithm
  • foreach v in suffix tree (simulated by suffix
    array)
  • if(node v is representative of EC v)
  • follow suffix link
  • while(node is in EC v)
  • follow suffix link
  • compute size and minimal strings
  • output succinct representation of EC v

difficulty 1
difficulty 2
difficulty 3
These are difficult because suffix array has less
information.
20
Solving difficulty 1 (representative judge)
v
l 1
r 1
l
r
Suffix Array
L rank(l 1)
index
R rank(r 1)
L rank(l)
R rank(r)
21
Solving difficulty 2 (equivalence relation judge)
xlabel from root
axlabel from root
v
l
r
l1
r1
Suffix Array
L rank(l)
index
R rank(r)
L rank(l1)
R rank(r1)
22
Solving difficulty 3 (size computation)
case 1
case 2
case 3
size sum of this
l
r
r
l
r
r
l
r
Suffix Array
L
index
R
R 1
L
R
R 1
L
R
lcp(R 1)
lcp(L)
lcp(R 1)
label length of parent
23
Computational experiment
  • Comparison of algorithms
  • suffix tree
  • CDAWG
  • suffix array
  • Data
  • two English and two Genome corpora
  • Canterbury corpus, Protein corpus
  • Machine spec.
  • Red Hat Linux
  • CPU 2.8GHz, 1 GB memory

24
Experimental result
25
Application spam detection
the size of the equivalence classes formed by
spams are larger than that of non spams.
  • This is Japanese
  • Sushi using spam,
  • but this spam does not relate to this study.

26
Application spam detection
  • Unsupervised Spam Detection based on String
    Alienness Measures
  • by Kazuyuki Narisawa, Hideo Bannai, Kohei
    Hatano
  • and Masayuki Takeda

if you are interested in our study and want to
come the conference, you should search not DS
07 but Discovery Science 2007.
27
Summary
  • Presented an algorithm for computing the
    equivalence classes with suffix array
  • simulating traversal on suffix tree suffix
    links
  • using only lcp and rank arrays
  • running in linear time and space
  • Compared with other data structures
  • less memory
  • faster computation
  • Can be applied to spam detection DS 07

28
Thank You
29
(No Transcript)
30
Compute size of the EC
sum of the length of label from parent to each
node
1 3 4
31
Compute minimal strings of the EC
z
x
y
z1
y1
z2
y1
x1
x2
zm
yk
xk
32
suffix tree
  • each node has
  • parent
  • leftmost child
  • right sibling
  • suffix link
  • label of the incoming edge

33
(No Transcript)
34
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com