# String Data Structures and Algorithms: Suffix Trees and Suffix Arrays - PowerPoint PPT Presentation

View by Category
Title:

## String Data Structures and Algorithms: Suffix Trees and Suffix Arrays

Description:

### 4/4/09. BBSI Summer School - Iowa State University. 2. Why Strings? ... 4/4/09. BBSI Summer School - Iowa State University. 8. M A L A Y A L A M \$ 1 2 3 4 5 6 7 8 9 10 ... – PowerPoint PPT presentation

Number of Views:431
Avg rating:3.0/5.0
Slides: 37
Provided by: scott314
Category:
Tags:
Transcript and Presenter's Notes

Title: String Data Structures and Algorithms: Suffix Trees and Suffix Arrays

1
String Data Structures and Algorithms Suffix
Trees and Suffix Arrays
• David Fernández-Baca
• UNAM (Mexico)
• (based on notes by Srinivas Aluru)
• slightly modified by Benny Chor

2
Why Strings?
• Biological sequences can be viewed as strings,
or finite series of characters, over an alphabet
S.
• There is a wealth of algorithmic theory developed
for general strings that we can apply to specific
biological problems.

3
Look-up Tables
• Strings of length k over S can be represented by
an integer index i, 0 i Sk 1.
• DNA is composed of four characters.
• S A, G, C, T S 4
• We can preprocess a database into a lookup table
to locate all occurrences of a query index.

4
Example
Let A 0 (00) C 1 (01) G 2 (10) T
3 (11) Strings are converted based on the binary
string they represent String
Binary Integer AAA 000000
0 ATA 001100 12 AAC
000001 1
5
Search using an Index
Size Sk
Query
6
Applications of Indexing
• Seeds for searching sequence databases
• BLAST
• Pair generation for fragment assembly in
sequencing projects
• CAP3 sequence assembly program

7
Indexing
• Using a sparse representation, a database can be
preprocessed in linear time to allow locating all
instances of a short string.
• Major limitation search is restricted to fixed
length strings.

8
Suffix Tree
M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
6
2
9
Suffix Trees
S M A L A Y A L A M 1 2 3 4 5 6 7
8 9 10
Paths from root to leaves represent all suffixes
of S

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
6
2
10
Suffix tree properties
• For a string S of length n, there are n1 leaves
and at most n internal nodes.
• therefore requires only linear space,
• provided edge labels are O(1) space
• Each leaf represents a unique suffix.
• Concatenation of edge labels from root to a leaf
spells out the suffix.
• Each internal node represents a distinct common
prefix to at least two suffixes.

11
Edge Encoding
S M A L A Y A L A M 1 2 3 4 5 6 7
8 9 10
(10, 10)
(2, 2)
(5, 10)
(3, 4)
(1, 1)
10
5
(5, 10)
(3, 4)
(10, 10)
(2, 10)
(9, 10)
(5, 10)
(9, 10)
7
3
1
8
4
9
(9, 10)
(5, 10)
6
2
12
Näive Suffix Tree Construction
1 MALAYALAM
2 ALAYALAM
3 LAYALAM
4 AYALAM
5 YALAM
6 ALAM
7 LAM
8 AM
9 M
10
Before starting Why exactly do we need this ,
which is not part of the alphabet?
13
Näive Suffix Tree Construction
3
4
2
1 MALAYALAM
2 ALAYALAM
3 LAYALAM
4 AYALAM
5 YALAM
6 ALAM
7 LAM
8 AM
9 M
10
A
MALAYALAM
LAYALAM
LAYALAM
YALAM
2
1
3
4
etc.
14
Application Finding a short Pattern in a long
String
• Build a suffix tree of the string.
• Starting from the root, traverse a path matching
characters of the pattern.
• If stuck, pattern not present in string.
• Otherwise, each leaf below gives a position of
the pattern in the string.

15
Finding a Pattern in a String
Find ALA

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
Two matches - at 6 and 2
6
2
16
Finding Common SubStrings
• Construct a generalized suffix tree for two
strings (each suffix of each string is
represented).
• Label each leaf with the suffix number and string
label.
• Each internal node with a leaf from both strings
in its subtree gives a common substring.

17
Generalized Suffix Tree
WINDOW INDIGO 1234567
1234567

D
ND
I
OG
O
W
(1, 7) (2, 7)
(2, 5)
ND
OW

OGI
OW
OGI
OG
W
INDOW

(2, 4)
(2, 2)
(1, 3)
(1, 5)
(2, 6)
(2, 3)
(1, 4)
OGI
OW
(1, 6)
(1, 1)
(2, 1)
(1, 2)
18
Lowest Common Ancestors
• The lowest common ancestor (lca) of two nodes x
and y in a rooted tree is the deepest node
(farthest away from root) that is an ancestor of
both x and y
• Concatenation of edge labels from root to the lca
of two leaves spells out the longest common
prefix (lcp) of two strings
• lca(x,y) an be found in constant time after
linear preprocessing Bender00

19
A Useful Property
String depth (lca (i , j)) lcp (suffixi,
suffixj)

A
A
String depth 3
M
LA
YALAM
AL
AL
5
10
lca
M
YALAM
YALAM
M

ALAYALAM
lcp longest common prefix
3
8
4
7
M
YALAM
1
9
6
2
20
Longest Common Extension
RAILWAY 12345678
RAI
GRAINY 1234567
RAI
lce(1,1) 0
lce(2,1) 3
Well soon find lces useful in reconstructing
phylogenetic trees based on whole genome/proteome
sequences
21
lces and lcas
To compute lces for two strings S1 and S2
1. Build generalized suffix tree, T, of S1 and S2
2. Compute string depth for each node in T
3. Preprocess T for lca queries
4. lce(i,j) string depth of lca of suffix i ofS1
and suffix j ofS2

22
Example
WINDOW INDIGO 1234567
1234567

D
ND
I
OG
O
W
(1, 7) (2, 7)
(2, 5)
ND
OW

OGI
OW
OGI
OG
W
INDOW

(2, 4)
(2, 2)
(1, 3)
(1, 5)
(2, 6)
(2, 3)
(1, 4)
OGI
OW
(1, 6)
(1, 1)
(2, 1)
(1, 2)
23
lces, revisited
Given two strings S1 and S2 , we are
now interested in finding, for each i, the index
j such that lce (i, j) is maximal.
• What is the meaning of this task?
• How do we accomplish it efficiently?
• Notice that computing the values
• lce (i, j) for all j is very inefficient!

24
Palindromes
• A palindrome is a string that reads the same in
both directions
• E.g., CATGTAC
• red rum, sir, is murder
• Palindrome problem Find all maximal palindromes
in a string S

25
Finding Palindromes in S
• Construct the reverse S of S
• Build generalized suffix tree of S and S
• Preprocess T for lce queries
• Now what?
• Left as homework
• Requirement Linear time (const. per query)

S
q 1
26
Palindromes in DNA sequences
• We sometimes need to deal with
• Crick-Watson complemented palindromes A ?
T C ? G
• E.g., ATCATGAT is a complemented palindrome
• All complemented palindromes in S can be found
using a GST of S and the complement of S

27
Suffix Array Reducing Space
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10
6 2 8 4 7 3 1 9 5 10
Suffix Array Lexicographic ordering of suffixes
3 1 1 0 2 0 1 0 0 -
Derive Longest Common Prefix array
Suffix 6 and 2 share ALA Suffix 2,8 share just
A. lcp achieved for successive pairs.
28
Pattern Search in Suffix Array
• All suffixes that share a common prefix appear in
consecutive positions in the array.
• Pattern P can be located in the string using a
binary search on the suffix array.
• Naïve Run-time O(P ? log n).
• Improved to O(P log n) ManberMyers93, and
to O(P) Abouelhoda et al. 02.

29
Computing longest common prefix Values
• Find where S1 is in the suffix array.
• Compute lcp value of S1.
• Find S2 in the suffix array.
• Compute lcp value of S2.
• Repeat for all suffixes.
• Run-time is O(1) per suffix (why?),
• so linear overall.

30
Example
Text
Position
Suffix Array
3
1
1
0
2
0
1
0
0
lcp Array
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
31
Suffix Trees vs. Suffix Arrays
• Suffix Array Lexicographic order of the
leaves of the Suffix Tree
• Suffix Tree ? Suffix Array lcp Array
• (why? Wait for next slide)

32
Building a ST from a SA and a lcp
D 0
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
A
LA
D 1
D 2
AL
M
YALAM
YALAM
M
D 3
3
8
4
7
M
YALAM
6
2
SA
6 2 8 4 7 3 1 9 5 10
lcp
3 1 1 0 2 0 1 0 0 -
33
Known (amazing) Results
• Suffix tree can be constructed in O(n) time and
O(n ? ?) space Weiner73, McCreight76,
Ukkonen92.
• Suffix arrays can be constructed without using
suffix trees in O(n) time PangAluru03.

34
More Applications
• Suffix-prefix overlaps in fragment assembly
• Maximal and tandem repeats
• Shortest unique substrings
• Maximal unique matches MUMmer
• Approximate matching
• Phylogenies based on complete genomes

35
Dealing with errors
• The basic string data structures can only extract
information in the absence of errors.
• To deal with errors, decompose into parts that do
not involve errors.

36
The k-mismatch problem
• Given a pattern P, a text T, and a number k, find
all occurrences of P in T with at most k
mismatches
• Example
• P bend, T abentbananaend, k 2

Match 1 bent
Match 2 bana
Match 3 aend
37
Solution
1. Build GST of P and T and preprocess it for lce
queries
2. For each starting index i in T, do at most k lce
queries to determine if there is a k-mismatch
beginning at i

T
P
Time O(k T )
38
References
• M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The
enhanced suffix array and its applications to
genome analysis, 2nd Workshop on Algorithms in
Bioinformatics, pp. 449-463, 2002.
• M. A. Bender and M. Farach-Colton, The LCA
Problem Revisited, LATIN, pages 88-94, 2000.
• P. Ko and S. Aluru, Linear time suffix sorting,
CPM, pages 200-210, 2003.
• U. Manber and G. Myers. Suffix arrays a new
method for on-line search, SIAM J. Comput.,
22935-948, 1993.
• E. M. McCreight, A space-economical suffix tree
construction algorithm, J. ACM, 23(2)262--272,
1976.
• E. Ukkonen, Constructing suffix trees on-line in
linear time. Intern. Federation of Information
Processing, pp. 484-492,1992. Also in
Algorithmica, 14(3)249--260, 1995.
• P. Weiner, Linear pattern matching algorithms,
Proc. of the 14th IEEE Annual Symp. on Switching
and Automata Theory, pp. 1-11, 1973.