String Data Structures and Algorithms Suffix

Trees and Suffix Arrays

- David Fernández-Baca
- UNAM (Mexico)
- (based on notes by Srinivas Aluru)
- slightly modified by Benny Chor

Why Strings?

- Biological sequences can be viewed as strings,

or finite series of characters, over an alphabet

S. - There is a wealth of algorithmic theory developed

for general strings that we can apply to specific

biological problems.

Look-up Tables

- Strings of length k over S can be represented by

an integer index i, 0 i Sk 1. - DNA is composed of four characters.
- S A, G, C, T S 4
- We can preprocess a database into a lookup table

to locate all occurrences of a query index.

Example

Let A 0 (00) C 1 (01) G 2 (10) T

3 (11) Strings are converted based on the binary

string they represent String

Binary Integer AAA 000000

0 ATA 001100 12 AAC

000001 1

Search using an Index

Size Sk

Linked list of occurrences

Query

Applications of Indexing

- Seeds for searching sequence databases
- BLAST
- Pair generation for fragment assembly in

sequencing projects - CAP3 sequence assembly program

Indexing

- Using a sparse representation, a database can be

preprocessed in linear time to allow locating all

instances of a short string. - Major limitation search is restricted to fixed

length strings.

Suffix Tree

M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10

A

M

LA

YALAM

AL

5

10

M

YALAM

YALAM

M

ALAYALAM

3

8

4

7

M

YALAM

1

9

6

2

Suffix Trees

S M A L A Y A L A M 1 2 3 4 5 6 7

8 9 10

Paths from root to leaves represent all suffixes

of S

A

M

LA

YALAM

AL

5

10

M

YALAM

YALAM

M

ALAYALAM

3

8

4

7

M

YALAM

1

9

6

2

Suffix tree properties

- For a string S of length n, there are n1 leaves

and at most n internal nodes. - therefore requires only linear space,
- provided edge labels are O(1) space
- Each leaf represents a unique suffix.
- Concatenation of edge labels from root to a leaf

spells out the suffix. - Each internal node represents a distinct common

prefix to at least two suffixes.

Edge Encoding

S M A L A Y A L A M 1 2 3 4 5 6 7

8 9 10

(10, 10)

(2, 2)

(5, 10)

(3, 4)

(1, 1)

10

5

(5, 10)

(3, 4)

(10, 10)

(2, 10)

(9, 10)

(5, 10)

(9, 10)

7

3

1

8

4

9

(9, 10)

(5, 10)

6

2

Näive Suffix Tree Construction

1 MALAYALAM

2 ALAYALAM

3 LAYALAM

4 AYALAM

5 YALAM

6 ALAM

7 LAM

8 AM

9 M

10

Before starting Why exactly do we need this ,

which is not part of the alphabet?

Näive Suffix Tree Construction

3

4

2

1 MALAYALAM

2 ALAYALAM

3 LAYALAM

4 AYALAM

5 YALAM

6 ALAM

7 LAM

8 AM

9 M

10

A

MALAYALAM

LAYALAM

LAYALAM

YALAM

2

1

3

4

etc.

Application Finding a short Pattern in a long

String

- Build a suffix tree of the string.
- Starting from the root, traverse a path matching

characters of the pattern. - If stuck, pattern not present in string.
- Otherwise, each leaf below gives a position of

the pattern in the string.

Finding a Pattern in a String

Find ALA

A

M

LA

YALAM

AL

5

10

M

YALAM

YALAM

M

ALAYALAM

3

8

4

7

M

YALAM

1

9

Two matches - at 6 and 2

6

2

Finding Common SubStrings

- Construct a generalized suffix tree for two

strings (each suffix of each string is

represented). - Label each leaf with the suffix number and string

label. - Each internal node with a leaf from both strings

in its subtree gives a common substring.

Generalized Suffix Tree

WINDOW INDIGO 1234567

1234567

D

ND

I

OG

O

W

(1, 7) (2, 7)

(2, 5)

ND

OW

OGI

OW

OGI

OG

W

INDOW

(2, 4)

(2, 2)

(1, 3)

(1, 5)

(2, 6)

(2, 3)

(1, 4)

OGI

OW

(1, 6)

(1, 1)

(2, 1)

(1, 2)

Lowest Common Ancestors

- The lowest common ancestor (lca) of two nodes x

and y in a rooted tree is the deepest node

(farthest away from root) that is an ancestor of

both x and y - Concatenation of edge labels from root to the lca

of two leaves spells out the longest common

prefix (lcp) of two strings - lca(x,y) an be found in constant time after

linear preprocessing Bender00

A Useful Property

String depth (lca (i , j)) lcp (suffixi,

suffixj)

A

A

String depth 3

M

LA

YALAM

AL

AL

5

10

lca

M

YALAM

YALAM

M

ALAYALAM

lcp longest common prefix

3

8

4

7

M

YALAM

1

9

6

2

Longest Common Extension

RAILWAY 12345678

RAI

GRAINY 1234567

RAI

lce(1,1) 0

lce(2,1) 3

Well soon find lces useful in reconstructing

phylogenetic trees based on whole genome/proteome

sequences

lces and lcas

To compute lces for two strings S1 and S2

- Build generalized suffix tree, T, of S1 and S2
- Compute string depth for each node in T
- Preprocess T for lca queries
- lce(i,j) string depth of lca of suffix i ofS1

and suffix j ofS2

Example

WINDOW INDIGO 1234567

1234567

D

ND

I

OG

O

W

(1, 7) (2, 7)

(2, 5)

ND

OW

OGI

OW

OGI

OG

W

INDOW

(2, 4)

(2, 2)

(1, 3)

(1, 5)

(2, 6)

(2, 3)

(1, 4)

OGI

OW

(1, 6)

(1, 1)

(2, 1)

(1, 2)

lces, revisited

Given two strings S1 and S2 , we are

now interested in finding, for each i, the index

j such that lce (i, j) is maximal.

- What is the meaning of this task?
- How do we accomplish it efficiently?
- Notice that computing the values
- lce (i, j) for all j is very inefficient!

Palindromes

- A palindrome is a string that reads the same in

both directions - E.g., CATGTAC
- red rum, sir, is murder
- Palindrome problem Find all maximal palindromes

in a string S

Finding Palindromes in S

- Construct the reverse S of S
- Build generalized suffix tree of S and S
- Preprocess T for lce queries
- Now what?
- Left as homework
- Requirement Linear time (const. per query)

S

q 1

Palindromes in DNA sequences

- We sometimes need to deal with
- Crick-Watson complemented palindromes A ?

T C ? G - E.g., ATCATGAT is a complemented palindrome
- All complemented palindromes in S can be found

using a GST of S and the complement of S

Suffix Array Reducing Space

6 ALAM

2 ALAYALAM

8 AM

4 AYALAM

7 LAM

3 LAYALAM

1 MALAYALAM

9 M

5 YALAM

10

M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10

6 2 8 4 7 3 1 9 5 10

Suffix Array Lexicographic ordering of suffixes

3 1 1 0 2 0 1 0 0 -

Derive Longest Common Prefix array

Suffix 6 and 2 share ALA Suffix 2,8 share just

A. lcp achieved for successive pairs.

Pattern Search in Suffix Array

- All suffixes that share a common prefix appear in

consecutive positions in the array. - Pattern P can be located in the string using a

binary search on the suffix array. - Naïve Run-time O(P ? log n).
- Improved to O(P log n) ManberMyers93, and

to O(P) Abouelhoda et al. 02.

Computing longest common prefix Values

- Find where S1 is in the suffix array.
- Compute lcp value of S1.
- Find S2 in the suffix array.
- Compute lcp value of S2.
- Repeat for all suffixes.
- Run-time is O(1) per suffix (why?),
- so linear overall.

Example

Text

Position

Suffix Array

3

1

1

0

2

0

1

0

0

lcp Array

6 ALAM

2 ALAYALAM

8 AM

4 AYALAM

7 LAM

3 LAYALAM

1 MALAYALAM

9 M

5 YALAM

10

Suffix Trees vs. Suffix Arrays

- Suffix Array Lexicographic order of the

leaves of the Suffix Tree - Suffix Tree ? Suffix Array lcp Array
- (why? Wait for next slide)

Building a ST from a SA and a lcp

D 0

6 ALAM

2 ALAYALAM

8 AM

4 AYALAM

7 LAM

3 LAYALAM

1 MALAYALAM

9 M

5 YALAM

10

A

LA

D 1

D 2

AL

M

YALAM

YALAM

M

D 3

3

8

4

7

M

YALAM

6

2

SA

6 2 8 4 7 3 1 9 5 10

lcp

3 1 1 0 2 0 1 0 0 -

Known (amazing) Results

- Suffix tree can be constructed in O(n) time and

O(n ? ?) space Weiner73, McCreight76,

Ukkonen92. - Suffix arrays can be constructed without using

suffix trees in O(n) time PangAluru03.

More Applications

- Suffix-prefix overlaps in fragment assembly
- Maximal and tandem repeats
- Shortest unique substrings
- Maximal unique matches MUMmer
- Approximate matching
- Phylogenies based on complete genomes

Dealing with errors

- The basic string data structures can only extract

information in the absence of errors. - To deal with errors, decompose into parts that do

not involve errors.

The k-mismatch problem

- Given a pattern P, a text T, and a number k, find

all occurrences of P in T with at most k

mismatches - Example
- P bend, T abentbananaend, k 2

Match 1 bent

Match 2 bana

Match 3 aend

Solution

- Build GST of P and T and preprocess it for lce

queries - For each starting index i in T, do at most k lce

queries to determine if there is a k-mismatch

beginning at i

T

P

Time O(k T )

References

- M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The

enhanced suffix array and its applications to

genome analysis, 2nd Workshop on Algorithms in

Bioinformatics, pp. 449-463, 2002. - M. A. Bender and M. Farach-Colton, The LCA

Problem Revisited, LATIN, pages 88-94, 2000. - P. Ko and S. Aluru, Linear time suffix sorting,

CPM, pages 200-210, 2003. - U. Manber and G. Myers. Suffix arrays a new

method for on-line search, SIAM J. Comput.,

22935-948, 1993. - E. M. McCreight, A space-economical suffix tree

construction algorithm, J. ACM, 23(2)262--272,

1976. - E. Ukkonen, Constructing suffix trees on-line in

linear time. Intern. Federation of Information

Processing, pp. 484-492,1992. Also in

Algorithmica, 14(3)249--260, 1995. - P. Weiner, Linear pattern matching algorithms,

Proc. of the 14th IEEE Annual Symp. on Switching

and Automata Theory, pp. 1-11, 1973.