String Data Structures and Algorithms: Suffix Trees and Suffix Arrays - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

String Data Structures and Algorithms: Suffix Trees and Suffix Arrays

Description:

4/4/09. BBSI Summer School - Iowa State University. 2. Why Strings? ... 4/4/09. BBSI Summer School - Iowa State University. 8. M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 ... – PowerPoint PPT presentation

Number of Views:431
Avg rating:3.0/5.0
Slides: 37
Provided by: scott314
Learn more at: http://www.cs.tau.ac.il
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: String Data Structures and Algorithms: Suffix Trees and Suffix Arrays


1
String Data Structures and Algorithms Suffix
Trees and Suffix Arrays
  • David Fernández-Baca
  • UNAM (Mexico)
  • (based on notes by Srinivas Aluru)
  • slightly modified by Benny Chor

2
Why Strings?
  • Biological sequences can be viewed as strings,
    or finite series of characters, over an alphabet
    S.
  • There is a wealth of algorithmic theory developed
    for general strings that we can apply to specific
    biological problems.

3
Look-up Tables
  • Strings of length k over S can be represented by
    an integer index i, 0 i Sk 1.
  • DNA is composed of four characters.
  • S A, G, C, T S 4
  • We can preprocess a database into a lookup table
    to locate all occurrences of a query index.

4
Example
Let A 0 (00) C 1 (01) G 2 (10) T
3 (11) Strings are converted based on the binary
string they represent String
Binary Integer AAA 000000
0 ATA 001100 12 AAC
000001 1
5
Search using an Index
Size Sk
Linked list of occurrences
Query
6
Applications of Indexing
  • Seeds for searching sequence databases
  • BLAST
  • Pair generation for fragment assembly in
    sequencing projects
  • CAP3 sequence assembly program

7
Indexing
  • Using a sparse representation, a database can be
    preprocessed in linear time to allow locating all
    instances of a short string.
  • Major limitation search is restricted to fixed
    length strings.

8
Suffix Tree
M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
6
2
9
Suffix Trees
S M A L A Y A L A M 1 2 3 4 5 6 7
8 9 10
Paths from root to leaves represent all suffixes
of S

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
6
2
10
Suffix tree properties
  • For a string S of length n, there are n1 leaves
    and at most n internal nodes.
  • therefore requires only linear space,
  • provided edge labels are O(1) space
  • Each leaf represents a unique suffix.
  • Concatenation of edge labels from root to a leaf
    spells out the suffix.
  • Each internal node represents a distinct common
    prefix to at least two suffixes.

11
Edge Encoding
S M A L A Y A L A M 1 2 3 4 5 6 7
8 9 10
(10, 10)
(2, 2)
(5, 10)
(3, 4)
(1, 1)
10
5
(5, 10)
(3, 4)
(10, 10)
(2, 10)
(9, 10)
(5, 10)
(9, 10)
7
3
1
8
4
9
(9, 10)
(5, 10)
6
2
12
Näive Suffix Tree Construction
1 MALAYALAM
2 ALAYALAM
3 LAYALAM
4 AYALAM
5 YALAM
6 ALAM
7 LAM
8 AM
9 M
10
Before starting Why exactly do we need this ,
which is not part of the alphabet?
13
Näive Suffix Tree Construction
3
4
2
1 MALAYALAM
2 ALAYALAM
3 LAYALAM
4 AYALAM
5 YALAM
6 ALAM
7 LAM
8 AM
9 M
10
A
MALAYALAM
LAYALAM
LAYALAM
YALAM
2
1
3
4
etc.
14
Application Finding a short Pattern in a long
String
  • Build a suffix tree of the string.
  • Starting from the root, traverse a path matching
    characters of the pattern.
  • If stuck, pattern not present in string.
  • Otherwise, each leaf below gives a position of
    the pattern in the string.

15
Finding a Pattern in a String
Find ALA

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
Two matches - at 6 and 2
6
2
16
Finding Common SubStrings
  • Construct a generalized suffix tree for two
    strings (each suffix of each string is
    represented).
  • Label each leaf with the suffix number and string
    label.
  • Each internal node with a leaf from both strings
    in its subtree gives a common substring.

17
Generalized Suffix Tree
WINDOW INDIGO 1234567
1234567

D
ND
I
OG
O
W
(1, 7) (2, 7)
(2, 5)
ND
OW

OGI
OW
OGI
OG
W
INDOW

(2, 4)
(2, 2)
(1, 3)
(1, 5)
(2, 6)
(2, 3)
(1, 4)
OGI
OW
(1, 6)
(1, 1)
(2, 1)
(1, 2)
18
Lowest Common Ancestors
  • The lowest common ancestor (lca) of two nodes x
    and y in a rooted tree is the deepest node
    (farthest away from root) that is an ancestor of
    both x and y
  • Concatenation of edge labels from root to the lca
    of two leaves spells out the longest common
    prefix (lcp) of two strings
  • lca(x,y) an be found in constant time after
    linear preprocessing Bender00

19
A Useful Property
String depth (lca (i , j)) lcp (suffixi,
suffixj)

A
A
String depth 3
M
LA
YALAM
AL
AL
5
10
lca
M
YALAM
YALAM
M

ALAYALAM
lcp longest common prefix
3
8
4
7
M
YALAM
1
9
6
2
20
Longest Common Extension
RAILWAY 12345678
RAI
GRAINY 1234567
RAI
lce(1,1) 0
lce(2,1) 3
Well soon find lces useful in reconstructing
phylogenetic trees based on whole genome/proteome
sequences
21
lces and lcas
To compute lces for two strings S1 and S2
  1. Build generalized suffix tree, T, of S1 and S2
  2. Compute string depth for each node in T
  3. Preprocess T for lca queries
  4. lce(i,j) string depth of lca of suffix i ofS1
    and suffix j ofS2

22
Example
WINDOW INDIGO 1234567
1234567

D
ND
I
OG
O
W
(1, 7) (2, 7)
(2, 5)
ND
OW

OGI
OW
OGI
OG
W
INDOW

(2, 4)
(2, 2)
(1, 3)
(1, 5)
(2, 6)
(2, 3)
(1, 4)
OGI
OW
(1, 6)
(1, 1)
(2, 1)
(1, 2)
23
lces, revisited
Given two strings S1 and S2 , we are
now interested in finding, for each i, the index
j such that lce (i, j) is maximal.
  • What is the meaning of this task?
  • How do we accomplish it efficiently?
  • Notice that computing the values
  • lce (i, j) for all j is very inefficient!

24
Palindromes
  • A palindrome is a string that reads the same in
    both directions
  • E.g., CATGTAC
  • red rum, sir, is murder
  • Palindrome problem Find all maximal palindromes
    in a string S

25
Finding Palindromes in S
  • Construct the reverse S of S
  • Build generalized suffix tree of S and S
  • Preprocess T for lce queries
  • Now what?
  • Left as homework
  • Requirement Linear time (const. per query)

S
q 1
26
Palindromes in DNA sequences
  • We sometimes need to deal with
  • Crick-Watson complemented palindromes A ?
    T C ? G
  • E.g., ATCATGAT is a complemented palindrome
  • All complemented palindromes in S can be found
    using a GST of S and the complement of S

27
Suffix Array Reducing Space
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10
6 2 8 4 7 3 1 9 5 10
Suffix Array Lexicographic ordering of suffixes
3 1 1 0 2 0 1 0 0 -
Derive Longest Common Prefix array
Suffix 6 and 2 share ALA Suffix 2,8 share just
A. lcp achieved for successive pairs.
28
Pattern Search in Suffix Array
  • All suffixes that share a common prefix appear in
    consecutive positions in the array.
  • Pattern P can be located in the string using a
    binary search on the suffix array.
  • Naïve Run-time O(P ? log n).
  • Improved to O(P log n) ManberMyers93, and
    to O(P) Abouelhoda et al. 02.

29
Computing longest common prefix Values
  • Find where S1 is in the suffix array.
  • Compute lcp value of S1.
  • Find S2 in the suffix array.
  • Compute lcp value of S2.
  • Repeat for all suffixes.
  • Run-time is O(1) per suffix (why?),
  • so linear overall.

30
Example
Text
Position
Suffix Array
3
1
1
0
2
0
1
0
0
lcp Array
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
31
Suffix Trees vs. Suffix Arrays
  • Suffix Array Lexicographic order of the
    leaves of the Suffix Tree
  • Suffix Tree ? Suffix Array lcp Array
  • (why? Wait for next slide)

32
Building a ST from a SA and a lcp
D 0
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
A
LA
D 1
D 2
AL
M
YALAM
YALAM
M
D 3
3
8
4
7
M
YALAM
6
2
SA
6 2 8 4 7 3 1 9 5 10
lcp
3 1 1 0 2 0 1 0 0 -
33
Known (amazing) Results
  • Suffix tree can be constructed in O(n) time and
    O(n ? ?) space Weiner73, McCreight76,
    Ukkonen92.
  • Suffix arrays can be constructed without using
    suffix trees in O(n) time PangAluru03.

34
More Applications
  • Suffix-prefix overlaps in fragment assembly
  • Maximal and tandem repeats
  • Shortest unique substrings
  • Maximal unique matches MUMmer
  • Approximate matching
  • Phylogenies based on complete genomes

35
Dealing with errors
  • The basic string data structures can only extract
    information in the absence of errors.
  • To deal with errors, decompose into parts that do
    not involve errors.

36
The k-mismatch problem
  • Given a pattern P, a text T, and a number k, find
    all occurrences of P in T with at most k
    mismatches
  • Example
  • P bend, T abentbananaend, k 2

Match 1 bent
Match 2 bana
Match 3 aend
37
Solution
  1. Build GST of P and T and preprocess it for lce
    queries
  2. For each starting index i in T, do at most k lce
    queries to determine if there is a k-mismatch
    beginning at i

T
P
Time O(k T )
38
References
  • M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The
    enhanced suffix array and its applications to
    genome analysis, 2nd Workshop on Algorithms in
    Bioinformatics, pp. 449-463, 2002.
  • M. A. Bender and M. Farach-Colton, The LCA
    Problem Revisited, LATIN, pages 88-94, 2000.
  • P. Ko and S. Aluru, Linear time suffix sorting,
    CPM, pages 200-210, 2003.
  • U. Manber and G. Myers. Suffix arrays a new
    method for on-line search, SIAM J. Comput.,
    22935-948, 1993.
  • E. M. McCreight, A space-economical suffix tree
    construction algorithm, J. ACM, 23(2)262--272,
    1976.
  • E. Ukkonen, Constructing suffix trees on-line in
    linear time. Intern. Federation of Information
    Processing, pp. 484-492,1992. Also in
    Algorithmica, 14(3)249--260, 1995.
  • P. Weiner, Linear pattern matching algorithms,
    Proc. of the 14th IEEE Annual Symp. on Switching
    and Automata Theory, pp. 1-11, 1973.
About PowerShow.com