String Data Structures and Algorithms - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

String Data Structures and Algorithms

Description:

BBSI Summer School - Iowa State University. 2. Why Strings? ... BBSI Summer School - Iowa State University. 9. S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 32

Provided by: scott314

Category:

more less

Transcript and Presenter's Notes

Title: String Data Structures and Algorithms

1
String Data Structures and Algorithms

David Fernández-Baca
UNAM (Mexico)
(based on notes by Srinivas Aluru)
slightly modified by Benny Chor

2
Why Strings?

Biological sequences can be viewed as strings,
or finite series of characters, over an alphabet
S.
There is a wealth of algorithmic theory developed
for general strings that we can apply to specific
biological problems.

3
Look-up Tables

Strings of size k over S can be represented by an
integer index i, 0 i Sk 1.
DNA is composed of four characters.
S A, G, C, T S 4
We can preprocess a database into a lookup table
to locate all occurrences of a query index.

4
Example
Let A 0 (00) C 1 (01) G 2 (10) T
3 (11) Strings are converted based on the binary
string they represent String
Binary Integer AAA 000000
0 ATA 001100 12 AAC
000001 1
5
Search using an Index
Size Sk
Linked list of occurrences
Query
6
Applications of Indexing

Seeds for searching sequence databases
BLAST
Pair generation for fragment assembly
CAP3 sequence assembly program

7
Indexing discussion

Using a sparse representation, a database can be
preprocessed in linear time to allow locating all
instances of a short string.
Major limitation search is restricted to fixed
length strings.

8
Suffix Tree
M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
6
2
9
Suffix Trees
S M A L A Y A L A M 1 2 3 4 5 6 7
8 9 10

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
6
2
10
Suffix tree properties

For a string S of length n, there are n leaves
and at most n internal nodes.
therefore requires only linear space,
provided edge labels are O(1) space
Each leaf represents a unique suffix.
Concatenation of edge labels from root to a leaf
spells out the suffix.
Each internal node represents a distinct common
prefix to at least two suffixes.

11
Edge Encoding
S M A L A Y A L A M 1 2 3 4 5 6 7
8 9 10
(10, 10)
(2, 2)
(5, 10)
(3, 4)
(1, 1)
10
5
(5, 10)
(3, 4)
(10, 10)
(2, 10)
(9, 10)
(5, 10)
(9, 10)
7
3
1
8
4
9
(9, 10)
(5, 10)
6
2
12
Näive Suffix Tree Construction
1 MALAYALAM
2 ALAYALAM
3 LAYALAM
4 AYALAM
5 YALAM
6 ALAM
7 LAM
8 AM
9 M
10
Before starting Why exactly do we need this ,
which is not part of the alphabet?
13
Näive Suffix Tree Construction
3
4
2
1 MALAYALAM
2 ALAYALAM
3 LAYALAM
4 AYALAM
5 YALAM
6 ALAM
7 LAM
8 AM
9 M
10
A
MALAYALAM
LAYALAM
LAYALAM
YALAM
2
1
3
4
14
Finding a (short) Pattern in a (long) String

Build a suffix tree of the string.
Starting from the root, traverse a path matching
characters of the pattern.
If stuck, pattern not present in string.
Otherwise, each leaf below gives a position of
the pattern in the string.

15
Finding a Pattern in a String
Find ALA

A
M
LA
YALAM
AL
5
10
M
YALAM
YALAM
M

ALAYALAM
3
8
4
7
M
YALAM
1
9
Two matches - at 6 and 2
6
2
16
Finding Common SubStrings

Construct a generalized suffix tree for two
strings (each suffix of each string is
represented).
Label each leaf with the suffix number and string
label.
Each internal node with a leaf from both strings
in its subtree gives a common substring.

17
Generalized Suffix Tree
WINDOW INDIGO 1234567
1234567

D
ND
I
OG
O
W
(1, 7) (2, 7)
(2, 5)
ND
OW

OGI
OW
OGI
OG
W
INDOW

(2, 4)
(2, 2)
(1, 3)
(1, 5)
(2, 6)
(2, 3)
(1, 4)
OGI
OW
(1, 6)
(1, 1)
(2, 1)
(1, 2)
18
Lowest Common Ancestors

The lowest common ancestor (lca) of two nodes x
and y in a rooted tree is the deepest node
(farthest away from root) that is an ancestor of
both x and y
Concatenation of edge labels from root to the lca
of two leaves spells out the longest common
prefix (lcp) of two strings
lca(x,y) an be found in constant time after
linear preprocessing Bender00

19
A Useful Property
String depth (lca (i , j)) lcp (suffixi,
suffixj)

A
A
String depth 3
M
LA
YALAM
AL
AL
5
10
lca
M
YALAM
YALAM
M

ALAYALAM
lcp longest common prefix
3
8
4
7
M
YALAM
1
9
6
2
20
Longest Common Extension
RAILWAY 12345678
RAI
GRAINY 1234567
RAI
lce(1,1) 0
lce(2,1) 3
Well soon find lces useful in reconstructing
phylogenetic trees based on whole genome/proteome
sequences
21
lces and lcas
To compute lces for two strings S1 and S2

Build generalized suffix tree, T, of S1 and S2
Compute string depth for each node in T
Preprocess T for lca queries
lce(i,j) string depth of lca of suffix i ofS1
and suffix j ofS2

22
Example
WINDOW INDIGO 1234567
1234567

D
ND
I
OG
O
W
(1, 7) (2, 7)
(2, 5)
ND
OW

OGI
OW
OGI
OG
W
INDOW

(2, 4)
(2, 2)
(1, 3)
(1, 5)
(2, 6)
(2, 3)
(1, 4)
OGI
OW
(1, 6)
(1, 1)
(2, 1)
(1, 2)
23
lces, revisited
Given two strings S1 and S2 , we are
now interested in finding, for each i, the index
j such that lce (i, j) is maximal.

What is the meaning of this task?
How do we accomplish it efficiently?
Notice that computing the values
lce (i, j) for all j is very inefficient!

24
Palindromes

A palindrome is a string that reads the same in
both directions
E.g., CATGTAC
red rum, sir, is murder
Palindrome problem Find all maximal palindromes
in a string S

25
Finding Palindromes in S

Construct the reverse S of S
Build generalized suffix tree of S and S
Preprocess T for lce queries
Now what?
Left as homework
Requirement Linear time (const. per query)

S
q 1
26
Palindromes in DNA sequences

We sometimes need to deal with
Crick-Watson complemented palindromes A ?
T C ? G
E.g., ATCATGAT is a complemented palindrome
All complemented palindromes in S can be found
using a GST of S and the complement of S

27
Suffix Array Reducing Space
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
M A L A Y A L A M 1 2 3 4 5 6 7 8 9 10
6 2 8 4 7 3 1 9 5 10
Suffix Array Lexicographic ordering of suffixes
3 1 1 0 2 0 1 0 0 -
Derive Longest Common Prefix array
Suffix 6 and 2 share ALA Suffix 2,8 share just
A. lcp achieved for successive pairs.
28
Pattern Search in Suffix Array

All suffixes that share a common prefix appear in
consecutive positions in the array.
Pattern P can be located in the string using a
binary search on the suffix array.
Naïve Run-time O(P ? log n).
Improved to O(P log n) ManberMyers93, and
to O(P) Abouelhoda et al. 02.

29
Computing longest common prefix Values

Find where S1 is in the suffix array.
Compute lcp value of S1.
Find S2 in the suffix array.
Compute lcp value of S2.
Repeat for all suffixes.
Run-time is O(1) per suffix (why?),
so linear overall.

30
Example
Text
Position
Suffix Array
3
1
1
0
2
0
1
0
0
lcp Array
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
31
Suffix Trees vs. Suffix Arrays

Suffix Array Lexicographic order of the
leaves of the Suffix Tree
Suffix Tree ? Suffix Array lcp Array
(why? Wait for next slide)

32
Building a ST from a SA and a lcp
D 0
6 ALAM
2 ALAYALAM
8 AM
4 AYALAM
7 LAM
3 LAYALAM
1 MALAYALAM
9 M
5 YALAM
10
A
LA
D 1
D 2
AL
M
YALAM
YALAM
M
D 3
3
8
4
7
M
YALAM
6
2
SA
6 2 8 4 7 3 1 9 5 10
lcp
3 1 1 0 2 0 1 0 0 -
33
Known Results

Suffix tree can be constructed in O(n) time and
O(n ? ?) space Weiner73, McCreight76,
Ukkonen92.
Suffix arrays can be constructed without using
suffix trees in O(n) time PangAluru03.

34
More Applications

Suffix-prefix overlaps in fragment assembly
Maximal and tandem repeats
Shortest unique substrings
Maximal unique matches MUMmer
Approximate matching
Phylogenies based on complete genomes

35
Dealing with errors

The basic string data structures can only extract
information in the absence of errors.
To deal with errors, decompose into parts that do
not involve errors.

36
The k-mismatch problem

Given a pattern P, a text T, and a number k, find
all occurrences of P in T with at most k
mismatches
Example
P bend, T abentbananaend, k 2

Match 1 bent
Match 2 bana
Match 3 aend
37
Solution

Build GST of P and T and preprocess it for lce
queries
For each starting index i in T, do at most k lce
queries to determine if there is a k-mismatch
beginning at i

T
P
Time O(k T )
38
References

M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The
enhanced suffix array and its applications to
genome analysis, 2nd Workshop on Algorithms in
Bioinformatics, pp. 449-463, 2002.
M. A. Bender and M. Farach-Colton, The LCA
Problem Revisited, LATIN, pages 88-94, 2000.
P. Ko and S. Aluru, Linear time suffix sorting,
CPM, pages 200-210, 2003.
U. Manber and G. Myers. Suffix arrays a new
method for on-line search, SIAM J. Comput.,
22935-948, 1993.
E. M. McCreight, A space-economical suffix tree
construction algorithm, J. ACM, 23(2)262--272,
1976.
E. Ukkonen, Constructing suffix trees on-line in
linear time. Intern. Federation of Information
Processing, pp. 484-492,1992. Also in
Algorithmica, 14(3)249--260, 1995.
P. Weiner, Linear pattern matching algorithms,
Proc. of the 14th IEEE Annual Symp. on Switching
and Automata Theory, pp. 1-11, 1973.