Suffix Trees Come of Age in Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

Suffix Trees Come of Age in Bioinformatics

Description:

Bioinformatics Applications pre 1997. A small number of applications ... 20- 40 new publications with substantial connection to bioinformatics, post 1997 ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 42

Provided by: DanGus8

Learn more at: https://csiflabs.cs.ucdavis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Suffix Trees Come of Age in Bioinformatics

1
Suffix Trees Come of Age in Bioinformatics

Algorithms, Applications and Implementations
Dan Gusfield, U.C. Davis

2
AABAAB
1 2 3 4 5 6
S
Suffix Tree
B
A
A
A
B

B
A
3

B
A
5
6
A

4
B
A

A
2
B
Every suffix of S is encoded by a root to leaf
walk. Every substring encoded by a walk from the
root.

1
3
AABAAB
1 2 3 4 5 6
SUFFIX ARRAY
B
A
A
A
B

B
A
3

B
A
5
6
A

4
B
A
4 1 5 2 3 6

A
2
The suffixes listed in lexicographic order. Found
by lexicographic DFS
B

1
4
Basic Facts about Suffix Trees

Suffix tree for a string of length n can be built
in O(n) time.
Basic implementations need about 25 bytes per
character.
Numerous applications in string algorithms
achieving significant (sometimes amazing)
speedups over naïve algorithms O(m) time
substring search for a string of length m in a
string of length n, regardless of n.

5
Suffix Tree/Array Construction

Weiner 1973
McCreight 1974
Ukkonen 1996
Farach-Colton 1999
Manber-Myers 1991 (Suffix-Arrays)
O(n log n time) with small space. O(n) time
possible as on prior slide.

6
aabbaa
B
A
B
A
A

B
B
6
A
3
A
B

A
A
5
B

A

B
4
A
2
A

1
7
Bioinformatics Applications pre 1997

A small number of applications
Limited by space requirements of the tree, small
computer memories, complication of the
algorithms, poor locality of reference

8
Post 1997

Much has changed, and Suffix trees and their
relatives (particularly suffix arrays) are more
widely applied in bioinformatics
20- 40 new publications with substantial
connection to bioinformatics, post 1997

9
New Results since 1997

Fundamental Algorithms Farach-Colton
construction simpler algorithm for
least-common-ancestor.
Implementation improvements (space) for suffix
trees and arrays
New variants of suffix trees affix trees,
virtual suffix trees
New algorithms for tandem repeats

10
New Results since 1997

Approximate repeats and large-scale repeat
structure
Fast lookups in databases
Substring frequencies
Motif and pattern discovery
Hybrid dynamic programming
Oligo and probe construction

11
New Results since 1997

Genome fragment assembly and resequencing
(multiple) whole genome comparison
Large-scale sequence comparison in resequencing
projects
Misc. clever applications
Related less-used data structures

12
Suffix trees collect together repeated substring
prefixes
Example Speeding up the use of Position Specific
Scoring Matrices. Accelerating protein
classification ISMB 2000 position 1 2 3 4 5
A 3 1 4 2 1 T 2 4 4 3 2
C 1 3 5 6 2 G 4 2 4 6 2
AACTGAACTG.AACTG
PSSM
AACTG
Walk around the tree to depth 5
Starting locations of AACTG
13
Whole Genome Alignment with MUMs

Delcher Salzberg NAR 1999
A MUM is a Maximal Unique Matching substring in
two strings S and S
MUMMER finds all MUMs selects and aligns
non-overlapping pairs of MUMs and then recurses
in the regions between adjacent selected MUMs.
Key issue finding MUMs efficiently

14
Spotting MUMs with a suffix tree
7
Example S AATCCGTG. S
GATCCGTA
20 So ATCCGT is a MUM if it only occurs in
these two places
root
Path spelling ATCCGT
Exactly two sibling leaves
S7
S20
15
Finding MUMs in a suffix tree

Build a suffix tree for the concatenation of S
and S, noting at each leaf whether the suffix is
from the S part or the S part. Also note the
prior character in S or S for that position.
Do a DFS traversal of the S.T. to note at each
node how many leaves below it are from S and how
many from S.
A node corresponds to a MUM if and only if the
count there is 1 and 1, and the two prior
characters are different.
For total string length n, the time used is O(n).

16
Reducing the space needed for MUM findingMUMMER
II - NAR spring 2002
root
Path to a node v, spelling xB, where B is a string
Path to a node v, spelling B
v
v
Suffix link
17
Space reduction when finding MUMs

Given strings S and S, build a suffix tree T,
with suffix links , but for only the smaller of S
and S, say S.
Using T, find for each position i in S, the
longest substring starting at i which occurs in
S exactly once. Mark the end location in T of
that match. Each such match is a candidate MUM.

18
Finding MUMs in less space

To find the matches, walk S through T, using
suffix links to reduce search time.
After S has been completely processed, every
marked position in T that is visited only once
and has no marked position below it, corresponds
to a MUM.
So all MUMs can be found in O(n) time, but in
much reduced space.

19
Finding Tandem Repeats
Stoye and Gusfield, Theoretical Computer Science,
2001
TATAACTAACTAAGATT..
For a string of length n, a naïve algorithm might
use (on the order of) n3 operations to find all
T.R.s
20
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem
Repeat. To extend a chain, examine the first
character of the T.R. and the character after the
end of the T.R. If they are the same, a new T.R.
exists on place to the right.
21
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeats
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem Repeat.
22
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeats
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem Repeat.
23
Example Finding Tandem Repeats in Strings
TATAACTAACTAAGATT.
Branching Tandem Repeat
Every Tandem Repeat is part of a chain of tandem
repeats ending with a Branching Tandem
Repeat. So to find all T.R.s find all
Branching T.Rs Use a suffix tree to efficiently
find Branching T.R.s
24
Finding Branching T.R.s

Every Branching T.R. ends at a node of the suffix
tree so examine each node.

root
String to here is length 8
leaves 5 , 13 and 21 are below v
v
5
13 5 8
21
25
Basic Algorithm

Build the S.T. for string S process it to find
depth of each node, and do a DFS numbering of the
nodes so that ancestry of two nodes can be
determined in constant time. (Classic property of
DFS)
Check each internal node to see if it is at the
first or second part of a tandem repeat.

26
How to check

From a node v of depth d(v), walk the subtree to
collect the leaf numbers below v.
If leaf k is collected, check if v is an ancestor
of leaf kd(v), and the next characters k1 and
kd(v) 1 differ. If so, then the string to v is
the first part of a Branching T.R.
Alternatively, focus on k-d(v) to see if the
string to v is the second part of a Branching
T.R.

27
AABAAB
1 2 3 4 5 6
S
Suffix Tree
B
A
A
A
B

B
A
3

B
A
5
6
A

4
B
A

A
2
B

1
28
Finding Tandem Repeats

This approach gives O(n2) time to find all
branching tandem repeats.The time can be reduced
to O(n log n) with a classic trick.
Related more complex result In O(n) time, one
can mark in the suffix tree, the endpoints of all
tandem repeat species. Gusfield and Stoye 2000
UCD techreport
Uses the fact (Frankel et al 1999) that there
are only 2n tandem repeat species complex
proof.
Example aaaaaaaaaa has only five tandem repeat
species, but 25 tandem repeat occurrences

29
Space issues again

Use virtual suffix tree or extended suffix
array to reduce space
Def for positions i and j in string S, LCP(i,j)
is the length of the longest matching substring
starting at positions i and j in S.

30
AABAAB
1 2 3 4 5 6
SUFFIX ARRAY With LCP or string depth information
B
A
A
A
B

B
A
3

B
A
5
6
A

4
B
A

4 1 5 2 3 6 3 1 2 0 1
A
2
Node depths at time of descent after a backup
B

LCP of neighbors
1
31
Extended Suffix Array

String S, the suffix array A for S, and the LCP
array for A, completely determine the suffix tree
T for S. This is the extended suffix array for
S.
The suffix array and the LCP array can be found
in O(n log n) time and small space, without an
explicit suffix tree.
For many efficient algorithms using an explicit
suffix tree, it is possible to simulate the
algorithm using only the extended suffix array,
greatly reducing space usage without increasing
time.
S. Kurtz WABI 2002

32
String Barcoding Oligo Construction

Uncovering Optimal Virus Signatures

Sam Rash and Dan Gusfield RECOMB April 2002
33
Motivation

Need for rapid virus detection
Given
unknown virus
database known viruses
Problem
identify unknown virus quickly
Ideal solution
have sequence of
viruses in database
unknown virus
Solution
use BLAST (or any sequence similarity
program/algorithm)

34
Motivation

Real World
only have sequence for pathogens in database
not possible to quickly sequence an unknown virus
can test for presence small (lt 50 bp) strings in
unknown virus
substring tests
Another Idea
String Barcoding
use substring tests to uniquely identify each
virus in the database
acquire unique barcode for each virus in database

35
Implementation

Basic Idea Formulate problem as an Integer
Linear Program (ILP)
Enumerate some useful set of substrings from S
variable in ILP for each substring
Constraint for each pair of strings in S
means that at least one substring will be chosen
to distinguish each pair
Objective Function
Minimize sum of variables in ILP

36
Problem Definition

Formal Definition
given
set of strings S
goal
find set of strings S, the testing set
wlog, for each s1,s2 in S, there exists at least
one u in S where u is a substring of only s1
u is a signature substring
minimize S
result
barcode for each element on S

37
Idea

strings
1. cagtgc
2. cagttc
3. catgga
Each node in the suffix tree has a corresponding
set of string IDs below it

Figure 1.1 - suffix tree for set of strings
cagtgc, cagttc, and catgga
38
Idea

strings
1. cagtgc
2. cagttc
3. catgga
Each node in the suffix tree has a corresponding
set of string IDs below it

Figure 1.1 - suffix tree for set of strings
cagtgc, cagttc, and catgga
39
Implementation Suffix Trees

root-edge walk
Creates string
appears in exactly the strings that label the
node at which it ends
2 root-edge walks ending on the same edge
Both strings created by the walk
occur in exactly the same set of original
strings
Can use ether string

example - a root edge walk
40
Practical Implementation

If two substrings occur in exactly the same set
of original strings, only one need be considered
Use strings from suffix tree for each uniquely
labeled node
Build ILP as discussed
Solve ILP using CPLEX
Acquire barcode and signatures for each original
string
signature is the set of substring tests occurring
in a string

41
Implementation Example
minimize V18 V22 V11 V17 V8 objective
function st V18 V22 V11 V17 V8 gt 2 this
is the theoretical minimum V18 V17 V8 gt
1 constraint to cover pair 1,2 V22 V11 V8
gt 1 constraint to cover pair 1,3 V18 V22
V11 V17 gt 1 constraint to cover pair
2,3 binaries all variables are 0/1 V18 V22
V11 V17 V8 end
Figure 1.4 - barcodes
Figure 1.5 - signatures
42
Implementation Extensions

minimum and maximum lengths on signature
substrings
acquire barcodes/signatures for only a subset of
input strings (wrt to whole set)
minimum string edit distance between chosen
signature substrings
redundancy
require r signature substrings to differentiate
each pair
adds a higher level of confidence that signatures
remain valid even with mutations

43
Results

Works quickly on most moderately sized datasets
(especially when redundancy gt 2)
dataset properties
50k virus genomes taken from NCBI (Genbank)
50-150 virus genomes
average length of each sequence 1000 characters
total input size ranged from approximately 50,000
150,000 characters
increasing dataset size scaled approximately
linearly
reach 25 gap (at most 1/3 more than optimum) in
just a few minutes
reach small gap (often lt 1) in 4 hours

44
Summary