Genome-scale disk-based suffix tree indexing - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Genome-scale disk-based suffix tree indexing

Description:

... p(x) and sl(p(x)) Count from sl(p(x)) to locate sl(x), when found ... TRELLIS vs. TOP-Q ... applicable to wider range of alphabets (Ex: English alphabets) ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 28

Provided by: chaitr5

Learn more at: https://www.cise.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Genome-scale disk-based suffix tree indexing

1
Genome-scale disk-based suffix tree indexing

Benjarath Phoophakdee
Mohammed J. Zaki

Compiled by Amit Mahajan Chaitra Venus
2
Introduction

Growth in biological sequences database
Need for effective and efficient structure
Suffix Tree
Exact/approx. matching
Database querying
Longest common substrings etc.

3
Introduction

In-memory construction algorithms
O(n2)
Can achieve Linear Time and Space
suffix links
edge encoding
skip and count
Problem do not scale for large input sequences

4
Disk based Suffix trees

A Database Index to Large Biological Sequences
Abandon suffix links (for better locality of
reference)
Partition input based on fixed length prefixes
Faces problem in partition size because of data
skew
Use of bin packing for partitions expensive to
count frequency for long length prefixes
Practical Suffix Tree Construction
TDD Similar to above drops suffix links
Reported to scale to human genome level
Random I/Os when input string size gt memory

5
Disk based Suffix trees

ST-Merge (Improvement to TDD)
Input string smaller contiguous substrings
Apply TDD on each substring and then merge all
trees
Does not have suffix links
TOP-Q and DynaCluster
Only known algorithms that maintain suffix links
and do not have data skew problem
Experiments show that they do not scale to human
genome level

6
Issue

Problems with disk based algorithms
Data skew
No Suffix Links
No scalability
Authors propose a novel disk based suffix tree
algorithm called TRELLIS

7
TRELLIS

O(n2) Time, O(n) Space
Idea
construct by partitioning and merging
use variable length prefixes
Recover suffix links in a different post
construction phase
Effectively scales up to human genome level
Can index entire human genome using 2GB in 4
hours, recover suffix links in 2 hours

8
TRELLIS

Has 4 different phases
Prefix Creation
Partitioning
Merging
Suffix Link Recovery

9
Prefix Creation Phase

Problems with fixed-length prefix
Cannot handle data skew
Computing appropriate length is not defined
TRELLIS makes use of variable length prefixes.
P P0, P1, P2, , Pm-1
Use some threshold t to determine P such that
freq(Pi) t

10
Prefix Creation Phase

Multi-scan approach to compute P
ith scan
Process prefixes up to certain length Li
(See formula below to calculate Li)
EPi set of prefixes that need further extension
in next scan (as their frequency gt t)
Add to P only the smallest length prefixes that
meets the frequency threshold t and reject their
extensions

11
Prefix Creation Phase

Ex
With t 106, only two stages were required for
the human genome with L18 and L216
Resulting set P contained about 6400 prefixes of
lengths in the range 4 to 16

12
Partitioning Phase

Divide input string into r consecutive partitions
where r (n1) / t
Suffix Subtree TRi
Contains suffixes that start in partition Ri
Use Ukkonens algorithm to build it
Prefixed Suffix Subtree TRi, Pk
Split TRi into subtrees that contain only
suffixes that have prefix Pk
At most m such subtrees
Store these prefixed suffix subtrees on disk
proposed in the paper Online construction of
suffix trees E. Ukkonen

13
Partitioning Phase

TRis obtained are implicit suffix trees (i.e.
some suffixes are part of internal edges)
To guarantee that TRi explicitly contains all
suffixes from ith partition
Continue to read some characters from next
partition Ri1 until t leaves are obtained in TRi
Cannot do special character appending as it will
incur additional overhead during merging phase

14
Merging Phase

For each prefix Pk in the set P
Merge all Prefixed Suffixed Subtrees TRi,Pk to
get Prefixed Suffix Tree TPk
We get m Prefixed Suffix trees
Store the resulting trees back to disk

15
Suffix Link Recovery Phase

Why?
Suffix links are crucial for efficiency in many
fast string processing algorithms
Why in a separate phase?
TRELLIS may discard all suffix links information
during the merge phase as new internal nodes are
created and some old ones are deleted
It is useful to discard suffix links information
after partitioning as it reduces amount of data
per node
Recovering links from scratch takes same time as
keeping original link information

16
Suffix Link Recovery Phase

TRELLIS recovers suffix links of one Prefix
Suffix Tree at a time
Start with children of root
Proceeding in a depth-first fashion, do the
following for each internal node x
Locate p(x) and sl(p(x))
Count from sl(p(x)) to locate sl(x), when found
add link
Do this recursively for all children of x

17
Choosing t

Note t is threshold for Partition size also
M gt n/4 ((0.7 x 40) 16)t (0.7 x 40)t
M available main memory
n/4 memory for input (in compressed form)
internal nodes 0.7( external nodes)
40, 16 are sizes of internal and external nodes

18
Computational Complexity

Prefix Creation Phase
O(nL) time, where L longest prefix length
O(n?L1)space
Partitioning Phase
Input is broken into r partitions and each
partition is of size t
O(t) time/space for each gt r x O(t) O(n)
Disk I/Os O(r x m) since at most m prefixed
suffix subtrees can be created for each partition

19
Computational Complexity

Merging Phase
Each merge operation can be O(p) where
p longest common prefix
Across all prefixes, merging O(p x n) since
number of tree nodes in suffix tree is bounded by
n
In worst case p can be O(n), therefore merge
O(n2)
Disk I/Os O(r x m)

20
Computational Complexity

Suffix Link Recovery Phase
Internal nodes in final suffix trees are O(n)
Constant set of operations for each suffix link
recovery
Putting all together
O(n2) time since most expensive is the merge
phase
O(n) space

21
Experimental Setup

Compared to
TOP-Q and DynaCluster (maintain suffix links)
TDD (no suffix links)
Performed on Linux with
2 GB RAM for human genome and 512 MB for others
288 GB disk space
TRELLIS written in C and compiled with g
Other algorithms obtained from their authors

22
Experimental Results

TRELLIS vs. TOP-Q and DynaCluster
For 200 Mbp, DynaCluster did not terminate even
after 8 hours, TRELLIS took 13 min

23
Experimental Results

TRELLIS vs. TDD
TDD uses four different buffers (string, suffix,
temp and tree)
200 Mbp requires only last 2 buffers
Saves additional I/O incurred in other cases

24
Experimental Results

TRELLIS vs. TDD
TDD is built using memory optimized suffix-tree
method
Difference is not significant for human genome as
TDD needs to be run in 64 bit mode

25
Experimental Results

TRELLIS vs. TDD Query time
TDD does not store edge length, determine by
examining children
Internal node has pointer only to one child, so
scan all children linearly for every query

26
Conclusions

TRELLIS
Solves data skew problem variable length
prefixes
Scales gracefully for very large sequence
No Disk I/O overhead as it works with suffix
trees that are guaranteed to fit in memory
It exhibits faster construction and query times
when compared to other disk based algorithms

27
Future Work

Plan to make TRELLIS applicable to wider range of
alphabets (Ex English alphabets)
No buffering strategy required for human genome,
but start building one for use of a generalized
suffix tree composed of many large genomes
Parallelize TRELLIS, since its partioning and
merging steps seem ideally suited

Write a Comment

User Comments (0)