Genome-scale disk-based suffix tree indexing - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Genome-scale disk-based suffix tree indexing

Description:

... p(x) and sl(p(x)) Count from sl(p(x)) to locate sl(x), when found ... TRELLIS vs. TOP-Q ... applicable to wider range of alphabets (Ex: English alphabets) ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 28
Provided by: chaitr5
Learn more at: https://www.cise.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Genome-scale disk-based suffix tree indexing


1
Genome-scale disk-based suffix tree indexing
  • Benjarath Phoophakdee
  • Mohammed J. Zaki

Compiled by Amit Mahajan Chaitra Venus
2
Introduction
  • Growth in biological sequences database
  • Need for effective and efficient structure
  • Suffix Tree
  • Exact/approx. matching
  • Database querying
  • Longest common substrings etc.

3
Introduction
  • In-memory construction algorithms
  • O(n2)
  • Can achieve Linear Time and Space
  • suffix links
  • edge encoding
  • skip and count
  • Problem do not scale for large input sequences

4
Disk based Suffix trees
  • A Database Index to Large Biological Sequences
  • Abandon suffix links (for better locality of
    reference)
  • Partition input based on fixed length prefixes
  • Faces problem in partition size because of data
    skew
  • Use of bin packing for partitions expensive to
    count frequency for long length prefixes
  • Practical Suffix Tree Construction
  • TDD Similar to above drops suffix links
  • Reported to scale to human genome level
  • Random I/Os when input string size gt memory

5
Disk based Suffix trees
  • ST-Merge (Improvement to TDD)
  • Input string smaller contiguous substrings
  • Apply TDD on each substring and then merge all
    trees
  • Does not have suffix links
  • TOP-Q and DynaCluster
  • Only known algorithms that maintain suffix links
    and do not have data skew problem
  • Experiments show that they do not scale to human
    genome level

6
Issue
  • Problems with disk based algorithms
  • Data skew
  • No Suffix Links
  • No scalability
  • Authors propose a novel disk based suffix tree
    algorithm called TRELLIS

7
TRELLIS
  • O(n2) Time, O(n) Space
  • Idea
  • construct by partitioning and merging
  • use variable length prefixes
  • Recover suffix links in a different post
    construction phase
  • Effectively scales up to human genome level
  • Can index entire human genome using 2GB in 4
    hours, recover suffix links in 2 hours

8
TRELLIS
  • Has 4 different phases
  • Prefix Creation
  • Partitioning
  • Merging
  • Suffix Link Recovery

9
Prefix Creation Phase
  • Problems with fixed-length prefix
  • Cannot handle data skew
  • Computing appropriate length is not defined
  • TRELLIS makes use of variable length prefixes.
  • P P0, P1, P2, , Pm-1
  • Use some threshold t to determine P such that
    freq(Pi) t

10
Prefix Creation Phase
  • Multi-scan approach to compute P
  • ith scan
  • Process prefixes up to certain length Li
  • (See formula below to calculate Li)
  • EPi set of prefixes that need further extension
    in next scan (as their frequency gt t)
  • Add to P only the smallest length prefixes that
    meets the frequency threshold t and reject their
    extensions

11
Prefix Creation Phase
  • Ex
  • With t 106, only two stages were required for
    the human genome with L18 and L216
  • Resulting set P contained about 6400 prefixes of
    lengths in the range 4 to 16

12
Partitioning Phase
  • Divide input string into r consecutive partitions
    where r (n1) / t
  • Suffix Subtree TRi
  • Contains suffixes that start in partition Ri
  • Use Ukkonens algorithm to build it
  • Prefixed Suffix Subtree TRi, Pk
  • Split TRi into subtrees that contain only
    suffixes that have prefix Pk
  • At most m such subtrees
  • Store these prefixed suffix subtrees on disk
  • proposed in the paper Online construction of
    suffix trees E. Ukkonen

13
Partitioning Phase
  • TRis obtained are implicit suffix trees (i.e.
    some suffixes are part of internal edges)
  • To guarantee that TRi explicitly contains all
    suffixes from ith partition
  • Continue to read some characters from next
    partition Ri1 until t leaves are obtained in TRi
  • Cannot do special character appending as it will
    incur additional overhead during merging phase

14
Merging Phase
  • For each prefix Pk in the set P
  • Merge all Prefixed Suffixed Subtrees TRi,Pk to
    get Prefixed Suffix Tree TPk
  • We get m Prefixed Suffix trees
  • Store the resulting trees back to disk

15
Suffix Link Recovery Phase
  • Why?
  • Suffix links are crucial for efficiency in many
    fast string processing algorithms
  • Why in a separate phase?
  • TRELLIS may discard all suffix links information
    during the merge phase as new internal nodes are
    created and some old ones are deleted
  • It is useful to discard suffix links information
    after partitioning as it reduces amount of data
    per node
  • Recovering links from scratch takes same time as
    keeping original link information

16
Suffix Link Recovery Phase
  • TRELLIS recovers suffix links of one Prefix
    Suffix Tree at a time
  • Start with children of root
  • Proceeding in a depth-first fashion, do the
    following for each internal node x
  • Locate p(x) and sl(p(x))
  • Count from sl(p(x)) to locate sl(x), when found
    add link
  • Do this recursively for all children of x

17
Choosing t
  • Note t is threshold for Partition size also
  • M gt n/4 ((0.7 x 40) 16)t (0.7 x 40)t
  • M available main memory
  • n/4 memory for input (in compressed form)
  • internal nodes 0.7( external nodes)
  • 40, 16 are sizes of internal and external nodes

18
Computational Complexity
  • Prefix Creation Phase
  • O(nL) time, where L longest prefix length
  • O(n?L1)space
  • Partitioning Phase
  • Input is broken into r partitions and each
    partition is of size t
  • O(t) time/space for each gt r x O(t) O(n)
  • Disk I/Os O(r x m) since at most m prefixed
    suffix subtrees can be created for each partition

19
Computational Complexity
  • Merging Phase
  • Each merge operation can be O(p) where
  • p longest common prefix
  • Across all prefixes, merging O(p x n) since
    number of tree nodes in suffix tree is bounded by
    n
  • In worst case p can be O(n), therefore merge
    O(n2)
  • Disk I/Os O(r x m)

20
Computational Complexity
  • Suffix Link Recovery Phase
  • Internal nodes in final suffix trees are O(n)
  • Constant set of operations for each suffix link
    recovery
  • Putting all together
  • O(n2) time since most expensive is the merge
    phase
  • O(n) space

21
Experimental Setup
  • Compared to
  • TOP-Q and DynaCluster (maintain suffix links)
  • TDD (no suffix links)
  • Performed on Linux with
  • 2 GB RAM for human genome and 512 MB for others
  • 288 GB disk space
  • TRELLIS written in C and compiled with g
  • Other algorithms obtained from their authors

22
Experimental Results
  • TRELLIS vs. TOP-Q and DynaCluster
  • For 200 Mbp, DynaCluster did not terminate even
    after 8 hours, TRELLIS took 13 min

23
Experimental Results
  • TRELLIS vs. TDD
  • TDD uses four different buffers (string, suffix,
    temp and tree)
  • 200 Mbp requires only last 2 buffers
  • Saves additional I/O incurred in other cases

24
Experimental Results
  • TRELLIS vs. TDD
  • TDD is built using memory optimized suffix-tree
    method
  • Difference is not significant for human genome as
    TDD needs to be run in 64 bit mode

25
Experimental Results
  • TRELLIS vs. TDD Query time
  • TDD does not store edge length, determine by
    examining children
  • Internal node has pointer only to one child, so
    scan all children linearly for every query

26
Conclusions
  • TRELLIS
  • Solves data skew problem variable length
    prefixes
  • Scales gracefully for very large sequence
  • No Disk I/O overhead as it works with suffix
    trees that are guaranteed to fit in memory
  • It exhibits faster construction and query times
    when compared to other disk based algorithms

27
Future Work
  • Plan to make TRELLIS applicable to wider range of
    alphabets (Ex English alphabets)
  • No buffering strategy required for human genome,
    but start building one for use of a generalized
    suffix tree composed of many large genomes
  • Parallelize TRELLIS, since its partioning and
    merging steps seem ideally suited
Write a Comment
User Comments (0)
About PowerShow.com