Suffix Array for Large Alphabet Radovan estk, Jan Lnsk, Michal emlicka radofan, zizelevak gmail.com, - PowerPoint PPT Presentation

1 / 2
About This Presentation
Title:

Suffix Array for Large Alphabet Radovan estk, Jan Lnsk, Michal emlicka radofan, zizelevak gmail.com,

Description:

Charles University, Faculty of Mathematics and Physics, Malostransk n m. ... Alphabet of words, syllables: Kao s modification of Itoh algorithm is best. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 3
Provided by: janl58
Category:

less

Transcript and Presenter's Notes

Title: Suffix Array for Large Alphabet Radovan estk, Jan Lnsk, Michal emlicka radofan, zizelevak gmail.com,


1
Suffix Array for Large AlphabetRadovan esták,
Jan Lánský, Michal emlickaradofan, zizelevak
_at_ gmail.com, michal.zemlicka_at_mff.cuni.czCharles
University, Faculty of Mathematics and Physics,
Malostranské nám., 118 00 Praha 1, Czech Republic
  • Motivation
  • Burrows-Wheeler Transform (BWT) is used as the
    main part in block compression which has a good
    balance of speed and compression ratio.
  • Suffix arrays are used in the coding phase of BWT
    and we focus on creating them for an alphabet
    larger than 256 symbols.
  • The motivation for this work has been software
    project XBW 7, an application for compression
    of large XML files.
  • The role of BWT is to reorder input before
    applying other algorithms.
  • Unlimited memory is available, block size for BWT
    is set to 100MB
  • Algorithms for Suffix Array Sorting
  • Sadakane and further improved by Larsson 8.
  • In the first iteration, we sort suffixes
    according to their first symbols and divide them
    into groups. Two suffixes are in the same group
    if they begin with the same character. Since
    rotations in any given group start with the same
    character, we can compare them starting with the
    second character - to do this we use values
    assigned to the rotations in the first iteration.
    In each step we double the number of characters
    by which the rotations are sorted.
  • Seward 10
  • The main idea used in algorithm due to Seward
    called "copy" is to use almost sorted 1-level
    buckets to omit sorting some 2-level buckets. The
    k-level buckets are groups of suffixes that share
    the same prefix of length k. If we have sorted
    bucket c2, we can determine order of buckets
    c1c2, c1 ? c2 by passing bucket c2. If suffixes
    Si and Sj start with the same letter, then their
    relative order is given by relative order of Si1
    and Sj1.
  • Itoh further improved by Kao 4.
  • The main idea of the Itoh's algorithm is to
    divide suffixes into two groups so that we only
    have to sort one group using comparison based
    algorithm. The order in the other group and of
    all suffixes can be then constructed in linear
    time. Group A contains suffix Si if xi gt xi1 and
    group B contains all other suffixes. Consider
    suffixes divided into 1-level buckets. Suffixes
    from group A are smaller than suffixes from group
    B if they start with the same symbol. Now if we
    pass group B in ascending order, we can determine
    the order of suffixes from group A.
  • Karkkainen and Sanders 5

References
2
RESULTS
  • BWT algorithm speed
  • Alphabet of Bytes Seward is best.
  • Alphabet of words, syllables Kaos modification
    of Itoh algorithm is best.
  • Compression speed
  • BWTMTFRLEAC has the best results on alphabet
    of words. Syllables are 1.5 slower and bytes 2.5
    slower.
  • Compression ratio
  • Large files syllables and words are best, bytes
    and 2-bytes are worse
  • Small files syllables are best
  • CONCLUSION
  • For compression text files by BWT MTF RLE
    AC, when unlimited memory is available, we
    recommend Kaos modification of Itoh algorithm
    over alphabet of words.

Influence of run time of BWT on alphabet (sec)
Influence of alphabet on run time (sec)
Influence of alphabet on compression ratio (bpc)
Run time of BWT algorithms for words (sec)
Write a Comment
User Comments (0)
About PowerShow.com