Title: Suffix Array for Large Alphabet Radovan estk, Jan Lnsk, Michal emlicka radofan, zizelevak gmail.com,
1Suffix Array for Large AlphabetRadovan esták,
Jan Lánský, Michal emlickaradofan, zizelevak
_at_ gmail.com, michal.zemlicka_at_mff.cuni.czCharles
University, Faculty of Mathematics and Physics,
Malostranské nám., 118 00 Praha 1, Czech Republic
- Motivation
- Burrows-Wheeler Transform (BWT) is used as the
main part in block compression which has a good
balance of speed and compression ratio. - Suffix arrays are used in the coding phase of BWT
and we focus on creating them for an alphabet
larger than 256 symbols. - The motivation for this work has been software
project XBW 7, an application for compression
of large XML files. - The role of BWT is to reorder input before
applying other algorithms. - Unlimited memory is available, block size for BWT
is set to 100MB - Algorithms for Suffix Array Sorting
- Sadakane and further improved by Larsson 8.
- In the first iteration, we sort suffixes
according to their first symbols and divide them
into groups. Two suffixes are in the same group
if they begin with the same character. Since
rotations in any given group start with the same
character, we can compare them starting with the
second character - to do this we use values
assigned to the rotations in the first iteration.
In each step we double the number of characters
by which the rotations are sorted. - Seward 10
- The main idea used in algorithm due to Seward
called "copy" is to use almost sorted 1-level
buckets to omit sorting some 2-level buckets. The
k-level buckets are groups of suffixes that share
the same prefix of length k. If we have sorted
bucket c2, we can determine order of buckets
c1c2, c1 ? c2 by passing bucket c2. If suffixes
Si and Sj start with the same letter, then their
relative order is given by relative order of Si1
and Sj1. - Itoh further improved by Kao 4.
- The main idea of the Itoh's algorithm is to
divide suffixes into two groups so that we only
have to sort one group using comparison based
algorithm. The order in the other group and of
all suffixes can be then constructed in linear
time. Group A contains suffix Si if xi gt xi1 and
group B contains all other suffixes. Consider
suffixes divided into 1-level buckets. Suffixes
from group A are smaller than suffixes from group
B if they start with the same symbol. Now if we
pass group B in ascending order, we can determine
the order of suffixes from group A. - Karkkainen and Sanders 5
References
2RESULTS
- BWT algorithm speed
- Alphabet of Bytes Seward is best.
- Alphabet of words, syllables Kaos modification
of Itoh algorithm is best. - Compression speed
- BWTMTFRLEAC has the best results on alphabet
of words. Syllables are 1.5 slower and bytes 2.5
slower. - Compression ratio
- Large files syllables and words are best, bytes
and 2-bytes are worse - Small files syllables are best
- CONCLUSION
- For compression text files by BWT MTF RLE
AC, when unlimited memory is available, we
recommend Kaos modification of Itoh algorithm
over alphabet of words.
Influence of run time of BWT on alphabet (sec)
Influence of alphabet on run time (sec)
Influence of alphabet on compression ratio (bpc)
Run time of BWT algorithms for words (sec)