A Simpler Analysis of BurrowsWheeler Based Compression - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

A Simpler Analysis of BurrowsWheeler Based Compression

Description:

... Burrows-Wheeler Transform creates a permutation of S that is locally homogeneous. ... S' is locally homogeneous. Empirical Entropy - Intuition ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 34
Provided by: nir2
Category:

less

Transcript and Presenter's Notes

Title: A Simpler Analysis of BurrowsWheeler Based Compression


1
A Simpler Analysis of Burrows-Wheeler Based
Compression
  • Haim Kaplan Shir Landau Elad Verbin

2
Our Results
  • Improve the bounds of one of the main BWT based
    compression algorithms
  • New technique for worst case analysis of BWT
    based compression algorithms using the Local
    Entropy
  • Interesting results concerning compression of
    integer strings

3
The Burrows-Wheeler Transform(1994)
  • Given a string S the Burrows-Wheeler Transform
    creates a permutation of S that is locally
    homogeneous.

S
BWT
S is locally homogeneous
4
Empirical Entropy - Intuition
  • H0(s) Maximum compression we can get without
    context information where a fixed codeword is
    assigned to each alphabet character (e.g.
    Huffman code )
  • Hk(s) Lower bound for compression with order-k
    contexts the codeword representing each symbol
    depends on the k symbols preceding it
  • Traditionally, compression ratio of compression
    algorithms measured using Hk(s)

5
History
  • The Main Burrows-Wheeler Compression Algorithm
    (Burrows, Wheeler 1994)

6
MTF
Given a string S baacb over alphabet a,b,c,d
b a a c b
S
1
1
0
2
2
MTF(S)
7
Main Bounds (Manzini 1999)
  • gk is a constant dependant on the context k and
    the size of the alphabet
  • these are worst-case bounds

8
Now we are ready to begin
9
Some Intuition
  • The more the contexts are similar in the original
    string, the more its BWT will exhibit local
    similarity
  • The more local similarity found in the BWT of the
    string the smaller the numbers we get in MTF
  • ? We want a statistic that measures local
    similarity in a string and specifically in the
    BWT of the string
  • ? The solution Local Entropy

10
The Local Entropy- Definition
  • We define given a string s
  • s1s2sn
  • The local entropy of s (Bentley, Sleator,
    Tarjan, Wei, 86)

11
The Local Entropy - Definition
  • Note LE(s) number of bits needed to write the
    MTF sequence in binary.
  • Example
  • MTF(s) 311
  • ? LE(s) 4
  • ? MTF(s) in binary 1111

In Dream world We would like to compress S to
LE(S)
12
The Local Entropy Properties
  • We use two properties of LE
  • The entropy hierarchy
  • Convexity

13
The Local Entropy Property 1
  • The entropy hierarchy
  • We prove For each k
  • LE(BWT(s)) nHk(s) O(1)
  • ? Any upper bound that we get for BWT with LE
    holds for Hk(s) as well.

14
The Local Entropy Properties 2
  • Convexity
  • ? This means that a partition of a string s does
    not improve the Local Entropy of s.

15
Convexity
  • Cutting the input string into parts doesnt
    influence much Only positions per part

16
Convexity Why do we need it?
  • Ferragina, Giancarlo, Manzini and Sciortino, JACM
    2005

17
Using LE and its properties we get our bounds
  • Theorem For every where

Our LE bound
Our Hk bound
18
Our bounds
  • We get an improvement of the known bounds
  • As opposed to the known bounds (Manzini, 1999)

19
Our Test Results
The files are non-binary files from the
Canterbury corpus. gzip results are also taken
from the corpus. The size is indicated in bytes.
20
How is LE related to compression of integer
sequences?
  • We mentioned dream world but what about
    reality?
  • How close can we come to ?
  • Problem
  • Compress an integer sequence S close to its sum
    of logs
  • Notice for any s

21
Compressing Integer Sequences
  • Universal Encodings of Integers prefix-free
    encoding for integers (e.g. Fibonacci encoding,
    Elias encoding).
  • Doing some math, it turns out that order-0
    encoding is good.
  • Not only good It is best!

22
The order-0 math
  • Theorem For any string s of length n over the
    integer alphabet 1,2,h and for any ,
  • Strange conclusion we get an upper-bound on the
    order-0 algorithm with a phrase dependant on the
    value of the integers.
  • This is true for all strings but is especially
    interesting for strings with smaller integers.

23
A lower bound for SL
  • Theorem For any algorithm A and for any ,
    and any C such that C lt log(?(µ)) there exists a
    string S of length n for which
  • A(S) gt µSL(S) Cn

24
Our Results - Summary
  • New improved bounds for BWMTF
  • Local Entropy (LE)
  • New bounds for compression of integer strings

25
Open Issues
  • We question the effectiveness of .
  • Is there a better statistic?

26
Any Questions?
27
Thank You!
Anybody want to guess??
28
Creating a Huffman encoding
  • For each encoding unit (letter, in this example),
    associate a frequency (number of times it occurs)
  • Create a binary tree whose children are the
    encoding units with the smallest frequencies
  • The frequency of the root is the sum of the
    frequencies of the leaves
  • Repeat this procedure until all the encoding
    units are in the binary tree

29
Example
Assume that relative frequencies are A 40 B
20 C 10 D 10 R 20
30
Example , cont.
31
Example, cont.
  • Assign 0 to left branches, 1 to right branches
  • Each encoding is a path from the root

A 0B 100C 1010D 1011R 11
32
The Burrows-Wheeler Transform (1994)
Given a string S banana
banana
banan a
ananab
a bana n
nanaba
a naba n
anaban
nabana
abanan
banana
33
Suffix Arrays and the BWT
The Suffix Array
Index of BWT
7 6 4 2 1 5 3
6 5 3 1 7 4 2
Write a Comment
User Comments (0)
About PowerShow.com