Advanced Seminar in Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Seminar in Data Structures

Description:

An Analysis of the Burrows-Wheeler Transform (Giovanni ... SYNOPSIS. bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ] bunzip2 [ -fkvsVL ] [ filenames ... – PowerPoint PPT presentation

Number of Views:564
Avg rating:3.0/5.0
Slides: 38
Provided by: oren2
Category:

less

Transcript and Presenter's Notes

Title: Advanced Seminar in Data Structures


1
Advanced Seminar in Data Structures 
  • 28/12/2004
  • An Analysis of the Burrows-Wheeler Transform
    (Giovanni Manzini)

Presented by Assaf Oren
2
Topics
  • Introduction
  • Burrows-Wheeler Transform
  • MovetoFront
  • Empirical Entropy
  • Order0 coder
  • Analysis of the BW0 algorithm
  • Run-Length encoding
  • Analysis of BW0RL algorithm

3
Introduction
  • BWT-based algorithm
  • Takes the input string s
  • Transforms it to bwt(s)
  • bwt(s) s
  • Compress bwt(s) with compressor A
  • The compressed string will be A(bwt(s))

4
Introduction (cont)
  • Notations
  • Recording scheme tra()
  • A transformation with no compression
  • Coding scheme A()
  • An algorithm which designed to reduce the size of
    the input

5
BWT-based Alg Properties
  • Even when using a simple compression alg,
    A(bwt(s)) will have a good compression ratio
  • The very simple and clean alg from Nelson1996,
    outperforms the PkZip package.
  • Other, more advanced BWT compressors are Bzip
    Seward 1997 and Szip Schindler 1997.
  • BWT-based compressors achieve a very good
    compression ratio using relatively small
    resources
  • Arnold and Bell 2000, Fenwick 1996a

6
nova 25 man bzip2 bzip2(1)
bzip2(1) NAME
bzip2, bunzip2 - a block-sorting file compressor,
v1.0.2 bzcat - decompresses files to
stdout bzip2recover - recovers data from
damaged bzip2 files SYNOPSIS bzip2
-cdfkqstvzVL123456789 filenames ...
bunzip2 -fkvsVL filenames ...
bzcat -s filenames ...
bzip2recover filename DESCRIPTION bzip2
compresses files using the Burrows-Wheeler
block sorting text compression algorithm,
and Huffman coding. Compression is
generally considerably better than that
achieved by more conventional LZ77/LZ78-based
compressors, and approaches the
performance of the PPM family of sta
tistical compressors.
7
BWT-based Alg Properties (cont)
  • Works very well in practice, but no satisfactory
    proof has been given for their compression
    ratio.
  • Previous proofs were done
  • Assuming the input string is a finite-order
    Markov source
  • Sadakane 19971998
  • To get bounds on the speed at which the average
    compression ratio approaches the entropy.
  • Effros 1999

8
The Burrows-Wheeler Transform
  • Background
  • Part of a research for DIGITAL released at 1994
  • Based on a previously unpublished transformation
    discovered by Wheeler in 1983
  • Technical
  • The resulting output block contains exactly the
    same data elements that it started with
  • Performed on an entire block of data at once
  • Reversible

9
The Burrows-Wheeler Transform (cont)
  • Append to the end of s
  • is unique and smaller then anyother character
  • Form a Matrix M whoserows are the cyclic shifts
    of s
  • Sort the rows right toleft

10
The Burrows-Wheeler Transform (cont)
  • The output of BWT is the column F msspipissii
    and the number 3 (the position of )

11
The Burrows-Wheeler Transform (cont)
  • Observations
  • Every column of M is a permutation of s.
  • Each character in L is followed in s by the
    corresponding character in F.
  • For any character c, the ith occurrence of c in F
    corresponds the the ith occurrence of c in L.
  • How to reconstruct s
  • Sort bwt(s) to get column L. (column F is bwt(s))
  • F1 is the first character of s.
  • By applying observation3 we get that m (is the
    same m from L6), and obsetvation2 will tell us
    that F6 is the next character of s.

12
The Burrows-Wheeler Transform (cont)
13
The Burrows-Wheeler Transform (cont)
  • Why this transform is so helpful ?
  • BWT collects together the symbols following a
    given context.
  • Formally
  • For each substring w of s, the characters
    following w in s are grouped together inside
    bwt(s)
  • More formally!!!

14
MovetoFront (mtf )
  • Another recording scheme
  • Suggested be BW to be used after applying BWT on
    string s
  • s mtf(bwt(s))
  • mtf(bwt(s)) bwt(s) s
  • If s is over a1, a2, , ah then s is over 0,
    1, , h-1

15
MovetoFront (cont)
  • For each letter (left-to-right)
  • Write the number of other letters since the last
    time the current letter appeared.
  • Example

a a b a c a a c c b a
0
0
1
1
2
1
0
1
0
2
2
a
a
b
a
c
a
a
c
c
b
a
16
MovetoFront (cont)
  • Why this transform is helpful ?
  • Transforms the local homogeneous of bwt(s) to
    global homogeneous
  • Formally if we had After mtf both strings will
    probably have the same small numbers.

17
Huffman coding
  • Sets binary values to letters according to their
    frequency
  • For example
  • A a, b, c
  • In our string the frequency is
  • The coding will be

a 300
b 150
c 150
a 0
b 10
c 11
18
Arithmetic coding
19
The Empirical Entropy of a string
  • s our string
  • n s
  • A our Alphabet
  • h A
  • ni number of occurrences of the symbol ai
    inside s
  • H0(s) the zeroth order empirical entropy of s

20
Intuition for the Empirical Entropy
For each symbol
For each appearance of this symbol in the text
The number of bits that will be needed
torepresent it with an ultimate uniquely
decodable code
21
The kth order Empirical Entropy
  • We can achieve a greater compression if the
    codeword depends on the k symbols that precedes
    the coded symbol
  • For example s abcabcabd the codeword for
    ab could be abs ccd
  • And formally we can define

22
Examples of Hk(s)
  • Example 1
  • K1, s mississippi
  • ms i, is ssp, ss sisi, ps pi
  • Example 2
  • K1, s cc(ab)n
  • as bn, bs an-1, cs ca

0
0
1
23
The modified Empirical Entropy
  • Modified in order to avoid cases of

24
Empirical Entropy and BWT
  • We saw that
  • We know that
  • If we had an Ideal algorithm A
  • ?We get
  • ?We reduced the problem of compressing up to kth
    order entropy to the problem of compressing
    distinct portions of the input string up to their
    zeroth order entropy.

25
An Order0 coder
  • A coder with a compression ratio that is close to
    the zeroth order empirical entropy.
  • Formally
  • For static Huffman coding, ? 1
  • For a simple arithmetic coder, ? 10-2
  • Howard and Vitter 1992a

26
Analysis of the BW0 algorithm
  • BW0(s) Order0(mtf(bwt(s)))
  • We would like to achieve
  • For now lets assume Theorem 4.1 on mtf(s)

27
Proof of BW0
  • We saw that if then for t hk
  • For combined with theorem 4.1
  • With our knowledge on Order0
  • ? Get get

28
Proof of Theorem 4.1
  • Lemma 4.3
  • Lemma 4.4

29
Proof of Theorem 4.1 (cont)
  • Lemma 4.5
  • Lemma 4.6
  • Lemma 4.7

30
Proof of Theorem 4.1 (cont)
  • Lemma 4.8
  • It is sufficient to prove that

31
Proof of Theorem 4.1 (cont)
  • By applying Lemma 4.3 and 4.5 we get
  • And
  • And

32
Analysis of BW0RL algorithm
  • BW0RL(s) Order0(RLE(mtf(bwt(s))))
  • RLE(s)
  • Let 0 and 1 be two symbols that are not belong to
    the alphabet
  • For m 1, B(m) m1 written in binary with 0
    and 1, discarding the MSB
  • B(1) 0, B(2) 1, B(3) 00. B(4) 01,
    B(5) 10
  • RLE(s) will replace 0m zeros in s with B(m)
  • Given s 110022013000, RLE(s) 1112201300
  • RLE(s) s, since ?log(m1)? m

33
Analysis of BW0RL (cont)
  • Theorem 5.1
  • Theorem 5.8

34
Analysis of BW0RL (cont)
  • Locally ?-Optimal Algorithm
  • For all t gt 0, there exists a constant ct, that
    for any partition s1, s2, , st of the string s
    we have
  • A locally ?-Optimal Algorithm combined with BWT
    is bounded by

35
A bit of practicality
  • A nice article by Mark Nelson
  • http//www.dogma.net/markn/articles/bwt/bwt.htm
  • Includes source code measurements
  • Usage
  • RLE input-file BWT MTF RLE ARI gt
    output-file
  • UNARI input-file UNRLE UNMTF UNBWT
    UNRLE gt output-file

36
A bit of practicality (cont)
BTW Bits/Byte BTW Size PKZIP Bits/Byte PKZIP Size Raw Size File Name
2.13 29,567 2.58 35,821 111,261 bib
2.87 275,831 3.29 315,999 768,771 book1
2.44 186,592 2.74 209,061 610,856 book2
4.85 62,120 5.38 68,917 102,400 geo
2.85 134,174 3.10 146,010 377,109 news
4.04 10,857 3.84 10,311 21,504 obj1
2.66 81,948 2.65 81,846 246,814 obj2
2.67 17,724 2.80 18,624 53,161 paper1
2.62 26,956 2.90 29,795 82,199 paper2
2.92 16,995 3.11 18,106 46,526 paper3
3.33 5,529 3.32 5,509 13,286 paper4
3.44 5,136 3.32 4,962 11,954 paper5
2.76 13,159 2.80 13,331 38,105 paper6
0.79 50,829 0.84 54,188 513,216 pic
2.69 13,312 2.69 13,340 39,611 progc
1.86 16,688 1.81 16,227 71,646 progl
1.85 11,404 1.82 11,248 49,379 progp
1.65 19,301 1.68 19,691 93,695 trans
2.41 978,122 2.64 1,072,986 3,251,493 total
37
A bit of practicality (cont)
The End
Write a Comment
User Comments (0)
About PowerShow.com