Canonical Huffman trees:

About This Presentation

Title:

Canonical Huffman trees:

Description:

Canonical Huffman trees: Goals: a scheme for large alphabets with Efficient decoding Efficient coding Economic use of main memory A non-Huffman same cost tree ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 61

Provided by: Offic96

Category:

more less

Transcript and Presenter's Notes

Title: Canonical Huffman trees:

1

Canonical Huffman trees
Goals a scheme for large alphabets with
Efficient decoding
Efficient coding
Economic use of main memory

A non-Huffman same cost tree
Code 1 lca(e,b) 0 code 2 lca(e,b)
Code 2 successive integers
(going down from longest codes)

decimal Code 2 Code1 (huffman) frequency symbol
0 000 000 10 a
1 001 001 11 b
2 010 100 12 c
3 011 101 13 d
4 10 01 22 e
5 11 11 23 f
3

tree for code 2
Lemma (nodes) in each level in Huffman is even
Proof a parent with single child is impossible

e 4
f 5
b 1
c 2
d 3
a 0
4

General approach

Canonical Huffman Algorithm
compute lengths of codes and numbers of symbols
for each length (for regular Huffman)
L max length
first(L) 0
for i L-1 downto 1
assign to symbols of length i codes of this
length, starting at first(i)
Q What happens when there are no symbols of
length i?
Does first(L) 0lt first(L-1)ltltFirst(1) always
hold?

Decoding (assume we start now on new symbol)
i1
v nextbit() // we have read the first bit
while vltfirst(i) // small codes start at large
numbers!
i
v2v nextbit()
/ now, v is code of length i of a symbol s
s is in position v first(i) in the block of
symbols with code length i (positions from 0)
/
Decoding can be implemented by shift/compare
(very efficient)

Data structures for decoder
The array first(i)
Arrays S(i) of the symbols with code length i,
ordered by their code
(v-first(i) is the index value to get the
symbol for code v)
Thus, decoding uses efficient arithmetic
operations array look-up more efficient then
storing a tree and traversing pointers
What about coding (for large alphabets, where
symbols words or blocks)?
The problem millions of symbols ? large Huffman
tree,

Construction of canonical Huffman (sketch)
assumption we have the symbol frequencies
Input a sequence of (symbol, freq)
Output a sequence of (symbol, length)
Idea use an array to represent a heap for
creating the tree, and the resulting tree and
lengths
We illustrate by example

Example frequencies 2, 8, 11, 12
(each cell with a freq. also contains a symbol
not shown)
Now reps of 2, 8 (smallest) go out, rest
percolate
The sum 10 is put into cell4, and its rep into
cell 3
Cell4 is the parent (sum) of cells 5, 8.

8 11 12 2 6 7 8 5
8 11 12 2 6 7
4 11 12 4 10 3 6 7
10

after one more step
Finally, a representation of the Huffman tree
Next, by i2 to 8, assign lengths (here shown
after i4)

4 3 12 4 3 21 3 6
4 3 2 4 3 2 33 2
4 3 2 4 2 1 0 2
11

Summary
Insertion of (symbol,freq) into array O(n)
Creation of heap
Creating tree from heap each step is
total is
Computing lengths O(n)
Storage requirements 2n (compare to tree!)

Entropy H a lower bound on compression
How can one still improve?
Huffman works for given frequencies, e.g., for
the English language static modeling
Plus No need to store model in coder/decoder
But, can construct frequency table for each file
semi-static
modeling
Minus
Need to store model in compressed file
(negligible for large files)
Takes more time to compress
Plus may provide for better compression

3rd option start compressing with default freqs
As coding proceeds, update frequencies
After reading a symbol
compress it
update freq table
Adaptive modeling
Decoding must use precisely same algorithm for
updating freqs ? can follow coding
Plus
Model need not be stored
May provide compression that adapts to file,
including local changes of freqs
Minus less efficient then previous models
May use a sliding window to better reflect
local changes

Adaptive Huffman
Construction of Huffman after each symbol O(n)
Incremental adaptation in O(logn) is possible
Both too expensive for practical use (large
alphabets)
We illustrate adaptive by arithmetic coding (soon)

Higher-order modeling use of context
e.g. for each block of 2 letters, construct
freq. table for the next letter (2-order
compression)
(uses conditional probabilities hence
improvement)
This also can be static/semi-static/adaptive

Arithmetic coding
Can be static, semi-static, adaptive
Basic idea
Coder start with the interval 0,1)
1st symbol selects a sub-interval, based on its
probability
ith symbol selects a sub-interval of (i-1)th
interval, based on its probability
When file ends, store a number in the final
interval
Decoder reads the number, reconstructs the
sequence of intervals, i.e. symbols
Important Length of file stored at beginning of
compressed file
(otherwise, decoder does not know when
to stop)

Example (static) a 3/4, b 1/4
The file to be compressed aaaba
The sequence of intervals ( symbols creating
them)
0,1), a 0,3/4), a 0,9/16), a 0, 27/64),
b 81/256, 108/256), a 324/1024, 405/1024)
Assuming this is the end, we store
5 length of file
Any number in final interval, say 0.011 (3
digits)
(after first 3 as, one digit suffices!)
(for a large file, the length will be negligible)

Why is it a good approach in general?
For a symbol with large probability, of binary
digits needed to represent an occurrence is
smaller than 1 ? poor compression with Huffman
But, arithmetic represents such a symbol with a
small shrinkage of interval, hence the extra
number of digits is smaller than 1!
Consider the example above, after aaa

Arithmetic coding adaptive an example
The symbols a, b, c
Initial frequencies 1,1,1 ( initial accumulated
freqs)
(0 is illegal, cannot code a symbol with
probability 0!)
b model passes to coder the triple (1, 2, 3)
1 the accumulated freqs up to, not including, b
2 the accumulated freqs up to, including, b
3 the sum of freqs
Coder notes new interval 1/3, 2/3)
Model updates freqs to 1, 2, 1
c model passes (3,4,4) (upper quarter)
Coder updates interval to 7/12,8/12)
Model updates freqs to (1,2,2)
And so on .

Practical considerations
Interval ends are held as binary numbers
of bits in number to be stored proportional to
size of file impractical to compute it all
before storing
solution as interval gets small, first bit of a
number in it is determined. This bit is written
by code into compressed file, and removed from
interval ends ( mult by 2)
Example in 1st example, when interval becomes
0,27/64 000000,011011) (after 3 as) output
0, and update to 00000,11011)
Decoder sees 1st 0 knows the first three are
as,
Computes interval, throws the 0

Practically, (de)coder maintain a word for each
number, computations are approximate
Some (very small) loss of compression
Both sides must perform same approximations at
same time
Initial assignment of freq. 1 to
low freq. symbols?
Solution assign 1 to all symbols not seen so far
If k were not seen yet, one now occurs, give it
1/k
Since coder does not know when to stop, file
length must be stored in compressed file

Frequencies data structure need to allow both
update, and sums of the form
(expensive for large alphabets)
Solution a tree-like structure
O(logn) accesses!

sum binary cell
f1 1 1
f1f2 10 2
f3 11 3
f1f2f3f4 100 4
f5 101 5
f5f6 110 6
f7 111 7
f1f8 1000 8
If k, the binary cell ends with i 0s, the
cell contains fkf_(k-1) f_(k-i1)
What is the algorithm to compute
23
Dictionary-based methods

Huffman is a dictionary-based method
Each symbol in dictionary has associated code
But, adaptive Huffman is not practical
Famous adaptive methods
LZ77, LZ78 (Lempel-Ziv)
We describe LZ77 (basis of gzip in Unix)

Basic idea The dictionary -- the sequences of
symbols in a window before current position
(typical window size )
When coder at position p, window is the symbols
in positions p-w,p-1
Coder searches for longest seq. that matches the
one at position p
If found, of length l, put (n,l) into file (n
--offset, l length), and forward l positions,
else output the current symbol

Example
input is a b a a b a bb (11 bs)
Code is a b (2,1) (1,1) (3,2) (2,1) (1,10)
Decoding a? a, b? b, (2,1) ? a, (1,1) ? a,
current known string a b a a
(3,2) ? b a, (2,1) ? b
current known string a b a a b a b
(1, 10) ? Go back one step to b
do 10 times output scanned
symbol,
advance one
(note run-length encoding hides here)
Note decoding is extremely fast!

Practical issues
Maintenance of window use cyclic buffer
searching for longest matching word
? expensive coding
How to distinguish a pair (n,l) from a symbol?
Can we save on the space for (n,l)?
The gzip solution for 2-4
2 a hash table of 3-sequences, with lists of
positions where a sequence starting with them
exists (what about short matches?)
An option limit the search in the list (save
time)
Does not always find the longest match, but loss
is very small

3 one bit suffices (but see below)
4 offsets are integers in range 1,2k, often
smaller values are more frequent
Semi-static solution (gzip)
Divide file into segments of 64k for each
Find the offsets used and their frequencies
Code using canonical Huffman
Do same for lengths
Actually, add symbols (issue 3) to set of
lengths, code together using one code, and put in
file this code before offset code (why?)

One last issue (for all methods) synchronization
Assume you want to start decoding in mid-file?
E.g. a db of files, coded using one code
Bit-based addresses for the files --- these
addresses occur in many ILs, which are loaded to
MM.
32/address is ok, 64/address may be costly
Byte/word-based addresses allow for much larger
dbs. It may pay to even use k-word blocks based
addresses
how does one synchronize?

Solution fill last block with 011
if code fills last block, add a
block
Since file addresses/lengths are known, filling
can be removed
Does this work for Huffman? Arithmetic? LZ77?
What is the cost?

Summary of file compression
Large dbs ? compression helps reduce storage
Fast query processing requires synchronization
and fast decoding
Db is often given, so statistics can be collected
semi-static is a viable option
(plus regular re-organization)
Context-based methods give good compression, but
expensive decoding
word-based Huffman is recommended (semi-static)
Construct two models one for words, another for
no-words

31
Compression of inverted lists

Introduction
Global, non-parametric methods
Global parametric methods
Local parametric methods

Introduction
Important parameters
N - of documents in db
n - of (distinct) words
F - of word occurrences
f - of inverted list entries
The index contains
lexicon (MM, if possible), ILs (Disc)
IL compression helps to reduce size of index,
cost of i/o

(in TREC, 99) 741,856 535,346 333,338,738 134,99
4,414 Total size 2G
33

The IL for a term t contains entries
An entry
d ( doc. id), in-doc freq. ,
in-doc-position,
For ranked answers, the entry is usually (d,
)
We consider each separately independent
compressions, can be composed

Compression of doc numbers
A sequence of numbers in 1..N how can it be
compressed?
Most methods use gaps
g1d1, g2d2-d1,
We know that
For long lists, most are small.
These facts can be used for compression
(Each method has an associated probability
distribution on the gaps, defined by code
lengths )

Global, non-parametric methods
Binary coding
represent each gap by a fixed length binary
number
Code length for g
Probability uniform distribution p(g)1/N

Unary coding
represent each ggt0 by d-1 digits 1, then 0
1 -gt 0, 2 -gt 10, 3 -gt 110, 4-gt 1110,
Code length for g g
?
Worst case for sum N (hence for all ILs nN)
is this a nice bound?
P(g)
Exponential decay if does not hold in practice
? compression penalty

Gamma ( ) code
a number g is represented by
Prefix ???? unary code for
Suffix ???? binary code, with
digits, for
Examples
(Why not ?)

Delta ( ) code

Interim summary
We have codes with probability distributions
Q can you prove that the (exact) formulas for
probabilities for gamma, delta sum to 1?

Golomb code
Semi-static, uses db statistics
global, parametric code
Select a basis b (based on db statistics later)
ggt0 ? we represent g-1
Prefix let (integer
division)
represent, in unary, q1
Suffix the remainder is (g-1)-qb (in 0..b-1)
represent by a binary tree code
- some leaves at distance
- the others at distance

The binary tree code
cut 2j leaves from the full binary tree of depth
k
assign leaves, in order, to the values in
0..b-1
Example b6

0
1
2
3
4
5
42

Summary of Golomb code
Exponential decay like unary, slower rate,
affected by b
Q what is the underlying theory?
Q how is b chosen?

Infinite Huffman Trees
Example Consider
The code () 0, 10, 110, 1110,
seems natural, but Huffman algorithm is not
applicable! (why?)
For each m, consider the (finite)
m-approximation
each has a Huffman tree code 0, 10, , 110
the code for m1 refines that of m
The sequence of codes converges to ()

44
0
approximation 1, code words 0, 1
1
approximation 2, code words 0, 10,11
1
1/2
1/2
1/4
0
approximation 1, code words 0, 10, 110,
1110,
1111
1
1/8
1/8
45

A more general approximation scheme
Given the sequence
An m-approximation, with skip b is the finite
sequence where
for example b 3

approximated tail
46

Fact refining the m-approx. by splitting
to and gives the
m1-approx.
A sequence of m-approximations is good if
() are the smallest in the
sequence,
so they are the 1st pair merged by Huffman
(why is this important?)
() Depends on the and on b

Let -- the Bernoulli
distribution
A decreasing sequence
?
to prove (), need to show
For which b do they hold?

48
(No Transcript)
49

We select lt on the right (useful later)
has a unique solution
To solve, from the left side, we obtain
Hence the solution is (b is an integer)

Next how do these Huffman trees look like?
Start with 0-approx.
Facts
A decreasing sequence (so last two are smallest)
(when bgt3)
follows from
and ()
3. Previous two properties are preserved when
last two are replaced by their sum
4. The Huffman tree for the sequence assigns to
codes of lengths
of same cost as the Golomb code for remainders
Proof induction on b

Now, expand the approximations, to obtain
infinite tree
This is the Golomb code, (with places of
prefix/suffix exchanged)!!

0
1
0
1
0
1
52

Last question
where do we get p, and why Bernoulli?
Assume equal probability p for t to be in d
For a given t, probability of the gap g from
one doc to next is then
For p there are f pairs (t, d), estimate p by
Since N is large, a reasonable estimate

For TREC
To estimate for a
small p
log(2-p) log 2, log(1-p) -p
b (log 2)/p 0.69 nN/f 1917
end of (blobal) Golomb

Global observed frequency (a global method)
Construct all ILs collect statictics on
frequencies of gaps
Construct canonical Huffman tree for gaps
The model/tree needs to be stored
(gaps are in 1..N for TREC this is 3/4M gap
values ? storage overhead may not be so large)
Practically, not far from gamma, delta,
But local methods are better

Local (parametric) methods
Coding of IL(t) based on statistics of IL(t)
Local observed frequency
Construct canonical Huffman for IL(t) based on
its gap frequencies
Problem in small ILs of distinct gaps is
close to of gaps
Size of model close to size of compressed data
Example 25 entries, 15 gap values
Model 15 gaps, 15 lengths (or freqs)
Way out construct model for groups of ILs
(see book for details)

Local Bernoulli/Golomb
Assumption - of entries of IL(t) is known
(to coder decoder)
, estimate b construct
Golomb
Note
Large f_t ? larger p ? smaller b ? code gets
close to unary (reasonable, many small gaps)
Small f_t ? large b ? most coding log b
For example f_t 2 (one gap) ? b 0.69N
for a gap lt 0.69/N, code in log(0.69N)
for a larger gap, one more bit

Interpolative coding
Uses original ds , not gs
Let f f_t, assume ds are stored in L0,,f
(each entry is at most N)
Standard binary for middle d, with of bits
determined by its range
Continue as in binary search each d in binary,
with of bits determined by modified range

Example L3,8,9,11,12,13,18 (f7) N20
H ? 7 div 2 3 L3 11 (4th d)
smallest d is 1, and there are 3 to left of
L3
largest d is 20, there are 3 to right of L3
size of interval is (20-3)-(13)17-413
? code 11 in 4 bits
For sub-list left of 11 3, 8, 9
h ? 3 div 2 1 L1 8
bounds lower 11 2 upper 10-19
code using 3 bits
For L2 9, range is 9..10, use 2 bits
For sub-list right of 11 do on board
(note the element that is coded in 0 bits!)

Advantages
Relatively easy to code, decode
Very efficient for clusters (a word that occurs
in many documents close to each other)
Disadvantage more complex to implement,
requires a stack
And cost of decoding is a bit more than Golomb
------ ---------- -------- ----------- -------
--------
Summary of methods Show table 3.8