Arithmetic Coding for Data Compression - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Arithmetic Coding for Data Compression

Description:

... desirable because they allow ready identification of the symbol at hand. ... the symbols are at the leaves of the code tree and hence the prefix property holds. ... – PowerPoint PPT presentation

Number of Views:3197

Avg rating:3.0/5.0

Slides: 29

Provided by: Nab61

Category:

more less

Transcript and Presenter's Notes

Title: Arithmetic Coding for Data Compression

1
Arithmetic Coding for Data Compression

Aatif Awan

Combinatorics, Complexity and Algorithms (CoCoA)
Group at LUMS
2
On Information Codes

We use codes to represent information for storage
and transmission.
We shall consider our messages (chunks of
information) consisting of symbols form a finite
alphabet e.g. English Alphabet, Numerals or a
combination of the two
The coding process would typically assign to each
symbol a unique codeword and the decoding process
would reverse the process.
Sometimes we would not use symbol coding but
would convert entire message into a number and
the decoding process will reproduce the message
from the number.

3
Terminology

Consider an alphabet A a1, a2, , am we
would represent a code by C c1, c2, , cm
where c1 is the code word for a1, c2 for a2 and
so on
For this talk, we take our code words to be
consisting of a binary alphabet 0, 1
The length of each codeword ci is given by L(ci).
We assume that we know the probabilities of
occurrence of the symbols of alphabet and we use
pi to represent the probability of ai.
Our measure of compression is the number of bits
per symbol on the average ? pi L(ci)

4
Fixed and Variable Length Codes

A Fixed Length Code employs code words of the
same length to represent all symbols of the
alphabet.
ASCII Code 8-bits to represent all symbols
(English Lang. characters special characters )
A Variable Length Code gives code words of
varying lengths to different symbols.
Famous Morse Code uses different length
sequences of dots and dashes to transmit symbols

5
A possibility of Data Compression

Intuitively if you are buying more doughnuts than
either of pizzas and brownies, bargaining on the
unit price of doughnuts would save you more.
Naturally, assigning shorter codes to more
frequent symbols and longer symbols to less
probable ones has a strong possibility of
achieving compression.
If the alphabet is uniformly distributed, we can
NOT achieve compression by employing above scheme
alone.

6
Example of Fixed-Length Coding

Alphabet A a,b,c,d Probabilities pa
0.1, pb 0.2, pc 0.3, pd 0.4 .
We employ a fixed length code a 00, b 01,
c 10, d 11
A message addbcd would then be coded to
001111011011. Takes 12 bits.
Decoding is straight-forward by breaking the
coded message into chunks of 2 and looking up
what symbol each chunk represents.
Average number of bits per symbol is 2.

7
Example of Variable-Length Coding

Alphabet A a,b,c,d Probabilities pa
0.1, pb 0.2, pc 0.3, pd 0.4 .
We employ a variable length code a 011, b
010, c 00, d 1
A message addbcd would then be coded to
01111010001. Takes 11 bits.
Decoding Read the message till you are sure you
have recognized a symbol, then replace it and
continue with the rest of the message.
Average number of bits per symbol is 0.13
0.23 0.32 0.41 1.9.

8
Uniquely Decodable Codes Prefix Codes

Uniquely Decodable Code No two different
messages generate the same coded sequence.
Prefix Code No codeword for any symbol is a
prefix of a codeword for another symbol.

Any prefix code is also uniquely decodable but
the converse is not true.
Prefix codes are desirable because they allow
ready identification of the symbol at hand.

9
Huffman Coding
10
Huffman doesnt want to take the Final Exam

1951 David Huffman in a term paper (as a
substitute for the final exam) discovers the
optimum prefix code (Minimum Redundancy Code).
Huffman, in his 1952 paper, observes that for an
optimum code
If pi pk then L(ci) L(ck)
The code words for the two least frequent symbols
should only differ in their last digit.

11
Huffman Coding Algorithm

Start with a forest of trees where each tree is a
single node corresponding to a symbol of the
alphabet. The probability of a tree is defined
as the probability of its root.
Until you have a single tree, do the following
Choose the two trees whose roots have the minimum
frequency (probability)
Construct a new node and make the two selected
trees its children. The probability of root of
the new tree would be the sum of the two trees we
have merged.
Your code tree is ready, assign 0 to each left
edge and 1 to each right edge.

12
Huffman Coding Algorithm

All the symbols are at the leaves of the code
tree and hence the prefix property holds.
Code word for a symbol can be formed by traversal
from root to the corresponding leave while
recording 1s and 0s on your way.
Decoding is done by reading the coded message and
traversing the tree. Once you reach a leaf, you
replace the read- part with the symbol. Repeat
the procedure with remaining message till you are
done.

13
Huffman Code Example
a,b,c,d,e
0
1.0
1
d,e
0
1
0.6
a,b,c
e
d
0
1
0.4
0.3
0.3
c
a,b
1
0.2
0.2
0
Average Code Length ? P(i) L(i) 0.1 3
0.1 3 0.2 2 0.3 2 0.3 2 2.2 bits
per symbol
a
b
0.1
0.1
14
How good can we get at it?
15
How much can we compress?

Assuming all possible messages are valid, for
each message an algorithm compresses, some other
must expand.
We take advantage of the fact that some messages
are more likely than others. Thus on the average,
we achieve compression.
There is a connection between the probability of
messages and the best average compression we can
achieve.

16
1948 Shannons Limit on Data Compression

Self-information is the number of bits required
to specify an event so that on average the total
number of bits is minimum.
Shannon showed in 1948 that the self-information
is given by S (i) - log (pi) log (1/pi)
Thus knowing only the probabilities of symbols in
our alphabet, we would achieve the best average
compression when each symbol is encoded using
exactly the same number of bits as its
self-information.
We can not do any better.

17
Why Huffman Coding cannot reach Shannons limit?

Huffman coding reaches Shannons limit when the
probabilities of symbols are negative powers of
2.
For example, if two symbol have probabilities
1/16 and 1/2, the Huffman coding algorithm would
generate a log (16) 4 and log(2) 1 bit code
words for them respectively.
Now consider a symbol with probability 0.9 , then
the ideal codeword length should be log(1/0.9)
0.152 bits. But Huffman code would generate a 1
bit codeword and thus can not approach Shannons
limit.

18
Arithmetic Coding
19
The idea

Arithmetic Coding is different from Huffman and
most other coding techniques in that it does not
replace symbols with codes.
Instead we produce a single fractional number
corresponding to a message. Each possible message
encodes to a unique number and can be recovered
uniquely.
The longer the message, the longer is the
precision required to represent the coded number.

20
Encoding Procedure

We use an interval called Current Range
initialized to 0,1.
While the message has not been completely read,
Read next symbol from the message as the current
symbol.
Divide the range into m sub-intervals, each
corresponding to one of the m possible symbols
in the alphabet. The length of each sub-interval
is proportional to the probability of the
corresponding symbol.
The portion of the current symbol in the message
is designated as the new current range.
Any number between the final interval is the
output and it uniquely describes the message.

21
Example
0.37419
o
c
o
a
c
Output 0.374 .01011111112 (10 bits)
22
Encoding Algorithm

Low 0
High 1
While there are symbols to encode
Current Symbol read next from message
Current Range High Low
Low Low Current Range Low End(Current
Symbol)
High Low Current Range High End(Current
Symbol)
Output any number in Low,High)

23
Encoding Algorithm at Work
Output 0.374125 .0101111111000110101012 (21
bits)
24
Decoding Algorithm

We apply inverse operations while decoding
Initialize num to encoded number
While decoding not complete
Find symbol within whose range num lies, output
symbol
range High End (symbol) Low End (symbol)
num num Low End (symbol)
num num / range

25
Decoding at work..
You need to explicitly tell the decoder when to
stop. This can be done by either sending an EOF
character or by sending file length in the
header.
26
Analysis

Compression is achieved because
A high probability symbol does not decrease the
interval sharply but a low probability symbol
decreases the interval more significantly
requiring large number of digits.
A larger interval needs less bits to be
specified.
The number of digits required is log(size of
interval)
The size of the final interval is the product of
the probabilities of the symbols encoded.
Thus each symbol i with probability pi
contributes log (pi) to the output, which is
the self-information of the symbol.
Thus Arithmetic coding thus achieves optimal
compression theoretically.

27
Some Comments

The main advantage of Arithmetic coding is its
optimality. Another advantage is that it can be
used to employ adaptive coding without much
increase in complexity.
The main disadvantage is its slow speed. We
require a large number of multiplications and
divisions during encoding and decoding
respectively.
Todays computers have limited precision and thus
practically we can not represent entire files as
fractions. The problem is resolved by scaling the
interval so that we can use integer operations
instead.

28
Questions!

Write a Comment

User Comments (0)