Arithmetic Coding for Data Compression - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Arithmetic Coding for Data Compression

Description:

... desirable because they allow ready identification of the symbol at hand. ... the symbols are at the leaves of the code tree and hence the prefix property holds. ... – PowerPoint PPT presentation

Number of Views:3197
Avg rating:3.0/5.0
Slides: 29
Provided by: Nab61
Category:

less

Transcript and Presenter's Notes

Title: Arithmetic Coding for Data Compression


1
Arithmetic Coding for Data Compression
  • Aatif Awan

Combinatorics, Complexity and Algorithms (CoCoA)
Group at LUMS
2
On Information Codes
  • We use codes to represent information for storage
    and transmission.
  • We shall consider our messages (chunks of
    information) consisting of symbols form a finite
    alphabet e.g. English Alphabet, Numerals or a
    combination of the two
  • The coding process would typically assign to each
    symbol a unique codeword and the decoding process
    would reverse the process.
  • Sometimes we would not use symbol coding but
    would convert entire message into a number and
    the decoding process will reproduce the message
    from the number.

3
Terminology
  • Consider an alphabet A a1, a2, , am we
    would represent a code by C c1, c2, , cm
    where c1 is the code word for a1, c2 for a2 and
    so on
  • For this talk, we take our code words to be
    consisting of a binary alphabet 0, 1
  • The length of each codeword ci is given by L(ci).
  • We assume that we know the probabilities of
    occurrence of the symbols of alphabet and we use
    pi to represent the probability of ai.
  • Our measure of compression is the number of bits
    per symbol on the average ? pi L(ci)

4
Fixed and Variable Length Codes
  • A Fixed Length Code employs code words of the
    same length to represent all symbols of the
    alphabet.
  • ASCII Code 8-bits to represent all symbols
    (English Lang. characters special characters )
  • A Variable Length Code gives code words of
    varying lengths to different symbols.
  • Famous Morse Code uses different length
    sequences of dots and dashes to transmit symbols

5
A possibility of Data Compression
  • Intuitively if you are buying more doughnuts than
    either of pizzas and brownies, bargaining on the
    unit price of doughnuts would save you more.
  • Naturally, assigning shorter codes to more
    frequent symbols and longer symbols to less
    probable ones has a strong possibility of
    achieving compression.
  • If the alphabet is uniformly distributed, we can
    NOT achieve compression by employing above scheme
    alone.

6
Example of Fixed-Length Coding
  • Alphabet A a,b,c,d Probabilities pa
    0.1, pb 0.2, pc 0.3, pd 0.4 .
  • We employ a fixed length code a 00, b 01,
    c 10, d 11
  • A message addbcd would then be coded to
    001111011011. Takes 12 bits.
  • Decoding is straight-forward by breaking the
    coded message into chunks of 2 and looking up
    what symbol each chunk represents.
  • Average number of bits per symbol is 2.

7
Example of Variable-Length Coding
  • Alphabet A a,b,c,d Probabilities pa
    0.1, pb 0.2, pc 0.3, pd 0.4 .
  • We employ a variable length code a 011, b
    010, c 00, d 1
  • A message addbcd would then be coded to
    01111010001. Takes 11 bits.
  • Decoding Read the message till you are sure you
    have recognized a symbol, then replace it and
    continue with the rest of the message.
  • Average number of bits per symbol is 0.13
    0.23 0.32 0.41 1.9.

8
Uniquely Decodable Codes Prefix Codes
  • Uniquely Decodable Code No two different
    messages generate the same coded sequence.
  • Prefix Code No codeword for any symbol is a
    prefix of a codeword for another symbol.
  • Any prefix code is also uniquely decodable but
    the converse is not true.
  • Prefix codes are desirable because they allow
    ready identification of the symbol at hand.

9
Huffman Coding
10
Huffman doesnt want to take the Final Exam
  • 1951 David Huffman in a term paper (as a
    substitute for the final exam) discovers the
    optimum prefix code (Minimum Redundancy Code).
  • Huffman, in his 1952 paper, observes that for an
    optimum code
  • If pi pk then L(ci) L(ck)
  • The code words for the two least frequent symbols
    should only differ in their last digit.

11
Huffman Coding Algorithm
  • Start with a forest of trees where each tree is a
    single node corresponding to a symbol of the
    alphabet. The probability of a tree is defined
    as the probability of its root.
  • Until you have a single tree, do the following
  • Choose the two trees whose roots have the minimum
    frequency (probability)
  • Construct a new node and make the two selected
    trees its children. The probability of root of
    the new tree would be the sum of the two trees we
    have merged.
  • Your code tree is ready, assign 0 to each left
    edge and 1 to each right edge.

12
Huffman Coding Algorithm
  • All the symbols are at the leaves of the code
    tree and hence the prefix property holds.
  • Code word for a symbol can be formed by traversal
    from root to the corresponding leave while
    recording 1s and 0s on your way.
  • Decoding is done by reading the coded message and
    traversing the tree. Once you reach a leaf, you
    replace the read- part with the symbol. Repeat
    the procedure with remaining message till you are
    done.

13
Huffman Code Example
a,b,c,d,e
0
1.0
1
d,e
0
1
0.6
a,b,c
e
d
0
1
0.4
0.3
0.3
c
a,b
1
0.2
0.2
0
Average Code Length ? P(i) L(i) 0.1 3
0.1 3 0.2 2 0.3 2 0.3 2 2.2 bits
per symbol
a
b
0.1
0.1
14
How good can we get at it?
15
How much can we compress?
  • Assuming all possible messages are valid, for
    each message an algorithm compresses, some other
    must expand.
  • We take advantage of the fact that some messages
    are more likely than others. Thus on the average,
    we achieve compression.
  • There is a connection between the probability of
    messages and the best average compression we can
    achieve.

16
1948 Shannons Limit on Data Compression
  • Self-information is the number of bits required
    to specify an event so that on average the total
    number of bits is minimum.
  • Shannon showed in 1948 that the self-information
    is given by S (i) - log (pi) log (1/pi)
  • Thus knowing only the probabilities of symbols in
    our alphabet, we would achieve the best average
    compression when each symbol is encoded using
    exactly the same number of bits as its
    self-information.
  • We can not do any better.

17
Why Huffman Coding cannot reach Shannons limit?
  • Huffman coding reaches Shannons limit when the
    probabilities of symbols are negative powers of
    2.
  • For example, if two symbol have probabilities
    1/16 and 1/2, the Huffman coding algorithm would
    generate a log (16) 4 and log(2) 1 bit code
    words for them respectively.
  • Now consider a symbol with probability 0.9 , then
    the ideal codeword length should be log(1/0.9)
    0.152 bits. But Huffman code would generate a 1
    bit codeword and thus can not approach Shannons
    limit.

18
Arithmetic Coding
19
The idea
  • Arithmetic Coding is different from Huffman and
    most other coding techniques in that it does not
    replace symbols with codes.
  • Instead we produce a single fractional number
    corresponding to a message. Each possible message
    encodes to a unique number and can be recovered
    uniquely.
  • The longer the message, the longer is the
    precision required to represent the coded number.

20
Encoding Procedure
  • We use an interval called Current Range
    initialized to 0,1.
  • While the message has not been completely read,
  • Read next symbol from the message as the current
    symbol.
  • Divide the range into m sub-intervals, each
    corresponding to one of the m possible symbols
    in the alphabet. The length of each sub-interval
    is proportional to the probability of the
    corresponding symbol.
  • The portion of the current symbol in the message
    is designated as the new current range.
  • Any number between the final interval is the
    output and it uniquely describes the message.

21
Example
0.37419
o
c
o
a
c
Output 0.374 .01011111112 (10 bits)
22
Encoding Algorithm
  • Low 0
  • High 1
  • While there are symbols to encode
  • Current Symbol read next from message
  • Current Range High Low
  • Low Low Current Range Low End(Current
    Symbol)
  • High Low Current Range High End(Current
    Symbol)
  • Output any number in Low,High)

23
Encoding Algorithm at Work
Output 0.374125 .0101111111000110101012 (21
bits)
24
Decoding Algorithm
  • We apply inverse operations while decoding
  • Initialize num to encoded number
  • While decoding not complete
  • Find symbol within whose range num lies, output
    symbol
  • range High End (symbol) Low End (symbol)
  • num num Low End (symbol)
  • num num / range

25
Decoding at work..
You need to explicitly tell the decoder when to
stop. This can be done by either sending an EOF
character or by sending file length in the
header.
26
Analysis
  • Compression is achieved because
  • A high probability symbol does not decrease the
    interval sharply but a low probability symbol
    decreases the interval more significantly
    requiring large number of digits.
  • A larger interval needs less bits to be
    specified.
  • The number of digits required is log(size of
    interval)
  • The size of the final interval is the product of
    the probabilities of the symbols encoded.
  • Thus each symbol i with probability pi
    contributes log (pi) to the output, which is
    the self-information of the symbol.
  • Thus Arithmetic coding thus achieves optimal
    compression theoretically.

27
Some Comments
  • The main advantage of Arithmetic coding is its
    optimality. Another advantage is that it can be
    used to employ adaptive coding without much
    increase in complexity.
  • The main disadvantage is its slow speed. We
    require a large number of multiplications and
    divisions during encoding and decoding
    respectively.
  • Todays computers have limited precision and thus
    practically we can not represent entire files as
    fractions. The problem is resolved by scaling the
    interval so that we can use integer operations
    instead.

28
Questions!
Write a Comment
User Comments (0)
About PowerShow.com