Set No. 2 - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Set No. 2

Description:

Set No. 2. Compression Programs. File Compression: Gzip, Bzip. Archivers :Arc, Pkzip, Winrar, ... Will use 'message' in generic sense to mean the data to be ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 63
Provided by: Sony96
Category:
Tags: set | winrar

less

Transcript and Presenter's Notes

Title: Set No. 2


1
Set No. 2
2
Compression Programs
  • File Compression Gzip, Bzip
  • Archivers Arc, Pkzip, Winrar,
  • File Systems NTFS

3
Multimedia
  • HDTV (Mpeg 4)
  • Sound (Mp3)
  • Images (Jpeg)

4
Compression Outline
  • Introduction Lossy vs. Lossless
  • Information Theory Entropy, etc.
  • Probability Coding Huffman Arithmetic Coding

5
Encoding/Decoding
  • Will use message in generic sense to mean the
    data to be compressed

Encoder
Decoder
Output Message
Input Message
Compressed Message
CODEC
The encoder and decoder need to understand common
compressed format.
6
Lossless vs. Lossy
  • Lossless Input message Output message
  • Lossy Input message ? Output message
  • Lossy does not necessarily mean loss of quality.
    In fact the output could be better than the
    input.
  • Drop random noise in images (dust on lens)
  • Drop background in music
  • Fix spelling errors in text. Put into better
    form.
  • Writing is the art of lossy text compression.

7
Lossless Compression Techniques
  • LZW (Lempel-Ziv-Welch) compression
  • Build dictionary
  • Replace patterns with index of dict.
  • Burrows-Wheeler transform
  • Block sort data to improve compression
  • Run length encoding
  • Find compress repetitive sequences
  • Huffman code
  • Use variable length codes based on frequency

8
How much can we compress?
  • For lossless compression, assuming all input
    messages are valid, if even one string is
    compressed, some other must expand.

9
Model vs. Coder
  • To compress we need a bias on the probability of
    messages. The model determines this bias
  • Example models
  • Simple Character counts, repeated strings
  • Complex Models of a human face

Encoder
Model
Coder
Probs.
Bits
Messages
10
Quality of Compression
  • Runtime vs. Compression vs. Generality
  • Several standard corpuses to compare algorithms
  • Calgary Corpus
  • 2 books, 5 papers, 1 bibliography, 1 collection
    of news articles, 3 programs, 1 terminal
    session, 2 object files, 1 geophysical data, 1
    bitmap bw image
  • The Archive Comparison Test maintains a
    comparison of just about all algorithms publicly
    available

11
Comparison of Algorithms

12
Entropy
  • Entropy is the measurement of the average
    uncertainty of information
  • H entropy
  • P probability
  • X random variable with a discrete set of
    possible outcomes
  • (X0, X1, X2, Xn-1) where n is the total number
    of possibilities

13
Entropy
  • Entropy is greatest when the probabilities of the
    outcomes are equal
  • Lets consider our fair coin experiment again
  • The entropy H ½ lg 2 ½ lg 2 1
  • Since each outcome has self-information of 1, the
    average of 2 outcomes is (11)/2 1
  • Consider a biased coin, P(H) 0.98, P(T) 0.02
  • H 0.98 lg 1/0.98 0.02 lg 1/0.02
  • 0.98 0.029 0.02 5.643 0.0285 0.1129
    0.1414

14
Entropy
  • The estimate depends on our assumptions about
    about the structure (read pattern) of the source
    of information
  • Consider the following sequence
  • 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
  • Obtaining the probability from the sequence
  • 16 digits, 1, 6, 7, 10 all appear once, the rest
    appear twice
  • The entropy H 3.25 bits
  • Since there are 16 symbols, we theoretically
    would need 16 3.25 bits to transmit the
    information

15
A Brief Introduction to Information Theory
  • Consider the following sequence
  • 1 2 1 2 4 4 1 2 4 4 4 4 4 4 1 2 4 4 4 4 4 4
  • Obtaining the probability from the sequence
  • 1, 2 four times (4/22), (4/22)
  • 4 fourteen times (14/22)
  • The entropy H 0.447 0.447 0.415 1.309
    bits
  • Since there are 22 symbols, we theoretically
    would need 22 1.309 28.798 (29) bits to
    transmit the information
  • However, check the symbols 12, 44
  • 12 appears 4/11 and 44 appears 7/11
  • H 0.530 0.415 0.945 bits
  • 11 0.945 10.395 (11) bits to tx the info (38
    less!)
  • We might possibly be able to find patterns with
    less entropy

16
Revisiting the Entropy
  • Entropy
  • A measure of information content
  • Entropy of the English Language
  • How much information does each character in
    typical English text contain?

17
Entropy of the English Language
  • How can we measure the information per character?
  • ASCII code 7
  • Entropy 4.5 (based on character probabilities)
  • Huffman codes (average) 4.7
  • Unix Compress 3.5
  • Gzip 2.5
  • BOA 1.9 (current close to best text compressor)
  • Must be less than 1.9.

18
Shannons experiment
  • Asked humans to predict the next character given
    the whole previous text. He used these as
    conditional probabilities to estimate the entropy
    of the English Language.
  • The number of guesses required for right answer
  • From the experiment we can predict the entropy
    of English.

19
Data compression model
Input data
Reduce Data Redundancy
Reduction of Entropy
Entropy Encoding
Compressed Data
20
Coding
  • How do we use the probabilities to code messages?
  • Prefix codes and relationship to Entropy
  • Huffman codes
  • Arithmetic codes
  • Implicit probability codes

21
Assumptions
  • Communication (or file) broken up into pieces
    called messages.
  • Adjacent messages might be of a different types
    and come from a different probability
    distributions
  • We will consider two types of coding
  • Discrete each message is a fixed set of bits
  • Huffman coding, Shannon-Fano coding
  • Blended bits can be shared among messages
  • Arithmetic coding

22
Uniquely Decodable Codes
  • A variable length code assigns a bit string
    (codeword) of variable length to every message
    value
  • e.g. a 1, b 01, c 101, d 011
  • What if you get the sequence of bits1011 ?
  • Is it aba, ca, or, ad?
  • A uniquely decodable code is a variable length
    code in which bit strings can always be uniquely
    decomposed into its codewords.

23
Prefix Codes
  • A prefix code is a variable length code in which
    no codeword is a prefix of another word
  • e.g a 0, b 110, c 111, d 10
  • Can be viewed as a binary tree with message
    values at the leaves and 0 or 1s on the edges.

0
1
0
1
a
0
1
d
b
c
24
Huffman Coding
  • Binary trees for compression

25
Huffman Code
  • Approach
  • Variable length encoding of symbols
  • Exploit statistical frequency of symbols
  • Efficient when symbol probabilities vary widely
  • Principle
  • Use fewer bits to represent frequent symbols
  • Use more bits to represent infrequent symbols

A
A
B
A
A
A
A
B
26
Huffman Codes
  • Invented by Huffman as a class assignment in
    1950.
  • Used in many, if not most compression algorithms
  • gzip, bzip, jpeg (as option), fax compression,
  • Properties
  • Generates optimal prefix codes
  • Cheap to generate codes
  • Cheap to encode and decode
  • laH if probabilities are powers of 2

27
Huffman Code Example
  • Expected size
  • Original ? 1/8?2 1/4?2 1/2?2 1/8?2 2
    bits / symbol
  • Huffman ? 1/8?3 1/4?2 1/2?1 1/8?3 1.75
    bits / symbol

28
Huffman Codes
  • Huffman Algorithm
  • Start with a forest of trees each consisting of a
    single vertex corresponding to a message s and
    with weight p(s)
  • Repeat
  • Select two trees with minimum weight roots p1 and
    p2
  • Join into single tree by adding root with weight
    p1 p2

29
Example
  • p(a) .1, p(b) .2, p(c ) .2, p(d) .5

a(.1)
b(.2)
d(.5)
c(.2)
(.3)
(.5)
(1.0)
1
0
(.5)
d(.5)
a(.1)
b(.2)
(.3)
c(.2)
1
0
Step 1
(.3)
c(.2)
a(.1)
b(.2)
0
1
a(.1)
b(.2)
Step 2
Step 3
a000, b001, c01, d1
30
Encoding and Decoding
  • Encoding Start at leaf of Huffman tree and
    follow path to the root. Reverse order of bits
    and send.
  • Decoding Start at root of Huffman tree and take
    branch for each bit received. When at leaf can
    output message and return to root.

(1.0)
1
0
(.5)
d(.5)
1
0
(.3)
c(.2)
0
1
a(.1)
b(.2)
31
Adaptive Huffman Codes
  • Huffman codes can be made to be adaptive without
    completely recalculating the tree on each step.
  • Can account for changing probabilities
  • Small changes in probability, typically make
    small changes to the Huffman tree
  • Used frequently in practice

32
Huffman Coding Disadvantages
  • Integral number of bits in each code.
  • If the entropy of a given character is 2.2
    bits,the Huffman code for that character must be
    either 2 or 3 bits , not 2.2.

33
Arithmetic Coding
  • Huffman codes have to be an integral number of
    bits long, while the entropy value of a symbol is
    almost always a faction number, theoretical
    possible compressed message cannot be achieved.
  • For example, if a statistical method assign 90
    probability to a given character, the optimal
    code size would be 0.15 bits.

34
Arithmetic Coding
  • Arithmetic coding bypasses the idea of replacing
    an input symbol with a specific code. It replaces
    a stream of input symbols with a single
    floating-point output number.
  • Arithmetic coding is especially useful when
    dealing with sources with small alphabets, such
    as binary sources, and alphabets with highly
    skewed probabilities.

35
Arithmetic Coding Example (1)
Character probability Range (space)
1/10 A 1/10
B 1/10 E 1/10 G
1/10 I 1/10 L
2/10 S 1/10 T
1/10
Suppose that we want to encode the message BILL
GATES
36
Arithmetic Coding Example (1)
0.2572
0.2
0.0
0.25
0.256


0.1
0.25724
A
0.2
B
0.3
E
0.4
G
0.5
0.25
I
I
0.6
0.26
0.2572
0.256
L
L
L
0.258
0.8
0.2576
S
0.9
T
0.26
0.258
1.0
0.3
0.2576
37
Arithmetic Coding Example (1)
  • New character Low value high
    value
  • B 0.2 0.3
  • I 0.25 0.26
  • L 0.256 0.258
  • L 0.2572 0.2576
  • (space) 0.25720 0.25724
  • G 0.257216 0.257220
  • A 0.2572164 0.2572168
  • T 0.25721676 0.2572168
  • E 0.257216772 0.257216776
  • S 0.2572167752
    0.2572167756

38
Arithmetic Coding Example (1)
  • The final value, named a tag, 0.2572167752 will
    uniquely encode the message BILL GATES.
  • Any value between 0.2572167752 and 0.2572167756
    can be a tag for the encoded message, and can be
    uniquely decoded.

39
Arithmetic Coding
  • Encoding algorithm for arithmetic coding.
  • Low 0.0 high 1.0
  • while not EOF do
  • range high - low read(c)
  • high low range?high_range(c)
  • low low range?low_range(c)
  • enddo
  • output(low)

40
Arithmetic Coding
  • Decoding is the inverse process.
  • Since 0.2572167752 falls between 0.2 and 0.3, the
    first character must be B.
  • Removing the effect of B from 0.2572167752 by
    first subtracting the low value of B, 0.2, giving
    0.0572167752.
  • Then divided by the width of the range of B,
    0.1. This gives a value of 0.572167752.

41
Arithmetic Coding
  • Then calculate where that lands, which is in the
    range of the next letter, I.
  • The process repeats until 0 or the known length
    of the message is reached.

42
r c Low High range 0.2572167752
B 0.2 0.3 0.1 0.572167752
I 0.5 0.6 0.1 0.72167752
L 0.6 0.8 0.2 0.6083876 L
0.6 0.8 0.2 0.041938 (space)
0.0 0.1 0.1 0.41938 G 0.4
0.5 0.1 0.1938 A 0.2 0.3
0.1 0.938 T 0.9 1.0 0.1
0.38 E 0.3 0.4 0.1 0.8
S 0.8 0.9 0.1 0.0
43
Arithmetic Coding
  • Decoding algorithm
  • r input_code
  • repeat
  • search c such that r falls in its range
  • output(c)
  • r r - low_range(c)
  • r r/(high_range(c) - low_range(c))
  • until r equal 0

44
Arithmetic Coding Example (2)
Suppose that we want to encode the message 1 3 2 1
45
Arithmetic Coding Example (2)
0.00
0.00
0.7712
0.656
0.7712
1
1
0.7712
0.773504
0.80
2
2
0.82
0.656
0.77408
3
3
1.00
0.773504
0.77408
0.80
0.80
46
Arithmetic Coding Example (2)
Encoding
New character Low value High
value 0.0
1.0 1 0.0
0.8 3 0.656 0.800 2
0.7712 0.77408 1 0.7712
0.773504
47
Arithmetic Coding Example (2)
Decoding
48
Arithmetic Coding
  • In summary, the encoding process is simply one of
    narrowing the range of possible numbers with
    every new symbol.
  • The new range is proportional to the predefined
    probability attached to that symbol.
  • Decoding is the inverse procedure, in which the
    range is expanded in proportion to the
    probability of each symbol as it is extracted.

49
Arithmetic Coding
  • Coding rate approaches high-order entropy
    theoretically.
  • Not so popular as Huffman coding because ?, ? are
    needed.
  • Average bits/byte on 14 files (program, object,
    text, and etc.)
  • Huff. LZW LZ77/LZ78 Arithmetic
  • 4.99 4.71 2.95 2.48

50
Generating a Binary Code forArithmetic Coding
  • Problem
  • The binary representation of some of the
    generated floating point values (tags) would be
    infinitely long.
  • We need increasing precision as the length of the
    sequence increases.
  • Solution
  • Synchronized rescaling and incremental encoding.

51
Generating a Binary Code forArithmetic Coding
  • If the upper bound and the lower bound of the
    interval are both less than 0.5, then rescaling
    the interval and transmitting a 0 bit.
  • If the upper bound and the lower bound of the
    interval are both greater than 0.5, then
    rescaling the interval and transmitting a 1
    bit.
  • Mapping rules

52
Arithmetic Coding Example (2)
0.00
0.00
0.3568
0.312
0.3568
0.0848
0.1696
0.6784
1
0.3392
0.312
1
0.09632
0.19264
0.38528
0.77056
0.5424
0.38528
0.80
2
2
0.54112
0.82
0.656
0.54812
0.6
3
3
1.00
0.80
0.6
0.504256
53
Encoding
Any binary value between lower or upper.
54
  • Decoding the bit stream start with 1100011
  • The number of bits to distinct the different
    symbol is bits.

55
Revisiting Arithmetic coding
  • An Example Consider sending a message of length
    1000 each with having probability .999
  • Self information of each message
  • -log(.999) .00144 bits
  • Sum of self information 1.4 bits.
  • Huffman coding will take at least 1k bits.
  • Arithmetic coding 3 bits!

56
Arithmetic Coding Introduction
  • Allows blending of bits in a message sequence.
  • Can bound total bits required based on sum of
    self information
  • Used in PPM, JPEG/MPEG (as option), DMM
  • More expensive than Huffman coding, but integer
    implementation is not too bad.

57
Arithmetic Coding (message intervals)
  • Assign each probability distribution to an
    interval range from 0 (inclusive) to 1
    (exclusive).
  • e.g.

f(a) .0, f(b) .2, f(c) .7
The interval for a particular message will be
calledthe message interval (e.g for b the
interval is .2,.7))
58
Arithmetic Coding (sequence intervals)
  • To code a message use the following
  • Each message narrows the interval by a factor of
    pi.
  • Final interval size
  • The interval for a message sequence will be
    called the sequence interval

59
Arithmetic Coding Encoding Example
  • Coding the message sequence bac
  • The final interval is .27,.3)

0.7
1.0
0.3
c .3
c .3
c .3
0.7
0.55
0.27
b .5
b .5
b .5
0.2
0.3
0.21
a .2
a .2
a .2
0.0
0.2
0.2
60
Vector Quantization
  • How do we compress a color image (r,g,b)?
  • Find k representative points for all colors
  • For every pixel, output the nearest
    representative
  • If the points are clustered around the
    representatives, the residuals are small and
    hence probability coding will work well.

61
Transform coding
  • Transform input into another space.
  • One form of transform is to choose a set of basis
    functions.
  • JPEG/MPEG both
  • use this idea.

62
Other Transform codes
  • Wavelets
  • Fractal base compression
  • Based on the idea of fixed points of functions.
Write a Comment
User Comments (0)
About PowerShow.com