Title: Review of Information Theory
1Review of Information Theory
2Information
- Suppose that we have the source alphabet of q
symbols s1, s2,.., sq, each with its probability
p(si)pi. How much information do we get when we
receive one of these symbols ? - For example, if p11, then there is no
surprise, no information since you know what
the message must be.
3Information
- On the other hand, if the probabilities are all
very different, then when a symbol with a low
probability arrives, you feel more surprised, get
more information, than when a symbol with a
higher probability arrives. - I(p) A function which measures the amount of
information-surprise, uncertainty -- in the
occurrence of an event of probability p.
4 Information
- Properties of I(p)
- 1) I(p) ? 0 (a real nonnegative measure)
- 2) I(p1p2) I(p1)I(p2) for independent event.
- 3) I(p) is a continuous function of p.
- Define I(p) log(1/p) log p
5Entropy
- If we get I(si) units of information when we
- receive the symbol si , how much do we get on the
average ? - Since pi is the probability of getting the
- information I(si), then on the average we get for
each symbol si, - pi I(si) pi log(1/ pi)
6Entropy
- Thus, on the average, over the whole alphabet of
symbols si, we will get -
- Conventionally, we write
- as the entropy of the signaling system S having
symbols si and probability pi.
7Entropy
- Note that this is the entropy function for a
distribution when all that is considered are the
probabilities pi of the symbols si. - Inter-pixel correlation is not considered here.
- If this is to be considered, we should compute
the entropy function by a Markov process.
8Entropy
- Example Ss1, s2, s3, s4, p11/2, p21/4,
p3 p41/8. - Then H2(S) 1/2 log22 1/4 log24 1/8log28
1/8log28 - (1/2)(1/4)2(1/8)3(1/8)3
- 1(3/4) bits of information.
- In this example, Lavg of Huffman code can reach
Hr(S). But, this is not the general case.
9Entropy
- Consider the distribution consisting of just two
events. Let p be the probability of the first
symbol (event). Then, the entropy function is - H2(P) p log2(1/p) (1- p)log21/(1-p)
- The maximum of H2(P) occurs when p 1/2.
- H2(P) I(p)
10Entropy
11Entropy and Coding
- Instantaneous Codes A code is said to be
instantaneous if when a complete symbol is
received, the receiver immediately knows this,
and do not have to look further before deciding
what message symbol is received. - No encoded symbol of this code is a prefix of any
other symbol.
12Entropy and Coding
- Example s10
s10 s210
s201
s3110
s3011 s4111
s4111 instantaneous
non-instantaneous - 101100111- s2s3s1s4
011111111- s3s4s4 - It is necessary and sufficient that an
instantaneous code have no code word si which is
a prefix of another code word, sj.
13Entropy and Coding
- Theorem. (The Kraft Inequality) A necessary and
sufficient condition for the existence of an
instantaneous code S of q symbols si with
encoded words of lengths l1? l2?. ? lq is -
-
- where r is the radix ( number of symbols ) of
the alphabet of the encoded symbols.
14Entropy and Coding
- Theorem.
- The entropy supplies a lower bound on the average
cod length Lavg for any instantaneous decodable
system.
15Shannon-Fano Coding
- Shannon-Fano coding is less efficient than is
Huffman coding, but has the advantage that you
can go directly from the probability pi to the
code word length li.
16Shannon-Fano Coding
- Given the source symbols s1, s2, , sq and their
corresponding probabilities p1, p2, , pq, then
for each pi there is an integer li such that -
- Kraft Inequality.
- Therefore, there is an instantaneous decodable
code having these Shannon-Fano lengths.
17Shannon-Fano Coding
- Entropy of the distribution pi
- logr(1/pi) ? li lt logr(1/pi) 1
- ? Hr(S) ? pilogr (1/pi) ? ? pili lt Hr(S)1
- Thus,
- Hr(S) ? Lavglt Hr(S) 1
- The entropy serves as a lower bound and part of
the upper bound on the average length for the
Shannon-Fano coding.
18Shannon-Fano Coding
- Example p1 p2 1/4, p3 p4 p5 p61/8
- Shannon-Fano lengths
- l1 l2 2, l3 l4 l5 l63
- we then assign
- s100 s3100
- s201 s4101
- s5110
- s6111
19Shannon-Fano Coding
- Since Shannon-Fano coding obeys the Kraft
inequality, there are always enough symbols to
assign for an instantaneous code. - Thus, orderly assignment leads readily to the
decoding tree and the prefix condition is met.
20How Bad Is Shannon-Fano Coding ?
- Consider the source alphabet s1, s2 with
probabilities - p1 1 , p2 (K ? 2)
- we get
- log2(1/p1) ? log2 2 1
- Thus, l11. But for l2, we have
- log2(1/p2) log22K K l2
- Hence where the Huffman encoding gives both code
words one binary digit, Shannon-Fano has s1, a
1-bit word, and s2, a K-bit word.
21How Bad Is Shannon-Fano Coding ?
- Consider Lavg for Huffman encoding and
Shannon-Fano encoding. - Clearly, Lavg (Huffman) 1
22For Shannon-Fano, we have
Lavg 1(1 ) K( )
1
(K-1) / K
1 (K-1) /
2
1 1/4 1.25
3 1 1/4
1.25 4
1 3/16 1.1875
5 1 1/8
1.125 6
1 5/64 1.078125
etc.
How Bad Is Shannon-Fano Coding ?
23Extensions of a Code
- If we encode not one symbol at a time, but make
up a code for blocks of symbols of the source
code at a time, then we can expect to get a
better approximation to the lower bound Hr(S),
because the probabilities of the extension are
more variable than the original probabilities. - Definition The n-th extension of a source
alphabet s1, s2,. sq symbols of the form
si1si2.sin having the probabilities Qi
pi1pi2.pin. Each block of the n original
symbols now becomes a single symbol ti with
probability Qi. We label this alphabet SNT.
24Extensions of a Code
- Example Encoding
- s1 p1 2/3 s1 0
- s2 p2 1/3 s2
1 Lavg 1 - The second extension takes two symbols at a time.
Encoding - s1s1 p11 4/9 s1s1 1
- s1s2 p12 2/9 s1s2
01 - s2s1 p21 2/9 s2s1
000 - s2s2 p22 1/9 s2s2
001 - Lavg 17/18 0.9444
- Lavg for the third extension is 76/87
0.93827 - Lavg for the fourth extension 0.93827.
25Extensions of a Code
- Theorem. The entropy of the n-th extension of
an alphabet is n times the entropy of the
original alphabet. I.E. - Hr(T) n Hr(S).
- Proof. Hr(T) Hr( )
-
-
26Extensions of a Code
27Extensions of a Code
- Hr(Sn)? Lavg(T)ltHr(Sn)1
- ? nHr(S)? Lavg(T)ltnHr(S)1
- ? Hr(S)? Lavg(T)/nltHr(S)1/n ---------(A)
- Thus, for a sufficient large nth extension of the
code, we can bring the average code word length
as close as we please to the entropy Hr(S). - Shannons noiseless coding theorem the n-th
extension of a code satisfies inequality (A).
28Extensions of a Code
- Since Huffman coding is optimal, we have
- Hr(S) ? (T) / n ? (T) / n lt Hr
(S) 1/n. - Note Proof of that Huffman code is optimal can
be found in many texts, including Fundamental of
Computer Algorithms by Horowitz and Shani
(Greedy Method ).
29Example s1 p1 2/3, s2 p2 1/3
The entropy is H2 (S) 2/3 log2
(3/2) (1/3)log23 0.9182958. (lower bound)
n (T)/n
(T)/n n (T)/n
(T)/n 1
1.00000 1.33333
4 0.93827 1.08333
2 0.94444 1.33333
5 0.92263
0.93333 3 0.93827
1.00000
Extensions of a Code
30Entropy of a Markov Process
- For the input stream of data, a Markov process is
used to handle the correlation between successive
symbols. - Given that we
have already seen the previous m symbols,
this is the probability that the
next symbol will be si.
31Entropy of a Markov Process
- Given that we have already seen ( are in the
state) - define the amount of information we get when we
- receive a symbol si as
- For example, in English, the letter q is almost
always followed by the letter u. Thus,
I(uq) ? 0 and I(pq) ? ?
32Entropy of a Markov Process
- In computer language, a zero memory source uses
each source symbol independently of what has
proceeded it. A m-memory source uses the
previous m symbols, and this idea corresponds
exactly to a m-th order Markov process. - In general, for a m-memory source, there are q m
states in the Markov process.
33Entropy of a Markov Process
- Example
- p(aa) 1/3 p(ba) 1/3 p(ca) 1/3
- p(ab) 1/4 p(bb) 1/2 p(cb) 1/4
- p(ac) 1/4 p(bc) 1/4 p(cc) 1/2
34Entropy of a Markov Process
- What is the probability distribution of the
states a, b, and c ? - pap(aa)pbp(ab)pcp(ac)pa
- pap(ba)pbp(bb)pcp(bc)pb
- pap(ca)pbp(cb)pcp(cc)pc
- pa pbpc 1
- ? pa 3/11, pb pc 4/11. (equilibrium
solution)
35Entropy of a Markov Process
- Conditional entropy of the source alphabet S
given that we have the sequence of symbols
is defined as
36Entropy of a Markov Process
- Define the entropy of the Markov system as the
probabilities of being in a state times the
conditional entropy of the states.
37Example p(00,0) 0.8 p(11,1)
p(10,0) 0.2 p(01,1)
p(00,1) 0.5 p(11,0)
p(10,1) 0.5 p(01,0) Using the obvious
symmetry, we see that
p(0,0) p(1,1) and p(0,1) p(1,0)
Therefore, we must have the equations
0.4 p(0,0)p(0,1)0
Entropy of a Markov Process
38 Since we must be at some place, we also have
p(0,0) p(0,1) p(1,0)
p(1,1) 1 or 2p(0,0) 2p(0,1)
1? p(0,0) 5/14 p(1,1) p(0,1) 2/14
p(1,0). 0 0 0
0.8 5/14
4/14 0 0 1 0.2
5/14 1/14 0
1 0 0.5
2/14 1/14 0 1 1
0.5 2/14
1/14 1 0 0 0.5
2/14 1/14
1 0 1 0.5
2/14 1/14 1 1
0 0.2 5/14
1/14 1 1 1
0.8 5/14
4/14
Entropy of a Markov Process
39From the table we can now compute
Entropy of a Markov Process
40Entropy of a Markov Process
- Since the inter-sample correlations are very high
for natural digital signal, we must use Markov
process to measure the entropy of that signal. - m ?