A Bit of Information Theory - PowerPoint PPT Presentation

About This Presentation

Title:

A Bit of Information Theory

Description:

A Bit of. Information Theory. Unsupervised Learning Working Group. Assaf Oron, Oct. 15 2003 ... We keep coding all the time. Crucial requirement for coding: ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 19

Provided by: ASS91

Category:

more less

Transcript and Presenter's Notes

Title: A Bit of Information Theory

1
A Bit of Information Theory

Unsupervised Learning Working Group
Assaf Oron, Oct. 15 2003

Based mostly upon Cover Thomas, Elements of
Inf. Theory, 1991
2
Contents

Coding and Transmitting Information
Entropy etc.
Information Theory and Statistics
Information Theory and Machine Learning

3
What is Coding? (1)

We keep coding all the time
Crucial requirement for coding source and
receiver agree on the key.
Modern coding telegraph-gtradio-gt
Practical problems How efficient can we make it?
Tackled from 20s on.
1940s Claude Shannon

4
What is Coding? (2)

Shannons greatness finding a solution of the
specific problem, by working on the general
problem.
Namely how does one quantify information, its
coding and its transmission?
ANY type of information

5
Some Day-to-Day Codes
Code Channel Unique? Instant?
Spoken Language Sounds via air Well
Written Language Signs on paper/screen Well
Numbers and math Signs on paper/screen, electronic, etc. Usually (decimal point, operation signs, etc.)
DNA protein code Nucleotide pairs Yes (start, end, 3-somes)
6
Information Complexity of Some Coded Messages

Lets think written numbers
k digits ? 10k possible messages
How about written English?
k letters ? 26k possible messages
k words ? Dk possible messages, where D is
English dictionary size
? Length log(complexity)

7
Information Entropy

The expected length (bits) of a binary message
conveying x-type information
other common descriptions code complexity,
uncertainty, missing/required information,
expected surprise, information content (BAD),
etc.

8
Why Entropy?

Thermodynamics (mid 19th) amount of un-usable
heat in system
Statistical Physics (end 19th) log (complexity
of current system state)
? amount of mess in the system
The two were proven to be equivalent
Statistical entropy is proportional to
information entropy if p(x) is uniform
2nd Law of Thermodynamics
Entropy never decreases (more later)

9
Entropy Properties, Examples

10
Kullback-Leibler Divergence(Relative Entropy)

In words the excess message length needed to
use p(x)-optimized code for messages based on
q(x)
Properties, Relation to H

11
Mutual Information

Relationship to D,H (hint cond. Prob.)

Properties, Examples

12
Entropy for Continuous RVs

Little h, Defined in the natural way
However it is not the same measure
h of discrete RVs is always 0, and H of
continuous RVs is infinite (measure theory)
For many continuous distributions, h is log
(variance) plus some constant
Why?

13
The Statistical Connection (1)

K-L D ? Likelihood Ratio
Law of large numbers can be rephrased as a limit
on D
For dist.s with same variance, normal is the one
with maximum h.
(2nd law of thermodynamics revisited)
h is an average quantity. Is the CLT, then, a
law of nature? (I think YES!)

14
The Statistical Connection (2)

Mutual information is very useful
Certainly for discrete RVs
Also for continuous (no dist. assumptions!)
A lot of implications for stochastic processes,
as well
I just dont quite understand them
English?

15
Machine Learning? (1)

So far, we havent mentioned noise
In inf. Theory, noise exists in the channel
Channel capacity max(mutual information) between
source, receiver
Noise directly decreases the capacity
Shannons Biggest result this can be (almost)
achieved with (almost) zero error
Known as the Channel Coding Theorem

16
Machine Learning? (2)

The CCT inspired practical developments
Now it all depends on code and channel!
Smarter, error-correcting codes
Tech developments focus on channel capacity

17
Machine Learning? (3)

Can you find analogy between coding and
classification/clustering? (can it be useful??)

Coding Coding M. Learning
Source Entropy Variability of Interest Variability of Interest
Choice of Channel Parameterization Parameterization
Choice of Code Classification Rules Classification Rules
Channel noise Noise, random errors Noise, random errors
Channel Capacity Maximum accuracy Maximum accuracy
I (source,receiver) Actual Accuracy Actual Accuracy
18
Machine Learning? (4)

Inf. Theory tells us that
We CAN find a nearly optimal classification or
clustering rule (coding)
We CAN find a nearly optimal parameterizationclas
sification combo
Perhaps the newer wave of successful, but
statistically intractable methods (boosting
etc.) works by increasing channel capacity (i.e,
high-dim parameterization)?

Write a Comment

User Comments (0)