EEL 4930 6 5930 5, Spring 06 Physical Limits of Computing

About This Presentation

Title:

EEL 4930 6 5930 5, Spring 06 Physical Limits of Computing

Description:

EEL 4930 6 / 5930 5, Spring 06. Physical Limits of Computing. Slides for a course taught by ... Required Background Material in Computing & Physics. Fundamentals ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 63

Provided by: Michael2156

Learn more at: https://eng.fsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: EEL 4930 6 5930 5, Spring 06 Physical Limits of Computing

1
EEL 4930 6 / 5930 5, Spring 06Physical Limits
of Computing
http//www.eng.fsu.edu/mpf

Slides for a course taught byMichael P. Frankin
the Department of Electrical Computer
Engineering

2
Physical Limits of ComputingCourse Outline
Currently I am working on writing up a set of
course notes based on this outline,intended to
someday evolve into a textbook

Course Introduction
Moores Law vs. Modern Physics
Foundations
Required Background Material in Computing
Physics
Fundamentals
The Deep Relationships between Physics and
Computation

IV. Core Principles
The two Revolutionary Paradigms of Physical
Computation
V. Technologies
Present and Future Physical Mechanisms for the
Practical Realization of Information Processing
VI. Conclusion

3
Part II. Foundations

This first part of the course quickly reviews
some key background knowledge that you will need
to be familiar with in order to follow the later
material.
You may have seen some of this material before.
Part II is divided into two chapters
Chapter II.A. The Theory of Information and
Computation
Chapter II.B. Required Physics Background

4
Chapter II.A. The Theory of Information and
Computation

In this chapter of the course, we review a few
important things that you need to know about
II.A.1. Combinatorics, Probability, Statistics
II.A.2. Information Communication Theory
II.A.3. The Theory of Computation

5
Section II.A.2. The Theory of Information and
Communication

This section is a gentle introduction to some of
the basic concepts of information theory
also known as communication theory.
Sections
(a) Basic Concepts of Information
(b) Quantifying Information
(c) Information vs. Entropy
(d) Communication Channels
Later in the course, we will describe Shannons
famous theorems concerning the fundamental limits
of channel capacity.
As well as some newer, more general quantum
limits on classical and quantum communication.

6
Subsection II.A.2.a Basic Concepts of
Information

Etymology of InformationVarious Senses of
InformationInformation-Related Concepts

7
Etymology of Information

Earliest historical usage in English (from Oxford
English Dictionary)
The act of informing,
As in education, instruction, training.
Five books come down from Heaven for information
of mankind. (1387)
Or a particular item of training, i.e., a
particular instruction.
Melibee had heard the great skills and reasons
of Dame Prudence, and her wise informations and
techniques. (1386)
Derived by adding the action noun ending ation
(descended from Latins tio) to the
pre-existing verb to inform,
Meaning to give form (shape) to the mind
to discipline, instruct, teach
Men so wise should go and inform their kings.
(1330)
And inform comes from Latin informare, derived
from noun forma (form),
Informare means to give form to, or to form an
idea of.
Latin also even already contained the derived
word informatio,
meaning concept or idea.
Note The Greek words e?d?? (eídos) and µ??f?
(morphé),
Meaning form, or shape,
were famously used by Plato ( later Aristotle)
in a technical philosophical sense, to denote the
true identity or ideal essence of something.
Well see that our modern concept of physical
information is not too dissimilar!

8
Information Our Definition

Information is that which distinguishes one thing
(entity) from another.
It is (all or part of) an identification or
description of the thing.
It is a specification of (some or all of) the
things properties or characteristics.
We can consider that every thing carries or
embodies a complete description of itself.
It does this simply in virtue of its own being,
its own existence.
In philosophy, this inherent description is
called the entitys form or constitutive essence.

9
Specific Senses of Information

But, let us also take care to distinguish between
the following uses of information
A form or pattern of information
An abstract configuration of information, as
opposed to a specific instantiation.
Many separate instances of information contained
in separate objects may have identical patterns,
or content.
We may say that those instances are copies of
each other.
An instance or copy of information
A specific instantiation (i.e., as found in a
specific entity) of some general form.
A holder or slot or location for storing
information
An indefinite or variable (mutable) instance of
information (or place where instances may be)
that may take on different forms at different
times or in different situations.
A wraith (pulse? cloud?) of information
A physical state or set of states, dynamically
changing over time.
A moving, constantly-mutating instance of
information, where the container of that
information may even transition from one physical
system to another.
A stream of information
An indefinitely large instance of information,
extended over time
A piece of information
All or a portion of a pattern, instance, wraith,
or stream of information.
A nugget of information
A piece of information, together with an
associated semantic interpretation of that
information.
A nugget is often implicitly a valuable,
important fact (the meaning of a fact is a true
statement).

10
Information-related concepts

It will also be convenient to discuss the
following
A container or embodiment of information
A physical system that contains some particular
instance of, placeholder for, or pulse of
information. (An embodiment contains nothing but
that.)
A symbol or message
A form or instance of information or its
embodiment produced with the intent that it
should convey some specific meaning, or semantic
content.
A message is typically a compound object
containing a number of symbols.
An interpretation or meaning of information
A particular semantic interpretation of a form
(pattern of information), tying it to potentially
useful facts of interest.
May or may not be the originally intended
meaning!
A representation of information
An encoding of one pattern of information within
some other (frequently larger) pattern.
The representation goes according to some
particular language or code.
A subject of information
An entity that is identified or described by a
given pattern of information.
May be abstract or concrete, mathematical or
physical

11
Information Concept Map
Meaning (interpretationof information)
Nugget (valuablepiece of information)
Has a
Describes, identifies
Interpretedto get
Representedby
Has a
Measures
Quantity ofinformation
Thing (subjector embodiment)
Form (pattern ofinformation)
Measures size of
Has a
Measures
Instantiatedby/in
Maybe a
Instantiates,has
Measures
Forms, composes
Contains, carries, embodies
Has a changing
Wraith (dynamic body of information,changing
cloud of states)
Physicalentity orsystem
Instance (copyof a form)
Can move from one to another
12
Example A Byte in a Register

The bit-sequence 01000001 is a particular form.
Suppose there is an instance of this particular
pattern of bits in a certain machine register in
the computer on my desk.
The register hardware is a physical system that
is a container of this particular instance.
The physical subsystem delimited by the high and
low ranges of voltage levels on the registers
storage nodes embodies this sequence of bits.
The register could hold other forms as well it
provides a holder that can contain an instance of
any form that is a sequence of 8 bits.
When the register is erased later, the specific
wraith of information that it contained will not
be destroyed, but will only be released into the
environment.
Although in an altered form which is scrambled
beyond all hope of recognition.
The instance of 01000001 contained in the
register at this moment happens to be intended to
represent the letter A (which is another form,
and is a symbol)
The meaning of this particular instance of the
letter A is that a particular students grade in
this class (say, Joe Smiths) is an A.
The valuable nugget of information which is the
fact that Joe has an A is also represented in
this register.
The subject of this particular piece of
information is Joes grade in the class.
The quantity of information contained in the
machine register is Log(256) 1 byte 8 bits,
because the slot could hold any of 256 different
forms (bit patterns).
But in the context of my grading application, the
quantity of information contained in the message
A is only Log(5) 2.32 bits, since only 5
grades (A,B,C,D,F) are allowed.
The size of the form A is 8 bits in the context
of the encoding being used.

13
Subsection II.A.2.b Quantifying Information

Capacity of Compound Holders
Logarithmic Information Measures
Indefinite Logarithms
Logarithmic Units

14
Quantifying Information

One way to quantify forms of information is to
try to count how many distinct ones there are.
Unfortunately, the number of all conceivable
forms is infinite.
However, we can count the forms in particular
finite subsets
Consider a situation defined in such a way that a
given information holder (in the context of that
situation) can only take on forms that are chosen
from among some definite, finite number N of
possible distinct forms.
One way to try to characterize the informational
size or capacity of the holder (the amount of
information in it) would then be to simply
specify the value of N, the number of forms it
could have.
This would describe the amount of variability of
its form.
However, the raw number N by itself does not seem
to have the right mathematical properties to
characterize the size of the holder in an
intuitive way
Intuition tells us that the capacity of holders
should be additive
E.g., it is intuitively clear that two pages of a
book should be able to hold twice as much
information as one page.

15
Compound Holders

Consider a holder of information C that is
composed by taking two separate and independent
holders of information A, B, and considering them
together as constituting a single compound holder
of information.
Suppose now also that A has N possible forms, and
that B has M possible forms.
Clearly then, due to the product rule of
combinatorics, C as a whole has NM possible
distinct forms.
Each is obtained by assigning a form to A and a
form to B independently.
But should the size of the holder C be the
product of the sizes of A and B?
It would seem more natural to say sum,so that
the whole is the sum of the parts.
How can we arrange for this to be true?

Holder C Has NM forms
Holder A
Holder B
N possibleforms
M possibleforms
16
Measuring Information with Logarithms

Fortunately, we can convert the product to a sum
by using logarithmic units for measuring
information.
Due to the rule about logarithms that log(NM)
log(N) log(M).
So, if we declare that the size or capacity or
amount of information in a holder of information
is defined to be the logarithm of the number of
different forms it can have,
Then we can say, the size of the compound holder
C is the sum of the sizes of the sub-holders A
and B that it comprises.
Only problem What base do we use for the
logarithm here?
Different bases would give different numeric
answers.
Any base could be chosen by convention, but would
be arbitrary.
A choice of a particular base amounts to choosing
an information unit.
Arguably, the most elegant answer is
Leave the base unspecified, and declare that an
amount of information is not a number, but rather
is a dimensioned indefinite logarithm quantity.

17
Indefinite Logarithms

Definition. Indefinite logarithm. For any real
number xgt0, let the indefinite logarithm of x,
written Logx, be defined as
Logx ?b.logbx
In other words, the value of Logx is a function
object with one argument (b), where this function
takes any value of the base b (gt1) and returns
the resulting value of logb x.
E.g., Log256 ?b.logb 256 (a function of 1
argument),
So for example, Log256(2) 8 and Log256(16)
2
Sums, negations, and scalar multiples of
indefinite logarithm objects can be defined quite
naturally,
by simply operating on the value of the
lambda-function in the corresponding way.
See the paper The Indefinite Logarithm,
Logarithmic Units, and the Nature of Entropy in
the readings for details.

(using lambda- calculus notation)
18
Indefinite Logarithms as Curves

The object LogN can also be identified with the
entire curve or graph (point set) (b, logb N)
b gt 1.

Note The Log1curve is 0 every-where, the
Log4curve is twice ashigh as the
Log2curve, Log10 isLog2Log5 andis
also equal to (log210)?Log2 3.322?Log2 In
general, each curve is just some constant
multiple of each other curve!
Larger valuesof the argument N
Log10
Log4
Log3
Log2
Log1
19
Indefinite Exponential

The inverse of the indefinite logarithm function
could be called the indefinite exponential.
Definition. Indefinite exponential. Given any
indefinite logarithm object L, the indefinite
exponential of L, written ExpL, is defined
by ExpL bL(b)
where b gt 0 may be any real number.
This definition is meaningful because all values
of b will give the same result x, since for any
b, we have that
where x is the unique real number such that
LLogx.
Thus, ExpLogx x and LogExpL L always.

20
Logarithmic Quantities Units

Theorem. Any given indefinite-logarithm quantity
Logx is equal to a scalar multiple of any fixed
indefinite-logarithm quantity, called a
logarithmic unit Lu Logu, where u can be any
standard chosen base gt0, and where the scalar
coefficient is logu x. Or, in symbols,
Logx (logu x)Lu.
Example Let the logarithmic unit b L2
Log2 be called the bit (binary digit).
Then Log16 4 Log2 4b (4 bits).

21
Some Common Logarithmic Units

Decibel dB L1.2589 Log100.1
Binary digit or bit b L2 Log2
A.k.a. the octave when used to measure tonal
intervals in music.
Neper or nat n Le Loge
A.k.a. Boltzmanns constant k in physics, as we
will see!
Octal digit o L8 3L2 Log8
Bel or decimal digit d L10 10 dB Log10
A.k.a. order of magnitude, power of ten, decade,
Richter-scale point.
Nibble or hex digit h L16 4L2 Log16
Byte or octet B L256 8L2 Log256
Kilobit or really kibibit kb Log21,024
Joule per Kelvin J/K Log103.14558e22
(roughly)
Units of physical entropy are equivalent to
indefinite-logarithm units!

22
Conversions Between Different Logarithmic Units

Suppose we are given a logarithmic quantity Q
expressed as a multiple of some logarithmic unit
La (that is, QcaLa where ca is a number),
and suppose that we wish to re-express Q in terms
of some other logarithmic unit Lb, i.e., as Q
cbLb.
The ratio between the two logarithmic units
LaLoga and LbLogb is La/Lb logba.
So, cb Q/Lb caLa/Lb calogba.
And so Q (calogba) Lb.
Example. How many nats are there in a byte?
The quantity to convert is 8 bits, Q 1B 8b
8L2.
The answer should be in units of nats, n Le.
Thus, Q (8loge 2)Le (80.693) n 5.55 nats.

23
Capacity of a Holder of Information

After all of that, we can now make the following
definition
Definition. The (information) capacity C of (or
the amount of information I contained in) a
holder of information H that has N possible forms
is defined to be the indefinite logarithm of N,
that is, CH LogN.
Like any indefinite logarithm quantity, CH can be
expressed in any of the units previously
discussed bits, nats, etc.
Example. A memory chip manufacturer develops a
memory technology in which each memory cell
(capacitor) can reliably hold any of 16
distinguishable logic voltage levels. (For
example, 0V, 0.1V, 0.2V, , 1.5V.) Therefore,
what is the information capacity of each cell
(that is, of its logical macrostate, as defined
by this set of levels), when expressed both in
bits, and in kB units (nats)?
Answer. N16, so Ccell Log16
Log16 (log2 16)Log2 4Log2 4 bits
Log16 (loge 16)Loge 2.77Loge 2.8 kB

24
Subsection II.A.2.c Information and Entropy

Complexity of a Form
Optimal Encoding of a Form
Entropy of a Probability Distribution
Marginal Conditional Entropy
Mutual Information

25
Quantifying The Size of a Form

In the previous subsection, we managed to
quantify the size or capacity of a holder of
information, as the indefinite logarithm of the
number of different forms that it could have.
But, that doesnt tell us how we might quantify
the size or complexity of a specific form, by
itself.
What about saying The size of a given form is
the capacity of the smallest holder that could
have that form?
The problem with that definition is
We can always imagine a holder constrained to
only have that form and no other.
Its capacity would be Log1 0.
So, with this definition the size of all forms
would be 0.
Another idea Lets measure the size of a form
in terms of the number of small, unit-size pieces
of information that it takes to describe it.
To do this, we need some language that we can use
to represent forms, e.g., an encoding of forms in
terms of sequences of symbols.
Given a language, we can then say that the size
or informational complexity K of a given form is
the length of the shortest symbol string that
represents it in our chosen language.
Its maximally compressed description.
At first, this definition seems pretty ambiguous,
but

26
Why Complexity is Meaningful

In their algorithmic information theory,
Kolmogorov and Chaitin showed that informational
complexity is almost language-independent, up to
a fixed (language-dependent) additive constant.
In the case of universal (Turing-complete)
languages.
Also, whenever we have a probability distribution
over a set of possible forms, Shannon showed us
how to choose an encoding of the forms that
minimizes the expected size of the codeword
instance that is needed.
This choice of encoding then minimizes the
expected complexity of the forms under the given
distribution.
If such a probability distribution is available,
we can then assume that the language has been
chosen appropriately so as to minimize the
expected length of the forms shortest
description.
We define this minimized expected complexity to
be the entropy of the system under the given
distribution.

27
Definition of Entropy

The definition of entropy which we just stated is
very important, and well worth repeating.
Definition. Entropy. Given a probability
distribution over a set of forms, the entropy of
that distribution is the expected form
complexity, according to whatever encoding of
forms yields the smallest expected complexity.
The definition can be applied to also define the
entropy of information holders, wraiths, etc.
that have given probability distributions over
their possible forms.

28
The Optimal Encoding

Suppose a specific form F has probability p.
Thus, it has improbability i 1/p.
Note that this is the same probability that F
would have if it were one of i equally-likely
forms.
We saw earlier that a holder of information
having i possible forms is characterized as
containing a quantity of information Logi.
So, it seems reasonable to declare that the
complexity K of the form F itself is, in fact,
K(F) Logi Log1/p -Logp.
This suggests that in the optimal encoding
language, the description of the form F could be
held in a holder of that capacity.
In his Mathematical Theory of Communication
(1949) Claude Shannon showed that in fact this is
exactly correct,
In an asymptotic limit where we permit ourselves
to consider encodings in which many similar
systems (whose forms are chosen from the same
distribution) are described together.
Modern block-coding schemes (turbo codes, etc.)
in fact closely approach Shannons ideal encoding
efficiency.

29
Optimal Encoding Example

Suppose a system has four forms A, B, C, D with
the following probabilities
p(A)½, p(B)¼, p(C)p(D)1/8.
Note that the probabilities sum to 1, as they
must.
Then the corresponding improbabilities are
i(A)2, i(B)4, i(C)i(D)8.
And the form sizes (log-improbabilities) are
K(A) Log2 1 bit, K(B) Log4 2 Log2
2 bits, K(C) K(D) Log8 3 Log2 3
bits.
Indeed, in this example, we can encode the forms
using bit-strings of exactly these lengths, as
follows
A0, B10, C110, D111.
Note that this code is self-delimiting
the symbols can be concatenated together without
ambiguity.

0
1
A
1
0
B
1
0
C
D
30
The Entropy Formula

Naturally, if we have a probability distribution
over the possible forms F of some system (holder
of information),
We can easily calculate the expected complexity K
of the systems form, which is the entropy S of
the system.
This is possible since K itself is a random
variable,
a function of the event that the system has a
specific form F.
The entropy S of the system is then
We can also view this formula as a simple
additive sum of the entropy contributions s pK
pLogp-1 arising from the individual forms.
The largest single contribution to entropy comes
from individual forms that have probability p
1/e, in which case s Loge/e .531 bits.
The entropy formula is often credited to Shannon,
but it was already known was being used by
Boltzmann in the 1800s.

31
Visualizing the Contributions toEntropy in a
Probability Distribution
Contribution to
s in
32
Known vs. Unknown Information

We can consider the informational capacity I
LogN of a holder that is defined as having N
possible forms as telling us the total amount of
information that the holder contains.
Meanwhile, we can consider its entropy S
?Logi(f)? as telling us how much of the total
information that it contains is unknown to us.
How much unknown information the holder contains,
In the perspective specified by the distribution
p().
Since S I, we can also define the amount of
known information (hereby dubbed extropy)
contained in the holder as X I - S.
Note that our probability distribution p() over
the holders form could change (if we gain or
lose knowledge about it),
Thus, the holders entropy S and extropy X may
also change.
However, note that the total informational size
of a given holder, I LogN X S, always
still remains a constant.
Entropy and extropy can be viewed as two forms of
information, which can be converted to each
other, but whose total amount is conserved.

33
Information/Entropy Example

Consider a tetrahedral die which maylie on any
of its 4 faces labeled 1,2,3,4
We say that the answer to the question Which
side is down? is a holder of information having
4 possible forms.
Thus, the total amount of information contained
in this holder, and in the orientation of the
physical die itself, is Log4 2 bits.
Now, suppose the die is weighted so that p(1)½,
p(2)¼, and p(3)p(4)1/8 for its post-throw
state.
Then K(1)1b, K(2)2b, and K(3)S(4)3b.
The entropy of the holder is then S 1.75 bits.
This much, this much information remains unknown
to us before we have taken a look at the thrown
die.
The extropy (known information) is already X
0.25 bits.
Exactly one-fourth of a bits worth of knowledge
about the outcome is already expressed by this
specific probability distribution p().
This much information about the dies state is
already known to us even before we have looked at
it.

34
HolderVariable, FormValue, and Types of Events.

A holder corresponds to a variable V.
Also associated with a set of possible values
v1,v2,.
Meanwhile, a form corresponds to a value v of
that variable.
A primitive event is a proposition that assigns a
specific form v to a specific holder, Vv.
I.e., a specific value to a specific variable.
A compound event is a conjunctive proposition
that assigns forms to multiple holders,
E.g., Vv, Uu, Ww.
A general event is a disjunctive set of primitive
and/or compound events.
Essentially equivalent to a Boolean combination
of assignment propositions.

35
Four Concepts to Distinguish

A set corresponds to (is a)
System, state space, sample space, situation
space, outcome space.
A partitioning of the set is a
Subsystem, state variable, mutex/ex set of
events
A section of the partitioning, or a subset of the
set, is a
Subsystem state, macrostate, value of variable,
event, abstract proposition
An individual element is
System configuration, microstate, primitive
event, complete outcome.

36
Entropy of a Binary Variable
Below, little s of an individual form or
probability denotesthe contribution to the total
entropy of a form with that probability.
Maximum s(p) (1/e) nat (lg e)/e bits .531
bits _at_ p 1/e .368
37
Proof that a form w. improbability econtributes
the most to the entropy

Lets find the slope of the s curve
Take the derivative, using standard calculus
But now, whats the derivative of an indefinite
logarithm quantity like Logp-1?
Lets rewrite Logp-1 as k ln p-1 (where the
constant k Le Loge is the indefinite log of
e), so then
Plugging this is to the earlier equation, we get
Now just set this to 0 and solve for p

38
Joint Distributions over Two Holders

Let X, Y be two holders, each with many forms
x1, x2, and y1, y2, .
Let xy represent the compound event Xx,Yy.
Note the set of all xys is a mutually exclusive
and exhaustive set.
Suppose we have available a joint probability
distribution p(xy) over the compound holder XY.
This then implies the reduced or marginal
distributions p(x)?y p(xy) and p(y)?x p(xy).
We also thus have conditional probabilities
p(xy) and p(yx), according to the usual
definitions.
And we have mutual probability ratios r(xy).

39
Joint, Marginal, Conditional Entropyand Mutual
Information

The joint entropy S(XY) ?Logi(xy)?.
The (prior, marginal or reduced) entropy S(X)
S(p(x)) ?Logi(x)?. Likewise for S(Y).
The entropy of each subsystem, taken by itself.
Entropy is subadditive S(XY) S(X) S(Y).
The conditional entropy S(XY) ExyS(p(xy))
The expected entropy after Y is observed.
Theorem S(XY) S(XY) - S(Y). Joint entropy
minus that of Y.
The mutual information I(XY) ?Logr(xy)?.
We will also prove Theorem I(XY) S(X) -
S(XY).
Thus the mutual information is the expected
reduction of entropy in either subsystem as a
result of observing the other.

40
Conditional Entropy Theorem
The conditional entropy of X given Y is the joint
entropy of XY minus the entropy of Y.
41
Mutual Information is Mutual Reduction in Entropy
And likewise, we also have I(XY) S(Y) -
S(YX), since the definition of I is symmetric.
I(XY) S(X) S(Y) - S(XY)
Also,
42
Visualization of Mutual Information

Let the total length of the bar below represent
the total amount of entropy in the system XY.

S(YX) conditional entropy of Y given X
S(X) entropy of X
S(XY) joint entropy of X and Y
S(XY) conditional entropy of X given Y
S(Y) entropy of Y
43
Example 1

Suppose the sample space of primitive events
consists of 5-bit strings Bb1b2b3b4b5.
Chosen at random with equal probability (1/32).
Let variable Xb1b2b3b4, and Yb3b4b5.
Then S(X) ___ bits, and S(Y) ___ b.
Meanwhile S(XY) ___ b.
Thus S(XY) ___ b, and S(YX) ___ b
And so I(XY) ___ b.

4
3
5
2
1
2
44
Example 2

Let the sample space A consist of the 8 letters
a,b,c,d,e,f,g,h. (All equally likely.)
Let X partition A into x1a,b,c,d and
x2e,f,g,h.
Y partitions A into y1a,b,e, y2c,f,
y3d,g,h.
Then we have
S(X) 1 bit.
S(Y) 2(3/8 log 8/3) (1/4 log 4) 1.561278
bits
S(YX) (1/2 log 2) 2(1/4 log 4) 1.5 bits.
I(XY) 1.561278b - 1.5b .061278 b.
S(XY) 1b 1.5b 2.5 b.
S(XY) 1b - .061278b .938722 b.

Y
a
b
c
d
X
e
f
g
h
(Meanwhile, the total information content of the
sample space log 8 3 bits)
45
Effective Entropy?

In many situations, using the ideal Shannon
compression may not be feasible in practice.
E.g., too few instances, block coding not
available, no source model
Or, a short algorithmic description of a given
form might exist, but it might be infeasible to
compute it
However, given the following
A holder with an associated set of forms
A probability distribution over the forms
A particular encoding strategy
E.g., an effective (short run-time) compression
algorithm
we can define the effective entropy of the holder
in this situation to be the expected compressed
size of its encoded form, as compressed by the
available algorithm.
This then is the definition of what the entropy
can be considered to be for practical purposes
given the capabilities in that situation.

46
Subsection II.A.2.d Communication Channels

Shannons Paradigm
Channel Capacity
Shannons Theorems

47
Communication Theory

Shannons Ph.D. thesis Mathematical Theory of
Communication (1948) is the seminal work that
established the field of Communication Theory.
a.k.a. Information Theory.
It deals with the theory of noiseless and noisy
communication channels for transmitting messages
consisting of sequences of symbols chosen from a
probability distribution.
Where the channel can be any medium or process
for communicating information through space
and/or time.
Shannon proves (among other things) that every
channel has a certain capacity for transmitting
information, and that this capacity is related to
the entropy of the source and channel probability
distributions.
At rates less than the channels capacity, coding
schemes exist that can transmit information with
an arbitrarily small probability of error.

48
Shannons Paradigm

A communication system is any system intended for
communicating messages (nuggets of information)
Selected from among some set of possible
messages.
Often, the set of possible messages must be
astronomically large
In general, such a system will include the
following six basic components
(1) Information Source (4) Noise Source
(2) Transmitter (5) Receiver
(3) Channel (6) Destination

Noise Source
Destination
Information Source
Receiver
Transmitter
Noise
Channel
Re-ceivedSignal
Message
Signal
Message
49
Discrete Noiseless Channels

A channel is simply any medium for the
communication of signals, which carry messages.
Meaningful instances of information.
A discrete channel supports the communication of
discrete signals consisting of sequences (or
other kinds of patterns) made up of discrete
(distinguishable) symbols.
There may be constraints on what sequences are
allowed
If the channel is noiseless, we can assume that
the signals are communicated exactly from
transmitter to receiver.
Noisy channels will be addressed in a later part
of the theory
The information transmission capacity C of a
discrete noiseless channel can be defined as
where t is the duration of the signal (in time)
and N(t) is the number of mutually
distinguishable signals of duration t.
This is just the asymptotic information capacity
of the channel (considered as a container of
information) per unit time.

50
Ergodic Information Sources

In general, we can consider the information
source to be producing a stream of information of
unbounded length.
Even if the individual messages are short, we can
always consider situations where there are
unbounded sequences of such messages.
For the theory to apply, we must consider the
source to be produced by an ergodic process.
This is a process for which all streams look
statistically similar in the long run
In the limit of sufficiently long streams
A discrete ergodic process can be modeled by a
Hidden Markov Model (HMM) with a unique
stationary distribution.
An HMM is essentially just a Finite State Machine
with nondeterministic transitions between states,
and no input
But with output, which may be nondeterministic
also
A stationary distribution is just a probability
distribution over states that is an eigenfunction
of the HMMs transition probability matrix.

51
Example Ergodic Source

This Markov model qualifies as an ergodic source
The equilibrium distribution for this particular
Markov process is pA1/6, pB5/6.
A typical long output string will be 1/6 As, 5/6
Bs.
Also, a B is more likely immediately following
another B
Since the machine has memory (more than 1 state),
note that successive symbols are in general not
independent of each other!

.5
A
B
A
A
B
.9
.5
B
.1
Transition matrix
52
Noiseless Coding Theorem

For a given discrete ergodic process, let pi be
the distribution of symbol probabilities, in the
stationary (equilibrium) distribution. this is
not quite right
And let H be the entropy of this distribution.
With a probability that approaches 1 as n??, a
typical sequence of n symbols produced by the
process will have improbability i ? ExpnH,
And therefore, we can represent this block of
symbols by a codeword with size S Log i
nH.
As n gets large, the relative inefficiency that
results from using codewords that are integer
multiples of some fixed unit (e.g. a bit) in size
approaches 0.

53
Formal Version of Theorem

Theorem. Shannons Noiseless Coding Theorem. For
any e,d gt 0, there is an n0 such that for any
length ngtn0, the sequences of length n all fall
into two sets
Atypical sequences whose total probability is
less than e.
Typical sequences, each of whose probability
satisfies
Where K log i log p-1 is the particular
sequences complexity
Due to the theorem, codewords of size S nH will
be within nd of the optimal code length (the
sequence complexity) K, for all of the typical
sequences.
The atypical sequences will require longer
codewords (or cause errors), but they can be made
as rare as desired.
The efficiency of such a code approaches optimal.

54
Existence of Near-Optimal Code

Let Nn(e) be the number of sequences of length n,
except for as many of the least-probable
sequences as we can gather but still have total
probability lte.
Theorem For all values of e in the range 0ltelt1,
Implication For large values of n, the number of
likely sequences Nn(e) roughly matches the
number ExpnH of different codewords of size nH.
Thus, there are not too many such sequences for
us to be able to code for each one with a unique
codeword.

55
Noiseless Coding Example

In the previous example, the entropy turns out to
be H 0.650022422 bits.
Thus, for example, an average sequence of, say, n
20 symbols will be encoded with 20H
13.00044843 bits (almost exactly 13 bits) in an
ideal code.
Here is a fairly typical sequence of 20 symbols
(here there are 3/20 1/6 As)
BBBBBABBBABBBBBBABBB
Probability 2.6478?10-5 Improbability (1 in)
37767 Complexity 15.205 bits
Here is an atypical sequence of 20 symbols
(17/20 ? 1/6 As)
AAAABAAAAAAABAAAAAAB
Probability 1.2716?10-8 Improbability (1 in)
78,643,200 Complexity 26.229 bits
So, e.g., one fairly efficient encoding for this
particular ergodic source would be
Chop the data stream into 20-symbol blocks.
Sort all 220 of the possible 20-symbol sequences
by order of their probability.
Assign the 213 8,192 most-likely sequences (the
typical sequences) unique binary codewords
consisting of a 1, followed by a 13-bit sequence
number.
Assign the other 220-213 1,040,384 sequences
(the atypical sequences) longer codewords
consisting of, say, a 0 followed by all 20 bits
of the full symbol string.
Shannons theorem tells us that all of the
atypical sequences put together have some
relatively small total probability e.
And if e is not small enough to be ignored, we
can simply choose a larger block size n.
Assuming eltlt1, the expected encoding size of each
block will be 14 bits
Note this is only one bit larger than the ideal
size of of 13 bits.
The extra bit needed to mark the sequence as
typical becomes negligible for large n.

56
Actual Data Relating to Example
57
Roulette Wheel Coding

An algorithm for coding continuous data streams
from known ergodic information sources.
Doesnt require data blocking!
We imagine that a single spin of a very large
roulette wheel generates theentire infinite
input data sequence
As the wheel slowly comes to a halt,infinitely
many symbols are generated
Each possible initial subsequence maps to a
region of the wheel orientations having an
angular size that is proportional to the
subsequences probability
Simultaneously, the sequences encoding is
generated in the same way
But with a different symbol distribution that
lends itself toward efficient transmission (e.g.,
equiprobable 0s and 1s)

58
Roulette Wheel Example

Here is how the roulette wheel is marked for the
particular ergodic process described earlier
This is the mapping from input code sequences to
wheel positions.
The wheel region for each sequence is
proportional to its probability.

0
360
A(1/6)
B (5/6)
1stsymbol
A
B
A
B (5/6 x 9/10)
2ndsymbol
A
B
A
B
A
B
A
B
3rdsymbol

59
Roulette Wheel Output Code

The output codespace divides up the wheel
similarly to (but more regularly than) the input
one
Note Output sequencelengths areexactly
theirLogi values!
This is whythe code isso efficient

0
360
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
60
Roulette Wheel Algorithm

Main variable in the algorithm
A current range R of possible roulette wheel
positions, initially 0,1.
Representing wheel pointer angles in the range
from 0 to 360.
Places on the wheel where the pointer might end
up pointing (or ball bouncing)
Encoding algorithm
1. When the first input symbol is received,
initialize R to a sub-region of 0,1
corresponding to the symbol, of size equal to its
probability.
E.g., for our example, initially A ? 0, 1/6, B
? 1/6, 1
2. Whenever R becomes a sub-range of 0, ½ or
½, 1,
output 0 or 1 respectively,
And re-map R from its respective size-1/2
sub-range back up into the whole 0,1 region.
(I.e., translate it appropriately and double it
in size.)
3. When the next symbol is received, narrow R
down to a corresponding subrange of size
proportional to the probability of that
transition.
E.g., if B was just received, then an A takes the
present range down to its first 1/10, and a B
takes it to its last 9/10.
4. Repeat steps 2 and 3 indefinitely (or include
a special stop symbol).
The decoding algorithm is very similar.

61
Pros and Cons of the Algorithm

Pros
Handles continuous data streams
Doesnt require knowing length of stream in
advance, or blocking
Its an on-line algorithm
It doesnt require a preprocessing phase to
construct the code
It only requires tracking the range, the last
symbol seen
Extremely simple to implement can do it in HW,
low latency
Yields an asymptotically optimal code (prove
this!)
In the case of sources modeled by (non-hidden)
Markov chains
Cons
Not adaptive Requires knowing the source model
Although this could be fixed in a more
sophisticated version
Its only optimal for non-hidden Markov models
To do more optimally for cases where a full HMM
is needed, the algorithm would need to remember
more past symbols
Requires unbounded precision in arithmetic!
Otherwise, rounding errors accumulate and are
rapidly magnified

62
Additional Communication Theory Topics to Include
Eventually

I dont yet have slides dealing with
Noisy channel coding theorems
Error correcting codes
Continuous channels
Quantum channels
To save time, we wont cover these topics right
now,
Though we will hopefully get to some basic
elements of quantum communication theory later on
in the course