Title: The Robota Dolls
1MACHINE LEARNINGInformation Theory and The
Neuron - II
Aude Billard
2Overview
- LECTURE I
- Neuron Biological Inspiration
- Information Theory and the Neuron
- Weight Decay Anti-Hebbian Learning ? PCA
- Anti-Hebbian Learning ? ICA
- LECTURE II
- Capacity of the single Neuron
- Capacity of Associative Memories (Willshaw Net,
- Extended Hopfield Network)
- LECTURE III
- Continuous Time-Delay NN
- Limit-Cycles, Stability and Convergence
3Neural Processing - The Brain
Decay-depolarization
Integration
E Electrical Potential
Synapse
Refractory
Dendrites
Cell Body
time
A neuron receives and integrate input from other
neurons. Once the input exceeds a critical level,
the neuron discharges a spike. This spiking event
is also called depolarization, and is followed by
a refractory period, during which the neuron is
unable to fire.
4Information Theory and The Neuron
W1
Output y
W2
W3
W4
- You can view the neuron as a memory.
- What can you store in this memory?
- What is the maximal capacity?
- How can you find a learning rule that maximizes
the capacity?
5Information Theory and The Neuron
A fundamental principle of learning systems is
their robustness to noise. One way to measure
the systems robustness to noise is to determine
the joint information between its inputs and
output.
Output y
6Information Theory and The Neuron
W1
Output y
W2
W3
W4
Consider the neuron as a sender-receiver system,
with X being the message sent and y the received
message. Information theory can give you a
measure of the information conveyed by y about X.
If the transmission system is imperfect
(noisy), you must find a way to ensure minimal
disturbance in the transmission.
7Information Theory and The Neuron
W1
Output y
W2
W3
W4
In order to maximize the ratio, one can increase
the magnitude of the weights.
8Information Theory and The Neuron
W1
Output y
W2
W3
W4
The mutual information between the neuron output
y and its Inputs X is given by
This time, one cannot simply increase the
magnitude of the weights, as this affects the
value of as well.
9Information Theory and The Neuron
10 How to define a learning rule to optimize the
mutual information?
11Hebbian Learning
Input
Output
If x I and y I fire simultaneously, the weight
of the connection between them will be
strengthened in proportion to their strength of
firing.
12Hebbian Learning Limit Cycle
This is true for all i, thus, w_j is an
eigenvector of C, with associated Eigenvalue 0
C is a positive, symmetric and semi-definite
matrix ? all eigenvalues are gt0.
Under a small disturbance
? The weights tend to grow in the direction of
the largest eigenvalue of C.
13Hebbian Learning Weight Decay
The simple weight decay rule belong to a class of
decay rule called Substractive Rule
The only advantage of substractive rules over
simply clipping the weights lies in that it
allows to eliminates weights that have little
importance.
The advantage of multiplicative rules is that, in
addition to giving small weights, they also give
useful weights.
14Information Theory and The Neuron
W1
Output y
W2
W3
W4
Ojas one neuron model
The weights converge toward the first eigenvector
of the input covariance matrix and are normalized.
15Hebbian Learning Weight Decay
Ojas subspace algorithm
Equivalent to minimizing the generalized form of
J
16Hebbian Learning Weight Decay
- Why PCA, LDA, ICA with ANN?
- Explain the way the brain could derive important
properties of the sensory and motor space. - Allows to discover new mode of computation with
simple iterative and local learning rules.
17Recurrence in Neural Networks
- Sofar, we have considered only feed-forward
neural networks - Most biological network have recurrent
connections. - This change of direction in the flow of
information is interesting, as it can allow - To keep a memory of the activation of the neuron
- To propagate the information across output
neurons
18Anti-Hebbian Learning
How to maximize information transmission in a
network, I.e. maximize I(xy)
19Anti-Hebbian Learning
Anti-Hebbian learning is also known as lateral
inhibition
Average of values taken over all training
patterns
20Anti-Hebbian Learning
If the two outputs are highly correlated, then,
the weights between them will grow to a large
negative value and each will tend to turn the
other off.
No need for weight decay or renormalizing on
anti-Hebbian weights, as they are automatically
self-limiting!
21Anti-Hebbian Learning
Foldiaks first Model
22Anti-Hebbian Learning
Foldiaks first Model
One can further show that there is a stable point
in the weight space.
23Anti-Hebbian Learning
Foldiaks 2ND Model
Allows all neurons to receive their own outputs
with weight 1
- This network will converge when
- the outputs are decorrelated
- the expected variance of the outputs is equal to
1.
24PCA versus ICA
PCA looks at the covariance matrix only. What if
the data is not well described by the covariance
matrix? The only distribution which is
uniquely specified by its covariance (with the
subtracted mean) is the Gaussian distribution.
Distributions which deviate from the Gaussian are
poorly described by their covariances.
25PCA versus ICA
Even with non-Gaussian data, variance
maximization leads to the most faithful
representation in a reconstruction error
sense. The mean-square error measure implicitly
assumes Gaussianity, since it penalizes
datapoints close to the mean less that those that
are far away. But it does not in general lead to
the most meaningful representation. ? We need to
perform gradient descent in some function other
than the reconstruction error.
26Uncorrelated and Statistical Independent
Independent
Uncorrelated
True for any non-linear transformation f
Statistical Independence is a stronger constraint
than decorrelation.
27Objective Function of ICA
We want to ensure that the outputs yi are
maximally independent. This is identical to
requiring that the mutual information be
small. Or alternately that the joint entropy be
large.
H(x,y)
H(x)
H(y)
H(xy)
I(x,y)
H(yx)
28Anti-Hebbian Learning and ICA
Anti-Hebbian Learning can also lead to a
decomposition in Statistically Independent
Component, and, as such allow to do a
decomposition of the type of ICA.
29ICA for TIME-DEPENDENT SIGNALS
Original Signal
Adapted from Hyvarinen _at_ 2000
30ICA for TIME-DEPENDENT SIGNALS
Mixed Signal
Adapted from Hyvarinen _at_ 2000
31Anti-Hebbian Learning and ICA
Jutten and Herault Model
32Anti-Hebbian Learning and ICA
HINT Use two odd functions for f and g
(f(-x)-f(x)), then their taylor series expansion
consists solely of the odd terms
Since most (audio) signals have an even
distribution, at convergence, one has
33Anti-Hebbian Learning and ICA Application for
Blind Source Separation
MIXED SIGNALS
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998,
ICASSP 1999
34Anti-Hebbian Learning and ICA Application for
Blind Source Separation
UNMIXED SIGNALS THROUGH GENERALIZED
ANTI-HEBBIAN LEARNING
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998,
ICASSP 1999
35Anti-Hebbian Learning and ICA Application for
Blind Source Separation
MIXED SIGNALS
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998,
ICASSP 1999
36Anti-Hebbian Learning and ICA Application for
Blind Source Separation
UNMIXED SIGNALS THROUGH GENERALIZED ANTI
HEBBIAN LEARNING
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998,
ICASSP 1999
37Information Maximization
Bell Sejnowsky proposed a network to maximize
the mutual information between the output and the
input when those are not subjected to noise (or
rather when the input and the noise can no longer
be distinguished, then H(YX) tend to negative
infinity).
W1
Output y
W2
W3
W0
W4
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
38Information Maximization
Bell Sejnowsky proposed a network to maximize
the mutual information between the output and the
input when those are not subjected to noise (or
rather when the input and the noise can no longer
be distinguished, then H(YX) tend to negative
infinity).
H(YX) is independent of the weights W and so
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
39Information Maximization
The entropy of a distribution is maximized when
all outcomes are equally likely. ? We must
choose an activation function at the output
neurons which equalizes each neurons chances of
firing and so maximizes their collective entropy.
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
40Anti-Hebbian Learning and ICA
The sigmoid is the optimal solution to even out a
gaussian distribution so that all outputs are
equally probable
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
41Anti-Hebbian Learning and ICA
The sigmoid is the optimal solution to even out a
gaussian distribution so that all outputs are
equally probable
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
42Anti-Hebbian Learning and ICA
The sigmoid is the optimal solution to even out a
gaussian distribution so that all outputs are
equally probable
W1
Output y
W2
W3
W0
W4
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
43Anti-Hebbian Learning and ICA
The pdf of the output can be written as
The entropy of the output is then given by
The learning rules that optimize this entropy are
given by
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
44Anti-Hebbian Learning and ICA
Bell A.J. and Sejnowski T.J. 1995. An information
maximization approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159
45Anti-Hebbian Learning and ICA
This can be generalized to a many inputs - many
outputs network with sigmoid function for the
output. The learning rules that optimizes the
mutual information between input and output are
then given by
Such a network can linearly decompose up to 10
sources.
Bell A.J. and Sejnowski T.J. 1995. An information
maximisation approach to blind separation and
blind deconvolution, Neural Computation, 7, 6,
1129-1159