Title: CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutu
1CSC2535 Computation in Neural Networks Lecture
11 Extracting coherent properties by maximizing
mutual information across space or time
2The aims of unsupervised learning
- We would like to extract a representation of the
sensory input that is useful for later
processing. - We want to do this without requiring labeled
data. - Prior ideas about what the internal
representation should look like ought to be
helpful. So what would we like in a
representation? - Hidden causes that explain high-order
correlations? - Constraints that often hold?
- A low-dimensional manifold that contains all the
data - Properties that are invariant across space or
time?
3Temporally invariant properties
- Consider a rigid object that is moving relative
to the retina - Its retinal image changes in predictable ways
- Its true 3-D shape stays exactly the same. It is
invariant over time. - Its angular momentum also stays the same if it is
in free fall. - Properties that are invariant over time are
usually interesting.
4Spatially invariant properties
- Consider a smooth surface covered in random dots
that is viewed from two different directions - Each image is just a set of random dots.
- A stereo pair of images has disparity that
changes smoothly over space. Nearby regions of
the image pair have very similar disparities.
plane of fixation
left eye right eye
surface
5Learning temporal invariances
maximize agreement
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
6A new way to get a teaching signal
- Each module uses the output of the other module
as the teaching signal. - This does not work if the two modules can see the
same data. They just report one component of the
data and agree perfectly. - It also fails if a module always outputs a
constant. The modules can just ignore the data
and agree on what constant to output. - We need a sensible definition of the amount of
agreement between the outputs.
7Mutual information
- Two variables, a and b, have high mutual
information if you can predict a lot about one
from the other. -
Joint entropy
Mutual Information
Individual entropies
- There is also an asymmetric way to define mutual
information - Compute derivatives of I w.r.t. the feature
activities. Then backpropagate to get derivatives
for all the weights in the network. - The network at time t is using the network at
time t1 as its teacher (and vice versa). -
8Some advantages of mutual information
- If the modules output constants the mutual
information is zero. - If the modules each output a vector, the mutual
information is maximized by making the components
of each vector be as independent as possible. - Mutual information exactly captures what we mean
by agreeing.
9A problem
- We can never have more mutual information between
the two output vectors than there is between the
two input vectors. - So why not just use the input vector as the
output? - We want to preserve as much mutual information as
possible whilst also achieving something else - Dimensionality reduction?
- A simple form for the prediction of one output
from the other?
10A simple form for the relationship
- Assume the output of module a equals the output
of module b plus noise
- If we assume that a and b are both noisy versions
of the same underlying signal we can use
11Learning temporal invariances
Backpropagate derivatives
Backpropagate derivatives
maximize mutual information
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
12Maximizing mutual information between a local
region and a larger context
Contextual prediction
w1 w2
w3 w4
Maximize MI
hidden
hidden
hidden
hidden
hidden
left eye right eye
surface
13How well does it work?
- If we use weight sharing between modules and
plenty of hidden units, it works really well. - It extracts the depth of the surface fairly
accurately. - It simultaneously learns the optimal weights of
-1/6, 4/6, 4/6, -1/6 for interpolating the
depths of the context to predict the depth at the
middle module. - If the data is noisy or the modules are
unreliable it learns a more robust interpolator
that uses smaller weights in order not to amplify
noise.
14But what about discontinuities?
- Real surfaces are mostly smooth but also have
sharp discontinuities in depth. - How can we preserve the high mutual information
between local depth and contextual depth? - Discontinuities cause occasional high residual
errors. The Gaussian model of residuals requires
high variance to accommodate these large errors.
15A simple mixture approach
- We assume that there are continuity cases in
which there is high MI and discontinuity cases
in which there is no MI. - The variance of the residual is only computed on
the continuity cases so it can stay small. - The residual can be used to compute the posterior
probability of each type of case. - Aim to maximize the mixing proportion of the
continuity cases times the MI in those cases.
16Mixtures of expert interpolators
- Instead of just giving up on discontinuity cases
we can use a different interpolator that ignores
the surface beyond the discontinuity - To predict the depth at c use a 2b
- To choose this interpolator, find the location of
the discontinuity.
a b c d e
17The mixture of interpolators net
- There are five interpolators, each with its own
controller. - Each controller is a neural net that looks at the
outputs of all 5 modules and learns to detect a
discontinuity at a particular location. - Except for the controller for the full
interpolator which checks that there is no
discontinuity. - The mixture of expert interpolators trains the
controllers and the interpolators and the local
depth modules all together.
18Mutual Information with multidimensional output
- For a multidimensional Gaussian, the entropy is
given by the determinant. - If we use the identity model of the relationship
between the outputs of two modules we get - If we assume the outputs are jointly Gaussian we
get
19Relationship to linear dynamical system
linear features
linear features
Linear model (could be the identity plus noise)
The past
We predict in this domain
image
image
time t1
time t