CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutu - PowerPoint PPT Presentation

About This Presentation

Title:

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutu

Description:

... 11. Extracting coherent properties by maximizing mutual ... Properties that are invariant over time are usually interesting. Spatially invariant properties ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 20

Provided by: hin9

Learn more at: http://www.cs.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutu

1
CSC2535 Computation in Neural Networks Lecture
11 Extracting coherent properties by maximizing
mutual information across space or time

Geoffrey Hinton

2
The aims of unsupervised learning

We would like to extract a representation of the
sensory input that is useful for later
processing.
We want to do this without requiring labeled
data.
Prior ideas about what the internal
representation should look like ought to be
helpful. So what would we like in a
representation?
Hidden causes that explain high-order
correlations?
Constraints that often hold?
A low-dimensional manifold that contains all the
data
Properties that are invariant across space or
time?

3
Temporally invariant properties

Consider a rigid object that is moving relative
to the retina
Its retinal image changes in predictable ways
Its true 3-D shape stays exactly the same. It is
invariant over time.
Its angular momentum also stays the same if it is
in free fall.
Properties that are invariant over time are
usually interesting.

4
Spatially invariant properties

Consider a smooth surface covered in random dots
that is viewed from two different directions
Each image is just a set of random dots.
A stereo pair of images has disparity that
changes smoothly over space. Nearby regions of
the image pair have very similar disparities.

plane of fixation
left eye right eye
surface
5
Learning temporal invariances
maximize agreement
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
6
A new way to get a teaching signal

Each module uses the output of the other module
as the teaching signal.
This does not work if the two modules can see the
same data. They just report one component of the
data and agree perfectly.
It also fails if a module always outputs a
constant. The modules can just ignore the data
and agree on what constant to output.
We need a sensible definition of the amount of
agreement between the outputs.

7
Mutual information

Two variables, a and b, have high mutual
information if you can predict a lot about one
from the other.

Joint entropy
Mutual Information
Individual entropies

There is also an asymmetric way to define mutual
information
Compute derivatives of I w.r.t. the feature
activities. Then backpropagate to get derivatives
for all the weights in the network.
The network at time t is using the network at
time t1 as its teacher (and vice versa).

8
Some advantages of mutual information

If the modules output constants the mutual
information is zero.
If the modules each output a vector, the mutual
information is maximized by making the components
of each vector be as independent as possible.
Mutual information exactly captures what we mean
by agreeing.

9
A problem

We can never have more mutual information between
the two output vectors than there is between the
two input vectors.
So why not just use the input vector as the
output?
We want to preserve as much mutual information as
possible whilst also achieving something else
Dimensionality reduction?
A simple form for the prediction of one output
from the other?

10
A simple form for the relationship

Assume the output of module a equals the output
of module b plus noise

If we assume that a and b are both noisy versions
of the same underlying signal we can use

11
Learning temporal invariances
Backpropagate derivatives
Backpropagate derivatives
maximize mutual information
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
12
Maximizing mutual information between a local
region and a larger context
Contextual prediction
w1 w2
w3 w4
Maximize MI
hidden
hidden
hidden
hidden
hidden
left eye right eye
surface
13
How well does it work?

If we use weight sharing between modules and
plenty of hidden units, it works really well.
It extracts the depth of the surface fairly
accurately.
It simultaneously learns the optimal weights of
-1/6, 4/6, 4/6, -1/6 for interpolating the
depths of the context to predict the depth at the
middle module.
If the data is noisy or the modules are
unreliable it learns a more robust interpolator
that uses smaller weights in order not to amplify
noise.

14
But what about discontinuities?

Real surfaces are mostly smooth but also have
sharp discontinuities in depth.
How can we preserve the high mutual information
between local depth and contextual depth?
Discontinuities cause occasional high residual
errors. The Gaussian model of residuals requires
high variance to accommodate these large errors.

15
A simple mixture approach

We assume that there are continuity cases in
which there is high MI and discontinuity cases
in which there is no MI.
The variance of the residual is only computed on
the continuity cases so it can stay small.
The residual can be used to compute the posterior
probability of each type of case.
Aim to maximize the mixing proportion of the
continuity cases times the MI in those cases.

16
Mixtures of expert interpolators

Instead of just giving up on discontinuity cases
we can use a different interpolator that ignores
the surface beyond the discontinuity
To predict the depth at c use a 2b
To choose this interpolator, find the location of
the discontinuity.

a b c d e
17
The mixture of interpolators net

There are five interpolators, each with its own
controller.
Each controller is a neural net that looks at the
outputs of all 5 modules and learns to detect a
discontinuity at a particular location.
Except for the controller for the full
interpolator which checks that there is no
discontinuity.
The mixture of expert interpolators trains the
controllers and the interpolators and the local
depth modules all together.

18
Mutual Information with multidimensional output

For a multidimensional Gaussian, the entropy is
given by the determinant.
If we use the identity model of the relationship
between the outputs of two modules we get
If we assume the outputs are jointly Gaussian we
get

19
Relationship to linear dynamical system
linear features
linear features
Linear model (could be the identity plus noise)
The past
We predict in this domain
image
image
time t1
time t

Write a Comment

User Comments (0)