CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutu - PowerPoint PPT Presentation

About This Presentation
Title:

CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutu

Description:

... 11. Extracting coherent properties by maximizing mutual ... Properties that are invariant over time are usually interesting. Spatially invariant properties ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 20
Provided by: hin9
Category:

less

Transcript and Presenter's Notes

Title: CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutu


1
CSC2535 Computation in Neural Networks Lecture
11 Extracting coherent properties by maximizing
mutual information across space or time
  • Geoffrey Hinton

2
The aims of unsupervised learning
  • We would like to extract a representation of the
    sensory input that is useful for later
    processing.
  • We want to do this without requiring labeled
    data.
  • Prior ideas about what the internal
    representation should look like ought to be
    helpful. So what would we like in a
    representation?
  • Hidden causes that explain high-order
    correlations?
  • Constraints that often hold?
  • A low-dimensional manifold that contains all the
    data
  • Properties that are invariant across space or
    time?

3
Temporally invariant properties
  • Consider a rigid object that is moving relative
    to the retina
  • Its retinal image changes in predictable ways
  • Its true 3-D shape stays exactly the same. It is
    invariant over time.
  • Its angular momentum also stays the same if it is
    in free fall.
  • Properties that are invariant over time are
    usually interesting.

4
Spatially invariant properties
  • Consider a smooth surface covered in random dots
    that is viewed from two different directions
  • Each image is just a set of random dots.
  • A stereo pair of images has disparity that
    changes smoothly over space. Nearby regions of
    the image pair have very similar disparities.

plane of fixation
left eye right eye
surface
5
Learning temporal invariances
maximize agreement
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
6
A new way to get a teaching signal
  • Each module uses the output of the other module
    as the teaching signal.
  • This does not work if the two modules can see the
    same data. They just report one component of the
    data and agree perfectly.
  • It also fails if a module always outputs a
    constant. The modules can just ignore the data
    and agree on what constant to output.
  • We need a sensible definition of the amount of
    agreement between the outputs.

7
Mutual information
  • Two variables, a and b, have high mutual
    information if you can predict a lot about one
    from the other.

Joint entropy
Mutual Information
Individual entropies
  • There is also an asymmetric way to define mutual
    information
  • Compute derivatives of I w.r.t. the feature
    activities. Then backpropagate to get derivatives
    for all the weights in the network.
  • The network at time t is using the network at
    time t1 as its teacher (and vice versa).

8
Some advantages of mutual information
  • If the modules output constants the mutual
    information is zero.
  • If the modules each output a vector, the mutual
    information is maximized by making the components
    of each vector be as independent as possible.
  • Mutual information exactly captures what we mean
    by agreeing.

9
A problem
  • We can never have more mutual information between
    the two output vectors than there is between the
    two input vectors.
  • So why not just use the input vector as the
    output?
  • We want to preserve as much mutual information as
    possible whilst also achieving something else
  • Dimensionality reduction?
  • A simple form for the prediction of one output
    from the other?

10
A simple form for the relationship
  • Assume the output of module a equals the output
    of module b plus noise
  • If we assume that a and b are both noisy versions
    of the same underlying signal we can use

11
Learning temporal invariances
Backpropagate derivatives
Backpropagate derivatives
maximize mutual information
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
12
Maximizing mutual information between a local
region and a larger context
Contextual prediction
w1 w2
w3 w4
Maximize MI
hidden
hidden
hidden
hidden
hidden
left eye right eye
surface
13
How well does it work?
  • If we use weight sharing between modules and
    plenty of hidden units, it works really well.
  • It extracts the depth of the surface fairly
    accurately.
  • It simultaneously learns the optimal weights of
    -1/6, 4/6, 4/6, -1/6 for interpolating the
    depths of the context to predict the depth at the
    middle module.
  • If the data is noisy or the modules are
    unreliable it learns a more robust interpolator
    that uses smaller weights in order not to amplify
    noise.

14
But what about discontinuities?
  • Real surfaces are mostly smooth but also have
    sharp discontinuities in depth.
  • How can we preserve the high mutual information
    between local depth and contextual depth?
  • Discontinuities cause occasional high residual
    errors. The Gaussian model of residuals requires
    high variance to accommodate these large errors.

15
A simple mixture approach
  • We assume that there are continuity cases in
    which there is high MI and discontinuity cases
    in which there is no MI.
  • The variance of the residual is only computed on
    the continuity cases so it can stay small.
  • The residual can be used to compute the posterior
    probability of each type of case.
  • Aim to maximize the mixing proportion of the
    continuity cases times the MI in those cases.

16
Mixtures of expert interpolators
  • Instead of just giving up on discontinuity cases
    we can use a different interpolator that ignores
    the surface beyond the discontinuity
  • To predict the depth at c use a 2b
  • To choose this interpolator, find the location of
    the discontinuity.

a b c d e
17
The mixture of interpolators net
  • There are five interpolators, each with its own
    controller.
  • Each controller is a neural net that looks at the
    outputs of all 5 modules and learns to detect a
    discontinuity at a particular location.
  • Except for the controller for the full
    interpolator which checks that there is no
    discontinuity.
  • The mixture of expert interpolators trains the
    controllers and the interpolators and the local
    depth modules all together.

18
Mutual Information with multidimensional output
  • For a multidimensional Gaussian, the entropy is
    given by the determinant.
  • If we use the identity model of the relationship
    between the outputs of two modules we get
  • If we assume the outputs are jointly Gaussian we
    get

19
Relationship to linear dynamical system
linear features
linear features
Linear model (could be the identity plus noise)
The past
We predict in this domain
image
image
time t1
time t
Write a Comment
User Comments (0)
About PowerShow.com