CSC321: Neural Networks Lecture 10: Learning ensembles of Networks - PowerPoint PPT Presentation

About This Presentation
Title:

CSC321: Neural Networks Lecture 10: Learning ensembles of Networks

Description:

The predictors that are further than average from d make bigger than average squared errors. ... So how do we make the individual predictors disagree? ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 12
Provided by: hin9
Category:

less

Transcript and Presenter's Notes

Title: CSC321: Neural Networks Lecture 10: Learning ensembles of Networks


1
CSC321 Neural NetworksLecture 10 Learning
ensembles of Networks
  • Geoffrey Hinton

2
Combining networks
  • When the amount of training data is limited, we
    need to avoid overfitting.
  • Averaging the predictions of many different
    networks is a good way to do this.
  • It works best if the networks are as different as
    possible.
  • If the data is really a mixture of several
    different regimes it is helpful to identify
    these regimes and use a separate, simple model
    for each regime.
  • We want to use the desired outputs to help
    cluster cases into regimes. Just clustering the
    inputs is not as efficient.

3
Combining networks reduces variance
  • We want to compare two expected squared errors
  • Method 1 Pick one of the predictors at random
  • Method 2 Use the average of the predictors, y

This term vanishes
4
A picture
good guy
bad guy
  • The predictors that are further than average from
    d make bigger than average squared errors.
  • The predictors that are nearer than average to d
    make smaller then average squared errors.
  • The first effect dominates because squares work
    like that.
  • Dont try averaging if you want to synchronize a
    bunch of clocks !

d
y
5
How the combined predictor compares with the
individual predictors
  • On any one test case, some individual predictors
    will be better than the combined predictor.
  • But different individuals will be better on
    different cases.
  • If the individual predictors disagree a lot, the
    combined predictor is typically better than all
    of the individual predictors when we average over
    test cases.
  • So how do we make the individual predictors
    disagree? (without making them much worse
    individually).

6
Ways to make predictors differ
  • Rely on the learning algorithm getting stuck in a
    different local optimum on each run.
  • A dubious hack unworthy of a true computer
    scientist (but definitely worth a try).
  • Use lots of different kinds of models
  • Different architectures
  • Different learning algorithms.
  • Use different training data for each model
  • Bagging Resample (with replacement) from the
    training set a,b,c,d,e -gt a c c d d
  • Boosting Fit models one at a time. Re-weight
    each training case by how badly it is predicted
    by the models already fitted.
  • This makes efficient use of computer time because
    it does not bother to back-fit models that were
    fitted earlier.

7
Mixtures of Experts
  • Can we do better that just averaging predictors
    in a way that does not depend on the particular
    training case?
  • Maybe we can look at the input data for a
    particular case to help us decide which model to
    rely on.
  • This may allow particular models to specialize in
    a subset of the training cases. They do not learn
    on cases for which they are not picked. So they
    can ignore stuff they are not good at modeling.
  • The key idea is to make each expert focus on
    predicting the right answer for the cases where
    it is used.
  • This causes specialization.
  • If we always average all the predictors, each
    model is trying to compensate for the combined
    error made by all the other models.

8
Another picture
target
Average of all the other predictors
Do we really want to move the output of predictor
I away from the target value?
9
Making an error function that encourages
specialization instead of cooperation
Average of all the predictors
  • If we want to encourage cooperation, we compare
    the average of all the predictors with the target
    and train to reduce the discrepancy.
  • This can overfit badly. It makes the model much
    more powerful than training each predictor
    separately.
  • If we want to encourage specialization we compare
    each predictor separately with the target and
    train to reduce the average of all these
    discrepancies.
  • Its best to use a weighted average, where the
    weights, p, are the probabilities of picking that
    expert for the particular training case.

probability of picking expert i for this case
10
The mixture of experts architecture
  • Combined predictor
  • Error function for training
  • (There is actually a slightly better error
    function)

Expert 1 Expert 2 Expert 3
Softmax gating network
input
11
The derivatives
  • If we differentiate w.r.t. the outputs of the
    experts we get a signal for training each expert.
  • If we differentiate w.r.t. the outputs of the
    gating network we get a signal for training the
    gating net.
  • We want to raise p for all experts that give less
    than the average squared error of all the experts
    (weighted by p)
Write a Comment
User Comments (0)
About PowerShow.com