CSC321: Neural Networks Lecture 10: Learning ensembles of Networks - PowerPoint PPT Presentation

About This Presentation

Title:

CSC321: Neural Networks Lecture 10: Learning ensembles of Networks

Description:

The predictors that are further than average from d make bigger than average squared errors. ... So how do we make the individual predictors disagree? ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 12

Provided by: hin9

Learn more at: http://www.cs.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSC321: Neural Networks Lecture 10: Learning ensembles of Networks

1
CSC321 Neural NetworksLecture 10 Learning
ensembles of Networks

Geoffrey Hinton

2
Combining networks

When the amount of training data is limited, we
need to avoid overfitting.
Averaging the predictions of many different
networks is a good way to do this.
It works best if the networks are as different as
possible.
If the data is really a mixture of several
different regimes it is helpful to identify
these regimes and use a separate, simple model
for each regime.
We want to use the desired outputs to help
cluster cases into regimes. Just clustering the
inputs is not as efficient.

3
Combining networks reduces variance

We want to compare two expected squared errors
Method 1 Pick one of the predictors at random
Method 2 Use the average of the predictors, y

This term vanishes
4
A picture
good guy
bad guy

The predictors that are further than average from
d make bigger than average squared errors.
The predictors that are nearer than average to d
make smaller then average squared errors.
The first effect dominates because squares work
like that.
Dont try averaging if you want to synchronize a
bunch of clocks !

d
y
5
How the combined predictor compares with the
individual predictors

On any one test case, some individual predictors
will be better than the combined predictor.
But different individuals will be better on
different cases.
If the individual predictors disagree a lot, the
combined predictor is typically better than all
of the individual predictors when we average over
test cases.
So how do we make the individual predictors
disagree? (without making them much worse
individually).

6
Ways to make predictors differ

Rely on the learning algorithm getting stuck in a
different local optimum on each run.
A dubious hack unworthy of a true computer
scientist (but definitely worth a try).
Use lots of different kinds of models
Different architectures
Different learning algorithms.
Use different training data for each model
Bagging Resample (with replacement) from the
training set a,b,c,d,e -gt a c c d d
Boosting Fit models one at a time. Re-weight
each training case by how badly it is predicted
by the models already fitted.
This makes efficient use of computer time because
it does not bother to back-fit models that were
fitted earlier.

7
Mixtures of Experts

Can we do better that just averaging predictors
in a way that does not depend on the particular
training case?
Maybe we can look at the input data for a
particular case to help us decide which model to
rely on.
This may allow particular models to specialize in
a subset of the training cases. They do not learn
on cases for which they are not picked. So they
can ignore stuff they are not good at modeling.
The key idea is to make each expert focus on
predicting the right answer for the cases where
it is used.
This causes specialization.
If we always average all the predictors, each
model is trying to compensate for the combined
error made by all the other models.

8
Another picture
target
Average of all the other predictors
Do we really want to move the output of predictor
I away from the target value?
9
Making an error function that encourages
specialization instead of cooperation
Average of all the predictors

If we want to encourage cooperation, we compare
the average of all the predictors with the target
and train to reduce the discrepancy.
This can overfit badly. It makes the model much
more powerful than training each predictor
separately.
If we want to encourage specialization we compare
each predictor separately with the target and
train to reduce the average of all these
discrepancies.
Its best to use a weighted average, where the
weights, p, are the probabilities of picking that
expert for the particular training case.

probability of picking expert i for this case
10
The mixture of experts architecture

Combined predictor
Error function for training
(There is actually a slightly better error
function)

Expert 1 Expert 2 Expert 3
Softmax gating network
input
11
The derivatives

If we differentiate w.r.t. the outputs of the
experts we get a signal for training each expert.
If we differentiate w.r.t. the outputs of the
gating network we get a signal for training the
gating net.
We want to raise p for all experts that give less
than the average squared error of all the experts
(weighted by p)

Write a Comment

User Comments (0)