Title: CSC321: Neural Networks Lecture 10: Learning ensembles of Networks
1CSC321 Neural NetworksLecture 10 Learning
ensembles of Networks
2Combining networks
- When the amount of training data is limited, we
need to avoid overfitting. - Averaging the predictions of many different
networks is a good way to do this. - It works best if the networks are as different as
possible. - If the data is really a mixture of several
different regimes it is helpful to identify
these regimes and use a separate, simple model
for each regime. - We want to use the desired outputs to help
cluster cases into regimes. Just clustering the
inputs is not as efficient.
3Combining networks reduces variance
- We want to compare two expected squared errors
- Method 1 Pick one of the predictors at random
- Method 2 Use the average of the predictors, y
This term vanishes
4A picture
good guy
bad guy
- The predictors that are further than average from
d make bigger than average squared errors. - The predictors that are nearer than average to d
make smaller then average squared errors. - The first effect dominates because squares work
like that. - Dont try averaging if you want to synchronize a
bunch of clocks !
d
y
5How the combined predictor compares with the
individual predictors
- On any one test case, some individual predictors
will be better than the combined predictor. - But different individuals will be better on
different cases. - If the individual predictors disagree a lot, the
combined predictor is typically better than all
of the individual predictors when we average over
test cases. - So how do we make the individual predictors
disagree? (without making them much worse
individually).
6Ways to make predictors differ
- Rely on the learning algorithm getting stuck in a
different local optimum on each run. - A dubious hack unworthy of a true computer
scientist (but definitely worth a try). - Use lots of different kinds of models
- Different architectures
- Different learning algorithms.
- Use different training data for each model
- Bagging Resample (with replacement) from the
training set a,b,c,d,e -gt a c c d d - Boosting Fit models one at a time. Re-weight
each training case by how badly it is predicted
by the models already fitted. - This makes efficient use of computer time because
it does not bother to back-fit models that were
fitted earlier.
7Mixtures of Experts
- Can we do better that just averaging predictors
in a way that does not depend on the particular
training case? - Maybe we can look at the input data for a
particular case to help us decide which model to
rely on. - This may allow particular models to specialize in
a subset of the training cases. They do not learn
on cases for which they are not picked. So they
can ignore stuff they are not good at modeling. - The key idea is to make each expert focus on
predicting the right answer for the cases where
it is used. - This causes specialization.
- If we always average all the predictors, each
model is trying to compensate for the combined
error made by all the other models.
8Another picture
target
Average of all the other predictors
Do we really want to move the output of predictor
I away from the target value?
9Making an error function that encourages
specialization instead of cooperation
Average of all the predictors
- If we want to encourage cooperation, we compare
the average of all the predictors with the target
and train to reduce the discrepancy. - This can overfit badly. It makes the model much
more powerful than training each predictor
separately. - If we want to encourage specialization we compare
each predictor separately with the target and
train to reduce the average of all these
discrepancies. - Its best to use a weighted average, where the
weights, p, are the probabilities of picking that
expert for the particular training case.
probability of picking expert i for this case
10The mixture of experts architecture
- Combined predictor
- Error function for training
- (There is actually a slightly better error
function)
Expert 1 Expert 2 Expert 3
Softmax gating network
input
11The derivatives
- If we differentiate w.r.t. the outputs of the
experts we get a signal for training each expert. - If we differentiate w.r.t. the outputs of the
gating network we get a signal for training the
gating net. - We want to raise p for all experts that give less
than the average squared error of all the experts
(weighted by p)