Title: Generalization Error of Linear Neural Networks in an Empirical Bayes Approach
1Generalization Error of Linear Neural Networks
in an Empirical Bayes Approach
- Shinichi Nakajima Sumio Watanabe
- Tokyo Institute of Technology
- Nikon Corporation
2Contents
- Backgrounds
- Regular models
- Unidentifiable models
- Superiority of Bayes to ML
- Whats the purpose?
- Setting
- Model
- Subspace Bayes (SB) Approach
- Analysis
- (James-Stein estimator)
- Solution
- Generalization error
- Discussion Conclusions
3Regular Models
Conventional Learning Theory
K dimensionality of parameter space
n of samples
x input
y output
1. Asymptotic normalities of distribution of ML
estimator and Bayes posterior
GE
FE
Model selection methods (AIC, BIC, MDL)
2. Asymptotic generalization error l(ML)
l(Bayes)
4Unidentifiable models
H of components
1. Asymptotic normalities NOT hold.
No (penalized likelihood type) information
criterion.
5Superiority of Bayes to ML
How singularities work in learning ?
When true is on singularities,
Increase of neighborhood of true accelerates
overfitting.
Increase of population denoting true suppresses
overfitting. (only in Bayes)
1. Asymptotic normalities NOT hold.
No (penalized likelihood type) information
criterion.
2. Bayes has advantage G(Bayes) lt
G(ML)
6Whats the purpose ?
- Bayes provides good generalization.
- Expensive. (Needs Markov chain Monte Carlo)
Is there any approximation with good
generalization and tractability?
- Variational Bayes (VB) HintonvanCamp93
MacKay95 Attias99GhahramaniBeal00
Analyzed in another paper. NakajimaWatanabe05
7Contents
- Backgrounds
- Regular models
- Unidentifiable models
- Superiority of Bayes to ML
- Whats the purpose?
- Setting
- Model
- Subspace Bayes (SB) Approach
- Analysis
- (James-Stein estimator)
- Solution
- Generalization error
- Discussion Conclusions
8Linear Neural Networks(LNNs)
LNN with M input, N output, and H hidden units
A input parameter (H x M ) matrix
B output parameter (N x H ) matrix
Essential parameter dimensionality
9Maximum Likelihood estimator BaldiHornik95
ML estimator is given by
where
Here
h-th largest singular value of RQ -1/2.
right singular vector.
left singular vector.
10Bayes estimation
input
output
parameter
Learner
Prior
True
In ML (or MAP) Predict with one model
In Bayes Predict with ensemble of models
11Empirical Bayes (EB) approachEffronMorris73
True
Learner
Prior
Hyperparameter is estimated by maximizing
marginal likelihood.
12Subspace Bayes (SB) approach
SB is an EB where part of parameters are regarded
as hyperparameters.
a) MIP (Marginalizing in Input Parameter space)
version
A parameter
B hyperparameter
b) MOP (Marginalizing in Output Parameter space)
version
A hyperparameter
B parameter
Marginalization can be done analytically in LNNs.
13Intuitive explanation
Bayes posterior
SB posterior
For redundant comp.
Optimize
14Contents
- Backgrounds
- Regular models
- Unidentifiable models
- Superiority of Bayes to ML
- Whats the purpose?
- Setting
- Model
- Subspace Bayes (SB) Approach
- Analysis
- (James-Stein estimator)
- Solution
- Generalization error
- Discussion Conclusions
15Free energy (a.k.a. evidence, stochastic
complexity)
Free energy
Important variable used for model selection.
Akaike80Mackay92
We minimize the free energy, optimizing
hyperparameter.
16Generalization error
Generalization Error
Kullbuck-Leibler divergence between q p
where
Expectation of V over q
Asymptotic expansion
In regular,
In unidentifiable,
17James-Stein (JS) estimator
for any true
Domination of a over b
for a certain true
K-dimensional mean estimation (Regular model)
A certain relation between EB and JSwas
discussed in EfronMorris73
samples
James-Stein estimator
JamesStein61
18Positive-part JS estimator
Positive-part JS type (PJS) estimator
where
Thresholding
Model selection
PJS is a model selecting, shrinkage estimator.
19Hyperparameter optimization
Assume orthonormality
d x d identity matrix
Analytically solved in LNNs!
Optimum hyperparameter value
20SB solution (Theorem1, Lemma1)
L dimensionality of marginalized subspace (per
component), i.e., L M in MIP, or L N
in MOP.
Theorem 1 The SB estimator is given by
where
Lemma 1 Posterior is localized so that we can
substitute the model at the SB estimator for
predictive.
SB is asymptotically equivalent to PJS estimation.
21Generalization error (Theorem 2)
Theorem 2 SB generalization coefficient is given
by
h-th largest eigenvalue of matrix subject to
WN-H (M-H, IN-H ).
Expectation over Wishart distribution.
22Large scale approximation(Theorem 3)
Theorem 3 In the large scale limit when
,the generalization
coefficient converges to
where
23Results 1 (true rank dependence)
N 30
ML
M 50
Bayes
SB(MIP)
SB(MOP)
N 30
M 50
SB provides good generalization.
Note This does NOT mean domination of SB over
Bayes. Discussion of domination needs
consideration of delicate situation. (See paper)
24Results 2(redundant rank dependence)
N 30
ML
M 50
Bayes
SB(MOP)
SB(MIP)
N 30
M 50
depends on H similarly to ML.has also a property
similar to ML.
25Contents
- Backgrounds
- Regular models
- Unidentifiable models
- Superiority of Bayes to ML
- Whats the purpose?
- Setting
- Model
- Subspace Bayes (SB) Approach
- Analysis
- (James-Stein estimator)
- Solution
- Generalization error
- Discussion Conclusions
26Feature of SB
- provides good generalization.
- In LNNs, asymptotically equivalent to PJS.
- requires smaller computational costs.
- Reduction of marginalized space.
- In some models, marginalization can be done
analytically. - related to variational Bayes (VB) approach.
27Variational Bayes (VB) Solution
NakajimaWatanabe05
- VB results in same solution as MIP.
- VB automatically selects larger dimension to
marginalize.
For
and
Bayes posterior
VB posterior
Similar to SB posterior
28Conclusions
- We have introduced a subspace Bayes (SB)
approach. - We have proved that, in LNNs, SB is
asymptotically equivalent to a shrinkage (PJS)
estimation. - Even in asymptotics, SB for redundant components
converges not to ML but to smaller value, which
means suppression of overfitting. - Interestingly, MIP of SB is asymptotically
equivalent to VB. - We have clarified the SB generalization error.
- SB has Bayes-like and ML-like properties, i.e.,
shrinkage and acceleration of overfitting by
basis selection.
29Future work
- Analysis of other models. (neural networks,
Bayesian networks, mixture models, etc). - Analysis of variational Bayes (VB) in other
models.
30Thank you!