Overfitting, Bias/Variance tradeoff, and Ensemble methods - PowerPoint PPT Presentation

About This Presentation
Title:

Overfitting, Bias/Variance tradeoff, and Ensemble methods

Description:

Illustration (1) Problem definition: One input x, uniform random variable in [0,1] ... Illustration (4) No noise doesn't imply no variance (but less variance) ... – PowerPoint PPT presentation

Number of Views:233
Avg rating:3.0/5.0
Slides: 69
Provided by: geu
Category:

less

Transcript and Presenter's Notes

Title: Overfitting, Bias/Variance tradeoff, and Ensemble methods


1
Overfitting, Bias/Variance tradeoff,andEnsemble
methods
Pierre Geurts Stochastic methods (Prof.
L.Wehenkel) University of Liège
2
Content of the presentation
  • Bias and variance definitions
  • Parameters that influence bias and variance
  • Decision/regression tree variance
  • Bias and variance reduction techniques

3
Content of the presentation
  • Bias and variance definitions
  • A simple regression problem with no input
  • Generalization to full regression problems
  • A short discussion about classification
  • Parameters that influence bias and variance
  • Decision/regression tree variance
  • Bias and variance reduction techniques

4
Regression problem - no input
  • Goal predict as well as possible the height of a
    Belgian male adult
  • More precisely
  • Choose an error measure, for example the square
    error.
  • Find an estimation y such that the expectation
  • over the whole population of Belgian male adult
    is minimized.

180
5
Regression problem - no input
  • The estimation that minimizes the error can be
    computed by taking
  • So, the estimation which minimizes the error is
    Eyy. In AL, it is called the Bayes model.
  • But in practice, we cannot compute the exact
    value of Eyy (this would imply to measure the
    height of every Belgian male adults).

6
Learning algorithm
  • As p(y) is unknown, find an estimation y from a
    sample of individuals, LSy1,y2,,yN, drawn
    from the Belgian male adult population.
  • Example of learning algorithms
  • ,
  • (if we know that the height is close to 180)

7
Good learning algorithm
  • As LS are randomly drawn, the prediction y will
    also be a random variable
  • A good learning algorithm should not be good only
    on one learning sample but in average over all
    learning samples (of size N) ? we want to
    minimize
  • Let us analyse this error in more detail

8
Bias/variance decomposition (1)
9
Bias/variance decomposition (2)
varyy
y
Eyy
  • E Ey(y- Eyy)2 ELS(Eyy-y)2

residual error minimal attainable error
varyy
10
Bias/variance decomposition (3)
11
Bias/variance decomposition (4)
bias2
y
Eyy
  • E varyy (Eyy-ELSy)2
  • ELSy average model (over all LS)
  • bias2 error between Bayes and average model

12
Bias/variance decomposition (5)
varLSy
y
  • E varyy bias2 ELS(y-ELSy)2
  • varLSy estimation variance consequence of
    over-fitting

13
Bias/variance decomposition (6)
varyy
varLSy
  • E varyy bias2 varLSy

14
Our simple example
  • From statistics, y1 is the best estimate with
    zero bias
  • So, the first one may not be the best estimator
    because of variance (There is a bias/variance
    tradeoff w.r.t. l)

15
Bayesian approach (1)
  • Hypotheses
  • The average height is close to 180cm
  • The height of one individual is Gaussian around
    the mean
  • What is the most probable value of after
    having seen the learning sample ?

16
Bayesian approach (2)
Bayes theorem and P(LS) is constant
Independence of the learning cases
17
Regression problem full (1)
  • Actually, we want to find a function y(x) of
    several inputs gt average over the whole input
    space
  • The error becomes
  • Over all learning sets

18
Regression problem full (2)
  • ELSEyx(y-y(x))2Noise(x)Bias2(x)Variance(x
    )
  • Noise(x) Eyx(y-hB(x))2
  • Quantifies how much y varies from hB(x)
    Eyxy, the Bayes model.
  • Bias2(x) (hB(x)-ELSy(x))2
  • Measures the error between the Bayes model and
    the average model.
  • Variance(x) ELS(y(x)-ELSy(x))2
  • Quantify how much y(x) varies from one learning
    sample to another.

19
Illustration (1)
  • Problem definition
  • One input x, uniform random variable in 0,1
  • yh(x)e where e?N(0,1)

h(x)Eyxy
y
x
20
Illustration (2)
  • Low variance, high bias method ? underfitting

21
Illustration (3)
  • Low bias, high variance method ? overfitting

22
Illustration (4)
  • No noise doesnt imply no variance (but less
    variance)

23
Classification problems (1)
  • The mean misclassification error is
  • The best possible model is the Bayes model
  • The average model is
  • Unfortunately, there is no such decomposition of
    the mean misclassification error into a bias and
    a variance terms.
  • Nevertheless, we observe the same phenomena

24
Classification problems (2)
LS1
25
Classification problems (3)
  • Bias ? systematic error component (independent of
    the learning sample)
  • Variance ? error due to the variability of the
    model with respect to the learning sample
    randomness
  • There are errors due to bias and errors due to
    variance

One test node
Full decision tree
26
Content of the presentation
  • Bias and variance definitions
  • Parameters that influence bias and variance
  • Complexity of the model
  • Complexity of the Bayes model
  • Noise
  • Learning sample size
  • Learning algorithm
  • Decision/regression tree variance
  • Bias and variance reduction techniques

27
Illustrative problem
  • Artificial problem with 10 inputs, all uniform
    random variables in 0,1
  • The true function depends only on 5 inputs
  • y(x)10.sin(p.x1.x2)20.(x3-0.5)210.x45.x5e,
  • where e is a N(0,1) random variable
  • Experimentations
  • ELS ? average over 50 learning sets of size 500
  • Ex,y ? average over 2000 cases
  • ? Estimate variance and bias ( residual error)

28
Complexity of the model
Ebias2var
var
bias2
Complexity
  • Usually, the bias is a decreasing function of the
    complexity, while variance is an increasing
    function of the complexity.

29
Complexity of the model neural networks
  • Error, bias, and variance w.r.t. the number of
    neurons in the hidden layer

30
Complexity of the model regression trees
  • Error, bias, and variance w.r.t. the number of
    test nodes

31
Complexity of the model k-NN
  • Error, bias, and variance w.r.t. k, the number of
    neighbors

32
Learning problem
  • Complexity of the Bayes model
  • At fixed model complexity, bias increases with
    the complexity of the Bayes model. However, the
    effect on variance is difficult to predict.
  • Noise
  • Variance increases with noise and bias is mainly
    unaffected.
  • E.g. with (full) regression trees

33
Learning sample size (1)
  • At fixed model complexity, bias remains constant
    and variance decreases with the learning sample
    size. E.g. linear regression

34
Learning sample size (2)
  • When the complexity of the model is dependant on
    the learning sample size, both bias and variance
    decrease with the learning sample size. E.g.
    regression trees

35
Learning algorithms linear regression
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Very few parameters small variance
  • Goal function is not linear high bias

36
Learning algorithms k-NN
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Small k high variance and moderate bias
  • High k smaller variance but higher bias

37
Learning algorithms - MLP
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Small bias
  • Variance increases with the model complexity

38
Learning algorithms regression trees
Method Err2 Bias2Noise Variance
Linear regr. 7.0 6.8 0.2
k-NN (k1) 15.4 5 10.4
k-NN (k10) 8.5 7.2 1.3
MLP (10) 2.0 1.2 0.8
MLP (10 10) 4.6 1.4 3.2
Regr. Tree 10.2 3.5 6.7
  • Small bias, a (complex enough) tree can
    approximate any non linear function
  • High variance

39
Content of the presentation
  • Bias and variance definition
  • Parameters that influence bias and variance
  • Decision/regression tree variance
  • Bias and variance reduction techniques

40
Decision/regression tree variance (1)
  • DT/RT are among the machine learning method that
    present the highest variance. Even a small change
    of the learning sample can result in a very
    different tree.
  • Even small trees have a high variance

Method E Bias Variance
k-NN (k10) 8.5 7.2 1.3
MLP (10 10) 4.6 1.4 3.2
RT, no test 25.5 25.4 0.1
RT, 1 test 19.0 17.7 1.3
RT, 3 tests 14.8 11.1 3.7
RT, full (250 tests) 10.2 3.5 6.7
41
Decision/regression tree variance (2)
  • Possible sources of variance
  • Discretization of numerical attributes
  • The selected threshold has a high variance (see
    next slide).
  • Structure choice
  • Sometimes, attribute scores are very close.
  • Estimation at leaf nodes
  • Because of the recursive partitioning, prediction
    at leaf nodes are based on very small samples of
    objects.
  • Consequences
  • sub-optimality in terms of accuracy
  • questionable interpretability since the
    parameters can not be trusted

42
Decision/regression tree variance (3)
  • The discretization thresholds chosen in trees are
    very unstable
  • This variance put into question the
    interpretability

A1
43
Content of the presentation
  • Bias and variance definition
  • Parameters that influence bias and variance
  • Decision/regression tree variance
  • Bias and variance reduction techniques
  • Introduction
  • Dealing with the bias/variance tradeoff of one
    algorithm
  • Ensemble methods

44
Bias and variance reduction techniques
  • In the context of a given method
  • Adapt the learning algorithm to find the best
    trade-off between bias and variance.
  • Not a panacea but the least we can do.
  • Example pruning, weight decay.
  • Ensemble methods
  • Change the bias/variance trade-off.
  • Universal but destroys some features of the
    initial method.
  • Example bagging, boosting.

45
Variance reduction 1 model (1)
  • General idea reduce the ability of the learning
    algorithm to fit the LS
  • Pruning
  • reduces the model complexity explicitly
  • Early stopping
  • reduces the amount of search
  • Regularization
  • reduce the size of hypothesis space
  • Weight decay with neural networks consists in
    penalizing high weight values

46
Variance reduction 1 model (2)
Ebias2var
Optimal fitting
var
bias2
Fitting
  • Selection of the optimal level of fitting
  • a priori (not optimal)
  • by cross-validation (less efficient) Bias2 ?
    error on the learning set, E ? error on an
    independent test set

47
Variance reduction 1 model (3)
  • Examples
  • Post-pruning of regression trees
  • Early stopping of MLP by cross-validation

Method E Bias Variance
Full regr. Tree (250) 10.2 3.5 6.7
Pr. regr. Tree (45) 9.1 4.3 4.8
Full learned MLP 4.6 1.4 3.2
Early stopped MLP 3.8 1.5 2.3
  • As expected, variance decreases but bias increases

48
Ensemble methods
  • Combine the predictions of several models built
    with a learning algorithm in order to improve
    with respect to the use of a single model
  • Two important families
  • Averaging techniques
  • Grow several models independantly and simply
    average their predictions
  • Ex bagging, random forests
  • Decrease mainly variance
  • Boosting type algorithms
  • Grows several models sequentially
  • Ex Adaboost, MART
  • Decrease mainly bias

49
Bagging (1)
  • ELSErr(x)Eyx(y-hB(x))2 (hB(x)-ELSy(x))2E
    LS(y(x)-ELSy(x))2
  • Idea the average model ELSy(x) has the same
    bias as the original method but zero variance
  • Bagging (Bootstrap AGGregatING)
  • To compute ELSy (x), we should draw an infinite
    number of LS (of size N)
  • Since we have only one single LS, we simulate
    sampling from nature by bootstrap sampling from
    the given LS
  • Bootstrap sampling sampling with replacement of
    N objects from LS (N is the size of LS)

50
Bagging (2)
LS
51
Bagging (3)
  • Usually, bagging reduces very much the variance
    without increasing too much the bias.
  • Application to regression trees

Method E Bias Variance
3 Test regr. Tree 14.8 11.1 3.7
Bagged (T25) 11.7 10.7 1.0
Full regr. Tree 10.2 3.5 6.7
Bagged (T25) 5.3 3.8 1.5
  • Strong variance reduction without increasing the
    bias (although the model is much more complex
    than a single tree)

52
Bagging (4)
y
x
53
Other averaging techniques
  • Perturb and Combine paradigm
  • Perturb the data or the learning algorithm to
    obtain several models that are good on the
    learning sample.
  • Combine the predictions of these models
  • Usually, these methods decrease the variance
    (because of averaging) but (slightly) increase
    the bias (because of the perturbation)
  • Examples
  • Bagging perturbs the learning sample.
  • Learn several neural networks with random initial
    weights
  • Random forests.

Method E Bias Variance
MLP (10-10) 4.6 1.4 3.2
Average of 10 MLPs 2.0 1.4 0.6
54
Random forests (1)
  • Perturb and combine algorithm specifically
    designed for trees
  • Combine bagging and random attribute subset
    selection
  • Build the tree from a bootstrap sample
  • Instead of choosing the best split among all
    attributes, select the best split among a random
    subset of k attributes
  • ( bagging when k is equal to the number of
    attributes)
  • There is a bias/variance tradeoff with k The
    smaller k, the greater the reduction of variance
    but also the higher the increase of bias

55
Random forests (2)
  • Application to our illustrative problem
  • Other advantage it decreases computing times
    with respect to bagging since only a subset of
    all attributes needs to be considered when
    splitting a node.

Method E Bias Variance
Full regr. Tree 10.2 3.5 6.7
Bagging (k10) 5.3 3.8 1.5
Random Forests (k7) 4.8 3.8 1.0
Random Forests (k5) 4.9 4.0 0.9
Random Forests (k3) 5.6 4.7 0.8
56
Boosting methods (1)
  • The motivation of boosting is to combine the
    ouputs of many  weak  models to produce a
    powerful ensemble of models.
  • Weak model a model that has a high bias
    (strictly, in classification, a model slightly
    better than random guessing)
  • Differences with previous ensemble methods
  • Models are built sequentially on modified
    versions of the data
  • The predictions of the models are combined
    through a weighted sum/vote

57
Boosting methods (2)
LS
58
Adaboost (1)
  • Assume that the learning algorithm accepts
    weighted objects
  • This is the case of many learning algorithms
  • With trees, simply take into account the weights
    when counting objects
  • In neural networks, minimize the weighted squared
    error
  • At each step, adaboost increases the weights of
    cases from the learning sample misclassified by
    the last model
  • Thus, the algorithm focuses on the difficult
    cases from the learning sample
  • In the weighted majority vote, adaboost gives
    higher influence to the more accurate models

59
Adaboost (2)
  • Input a learning algorithm and a learning sample
    (xi,yi) i1,,N
  • Initialize the weights wi1/N, i1,,N
  • For t1 to T
  • Build a model yt(x) with the learning algorithm
    using weights wi
  • Compute the weighted error
  • Compute btlog((1-errt)/errt))
  • Change weights

å
¹
x
y
y
I
w
))
(
(
i
t
i
i

i
err
å
t
w
i
i
¹



b
))
(
(
exp
x
y
I
w
w
y
i
t
i
m
i
i
60
MART (multiple additive regression trees)
  • MART is a boosting algorithm for regression
  • Input a learning sample (xi,yi) i1,,N
  • Initialize
  • y0(x) 1/N åi yi riyi, i1,,N
  • For t1 to T
  • For i1 to N, compute the residuals
  • ri ? ri -yt-1(xi)
  • Build a regression tree from the learning sample
    (xi,ri) i1,,N
  • Return the model y(x) y0(x)y1(x)yT(x)

61
Boosting methods
  • Adaboost and MART are only two boosting variants.
    There are many other boosting type algorithms.
  • Boosting decision/regression trees improves their
    accuracy often dramatically. However, boosting is
    more sensible to noise than averaging techniques
    (overfitting).
  • For boosting to work, the models need not to be
    perfect on the learning sample. With trees, there
    are two possible strategies
  • Use pruned trees (pre-pruned or post-pruned by
    cross-validation)
  • Limit the number of tree tests (and split first
    the most impure nodes)
  • ? there is again a bias/variance tradeoff with
    respect to the tree size.

62
Experiment with MART
  • On our illustrative problem
  • Boosting reduces the bias but increases the
    variance. However, with respect to full trees, it
    decreases both bias and variance.

Method E Bias Variance
Full regr. Tree 10.2 3.5 6.7
Regr. Tree with 1 test 18.9 17.8 1.1
MART (T50) 5.0 3.1 1.9
Bagging (T50) 17.9 17.3 0.6
Regr. Tree with 5 tests 11.7 8.8 2.9
MART (T50) 6.4 1.7 4.7
Bagging (T50) 9.1 8.7 0.4
63
Interpretability and efficiency of ensembles
  • Since we average several models, we loose
    interpretability and efficiency which are two of
    the main advantages of decision/regression trees
  • However,
  • We still can use the ensembles to compute
    variable importance by averaging over all trees.
    Actually, this even stabilizes the estimates.
  • Averaging techniques can be parallelized and
    boosting type algorithm uses smaller trees. So,
    the increase of computing times is not so
    detrimental.

64
Experiments on Golubs microarray data
  • 72 objects, 7129 numerical attributes (gene
    expressions), 2 classes (ALL and AL)
  • Leave-one-out error with several variants
  • Variable importance with boosting

Method Error
1 decision tree 22.2 (16/72)
Random forests (k85,T500) 9.7 (7/72)
Extra-trees (sth0.5, T500) 5.5 (4/72)
Adaboost (1 test node, T500) 1.4 (1/72)
65
Conclusion (1)
  • The notions of bias and variance are very useful
    to predict how changing the (learning and
    problem) parameters will affect the accuracy.
    E.g. this explains why very simple methods can
    work much better than more complex ones on very
    difficult tasks
  • Variance reduction is a very important topic
  • To reduce bias is easy, but to keep variance low
    is not as easy.
  • Especially in the context of new applications of
    machine learning to very complex domains
    temporal data, biological data, Bayesian networks
    learning, text mining
  • All learning algorithms are not equal in terms of
    variance. Trees are among the worst methods from
    this criterion

66
Conclusion (2)
  • Ensemble methods are very effective techniques to
    reduce bias and/or variance. They can transform a
    not so good method to a competitive method in
    terms of accuracy.
  • Adaboost with trees is considered as one of the
    best  off-the-shelve  classification method.
  • Interpretability of the model and efficiency of
    the method are difficult to preserve if we want
    to reduce variance significantly.
  • There are other ways to tackle the
    variance/overfitting problem, e.g.
  • Bayesian approaches (related to averaging
    techniques)
  • Support vector machines (they maintain a low
    variance by maximizing the classifiction margin)

67
References
  • About bias and variance
  • Neural networks and the bias/variance dilemma,
    S.Geman et al., Neural computation 4, 1(1992),
    1-58
  • Neural networks for statistical pattern
    recognition, C.M.Bishop, Oxfored University
    Press, 1994
  • The elements of statistical learning, T.Hastie et
    al., Springer, 2001
  • Contribution to decision tree induction
    bias/variance tradeoff and time series
    classification, P.Geurts, Phd Thesis, 2002
  • About ensemble methods
  • Bagging predictors, L.Breiman, Machine learning,
    24, 1996
  • A decision theoretic generalization of on-line
    learning and an application to boosting, Y.Freund
    and R.Schapire, Journal of Computer and Science
    Systems, 1995
  • Random Forests, L.Breiman, Machine learning, 45,
    2001
  • Ensemble methods in machine learning,
    T.Dietterich, First international workshop on
    multiple classifier systems, 2000
  • An introduction to boosting and leveraging,
    R.Meir and G.Ratsch, Advanced Lectures on Machine
    Learning, Springer, 2003

68
Softwares
  • Random forests
  • http//stat-www.berkeley.edu/users/breiman/rf.html
  • R package randomForest
  • Boosting
  • See www.boosting.org
Write a Comment
User Comments (0)
About PowerShow.com