Dynamics of AdaBoost - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Dynamics of AdaBoost

Description:

This is not easy, there are ... Webpage classification (search engines), email filtering, document retrieval ... Speech recognition, automatic .mp3 sorting ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 57
Provided by: cynthi104
Category:

less

Transcript and Presenter's Notes

Title: Dynamics of AdaBoost


1
Dynamics of AdaBoost
  • Cynthia Rudin,
  • Ingrid Daubechies and
  • Robert Schapire


2
Say you have a database of news articles
where articles are labeled 1 if the category
is entertainment, and -1 otherwise.
Your goal is Given a new article ,
find its label.
This is not easy, there are noisy datasets, high
dimensions.
3
Examples of Classification Tasks
  • Optical Character Recognition (OCR) (post office,
    banks), object recognition in images.
  • Webpage classification (search engines), email
    filtering, document retrieval
  • Bioinformatics (analysis of gene array data for
    tumor detection, protein classification, etc.)
  • Speech recognition, automatic .mp3 sorting

Huge number of applications, but all have high
dimensional data
4
(No Transcript)
5
How do we construct a classifier?
  • Divide the space X into two sections, based on
    the sign of a function f X?R.
  • Decision boundary is the zero-level set of f.

f(x)0

-
Classifiers divide the space into two pieces for
binary classification. Multiclass classification
can be reduced to binary.
6
? Overview of Talk ?
  • The statistical learning problem of
    classification (done)
  • Introduction to boosting
  • Does AdaBoost converge to a maximum margin
    solution?
  • Reduce AdaBoost to a dynamical system to
    understand its convergence!

7
Say we have a weak learning algorithm
  • A weak learning algorithm produces weak
    classifiers.
  • (Think of a weak classifier as a rule of
    thumb)

Examples of weak classifiers for entertainment
application
Wouldnt it be nice to combine the weak
classifiers?
8
Boosting algorithms combine weak classifiers in a
meaningful way (Schapire 89).
Example


So if the article contains the term movie, and
the word drama, but not the word actor
The value of f is sign.4-.3.3
sign.41, so we label it 1.
9
Boosting algorithms combine weak classifiers in a
meaningful way (Schapire 89).
Example
A boosting algorithm takes as input - the
weak learning algorithm which produces the weak
classifiers - a large training database
and outputs - the coefficients of the weak
classifiers to make the combined
classifier
10
AdaBoost (Freund and Schapire 96)
  • Start with a uniform distribution
  • (weights) over training examples.
  • (The weights tell the weak learning
  • algorithm which examples are important.)
  • Request a weak classifier from the weak learning
    algorithm, hjtX?-1,1.
  • Increase the weights on the training examples
    that were misclassified.
  • (Repeat)

At the end, make (carefully!) a linear
combination of the weak classifiers obtained at
all iterations.
11
AdaBoost
Define
matrix of weak classifiers and data
Enumerate every possible weak classifier which
can be produced by weak learning algorithm
h1 hj hn
movie actor drama
1 i m
Mij
of data points
The matrix M has too many columns to actually be
enumerated. M acts as the only input to AdaBoost.
12
AdaBoost
Define
distribution (weights) over examples at time
t
13
AdaBoost
Define
coeffs of weak classifiers for the linear
combination
14
AdaBoost
M
matrix of weak classifiers and training examples
coefficients for final combined classifier
weights on the training examples
coefficients on the weak classifiers to form the
combined classifier
15
AdaBoost
M
matrix of weak classifiers and training examples
coefficients for final combined classifier
weights on the training examples
rt the edge
coefficients on the weak classifiers to form the
combined classifier
16
? Boosting and Margins ?
  • We want the boosted classifier (defined via ?) to
    generalize well, i.e., we want it to perform well
    on data that is not in the training set.
  • The margin theory The margin of a boosted
    classifier indicates whether it will generalize
    well. (Schapire, Freund, Bartlett, and Lee 98)
  • Large margin classifiers work well in practice,
    but theres more to this story!
  • (Story of AdaBoost and Margins coming up!)

Think of the margin as the confidence of a
prediction.
17
The Story of AdaBoost and Margins
  • AdaBoost often tends not to overfit. (Breiman
    96, Cortes and Drucker 97, etc.)
  • As a result, the margin theory (Schapire,
    Freund, Bartlett and Lee 98) developed, which is
    based on loose generalization bounds.
  • Note margin for boosting is not the same as
    margin for svm.

18
The Story of AdaBoost and Margins
  • AdaBoost often tends not to overfit. (Breiman
    96, Cortes and Drucker 97, etc.)
  • As a result, the margin theory (Schapire,
    Freund, Bartlett and Lee 98) developed, which is
    based on loose generalization bounds.
  • Note margin for boosting is not the same as
    margin for svm.

margin
19
The Story of AdaBoost and Margins
  • AdaBoost often tends not to overfit. (Breiman
    96, Cortes and Drucker 97, etc.)
  • As a result, the margin theory (Schapire,
    Freund, Bartlett and Lee 98) developed, which is
    based on loose generalization bounds.
  • Note margin for boosting is not the same as
    margin for svm.
  • Remember, AdaBoost (Freund and Schapire 97) was
    invented before the margin theory.

The question remained (until recently) Does
AdaBoost maximize the margin?
20
The question remained (until recently) Does
AdaBoost maximize the margin?
  • Empirical results on convergence of AdaBoost
  • AdaBoost seemed to maximize the margin in the
    limit (Grove and Schuurmans 98, and others)

Seems very much like yes
21
The question remained (until recently) Does
AdaBoost maximize the margin?
  • Theoretical results on convergence of AdaBoost
  • AdaBoost generates a margin that is at least ½?,
    where ? is the maximum margin. (Schapire, Freund,
    Bartlett, and Lee 98)
  • seems like yes

?
AdaBoosts margin is at least this much
?/2 (Schapire et al. 98)
true margin
22
The question remained (until recently) Does
AdaBoost maximize the margin?
  • Theoretical results on convergence of AdaBoost
  • 2) AdaBoost generates a margin that is at least
    ?(?) ½?. (Rätsch and Warmuth 02).
  • even closer to yes

?
Y(?) (RätschWarmuth 02)
AdaBoosts margin is at least this much
?/2 (Schapire et al. 98)
true margin
23
The question remained (until recently) Does
AdaBoost maximize the margin?
2) AdaBoost generates a margin that is at least
?(?) ½?. (Rätsch and Warmuth 02).
  • Two cases of interest
  • optimal case
  • the weak learning algorithm chooses the best
    weak classifier at each iteration.
  • e.g., BoosTexter
  • non-optimal case
  • the weak learning algorithm is only required to
    choose a sufficiently good weak classifier at
    each iteration, not necessarily the best one.
  • e.g., weak learning algorithm is a decision tree
    or neural network

24
The question remained (until recently) Does
AdaBoost maximize the margin?
2) AdaBoost generates a margin that is at least
?(?) ½?. (Rätsch and Warmuth 02).
This bound conjectured to be tight for the
non-optimal case (based on numerical evidence).
(Rätsch and Warmuth 02).
perhaps yes for the optimal case, but no for
non-optimal case
25
The question remained (until recently) Does
AdaBoost maximize the margin?
The answer is
Theorem (R, Daubechies, Schapire 04) AdaBoost
may converge to a margin that is significantly
below maximum.
The answer is no ?
Theorem (R, Daubechies, Schapire 04) The bound
of (Rätsch and Warmuth 02) is tight, i.e.,
non- optimal AdaBoost will converge to a margin
of ?(?) whenever lim rt ?. (Note this is a
specific case of a more general theorem.)
26
About the proof
  • AdaBoost is difficult to analyze because the
    margin does not increase at every iteration the
    usual tricks dont work!
  • We use a dynamical systems approach to study
    this problem.
  • Reduce AdaBoost to a dynamical system
  • Analyze the dynamical system in simple cases
  • remarkably find stable cycles!
  • Convergence properties can be completely
    understood in
  • these cases.

27
The key to answering this open question
A set of examples where AdaBoosts convergence
properties can be completely understood.
28
? Analyzing AdaBoost using Dynamical Systems ?
  • Reduced Dynamics

Compare to AdaBoost
Iterated map for directly updating dt. Reduction
uses the fact that M is binary.
The existence of this map enables study of
low-dim cases.
29
Smallest Non-Trivial Case
30
t1 ? ? ? ? ? ? t50
31
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
32
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
33
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
34
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
35
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
36
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
37
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
38
Two possible stable cycles!
x
Maximum margin solution is attained!
x
x
t1 ? ? ? ? ? ? t50
To solve simply assume 3-cycle exists. AdaBoost
achieves max margin here, so conjecture true in
at least one case. The edge, r_t, is the golden
ratio minus 1.
39
Generalization of smallest non-trivial case
  • Case of m weak classifiers, each misclassifies
    one point
  • Existence of at least (m-1)! stable cycles, each
    yields a
  • maximum margin solution.

Cant solve for cycle exactly, but can prove our
equation has a unique solution for each cycle.
40
Generalization of smallest non-trivial case
  • Stable manifolds of 3-cycles.

41
? Empirically Observed Cycles ?
42
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t300
43
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t400
44
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t400
45
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t300
46
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t5500 (only plotted every
20th iterate)
47
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t400
48
  • If AdaBoost cycles, we can calculate the margin
    it will asymptotically converge to in terms of
    the edge values

49
Does AdaBoost produce maximum margin classifiers?
  • AdaBoost does not always produce a maximum
    margin classifier!
  • Constructed an 8x8 matrix M where AdaBoost
    provably converges to a non-maximum margin
    solution.
  • Convergence to a manifold of strongly attracting
  • stable 3-cycles.
  • Margin produced by AdaBoost is 1/3,
  • but maximum margin is 3/8!

50
Approximate Coordinate Ascent Boosting
AdaBoost
51
Recap of main result
  • AdaBoost does not always produce a maximum margin
    classifier! (Contradicts what everyone thought!)
  • A new algorithm, Approximate Coordinate Ascent
    Boosting, always provably does.
  • (And it has a fast convergence rate!)

52
? Summary for Dynamics of AdaBoost ?
  • Analyzed the AdaBoost algorithm using an unusual
    technique, namely dynamical systems.
  • Found remarkable stable cycles.
  • Answered the question of whether AdaBoost
    converges to a maximum margin solution.
  • Key is a set of examples in which AdaBoosts
    convergence could be completely understood.

53
In the course of experimenting with many
different cycling behaviors, we found the
following examples, exhibiting exquisite sensitivi
ty to initial conditions
54
Sensitivity to initial conditions
55
Sensitivity to initial conditions
56
Sensitivity to initial conditions
57
Sensitivity to initial conditions
58
Sensitivity to initial conditions
59
Sensitivity to initial conditions
60
Sensitivity to initial conditions
61
Sensitivity to initial conditions
62
Sensitivity to initial conditions
63
(No Transcript)
64
(No Transcript)
65
Non-Robustness Theorem (R, Daubechies, Schapire,
04)
  • In the non-optimal case, we just choose a
    sufficiently good weak classifier, not
    necessarily the best one.
  • Non-optimal AdaBoost does not necessarily
    converge to a maximum margin solution, even if
    optimal AdaBoost does!

Non-Optimal AdaBoost can be forced to alternate
between these 3 weak classifiers. Margin
attained 1/3 .
Optimal AdaBoost alternates between these 4 weak
classifiers. Margin attained ½ .
Conjecture Ratsch and Warmuth
66
Theorem (R, Schapire, Daubechies, 04)
is a monotonic function, is the smooth margin
function
  • AdaBoost makes progress according to the smooth
    margin iff the edge is sufficiently large.
  • Something stronger the value of the smooth
    margin function increases as the edge value
    increases.
  • This theorem is a key tool for proving other
    results.

67
(No Transcript)
68
Results on Cyclic AdaBoost
Theorem (R, Schapire, Daubechies 04)
  • When AdaBoost cycles,
  • either the value of the smooth margin decreases
    an infinite number of times,
  • (thats the opposite of our new algorithms!)
  • or
  • all the edge values are equal

Theorem
When all edge values are equal, all support
vectors must be misclassified by the same number
of weak classifiers.
69
(No Transcript)
70
Other ResultsAdaBoost cycles among support
vectors
  • All data points whose weights are non-zero in a
    cycle asymptotically achieve the same margin!
    That is, if AdaBoost cycles, it cycles among
    support vectors.

71
Theorem (R, Schapire, Daubechies 04) The bound
of (Rätsch and Warmuth 02) is tight, i.e.,
AdaBoost may converge to a margin within ?(?),
?(?em) in the non-optimal case. Namely, if
the edge is within ?, ?em, AdaBoost will
converge to a margin within ?(?),
?(?em).
  • We can coerce AdaBoost to converge to any
    margin wed like,
  • in the non-optimal case!

Theorem (R, Schapire, Daubechies 04) For any
given em,it is possible to construct a case in
which the edge is within ?, ?em.
72
Future Plans
General goal to understand the delicate balance
between the margins and the complexity of the
weak learning algorithm
  • tools we have developed
  • analysis of convergence using dynamics,
  • smooth margin function,
  • new algorithms that maximize the margin
  • sparsity of classifiers, coherence of
    classifiers
  • neurons!

73
? Thank you ?
This research supported by grants of Ingrid
Daubechies and Robert Schapire, Princeton
University, Program in Applied and Computational
Mathematics and Department of Computer Science
Thanks to current supervisor Eero Simoncelli,
HHMI at NYU NSF BIO
postdoctoral fellowship (starting in March)
References articles on my webpage
www.cns.nyu.edu/rudin
Write a Comment
User Comments (0)
About PowerShow.com