Title: Online Learning with a Memory Harness using the Forgetron
1Online Learning with a Memory Harness using the
Forgetron
The Hebrew University Jerusalem, Israel
- Shai Shalev-Shwartz joint work with
- Ofer Dekel and Yoram Singer
Large Scale Kernel Machine NIPS05, Whistler
2Overview
- Online learning with kernels
- Goal strict limit on the number of support
vectors - The Forgetron algorithm
- Analysis
- Experiments
3Kernel-based Perceptron for Online Learning
Online Learner
yt
sign(ft(xt))
?
sign(ft(xt))
xt
current classifier ft(x) ?i 2 I yi K(xi,x)
Current Active-Set
Mistakes M
I 1,3
I 1,3,4
1 2 3 4 5 6 7 . . .
4Kernel-based Perceptron for Online Learning
Online Learner
yt
sign(ft(xt))
sign(ft(xt))
xt
current classifier ft(x) ?i 2 I yi K(xi,x)
Current Active-Set
Mistakes M
I 1,3,4
1 2 3 4 5 6 7 . . .
5Learning on a Budget
- I number of mistakes until round t
- Memory time inefficient
- I might grow unboundedly
- Goal Construct a kernel-based online algorithm
for which - I B for each t
- Still performs well ? comes with performance
guarantee
6Mistake Bound for Perceptron
- (x1,y1),,(xT,yT) a sequence of examples
- A kernel K s.t. K(xt,xt) 1
- g a fixed competitor classifier in RKHS
- Define t(g) max(0,1 yt g(xt))
- Then,
7Previous Work
- Crammer, Kandola, Singer (2003)
- Kivinen, Smola, Williamson (2004)
- Weston, Bordes, Bottu (2005)
Previous online budget algorithms do not provide
a mistake bound Is our goal attainable ?
8Mission Impossible
- Input space e1,,eB1
- Linear kernel K(ei,ej) ei ej ?i,j
- Budget constraint I B . Therefore, there
exists j s.t. ?i2 I ?i K(ei,ej) 0 - We might always err
- But, the competitor g?i ei never errs !
- Perceptron makes B1 mistakes
9Redefine the Goal
- We must restrict the competitor g somehow. One
way restrict g - The counter example implies that we cannot
compete with g (B1)1/2 - Main result The Forgetron algorithm can compete
with any classifier g s.t.
10The Forgetron
ft(x) ?i 2 I ?i yi K(xi,x)
1 2 3 ... ... t-1 t
Step (1) - Perceptron
I I t
1 2 3 ... ... t-1 t
Step (2) Shrinking
?i ? ?t ?i
1 2 3 ... ... t-1 t
Step (3) Remove Oldest
r min I I ? I t
1 2 3 ... ... t-1 t
11Shrinking a two-edged sword
- ?t is small ? ?r is small ? deviation due
to removal is negligible - ?t is small ? deviation due to shrinking is
large - The Forgetron formalizes deviation and
automatically balances the tradeoff
12Quantifying Deviation
- Progress measure ?t ft g2 -
ft1-g2 - Progress for each update step
- Deviation is measured by negative progress
?t ?t ?t ?t
ft-g2-f-g2
f-g2-f-g2
f-g2-ft1-g2
after Perceptron
after removal
after shrinking
13Quantifying Deviation
Gain from Perceptron step
Damage from shrinking
Damage from removal
The Forgetron sets
14Resulting Mistake Bound
- For any g s.t.
- the number of prediction mistakes the Forgetron
makes is at most
15Small deviation ? Mistake Bound
- Assume low deviation
- Perceptrons progress
f
f
f-g2
g
f-g2
16Small deviation ? Mistake Bound
- On one hand positive progress towards good
competitors - On other hand total possible progress is
Corollary Small deviation ? mistake bound
17Deviation due to Removal
- Assume that on round t we remove example r with
weight ?. Then,
Perceptrons progress
Deviation from removaldenote ?t
- Remarks
- ? small ? ? small ? ?t small
- ?t decreases with yr ft(xr)
18Deviation due to Shrinking
- Case I After shrinking, ft g
?t 0
f-g2
f-g2
19Deviation due to Shrinking
- Case II After shrinking, ft gU
?t U2 ( 1 - ?t )
f-g2
f-g2
20Self-tuning Shrinking Mechanism
- The Forgetron sets ?t to the maximal value in
(0,1 for which the deviation from removal is
small - The above has an analytic solution
- By construction, total deviation caused by
removal is at most (15/32) M - It can be shown (strong induction) that the total
deviation caused by shrinking is at most (1/32) M
21Experiments
- Gaussian kernel
- Compare performance to Crammer, Kandola Singer
(CKS), NIPS03 - Measure the number of prediction mistakes as a
function of the budget - The base line is the performance of the Perceptron
22Experiment I MNIST dataset
23Experiment II Census-income (adult)
(Perceptron makes 16,000 mistakes)
24Experiment III Synthetic Data with Label Noise
25Summary
- No budget algorithm can compete with arbitrary
hypotheses - The Forgetron can compete with norm-bounded
hypotheses - Works well in practice
- Does not require parameters
- Future work the Forgetron for batch learning