Online Learning with a Memory Harness using the Forgetron - PowerPoint PPT Presentation

About This Presentation
Title:

Online Learning with a Memory Harness using the Forgetron

Description:

No budget algorithm can compete with arbitrary hypotheses. The Forgetron can compete with norm-bounded hypotheses. Works well in practice ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 18
Provided by: sha113
Category:

less

Transcript and Presenter's Notes

Title: Online Learning with a Memory Harness using the Forgetron


1
Online Learning with a Memory Harness using the
Forgetron
The Hebrew University Jerusalem, Israel
  • Shai Shalev-Shwartz joint work with
  • Ofer Dekel and Yoram Singer

Large Scale Kernel Machine NIPS05, Whistler
2
Overview
  • Online learning with kernels
  • Goal strict limit on the number of support
    vectors
  • The Forgetron algorithm
  • Analysis
  • Experiments

3
Kernel-based Perceptron for Online Learning
Online Learner
yt
sign(ft(xt))
?
sign(ft(xt))
xt
current classifier ft(x) ?i 2 I yi K(xi,x)
Current Active-Set
Mistakes M
I 1,3
I 1,3,4
1 2 3 4 5 6 7 . . .
4
Kernel-based Perceptron for Online Learning
Online Learner
yt
sign(ft(xt))

sign(ft(xt))
xt
current classifier ft(x) ?i 2 I yi K(xi,x)
Current Active-Set
Mistakes M
I 1,3,4
1 2 3 4 5 6 7 . . .
5
Learning on a Budget
  • I number of mistakes until round t
  • Memory time inefficient
  • I might grow unboundedly
  • Goal Construct a kernel-based online algorithm
    for which
  • I B for each t
  • Still performs well ? comes with performance
    guarantee

6
Mistake Bound for Perceptron
  • (x1,y1),,(xT,yT) a sequence of examples
  • A kernel K s.t. K(xt,xt) 1
  • g a fixed competitor classifier in RKHS
  • Define t(g) max(0,1 yt g(xt))
  • Then,

7
Previous Work
  • Crammer, Kandola, Singer (2003)
  • Kivinen, Smola, Williamson (2004)
  • Weston, Bordes, Bottu (2005)

Previous online budget algorithms do not provide
a mistake bound Is our goal attainable ?
8
Mission Impossible
  • Input space e1,,eB1
  • Linear kernel K(ei,ej) ei ej ?i,j
  • Budget constraint I B . Therefore, there
    exists j s.t. ?i2 I ?i K(ei,ej) 0
  • We might always err
  • But, the competitor g?i ei never errs !
  • Perceptron makes B1 mistakes

9
Redefine the Goal
  • We must restrict the competitor g somehow. One
    way restrict g
  • The counter example implies that we cannot
    compete with g (B1)1/2
  • Main result The Forgetron algorithm can compete
    with any classifier g s.t.

10
The Forgetron
ft(x) ?i 2 I ?i yi K(xi,x)
1 2 3 ... ... t-1 t
Step (1) - Perceptron
I I t
1 2 3 ... ... t-1 t
Step (2) Shrinking
?i ? ?t ?i
1 2 3 ... ... t-1 t
Step (3) Remove Oldest
r min I I ? I t
1 2 3 ... ... t-1 t
11
Shrinking a two-edged sword
  • ?t is small ? ?r is small ? deviation due
    to removal is negligible
  • ?t is small ? deviation due to shrinking is
    large
  • The Forgetron formalizes deviation and
    automatically balances the tradeoff

12
Quantifying Deviation
  • Progress measure ?t ft g2 -
    ft1-g2
  • Progress for each update step
  • Deviation is measured by negative progress

?t ?t ?t ?t
ft-g2-f-g2
f-g2-f-g2
f-g2-ft1-g2
after Perceptron
after removal
after shrinking
13
Quantifying Deviation
Gain from Perceptron step
Damage from shrinking
Damage from removal
The Forgetron sets
14
Resulting Mistake Bound
  • For any g s.t.
  • the number of prediction mistakes the Forgetron
    makes is at most

15
Small deviation ? Mistake Bound
  • Assume low deviation
  • Perceptrons progress

f
f
f-g2
g
f-g2
16
Small deviation ? Mistake Bound
  • On one hand positive progress towards good
    competitors
  • On other hand total possible progress is

Corollary Small deviation ? mistake bound
17
Deviation due to Removal
  • Assume that on round t we remove example r with
    weight ?. Then,

Perceptrons progress
Deviation from removaldenote ?t
  • Remarks
  • ? small ? ? small ? ?t small
  • ?t decreases with yr ft(xr)

18
Deviation due to Shrinking
  • Case I After shrinking, ft g

?t 0
f-g2
f-g2
19
Deviation due to Shrinking
  • Case II After shrinking, ft gU

?t U2 ( 1 - ?t )
f-g2
f-g2
20
Self-tuning Shrinking Mechanism
  • The Forgetron sets ?t to the maximal value in
    (0,1 for which the deviation from removal is
    small
  • The above has an analytic solution
  • By construction, total deviation caused by
    removal is at most (15/32) M
  • It can be shown (strong induction) that the total
    deviation caused by shrinking is at most (1/32) M

21
Experiments
  • Gaussian kernel
  • Compare performance to Crammer, Kandola Singer
    (CKS), NIPS03
  • Measure the number of prediction mistakes as a
    function of the budget
  • The base line is the performance of the Perceptron

22
Experiment I MNIST dataset
23
Experiment II Census-income (adult)
(Perceptron makes 16,000 mistakes)
24
Experiment III Synthetic Data with Label Noise
25
Summary
  • No budget algorithm can compete with arbitrary
    hypotheses
  • The Forgetron can compete with norm-bounded
    hypotheses
  • Works well in practice
  • Does not require parameters
  • Future work the Forgetron for batch learning
Write a Comment
User Comments (0)
About PowerShow.com