Title: PEGASOS Primal Efficient subGrAdient SOlver for SVM
 1PEGASOS Primal Efficient sub-GrAdient SOlver for 
SVM
YASSO  Yet Another Svm SOlver
- Shai Shalev-Shwartz 
- Yoram Singer 
- Nati Srebro
The Hebrew University Jerusalem, Israel 
 2Support Vector Machines
QP form
More natural form
Regularization term
Empirical loss 
 3Outline
- Previous Work 
- The Pegasos algorithm 
- Analysis  faster convergence rates 
- Experiments  outperforms state-of-the-art 
- Extensions 
- kernels 
- complex prediction problems 
- bias term
4Previous Work
- Dual-based methods 
- Interior Point methods 
- Memory m2, time m3 log(log(1/?)) 
- Decomposition methods 
- Memory m, Time super-linear in m 
- Online learning  Stochastic Gradient 
- Memory O(1), Time 1/?2 (linear kernel) 
- Memory 1/?2, Time 1/?4 (non-linear kernel) 
- Typically, online learning algorithms do not 
 converge to the optimal solution of SVM
Better rates for finite dimensional instances 
(Murata, Bottou)  
 5PEGASOS
A_t  S Subgradient method
A_t  1 Stochastic gradient
Subgradient
Projection 
 6Run-Time of Pegasos
- Choosing At1 and a linear kernel over Rn 
-  ? Run-time required for Pegasos to find ? 
 accurate solution w.p.  1-?
- Run-time does not depend on examples 
- Depends on difficulty of problem (? and ?)
7Formal Properties
- Definition w is ? accurate if 
- Theorem 1 Pegasos finds ? accurate solution w.p. 
 1-? after at most iterations.
- Theorem 2 Pegasos finds log(1/?) solutions s.t. 
 w.p.  1-?, at least one of them is ? accurate
 after iterations
8Proof Sketch
A second look on the update step 
 9Proof Sketch
- Lemma (free projection) 
- Logarithmic Regret for OCP (Hazan et al06) 
- Take expectation 
- f(wr)-f(w)  0 ? Markov gives that w.p.  1-? 
- Amplify the confidence
10Experiments
- 3 datasets (provided by Joachims) 
- Reuters CCAT (800K examples, 47k features) 
- Physics ArXiv (62k examples, 100k features) 
- Covertype (581k examples, 54 features) 
- 4 competing algorithms 
- SVM-light (Joachims) 
- SVM-Perf (Joachims06) 
- Norma (Kivinen, Smola, Williamson 02) 
- Zhang04 (stochastic gradient descent) 
- Source-Code available online
11Training Time (in seconds) 
 12Compare to Norma (on Physics)
obj. value test error 
 13Compare to Zhang (on Physics)
Objective
But, tuning the parameter is more expensive than 
learning  
 14Effect of kAt when T is fixed
Objective 
 15Effect of kAt when kT is fixed
Objective 
 16I want my kernels !
- Pegasos can seamlessly be adapted to employ 
 non-linear kernels while working solely on the
 primal objective function
- No need to switch to the dual problem 
- Number of support vectors is bounded by
17Complex Decision Problems
- Pegasos works whenever we know how to calculate 
 subgradients of loss func. l(w(x,y))
- Example Structured output prediction 
- Subgradient is ?(x,y)-?(x,y) where y is the 
 maximizer in the definition of l
18bias term
- Popular approach increase dimension of xCons 
 pay for b in the regularization term
- Calculate subgradients w.r.t. w and w.r.t 
 bCons convergence rate is 1/?2
- DefineCons At need to be large 
- Search b in an outer loopCons evaluating 
 objective is 1/?2
19Discussion
- Pegasos Simple  Efficient solver for SVM 
- Sample vs. computational complexity 
- Sample complexity How many examples do we need 
 as a function of VC-dim (?), accuracy (?), and
 confidence (?)
- In Pegasos, we aim at analyzing computational 
 complexity based on ?, ?, and ? (also in Bottou
 Bousquet)
- Finding argmin vs. calculating min It seems that 
 Pegasos finds the argmin more easily than it
 requires to calculate the min value