Introduction to Support Vector Machines - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Introduction to Support Vector Machines

Description:

Karush-Kuhn-Tucker conditions. What does it imply? The famous KKT conditions. Karush-Kuhn-Tucker conditions. Very important! Return to our problem ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 29
Provided by: kegCsTsi
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Support Vector Machines


1
Introduction to Support Vector Machines
  • Jie Tang
  • 25 July 2005

2
Introduction
  • Support Vector Machine (SVM) is a learning
    methodology based on Vapniks statistical
    learning theory
  • Addressed in the 1990s
  • To solve the problems in traditional statistical
    learning( over fitting, capacity control,)
  • It achieved the best performances in practical
    applications
  • Handwritten digit recognition
  • text categorization

3
Classification Problem
  • Given training set S(x1,y1),(x2,y2),,(xl,yl),
    and xi? XRn, yi? Y1,-1, i1,2,,l
  • To learn a function g(x), and make the decision
    function f(x)sgn(g(x)) can classify new input x
  • So this is a supervised batch learning method

4
Linear classifier
5
Maximum Marginal Classifier
6
Maximum Marginal Classifier
  • Let us select two points on the two hyperplan
    respectively.

Distance from hyperplan to origin
7
Maximum Marginal Classifier
8
Then
equal to
Note we have constraints
equal to
By scaling w and b by setting?1
9
Lagrange duality
For the problem
We can write lagrange form
10
Let us review generalized Lagrangian
By lagrangian
Let us consider
Note the constraints must be satisfied,
otherwise, the maxL will be infinite.
11
Let us review generalized Lagrangian
If the constraints are satisfied, then we must
have
Now you can found that, maxL takes the same value
as the objective of our problem f(x). Therefore,
we can consider the minimization problem
Let us define the optimal value of the primal
problem as p Then, let us define the dual problem
They are similar
Now we define the optimal value of the dual
problem as d
12
Relationship between Primal and Dual problems
Why? Just remember it
Then if under some conditions, dp We can solve
the dual problem in lieu of the primal problem
What is the conditions?
13
The famous KKT conditionsKarush-Kuhn-Tucker
conditions
What does it imply?
14
The famous KKT conditionsKarush-Kuhn-Tucker
conditions
Very important!
15
Return to our problem
Let us first solve minwL with respective to the w
Substitute the two equations back to L(w,b,a)
16
We have
Then, what we have the maximum optimum problem
with respect to aß
Now, we have only one parameter a We can solve
it and then solve w, And then b, because
17
How to predict
For a new sample x, we can predict it by
18
Non-separable case
What is non-separable case? I will not give an
example. I suppose you know that?
Then what is the optimal problem
Next, by forming the lagrangian
19
Dual form
What is the difference from the previous form??!!
Also note following conditions
20
How to train SVM how to solve the optimal
problem
Sequential minimal optimization (SMO) algorithm,
due to John Platt.
First, let us introduce coordinate ascent
algorithm Loop until convergence For i1,
, m aiargmaxaiL(a1,, ai-1, ai,
ai1,, am)
21
coordinate ascent is ok?
Is it ok?
22
SMO
Change the algorithm by this is just SMO Repeat
until convergence 1. select some pair ai and
aj to update next. (using a heuristic that tries
to pick the two that will allow us to make the
biggest progress towards the global maximum).
2. reoptimize L(a) with respect to ai and aj,
while holding all the other a.
23
SMO(2)
This is a quadratic function in a2. I.e. it can
be written as
24
Solving a2
For the quadratic function, we can simply solve
it by setting its derivative to zero. Let us use
a2new, unclipped as the resulting value.
Having find a2, we can go back to find the
optimal a1. Please read Platts paper if you want
to read more details
25
Kernel
1. Why kernel? 2. What is feature space mapping?
kernel function
With kernel, whats more interesting to us?
26
We can compute kernel without calculating mapping
Replace all by
Now, we need to compute F(x) first. That may be
expensive. But with kernel, we can ignore the
step. Why? Because, both in the training and
test, there have the expression ltx, zgt.
For example
27
References
  • Vladimir N. Vapnik. The nature of statistical
    learning theory. Springer-Verlag New York. 1998.
  • Andrew Ng. CS229 Lecture notes. Lectures from
    10/19/03 to 10/26/03. Part V. Support Vector
    Machines
  • CHRISTOPHER J.C. BURGES. A Tutorial on Support
    Vector Machines for Pattern Recognition. Data
    Mining and Knowledge Discovery, 2, 121167
    (1998). 1998 Kluwer Academic Publishers, Boston.
    Manufactured in The Netherlands.
  • Cristianini, N., Shawe-Taylor, J., An
    Introduction to Support Vector Machines,
    Cambridge University Press, (2000).

28
People
  • Vladimir Vapnik.
  • J. Platt
  • J. Platt, N. Cristianini, J. Shawe-Taylor
  • Shawe-Taylor, J.
  • Burges, C. J. C.
  • Thorsten Joachims
  • Etc.
Write a Comment
User Comments (0)
About PowerShow.com