Support Vector Machine (SVM) Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Support Vector Machine (SVM) Classification

Description:

Title: Machine Learning CSCI 5622 Author: GRASP LAB Last modified by: latecki Created Date: 8/27/2001 4:40:02 PM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 47
Provided by: GRAS2
Learn more at: https://cis.temple.edu
Category:

less

Transcript and Presenter's Notes

Title: Support Vector Machine (SVM) Classification


1
Support Vector Machine (SVM) Classification
  • Greg Grudic

2
Todays Lecture Goals
  • Support Vector Machine (SVM) Classification
  • Another algorithm for linear separating
    hyperplanes
  • A Good text on SVMs Bernhard Schölkopf and Alex
    Smola. Learning with Kernels. MIT Press,
    Cambridge, MA, 2002

3
Support Vector Machine (SVM) Classification
  • Classification as a problem of finding optimal
    (canonical) linear hyperplanes.
  • Optimal Linear Separating Hyperplanes
  • In Input Space
  • In Kernel Space
  • Can be non-linear

4
Linear Separating Hyper-Planes
How many lines can separate these points?
Which line should we use?
NO!
5
Initial Assumption Linearly Separable Data
6
Linear Separating Hyper-Planes
7
Linear Separating Hyper-Planes
  • Given data
  • Finding a separating hyperplane can be posed as a
    constraint satisfaction problem (CSP)
  • Or, equivalently
  • If data is linearly separable, there are an
    infinite number of hyperplanes that satisfy this
    CSP

8
The Margin of a Classifier
  • Take any hyper-plane (P0) that separates the data
  • Put a parallel hyper-plane (P1) on a point in
    class 1 closest to P0
  • Put a second parallel hyper-plane (P2) on a point
    in class -1 closest to P0
  • The margin (M) is the perpendicular distance
    between P1 and P2

9
Calculating the Margin of a Classifier
P2
  • P0 Any separating hyperplane
  • P1 Parallel to P0, passing through closest
    point in one class
  • P2 Parallel to P0, passing through point
    closest to the opposite class

P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
10
SVM Constraints on the Model Parameters
Model parameters must be chosen such
that, for on P1 and for on P2
For any P0, these constraints are
always attainable.
Given the above, then the linear separating
boundary lies half way between P1 and P2 and is
given by
Resulting Classifier
11
Remember signed distance from a point to a
hyperplane
Hyperplane define by
12
Calculating the Margin (1)
13
Calculating the Margin (2)
Signed Distance
Take absolute value to get the unsigned margin
14
Different P0s have Different Margins
P2
  • P0 Any separating hyperplane
  • P1 Parallel to P0, passing through closest
    point in one class
  • P2 Parallel to P0, passing through point
    closest to the opposite class

P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
15
Different P0s have Different Margins
P2
  • P0 Any separating hyperplane
  • P1 Parallel to P0, passing through closest
    point in one class
  • P2 Parallel to P0, passing through point
    closest to the opposite class

P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
16
Different P0s have Different Margins
  • P0 Any separating hyperplane
  • P1 Parallel to P0, passing through closest
    point in one class
  • P2 Parallel to P0, passing through point
    closest to the opposite class

P2
P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
17
How Do SVMs Choose the Optimal Separating
Hyperplane (boundary)?
P2
  • Find the that maximizes the margin!

P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
18
SVM Constraint Optimization Problem
  • Given data
  • Minimize subject to

The Lagrange Function Formulation is used to
solve this Minimization Problem
19
The Lagrange Function Formulation
For every constraint we introduce a Lagrange
Multiplier
The Lagrangian is then defined by
Where - the primal variables are -
the dual variables are
Goal Minimize Lagrangian w.r.t. primal
variables, and Maximize w.r.t. dual variables
20
Derivation of the Dual Problem
  • At the saddle point (extremum w.r.t. primal)
  • This give the conditions
  • Substitute into to get the dual
    problem

21
Using the Lagrange Function Formulation, we get
the Dual Problem
  • Maximize
  • Subject to

22
Properties of the Dual Problem
  • Solving the Dual gives a solution to the original
    constraint optimization problem
  • For SVMs, the Dual problem is a Quadratic
    Optimization Problem which has a globally optimal
    solution
  • Gives insights into the NON-Linear formulation
    for SVMs

23
Support Vector Expansion (1)
OR
is also computed from the optimal dual variables
24
Support Vector Expansion (2)
Substitute
OR
25
What are the Support Vectors?
Maximized Margin
26
Why do we want a model with only a few SVs?
  • Leaving out an example that does not become an SV
    gives the same solution!
  • Theorem (Vapnik and Chervonenkis, 1974) Let
    be the number of SVs obtained by training on N
    examples randomly drawn from P(X,Y), and E be an
    expectation. Then

27
What Happens When Data is Not Separable Soft
Margin SVM
Add a Slack Variable
28
Soft Margin SVM Constraint Optimization Problem
  • Given data
  • Minimize subject to

29
Dual Problem (Non-separable data)
  • Maximize
  • Subject to

30
Same Decision Boundary
31
Mapping into Nonlinear Space
Goal Data is linearly separable (or almost) in
the nonlinear space.
32
Nonlinear SVMs
  • KEY IDEA Note that both the decision boundary
    and dual optimization formulation use dot
    products in input space only!

33
Kernel Trick
Replace
with
Inner Product
Can use the same algorithms in nonlinear kernel
space!
34
Nonlinear SVMs
Maximize
Boundary
35
Need Mercer Kernels
36
Gram (Kernel) Matrix
Training Data
  • Properties
  • Positive Definite Matrix
  • Symmetric
  • Positive on diagonal
  • N by N

37
Commonly Used Mercer Kernels
  • Polynomial
  • Sigmoid
  • Gaussian

38
Why these kernels?
  • There are infinitely many kernels
  • The best kernel is data set dependent
  • We can only know which kernels are good by trying
    them and estimating error rates on future data
  • Definition a universal approximator is a mapping
    that can arbitrarily well model any surface (i.e.
    many to one mapping)
  • Motivation for the most commonly used kernels
  • Polynomials (given enough terms) are universal
    approximators
  • However, Polynomial Kernels are not universal
    approximators because they cannot represent all
    polynomial interactions
  • Sigmoid functions (given enough training
    examples) are universal approximators
  • Gaussian Kernels (given enough training examples)
    are universal approximators
  • These kernels have shown to produce good models
    in practice

39
Picking a Model (A Kernel for SVMs)?
  • How do you pick the Kernels?
  • Kernel parameters
  • These are called learning parameters or
    hyperparamters
  • Two approaches choosing learning paramters
  • Bayesian
  • Learning parameters must maximize probability of
    correct classification on future data based on
    prior biases
  • Frequentist
  • Use the training data to learn the model
    parameters
  • Use validation data to pick the best
    hyperparameters.
  • More on learning parameter selection later

40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
Some SVM Software
  • LIBSVM
  • http//www.csie.ntu.edu.tw/cjlin/libsvm/
  • SVM Light
  • http//svmlight.joachims.org/
  • TinySVM
  • http//chasen.org/taku/software/TinySVM/
  • WEKA
  • http//www.cs.waikato.ac.nz/ml/weka/
  • Has many ML algorithm implementations in JAVA

44
MNIST A SVM Success Story
  • Handwritten character benchmark
  • 60,000 training and 10,0000 testing
  • Dimension d 28 x 28

45
Results on Test Data
SVM used a polynomial kernel of degree 9.
46
SVM (Kernel) Model Structure
Write a Comment
User Comments (0)
About PowerShow.com