Support Vector Machine (SVM) Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Support Vector Machine (SVM) Classification

Description:

Title: Machine Learning CSCI 5622 Author: GRASP LAB Last modified by: latecki Created Date: 8/27/2001 4:40:02 PM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 47

Provided by: GRAS2

Learn more at: https://cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machine (SVM) Classification

1
Support Vector Machine (SVM) Classification

Greg Grudic

2
Todays Lecture Goals

Support Vector Machine (SVM) Classification
Another algorithm for linear separating
hyperplanes
A Good text on SVMs Bernhard Schölkopf and Alex
Smola. Learning with Kernels. MIT Press,
Cambridge, MA, 2002

3
Support Vector Machine (SVM) Classification

Classification as a problem of finding optimal
(canonical) linear hyperplanes.
Optimal Linear Separating Hyperplanes
In Input Space
In Kernel Space
Can be non-linear

4
Linear Separating Hyper-Planes
How many lines can separate these points?
Which line should we use?
NO!
5
Initial Assumption Linearly Separable Data
6
Linear Separating Hyper-Planes
7
Linear Separating Hyper-Planes

Given data
Finding a separating hyperplane can be posed as a
constraint satisfaction problem (CSP)
Or, equivalently
If data is linearly separable, there are an
infinite number of hyperplanes that satisfy this
CSP

8
The Margin of a Classifier

Take any hyper-plane (P0) that separates the data
Put a parallel hyper-plane (P1) on a point in
class 1 closest to P0
Put a second parallel hyper-plane (P2) on a point
in class -1 closest to P0
The margin (M) is the perpendicular distance
between P1 and P2

9
Calculating the Margin of a Classifier
P2

P0 Any separating hyperplane
P1 Parallel to P0, passing through closest
point in one class
P2 Parallel to P0, passing through point
closest to the opposite class

P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
10
SVM Constraints on the Model Parameters
Model parameters must be chosen such
that, for on P1 and for on P2
For any P0, these constraints are
always attainable.
Given the above, then the linear separating
boundary lies half way between P1 and P2 and is
given by
Resulting Classifier
11
Remember signed distance from a point to a
hyperplane
Hyperplane define by
12
Calculating the Margin (1)
13
Calculating the Margin (2)
Signed Distance
Take absolute value to get the unsigned margin
14
Different P0s have Different Margins
P2

P0 Any separating hyperplane
P1 Parallel to P0, passing through closest
point in one class
P2 Parallel to P0, passing through point
closest to the opposite class

P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
15
Different P0s have Different Margins
P2

P0 Any separating hyperplane
P1 Parallel to P0, passing through closest
point in one class
P2 Parallel to P0, passing through point
closest to the opposite class

P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
16
Different P0s have Different Margins

P0 Any separating hyperplane
P1 Parallel to P0, passing through closest
point in one class
P2 Parallel to P0, passing through point
closest to the opposite class

P2
P0
P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
17
How Do SVMs Choose the Optimal Separating
Hyperplane (boundary)?
P2

Find the that maximizes the margin!

P1
Margin (M) distance measured along a line
perpendicular to P1 and P2
18
SVM Constraint Optimization Problem

Given data
Minimize subject to

The Lagrange Function Formulation is used to
solve this Minimization Problem
19
The Lagrange Function Formulation
For every constraint we introduce a Lagrange
Multiplier
The Lagrangian is then defined by
Where - the primal variables are -
the dual variables are
Goal Minimize Lagrangian w.r.t. primal
variables, and Maximize w.r.t. dual variables
20
Derivation of the Dual Problem

At the saddle point (extremum w.r.t. primal)
This give the conditions
Substitute into to get the dual
problem

21
Using the Lagrange Function Formulation, we get
the Dual Problem

Maximize
Subject to

22
Properties of the Dual Problem

Solving the Dual gives a solution to the original
constraint optimization problem
For SVMs, the Dual problem is a Quadratic
Optimization Problem which has a globally optimal
solution
Gives insights into the NON-Linear formulation
for SVMs

23
Support Vector Expansion (1)
OR
is also computed from the optimal dual variables
24
Support Vector Expansion (2)
Substitute
OR
25
What are the Support Vectors?
Maximized Margin
26
Why do we want a model with only a few SVs?

Leaving out an example that does not become an SV
gives the same solution!
Theorem (Vapnik and Chervonenkis, 1974) Let
be the number of SVs obtained by training on N
examples randomly drawn from P(X,Y), and E be an
expectation. Then

27
What Happens When Data is Not Separable Soft
Margin SVM
Add a Slack Variable
28
Soft Margin SVM Constraint Optimization Problem

Given data
Minimize subject to

29
Dual Problem (Non-separable data)

Maximize
Subject to

30
Same Decision Boundary
31
Mapping into Nonlinear Space
Goal Data is linearly separable (or almost) in
the nonlinear space.
32
Nonlinear SVMs

KEY IDEA Note that both the decision boundary
and dual optimization formulation use dot
products in input space only!

33
Kernel Trick
Replace
with
Inner Product
Can use the same algorithms in nonlinear kernel
space!
34
Nonlinear SVMs
Maximize
Boundary
35
Need Mercer Kernels
36
Gram (Kernel) Matrix
Training Data

Properties
Positive Definite Matrix
Symmetric
Positive on diagonal
N by N

37
Commonly Used Mercer Kernels

Polynomial
Sigmoid
Gaussian

38
Why these kernels?

There are infinitely many kernels
The best kernel is data set dependent
We can only know which kernels are good by trying
them and estimating error rates on future data
Definition a universal approximator is a mapping
that can arbitrarily well model any surface (i.e.
many to one mapping)
Motivation for the most commonly used kernels
Polynomials (given enough terms) are universal
approximators
However, Polynomial Kernels are not universal
approximators because they cannot represent all
polynomial interactions
Sigmoid functions (given enough training
examples) are universal approximators
Gaussian Kernels (given enough training examples)
are universal approximators
These kernels have shown to produce good models
in practice

39
Picking a Model (A Kernel for SVMs)?

How do you pick the Kernels?
Kernel parameters
These are called learning parameters or
hyperparamters
Two approaches choosing learning paramters
Bayesian
Learning parameters must maximize probability of
correct classification on future data based on
prior biases
Frequentist
Use the training data to learn the model
parameters
Use validation data to pick the best
hyperparameters.
More on learning parameter selection later