Instructor : Saeed Shiry - PowerPoint PPT Presentation

About This Presentation

Title:

Instructor : Saeed Shiry

Description:

Instructor : Saeed Shiry * * Example By using a QP solver, we get a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833 Note that the ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 78

Provided by: Shi122

Category:

more less

Transcript and Presenter's Notes

Title: Instructor : Saeed Shiry

1
????? ????? ???????

Instructor Saeed Shiry

2
?????

SVM ???? ???? ????? ?? ??? ?? ??? ???? Kernel
Methods ????????? ????? ????? ?????.
SVM?? ??? 1992 ???? Vapnik ????? ??? ? ?? ????
statistical learning theory ??? ?????? ???.
???? SVM ????? ?????? ?? ?? ????? ???? ??? ????
??? ?? ?? ???? ??? ???? ???? ????? ??? ??????
????? 1.1 ???

3
?????

??? ??? ???? ???????? ?? ????? ? ?????? ????
??????? ?????? ?? ???? ???? ( ?? ???? ??????????
???? ????? ??????? ??????? ? ????)
????? ????
??????? ?????? ?? ????? ????? ????
????? ?? ????? overfitting ????? ????

4
???? ????

?? ??? ????? ???? ?? ????? ??? ??????? ??????
??????? ???? ?? ?????? ????? (maximum margin) ??
???? ?? ???? ?? ???? ?? ?? ??? ????.
?? ?????? ?? ???? ?? ????? ??? ??????? ??????
???? ?? ?? ???? ?? ????? ????? ????? ???? ??????
?? ????? ???? ?? ?? ??? ???? ???? ????? ??? ???
????.

5
?????

Support Vector Machines are a system for
efficiently training linear learning machines in
kernel-induced feature spaces, while respecting
the insights of generalisation theory and
exploiting optimisation theory.
Cristianini Shawe-Taylor (2000)

6
????? ??????? ??? Linear Discrimination

??? ?? ???? ???? ????? ????? ?? ????? ??? ?? ??
??????? ?????? ?????? ??? ????? ??? ?? ???? ?????
???????? ??? ?????? ?? ???? ???????? ????????
??? ??????? ?? ????? ????.
??? ??? ??? ?????????? ????? ?? ???? ?????? ??
???????

7
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
8
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
9
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
10
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
11
A Good Separator
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
12
Noise in the Observations
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
13
Ruling Out Some Separators
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
14
Lots of Noise
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
15
Maximizing the Margin
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
16
??? ?????

??? ????? ?? ?????? ?????? ?? ????? ?????
?? ???? n ???? ?????? ???? ????? ??? ????? ???.

17
?? ?? ??? ???? ??? ?????

??? ???? ???? ?????? ?? ( ??? ????) ?? ?? ????
?? ?? ?? ??? ???. ?? ???? ?? ???? ?????? ??? ??
????? ??? ???

?? ???? n ???? ?????? ????

18
???? SVM ???? ??? ???? ???? ??

?? ???? ???? ??????
?? ???? ???? ????? ?? ???? ???? ???? ??? ???? ?
???? ?? ????? ?? ?? ??? ?????? ?? ?? ???? ??
?????? ????.
???? ???? ???? ?? ??????? ????? ?? ?? ????? ????
????? ????? ?????? ??? ????? ????? ???.

19
?????? ?????

?? ??? ???? ?? ?? ????? ??????? ??? ???????
?????? ?????? ???? ???? ??? ?????? ?? ???
????????? ???? ?? ??????? ?? ????? ???? ???
?????? ?? ?????? ????? ???? ????? ?? ????? ?????
???.

20
??? ?????? ??????

?? ??? ????? ?? ????? ???? ??? ????.
????? ???? ??????? VC dimension ???? ???? ?? ????
???? ???? ????? ?????.
???? ????? ??? ??? ???? ??? ???? ???? ???.

21
????? ???????

????????? ???? ??? ?????? ?? ??? ???? ??? ???
????? ????? ??????? ?????? ??????

X2
SV
SV
SV
Class 1
X1
Class -1
22
????? ? SVM

?? ???? ??????? ????? ?? SVM ??? ???????? ????
????? ???? ????? ????
?????? ????? ????? ???? (high dimensionality) ??
overfitting ????? ?????. ??? ????? ???? ??
optimization ??? ???????? ???
????? ???? ???????
???? ???? ??? ?????? ?? ???????? ??????? ???????
?????.

23
?? ????? ???? ???? ?? ????

????? ??? ??????
x ? ?n
y ? -1, 1
???? ????? ????
f(x) sign(ltw,xgt b)
w ? ?n
b ? ?
??? ????
ltw, xgt b 0
w1x1 w2x2 wnxn b 0
???????? ?????? W, b??????? ?? ???? ???? ??
????? ??? ?????? ?? ???? ???? ???? ???
?? ??? ??? ?? ???? ?? ????? ??? ??? ???? ?????
????? ?? ?????? ?????

24
Linear SVM Mathematically

Let training set (xi, yi)i1..n, xi?Rd, yi ?
-1, 1 be separated by a hyperplane with margin
?. Then for each training example (xi, yi)
For every support vector xs the above inequality
is an equality. After rescaling w and b by ?/2
in the equality, we obtain that distance between
each xs and the hyperplane is
Then the margin can be expressed through
(rescaled) w and b as

wTxi b - ?/2 if yi -1 wTxi b ?/2
if yi 1
yi(wTxi b) ?/2
?
25
?? ????? ???? ???? ?? ????

????? ?? ???????? ?? ???? ??? ????? ??
????? ????? ?? ??? x ?? ?? ??? ????? ????? ??? ??

f(x)0
f(x)lt0
f(x)gt0
X2
????? w ?? ?? ?? ???? ???? ????? ???? ????? ???.
x
w
X1
26
????? ????? ??? ???? ??? ?????

Plus-plane x w . x b 1
Minus-plane x w . x b -1
Classify as..
-1 if w . x b lt -1
1 if w . x b gt 1

27
?????? ????? ?????

???? ???? ? ???? ?? ????? ??? ?? ??? ???????
Plus-plane x w . x b 1
Minus-plane x w . x b -1
????? w ?? ???? ???? ????? ???? ????? ???.
??? ???? X- ???? ?? ?? ???? ???? ???? ? X
????????? ???? ?? ???? ???? ?? X- ????.

28
?????? ????? ?????

??? ?? X- ???? X ??? ????? ?? ?? ?? ???? ????
????? ???. ??? ????? ??? ?? ???? ????? ??W ?????
???.
?? ??????? ?????? ????
x x- ? w for some value of ?.

29
?????? ????? ?????

??????? ??
w . x b 1
w . x- b -1
X x- ? w
x- x- M
??? ?????? M ?? ????? W? b ?????? ???.

30
?????? ????? ?????

w . x b 1
w . x- b -1
X x- ? w
x- x- M

w.( x- ? w) b 1
w.x- ? w.w b 1
-1 ? w.w 1
?2/ w.w
31
?????? ????? ?????
32
???????

??? ???? ???? ?? ???? ??? ????? ???? ?? ?? ?? 1 ?
1- ???? ???? ?????
ltw,xigt b 1 for y1
ltw,xigt b -1 for y -1
?? ?????? ????????? ??? ????
yi (ltw,xigt b) 1 for all i

33
??? ???? ?? ?????

?? SVM ?????? ?? ?????? ??????? ??? ?????
?? ????? ??????? ?????? (xi, yi) ?? i1,2,N
yi?1,-1
Minimise w2
Subject to yi (ltw,xigt b) 1 for all i
Note that w2 wTw
??? ?? ????? quadratic programming ?? ???????
???? ????? ????????? ??? ???. ?????? ?????? ???
?? ???? ???? ????? ???? ????? ???? ???.

34
Quadratic Programming
35
Recap of Constrained Optimization

Suppose we want to minimize f(x) subject to g(x)
0
A necessary condition for x0 to be a solution
a the Lagrange multiplier
For multiple constraints gi(x) 0, i1, , m, we
need a Lagrange multiplier ai for each of the
constraints

36
Recap of Constrained Optimization

The case for inequality constraint gi(x) 0 is
similar, except that the Lagrange multiplier ai
should be positive
If x0 is a solution to the constrained
optimization problem
There must exist ai 0 for i1, , m such that
x0 satisfy
The function is
also known as the Lagrangrian we want to set its
gradient to 0

37
??? ?? ??????

Construct minimise the Lagrangian
Take derivatives wrt. w and b, equate them to 0
The Lagrange multipliers ai are called dual
variables
Each training point has an associated dual
variable.

parameters are expressed as a linear combination
of training points
only SVs will have non-zero ai

38
??? ?? ??????
Class 2
a100
a80.6
a70
a20
a50
a10.8
a40
a61.4
a90
a30
Class 1
39
The Dual Problem

If we substitute to
Lagrangian , we have
Note that
This is a function of ai only

40
The Dual Problem

The new objective function is in terms of ai only
It is known as the dual problem if we know w, we
know all ai if we know all ai, we know w
The original problem is known as the primal
problem
The objective function of the dual problem needs
to be maximized!
The dual problem is therefore

Properties of ai when we introduce the Lagrange
multipliers
The result when we differentiate the original
Lagrangian w.r.t. b
41
The Dual Problem

This is a quadratic programming (QP) problem
A global maximum of ai can always be found
w can be recovered by

42
??? ?? ??????

So,
Plug this back into the Lagrangian to obtain the
dual formulation
The resulting dual that is solved for ? by using
a QP solver
The b does not appear in the dual so it is
determined separatelyfrom the initial
constraints

Data enters only in the form of dot products!
43
???? ???? ???? ??? ????

?? ?? ???? ?????? (?, b) ?? ?? ???????
quadratic ?? ???? ???? ??? ????? ???? ???? ??????
SVM ?? ???? ???? ???? ????? ??? ???? ???? ???.
??? x ?? ????? ???? ????? ???? ???? ?? ????? ???
???? ?????
signf(x, ?, b), where

Data enters only in the form of dot products!
44
????? ??? ??? ??

The solution of the SVM, i.e. of the quadratic
programming problem with linear inequality
constraints has the nice property that the data
enters only in the form of dot products!
Dot product (notation memory refreshing) given
x(x1,x2,xn) and y(y1,y2,yn), then the dot
product of x and y is xy(x1y1, x2y2,, xnyn).
This is nice because it allows us to make SVMs
non-linear without complicating the algorithm

45
The Quadratic Programming Problem

Many approaches have been proposed
Loqo, cplex, etc.
Most are interior-point methods
Start with an initial solution that can violate
the constraints
Improve this solution by optimizing the objective
function and/or reducing the amount of constraint
violation
For SVM, sequential minimal optimization (SMO)
seems to be the most popular
A QP with two variables is trivial to solve
Each iteration of SMO picks a pair of (ai,aj) and
solve the QP with these two variables repeat
until convergence
In practice, we can just regard the QP solver as
a black-box without bothering how it works

46
???? ???? ?? ????? ??? ??? ???? ??????

?? ??? ????? ??? ?? SVM ??? ??? ?? ???? ?? ?????
??? ??????? ?????. ?? ?????? ?? ??? ?? ??????
????? ??? ??? ???? ????.

47
?????? ????? ??? slack

?? ??? ?? ??? ??? ?? ????? ????? ???? ? ??????
??? ?? ???? ???? ?? ???????!
??? ??? ?? ????? ????? xi ????? ????? ?? ??????
????? ????? ???? ??? ?? ???? ???? wTxb ???
??????? ??????.

48
?????? ????? ??? slack

?? ????? ?????xi, i1, 2, , N, ??????? ??? ????
???? ?? ??? ? ?????
yi (ltw,xigt b) 1
????? ??? ????? ?????
yi (ltw,xigt b) 1- xi , xi 0
?? ???? ???? ?? ??? ??? ????? ?? ???? ??? ?????.

?? ??????? ????? ????? ???? ????? ????? ?? ?????
w ?? ???? ?? ?????? ??? ?????? ???
?? ?? ?? C gt 0 ??????. ???? ????? ??? ??? ????
?? ?? ????? ??? ???????? slack ?? ???? ?????.

????? ????? ?? ???? ???? ????? ??? ????? ???.
????? ????? C ?? ???? ???? ??? ????? ??????
?????.

find ai that maximizes
subject to
51
Soft Margin Hyperplane

If we minimize wi xi, xi can be computed by
xi are slack variables in optimization
Note that xi0 if there is no error for xi
xi is an upper bound of the number of errors
We want to minimize
C tradeoff parameter between error and margin
The optimization problem becomes

52
The Optimization Problem

The dual of this new constrained optimization
problem is
w is recovered as
This is very similar to the optimization problem
in the linear separable case, except that there
is an upper bound C on ai now
Once again, a QP solver can be used to find ai

53
????? ??????? ??? ??? ??????? ?? ???? ?????

?????? ?? ????? ???? ?? ?? ???? ????? ???? ??
????? ??? ??????? ????

54
???? ???? ?? ???? ?????
f(.)
Feature space
Input space
Note feature space is of higher dimension than
the input space in practice

????? ??????? ?? ???? ????? ??????? ??????? ????
???? ????? ????? ?????? ????.
?? ???? ??? ????? ??? ??? ?? ????? ???.
???? ???? ?? ??? ???? ?? kernel trick ???????
?????.

55
?????? ???? ?????

??? ???? ?? ???? ????? ?? ????? ???? ???? ???
????? ?? ????? ???? ???? ????? ???????? ???? ???
???? ????? ??? ?????? curse of dimensionality
????? ???.

56
????? ??? ?????? ?? ???? ?????

We will introduce Kernels
Solve the computational problem of working with
many dimensions
Can make it possible to use infinite dimensions
efficiently in time / space
Other advantages, both practical and conceptual

57
????

Transform x ? ?(x)
The linear algorithm depends only on xxi, hence
transformed algorithm depends only on ?(x)?(xi)
Use kernel function K(xi,xj) such that K(xi,xj)
?(x)?(xi)

58
An Example for f(.) and K(.,.)

Suppose f(.) is given as follows
An inner product in the feature space is
So, if we define the kernel function as follows,
there is no need to carry out f(.) explicitly
This use of kernel function to avoid carrying out
f(.) explicitly is known as the kernel trick

59
???? ??? ?????
60
???? ???? ??? ???? ??
61
Modification Due to Kernel Function

Change all inner products to kernel functions
For training,

Original
With kernel function
62
Modification Due to Kernel Function

For testing, the new data z is classified as
class 1 if f³0, and as class 2 if f lt0

Original
With kernel function
63
Modularity

Any kernel-based learning algorithm composed of
two modules
A general purpose learning machine
A problem specific kernel function
Any K-B algorithm can be fitted with any kernel
Kernels themselves can be constructed in a
modular way
Great for software engineering (and for analysis)

64
???? ???? ??

?????? ?????? ??? ?? ???? ???? ?? ???? ???
If K, K are kernels, then
KK is a kernel
cK is a kernel, if cgt0
aKbK is a kernel, for a,b gt0
Etc etc etc
?? ??? ????? ?????? ???? ??? ?????? ?? ?? ???
???? ??? ???? ?? ????.

65
Example

Suppose we have 5 1D data points
x11, x22, x34, x45, x56, with 1, 2, 6 as
class 1 and 4, 5 as class 2 ? y11, y21, y3-1,
y4-1, y51
We use the polynomial kernel of degree 2
K(x,y) (xy1)2
C is set to 100
We first find ai (i1, , 5) by

66
Example

By using a QP solver, we get
a10, a22.5, a30, a47.333, a54.833
Note that the constraints are indeed satisfied
The support vectors are x22, x45, x56
The discriminant function is
b is recovered by solving f(2)1 or by f(5)-1 or
by f(6)1, as x2 and x5 lie on the line
and x4 lies on the line
All three give b9

67
Example
Value of discriminant function
class 1
class 1
class 2
1
2
4
5
6
68
????? ?? ??????

????? ???? ??? ????
?? ????? ??? ?????? ?? ??????? ?? ??? ??? ???????
??? ?? ????? ?? ???? 4 ?????.

69
????? ??????? ?? SVM ???? ???? ????

Prepare the data matrix
Select the kernel function to use
Execute the training algorithm using a QP solver
to obtain the ?i values
Unseen data can be classified using the ?i values
and the support vectors

70
?????? ???? ????

??? ???? ????? ?? ??? SVM ?????? ???? ???? ???.
????? ? ???? ?????? ???? ??? ??? ????? ??? ????
diffusion kernel, Fisher kernel, string kernel,
? ???????? ??? ???? ???? ????? ?????? ???? ?? ???
???? ??? ????? ?? ??? ????? ???.
?? ???
In practice, a low degree polynomial kernel or
RBF kernel with a reasonable width is a good
initial try
Note that SVM with RBF kernel is closely related
to RBF neural networks, with the centers of the
radial basis functions automatically chosen for
SVM

71
SVM applications

SVMs were originally proposed by Boser, Guyon and
Vapnik in 1992 and gained increasing popularity
in late 1990s.
SVMs are currently among the best performers for
a number of classification tasks ranging from
text to genomic data.
SVMs can be applied to complex data types beyond
feature vectors (e.g. graphs, sequences,
relational data) by designing kernel functions
for such data.
SVM techniques have been extended to a number of
tasks such as regression Vapnik et al. 97,
principal component analysis Schölkopf et al.
99, etc.
Most popular optimization algorithms for SVMs use
decomposition to hill-climb over a subset of ais
at a time, e.g. SMO Platt 99 and Joachims
99
Tuning SVMs remains a black art selecting a
specific kernel and parameters is usually done in
a try-and-see manner.

72
???? ??? ? ??? SVM

Strengths
Training is relatively easy
Good generalization in theory and practice
Work well with few training instances
Find globally best model, No local optimal,
unlike in neural networks
It scales relatively well to high dimensional
data
Tradeoff between classifier complexity and error
can be controlled explicitly
Non-traditional data like strings and trees can
be used as input to SVM, instead of feature
vectors
Weaknesses
Need to choose a good kernel function.

73
????? ????

SVMs find optimal linear separator
They pick the hyperplane that maximises the
margin
The optimal hyperplane turns out to be a linear
combination of support vectors
The kernel trick makes SVMs non-linear learning
algorithms
Transform nonlinear problems to higher
dimensional space using kernel functions then
there is more chance that in the transformed
space the classes will be linearly separable.

74
???? ???? ??? SVM

How to use SVM for multi-class classification?
One can change the QP formulation to become
multi-class
More often, multiple binary classifiers are
combined
One can train multiple one-versus-all
classifiers, or combine multiple pairwise
classifiers intelligently
How to interpret the SVM discriminant function
value as probability?
By performing logistic regression on the SVM
output of a set of data (validation set) that is
not used for training
Some SVM software (like libsvm) have these
features built-in

75
Multi-class Classification

SVM is basically a two-class classifier
One can change the QP formulation to allow
multi-class classification
More commonly, the data set is divided into two
parts intelligently in different ways and a
separate SVM is trained for each way of division
Multi-class classification is done by combining
the output of all the SVM classifiers
Majority rule
Error correcting code
Directed acyclic graph

76
??? ?????

????? ?? ??? ????? ??? ????? ?? ???????? ?? ????
??? ??????
http//www.kernel-machines.org/software.html
???? ??? ??????? ???? LIBSVM ???????? ????? ???
???? ?? ??? ????.
??? ????? SVMLight ?? ????? ?????? ???? ???? ???.
????? toolbox ?? Matlab ???? SVM ????? ??? ???.

77
?????

1 b.E. Boser et al. A training algorithm for
optimal margin classifiers. Proceedings of the
fifth annual workshop on computational learning
theory 5 144-152, Pittsburgh, 1992.
2 l. Bottou et al. Comparison of classifier
methods a case study in handwritten digit
recognition. Proceedings of the 12th IAPR
international conference on pattern recognition,
vol. 2, pp. 77-82.
3 v. Vapnik. The nature of statistical learning
theory. 2nd edition, Springer, 1999.