Instructor : Saeed Shiry - PowerPoint PPT Presentation

About This Presentation
Title:

Instructor : Saeed Shiry

Description:

Instructor : Saeed Shiry * * Example By using a QP solver, we get a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833 Note that the ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 78
Provided by: Shi122
Category:

less

Transcript and Presenter's Notes

Title: Instructor : Saeed Shiry


1
????? ????? ???????
  • Instructor Saeed Shiry

2
?????
  • SVM ???? ???? ????? ?? ??? ?? ??? ???? Kernel
    Methods ????????? ????? ????? ?????.
  • SVM?? ??? 1992 ???? Vapnik ????? ??? ? ?? ????
    statistical learning theory ??? ?????? ???.
  • ???? SVM ????? ?????? ?? ?? ????? ???? ??? ????
    ??? ?? ?? ???? ??? ???? ???? ????? ??? ??????
    ????? 1.1 ???

3
?????
  • ??? ??? ???? ???????? ?? ????? ? ?????? ????
    ??????? ?????? ?? ???? ???? ( ?? ???? ??????????
    ???? ????? ??????? ??????? ? ????)
  • ????? ????
  • ??????? ?????? ?? ????? ????? ????
  • ????? ?? ????? overfitting ????? ????

4
???? ????
  • ?? ??? ????? ???? ?? ????? ??? ??????? ??????
    ??????? ???? ?? ?????? ????? (maximum margin) ??
    ???? ?? ???? ?? ???? ?? ?? ??? ????.
  • ?? ?????? ?? ???? ?? ????? ??? ??????? ??????
    ???? ?? ?? ???? ?? ????? ????? ????? ???? ??????
    ?? ????? ???? ?? ?? ??? ???? ???? ????? ??? ???
    ????.

5
?????
  • Support Vector Machines are a system for
    efficiently training linear learning machines in
    kernel-induced feature spaces, while respecting
    the insights of generalisation theory and
    exploiting optimisation theory.
  • Cristianini Shawe-Taylor (2000)

6
????? ??????? ??? Linear Discrimination
  • ??? ?? ???? ???? ????? ????? ?? ????? ??? ?? ??
    ??????? ?????? ?????? ??? ????? ??? ?? ???? ?????
  • ???????? ??? ?????? ?? ???? ???????? ????????
    ??? ??????? ?? ????? ????.
  • ??? ??? ??? ?????????? ????? ?? ???? ?????? ??
    ???????

7
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
8
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
9
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
10
Intuitions
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
11
A Good Separator
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
12
Noise in the Observations
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
13
Ruling Out Some Separators
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
14
Lots of Noise
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
15
Maximizing the Margin
O
O
X
X
O
X
X
O
O
X
O
X
O
X
O
X
16
??? ?????
  • ??? ????? ?? ?????? ?????? ?? ????? ?????
  • ?? ???? n ???? ?????? ???? ????? ??? ????? ???.

17
?? ?? ??? ???? ??? ?????
  • ??? ???? ???? ?????? ?? ( ??? ????) ?? ?? ????
    ?? ?? ?? ??? ???. ?? ???? ?? ???? ?????? ??? ??
    ????? ??? ???
  • ?? ???? n ???? ?????? ????

18
???? SVM ???? ??? ???? ???? ??
  • ?? ???? ???? ??????
  • ?? ???? ???? ????? ?? ???? ???? ???? ??? ???? ?
    ???? ?? ????? ?? ?? ??? ?????? ?? ?? ???? ??
    ?????? ????.
  • ???? ???? ???? ?? ??????? ????? ?? ?? ????? ????
    ????? ????? ?????? ??? ????? ????? ???.

19
?????? ?????
  • ?? ??? ???? ?? ?? ????? ??????? ??? ???????
    ?????? ?????? ???? ???? ??? ?????? ?? ???
    ????????? ???? ?? ??????? ?? ????? ???? ???
    ?????? ?? ?????? ????? ???? ????? ?? ????? ?????
    ???.

20
??? ?????? ??????
  • ?? ??? ????? ?? ????? ???? ??? ????.
  • ????? ???? ??????? VC dimension ???? ???? ?? ????
    ???? ???? ????? ?????.
  • ???? ????? ??? ??? ???? ??? ???? ???? ???.

21
????? ???????
  • ????????? ???? ??? ?????? ?? ??? ???? ??? ???
    ????? ????? ??????? ?????? ??????

X2
SV
SV
SV
Class 1
X1
Class -1
22
????? ? SVM
  • ?? ???? ??????? ????? ?? SVM ??? ???????? ????
    ????? ???? ????? ????
  • ?????? ????? ????? ???? (high dimensionality) ??
    overfitting ????? ?????. ??? ????? ???? ??
    optimization ??? ???????? ???
  • ????? ???? ???????
  • ???? ???? ??? ?????? ?? ???????? ??????? ???????
    ?????.

23
?? ????? ???? ???? ?? ????
  • ????? ??? ??????
  • x ? ?n
  • y ? -1, 1
  • ???? ????? ????
  • f(x) sign(ltw,xgt b)
  • w ? ?n
  • b ? ?
  • ??? ????
  • ltw, xgt b 0
  • w1x1 w2x2 wnxn b 0
  • ???????? ?????? W, b??????? ?? ???? ???? ??
  • ????? ??? ?????? ?? ???? ???? ???? ???
  • ?? ??? ??? ?? ???? ?? ????? ??? ??? ???? ?????
  • ????? ?? ?????? ?????

24
Linear SVM Mathematically
  • Let training set (xi, yi)i1..n, xi?Rd, yi ?
    -1, 1 be separated by a hyperplane with margin
    ?. Then for each training example (xi, yi)
  • For every support vector xs the above inequality
    is an equality. After rescaling w and b by ?/2
    in the equality, we obtain that distance between
    each xs and the hyperplane is
  • Then the margin can be expressed through
    (rescaled) w and b as

wTxi b - ?/2 if yi -1 wTxi b ?/2
if yi 1
yi(wTxi b) ?/2
?
25
?? ????? ???? ???? ?? ????
  • ????? ?? ???????? ?? ???? ??? ????? ??
  • ????? ????? ?? ??? x ?? ?? ??? ????? ????? ??? ??

f(x)0
f(x)lt0
f(x)gt0
X2
????? w ?? ?? ?? ???? ???? ????? ???? ????? ???.
x
w
X1
26
????? ????? ??? ???? ??? ?????
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1
  • Classify as..
  • -1 if w . x b lt -1
  • 1 if w . x b gt 1

27
?????? ????? ?????
  • ???? ???? ? ???? ?? ????? ??? ?? ??? ???????
  • Plus-plane x w . x b 1
  • Minus-plane x w . x b -1
  • ????? w ?? ???? ???? ????? ???? ????? ???.
  • ??? ???? X- ???? ?? ?? ???? ???? ???? ? X
    ????????? ???? ?? ???? ???? ?? X- ????.

28
?????? ????? ?????
  • ??? ?? X- ???? X ??? ????? ?? ?? ?? ???? ????
    ????? ???. ??? ????? ??? ?? ???? ????? ??W ?????
    ???.
  • ?? ??????? ?????? ????
  • x x- ? w for some value of ?.

29
?????? ????? ?????
  • ??????? ??
  • w . x b 1
  • w . x- b -1
  • X x- ? w
  • x- x- M
  • ??? ?????? M ?? ????? W? b ?????? ???.

30
?????? ????? ?????
  • w . x b 1
  • w . x- b -1
  • X x- ? w
  • x- x- M

w.( x- ? w) b 1
w.x- ? w.w b 1
-1 ? w.w 1
?2/ w.w
31
?????? ????? ?????
32
???????
  • ??? ???? ???? ?? ???? ??? ????? ???? ?? ?? ?? 1 ?
    1- ???? ???? ?????
  • ltw,xigt b 1 for y1
  • ltw,xigt b -1 for y -1
  • ?? ?????? ????????? ??? ????
  • yi (ltw,xigt b) 1 for all i

33
??? ???? ?? ?????
  • ?? SVM ?????? ?? ?????? ??????? ??? ?????
  • ?? ????? ??????? ?????? (xi, yi) ?? i1,2,N
    yi?1,-1
  • Minimise w2
  • Subject to yi (ltw,xigt b) 1 for all i
  • Note that w2 wTw
  • ??? ?? ????? quadratic programming ?? ???????
    ???? ????? ????????? ??? ???. ?????? ?????? ???
    ?? ???? ???? ????? ???? ????? ???? ???.

34
Quadratic Programming
35
Recap of Constrained Optimization
  • Suppose we want to minimize f(x) subject to g(x)
    0
  • A necessary condition for x0 to be a solution
  • a the Lagrange multiplier
  • For multiple constraints gi(x) 0, i1, , m, we
    need a Lagrange multiplier ai for each of the
    constraints

36
Recap of Constrained Optimization
  • The case for inequality constraint gi(x) 0 is
    similar, except that the Lagrange multiplier ai
    should be positive
  • If x0 is a solution to the constrained
    optimization problem
  • There must exist ai 0 for i1, , m such that
    x0 satisfy
  • The function is
    also known as the Lagrangrian we want to set its
    gradient to 0

37
??? ?? ??????
  • Construct minimise the Lagrangian
  • Take derivatives wrt. w and b, equate them to 0
  • The Lagrange multipliers ai are called dual
    variables
  • Each training point has an associated dual
    variable.
  • parameters are expressed as a linear combination
    of training points
  • only SVs will have non-zero ai

38
??? ?? ??????
Class 2
a100
a80.6
a70
a20
a50
a10.8
a40
a61.4
a90
a30
Class 1
39
The Dual Problem
  • If we substitute to
    Lagrangian , we have
  • Note that
  • This is a function of ai only

40
The Dual Problem
  • The new objective function is in terms of ai only
  • It is known as the dual problem if we know w, we
    know all ai if we know all ai, we know w
  • The original problem is known as the primal
    problem
  • The objective function of the dual problem needs
    to be maximized!
  • The dual problem is therefore

Properties of ai when we introduce the Lagrange
multipliers
The result when we differentiate the original
Lagrangian w.r.t. b
41
The Dual Problem
  • This is a quadratic programming (QP) problem
  • A global maximum of ai can always be found
  • w can be recovered by

42
??? ?? ??????
  • So,
  • Plug this back into the Lagrangian to obtain the
    dual formulation
  • The resulting dual that is solved for ? by using
    a QP solver
  • The b does not appear in the dual so it is
    determined separatelyfrom the initial
    constraints

Data enters only in the form of dot products!
43
???? ???? ???? ??? ????
  • ?? ?? ???? ?????? (?, b) ?? ?? ???????
    quadratic ?? ???? ???? ??? ????? ???? ???? ??????
    SVM ?? ???? ???? ???? ????? ??? ???? ???? ???.
  • ??? x ?? ????? ???? ????? ???? ???? ?? ????? ???
    ???? ?????
  • signf(x, ?, b), where

Data enters only in the form of dot products!
44
????? ??? ??? ??
  • The solution of the SVM, i.e. of the quadratic
    programming problem with linear inequality
    constraints has the nice property that the data
    enters only in the form of dot products!
  • Dot product (notation memory refreshing) given
    x(x1,x2,xn) and y(y1,y2,yn), then the dot
    product of x and y is xy(x1y1, x2y2,, xnyn).
  • This is nice because it allows us to make SVMs
    non-linear without complicating the algorithm

45
The Quadratic Programming Problem
  • Many approaches have been proposed
  • Loqo, cplex, etc.
  • Most are interior-point methods
  • Start with an initial solution that can violate
    the constraints
  • Improve this solution by optimizing the objective
    function and/or reducing the amount of constraint
    violation
  • For SVM, sequential minimal optimization (SMO)
    seems to be the most popular
  • A QP with two variables is trivial to solve
  • Each iteration of SMO picks a pair of (ai,aj) and
    solve the QP with these two variables repeat
    until convergence
  • In practice, we can just regard the QP solver as
    a black-box without bothering how it works

46
???? ???? ?? ????? ??? ??? ???? ??????
  • ?? ??? ????? ??? ?? SVM ??? ??? ?? ???? ?? ?????
    ??? ??????? ?????. ?? ?????? ?? ??? ?? ??????
    ????? ??? ??? ???? ????.

47
?????? ????? ??? slack
  • ?? ??? ?? ??? ??? ?? ????? ????? ???? ? ??????
    ??? ?? ???? ???? ?? ???????!
  • ??? ??? ?? ????? ????? xi ????? ????? ?? ??????
    ????? ????? ???? ??? ?? ???? ???? wTxb ???
    ??????? ??????.

48
?????? ????? ??? slack
  • ?? ????? ?????xi, i1, 2, , N, ??????? ??? ????
    ???? ?? ??? ? ?????
  • yi (ltw,xigt b) 1
  • ????? ??? ????? ?????
  • yi (ltw,xigt b) 1- xi , xi 0
  • ?? ???? ???? ?? ??? ??? ????? ?? ???? ??? ?????.

49
  • ?? ??????? ????? ????? ???? ????? ????? ?? ?????
    w ?? ???? ?? ?????? ??? ?????? ???
  • ?? ?? ?? C gt 0 ??????. ???? ????? ??? ??? ????
    ?? ?? ????? ??? ???????? slack ?? ???? ?????.

50
  • ????? ????? ?? ???? ???? ????? ??? ????? ???.
  • ????? ????? C ?? ???? ???? ??? ????? ??????
    ?????.

find ai that maximizes
subject to
51
Soft Margin Hyperplane
  • If we minimize wi xi, xi can be computed by
  • xi are slack variables in optimization
  • Note that xi0 if there is no error for xi
  • xi is an upper bound of the number of errors
  • We want to minimize
  • C tradeoff parameter between error and margin
  • The optimization problem becomes

52
The Optimization Problem
  • The dual of this new constrained optimization
    problem is
  • w is recovered as
  • This is very similar to the optimization problem
    in the linear separable case, except that there
    is an upper bound C on ai now
  • Once again, a QP solver can be used to find ai

53
????? ??????? ??? ??? ??????? ?? ???? ?????
  • ?????? ?? ????? ???? ?? ?? ???? ????? ???? ??
    ????? ??? ??????? ????

54
???? ???? ?? ???? ?????
f(.)
Feature space
Input space
Note feature space is of higher dimension than
the input space in practice
  • ????? ??????? ?? ???? ????? ??????? ??????? ????
    ???? ????? ????? ?????? ????.
  • ?? ???? ??? ????? ??? ??? ?? ????? ???.
  • ???? ???? ?? ??? ???? ?? kernel trick ???????
    ?????.

55
?????? ???? ?????
  • ??? ???? ?? ???? ????? ?? ????? ???? ???? ???
  • ????? ?? ????? ???? ???? ????? ???????? ???? ???
    ???? ????? ??? ?????? curse of dimensionality
    ????? ???.

56
????? ??? ?????? ?? ???? ?????
  • We will introduce Kernels
  • Solve the computational problem of working with
    many dimensions
  • Can make it possible to use infinite dimensions
  • efficiently in time / space
  • Other advantages, both practical and conceptual

57
????
  • Transform x ? ?(x)
  • The linear algorithm depends only on xxi, hence
    transformed algorithm depends only on ?(x)?(xi)
  • Use kernel function K(xi,xj) such that K(xi,xj)
    ?(x)?(xi)

58
An Example for f(.) and K(.,.)
  • Suppose f(.) is given as follows
  • An inner product in the feature space is
  • So, if we define the kernel function as follows,
    there is no need to carry out f(.) explicitly
  • This use of kernel function to avoid carrying out
    f(.) explicitly is known as the kernel trick

59
???? ??? ?????
60
???? ???? ??? ???? ??
61
Modification Due to Kernel Function
  • Change all inner products to kernel functions
  • For training,

Original
With kernel function
62
Modification Due to Kernel Function
  • For testing, the new data z is classified as
    class 1 if f³0, and as class 2 if f lt0

Original
With kernel function
63
Modularity
  • Any kernel-based learning algorithm composed of
    two modules
  • A general purpose learning machine
  • A problem specific kernel function
  • Any K-B algorithm can be fitted with any kernel
  • Kernels themselves can be constructed in a
    modular way
  • Great for software engineering (and for analysis)

64
???? ???? ??
  • ?????? ?????? ??? ?? ???? ???? ?? ???? ???
  • If K, K are kernels, then
  • KK is a kernel
  • cK is a kernel, if cgt0
  • aKbK is a kernel, for a,b gt0
  • Etc etc etc
  • ?? ??? ????? ?????? ???? ??? ?????? ?? ?? ???
    ???? ??? ???? ?? ????.

65
Example
  • Suppose we have 5 1D data points
  • x11, x22, x34, x45, x56, with 1, 2, 6 as
    class 1 and 4, 5 as class 2 ? y11, y21, y3-1,
    y4-1, y51
  • We use the polynomial kernel of degree 2
  • K(x,y) (xy1)2
  • C is set to 100
  • We first find ai (i1, , 5) by

66
Example
  • By using a QP solver, we get
  • a10, a22.5, a30, a47.333, a54.833
  • Note that the constraints are indeed satisfied
  • The support vectors are x22, x45, x56
  • The discriminant function is
  • b is recovered by solving f(2)1 or by f(5)-1 or
    by f(6)1, as x2 and x5 lie on the line
    and x4 lies on the line
  • All three give b9

67
Example
Value of discriminant function
class 1
class 1
class 2
1
2
4
5
6
68
????? ?? ??????
  • ????? ???? ??? ????
  • ?? ????? ??? ?????? ?? ??????? ?? ??? ??? ???????
    ??? ?? ????? ?? ???? 4 ?????.

69
????? ??????? ?? SVM ???? ???? ????
  • Prepare the data matrix
  • Select the kernel function to use
  • Execute the training algorithm using a QP solver
    to obtain the ?i values
  • Unseen data can be classified using the ?i values
    and the support vectors

70
?????? ???? ????
  • ??? ???? ????? ?? ??? SVM ?????? ???? ???? ???.
  • ????? ? ???? ?????? ???? ??? ??? ????? ??? ????
  • diffusion kernel, Fisher kernel, string kernel,
  • ? ???????? ??? ???? ???? ????? ?????? ???? ?? ???
    ???? ??? ????? ?? ??? ????? ???.
  • ?? ???
  • In practice, a low degree polynomial kernel or
    RBF kernel with a reasonable width is a good
    initial try
  • Note that SVM with RBF kernel is closely related
    to RBF neural networks, with the centers of the
    radial basis functions automatically chosen for
    SVM

71
SVM applications
  • SVMs were originally proposed by Boser, Guyon and
    Vapnik in 1992 and gained increasing popularity
    in late 1990s.
  • SVMs are currently among the best performers for
    a number of classification tasks ranging from
    text to genomic data.
  • SVMs can be applied to complex data types beyond
    feature vectors (e.g. graphs, sequences,
    relational data) by designing kernel functions
    for such data.
  • SVM techniques have been extended to a number of
    tasks such as regression Vapnik et al. 97,
    principal component analysis Schölkopf et al.
    99, etc.
  • Most popular optimization algorithms for SVMs use
    decomposition to hill-climb over a subset of ais
    at a time, e.g. SMO Platt 99 and Joachims
    99
  • Tuning SVMs remains a black art selecting a
    specific kernel and parameters is usually done in
    a try-and-see manner.

72
???? ??? ? ??? SVM
  • Strengths
  • Training is relatively easy
  • Good generalization in theory and practice
  • Work well with few training instances
  • Find globally best model, No local optimal,
    unlike in neural networks
  • It scales relatively well to high dimensional
    data
  • Tradeoff between classifier complexity and error
    can be controlled explicitly
  • Non-traditional data like strings and trees can
    be used as input to SVM, instead of feature
    vectors
  • Weaknesses
  • Need to choose a good kernel function.

73
????? ????
  • SVMs find optimal linear separator
  • They pick the hyperplane that maximises the
    margin
  • The optimal hyperplane turns out to be a linear
    combination of support vectors
  • The kernel trick makes SVMs non-linear learning
    algorithms
  • Transform nonlinear problems to higher
    dimensional space using kernel functions then
    there is more chance that in the transformed
    space the classes will be linearly separable.

74
???? ???? ??? SVM
  • How to use SVM for multi-class classification?
  • One can change the QP formulation to become
    multi-class
  • More often, multiple binary classifiers are
    combined
  • One can train multiple one-versus-all
    classifiers, or combine multiple pairwise
    classifiers intelligently
  • How to interpret the SVM discriminant function
    value as probability?
  • By performing logistic regression on the SVM
    output of a set of data (validation set) that is
    not used for training
  • Some SVM software (like libsvm) have these
    features built-in

75
Multi-class Classification
  • SVM is basically a two-class classifier
  • One can change the QP formulation to allow
    multi-class classification
  • More commonly, the data set is divided into two
    parts intelligently in different ways and a
    separate SVM is trained for each way of division
  • Multi-class classification is done by combining
    the output of all the SVM classifiers
  • Majority rule
  • Error correcting code
  • Directed acyclic graph

76
??? ?????
  • ????? ?? ??? ????? ??? ????? ?? ???????? ?? ????
    ??? ??????
  • http//www.kernel-machines.org/software.html
  • ???? ??? ??????? ???? LIBSVM ???????? ????? ???
    ???? ?? ??? ????.
  • ??? ????? SVMLight ?? ????? ?????? ???? ???? ???.
  • ????? toolbox ?? Matlab ???? SVM ????? ??? ???.

77
?????
  • 1 b.E. Boser et al. A training algorithm for
    optimal margin classifiers. Proceedings of the
    fifth annual workshop on computational learning
    theory 5 144-152, Pittsburgh, 1992.
  • 2 l. Bottou et al. Comparison of classifier
    methods a case study in handwritten digit
    recognition. Proceedings of the 12th IAPR
    international conference on pattern recognition,
    vol. 2, pp. 77-82.
  • 3 v. Vapnik. The nature of statistical learning
    theory. 2nd edition, Springer, 1999.
Write a Comment
User Comments (0)
About PowerShow.com