A Kernel Approach for Learning From Almost Orthogonal Pattern* - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

A Kernel Approach for Learning From Almost Orthogonal Pattern*

Description:

A Kernel Approach for Learning From Almost Orthogonal Pattern* CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf et al ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 27
Provided by: Yil53
Category:

less

Transcript and Presenter's Notes

Title: A Kernel Approach for Learning From Almost Orthogonal Pattern*


1
A Kernel Approach for Learning From Almost
Orthogonal Pattern
CIS 525 Class Presentation Professor Slobodan
Vucetic Presenter Yilian Qin
B. Scholkopf et al., Proc. 13th ECML, Aug
19-23, 2002, pp. 511-528.
2
Presentation Outline
  • Introduction
  • Motivation
  • A Brief review of SVM for linearly separable
    patterns
  • Kernel approach for SVM
  • Empirical kernel map
  • Problem almost orthogonal patterns in feature
    space
  • An example
  • Situations leading to almost orthogonal patterns
  • Method to reduce large diagonals of Gram matrix
  • Gram matrix transformation
  • An approximate approach based on statistics
  • Experiments
  • Artificial data (String classification,
    Microarray data with noise, Hidden variable
    problem)
  • Real data (Thrombin binding, Lymphoma
    classification, Protein family classification)
  • Conclusions
  • Comments

3
  • Introduction

4
Motivation
  • Support vector machine (SVM)
  • Powerful method for classification (or
    regression) with high accuracy comparable to
    neural network
  • Exploit of kernel function for pattern separation
    in high dimensional space
  • The information of training data for SVM is
    stored in the Gram matrix (kernel matrix)
  • The problem
  • SVM doesnt perform well if Gram matrix has large
    diagonal values

5
A Brief Review of SVM
For linearly separable patterns
To maximize the margin
Minimize Constraints
6
Kernel Approach for SVM (1/3)
  • For linearly non-separable patterns
  • Nonlinear mapping function ?(x)?H mapping the
    patterns to new feature space H of higher
    dimension
  • For example the XOR problem
  • SVM in the new feature space
  • The kernel trick
  • Solving the above minimization problem requires
    1) Explicit form of ?
  • 2) Inner product in high dimensional space H
  • Simplification by wise selection of kernel
    functions with property k(xi, xj) ?(xi) ?
    ?(xj)

7
Kernel Approach for SVM (2/3)
  • Transform the problem with kernel method
  • Expand w in the new feature space w ??ai?(xi)
    ?(x)awhere ?(x)?(x1), ?(x2), , ?(xm),
    and aa1, a2, amT
  • Gram matrix KKij, where Kij ?(xi) ? ?(xj)
    k(xi, xj) (symmetric !)
  • The (squared) objective function?w2
    aT?(x)T?(x)a aTKa (sufficient
    condition for existence of optimal solution K is
    positive definite)
  • The constraintsyiwT?(xi) b
    yiaT?(x)T?(xi) b yiaTKi b ? 1, where
    Ki is the ith column of K.

8
Kernel Approach for SVM (3/3)
  • To predict new data with a trained SVM
  • The explicit form of k(xi, xj) is required for
    prediction of new data

9
Empirical Kernel Mapping
  • Assumption m (the number if instances) is a
    sufficient high dimension of the new feature
    space. i.e. the patterns will be linearly
    separable in m-dimension space (Rm)
  • Empirical kernel map ?m(xi) k(xi,x1),
    k(xi,x2), , k(xi,xm)T Ki
  • The SVM in Rm
  • The new Gram matrix Km associated with ?m(x)
  • KmKmij, where Kmij ?m(xi) ? ?m(xj) Ki ?
    Kj KiTKj, i.e. Km KTK KKT
  • Advantage of empirical kernel map Km is positive
    definite
  • Km KKT (UTDU) (UTDU)T UTD2U (K is
    symmetric, U is unitary matrix, D is diagonal)
  • Satisfied the sufficient condition of above
    minimization problem

10
  • The Problem
  • Almost Orthogonal Patterns in the Feature Space
  • Result in Poor Performance

11
An Example of Almost Orthogonal Patterns
  • The training dataset with almost orthogonal
    patterns
  • The Gram matrix with linear kernel k(xi, xj)
    xi ? xj

Large Diagonals
  • w is the solution with standard SVM
  • Observation each large entry in w is
    corresponding to a column in X with only one
    large entry w becomes a lookup table, the SVM
    wont generalize well
  • A better solution

12
Situations Leading to Almost Orthogonal Patterns
  • Sparsity of the patterns in the new feature
    space, e.g.
  • x 0, 0, 0, 1, 0, 0, 1, 0T
  • Y 0, 1, 1, 0, 0, 0 , 0, 0T
  • x ? x ? y ? y gtgt x ? y (large diagonals in Gram
    matrix)
  • Some selection of kernel functions may result in
    sparsity in the new feature space
  • String kernel (Watkins 2000, et al)
  • Polynomial kernel, k(xi, xj) (xi?xj)d, with
    large order d
  • If xi ? xi gt xi ? xj , for i?j, then
  • k(xi, xi) gtgt k(xi, xj), for even moderately large
    d, due to the exponential function.

13
  • Methods to Reduce the Large Diagonals of Gram
    Matrices

14
Gram Matrix Transformation (1/2)
  • For symmetric, positive definite Gram matrix K
    (or Km),
  • K UTDU U is unitary matrix, D is diagonal
    matrix
  • Define f(K) UTf(D)U, and f(D)ii f(Dii) i.e.,
    the function f operates on the eigenvalues ?i of
    K
  • f(K) should preserve positive definition of the
    Gram matrix
  • A sample procedure for Gram matrix transformation
  • (Optional) Compute the positive definite matrix A
    sqrt(K)
  • Suppress the large diagonals of A, and obtain a
    symmetric A
  • i.e. transform the eigenvalues of A ?min,
    ?max ? f(?min ), f(?max )
  • Compute the positive definite matrix K(A)2

15
Gram Matrix Transformation (2/2)
  • Effect of matrix transformation
  • The explicit form of new kernel function k is
    not available
  • k is required when the trained SVM is used to
    predict the testing data
  • A solution include all test data into K before
    the matrix transformation K-gtK i.e. the testing
    data has to be known in training time

16
An Approximate Approach based on Statistics
  • The empirical kernel map ?mn(x) should be used
    to calculate the Gram matrix
  • Assuming the dataset size r is large
  • Therefore, the SVM can be simply trained with the
    empirical map on the training set, ?m(x), instead
    of ?mn(x)

17
  • Experiment Results

18
Artificial Data (1/3)
  • String classification
  • String kernel function (Watkins 2000, et al)
  • Sub-polynomial kernel k(x,y) ?(x) ? ?(y)P,
    0ltPlt1 for sufficiently small P, the large
    diagonals of K can be suppressed
  • 50 strings (25 for training, and 25 for testing),
    20 trials

19
Artificial Data (2/3)
  • Microarray data with noise (Alon et al, 1999)
  • 62 instance (22 positive, 44 negative), 2000
    features in original data
  • 10000 noise features were added (1 to be
    non-zero in probability)

Error rate for SVM without noise addition is
0.18?0.15
20
Artificial Data (3/3)
  • Hidden variable problem
  • 10 hidden variables (attributes), 10 additional
    attributes which are nonlinear functions of the
    10 hidden variables
  • Original kernel is polynomial kernel of order 4

21
Real Data (1/3)
  • Thrombin binding problem
  • 1909 instances, 139,351 binary features
  • 0.68 entries are non-zero
  • 8-fold cross validation

22
Real Data (2/3)
  • Lymphoma classification (Alizadeh et al, 2000)
  • 96 samples, 4026 features
  • 10-fold cross validation
  • Improved results observed compared with previous
    work (Weston, 2001)

23
Real Data (3/3)
  • Protein family classification (Murzin et al,
    1995)
  • Small positive set, large negative set

Receiver operating characteristic 1 best
score 0 worst score
Rate of false positive
24
Conclusions
  • Problem of degraded performance for SVM due to
    almost orthogonal patterns was identified and
    analyzed
  • The common situation that sparse vectors leading
    to large diagonals was identified and discussed
  • A method of Gram matrix transformation to
    suppress the large diagonals was proposed to
    improve the performance in such cases
  • Experiment results show improved accuracy for
    various artificial or real datasets with
    suppressed large diagonals of Gram matrices

25
Comments
  • Strong points
  • The identification of the situations leads to
    large diagonals in Gram matrix, and the proposed
    Gram matrix transformation method for suppressing
    the large diagonals
  • Experiments are extensive
  • Weak points
  • The application of Gram matrix transformation may
    be severely restricted in forecasting or other
    applications in which the testing data is not
    know in training time
  • The proposed Gram matrix transformation method
    was not tested by experiments directly, instead,
    transformed kernel functions were used in
    experiments
  • The almost orthogonal patterns imply that
    multiple pattern vectors in the same direction
    rarely exist, therefore, the necessary condition
    for statistic approach for pattern distribution
    is not satisfied

26
  • End!
Write a Comment
User Comments (0)
About PowerShow.com