A Kernel Approach for Learning From Almost Orthogonal Pattern* - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

A Kernel Approach for Learning From Almost Orthogonal Pattern*

Description:

A Kernel Approach for Learning From Almost Orthogonal Pattern* CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf et al ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 27

Provided by: Yil53

Learn more at: http://www.dabi.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Kernel Approach for Learning From Almost Orthogonal Pattern*

1
A Kernel Approach for Learning From Almost
Orthogonal Pattern
CIS 525 Class Presentation Professor Slobodan
Vucetic Presenter Yilian Qin
B. Scholkopf et al., Proc. 13th ECML, Aug
19-23, 2002, pp. 511-528.
2
Presentation Outline

Introduction
Motivation
A Brief review of SVM for linearly separable
patterns
Kernel approach for SVM
Empirical kernel map
Problem almost orthogonal patterns in feature
space
An example
Situations leading to almost orthogonal patterns
Method to reduce large diagonals of Gram matrix
Gram matrix transformation
An approximate approach based on statistics
Experiments
Artificial data (String classification,
Microarray data with noise, Hidden variable
problem)
Real data (Thrombin binding, Lymphoma
classification, Protein family classification)
Conclusions
Comments

Introduction

4
Motivation

Support vector machine (SVM)
Powerful method for classification (or
regression) with high accuracy comparable to
neural network
Exploit of kernel function for pattern separation
in high dimensional space
The information of training data for SVM is
stored in the Gram matrix (kernel matrix)
The problem
SVM doesnt perform well if Gram matrix has large
diagonal values

5
A Brief Review of SVM
For linearly separable patterns
To maximize the margin
Minimize Constraints
6
Kernel Approach for SVM (1/3)

For linearly non-separable patterns
Nonlinear mapping function ?(x)?H mapping the
patterns to new feature space H of higher
dimension
For example the XOR problem
SVM in the new feature space
The kernel trick
Solving the above minimization problem requires
1) Explicit form of ?
2) Inner product in high dimensional space H
Simplification by wise selection of kernel
functions with property k(xi, xj) ?(xi) ?
?(xj)

7
Kernel Approach for SVM (2/3)

Transform the problem with kernel method
Expand w in the new feature space w ??ai?(xi)
?(x)awhere ?(x)?(x1), ?(x2), , ?(xm),
and aa1, a2, amT
Gram matrix KKij, where Kij ?(xi) ? ?(xj)
k(xi, xj) (symmetric !)
The (squared) objective function?w2
aT?(x)T?(x)a aTKa (sufficient
condition for existence of optimal solution K is
positive definite)
The constraintsyiwT?(xi) b
yiaT?(x)T?(xi) b yiaTKi b ? 1, where
Ki is the ith column of K.

8
Kernel Approach for SVM (3/3)

To predict new data with a trained SVM
The explicit form of k(xi, xj) is required for
prediction of new data

9
Empirical Kernel Mapping

Assumption m (the number if instances) is a
sufficient high dimension of the new feature
space. i.e. the patterns will be linearly
separable in m-dimension space (Rm)
Empirical kernel map ?m(xi) k(xi,x1),
k(xi,x2), , k(xi,xm)T Ki
The SVM in Rm
The new Gram matrix Km associated with ?m(x)
KmKmij, where Kmij ?m(xi) ? ?m(xj) Ki ?
Kj KiTKj, i.e. Km KTK KKT
Advantage of empirical kernel map Km is positive
definite
Km KKT (UTDU) (UTDU)T UTD2U (K is
symmetric, U is unitary matrix, D is diagonal)
Satisfied the sufficient condition of above
minimization problem

The Problem
Almost Orthogonal Patterns in the Feature Space
Result in Poor Performance

11
An Example of Almost Orthogonal Patterns

The training dataset with almost orthogonal
patterns

The Gram matrix with linear kernel k(xi, xj)
xi ? xj

Large Diagonals

w is the solution with standard SVM
Observation each large entry in w is
corresponding to a column in X with only one
large entry w becomes a lookup table, the SVM
wont generalize well
A better solution

12
Situations Leading to Almost Orthogonal Patterns

Sparsity of the patterns in the new feature
space, e.g.
x 0, 0, 0, 1, 0, 0, 1, 0T
Y 0, 1, 1, 0, 0, 0 , 0, 0T
x ? x ? y ? y gtgt x ? y (large diagonals in Gram
matrix)
Some selection of kernel functions may result in
sparsity in the new feature space
String kernel (Watkins 2000, et al)
Polynomial kernel, k(xi, xj) (xi?xj)d, with
large order d
If xi ? xi gt xi ? xj , for i?j, then
k(xi, xi) gtgt k(xi, xj), for even moderately large
d, due to the exponential function.

Methods to Reduce the Large Diagonals of Gram
Matrices

14
Gram Matrix Transformation (1/2)

For symmetric, positive definite Gram matrix K
(or Km),
K UTDU U is unitary matrix, D is diagonal
matrix
Define f(K) UTf(D)U, and f(D)ii f(Dii) i.e.,
the function f operates on the eigenvalues ?i of
K
f(K) should preserve positive definition of the
Gram matrix
A sample procedure for Gram matrix transformation
(Optional) Compute the positive definite matrix A
sqrt(K)
Suppress the large diagonals of A, and obtain a
symmetric A
i.e. transform the eigenvalues of A ?min,
?max ? f(?min ), f(?max )
Compute the positive definite matrix K(A)2

15
Gram Matrix Transformation (2/2)

Effect of matrix transformation
The explicit form of new kernel function k is
not available
k is required when the trained SVM is used to
predict the testing data
A solution include all test data into K before
the matrix transformation K-gtK i.e. the testing
data has to be known in training time

16
An Approximate Approach based on Statistics

The empirical kernel map ?mn(x) should be used
to calculate the Gram matrix
Assuming the dataset size r is large
Therefore, the SVM can be simply trained with the
empirical map on the training set, ?m(x), instead
of ?mn(x)

Experiment Results

18
Artificial Data (1/3)

String classification
String kernel function (Watkins 2000, et al)
Sub-polynomial kernel k(x,y) ?(x) ? ?(y)P,
0ltPlt1 for sufficiently small P, the large
diagonals of K can be suppressed
50 strings (25 for training, and 25 for testing),
20 trials

19
Artificial Data (2/3)

Microarray data with noise (Alon et al, 1999)
62 instance (22 positive, 44 negative), 2000
features in original data
10000 noise features were added (1 to be
non-zero in probability)

Error rate for SVM without noise addition is
0.18?0.15
20
Artificial Data (3/3)

Hidden variable problem
10 hidden variables (attributes), 10 additional
attributes which are nonlinear functions of the
10 hidden variables
Original kernel is polynomial kernel of order 4

21
Real Data (1/3)

Thrombin binding problem
1909 instances, 139,351 binary features
0.68 entries are non-zero
8-fold cross validation

22
Real Data (2/3)

Lymphoma classification (Alizadeh et al, 2000)
96 samples, 4026 features
10-fold cross validation
Improved results observed compared with previous
work (Weston, 2001)

23
Real Data (3/3)

Protein family classification (Murzin et al,
1995)
Small positive set, large negative set

Receiver operating characteristic 1 best
score 0 worst score
Rate of false positive
24
Conclusions

Problem of degraded performance for SVM due to
almost orthogonal patterns was identified and
analyzed
The common situation that sparse vectors leading
to large diagonals was identified and discussed
A method of Gram matrix transformation to
suppress the large diagonals was proposed to
improve the performance in such cases
Experiment results show improved accuracy for
various artificial or real datasets with
suppressed large diagonals of Gram matrices

25
Comments

Strong points
The identification of the situations leads to
large diagonals in Gram matrix, and the proposed
Gram matrix transformation method for suppressing
the large diagonals
Experiments are extensive
Weak points
The application of Gram matrix transformation may
be severely restricted in forecasting or other
applications in which the testing data is not
know in training time
The proposed Gram matrix transformation method
was not tested by experiments directly, instead,
transformed kernel functions were used in
experiments
The almost orthogonal patterns imply that
multiple pattern vectors in the same direction
rarely exist, therefore, the necessary condition
for statistic approach for pattern distribution
is not satisfied