Title: A Kernel Approach for Learning From Almost Orthogonal Pattern*
1A Kernel Approach for Learning From Almost
Orthogonal Pattern
CIS 525 Class Presentation Professor Slobodan
Vucetic Presenter Yilian Qin
B. Scholkopf et al., Proc. 13th ECML, Aug
19-23, 2002, pp. 511-528.
2Presentation Outline
- Introduction
- Motivation
- A Brief review of SVM for linearly separable
patterns - Kernel approach for SVM
- Empirical kernel map
- Problem almost orthogonal patterns in feature
space - An example
- Situations leading to almost orthogonal patterns
- Method to reduce large diagonals of Gram matrix
- Gram matrix transformation
- An approximate approach based on statistics
- Experiments
- Artificial data (String classification,
Microarray data with noise, Hidden variable
problem) - Real data (Thrombin binding, Lymphoma
classification, Protein family classification) - Conclusions
- Comments
3 4Motivation
- Support vector machine (SVM)
- Powerful method for classification (or
regression) with high accuracy comparable to
neural network - Exploit of kernel function for pattern separation
in high dimensional space - The information of training data for SVM is
stored in the Gram matrix (kernel matrix) - The problem
- SVM doesnt perform well if Gram matrix has large
diagonal values
5A Brief Review of SVM
For linearly separable patterns
To maximize the margin
Minimize Constraints
6Kernel Approach for SVM (1/3)
- For linearly non-separable patterns
- Nonlinear mapping function ?(x)?H mapping the
patterns to new feature space H of higher
dimension - For example the XOR problem
- SVM in the new feature space
- The kernel trick
- Solving the above minimization problem requires
1) Explicit form of ? - 2) Inner product in high dimensional space H
- Simplification by wise selection of kernel
functions with property k(xi, xj) ?(xi) ?
?(xj)
7Kernel Approach for SVM (2/3)
- Transform the problem with kernel method
- Expand w in the new feature space w ??ai?(xi)
?(x)awhere ?(x)?(x1), ?(x2), , ?(xm),
and aa1, a2, amT - Gram matrix KKij, where Kij ?(xi) ? ?(xj)
k(xi, xj) (symmetric !) - The (squared) objective function?w2
aT?(x)T?(x)a aTKa (sufficient
condition for existence of optimal solution K is
positive definite) - The constraintsyiwT?(xi) b
yiaT?(x)T?(xi) b yiaTKi b ? 1, where
Ki is the ith column of K.
8Kernel Approach for SVM (3/3)
- To predict new data with a trained SVM
- The explicit form of k(xi, xj) is required for
prediction of new data
9Empirical Kernel Mapping
- Assumption m (the number if instances) is a
sufficient high dimension of the new feature
space. i.e. the patterns will be linearly
separable in m-dimension space (Rm) - Empirical kernel map ?m(xi) k(xi,x1),
k(xi,x2), , k(xi,xm)T Ki - The SVM in Rm
- The new Gram matrix Km associated with ?m(x)
- KmKmij, where Kmij ?m(xi) ? ?m(xj) Ki ?
Kj KiTKj, i.e. Km KTK KKT - Advantage of empirical kernel map Km is positive
definite - Km KKT (UTDU) (UTDU)T UTD2U (K is
symmetric, U is unitary matrix, D is diagonal) - Satisfied the sufficient condition of above
minimization problem
10- The Problem
- Almost Orthogonal Patterns in the Feature Space
- Result in Poor Performance
11An Example of Almost Orthogonal Patterns
- The training dataset with almost orthogonal
patterns
- The Gram matrix with linear kernel k(xi, xj)
xi ? xj
Large Diagonals
- w is the solution with standard SVM
- Observation each large entry in w is
corresponding to a column in X with only one
large entry w becomes a lookup table, the SVM
wont generalize well - A better solution
12Situations Leading to Almost Orthogonal Patterns
- Sparsity of the patterns in the new feature
space, e.g. - x 0, 0, 0, 1, 0, 0, 1, 0T
- Y 0, 1, 1, 0, 0, 0 , 0, 0T
- x ? x ? y ? y gtgt x ? y (large diagonals in Gram
matrix) - Some selection of kernel functions may result in
sparsity in the new feature space - String kernel (Watkins 2000, et al)
- Polynomial kernel, k(xi, xj) (xi?xj)d, with
large order d - If xi ? xi gt xi ? xj , for i?j, then
- k(xi, xi) gtgt k(xi, xj), for even moderately large
d, due to the exponential function.
13- Methods to Reduce the Large Diagonals of Gram
Matrices
14Gram Matrix Transformation (1/2)
- For symmetric, positive definite Gram matrix K
(or Km), - K UTDU U is unitary matrix, D is diagonal
matrix - Define f(K) UTf(D)U, and f(D)ii f(Dii) i.e.,
the function f operates on the eigenvalues ?i of
K - f(K) should preserve positive definition of the
Gram matrix - A sample procedure for Gram matrix transformation
- (Optional) Compute the positive definite matrix A
sqrt(K) - Suppress the large diagonals of A, and obtain a
symmetric A - i.e. transform the eigenvalues of A ?min,
?max ? f(?min ), f(?max ) - Compute the positive definite matrix K(A)2
15Gram Matrix Transformation (2/2)
- Effect of matrix transformation
- The explicit form of new kernel function k is
not available - k is required when the trained SVM is used to
predict the testing data - A solution include all test data into K before
the matrix transformation K-gtK i.e. the testing
data has to be known in training time
16An Approximate Approach based on Statistics
- The empirical kernel map ?mn(x) should be used
to calculate the Gram matrix - Assuming the dataset size r is large
- Therefore, the SVM can be simply trained with the
empirical map on the training set, ?m(x), instead
of ?mn(x)
17 18Artificial Data (1/3)
- String classification
- String kernel function (Watkins 2000, et al)
- Sub-polynomial kernel k(x,y) ?(x) ? ?(y)P,
0ltPlt1 for sufficiently small P, the large
diagonals of K can be suppressed - 50 strings (25 for training, and 25 for testing),
20 trials
19Artificial Data (2/3)
- Microarray data with noise (Alon et al, 1999)
- 62 instance (22 positive, 44 negative), 2000
features in original data - 10000 noise features were added (1 to be
non-zero in probability)
Error rate for SVM without noise addition is
0.18?0.15
20Artificial Data (3/3)
- Hidden variable problem
- 10 hidden variables (attributes), 10 additional
attributes which are nonlinear functions of the
10 hidden variables - Original kernel is polynomial kernel of order 4
21Real Data (1/3)
- Thrombin binding problem
- 1909 instances, 139,351 binary features
- 0.68 entries are non-zero
- 8-fold cross validation
22Real Data (2/3)
- Lymphoma classification (Alizadeh et al, 2000)
- 96 samples, 4026 features
- 10-fold cross validation
- Improved results observed compared with previous
work (Weston, 2001)
23Real Data (3/3)
- Protein family classification (Murzin et al,
1995) - Small positive set, large negative set
Receiver operating characteristic 1 best
score 0 worst score
Rate of false positive
24Conclusions
- Problem of degraded performance for SVM due to
almost orthogonal patterns was identified and
analyzed - The common situation that sparse vectors leading
to large diagonals was identified and discussed - A method of Gram matrix transformation to
suppress the large diagonals was proposed to
improve the performance in such cases - Experiment results show improved accuracy for
various artificial or real datasets with
suppressed large diagonals of Gram matrices
25Comments
- Strong points
- The identification of the situations leads to
large diagonals in Gram matrix, and the proposed
Gram matrix transformation method for suppressing
the large diagonals - Experiments are extensive
- Weak points
- The application of Gram matrix transformation may
be severely restricted in forecasting or other
applications in which the testing data is not
know in training time - The proposed Gram matrix transformation method
was not tested by experiments directly, instead,
transformed kernel functions were used in
experiments - The almost orthogonal patterns imply that
multiple pattern vectors in the same direction
rarely exist, therefore, the necessary condition
for statistic approach for pattern distribution
is not satisfied
26