Presented at the Alabany Chapter of the ASA

About This Presentation

Title:

Presented at the Alabany Chapter of the ASA

Description:

Title: PowerPoint Presentation Author: Mark Embrechts Last modified by: Mark Embrechts Created Date: 9/17/2003 4:17:25 AM Document presentation format – PowerPoint PPT presentation

Number of Views:1272

Avg rating:3.0/5.0

Slides: 53

Provided by: MarkEmb8

Learn more at: https://www.rpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Presented at the Alabany Chapter of the ASA

1
Presented at the Alabany Chapter of the
ASA February 25, 2004 Washinghton DC
2
Magnetocardiography at CardioMag Imaging inc.
With Bolek Szymanski and Karsten Sternickel
3
Left Filtered and averaged temporal MCG traces
for one cardiac cycle in 36 channels (the 6x6
grid). Right Upper Spatial map of the cardiac
magnetic field, generated at an instant within
the ST interval. Right Lower T3-T4
sub-cycle in one MCG signal trace
4
Classical (Linear) Regression Analysis Predict y
from X
Can we apply wisdom to data and forecast them
right?
Xnm
(n 19 m 7) 19 data and 7 attributes
Pseudo inverse
y
(1 response)
5
Fundamental Machine Learning Paradox

Learning occurs because of redundancy
(patterns) in the data

Machine Learning Paradox If data contain
redundancies
(i) we can learn from data
(ii) the feature kernel matrix KF is
ill-conditioned

How to resolve Machine Learning Paradox?

(i) fix rank deficiency of KF with
principal components (PCA) (ii)
regularization use KF?I instead of KF (ridge
regression) (iii) local learning
6
Principal Component Regression (PCR) Replace Xnm
by Tnh
Tnh ? principal components projection of the (n)
data records on the (h) most important
eigenvectors of the feature kernel KF
7
Ridge Regression in Data Space

Wisdom is now obtained from the right-hand
inverse or Penrose inverse

Ridge term is added to resolve learning paradox
Needs kernels only
Data Kernel KD
8
Implementing Direct Kernel Methods
Linear Model - PCA model - PLS model - Ridge
Regression - Self-Organizing Map
. . .
9
What have we learned so far?

There is a learning paradox because of
redundancies in the data
We resolved this paradox by regularization
- In the case of PCA we used the
eigenvectors of the feature kernel
- In the case of ridge regression we
added a ridge to the data kernel
So far prediction models involved only linear
algebra ? stricly linear
What is in a kernel?

The data kernel contains linear similarity
measures (correlations) of data records
10
Kernels
Nonlinear

What is a kernel?
- The data kernel expresses a similarity
measure between data records
- So far, the kernel contains linear
similarity measures ? linear kernel

We actually can make up nonlinear similarity
measures as well

Distance or difference
Radial Basis Function Kernel
11
Review What is in a Kernel?

A kernel can be considered as a (nonlinear)
data transformation
- Many different choices for the kernel are
possible
- The Radial Basis Function (RBF) or Gaussian
kernel is an effective nonlinear kernel
The RBF or Gaussian kernel is a symmetric
matrix
- Entries reflect nonlinear similarities
amongst data descriptions
- As defined by

12
Direct Kernel Methods for Nonlinear
Regression/Classification

Consider the Kernel as a (nonlinear) data
transformation
- This is the so-called kernel trick
(Hilbert, early 1900s)
- The Radial Basis Function (RBF) or Gaussian
kernel is an efficient nonlinear kernel
Linear regression models can be tricked into
nonlinear models by applying
such regression models on kernel transformed
data
- PCA ? DK-PCA
- PLS ? DK-PLS (Partial Least Squares
Support Vector Machines)
- (Direct) Kernel Ridge Regression ? Least
Squares Support Vector Machines
- Direct Kernel Self-Organizing maps (DK-SOM)
These methods work in the same space as SVMs
- DK models can usually be derived also from
an optimization formulation (similar to SVMs)
- Unlike the original SVMs DK methods are
not sparse (i.,e., all data are support vectors)
- Unlike SVMs there is no patent on direct
kernel methods
- Performance on hunderds of benchmark
problems compare favorably with SVMs
Classification can be considered as a special
cae of regression

13
Nonlinear PCA in Kernel Space

Like PCA
Consider a nonlinear data kernel transformation
up front Data ? Kernel
Derive principal components for that kernel
(e.g. with NIPALS)
Examples - Haykins Spiral
- Cherkasskys nonlinear
function model

14
PCA Example Haykins Spiral
(demo haykin1)
PCA
15
Linear PCR Example Haykins Spiral
(demo haykin2)
16
K-PCR Example Haykins Spiral
3 PCAs
12 PCAs
(demo haykin3)
17
Scaling, centering making the test kernel
centering consistent
Centered Direct Kernel (Training Data)
Training Data
Mahalanobis-scaled Training Data
Kernel Transformed Training Data
Mahalanobis Scaling Factors
Vertical Kernel Centering Factors
Centered Direct Kernel (Test Data)
Test Data
Mahalanobis-scaled Test Data
Kernel Transformed Test Data
18
36 MCG T3-T4 Traces

Preprocessing
horizontal Mahalanobis scaling
D4 wavlet transform
vertical Mahalanobis scaling
(features and response)

19
SVMLib
Linear PCA
SVMLib
Direct Kernel PLS
20
Direct Kernel PLS with 3 Latent Variables
21
Predictions on Test Cases with K-PLS
22
K-PLS Predictions After Removing 14 Outliers
23
Benchmark Predictions on Test Cases
24
Direct Kernel
with Robert Bress and Thanakorn Naenna
25
www.drugmining.com
Kristin Bennett and Mark Embrechts
26
Docking Ligands is a Nonlinear Problem
DDASSL
Drug Design and Semi-Supervised Learning
27
WORK IN PROGRESS
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCT
GTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCA
TCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAA
TAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTAT
GGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAA
GAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGG
AATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATG
AATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACC
AATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCAT
CACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACC
ACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG T
CATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCA
CCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTAT
CACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATC
ATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCA
CCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCAT
TATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCA
TCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCAC
CAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATC
ATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCA
TCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCAC
CACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CC
ATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCAC
CACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAG
AATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATG
AAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAG
GACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCAC
CAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCT
GT
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCT
GTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCA
TCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAA
TAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTAT
GGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAA
GAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGG
AATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATG
AATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACC
AATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCAT
CACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACC
ACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG T
CATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCA
CCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTAT
CACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATC
ATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCA
CCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCAT
TATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCA
TCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCAC
CAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATC
ATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCA
TCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCAC
CACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CC
ATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCAC
CACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAG
AATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATG
AAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAG
GACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCAC
CAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCT
GT
DDASSL
Drug Design and Semi-Supervised Learning
28
(No Transcript)
29
(No Transcript)
30
Direct Kernel Partial-Least Squares (K-PLS)

Direct Kernel PLS is PLS with the kernel
transform as a preprocessing step
Consider K-PLS as a better nonlinear PLS
Consider PLS as a better PCA
K-PLS gives almost identical (but more stable)
results as SVMs
- PLS is the method by choice for
chemometrics and QSAR drug design
- hyper-parameters are easy to tune (5 latent
variables)
- unlike SVMs there is no patent on K-PLS

31
What have we learned so far?

There is a learning paradox because of
redundancies in the data
We resolved this paradox by regularization
- In the case of PCA we used the
eigenvectors of the feature kernel
- In the case of ridge regression we
added a ridge to the data kernel
So far prediction models involved only linear
algebra ? strictly linear
What is in a kernel?

The data kernel contains linear similarity
measures (correlations) of data records
32
Kernels
Nonlinear

What is a kernel?
- The data kernel expresses a similarity
measure between data records
- So far, the kernel contains linear
similarity measures ? linear kernel

We actually can make up nonlinear similarity
measures as well

Distance or difference
Radial Basis Function Kernel
33
PCR in Feature Space
Means that the projections on the eigenvectors
will be divided with the corresponding variance
(cfr. Mahalanobis scaling)
S
S
x1
This layer gives a weighted similarity
score with each datapoint
S
S
. . .
S
S
S
xi
Kind of a nearest neighbor weighted prediction
score
S
S
xm
S
Weights correspond to the dependent variable for
the entire training data
Weights correspond to H eigenvectors corresponding
to largest eigenvalues of XTX
S
Weights correspond to the scores or PCAs for
the entire training set
34
PCR in Feature Space
t1
x1
w1
S
w2
y
S
S
xi
t2
wh
S
th
xm

Principal components can be thought of
as a data pre-processing step
Rather than building a model for an
m-dimensional input vector x we now have
a h-dimensional t vector

Weights correspond to H eigenvectors corresponding
to largest eigenvalues of XTX
35
Predictions on Test Cases with DK-SOM
Use of a direct kernel self-organizing map in
testing mode for the detection of patients with
ischemia (read patient IDs). The darker hexagons
colored during a separate training phase
represent nodes corresponding with ischemia
cases.
36
Outlier/Novelty Detection Methods in
Analyze/StripMiner

One-Class SVM with LIBSVM with auto-tuning for
regularization
outliers flagged on Self-Organizing Maps (SOMs
and DK-SOMs)
Extended pharmaplots
- PCA- based pharmaplot
- PLS-based pharmaplot
- K-PLS based pharmaplot
- K-PCA based pharmaplot

Will explore outlier detection options with
CardioMag data
- 1152 mixed Wavelet descriptors
- 74 training data and 10 test data

37
Outlier Detection Procedure in Analyze
start
List of Outlier pattern IDs
One-class SVM on training data Proprietory
regularization mechanism
Determine number of outliers from elbow
plot Eliminate outliers from training set Run
K-PLS for new training/test data
See whether outliers make sense on
pharmaplots Inspect outlier clusters on SOMs
end
Outliers are flagged in pharmaplots
38
Tagging Outliers on Pharmaplot with Analyze Code
39
Elbow Plot for Specifying Outliers
Elbows suggest 7-14 outliers
40
One-Class SVM Results for MCG Data
41
Outlier/Novelty Detection Methods in Analyze
Hypotheses

One-class SVMs are commonly cited for outlier
detection (e.g., Suykens)
- used publicly available SVM code (LibSVM)
- Analyze has user-friendly interface
operators for using LibSVM
Proprietary heuristic tuning for C in SVMs
- heuristic tuning method explained in
previous publications
- heuristic tuning is essential to make
outlier detection work properly
Elbow curves for indicating outliers
Pharmaplot justifies/validates detection from
different methods
Pharmaplots extended to PLS, K-PCA, and K-PLS
pharmaplots

42
One-Class SVM Brief Theory

Well-known method for outlier novelty
detection in SVM literature
(e.g., seeSuykens)
LibSVM, a publicly available SVM code for
general use, has one-class
SVM option built-in (see Chih-Chung Chang and
Chih-Jen Lin )
Analyze has operators to interface with LibSVM
Theory
- One-class SVM ignores response (assumes all
zeros for responses)
- Maximizes spread and subtracts
regularization term
- Suykens, pp. 203 has following formulation
- ? is a regularization parameter, Analyze
has proprietary way to determine ?
Application
- Analyze combines one-class SVMs with
pharmaplots to see whether outliers
can be explained and make sense
- Analyze has elbow curves to assist user in
determining outliers

43
NIPALS ALGORITHM FOR PLS (with just one response
variable y)

Start for a PLS component
Calculate the score t
Calculate c
Calculate the loading p
Store t in T, store p in P, store w in W
Deflate the data matrix and the response
variable

Do for h latent variables
44
Outlier/Novelty Detection Methods in Analyze

Outlier detection methods where extensively
tested
- on a variety of different UCI data sets
- models sometimes showed significant
improvement after removal of outliers
- models were rarely worse
- outliers could be validated on pharmaplots
and lead to enhanced insight
The pharmaplots confirm the validity of outlier
detection with one-class SVM
Prediction on test set for albumin data improves
model
A non-pharmaceutical (medical) data set actually
shows two data points in the
training set that probably were given wrong
labels (Appendix A)

45
(No Transcript)
46
Innovations in Analyze for Outlier Detection

User-fiendly procedure with automated processes
Interface for one-class SVM from LibSVM
Automated tuning for regularization parameters
Elbow plots to determine number of outliers
Combination of LibSVM outliers with pharmaplots
- efficient visualization of outliers
- facilitates interpretation of outliers
Extended pharmaplots
- PCA
- K-PCA
- PLS
- K-PLS
User-friendly and efficient SOM with outlier
identification
Direct-Kernel-based outlier detection as an
alternative to LibSVM

47
KERNEL PLS (K-PLS)

Invented by Rospital and Trejo (J. Machine
learning, December 2001)
Can be considered as the poor mans support
vector machine (SVM)
They first altered the linear PLS by dealing
with eigenvectors of XXT
They also made the NIPALS PLS formulation
resemble PCA more
Now non-linear Kernel based correlation matrix
K(XXT) rather than XXT is used
Nonlinear Correlation matrix contains nonlinear
similarities of
datapoints rather than
An example is the Gaussian Kernel similarity
measure

Linear PLS
Kernel PLS
w1 eigenvector of XTYYTX t1 eigenvector of
XXTYYT ws and ts of deflations ws are
orthonormal ts are orthogonal ps not
orthogonal ps orthogonal to earlier ws
Trick is a different normalization Now ts
rather than ws are normalized t1 eigenvector
of K(XXT)YYT ws and ts of deflations of XXT
RENSSELAER
48
Principal Component Analysis (PCA)

We introduce a modest set of h most important
principal components, Tnh
Replace data Xnm by most important principal
components Tnh
The most important Ts are the ones
corresponding to largest eigenvalues of XTX
The Bs are the eigenvectors of XTX ordered from
largest to lowest eigenvalue
In practice calculation of Bs and Ts proceeds
iteratively with NIPALS algorithm
NIPALS non-linear iterative least squares
(Herman Wold)

t1
t2
x3
x1
y
x2
49
Partial Least Squares (PLS)

Similar to PCA
PLS Partial Least Squares/Projection to Latent
Structures/Please Listen to Svante
Ts are now called scores or latent variables
and the ps are the loading vectors
Loading vectors are not orthogonal anymore and
influenced by y vector
A special version of NIPALS is also used to
build up ts

t2
t1
x3
x1
y
x2
50
Kernel PLS (K-PLS)

Invented by Rospital and Trejo (J. Machine
Learning, December 2000)
Consider K-PLS as a better and nonlinear PLS
K-PLS gives almost identical results to SVMs for
the QSAR data we tried
K-PLS is a lot faster than SVMs

t2
t1
x3
x1
y
x2
51
(No Transcript)
52
(No Transcript)
53
Validation Model 100x leave 10 out validations
54
Feature Selection (data strip mining)
PLS, K-PLS, SVM, ANN
Fuzzy Expert System Rules
GA or Sensitivity Analysis to select descriptors

Write a Comment

User Comments (0)