Kernel%20Methods%20for%20Classification%20From%20Theory%20to%20Practice - PowerPoint PPT Presentation

About This Presentation

Title:

Kernel%20Methods%20for%20Classification%20From%20Theory%20to%20Practice

Description:

Title: Folie 1 Author: Michael Berthold Last modified by: Michael Berthold Created Date: 9/29/2005 7:13:29 AM Document presentation format: Bildschirmpr sentation – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 35

Provided by: Michael4049

Category:

more less

Transcript and Presenter's Notes

Title: Kernel%20Methods%20for%20Classification%20From%20Theory%20to%20Practice

1
Kernel Methods for ClassificationFrom Theory to
Practice

14. Sept 2009Iris Adä, Michael Berthold, Ulrik
Brandes, Martin Mader, Uwe Nagel

2
Goals of the Tutorial

At lunch time on Tuesday, you will
Have learned about Linear Classifiers and SVMs
Have improved a kernel based classifier
Will know what Finnish looks like
Have a hunch what a kernel is
Had a chance at winning a trophy.

3
Outline Monday (1315 2330)

The Theory
Motivation Learning Classifiers from Data
Linear Classifiers
Delta Learning Rule
Kernel Methods Support Vector Machines
Dual Representation
Maximal Margin
Kernels
The Environment
KNIME A short Intro
Practical Stuff
How to develop nodes in KNIME
Install on your laptop(s)
You work, we rest
Invent a new (and better) Kernel
Dinner
(Invent an even better Kernel)

4
Outline Tuesday (900 1200)

9-11 refine your kernel
1100 score test data set
1113 winning kernel(s) presented
1200 Lunch and Award Ceremony

5
Learning Models

Assumptions
no major influence of non-observed inputs

observedinputs
otherinputs
System
Data
observedoutputs
6
Predicting Outcomes

Assumptions
static system

newinputs
Model
predictedoutputs
7
Learning Classifiers from Data

Training data consists of input with labels, e.g.
Credit card transactions (fraud no/yes)
Hand written letter (A, Z)
Drug candidate classification (toxic, non toxic)
Multi-label classification problems can be
reduced to a binary yes/no classification
Many, many algorithms around.Why?
Choice of algorithm influences generalization
capability
There is no best algorithm for all classification
problems

8
Linear Discriminant

Simple linear, binary classifier
Class A if f(x) positive
Class B if f(x) negative
e.g. is the
decision function.

9
Linear Discriminant

Linear discriminants represent hyperplanes in
feature space

10
Primal Perceptron

Rosenblatt (1959) introduced simple learning
algorithm for linear discriminants (perceptron)

11
Rosenblatt Algorithm

Algorithm is
On-line (pattern by pattern approach)
Mistake driven (updates only in case of wrong
classification)
Algorithm converges guaranteed if a hyperplane
exists which classifies all training data
correctly (data is linearly separable)
Learning rule
One observation
Weight vector (if initialized properly) is simply
a weighted sum of input vectors(b is even more
trivial)

12
Dual Representation

Weight vector is a weighted sum of input vectors
difficult training patterns have larger alpha
easier ones have smaller or zero alpha

13
Dual Representation

Dual Representation of the linear discriminant
function

14
Dual Representation

Dual Representation of Learning Algorithm

15
Dual Representation

Learning Rule
Harder to learn examples have larger alpha
(higher information content)
The information about training examples enters
algorithm only through the inner products (which
we could pre-compute!)

16
Dual Representation in other spaces

All we need for training
Computation of inner products of all training
examples
If we train in a different space
Computation of inner products in the projected
space

17
Kernel Functions

A kernel allows us (via K) to compute the inner
product of two points x and y in the projected
space without ever entering that space...

18
in Kernel Land

The discriminant function in our project space
And, using a kernel

19
The Gram Matrix

All data necessary for
The decision function
The training of the coefficients
can be pre-computed using a Kernel or Gram
Matrix
(If K is symmetric and positive semi-definite
then K() is a Kernel.)

20
Kernels

A simple kernel is
And the corresponding projected space
Why?

21
Kernels

A few (slightly less) simple kernels are
And the corresponding projected spaces are of
dimension
computing the inner products in the projected
space becomes pretty expensive rather quickly

22
Kernels

Gaussian Kernel
Polynomial Kernel of degree d

23
Why?

Greatwe can also apply Rosenblatts algorithm
to other spaces implicitly.
So what?

24
Transformations
25
Polynomial Kernel
26
Gauss Kernel
27
Kernels

Note that we do not need to know the projection
F,it is sufficient to prove that K(.) is a
Kernel.
A few notes
Kernels are modular and closed we can compose
new Kernels based on existing ones
Kernels can be defined over non numerical objects
text e.g. string matching kernel
images, trees, graphs,
Note also A good Kernel is crucial
Gram Matrix diagonal classification easy and
useless
No Free Kernel too many irrelevant attributes
Gram Matrix diagonal.

28
Finding Linear Discriminants

Finding the hyperplane (in any space) still
leaves lots of room for variations does it?
We can define margins of individual training
examples(appropriately normalized this is a
geometrical margin)
The margin of a hyperplane (with respect to a
training set)
And a maximal margin of all training examples
indicates the maximum margin over all hyperplanes.

29
(maximum) Margin of a Hyperplane
30
Support Vector Machines

Dual Representation
Classifier as weighted sum over inner products of
training pattern(or only support vectors) and
the new pattern.
Training analog
Kernel-Induced feature space
Transformation into higher-dimensional
space(where we will hopefully be able to find a
linear separation plane).
Representation of solution through few support
vectors (alphagt0).
Maximum Margin Classifier
Reduction of Capacity (Bias) via maximization of
margin(and not via reduction of degrees of
freedom).
Efficient parameter estimation see IDA book.

31
Soft and Hard Margin Classifiers

What can we do if no linear separating hyperplane
exists?
Instead of focusing on find a hard margin, allow
minor violations
Introduce (positive) slack variables (patterns
with slack are allowed to lie in margin)
Misclassifications are allowed if slack can be
negative.

32
Kernel Methods Summary

Main idea of Kernel Methods
Embed data into a suitable vector space
Find linear classifier (or other linear pattern
of interest) in the new space
Needed Mapping (implicit or explicit)
Key Assumptions
Information about relative position is often all
that is needed by learning methods
The inner products between points in the
projected space can be computed in the original
space using special functions (kernels).

33
Support Vector Machines