Kernel%20Methods%20for%20Classification%20From%20Theory%20to%20Practice - PowerPoint PPT Presentation

About This Presentation
Title:

Kernel%20Methods%20for%20Classification%20From%20Theory%20to%20Practice

Description:

Title: Folie 1 Author: Michael Berthold Last modified by: Michael Berthold Created Date: 9/29/2005 7:13:29 AM Document presentation format: Bildschirmpr sentation – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 35
Provided by: Michael4049
Category:

less

Transcript and Presenter's Notes

Title: Kernel%20Methods%20for%20Classification%20From%20Theory%20to%20Practice


1
Kernel Methods for ClassificationFrom Theory to
Practice
  • 14. Sept 2009Iris Adä, Michael Berthold, Ulrik
    Brandes, Martin Mader, Uwe Nagel

2
Goals of the Tutorial
  • At lunch time on Tuesday, you will
  • Have learned about Linear Classifiers and SVMs
  • Have improved a kernel based classifier
  • Will know what Finnish looks like
  • Have a hunch what a kernel is
  • Had a chance at winning a trophy.

3
Outline Monday (1315 2330)
  • The Theory
  • Motivation Learning Classifiers from Data
  • Linear Classifiers
  • Delta Learning Rule
  • Kernel Methods Support Vector Machines
  • Dual Representation
  • Maximal Margin
  • Kernels
  • The Environment
  • KNIME A short Intro
  • Practical Stuff
  • How to develop nodes in KNIME
  • Install on your laptop(s)
  • You work, we rest
  • Invent a new (and better) Kernel
  • Dinner
  • (Invent an even better Kernel)

4
Outline Tuesday (900 1200)
  • 9-11 refine your kernel
  • 1100 score test data set
  • 1113 winning kernel(s) presented
  • 1200 Lunch and Award Ceremony

5
Learning Models
  • Assumptions
  • no major influence of non-observed inputs

observedinputs
otherinputs
System
Data
observedoutputs
6
Predicting Outcomes
  • Assumptions
  • static system

newinputs
Model
predictedoutputs
7
Learning Classifiers from Data
  • Training data consists of input with labels, e.g.
  • Credit card transactions (fraud no/yes)
  • Hand written letter (A, Z)
  • Drug candidate classification (toxic, non toxic)
  • Multi-label classification problems can be
    reduced to a binary yes/no classification
  • Many, many algorithms around.Why?
  • Choice of algorithm influences generalization
    capability
  • There is no best algorithm for all classification
    problems

8
Linear Discriminant
  • Simple linear, binary classifier
  • Class A if f(x) positive
  • Class B if f(x) negative
  • e.g. is the
    decision function.

9
Linear Discriminant
  • Linear discriminants represent hyperplanes in
    feature space

10
Primal Perceptron
  • Rosenblatt (1959) introduced simple learning
    algorithm for linear discriminants (perceptron)

11
Rosenblatt Algorithm
  • Algorithm is
  • On-line (pattern by pattern approach)
  • Mistake driven (updates only in case of wrong
    classification)
  • Algorithm converges guaranteed if a hyperplane
    exists which classifies all training data
    correctly (data is linearly separable)
  • Learning rule
  • One observation
  • Weight vector (if initialized properly) is simply
    a weighted sum of input vectors(b is even more
    trivial)

12
Dual Representation
  • Weight vector is a weighted sum of input vectors
  • difficult training patterns have larger alpha
  • easier ones have smaller or zero alpha

13
Dual Representation
  • Dual Representation of the linear discriminant
    function

14
Dual Representation
  • Dual Representation of Learning Algorithm

15
Dual Representation
  • Learning Rule
  • Harder to learn examples have larger alpha
    (higher information content)
  • The information about training examples enters
    algorithm only through the inner products (which
    we could pre-compute!)

16
Dual Representation in other spaces
  • All we need for training
  • Computation of inner products of all training
    examples
  • If we train in a different space
  • Computation of inner products in the projected
    space

17
Kernel Functions
  • A kernel allows us (via K) to compute the inner
    product of two points x and y in the projected
    space without ever entering that space...

18
in Kernel Land
  • The discriminant function in our project space
  • And, using a kernel

19
The Gram Matrix
  • All data necessary for
  • The decision function
  • The training of the coefficients
  • can be pre-computed using a Kernel or Gram
    Matrix
  • (If K is symmetric and positive semi-definite
    then K() is a Kernel.)

20
Kernels
  • A simple kernel is
  • And the corresponding projected space
  • Why?

21
Kernels
  • A few (slightly less) simple kernels are
  • And the corresponding projected spaces are of
    dimension
  • computing the inner products in the projected
    space becomes pretty expensive rather quickly

22
Kernels
  • Gaussian Kernel
  • Polynomial Kernel of degree d

23
Why?
  • Greatwe can also apply Rosenblatts algorithm
    to other spaces implicitly.
  • So what?

24
Transformations
25
Polynomial Kernel
26
Gauss Kernel
27
Kernels
  • Note that we do not need to know the projection
    F,it is sufficient to prove that K(.) is a
    Kernel.
  • A few notes
  • Kernels are modular and closed we can compose
    new Kernels based on existing ones
  • Kernels can be defined over non numerical objects
  • text e.g. string matching kernel
  • images, trees, graphs,
  • Note also A good Kernel is crucial
  • Gram Matrix diagonal classification easy and
    useless
  • No Free Kernel too many irrelevant attributes
    Gram Matrix diagonal.

28
Finding Linear Discriminants
  • Finding the hyperplane (in any space) still
    leaves lots of room for variations does it?
  • We can define margins of individual training
    examples(appropriately normalized this is a
    geometrical margin)
  • The margin of a hyperplane (with respect to a
    training set)
  • And a maximal margin of all training examples
    indicates the maximum margin over all hyperplanes.

29
(maximum) Margin of a Hyperplane
30
Support Vector Machines
  • Dual Representation
  • Classifier as weighted sum over inner products of
    training pattern(or only support vectors) and
    the new pattern.
  • Training analog
  • Kernel-Induced feature space
  • Transformation into higher-dimensional
    space(where we will hopefully be able to find a
    linear separation plane).
  • Representation of solution through few support
    vectors (alphagt0).
  • Maximum Margin Classifier
  • Reduction of Capacity (Bias) via maximization of
    margin(and not via reduction of degrees of
    freedom).
  • Efficient parameter estimation see IDA book.

31
Soft and Hard Margin Classifiers
  • What can we do if no linear separating hyperplane
    exists?
  • Instead of focusing on find a hard margin, allow
    minor violations
  • Introduce (positive) slack variables (patterns
    with slack are allowed to lie in margin)
  • Misclassifications are allowed if slack can be
    negative.

32
Kernel Methods Summary
  • Main idea of Kernel Methods
  • Embed data into a suitable vector space
  • Find linear classifier (or other linear pattern
    of interest) in the new space
  • Needed Mapping (implicit or explicit)
  • Key Assumptions
  • Information about relative position is often all
    that is needed by learning methods
  • The inner products between points in the
    projected space can be computed in the original
    space using special functions (kernels).

33
Support Vector Machines
  • Powerful classifier
  • computation of optimal classifier is possible
  • Choice of kernel is critical

34
KNIME
  • Coffee Break.
  • And then
  • KNIME, the Konstanz Information Miner
  • SVMs (and other classifiers) in KNIME
Write a Comment
User Comments (0)
About PowerShow.com