Machine Learning and Data Mining via Mathematical Programming Based Support Vector Machines - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Learning and Data Mining via Mathematical Programming Based Support Vector Machines

Description:

Synthetic dataset consisting of 1 billion points in 10- dimensional ... By QP 'duality', . Maximizing the margin. in the 'dual space' , gives: min. Replace ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 48
Provided by: olvilman9
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning and Data Mining via Mathematical Programming Based Support Vector Machines


1
Machine Learning and Data Miningvia
Mathematical Programming Based Support Vector
Machines
  • Glenn M. Fung

May 8, 2003
Ph. D. Dissertation Talk University of Wisconsin
- Madison
2
Thesis Overview
  • Proximal support vector machines (PSVM)
  • Binary Classification
  • Multiclass Classification
  • Incremental Classification (massive datasets)
  • Knowledge based SVMs (KSVM)
  • Linear KSVM
  • Extension to Nonlinear KSVM
  • Sparse classifiers
  • Data selection for linear classifiers
  • Minimize of support vectors
  • Minimal kernel classifiers
  • Feature selection Newton method for SVM.
  • Semi-Supervised SVMs
  • Finite Newton method for Lagrangian SVM
    classifiers

3
Outline of Talk
  • (Standard) Support vector machine (SVM)
  • Classify by halfspaces
  • Proximal support vector machine (PSVM)
  • Classify by proximity to planes
  • Incremental PSVM classifiers
  • Synthetic dataset consisting of 1 billion points
    in 10- dimensional input space
    classified in less than 2 hours and 26 minutes
  • Knowledge based SVMs
  • Incorporate prior knowledge sets into
    classifiers
  • Minimal kernel classifiers
  • Reduce data dependence of nonlinear classifiers

4
Support Vector MachinesMaximizing the Margin
between Bounding Planes
A
A-
5
Standard Support Vector MachineAlgebra of
2-Category Linearly Separable Case
6
Standard Support Vector Machine Formulation
7
Proximal Vector MachinesFitting the Data using
two parallel Bounding Planes
A
A-
8
PSVM Formulation
We have from the QP SVM formulation
This simple, but critical modification, changes
the nature of the optimization problem
tremendously!!
9
Advantages of New Formulation
  • Objective function remains strongly convex
  • An explicit exact solution can be written in
    terms of the problem data
  • PSVM classifier is obtained by solving a single
    system of linear equations in the usually small
    dimensional input space
  • Exact leave-one-out-correctness can be obtained
    in terms of problem data

10
Linear PSVM
  • Setting the gradient equal to zero, gives a
    nonsingular system of linear equations.
  • Solution of the system gives the desired PSVM
    classifier

11
Linear PSVM Solution
12
Linear Proximal SVM Algorithm
13
Nonlinear PSVM Formulation
14
The Nonlinear Classifier
  • Where K is a nonlinear kernel, e.g.

15
Nonlinear PSVM
However, reduced kernel techniques can be used
(RSVM) to reduce dimensionality.
16
Linear Proximal SVM Algorithm
Non

Solve
17
Incremental PSVM Classification
18
Linear Incremental Proximal SVM Algorithm
19
Linear Incremental Proximal SVM Adding Retiring
Data
  • Capable of modifying an existing linear
    classifier by both adding and retiring data
  • Option of retiring old data is similar to adding
    new data
  • Financial Data old data is obsolete
  • Option of keeping old data and merging it with
    the new data
  • Medical Data old data does not obsolesce.

20
Numerical experimentsOne-Billion Two-Class
Dataset
  • Synthetic dataset consisting of 1 billion points
    in 10- dimensional input space
  • Generated by NDC (Normally Distributed
    Clustered) dataset generator
  • Dataset divided into 500 blocks of 2 million
    points each.
  • Solution obtained in less than 2 hours and 26
    minutes
  • About 30 of the time was spent reading data
    from disk.
  • Testing set correctness 90.79

21
Numerical Experiments Simulation of Two-month
60-Million Dataset
  • Synthetic dataset consisting of 60 million
    points (1 million per day) in 10- dimensional
    input space
  • Generated using NDC
  • At the beginning, we only have data
    corresponding to the first month
  • Every day
  • The oldest block of data is retired (1 Million)
  • A new block is added (1 Million)
  • A new linear classifier is calculated daily
  • Only an 11 by 11 matrix is kept in memory at the
    end of each day. All other data is purged.

22
Numerical experimentsSeparator changing through
time
23
Numerical experiments Normals to the separating
hyperplanes Corresponding to 5 day intervals
24
Support Vector MachinesLinear Programming
Formulation
  • Use the 1-norm instead of the 2-norm
  • This is equivalent to the following linear
    program

25
Support Vector MachinesMaximizing the Margin
between Bounding Planes
A
A-
26
Incoporating Knowledge Sets Into an SVM
Classifier
  • Will show that this implication is equivalent to
    a set of constraints that can be imposed on the
    classification problem.

27
Knowledge Set Equivalence Theorem
28
Knowledge-Based SVM Classification
29
Knowledge-Based SVM Classification
30
Knowledge-Based LP SVM with slackvariables
31
Knowledge-Based SVM via Polyhedral Knowledge
Sets
32
Numerical TestingThe Promoter Recognition Dataset
  • Promoter Short DNA sequence that precedes a
    gene sequence.
  • A promoter consists of 57 consecutive DNA
    nucleotides belonging to A,G,C,T .
  • Important to distinguish between promoters and
    nonpromoters
  • This distinction identifies starting locations
    of genes in long uncharacterized DNA sequences.

33
The Promoter Recognition DatasetComparative Test
Results
34
Minimal kernel ClassifiersModel Simplification
  • Why? Minimizes number of kernel functions used.
  • Simplifies separating surface.
  • Reduces storage
  • Goal 2 Minimize number of active constraints.
  • Why? Reduces data dependence.
  • Useful for massive incremental classification.

35
Model Simplification Goal 1Simplifying
Separating Surface
36
Model Simplification Goal 2Minimize Data
Dependence
  • By KKT conditions

Hence
37
Achieving Model SimplificationMinimal Kernel
Classifier Formulation
38
The (Pound) Loss Function
39
Approximating the Pound Loss Function
40
Minimal Kernel Classifier as a Concave
Minimization Problem
  • Problem can be effectively solved using the
    finite
  • Successive Linearization Algorithm (SLA)
  • (Mangasarian 1996)

41
Minimal Kernel Algorithm (SLA)
42
Minimal Kernel Algorithm (SLA)
  • Each iteration of the algorithm solves a
    Linear program.
  • The algorithm terminates in a finite number of
    iterations (typically 5 to 7 iterations).
  • Solution obtained satisfies the Minimum
    Principle necessary optimality condition.

43
(No Transcript)
44
Checkerboard Separating Surface of Kernel
Functions27 of Active Constraints 30 o
45
Conclusions (PSVM)
  • PSVM is an extremely simple procedure for
    generating linear and nonlinear classifiers by
    solving a single system of linear equations
  • Comparable test set correctness to standard SVM
  • Much faster than standard SVMs typically an
    order of magnitude less.
  • We also Proposed algorithm is an extremely
    simple procedure for generating linear
    classifiers in an incremental fashion for huge
    datasets.
  • The proposed algorithm has the ability to retire
    old data and add new data in a very simple
    manner.
  • Only a matrix of the size of the input space is
    kept in memory at any time.

46
Conclusions (KSVM)
  • Prior knowledge easily incorporated into
    classifiers through polyhedral knowledge sets.
  • Resulting problem is a simple LP.
  • Knowledge sets can be used with or without
    conventional labeled data.
  • In either case KSVM is better than most
    knowledge based classifiers.

47
Conclusions (Minimal Kernel Classifiers)
  • A finite algorithm that generates a classifier
    depending on a fraction of the input data only.
  • Important for fast online testing of unseen data,
    e.g. fraud or intrusion detection.
  • Useful for incremental training of massive data.
  • Overall algorithm consists of solving 5 to 7
    LPs.
  • Kernel data dependence reduced up to 98.8 of
    the data used by a standard SVM.
  • Testing time reduction up to 98.2.
  • MKC testing set correctness comparable to that of
    more complex standard SVM.
Write a Comment
User Comments (0)
About PowerShow.com