1 / 30

An Introduction to Support Vector Machine

Classification

Bioinformatics Lecture 7/2/2003

by Pierre Dönnes

Outline

- What do we mean with classification, why is it

useful - Machine learning- basic concept
- Support Vector Machines (SVM)
- Linear SVM basic terminology and some formulas
- Non-linear SVM the Kernel trick
- An example Predicting protein subcellular

location with SVM - Performance measurments

Classification

- Everyday, all the time we classify things.
- Eg crossing the street
- Is there a car coming?
- At what speed?
- How far is it to the other side?
- Classification Safe to walk or not!!!

- Decision tree learning
- IF (Outlook Sunny) (Humidity High)
- THEN PlayTennis NO
- IF (Outlook Sunny) (Humidity Normal)
- THEN PlayTennis YES

Classification tasks in Bioinformatics

- Learning Task
- Given Expression profiles of leukemia patients

and healthy persons. - Compute A model distinguishing if a person has

leukemia from expression data. - Classification Task
- Given Expression profile of a new patient a

learned model - Determine If a patient has leukemia or not.

Problems in classifying biological data

- Often high dimension of data.
- Hard to put up simple rules.
- Amount of data.
- Need automated ways to deal with the data.
- Use computers data processing, statistical

analysis, try to learn patterns from the data

(Machine Learning)

Examples are - Support Vector Machines -

Artificial Neural Networks -

Boosting - Hidden Markov Models

Black box view ofMachine Learning

Training data

Model

Model

Magic black box (learning machine)

Training data -Expression patterns of some

cancer expression data from healty

person Model - The model can

distinguish between healty and sick persons.

Can be used for prediction.

Tennis example 2

Temperature

Humidity

play tennis

do not play tennis

Linear Support Vector Machines

Data ltxi,yigt, i1,..,l xi ? Rd yi ? -1,1

x2

1

-1

x1

Linear SVM 2

Data ltxi,yigt, i1,..,l xi ? Rd yi ? -1,1

f(x)

All hyperplanes in Rd are parameterize by a

vector (w) and a constant b. Can be expressed as

wxb0 (remember the equation for a hyperplane

from algebra!)

Our aim is to find such a hyperplane

f(x)sign(wxb), that correctly classify our

data.

Definitions

Define the hyperplane H such that xiwb ? 1

when yi 1 xiwb ? -1 when yi -1

H1

H2

H1 and H2 are the planes H1 xiwb 1 H2

xiwb -1 The points on the planes H1 and H2

are the Support Vectors

H

d the shortest distance to the closest poitive

point

d- the shortest distance to the closest

negative point

The margin of a separating hyperplane is d d-.

Maximizing the margin

We want a classifier with as big margin as

possible.

H1

H

H2

Recall the distance from a point(x0,y0) to a

line AxByc 0 isA x0 B y0 c/sqrt(A2B2)

The distance between H and H1 is wxb/w1/

w

The distance between H1 and H2 is 2/w

In order to maximize the margin, we need to

minimize w. With the condition that there

are no datapoints between H1 and H2 xiwb ? 1

when yi 1 xiwb ? -1 when yi -1 Can

be combined into yi(xiw) ? 1

The Lagrangian trick

Reformulate the optimization problem A trick

often used in optimization is to do an Lagrangian

formulation of the problem.The constraints will

be replace by constraints on the Lagrangian

multipliers and the training data will only

occur as dot products.

Gives us the task Max Ld ??i

½??i?jxixj, Subject to w ??iyixi ??iyi

0

What we need to see xiand xj (input vectors)

appear only in the form of dot product we will

soon see why that is important.

Problems with linear SVM

-1

1

What if the decison function is not a linear?

Non-linear SVM 1

The Kernel trick

Imagine a function ? that maps the data into

another space ?Rd??

Rd

?

-1

1

?

-1

1

Non-linear svm2

The function we end up optimizing is Max Ld

??i ½??i?jK(xixj), Subject to w

??iyixi ??iyi 0

Another kernel example The polynomial

kernel K(xi,xj) (xixj 1)p, where p is a

tunable parameter. Evaluating K only require one

addition and one exponentiation more than the

original dot product.

Solving the optimization problem

- In many cases any general purpose optimization

package that solves linearly constrained

equations will do. - Newtons method
- Conjugate gradient descent
- Other methods involves nonlinear programming

techniques.

Overtraining/overfitting

A well known problem with machine learning

methods is overtraining. This means that we have

learned the training data very well, but we can

not classify unseen examples correctly.

An example A botanist really knowing

trees.Everytime he sees a new tree, he claims it

is not a tree.

Overtraining/overfitting 2

A measure of the risk of overtraining with SVM

(there are also other measures).

It can be shown that The portion, n, of unseen

data that will be missclassified is bound by n

? No of support vectors / number of training

examples

Ockhams razor principle Simpler system are

better than more complex ones. In SVM case fewer

support vectors mean a simpler representation of

the hyperplane.

Example Understanding a certain cancer if it can

be described by one gene is easier than if we

have to describe it with 5000.

A practical example, protein localization

- Proteins are synthesized in the cytosol.
- Transported into different subcellular locations

where they carry out their functions. - Aim To predict in what location a certain

protein will end up!!!

Subcellular Locations

Method

- Hypothesis The amino acid composition of

proteins from different compartments should

differ. - Extract proteins with know subcellular location

from SWISSPROT. - Calculate the amino acid composition of the

proteins. - Try to differentiate between cytosol,

extracellular, mitochondria and nuclear by using

SVM

Input encoding

Prediction of nuclear proteins Label the known

nuclear proteins as 1 and all others as 1. The

input vector xi represents the amino acid

composition. Eg xi (4.2,6.7,12,.,0.5)

A , C , D,.., Y)

Nuclear

SVM

Model

All others

Cross-validation

Cross validation Split the data into n sets,

train on n-1 set, test on the set left out of

training.

1

Test set

Nuclear

1

1

2

3

2

1

All others

Training set

3

2

2

3

3

Performance measurments

TP

Test data

Predictions

FP

Model

1

TN

-1

1

-1

FN

Results

- We definetely get some predictive power out of

our models. - Seems to be a difference in composition of

proteins from different subcellular locations. - Another questions What about nuclear proteins.

Is there a difference between DNA-binding

proteins and others???

Conclusions

- We have (hopefully) learned some basic concepts

and terminology of SVM. - We know about the risk of overtraining and how to

put a measure on the risk of bad generalization. - SVMs can be useful for example in predicting

subcellular location of proteins.

You cant input anything into a learning

machine!!!

Image classification of tanks. Autofire when an

enemy tank is spotted. Input data Photos of own

and enemy tanks. Worked really good with the

training set used. In reality it failed

completely.

Reason All enemy tank photos taken in the

morning. All own tanks in dawn. The classifier

could recognize dusk from dawn!!!!

References

http//www.kernel-machines.org/

http//www.support-vector.net/

Papers by Vapnik

C.J.C. Burges A tutorial on Support Vector

Machines. Data Mining and Knowledge Discovery

2121-167, 1998.