Max-margin Classification of Data with Absent Features - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Max-margin Classification of Data with Absent Features

Description:

Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller ... In the traditional supervised learning, data instances are viewed as feature ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 70
Provided by: mauric98
Category:

less

Transcript and Presenter's Notes

Title: Max-margin Classification of Data with Absent Features


1
Max-margin Classification of Data with Absent
Features
  • Gal Chechik, Geremy Heitz, Gal Elidan, Pieter
    Abbeel, and Daphne Koller
  • Journal of Machine Learning Research 2008

2
Outline
  • Introduction
  • Background knowledge
  • Problem Description
  • Algorithms
  • Experiments
  • Conclusions

3
Introduction
  • In the traditional supervised learning, data
    instances are viewed as feature vectors in
    high-dimensional space.
  • Why do features miss?
  • noise
  • undefined part of objects
  • structural absent
  • etc.

4
Introduction
  • How to handle classification if features missing?
  • fundamental methods
  • expectation maximization (EM)
  • Markov-chain monte-carlo (MCMC)
  • However, features sometimes are non-existing,
    rather than have an unknown value.
  • To classify without filling missing values.

5
Background
  • Support Vector Machines(SVMs)
  • Second Order Cone Programming(SOCP)

6
Support Vector Machines
  • Support Vector Machines(SVMs) a supervised
    learning method used for classification and
    regression.
  • They simultaneously minimize the empirical
    classification error and maximize the geometric
    margin, also called as maximum margin classifiers.

7
Support Vector Machines
  • Given a set of n labeled sample x1xn, in a
    feature spaces F of size d. Each sample xi has a
    binary class label .
  • We want to give the maximum-margin hyperplane
    which divides the data having yi 1 from those
    having yi - 1.

8
Support Vector Machines
  • Any hyperplane can be written as the set of
    samples x satisfying , where w
    is a normal vector and b is the offset of the
    hyperplane from the origin along w.
  • The hyperplane separate samples into two classes,
    so it could be

9
Support Vector Machines
  • Geometric Margins we define the margin as
    , and learn a classifier w
    by maximize ?.
  • It turns to an optimization problem
  • Or a quadratic programming (QP) optimization
    problem

10
Support Vector Machines
The hyperplane H3 doesn't separate the 2 classes.
H1 does, with a small margin and H2 with the
maximum margin.
11
Support Vector Machines
  • Soft-margin SVMs when the training samples are
    not linearly separable, we introduce slack
    variables to the SVMs.
  • It turns to
  • where C is the trade-off between accuracy and
    model complexity.

12
Support Vector Machines
  • The dual problem of SVMs we consider to use
    Lagrangian to solve the primal SVMs.
  • Set the first derivatives of L to 0

13
Support Vector Machines
  • We then derive
  • The dual problem of SVMs

14
Second Order Cone Programming
  • Second-order Cone Programming(SOCP) a convex
    optimization problem of the form
  • where is the optimization variable.
  • We can solve SOCPs by using MOSEK.

15
Problem Description
  • Given a set of n labeled sample x1xn, in a
    feature spaces F of size d. Each sample xi has a
    binary class label .
  • Let Fi denote the set of features of the ith
    sample. Each sample xi can be viewed as embedded
    in the relevant subspace .

16
Problem Description
  • In the traditional SVMs, we try to maximize the
    geometric margin ? for all instances.
  • In the case of missing feature, the margin which
    measures the distance to a hyperplane is not well
    defined.
  • We cant use the traditional SVMs in this case.

17
Problem Description
  • We should treat the margin of each instance in
    its own relevant subspace.
  • We define the instance margin for the
    ith instance as
  • where w(i) is a vector obtained by taking the
    entries of w that are relevant for xi.

18
Problem Description
  • We consider new geometric margin to be the
    minimum over all instance margins, that is
  • , and arrive at
    a new optimization problem for the case

19
Problem Description
  • However, since different margin terms are
    normalized by different norms , we cant
    take out of the minimization.
  • Besides, each of the terms is
    non-convex in w, its difficult to solve
    directly.

20
Algorithms
  • How to solve this optimization problem?
  • Three approaches
  • Linearly separable case A Convex Formulation
  • The general case
  • Average Norm
  • Instance-specific Margins

21
A Convex Formulation
  • In the linearly separable case, we can transform
    the optimization problem into a series of convex
    optimization problem.
  • First step

By maximizing a lower bound, we take the
minimization term out of the objective function
into the constraints. And the resulting problem
is equivalent to the original one since the bound
? can always be maximized until its perfect
tight.
22
A Convex Formulation
  • Second step Replace the single constraint by
    multiple constraints.

From single constraint to multiple constraints.
23
A Convex Formulation
  • Finally, we write

Because for all instances.
24
A Convex Formulation
  • Assume first that ? is given. For any fixed value
    of ?, the problem obeys the general structure of
    SOCP.

The structures are similar.
25
A Convex Formulation
  • We can solve it by doing a bisection search over
    , for each iteration we solve a SOCP
    problem.
  • However, one problem is that any scaled version
    of a solution is also a solution.

Each constraint is invariant to a rescaling of w.
And the null case is always a
solution.
26
A Convex Formulation
  • How to solve it? We can add constraints
  • A non vanishing norm .
  • On a single entry in w, for
    each entry.
  • We can solve the SOCP twice for each entry of w,
    once for , and once for .
  • It becomes a total of 2d problems.

No longer convex!
27
A Convex Formulation
  • The convex formulation is difficult for the
    non-separable case.

We cant be sure that its jointly convex with w.
Consider the slack variables are not normalized by
28
A Convex Formulation
  • In the case, the vanishing solutions ,
    are also encountered.
  • We can no longer guarantee that the modified
    approach we discussed above will coincide.
  • So the non-separable formulation isnt likely to
    be of practical use for this case.

29
Average Norm
  • We consider an alternative solution based on an
    approximation of the margin.
  • We can approximate the different norms
    by a common term that doesnt depend on the
    instance.

30
Average Norm
  • Replace each low-dimensional norms by
    the root-mean-square norm over all instances.
  • When all samples have all features, its the same
    as the original SVMs.

31
Average Norm
  • In the case of missing features, the
    approximation of will be also good if all
    the norms are equal.
  • When the norms are near equal, we expect
    to find nearly optimal solutions.

32
Average Norm
  • We can derive

33
Average Norm
  • The linear-separable case
  • The non-separable case

They can be solved using the same techniques as
standard SVMs.
Quadratic programming Problems!
34
Average Norm
  • However, average norm isnt expected to perform
    well if the norms vary considerably.
  • How to solve the problem in the case?
  • Instance-specific Margins approach.

35
Instance-specific Margins
  • We can represent each of the norms as a
    scaling of the full norm .
  • By defining scaling coefficients
    ,we can rewrite the equation

36
Instance-specific Margins
  • So we can derive as above steps

Separable case.
Non-separable case.
37
Instance-specific Margins
  • We consider how to solve it
  • Projected gradient approach.

Not a Quadratic Programming problem!
Its not even convex in w.
38
Instance-specific Margins
  • Projected gradient approach one iterates between
    steps in the direction of the gradient of the
    Lagrangian
  • and projections to the constrained space, by
    calculating .
  • With the right choices of step sizes, it
    converges to local minima.
  • Other solutions?

39
Instance-specific Margins
  • If we give a set of si , the problem is a
    Quadratic Programming problem.
  • We can use the fact to devise a iterative
    algorithm.

For any fixed value of si, the problem is a QP!
40
Instance-specific Margins
  • For a given tuple of sis, we solve a QP for w,
    and then use the resulting w to calculate new
    sis.
  • To solve the QP, we derive the dual for given
    sis

The dual problem of SVMs!
41
Instance-specific Margins
  • The inner product lt, gt is taken only over
    features that are valid for both xi and xj.
  • We discuss the kernels for modified SVMs later.

42
Instance-specific Margins
  • Iterative optimization/projection algorithm

The Dual problem of SVMs.
The convergence isnt always guaranteed.
The dual solution is used to find the optimal
classifier by setting
43
Instance-specific Margins
  • Two other approaches for optimizing this

44
Instance-specific Margins
  • Updating approach minimize
  • subject to
    .
  • Hybrid approach combine gradient ascent over s
    and QP for w.
  • Those approaches didnt perform as well as the
    iterative approach above.

45
Kernels for missing features
  • Why choose kernels for the SVMs?
  • Some common kernels
  • Polynomial
  • Polynomial(inhomogeneous)
  • Radial Basis Function
  • Gaussian Radial basis function
  • Sigmoid

RBF kernel for SVM
46
Kernels for missing features
  • In the dual formulation above, the dependence on
    the instances is through their inner product.
  • We focus on kernels like the dependence.
  • Polynomial ,
  • Sigmodal

47
Kernels for missing features
  • For a polynomial kernel
    , defined the modified kernel as
  • with the inner product calculated over valid
    features
    .
  • We define , where
    replaces invalid entries(missing
    features) with zeros.

48
Kernels for missing features
  • We have , simply
    since multiplying by zero is equivalent to
    skipping the missing values.
  • We make the kernels for missing features.

49
Experiments
  • Three experiments
  • Features are missing at random.
  • Visual object recognition features are missing
    because they cant be located in the image.
  • Biological network completion(Metabolic Pathway
    Reconstruction) missingness patterns of features
    is determined by the known structure of the
    network.

50
Experiments
  • Five common approaches for filling missing
    features
  • Zero
  • Mean
  • Flag
  • kNN
  • EM
  • Average Norm and Geometric Margins.

51
Missing at Random
  • Features are missing at random
  • Data sets from the UCI repository.
  • MNIST images.

Missing features
52
Missing at Random
  • Experiment results

Geometric Margins has good performance.
Data sets from UCI
MNIST images
53
Visual Object Recognition
  • Visual Object Recognition to determine if an
    object from a certain class is present in a given
    input image.
  • The trunk of a car may not be found in a picture
    of a hatch-back car.
  • The features are structurally missing.

54
Visual Object Recognition
  • The object model contains a set of landmarks,
    defining the outline of an object.
  • We find several matches in a given image.

Five matches for the front windshield landmark.
55
Visual Object Recognition
  • In the car model, we located up to 10
    matches(candidates) for each of the 19 landmarks.
  • For each candidate, we compute the first 10
    principal component coefficients(PCA) of the
    image patch.

56
Visual Object Recognition
  • We concatenate these descriptors to form
    1900(191010) features per image.
  • If the number of descriptors for a given landmark
    is less than 10, we consider the rest to be
    structurally absent.

57
Visual Object Recognition
  • Experiment results

58
Visual Object Recognition
  • Examples

59
Metabolic Pathway Reconstruction
  • Metabolic Pathway Reconstruction predicting
    missing enzymes in metabolic pathways.
  • Instances in this task have missing features due
    to the structure of the biochemical network.

60
Metabolic Pathway Reconstruction
  • Cells use a complex network of chemical reactions
    to produce their building blocks.

molecular compounds
enzyme
enzyme
molecular compounds
The enzyme catalyzes a reaction.
61
Metabolic Pathway Reconstruction
  • For many reactions, the enzyme responsible for
    their catalysis is unknown, making it an
    important computational task to predict the
    identity of such missing enzymes.
  • How to predict?
  • The enzymes in local network neighborhoods
    usually participate in related functions.

62
Metabolic Pathway Reconstruction
  • Different types of network neighborhood relations
    between enzyme pairs lead to different relations
    of their properties.
  • Three types
  • forks(same inputs, different outputs)
  • funnels(same outputs, different inputs)
  • linear chains

63
Metabolic Pathway Reconstruction
linear chains
forks(same inputs, different outputs)
funnels(same outputs, different inputs)
64
Metabolic Pathway Reconstruction
  • Each enzyme is represented using a vector of
    features that measure its relatedness to each of
    its different neighbors, across different data
    types.
  • A feature vector will have structurally missing
    entries if the enzyme does not have all types of
    neighbors.

65
Metabolic Pathway Reconstruction
  • Three types of data for enzyme attributes
  • A compendium of gene expression assays
  • Protein domains content of enzymes
  • The cellular localization of proteins
  • We use those data to measure the similarity
    between enzymes.

66
Metabolic Pathway Reconstruction
  • Similarity Measures for Enzyme Predictions

67
Metabolic Pathway Reconstruction
  • Positive examples From the reactions with known
    enzymes.
  • Negative examples By plugging a random impostor
    genes into each neighborhood.

68
Metabolic Pathway Reconstruction
  • Experiment results

69
Conclusions
  • A novel method for max-margin training of
    classifiers in the presence of missing features.
  • To classify instances by skipping the
    non-existing features, rather than filling them
    with hypothetical values.
Write a Comment
User Comments (0)
About PowerShow.com