A Survey on Distance Metric Learning Part 1 - PowerPoint PPT Presentation

1 / 66

About This Presentation

Title:

A Survey on Distance Metric Learning Part 1

Description:

Lecture material shamelessly adapted/stolen from the following sources: Kilian Weinberger: ... semi-definiteness. Convex optimization problem. CONVEX. Two convex sets ... – PowerPoint PPT presentation

Number of Views:284

Avg rating:3.0/5.0

Slides: 67

Provided by: IBMU288

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Survey on Distance Metric Learning Part 1

1
A Survey on Distance Metric Learning (Part 1)

Gerry Tesauro
IBM T.J.Watson Research Center

2
Acknowledgement

Lecture material shamelessly adapted/stolen from
the following sources
Kilian Weinberger
Survey on Distance Metric Learning slides
IBM summer intern talk slides (Aug. 2006)
Sam Roweis slides (NIPS 2006 workshop on
Learning to Compare Examples)
Yann LeCun talk slides (NIPS 2006 workshop on
Learning to Compare Examples)

3
Outline
Part 1

Motivation and Basic Concepts
ML tasks where its useful to learn dist. metric
Overview of Dimensionality Reduction
Mahalanobis Metric Learning for Clustering with
Side Info (Xing et al.)
Pseudo-metric online learning (Shalev-Shwartz et
al.)
Neighbourhood Components Analysis (Golderberger
et al.), Metric Learning by Collapsing Classes
(Globerson Roweis)
Metric Learning for Kernel Regression (Weinberger
Tesauro)
Metric learning for RL basis function
construction (Keller et al.)
Similarity learning for image processing (LeCun
et al.)

Part 2
4
Motivation

Many ML algorithms and tasks require a distance
metric (equivalently, dissimilarity metric)
Clustering (e.g. k-means)
Classification regression
Kernel methods
Nearest neighbor methods
Document/text retrieval
Find most similar fingerprints in DB to given
sample
Find most similar web pages to document/keywords
Nonlinear dimensionality reduction methods
Isomap, Maximum Variance Unfolding, Laplacian
Eigenmaps, etc.

5
Motivation (2)

Many problems may lack a well-defined, relevant
distance metric
Incommensurate features ? Euclidean distance not
meaningful
Side information ? Euclidean distance not
relevant
Learning distance metrics may thus be desirable
A sensible similarity/distance metric may be
highly task-dependent or semantic-dependent
What do these data points mean?
What are we using the data for?

6
Which images are most similar?
7
It depends ...
centered
left
right
8
male
female
It depends ...
9
... what you are looking for
student
professor
10
... what you are looking for
nature background
plain background
11
Key DML Concept Mahalanobis distance metric

The simplest mapping is a linear transformation

12
Mahalanobis distance metric

The simplest mapping is a linear transformation

Algorithms can learn both matrices
13
gt5 Minutes Introduction to Dimensionality
Reduction
14
How can the dimensionality be reduced?

eliminate redundant features
eliminate irrelevant features
extract low dimensional structure

15
Notation
Input
with
Output
Embedding principle
Nearby points remain nearby, distant points
remain distant. Estimate r.
16
Two classes of DR algorithms
Linear
Non-Linear
17
Linear dimensionality reduction
18
Principal Component Analysis
(Jolliffe 1986)
Project data into subspace of maximum variance.
19
Optimization
20
Optimization
Eigenvalue solution
21
Facts about PCA

Eigenvectors of covariance matrix C
Minimizes ssq reconstruction error
Dimensionality r can be estimated from
eigenvalues of C
PCA requires meaningful scaling of input features

22
Multidimensional Scaling (MDS)
23
Multidimensional Scaling (MDS)
24
Multidimensional Scaling (MDS)
inner product matrix
25
Multidimensional Scaling (MDS)

equivalent to PCA
use eigenvectors of inner-product matrix
requires only pairwise distances

26
Non-linear dimensionality reduction
27
Non-linear dimensionality reduction
28
From subspace to submanifold
We assume the data is sampled from some manifold
with lower dimensional degree of freedom. How can
we find a truthful embedding?
29
Approximate manifold with neighborhood graph
30
Approximate manifold with neighborhood graph
31
Isomap
Tenenbaum et al 2000
geodesic distance

Compute shortest path between all inputs
Create geodesic distance matrix
Perform MDS with geodesic distances

32
Locally Linear Embedding (LLE)

Maximize pairwise distances
Preserve local distances and angles
Unfolding by semidefinite programming

Roweis and Saul 2000
33
Maximum Variance Unfolding (MVU)
Weinberger and Saul 2004
34
Maximum Variance Unfolding (MVU)
Weinberger and Saul 2004
35
Optimization problem
unfold data by maximizing pairwise distances
Preserve local distances
36
Optimization problem
center output (translation invariance)
37
Optimization problem
Problem Optimization non-convex
multiple local minima
38
Optimization problem
Solution Change of notation
single global minimum
39
Unfolding the swiss-roll
40
Mahalanobis Metric Learning for Clustering with
Side Information (Xing et al. 2003)

Exemplars xi , i1,,N plus two types of side
info
Similar set S (xi , xj ) s.t. xi and xj
are similar (e.g. same class)
Dissimilar set D (xi , xj ) s.t. xi and
xj are dissimilar
Learn optimal Mahalanobis matrix M
D2ij (xi xj)T M (xi xj)
(global dist. fn.)
Goal keep all pairs of similar points close,
while separating all dissilimar
pairs.
Formulate as a constrained convex programming
problem
minimize the distance between the data pairs in S
Subject to data pairs in D are well separated

41
MMC-SI (Contd)

Objective of learning
M is positive semi-definite
Ensure non negativity and triangle inequality of
the metric
The number of parameters is quadratic in the
number of features
Difficult to scale to a large number of features
Significant danger of overfitting small datasets

42
Mahalanobis Metric for Clustering (MMC-SI)
Xing et al., NIPS 2002
43
MMC-SI
Move similarly labeled inputs together
44
MMC-SI
Move different labeled inputs apart
45
Convex optimization problem
46
Convex optimization problem
target Mahalanobis matrix
47
Convex optimization problem
pushing differently labeled inputs apart
48
Convex optimization problem
pulling similar points together
49
Convex optimization problem
ensuring positive semi-definiteness
50
Convex optimization problem
51
Two convex sets
Set of all matrices that satisfy constraint 1
Cone of PSD matrices
52
Convex optimization problem
convex objective
convex constraints
53
Gradient Alternating Projection
54
Gradient Alternating Projection
Take step along gradient.
55
Gradient Alternating Projection
Take step along gradient.
Project onto constraint satisfying sub-space.
56
Gradient Alternating Projection
Take step along gradient.
Project onto constraint satisfying sub-space.
Project onto PSD cone.
57
Gradient Alternating Projection
Take step along gradient.
Project onto constraint satisfying sub-space.
Project onto PSD cone.
Algorithm is guaranteed to converge to optimal
solution
58
Mahalanobis Metric Learning Example I
(b) Data scaled by the global metric

Data Dist. of the original dataset

Keep all the data points within the same classes
close
Separate all the data points from different
classes

59
Mahalanobis Metric Learning Example II

Original data

Diagonal distance metric M can simplify
computation, but could lead to disastrous results

60
Summary of Xing et al 2002

Learns Mahalanobis metric
Well suited for clustering
Can be kernelized
Optimization problem is convex
Algorithm is guaranteed to converge
Assumes data to be uni-modal

61
POLA(Pseudo-metric online learning algorithm)
Shalev-Shwartz et al, ICML 2004
62
POLA (Pseudo-metric online learning algorithm)
This time the inputs are accessed two at a time.
63
POLA (Pseudo-metric online learning algorithm)
Differently labeled inputs are separated.
64
POLA (Pseudo-metric online learning algorithm)
65
POLA (Pseudo-metric online learning algorithm)
Similarly labeled inputs are moved closer.
66
Margin
67
Convex optimization
Both are convex!!
68
Alternating Projection
Initialize inside PSD cone
Project onto constraint - satisfying
hyperplane and back
69
Alternating Projection
Initialize inside PSD cone
Project onto constraint - satisfying
hyperplane and back
Repeat with new constraints
70
Alternating Projection
Initialize inside PSD cone
Project onto constraint - satisfying
hyperplane and back
Repeat with new constraints
If solution exists, algorithm converges inside
intersection.
71
Theoretical Guarantees
Provided global solution exists
Online-version has an upper bound on accumulated
violation of threshold.
Batch-version converges after finite number of
passes over data.
72
Summary of POLA