Title: On a Theory of Similarity functions for Learning and Clustering
1On a Theory of Similarity functions for Learning
and Clustering
- Avrim Blum
- Carnegie Mellon University
- This talk is based on work joint with Nina
Balcan, Nati Srebro and Santosh Vempala
Theory and Practice of Computational Learning,
2009
22-minute version
- Suppose we are given a set of images
, and want to learn a rule to distinguish men
from women. Problem pixel representation not so
good. - A powerful technique for such settings is to use
a kernel a special kind of pairwise similarity
function K( , ). - But, theory in terms of implicit mappings.
Q Can we develop a theory that just views K as a
measure of similarity? Develop more general and
intuitive theory of when K is useful for learning?
32-minute version
- Suppose we are given a set of images
, and want to learn a rule to distinguish men
from women. Problem pixel representation not so
good. - A powerful technique for such settings is to use
a kernel a special kind of pairwise similarity
function K( , ). - But, theory in terms of implicit mappings.
Q What if we only have unlabeled data (i.e.,
clustering)? Can we develop a theory of
properties that are sufficient to be able to
cluster well?
42-minute version
- Suppose we are given a set of images
, and want to learn a rule to distinguish men
from women. Problem pixel representation not so
good. - A powerful technique for such settings is to use
a kernel a special kind of pairwise similarity
function K( , ). - But, theory in terms of implicit mappings.
Develop a kind of PAC model for clustering.
5Part 1 On similarity functions for learning
6Theme of this part
- Theory of natural sufficient conditions for
similarity functions to be useful for
classification learning problems. - Dont require PSD, no implicit spaces, but
includes notion of large-margin kernel. - At a formal level, can even allow you to learn
more (can define classes of functions with no
large-margin kernel even if allow substantial
hinge-loss but that do have a good similarity fn
under this notion)
7Kernels
- We have a lot of great algorithms for learning
linear separators (perceptron, SVM, ). But, a
lot of time, data is not linearly separable. - Old answer use a multi-layer neural network.
- New answer use a kernel function!
- Many algorithms only interact with the data via
dot-products. - So, lets just re-define dot-product.
- E.g., K(x,y) (1 xy)d.
- K(x,y) ?(x) ?(y), where ?() is implicit
mapping into an nd-dimensional space. - Algorithm acts as if data is in ?-space. Allows
it to produce non-linear curve in original space.
8Kernels
A kernel K is a legal def of
dot-product i.e. there exists an implicit
mapping ? such that K( , )? ( )? (
).
E.g., K(x,y) (x y 1)d
? (n-dimensional space) ! nd-dimensional space
Why Kernels are so useful
Many algorithms interact with data only via
dot-products.
So, if replace x y with K(x,y), they act
implicitly as if data was in the
higher-dimensional ?-space.
9Example
K(x,y) (xy)d corresponds to
- E.g., for n2, d2, the kernel
original space
?-space
z2
10Moreover, generalize well if good Margin
- If data is linearly separable by large margin in
?-space, then good sample complexity.
If margin ? in ?-space, then need sample size
of only Õ(1/?2) to get confidence in
generalization.
no dependence on dimension
?(x) 1
- Kernels useful in practice for dealing with many,
many different kinds of data.
11Limitations of the Current Theory
In practice kernels are constructed by viewing
them as measures of similarity.
Existing Theory in terms of margins in implicit
spaces.
Not best for intuition.
Kernel requirement rules out many natural
similarity functions.
Alternative, perhaps more general theoretical
explanation?
12A notion of a good similarity function that is
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008
- In terms of natural direct quantities.
Main notion
- no implicit high-dimensional spaces
- no requirement that K(x,y)?(x) ? (y)
Good kernels
K can be used to learn well.
First attempt
2) Is broad includes usual notion of good
kernel,
has a large margin sep. in ?-space
3) Even formally allows you to do more.
13A First Attempt
P distribution over labeled examples (x, l(x))
Goal output classification rule good for P
K is good if most x are on average more
similar to points y of their own type than to
points y of the other type.
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
gap
Average similarity to points of opposite label
Average similarity to points of the same label
14A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
E.g., most images of men are on average ?-more
similar to random images of men than random
images of women, and vice-versa.
15A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm
- Draw sets S, S- of positive and negative
examples.
- Classify x based on average similarity to S
versus to S-.
S
S-
x
x
16A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm
- Draw sets S, S- of positive and negative
examples.
- Classify x based on average similarity to S
versus to S-.
Theorem
If S and S- are ?((1/?2)
ln(1/??)), then with probability 1-?, error
??.
17A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
30o
30o
½ versus ½ 1 ½ (- ½) ¼
-
-
-
-
-
-
Similarity function K(x,y)x y
- has a large margin separator
does not satisfy our definition.
18A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
R
30o
30o
Broaden 9 non-negligible R s.t. most x are
on average more similar to y 2 R of same label
than to y 2 R of other label.
even if do not know R in advance
19Broader Definition
- K is (?, ?, ?)-good if 9 a set R of
reasonable y (allow probabilistic) s.t. 1-?
fraction of x satisfy
(technically ? hinge loss)
EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm
- Draw Sy1, ?, yd set of landmarks.
F(x) K(x,y1), ,K(x,yd).
x !
Re-represent data.
P
- If enough landmarks (d?(1/?2 ? )), then with
high prob. there exists a good L1 large margin
linear separator.
w0,0,1/n,1/n,0,0,0,-1/n-,0,0
20Broader Definition
- K is (?, ?, ?)good if 9 a set R of
reasonable y (allow probabilistic) s.t. 1-?
fraction of x satisfy
(technically ? hinge loss)
EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm
duÕ(1/(?2? ))
dlO((1/(?2²acc))ln du)
- Draw Sy1, ?, yd set of landmarks.
F(x) K(x,y1), ,K(x,yd)
x !
Re-represent data.
P
X
X
X
X
X
O
X
X
O
O
O
O
X
O
O
X
O
O
X
O
- Take a new set of labeled examples, project to
this space, and run a good L1 linear separator
alg. (e.g., Winnow etc).
21Kernels and Similarity Functions
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
If K has margin ? in implicit space, then for any
?, K is (?,?2,?)-good in our sense.
22Kernels and Similarity Functions
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
Can also show a separation.
Theorem
Exists class C, distrib D s.t. 9 a similarity
function with large ? for all f in C, but no
large-margin kernel function exists.
23Kernels and Similarity Functions
Theorem
For any class C of pairwise uncorrelated
functions, 9 a similarity function good for all f
in C, but no such good kernel function exists.
- In principle, should be able to learn from
O(?-1log(C/?)) labeled examples.
- Claim 1 can define generic (0,1,1/C)-good
similarity function achieving this bound. (Assume
D not too concentrated)
- Claim 2 There is no (?,?) good kernel in hinge
loss, even if ?1/2 and ?1/C1/2. So, margin
based SC is d?(C).
24Generic Similarity Function
- Partition X into regions R1,,RC with P(Ri) gt
1/poly(C). - Ri will be R for target fi.
- For y in Ri, define K(x,y)fi(x)fi(y).
- So, for any target fi in C, any x, we get
- Eyl(x)l(y)K(x,y) y in Ri El(x)2l(y)2 1.
- So, K is (0,1,1/poly(C))-good.
Gives bound O(?-1log(C))
25Similarity Functions for Classification
Algorithmic Implications
- Can use non-PSD similarities, no need to
transform them into PSD functions and plug into
SVM. Instead use empirical similarity map.
E.g., Liao and Noble, Journal of Computational
Biology
- Give justification to the following rule
- Shows that anything learnable with SVM is also
learnable this way.
26Learning with Multiple Similarity Functions
- Let K1, , Kr be similarity functions s. t.
some (unknown) convex combination of them is
(?,?)-good.
Algorithm
- Draw Sy1, ?, yd set of landmarks. Concatenate
features.
F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
- Run same L1 optimization algorithm as before in
this new feature space.
27Learning with Multiple Similarity Functions
- Let K1, , Kr be similarity functions s. t.
some (unknown) convex combination of them is
(?,?)-good.
Algorithm
- Draw Sy1, ?, yd set of landmarks. Concatenate
features.
F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
Guarantee Whp the induced distribution F(P)
in R2dr has a separator of error ? ? at L1
margin at least
Sample complexity is roughly
Only increases by log(r) factor!
28Learning with Multiple Similarity Functions
- Let K1, , Kr be similarity functions s. t.
some (unknown) convex combination of them is
(?,?)-good.
Algorithm
- Draw Sy1, ?, yd set of landmarks. Concatenate
features.
F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
Guarantee Whp the induced distribution F(P)
in R2dr has a separator of error ? ? at L1
margin at least
Proof imagine mapping Fo(x) Ko(x,y1), ,Ko
(x,yd), for the good similarity function Ko ?1
K1 . ?r Kr Consider wo (w1, , wd) of L1
norm 1, margin ?/4. The vector w (?1 w1 ,
?2 w1,, ?r w1, , ?1 wd , ?2 wd,, ?r wd) also
has norm 1 and has wF(x) woFo(x).
29Learning with Multiple Similarity Functions
- Because property defined in terms of L1, no
change in margin! - Only log(r) penalty for concatenating feature
spaces. - If L2, margin would drop by factor r1/2, giving
O(r) penalty in sample complexity. - Algorithm is also very simple (just concatenate).
- Alternative algorithm do joint optimization
- solve for Ko (?1K1 ?nKn), vector wo s.t.
wo has good L1 margin in space defined by Fo(x)
Ko(x,y1),,Ko(x,yd) - Bound also holds here since capacity only lower.
- But we dont know how to do this efficiently
30Learning with Multiple Similarity Functions
- Interesting fact because property defined in
terms of L1, no change in margin! - Only log(r) penalty for concatenating feature
spaces. - If L2, margin would drop by factor r1/2, giving
O(r) penalty in sample complexity. - Also, since any large-margin kernel is also a
good similarity function, - log(r) penalty applies to concatenate and
optimize L1 margin alg for kernels. - But ? is potentially squared in translation and
add extra ? to hinge loss at 1/? cost in
unlabeled data. - Nonetheless, if r is large, this can be a good
tradeoff!
31Open questions (part I)
- Can we deal (efficiently?) with general convex
class K of similarity functions? - Not just K ?1K1 ?rKr ?i 0, ?1?r1.
- Can we efficiently implement direct joint
optimization for convex combination case? - Alternatively can we use concatenation alg to
extract a good convex combination Ko? - Two quite different algorithm styles anything
in-between? - Use this approach for transfer learning?
32Part 2 Can we use this angle to help think about
clustering?
33Clustering comes up in many places
- Given a set of documents or search results,
cluster them by topic. - Given a collection of protein sequences, cluster
them by function. - Given a set of images of people, cluster by who
is in them.
34Can model clustering like this
- Given data set S of n objects.
- There is some (unknown) ground truth clustering
- Goal produce hypothesis clustering C1,C2,,Ck
that matches target as much as possible. - Problem no labeled data!
- But do have a measure of similarity
news articles
sports
politics
C1,C2,,Ck.
minimize mistakes up to renumbering of indices
35Can model clustering like this
What conditions on a similarity measure would be
enough to allow one to cluster well?
- Given data set S of n objects.
- There is some (unknown) ground truth clustering
- Goal produce hypothesis clustering C1,C2,,Ck
that matches target as much as possible. - Problem no labeled data!
- But do have a measure of similarity
news articles
sports
politics
C1,C2,,Ck.
minimize mistakes up to renumbering of indices
36What conditions on a similarity measure would be
enough to allow one to cluster well?
- Contrast with more standard approach to
clustering analysis - View similarity/distance info as ground truth
- Analyze abilities of algorithms to achieve
different optimization criteria. - Or, assume generative model, like mixture of
Gaussians - Here, no generative assumptions. Instead given
data, how powerful a K do we need to be able to
cluster it well?
min-sum, k-means, k-median,
37Here is a condition that trivially works
What conditions on a similarity measure would be
enough to allow one to cluster well?
- Suppose K has property that
- K(x,y) gt 0 for all x,y such that C(x) C(y).
- K(x,y) lt 0 for all x,y such that C(x) ? C(y).
- If we have such a K, then clustering is easy.
- Now, lets try to make this condition a little
weaker.
38What conditions on a similarity measure would be
enough to allow one to cluster well?
- Suppose K has property that all x are more
similar to all points y in their own cluster than
to any y in other clusters. - Still a very strong condition.
- Problem the same K can satisfy for two very
different clusterings of the same data!
baseball
basketball
39What conditions on a similarity measure would be
enough to allow one to cluster well?
- Suppose K has property that all x are more
similar to all points y in their own cluster than
to any y in other clusters. - Still a very strong condition.
- Problem the same K can satisfy for two very
different clusterings of the same data!
baseball
Math
Physics
basketball
40Lets weaken our goals a bit
- OK to produce a hierarchical clustering (tree)
such that target clustering is apx some pruning
of it. - E.g., in case from last slide
- Can view as saying if any of these clusters is
too broad, just click and I will split it for
you - Or, OK to output a small of clusterings such
that at least one has low error (like
list-decoding) but wont talk about this one
today.
41Then you can start getting somewhere.
all x more similar to all y in their own cluster
than to any y from any other cluster
- 1.
- is sufficient to get hierarchical clustering such
that target is some pruning of tree. (Kruskals /
single-linkage works)
42Then you can start getting somewhere.
all x more similar to all y in their own cluster
than to any y from any other cluster
- 1.
- is sufficient to get hierarchical clustering such
that target is some pruning of tree. (Kruskals /
single-linkage works)
2. Weaker condition ground truth is stable
For all clusters C, C, for all AµC, AµC A and
A not both more similar on avg to each other
than to rest of own clusters.
View K(x,y) as attraction between x and y
(plus technical conditions at boundary)
Sufficient to get a good tree using average
single linkage alg.
43Analysis for slightly simpler version
Assume for all C, C, all A½C, AµC, we have
K(A,C-A) gt K(A,A), and say K is symmetric.
Avgx2A, y2C-AS(x,y)
- Algorithm average single-linkage
- Like Kruskal, but at each step merge pair of
clusters whose average similarity is highest. - Analysis (all clusters made are laminar wrt
target) - Failure iff merge C1, C2 s.t. C1½C, C2ÅC ?.
44Analysis for slightly simpler version
Assume for all C, C, all A½C, AµC, we have
K(A,C-A) gt K(A,A), and say K is symmetric.
C3
Avgx2A, y2C-AS(x,y)
C2
C1
- Algorithm average single-linkage
- Like Kruskal, but at each step merge pair of
clusters whose average similarity is highest. - Analysis (all clusters made are laminar wrt
target) - Failure iff merge C1, C2 s.t. C1½C, C2ÅC ?.
- But must exist C3½C at least as similar to C1 as
the average. Contradiction.
45More sufficient properties
all x more similar to all y in their own cluster
than to any y from any other cluster
- 3.
- But add noisy data.
- Noisy data can ruin bottom-up algorithms, but can
show a generate-and-test style algorithm works. - Create collection of plausible clusters.
- Use series of pairwise tests to remove/shrink
clusters until consistent with a tree
46More sufficient properties
all x more similar to all y in their own cluster
than to any y from any other cluster
- 3.
- But add noisy data.
- 4. Implicit assumptions made by optimization
approach
Any approximately-optimal ..k-median.. solution
is close (in terms of how pts are clustered) to
the target.
Nina Balcans talk on Saturday
47Can also analyze inductive setting
- Can use regularity type results of AFKK to
argue that whp, a reasonable size S will give
good estimates of all desired quantities. - Once S is hierarchically partitioned, can insert
new points as they arrive.
48Like a PAC model for clustering
- A property is a relation between target and
similarity information (data). Like a
data-dependent concept class in learning. - Given data and a similarity function K, a
property induces a concept class C of all
clusterings c such that (c,K) is consistent with
the property. - Tree model want tree T s.t. set of prunings of T
form an ?-cover of C. - In inductive model, want this with prob 1-?.
49Summary (part II)
- Exploring the question what does an algorithm
need in order to cluster well? - What natural properties allow a similarity
measure to be useful for clustering? - To get a good theory, helps to relax what we mean
by useful for clustering. - User can then decide how specific he wanted to be
in each part of domain. - Analyze a number of natural properties and
prove guarantees on algorithms able to use them.
50Wrap-up
- Tour through learning and clustering by
similarity functions. - User with some knowledge of the problem domain
comes up with pairwise similarity measure K(x,y)
that makes sense for the given problem. - Algorithm uses this (together with labeled data
in the case of learning) to find a good solution. - Goals of a theory
- Give guidance to similarity-function designer
(what properties to shoot for?). - Understand what properties are sufficient for
learning/clustering, and by what algorithms. - For learning, get theory of kernels without need
for implicit spaces. - For clustering, reverses the usual view.
Suggests giving the algorithm some slack (tree vs
partitioning). - A lot of interesting questions still open in
these areas.