A Theory of Learning and Clustering via Similarity Functions

About This Presentation

Title:

A Theory of Learning and Clustering via Similarity Functions

Description:

New Theoretical Frameworks and Algorithms for Key Problems in ... Predict SPAM if unknown AND (money OR pills) Predict SPAM if 2money 3pills 5 known 0 ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 51

Provided by: mariaflor

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Theory of Learning and Clustering via Similarity Functions

1
A Theory of Learning and Clustering via
Similarity Functions
Maria-Florina Balcan
Carnegie Mellon University
2
Main Research Directions
New Theoretical Frameworks and Algorithms for Key
Problems in Machine Learning
Machine Learning Algorithmic Game
Theory
Algorithms for Pricing Problems (Revenue
Maximization)
3
Outline of the Talk
Outline
Background.

Important aspects in Machine Learning today.

Learning with Similarity Functions.

Supervised Learning.

Kernel methods and their limitations.

A new framework General Similarity Functions.

Unsupervised Learning.

A new framework for Clustering.

Other work and future directions.
4
Machine Learning
Background
Image Classification
Document Categorization
Speech Recognition
Protein Classification
Spam Detection
Branch Prediction
Fraud Detection
5
Machine Learning
Background
Image Classification
Supervised Learning
Document Categorization
Speech Recognition
Protein Classification
Spam Detection
Branch Prediction
Fraud Detection
6
Example Supervised Classification
Background
Decide which emails are spam and which are
important.
Supervised classification
Not spam
spam
Goal use emails seen so far to produce good
prediction rule for future data.
7
Example Supervised Classification
Background
Represent each message by features. (e.g.,
keywords, spelling, etc.)
example
label
Reasonable RULES
Predict SPAM if unknown AND (money OR pills)
Predict SPAM if 2money 3pills 5 known gt 0
Linearly separable
8
Two Main Aspects in Machine Learning
Background
Algorithm Design. How to optimize?
Automatically generate rules that do well on
observed data.
Our best algorithms for learning linear
separators.
Confidence Bounds, Generalization Guarantees
Confidence for rule effectiveness on future data.
Well understood for supervised learning.
9
What if Not Linearly Separable
Background
Problem data not linearly separable in the most
natural feature representation.
Example
No good linear separator in pixel representation.
vs
Solutions

Classic Learn a more complex class of
functions.

Modern Use a Kernel (prominent method today)

10
Learning with Similarity Functions
Contributions
Kernels special kind of similarity
Prominent method today, difficult theory.
My Work

Methods for more general similarity functions.

More tangible, direct theory.

Helps in the design of good kernels for new
learning tasks.
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008
Will describe this in a few of minutes.
11
One Other Major Aspect in Machine Learning
Background
Where do we get the data?
And what type of data do we use in learning?
Traditional methods learning from labeled
examples only.
Modern applications lots of unlabeled data,
labeled data is rare or expensive

Web page, document classification

OCR, Image classification

Biology classification problems

12
Incorporating Unlabeled Data in the Learning
Process
Background and Contributions
Areas of significant importance and activity.

Semi-Supervised Learning

Using cheap unlabeled data in addition to labeled
data.

Active Learning

The algorithm interactively asks for labels of
informative examples.
Unified theoretical understanding was lacking.
My Work

Foundational theoretical understanding.

Analyze practically successful existing as well
as new algos.

BB, COLT 2005
BB, book chapter, Semi-Supervised Learning,
2006
BBY, NIPS 2004
BBL, ICML 2006
BBZ, COLT 2007
BHW, COLT2008
BBL, JCSS 2008
13
Outline of the Talk
Outline
Background.

Important aspects in Machine Learning today.

Learning with Similarity Functions.

Supervised Learning.

Kernel methods and their limitations.

A new framework General Similarity Functions.

Unsupervised Learning.

A new framework for Clustering.

Other work and future directions.
14
Outline of the Talk
Outline
Background.

Important aspects in Machine Learning today.

Learning with Similarity Functions.

Supervised Learning.

Kernel methods and their limitations.

A new framework General Similarity Functions.

Unsupervised Learning.

A new framework for Clustering.

Other work and future directions.
15
Kernel Methods
Kernels
Prominent method for supervised classification
today.
What is a Kernel?
A kernel K is a legal def of
dot-product i.e. there exists an implicit
mapping ? such that K( , )? ( )? (
).
E.g., K(x,y) (x y 1)d
? (n-dimensional space) ! nd-dimensional space
Why Kernels matter?
Many algorithms interact with data only via
dot-products.
So, if replace x y with K(x,y), they act
implicitly as if data was in the
higher-dimensional ?-space.
16
Example
Kernels
K(x,y) (xy)d corresponds to

E.g., for n2, d2, the kernel

original space
?-space
z2
17
Generalize Well if Good Margin
Kernels

If data is linearly separable by margin in
?-space, then good sample complexity.

If margin ? in ?-space, then need sample size
of only Õ(1/?2) to get confidence in
generalization.
?(x) 1

(Example of a generalization bound)

18
Kernel Methods
Kernels
Prominent method for supervised classification
today
Lots of Books, Workshops.
Significant percentage of ICML, NIPS, COLT.
ICML 2007, Business meeting
19
Limitations of the Current Theory
Kernels
In practice kernels are constructed by viewing
them as measures of similarity.
Existing Theory in terms of margins in implicit
spaces.
Difficult to think about, not great for intuition.
Kernel requirement rules out many natural
similarity functions.
Better theoretical explanation?
20
Better Theoretical Framework
Kernels
Yes! We provide a more general and intuitive
theory that formalizes the intuition that a good
kernel is a good measure of similarity.
In practice kernels are constructed by viewing
them as measures of similarity.
Existing Theory in terms of margins in implicit
spaces.
Difficult to think about, not great for intuition.
Kernel requirement rules out natural similarity
functions.
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008
Better theoretical explanation?
21
More General Similarity Functions
New Framework
We provide a notion of a good similarity function

Simpler, in terms of natural direct quantities.

Main notion

no implicit high-dimensional spaces
no requirement that K(x,y)?(x) ? (y)

Good kernels
2) K can be used to learn well.
First attempt
3) Is broad includes usual notion of good
kernel.
has a large margin sep. in ?-space
22
A First Attempt
New Framework
P distribution over labeled examples (x, l(x))
Goal output classification rule good for P
K is good if most x are on average more
similar to points y of their own type than to
points y of the other type.
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
gap
Average similarity to points of opposite label
Average similarity to points of the same label
23
A First Attempt
New Framework
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
0.4
Example
0.3
0.5
E.g., K(x,y) 0.2, l(x) l(y)
-1
1
K(x,y) random in -1,1, l(x) ? l(y)
1
24
A First Attempt
New Framework
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm

Draw sets S, S- of positive and negative
examples.

Classify x based on average similarity to S
versus to S-.

S
S-
x
x
25
A First Attempt
New Framework
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm

Draw sets S, S- of positive and negative
examples.

Classify x based on average similarity to S
versus to S-.

Theorem
If S and S- are ?((1/?2)
ln(1/??)), then with probability 1-?, error
??.

For a fixed good x prob. of error w.r.t. x (over
draw of S, S-) is small. Hoeffding

So, the expected error rate is small.

26
A First Attempt Not Broad Enough
New Framework
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?

30o
30o
½ versus ¼
-
-
-
-
½ versus ½ 1 ½ (- ½)
-
-
Similarity function K(x,y)x y

has a large margin separator

does not satisfy our definition.
27
A First Attempt Not Broad Enough
New Framework
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
R
30o
30o
Broaden 9 non-negligible R s.t. most x are
on average more similar to y 2 R of same label
than to y 2 R of other label.
even if do not know R in advance
28
Broader Definition
New Framework

Ask that 9 a set R of reasonable y (allow
probabilistic) s.t. almost all x satisfy

EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Property

Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd).
x !
Re-represent data.
P

If enough landmarks (d?(1/?2 ? )), then with
high prob. there exists linear separator of small
error.

w0,0,1/n,1/n,0,0,0,-1/n-,0,0
29
Broader Definition
New Framework

Ask that 9 a set R of reasonable y (allow
probabilistic) s.t. almost all x satisfy

EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm

Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd)
x !
Re-represent data.
P
X
X
X
X
X
O
X
X
O
O
O
O
X
O
O
X
O
O
X
O

Take a new set of labeled examples, project to
this space, and run a linear separator alg.

30
Kernels versus Similarity Functions
New Framework
Main Technical Contributions
technically hardest parts
Strictly more general
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
Can also show a Strict Separation.
(use Fourier analysis)
31
Similarity Functions for Classification
Summary, Part I
Conceptual Contributions
Before
After, Our Work
Much more intuitive theory
Difficult theory

Implicit spaces

No Implicit spaces

Formalizes a common intuition.
Not helpful for intuition.
Provably more general.
Limiting.
Algorithmic Implications

Can use non-PSD similarities, no need to
transform them into PSD functions and plug into
SVM.

E.g., Liao and Noble, Journal of Computational
Biology
32
Outline of the Talk
Outline
Background.

Important aspects in Machine Learning today.

Learning with Similarity Functions.

Supervised Learning.

Kernel methods and their limitations.

A new framework General Similarity Functions.

Unsupervised Learning.

Balcan-Blum-Vempala, STOC 2008

A new framework for Clustering.

Other work and future directions.
33
What if only Unlabeled Examples Available?
Clustering
sports
fashion
S set of n objects.
documents
9 ground truth clustering.
x, l(x) in 1,,t.
topic
Goal h of low error where
err(h) min?PrxS?(h(x)) ? l(x)
Problem unlabeled data only!
But have a Similarity Function!
34
What if only Unlabeled Examples Available?
Clustering
sports
fashion
Protocol
9 ground truth clustering for S
The similarity function K has to be related to
the ground-truth.
i.e., each x in S has l(x) in 1,,t.
S, a similarity function K.
Input
Clustering of small error.
Output
35
What if only Unlabeled Examples Available?
Clustering
sports
fashion
Fundamental Question
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
36
Contrast with Standard Approaches
Clustering
Approximation algorithms
Mixture models
Input embedding into Rd
Input graph or embedding into Rd

score algs based on apx ratios

score algs based on error rate

- analyze algs to optimize various criteria over
edges

strong probabilistic assumptions

Clustering Theoretical Frameworks
Discriminative, not generative.
Our Approach
Much better when input graph/ similarity is
based on heuristics.
Balcan-Blum-Vempala, STOC 2008
Input graph or similarity info
E.g., clustering documents by topic, web
search results by category

score algs based on error rate

no strong probabilistic assumptions

37
Condition that trivially works.
Clustering
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
sports
fashion
C
C
K(x,y) gt 0 for all x,y, l(x) l(y).K(x,y) lt 0
for all x,y, l(x) ? l(y).
A
A
38
Clustering
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
All x more similar to all y in own cluster than
any z in any other cluster
Problem same K can satisfy it for two very
different, equally natural clusterings of the
same data!
K(x,x)1
K(x,x)0.5
K(x,x)0
39
Relax Our Goals
Clustering
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
40
Relax Our Goals
Clustering
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
tennis
Lacoste
soccer
Gucci
2. List of clusterings s.t. at least one has
low error.
Tradeoff strength of assumption with size of list.
Obtain a rich, general model.
41
Examples of Properties and Algorithms
Clustering
Strict Separation Property
All x are more similar to all y in own cluster
than any z in any other cluster
Sufficient for hierarchical clustering
(single linkage algorithm)
Stability Property
C
C
For all clusters C, C, for all Aµ C, A µ C,
neither A nor A more attracted to the other
one than to the rest of its own cluster.
A
A
Sufficient for hierarchical clustering
(average linkage algorithm)
42
Examples of Properties and Algorithms
Clustering
Average Attraction Property
Ex 2 C(x)K(x,x) gt Ex 2 C K(x,x)? (8
C?C(x))
Not sufficient for hierarchical clustering
Can produce a small list of clusterings.
(sampling based algorithm)
Stability of Large Subsets Property
C
C
For all clusters C, C, for all Aµ C, A µ C,
AA sn, neither A nor A more attracted to
the other one than to the rest of its own cluster.
A
A
Sufficient for hierarchical clustering
Find hierarchy using a multi-stage
learning-based algorithm.
43
Stability of Large Subsets Property
Clustering
C
C
For all C, C, all A ½ C, A µ C,
K(A,C-A) gt K(A,A),
AA sn
A
A
Algorithm

Generate list L of candidate clusters (average
attraction alg.)

Ensure that any ground-truth cluster is f-close
to one in L.

For every (C, C0) in L s.t. all three parts are
large

If K(C Å C0, C \ C0) K(C Å C0, C0 \ C),
then throw out C0
Else throw out C.
3) Clean and hook up the surviving clusters
into a tree.
44
Stability of Large Subsets
Clustering
C
C
For all C, C, all A½C, AµC, AA sn
K(A,C-A) gt K(A,A)?
A
A
If sO(?2/k2), fO(?2 ?/k2), then produce
a tree s.t. the ground-truth is ?-close to a
pruning.
Theorem
45
Similarity Functions for Clustering, Summary
Summary, Part II
Main Conceptual Contributions

Minimal conditions on K to be useful for
clustering.

For robust theory, relax objective hierarchy,
list.

A general model that parallels PAC, SLT, Learning
with Kernels and Similarity Functions in
Supervised Classification.

Technically Most Difficult Aspects

Algos for stability of large subsets ?-strict
separation.

Algos and analysis for the inductive setting
e.g, sampling preserves stability (regularity
based arguments).

46
Overall Summary
Similarity Functions, Overall Summary
Supervised Classification
Unsupervised Learning
First Clustering model for analyzing accuracy
without strong probabilistic assumptions.
Generalize and simplify the existing theory
of Kernels.
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, COLT 2008
Balcan-Blum-Vempala, STOC 2008
Balcan-Blum-Srebro, MLJ 2008
47
Mechanism Design and Pricing Problems
Other Research Directions
48
Mechanism Design and Pricing Problems
Other Research Directions
generic
Incentive compatible auction design
Standard algorithm design
reduction
BBHM, FOCS 2005
BBHM, JCSS 2008
Approximation and Online Algorithms Pricing
Problems
Revenue maximization in combinatorial auctions
Single minded customers
Customers with general valuation functions
BB, EC 2006
BB, TCS 2007
BBCH, WINE 2007
BBM, EC 2008
49
New Frameworks for Machine Learning
Other Research Directions
Kernels, Margins, Random Projections, Feature
Selection
BBV, ALT 2004
BBV, MLJ 2006
Incorporating Unlabeled Data in the Learning
Process
Semi-supervised Learning
Active Learning
- Agnostic Active Learning
- Unified theoretical framework
BB, COLT 2005
BBL, ICML 2006
BBL, JCSS 2008
BB, book chapter, 2006
- Margin Based Active Learning
- Co-training
BBZ, COLT 2007
BBY, NIPS 2004
50
Future Directions
Future Directions
Connections between Computer Science and Economics
Use Machine Learning to automate aspects of
Mechanism Design and analyze complex systems.
New Frameworks and Algorithms for Machine Learning

Interactive Learning

Similarity Functions for Learning and Clustering

Learn a good similarity based on data from
related problems.
Other navigational structures e.g., a small DAG.
Other notions of useful, other types of
feedback.
Machine Learning for other areas of Computer
Science
51
Overall Summary
Similarity Functions, Overall Summary
Supervised Classification
Unsupervised Learning
First Clustering model for analyzing accuracy
without strong probabilistic assumptions.
Generalize and simplify the existing theory
of Kernels.
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, COLT 2008
Balcan-Blum-Vempala, STOC 2008
Balcan-Blum-Srebro, MLJ 2008

Write a Comment

User Comments (0)