New Theoretical Frameworks for Machine Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: New Theoretical Frameworks for Machine Learning

1
New Theoretical Frameworks for Machine Learning
Maria-Florina Balcan

2
Thanks to My Committee
Avrim Blum
Tom Mitchell
Manuel Blum
Yishay Mansour
Santosh Vempala
3
The Goals of the Thesis
New Frameworks for Important Learning Problems
Models, algorithms, generalization bounds for

Semi-Supervised Learning

Active Learning

Prominent methods in ML today.

Learning with Kernels

Learning with General Similarity Fns

Clustering

Machine Learning Algorithmic Game
Theory

ML both for designing and analyzing auctions
(Revenue Maximization)

4
New Frameworks for Machine Learning
(Not captured by standard learning models)
Important Learning Paradigms
Incorporating Unlabeled Data in the Learning
Process
Kernel based Learning
Qualitative gap between theory and practice
Semi-supervised Learning
Unified theoretical treatment lacking
Active Learning
Our Contributions
Our Contributions
A theory of learning with general similarity
functions
Semi-supervised learning
- a unified discriminative framework
Active Learning
New discriminative model for Clustering
- new positive theoretical results
5
New Frameworks for Machine Learning
(Not captured by standard learning models)
Important Learning Paradigms
Incorporating Unlabeled Data in the Learning
Process
Kernel, Similarity based Learning and Clustering
Qualitative gap between theory and practice
Unified theoretical treatment lacking
Our Contributions
Our Contributions
A theory of learning with general similarity
functions
Semi-supervised learning
- a unified discriminative framework
Active Learning
New discriminative model for Clustering
- new positive theoretical results
6
Machine Learning and Algorithmic Game Theory
One-slide Overview of Our Results
Machine Learning for Auction Design and Pricing
generic
Incentive compatible auction design
Standard algorithm design
reduction
Balcan-Blum-Hartline-Mansour, FOCS 2005
Balcan-Blum-Hartline-Mansour, JCSS 2008
Approximation
and Online Algorithms Pricing revenue
maximization in combinatorial auctions
Other related work
Single minded customers
Customers with general valuations
BB, EC 2006
BB, TCS 2007
BBM, EC 2008
BBCH, WINE 2007
7
The Goals of the Thesis
New Frameworks for Important Learning Problems
Machine Learning Algorithmic Game
Theory
8
New Frameworks for Important Learning Problems
Prominent methods in Machine Learning today.
Lots of Books, Workshops.
Significant percentage of ICML, NIPS, COLT.
ICML 2007, Business meeting
9
Structure of the Talk
New Frameworks for Important Learning Paradigms
Incorporating Unlabeled Data in the Learning
Process
Kernels, Similarity based learning and Clustering
Semi-supervised learning (SSL)
- Kernels, margins feature selection
- An Augmented PAC model for SSL
Balcan-Blum-Vempala, ALT 2004 MLJ 2006
Balcan-Blum, COLT 2005 book chapter,
Semi-Supervised Learning, 2006
- General theory of learning with similarity
functions
Active Learning (AL)
Balcan-Blum, ICML 2006
- Generic agnostic AL procedure
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008
Balcan-Beygelzimer-Langford, ICML 2006 JCSS
2008
- Discriminative model for Clustering
- Margin based AL of linear separators
Balcan-Blum-Vempala, STOC 2008
Balcan-Broder-Zhang, COLT 2007
Balcan-Blum-Gupta, Manuscript 2008
Balcan-Hanneke-Wortman, COLT 2008 MLJ 2008
(best student paper)
10
Part I, Incorporating Unlabeled Data in the
Learning Process
Semi-Supervised Learning
A general discriminative framework
Balcan-Blum, COLT 2005 book chapter,
Semi-Supervised Learning, 2006
11
Standard Supervised Learning

X instance/feature space

S(x, l) - set of labeled examples
labeled examples - drawn i.i.d. from distr. D
over X and labeled by some target concept c

labels 2 -1,1 - binary classification

Do optimization over S, find hypothesis h 2 C.

Goal h has small error over D.

c in C, realizable case
err(h)Prx 2 D(h(x) ? c(x))
c not in C, agnostic case

Classic models for learning from labeled data.

Statistical Learning Theory (Vapnik)

PAC (Valiant)

12
Standard Supervised Learning
Sample Complexity

E.g., Finite Hypothesis Spaces, Realizable Case

In PAC, can also talk about efficient algorithms.

13
Semi-Supervised Learning
Suxi - unlabeled examples i.i.d. from D
Sl(xi, yi) labeled examples i.i.d. from D,
labeled by target c.
Data Source
Learning Algorithm
Expert / Oracle
Unlabeled examples
Unlabeled examples
Labeled Examples
Algorithm outputs a classifier
14
Semi-Supervised Learning

Variety of methods and experimental results

Transductive SVM Joachims 98
Co-training Blum Mitchell 98,
Balcan-Blum-Yang04
Graph-based methods Blum Chawla01,
Zhu-Lafferty-Ghahramani03
Etc

Scattered and very specific theoretical results

We provide a general discriminative (PAC, SLT
style) framework for SSL.
Challenge capture many of the assumptions
typically used.
Different SSL algorithms based on different
assumptions.
15
Example of typical assumption Margins
Belief target goes through low density regions
(large margin).
16
Another Example Self-consistency
Agreement between two parts co-training
Blum-Mitchell98.
- examples contain two sufficient sets of
features, x h x1, x2 i
- belief the parts are consistent, i.e. 9 c1,
c2 s.t. c1(x1)c2(x2)c(x)
For example, if we want to classify web pages
x h x1, x2 i
17
New discriminative model for SSL BB05
Problems with thinking about SSL in standard
models

PAC or SLT learn a class C under (known or
unknown) distribution D.

Unlabeled data doesnt give any info about which
c 2 C is the target.

Key Insight
Unlabeled data useful if we have beliefs not only
about the form of the target, but also about its
relationship with the underlying distribution.
18
Proposed Model, Main Ideas
Augment the notion of a concept class C with a
notion of compatibility ? between a concept and
the data distribution.
learn C becomes learn (C,?) (learn class C
under ?)
Express relationships that one hopes the target
function and underlying distribution possess.
Idea I use unlabeled data belief that the
target is compatible to reduce C down to just
the highly compatible functions in C.
abstract prior ?
Class of fns C
unlabeled data
Compatible fns in C
e.g., linear separators
e.g., large margin linear separators
finite sample
Idea II require that the degree of compatibility
can be estimated from a finite sample.
19
Types of Results in the BB05 Model
Fundamental Sample Complexity issues

How much unlabeled data we need

both complexity of C and of the compatibility
notion.

- Ability of unlabeled data to reduce of
labeled examples

compatibility of the target

(various) measures of the helpfulness of the
distribution

?-Cover bounds much better than Uniform
Convergence bounds.
Main Poly-Time Algorithmic Result improved alg
co-training of linear separators (improves over
BM98 substantially)
Subsequent Work used our framework
P. Bartlett, D. Rosenberg, AISTATS 2007
Kakade et al, COLT 2008
J. Shawe-Taylor et al., Neurocomputing 2007
20
Part II, Incorporating Unlabeled Data in the
Learning Process
Active Learning
Brief Overview of the results
21
Active Learning (AL)
Data Source
Expert / Oracle
Unlabeled examples
Learning Algorithm
Request for the Label of an Example
A Label for that Example
Request for the Label of an Example
A Label for that Example
. . .
Algorithm outputs a classifier

Classic example where AL helps thresholds on the
real line

-
-
-
22
First Agnostic Active Learning Procedure
We provide A2 the first algorithm which is
robust to noise.
Balcan, Beygelzimer, Langford, ICML06
Balcan, Beygelzimer, Langford, JCSS08
Region of disagreement style Pick a few
points at random from the current region of
uncertainty, query their labels, throw out
hypothesis if you are statistically confident
they are suboptimal.
(similar to CAL92 realizable case)
Guarantees for A2

Fall-back exponential improvements.

C thresholds, low noise, exponential
improvement.

C - homogeneous linear separators in Rd,
D - uniform over unit sphere, low noise, only
d2 log (1/?) labels to find h with error ?.

A lot of subsequent work.
Hanneke07, DHM07, BBZ07, BHW08
23
First Agnostic Active Learning Procedure
We provide A2 the first algorithm which is
robust to noise.
Balcan, Beygelzimer, Langford, ICML06
Balcan, Beygelzimer, Langford, JCSS08
Region of disagreement style Pick a few
points at random from the current region of
uncertainty, query their labels, throw out
hypothesis if you are statistically confident
they are suboptimal.
(similar to CAL92 realizable case)
Guarantees for A2

Fall-back exponential improvements.

C thresholds, low noise, exponential
improvement.

C - homogeneous linear separators in Rd,
D - uniform over unit sphere, low noise, only
d2 log (1/?) labels to find h with error ?.

Realizable d3/2 log (1/?) labels

Improved in subsequent work d log2 (1/?)

Balcan-Broder-Zhang, COLT 07
24
Margin Based Active-Learning Algorithm
Realizable case, can get d log2 (1/?) labels
Balcan-Broder-Zhang, COLT 07
Use O(d) examples to find w1 of error 1/8.

iterate k2, , log(1/?)
rejection sample mk samples x from D
satisfying wk-1T x ?k
label them
find wk 2 B(wk-1, 1/2k ) consistent with all
these examples.
end iterate

Other Work BHW08 -- new perspective on AL.
25
Part III, Learning with Kernels and
More General Similarity Functions
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008
26
Kernel Methods
Prominent method for supervised classification
today.
The learning alg. interacts with the data via a
similarity fns
What is a Kernel?
A kernel K is a legal def of
dot-product i.e. there exists an implicit
mapping ? such that K( , )? ( )? (
).
E.g., K(x,y) (x y 1)d
? (n-dimensional space) ! nd-dimensional space
Why Kernels matter?
Many algorithms interact with data only via
dot-products.
So, if replace x y with K(x,y), they act
implicitly as if data was in the
higher-dimensional ?-space.
27
Example
K(x,y) (xy)d corresponds to

E.g., for n2, d2, the kernel

original space
?-space
z2
28
Generalize Well if Good Margin

If data is linearly separable by margin in
?-space, then good sample complexity.

If margin ? in ?-space, then need sample size
of only Õ(1/?2) to get confidence in
generalization.
?(x) 1

(another example of a generalization bound)

29
Limitations of the Current Theory
In practice kernels are constructed by viewing
them as measures of similarity.
Existing Theory in terms of margins in implicit
spaces.
Difficult to think about, not great for intuition.
Kernel requirement rules out many natural
similarity functions.
Better theoretical explanation?
30
Better Theoretical Framework
Yes! We provide a more general and intuitive
theory that formalizes the intuition that a good
kernel is a good measure of similarity.
In practice kernels are constructed by viewing
them as measures of similarity.
Existing Theory in terms of margins in implicit
spaces.
Difficult to think about, not great for intuition.
Kernel requirement rules out natural similarity
functions.
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008
Better theoretical explanation?
31
More General Similarity Functions
We provide a notion of a good similarity function

Simpler, in terms of natural direct quantities.

Main notion

no implicit high-dimensional spaces
no requirement that K(x,y)?(x) ? (y)

Good kernels
K can be used to learn well.
First attempt
2) Is broad includes usual notion of good
kernel.
has a large margin sep. in ?-space
3) Allows one to learn classes that have no good
kernels.
32
A First Attempt
P distribution over labeled examples (x, l(x))
Goal output classification rule good for P
K is good if most x are on average more
similar to points y of their own type than to
points y of the other type.
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
gap
Average similarity to points of opposite label
Average similarity to points of the same label
33
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
0.4
Example
0.3
0.5
E.g., K(x,y) 0.2, l(x) l(y)
-1
1
K(x,y) random in -1,1, l(x) ? l(y)
1
34
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm

Draw sets S, S- of positive and negative
examples.

Classify x based on average similarity to S
versus to S-.

S
S-
x
x
35
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm

Draw sets S, S- of positive and negative
examples.

Classify x based on average similarity to S
versus to S-.

Theorem
If S and S- are ?((1/?2)
ln(1/??)), then with probability 1-?, error
??.

For a fixed good x prob. of error w.r.t. x (over
draw of S, S-) is ². Hoeffding

At most ? chance that the error rate over GOOD is
?.

Overall error rate ??.

36
A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?

30o
30o
½ versus ¼
-
-
-
-
½ versus ½ 1 ½ (- ½)
-
-
Similarity function K(x,y)x y

has a large margin separator

does not satisfy our definition.
37
A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
R
30o
30o
Broaden 9 non-negligible R s.t. most x are
on average more similar to y 2 R of same label
than to y 2 R of other label.
even if do not know R in advance
38
Broader Definition

K is (?, ?, ?) if 9 a set R of reasonable y
(allow probabilistic) s.t. 1-? fraction of x
satisfy

EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Property

Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd).
x !
Re-represent data.
P

If enough landmarks (d?(1/?2 ? )), then with
high prob. there exists a good L1 large margin
linear separator.

w0,0,1/n,1/n,0,0,0,-1/n-,0,0
39
Broader Definition

K is (?, ?, ?) if 9 a set R of reasonable y
(allow probabilistic) s.t. 1-? fraction of x
satisfy

EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm
duÕ(1/(?2? ))
dlO(1/(?2²acc ln (du) ))

Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd)
x !
Re-represent data.
P
X
X
X
X
X
O
X
X
O
O
O
O
X
O
O
X
O
O
X
O

Take a new set of labeled examples, project to
this space, and run a good L1 linear separator
alg.

40
Kernels versus Similarity Functions
Main Technical Contributions
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
If K has margin ? in implicit space, then for any
?, K is (?,?2,?)-good in our sense.
41
Kernels versus Similarity Functions
Main Technical Contributions
Strictly more general
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
Can also show a Strict Separation.
Theorem
For any class C of n pairwise uncorrelated
functions, 9 a similarity function good for all f
in C, but no such good kernel function exists.
42
Kernels versus Similarity Functions
Can also show a Strict Separation.
Theorem
For any class C of n pairwise uncorrelated
functions, 9 a similarity function good for all f
in C, but no such good kernel function exists.

In principle, should be able to learn from
O(?-1log(C/?)) labeled examples.

Claim 1 can define generic (0,1,1/C)-good
similarity function achieving this bound. (Assume
D not too concentrated)

Claim 2 There is no (?,?) good kernel in hinge
loss, even if ?1/2 and ?1/ C-1/2. So, margin
based SC is d?(1/C).

43
Generic Similarity Function

Partition X into regions R1,,RC with P(Ri) gt
1/poly(C).
Ri will be R for target fi.
For y in Ri, define K(x,y)fi(x)fi(y).
So, for any target fi in C, any x, we get
Eyl(x)l(y)K(x,y) y in Ri El(x)2l(y)2 1.
So, K is (0,1,1/poly(C))-good.

Gives bound O(?-1log(C))
44
Similarity Functions for Classification
Conceptual Contributions
Before
After, Our Work
Much more intuitive theory
Difficult theory

Implicit spaces

No Implicit spaces

Formalizes a common intuition.
Not helpful for intuition.
Provably more general.
Limiting.
Algorithmic Implications

Can use non-PSD similarities, no need to
transform them into PSD functions and plug into
SVM.

E.g., Liao and Noble, Journal of Computational
Biology
45
Similarity Functions for Classification
Algorithmic Implications

Can use non-PSD similarities, no need to
transform them into PSD functions and plug into
SVM.

E.g., Liao and Noble, Journal of Computational
Biology

Give justification to the following rule

Also show that anything learnable with SVM is
learnable this way!

46
Part IV, A Novel View on Clustering
Balcan-Blum-Vempala, STOC 2008
A General Framework for analyzing clustering
accuracy without strong probabilistic assumptions
47
What if only Unlabeled Examples Available?
sports
fashion
S set of n objects.
documents
9 ground truth clustering.
x, l(x) in 1,,t.
topic
Goal h of low error where
err(h) min?PrxS?(h(x)) ? l(x)
Problem unlabeled data only!
But have a Similarity Function!
48
What if only Unlabeled Examples Available?
sports
fashion
Protocol
9 ground truth clustering for S
The similarity function K has to be related to
the ground-truth.
i.e., each x in S has l(x) in 1,,t.
S, a similarity function K.
Input
Clustering of small error.
Output
(err(h) min?PrxS?(h(x)) ? l(x))
49
What if only Unlabeled Examples Available?
sports
fashion
Fundamental Question
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
50
Contrast with Standard Approaches
Approximation algorithms
Mixture models
Input embedding into Rd
Input graph or embedding into Rd

score algs based on apx ratios

score algs based on error rate

- analyze algs to optimize various criteria over
edges

strong probabilistic assumptions

Clustering Theoretical Frameworks
Discriminative, not generative.
Our Approach
Much better when input graph/ similarity is
based on heuristics.
Balcan-Blum-Vempala, STOC 2008
Input graph or similarity info
E.g., clustering documents by topic, web
search results by category

score algs based on error rate

no strong probabilistic assumptions

51
Condition that trivially works.
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
sports
fashion
C
C
K(x,y) gt 0 for all x,y, l(x) l(y).K(x,y) lt 0
for all x,y, l(x) ? l(y).
A
A
52
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
All x more similar to all y in own cluster than
any z in any other cluster
Problem same K can satisfy it for two very
different, equally natural clusterings of the
same data!
K(x,x)1
K(x,x)0.5
K(x,x)0
53
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
54
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
tennis
Lacoste
soccer
Gucci
2. List of clusterings s.t. at least one has
low error.
Tradeoff strength of assumption with size of list.
Obtain a rich, general model.
55
Examples of Properties and Algorithms
Strict Separation Property
All x are more similar to all y in own cluster
than any z in any other cluster
Sufficient for hierarchical clustering
(single linkage algorithm)
Stability Property
C
C
For all clusters C, C, for all Aµ C, A µ C,
neither A nor A more attracted to the other
one than to the rest of its own cluster.
A
A
(K(A,A) - average attraction between A and A)
Sufficient for hierarchical clustering
(average linkage algorithm)
56
Examples of Properties and Algorithms
Average Attraction Property
Ex 2 C(x)K(x,x) gt Ex 2 C K(x,x)? (8
C?C(x))
Not sufficient for hierarchical clustering
Can produce a small list of clusterings.
(sampling based algorithm)
Stability of Large Subsets Property
C
C
For all clusters C, C, for all Aµ C, A µ C,
AA sn, neither A nor A more attracted to
the other one than to the rest of its own cluster.
A
A
Sufficient for hierarchical clustering
Find hierarchy using a multi-stage
learning-based algorithm.
57
Stability of Large Subsets Property
C
C
For all C, C, all A ½ C, A µ C,
K(A,C-A) gt K(A,A),
AA sn
A
A
Algorithm

Generate list L of candidate clusters (average
attraction alg.)

Ensure that any ground-truth cluster is f-close
to one in L.

For every (C, C0) in L s.t. all three parts are
large

If K(C Å C0, C \ C0) K(C Å C0, C0 \ C),
then throw out C0
Else throw out C.
3) Clean and hook up the surviving clusters
into a tree.
58
Stability of Large Subsets
C
C
For all C, C, all A½C, AµC, AA sn
K(A,C-A) gt K(A,A)?
A
A
If sO(?2/k2), fO(?2 ?/k2), then produce
a tree s.t. the ground-truth is ?-close to a
pruning.
Theorem
59
Similarity Functions for Clustering, Summary

Minimal conditions on K to be useful for
clustering.

For robust theory, relax objective hierarchy,
list.

A general model that parallels PAC, SLT, Learning
with Kernels and Similarity Functions in
Supervised Classification.

60
Similarity Functions, Overall Summary
Supervised Classification
Unsupervised Learning
First Clustering model for analyzing accuracy
without strong probabilistic assumptions.
Generalize and simplify the existing theory
of Kernels.
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, COLT 2008
Balcan-Blum-Vempala, STOC 2008
Balcan-Blum-Srebro, MLJ 2008
61
Future Directions
Connections between Computer Science and Economics
Active learning and online learning techniques
for better pricing algorithms and auctions.
New Frameworks and Algorithms for Machine Learning

Similarity Functions for Learning and Clustering

Learn a good similarity based on data from
related problems.
Other notions of useful, other types of
feedback.
Other navigational structures e.g., a small DAG.

New Theoretical Frameworks for Machine Learning PowerPoint PPT Presentation