Active Learning of Binary Classifiers - PowerPoint PPT Presentation

About This Presentation
Title:

Active Learning of Binary Classifiers

Description:

Active Learning of Binary Classifiers Presenters: Nina Balcan and Steve Hanneke Outline What is Active Learning? Active Learning Linear Separators General Theories ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 39
Provided by: MariaFlor3
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Active Learning of Binary Classifiers


1
Active Learning of Binary Classifiers
  • Presenters Nina Balcan and Steve Hanneke

2
Outline
  • What is Active Learning?
  • Active Learning Linear Separators
  • General Theories of Active Learning
  • Open Problems

3
Supervised Passive Learning
Data Source
Expert / Oracle
Learning Algorithm
Unlabeled examples
Labeled examples
Algorithm outputs a classifier
4
Incorporating Unlabeled Data in the Learning
process
  • In many settings, unlabeled data is cheap easy
    to obtain, labeled data is much more expensive.
  • Web page, document classification
  • OCR, Image classification

5
Semi-Supervised Passive Learning
Data Source
Learning Algorithm
Expert / Oracle
Unlabeled examples
Unlabeled examples
Labeled Examples
Algorithm outputs a classifier
6
Semi-Supervised Passive Learning
  • Several methods have been developed to try to use
    unlabeled data to improve performance, e.g.
  • Transductive SVM Joachims 98
  • Co-training Blum Mitchell 98, BBY04
  • Graph-based methods Blum Chawla01, ZGL03

7
Semi-Supervised Passive Learning
  • Several methods have been developed to try to use
    unlabeled data to improve performance, e.g.
  • Transductive SVM Joachims 98
  • Co-training Blum Mitchell 98, BBY04
  • Graph-based methods Blum Chawla01, ZGL03

8
Semi-Supervised Passive Learning
  • Several methods have been developed to try to use
    unlabeled data to improve performance, e.g.
  • Transductive SVM Joachims 98
  • Co-training Blum Mitchell 98, BBY04
  • Graph-based methods Blum Chawla01, ZGL03

Workshops ICML 03, ICML 05
Books Semi-Supervised Learning, MIT 2006
O. Chapelle, B. Scholkopf and A. Zien (eds)
Theoretical models Balcan-Blum05
9
Active Learning
Data Source
Learning Algorithm
Expert / Oracle
Unlabeled examples
Request for the Label of an Example
A Label for that Example
Request for the Label of an Example
A Label for that Example
. . .
Algorithm outputs a classifier
10
What Makes a Good Algorithm?
  • Guaranteed to output a relatively good classifier
    for most learning problems.
  • Doesnt make too many label requests.

Choose the label requests carefully, to get
informative labels.
11
Can It Really Do Better Than Passive?
  • YES! (sometimes)
  • We often need far fewer labels for active
    learning than for passive.
  • This is predicted by theory and has been observed
    in practice.

12
Active Learning in Practice
  • Active SVM (Tong Koller, ICML 2000) seems to be
    quite useful in practice.

At any time during the alg., we have a current
guess of the separator the max-margin separator
of all labeled points so far.
E.g., strategy 1 request the label of the
example closest to the current separator.
13
When Does it Work? And Why?
  • The algorithms currently used in practice are not
    well understood theoretically.
  • We dont know if/when they output a good
    classifier, nor can we say how many labels they
    will need.
  • So we seek algorithms that we can understand and
    state formal guarantees for.

Rest of this talk surveys recent theoretical
results.
14
Standard Supervised Learning Setting
  • S(x, l) - set of labeled examples
  • drawn i.i.d. from some distr. D over X and
    labeled by some target concept c 2 C
  • Want to do optimization over S to find some
    hyp. h, but we want h to have small error over D.
  • err(h)Prx 2 D(h(x) ? c(x))

Sample Complexity, Finite Hyp. Space, Realizable
case
15
Sample Complexity Uniform Convergence Bounds
  • Infinite Hypothesis Case

E.g., if C - class of linear separators in Rd,
then we need roughly O(d/?) examples to achieve
generalization error ?.
Non-realizable case replace ? with ?2.
16
Active Learning
  • We get to see unlabeled data first, and there is
    a charge for every label.
  • The learner has the ability to choose specific
    examples to be labeled
  • - The learner works harder, in order to use fewer
    labeled examples.
  • How many labels can we save by querying
    adaptively?

17
Can adaptive querying help? CAL92, Dasgupta04
  • Consider threshold functions on the real line

hw(x) 1(x w), C hw w 2 R
  • Sample with 1/? unlabeled examples.


-
-
  • Binary search need just O(log 1/?) labels.

Active setting O(log 1/?) labels to find an
?-accurate threshold.
Supervised learning needs O(1/?) labels.
Exponential improvement in sample complexity ?
18
Active Learning might not help Dasgupta04
In general, number of queries needed depends on C
and also on D.
h3
C linear separators in R1 active learning
reduces sample complexity substantially.
h2
C linear separators in R2 there are some
target hyp. for which no improvement can be
achieved! - no matter how benign the input
distr.
h1
h0
In this case learning to accuracy ? requires 1/?
labels
19
Examples where Active Learning helps
In general, number of queries needed depends on C
and also on D.
  • C linear separators in R1 active learning
    reduces sample complexity substantially no
    matter what is the input distribution.
  • C - homogeneous linear separators in Rd, D -
    uniform distribution over unit sphere
  • need only d log 1/? labels to find a hypothesis
    with error rate lt ?.
  • Dasgupta, Kalai, Monteleoni, COLT 2005
  • Freund et al., 97.
  • Balcan-Broder-Zhang, COLT 07

20
Region of uncertainty CAL92
  • Current version space part of C consistent
    with labels so far.
  • Region of uncertainty part of data space
    about which there is still some uncertainty (i.e.
    disagreement within version space)
  • Example data lies on circle in R2 and
    hypotheses are homogeneous linear separators.

current version space


region of uncertainty in data space
21
Region of uncertainty CAL92
Algorithm
Pick a few points at random from the current
region of uncertainty and query their labels.
22
Region of uncertainty CAL92
  • Current version space part of C consistent
    with labels so far.
  • Region of uncertainty part of data space
    about which there is still some uncertainty (i.e.
    disagreement within version space)

23
Region of uncertainty CAL92
  • Current version space part of C consistent
    with labels so far.
  • Region of uncertainty part of data space
    about which there is still some uncertainty (i.e.
    disagreement within version space)

new version space


New region of uncertainty in data space
24
Region of uncertainty CAL92, Guarantees
Algorithm Pick a few points at random from the
current region of uncertainty and query their
labels.
Balcan, Beygelzimer, Langford, ICML06
Analyze a version of this alg. which is robust to
noise.
  • C- linear separators on the line, low noise,
    exponential
  • improvement.
  • C - homogeneous linear separators in Rd, D
    -uniform distribution over unit sphere.
  • low noise, need only d2 log 1/? labels to find a
    hypothesis with error rate lt ?.
  • realizable case, d3/2 log 1/? labels.
  • supervised -- d/? labels.

25
Margin Based Active-Learning Algorithm
Balcan-Broder-Zhang, COLT 07
Use O(d) examples to find w1 of error 1/8.
  • iterate k2, , log(1/?)
  • rejection sample mk samples x from D
  • satisfying wk-1T x ?k
  • label them
  • find wk 2 B(wk-1, 1/2k ) consistent with all
    these examples.
  • end iterate

26
Margin Based Active-Learning BBZ07
Wk
region of uncertainty in data space
27
BBZ07, Proof Idea
iterate k2, , log(1/?) Rejection sample mk
samples x from D satisfying wk-1T x ?k
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples. end iterate
Assume wk has error ?. We are done if 9 ?k s.t.
wk1 has error ?/2 and only need O(d log( 1/?))
labels in round k.
28
BBZ07, Proof Idea
iterate k2, , log(1/?) Rejection sample mk
samples x from D satisfying wk-1T x ?k
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples. end iterate
Assume wk has error ?. We are done if 9 ?k s.t.
wk1 has error ?/2 and only need O(d log( 1/?))
labels in round k.
29
BBZ07, Proof Idea
iterate k2, , log(1/?) Rejection sample mk
samples x from D satisfying wk-1T x ?k
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples. end iterate
Assume wk has error ?. We are done if 9 ?k s.t.
wk1 has error ?/2 and only need O(d log( 1/?))
labels in round k.
Key Point
Under the uniform distr. assumption for
we have
?/4
30
BBZ07, Proof Idea
Key Point
Under the uniform distr. assumption for
we have
?/4
Key Point
So, its enough to ensure that
We can do so by only using O(d log( 1/?)) labels
in round k.
31
Our Algorithm Extensions
  • A robust version add a testing step.
  • Deals with certain types of noise,

a more general class of distributions.
32
General Theories of Active Learning
33
General Concept Spaces
  • In the general learning problem, there is a
    concept space C, and we want to find an ?-optimal
    classifier h ? C with high probability 1-?.

34
How Many Labels Do We Need?
  • In passive learning, we know of an algorithm
    (empirical risk minimization) that needs only
  • labels (for realizable learning), and
  • if there is noise.
  • We also know this is close to the best we can
    expect from any passive algorithm.

35
How Many Labels Do We Need?
  • As before, we want to explore the analogous idea
    for Active Learning, (but now for general concept
    space C).
  • How many label requests are necessary and
    sufficient for Active Learning?
  • What are the relevant complexity measures? (i.e.,
    the Active Learning analogue of VC dimension)

36
What ARE the Interesting Quantities?
  • Generally speaking, we want examples whose labels
    are highly controversial among the set of
    remaining concepts.
  • The likelihood of drawing such an informative
    example is an important quantity to consider.
  • But there are many ways to define informative
    in general.

37
What Do You Mean By Informative?
  • Want examples that reduce the version space.
  • But how do we measure progress?
  • A problem-specific measure P on C?
  • The Diameter?
  • Measure of the region of disagreement?
  • Cover size? (see e.g., Hanneke, COLT 2007)

All of these seem to have interesting theories
associated with them. As an example, lets take
a look at Diameter in detail.
38
Diameter (Dasgupta, NIPS 2005)
Imagine each pair of concepts separated by
distance gt ? has an edge between them.
We have to rule out at least one of the two
concepts for each edge.
Each unlabeled example X partitions the concepts
into two sets.
And guarantees some fraction of the edges will
have at least one concept contradicted, no matter
which label it has.
  • Define distance d(g,h) Pr(g(X)?h(X)).
  • One way to guarantee our classifier is within ?
    of the target classifier is to (safely) reduce
    the diameter to size ?.

39
Diameter
  • Theorem (Dasgupta, NIPS 2005)
  • If, for any finite subset V ? C,
    PrX(X eliminates a ? fraction of
    the edges) ? ?, then (assuming no
    noise) we can reduce the diameter to ? using a
    number of label requests at most
  • Furthermore, there is an algorithm that does
    this, which with high probability requires a
    number of unlabeled examples at most

40
Open Problems in Active Learning
  • Efficient (correct) learning algorithms for
    linear separators provably achieving significant
    improvements on many distributions.
  • What about binary feature spaces?
  • Tight general-purpose sample complexity bounds,
    for both realizable and agnostic.
  • An optimal active learning algorithm?
Write a Comment
User Comments (0)
About PowerShow.com