Models of active learning for classification - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Models of active learning for classification

Description:

Given access to labeled data (drawn iid from an unknown underlying distribution ... Synthetic instances created were incomprehensible to humans! ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 27

Provided by: sanj47

Category:

more less

Transcript and Presenter's Notes

Title: Models of active learning for classification

1
Models of active learning for classification

Sanjoy Dasgupta
UC San Diego

2
Supervised learning

Given access to labeled data (drawn iid from an
unknown underlying distribution P), want to learn
a good classifier chosen from hypothesis class H.

3
Active learning

In many situations like speech recognition and
document retrieval unlabeled data is easy to
come by, but there is a charge for each label.

Pick a good classifier, at low cost.
4
Membership queries
Earliest model of active learning in theory work
Angluin 1992 X space of possible inputs,
like 0,1n H class of hypotheses Target
concept h 2 H to be identified exactly. You can
ask for the label of any point in X no unlabeled
data. H0 H For t 1,2, pick a point x 2 X
and query its label h(x) let Ht all
hypotheses in Ht-1 consistent with (x,
h(x)) What is the minimum number of membership
queries needed to reduce H to just h?
5
Membership queries example
X 0,1n H AND-of-positive-literals, like x1
Æ x3 Æ x10 S (set of AND positions) For i
1 to n ask for the label of (1,,1,0,1,,1)
0 at position i if negative S S
i Total n queries General idea synthesize
highly informative points. Each query cuts the
version space in half.
6
Problem
Many results in this framework, even for
complicated hypothesis classes. Baum and Lang,
1991 tried fitting a neural net to handwritten
characters. Synthetic instances created were
incomprehensible to humans! Lewis and Gale,
1992 tried training text classifiers. an
artificial text created by a learning algorithm
is unlikely to be a legitimate natural language
expression, and probably would be uninterpretable
by a human teacher.
7
A better, PAC-like model
Cohn, Atlas, and Ladner, 1992 Underlying
distribution P on the (x,y) data. Learner has
two abilities -- draw an unlabeled sample from
the distribution -- ask for a label of one of
these samples The error of any classifier h is
measured on distribution P err(h) P(h(x) ?
y) Special case to simplify matters assume the
data is separable, ie. some concept h 2 H labels
all points perfectly.
8
1 Uncertainty sampling
Maintain a single hypothesis, based on labels
seen so far. Query the point about which this
hypothesis is most uncertain. Problem
confidence of a single hypothesis may not
accurately represent the true diversity of
opinion in the hypothesis class.
X
-
-
-
-
-

-

-

-
-
9
2 Region of uncertainty
Current version space portion of H consistent
with labels so far. Region of uncertainty
part of data space about which there is still
some uncertainty (ie. disagreement within version
space)
current version space
Suppose data lies on circle in R2 hypotheses are
linear separators. (spaces X, H superimposed)

region of uncertainty in data space
10
2 Region of uncertainty
Algorithm CAL92 of the unlabeled points which
lie in the region of uncertainty, pick one at
random to query.
current version space
Data and hypothesis spaces, superimposed (both
are the surface of the unit sphere in Rd)
region of uncertainty in data space
11
2 Region of uncertainty
Number of labels needed depends on H and also on
P. Special case H linear separators in Rd,
P uniform distribution over unit sphere.
Then just d log 1/? labels are needed to
reach a hypothesis with error rate lt ?. 1
Supervised learning d/? labels. 2 Best we can
hope for.
12
2 Region of uncertainty
Algorithm CAL92 of the unlabeled points which
lie in the region of uncertainty, pick one at
random to query. For more general distributions
suboptimal
Need to measure quality of a query or
alternatively, size of version space.
13
Query-by-committee
Seung, Opper, Sompolinsky, 1992 Freund, Seung,
Shamir, Tishby 1997
First idea Try to rapidly reduce volume of
version space? Problem doesnt take data
distribution into account.
H
Which pair of hypotheses is closest? Depends on
data distribution P. Distance measure on H
d(h,h) P(h(x) ? h(x))
14
Query-by-committee
First idea Try to rapidly reduce volume of
version space? Problem doesnt take data
distribution into account.
To keep things simple, say d(h,h) ¼ Euclidean
distance in this picture.
H
Error is likely to remain large!
15
3 Query-by-committee
Elegant scheme which decreases volume in a manner
which is sensitive to the data distribution. Baye
sian setting given a prior ? on H H1 H For t
1, 2, receive an unlabeled point xt drawn
from P informally is there a lot of
disagreement about xt in Ht? choose two
hypotheses h,h randomly from (?, Ht) if h(xt) ?
h(xt) ask for xts label set Ht1
16
Query-by-committee
For t 1, 2, receive an unlabeled point xt
drawn from P choose two hypotheses h,h randomly
from (?, Ht) if h(xt) ? h(xt) ask for xts
label set Ht1 Observation the probability of
getting pair (h,h) in the inner loop (when a
query is made) is proportional to ?(h) ?(h)
d(h,h).
vs.
Ht
17
Query-by-committee
Label bound For H linear separators in Rd,
P uniform distribution, just d log 1/? labels
to reach a hypothesis with error lt
?. Implementation need to randomly pick h
according to (?, Ht). e.g. H linear
separators in Rd, ? uniform distribution
How do you pick a random point from a convex body?
Ht
18
Sampling from convex bodies

By random walk!
Ball walk
Hit-and-run

Gilad-Bachrach, Navot, Tishby 2005 Studies
random walks and also ways to kernelize QBC.
19
Online active learning

Online algorithms
see unlabeled data streaming by, one point at a
time
can query current points label, at a cost
can only maintain current hypothesis (memory
bound)
Dasgupta, Kalai, Monteleoni 2005 An active
version of the perceptron algorithm.
Guarantee For linear separators under the
uniform distribution, label complexity is d log
1/?.

20
Active perceptron?
No matter what selective sampling rule is used,
the perceptron algorithm needs 1/?2 labels to
reach error ?.
?t angle between vt and u Then 1 This angle
increases unless vt 1/sin ?t 2 vt
t1/2 Therefore need sin ?t 1/t1/2 (For
uniform distribution) error rate is approximately
sin ?t
vt current hypothesis u target
vt1
vt
u
?t
xt
(Graphic taken from C. Monteleoni)
21
Conservative update
Standard Perceptron update vt1 vt yt
xt Instead, weight the update by confidence
w.r.t. current hypothesis vt vt1 vt 2 yt
vt xt xt (v1 y0 x0) (smaller update for
points close to boundary) Unlike Perceptron 1
Length remains constant vt 1 2 Angle ?t
decreases monotonically
22
A more conservative update
Standard Perceptron update vt1 vt yt
xt Modified update vt1 vt 2 yt vt xt
xt
vt1
vt1
u
vt
vt1
vt
xt
(Graphic taken from C. Monteleoni)
23
4 Active perceptron
Input a stream of data points x0, x1, x2, Set
initial hypothesis v0 y0 x0 For t 1, 2, 3,
receive unlabeled point xt Filtering step
decide whether to ask for xts label yt if label
is requested if (xt, yt) is misclassified
vt1 vt 2 yt vt xt xt adjust filtering
rule else vt1 vt What filtering rule
should be used?
24
Selective sampling rule
Ideally select exactly the points which lie in
the error region. But we dont know what this
region is
vt

So choose points within a certain margin of vt
labeling region
L x x vt st
(threshold st).
Tradeoff in choosing L
- If too large wait forever for a misclassified
point
If too small update is miniscule
Solution set st adaptively

u

L
Error region
(Graphic taken from C. Monteleoni)
25
Some challenges
1 For linear separators, analyze the label
complexity for some distribution other than
uniform! 2 How to handle nonseparable
data? Need a robust base learner

true boundary
-
26
Thanks
For many helpful discussions Peter
Bartlett Yoav Freund Adam Kalai John
Langford Claire Monteleoni

Write a Comment

User Comments (0)