Co-Training and Expansion: Towards Bridging Theory and Practice - PowerPoint PPT Presentation

About This Presentation
Title:

Co-Training and Expansion: Towards Bridging Theory and Practice

Description:

Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning) ... Co-training: method for combining labeled & unlabeled data ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 22
Provided by: dorub
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Co-Training and Expansion: Towards Bridging Theory and Practice


1
Co-Training and Expansion Towards Bridging
Theory and Practice
  • Maria-Florina Balcan, Avrim Blum, Ke Yang
  • Carnegie Mellon University,
  • Computer Science Department

2
Combining Labeled and Unlabeled Data (a.k.a.
Semi-supervised Learning)
  • Many applications have lots of unlabeled data,
    but labeled data is rare or expensive
  • Web page, document classification
  • OCR, Image classification
  • Several methods have been developed to try to use
    unlabeled data to improve performance, e.g.
  • Transductive SVM
  • Co-training
  • Graph-based methods

3
Co-training method for combining labeled
unlabeled data
  • Works in scenarios where examples have distinct,
    yet sufficient feature sets
  • An example
  • Belief is that the two parts of the example are
    consistent, i.e. 9 c1, c2 such that
  • Each view is sufficient for correct
    classification
  • Works by using unlabeled data to propagate
    learned information.

4
Co-Training method for combining labeled
unlabeled data
  • For example, if we want to classify web pages

5
Iterative Co-Training
  • Have learning algorithms A1, A2 on each of the
    two views.
  • Use labeled data to learn two initial hypotheses
    h1, h2.
  • Look through unlabeled data to find examples
    where one of hi is confident but other is not.
  • Have the confident hi label it for algorithm A3-i.

Repeat
6
Iterative Co-Training A Simple Example
Learning Intervals
h21
h11
Use labeled data to learn h11 and h21
Use unlabeled data to bootstrap
7
Theoretical/Conceptual Question
  • What properties do we need for co-training to
    work well?
  • Need assumptions about
  • the underlying data distribution
  • the learning algorithms on the two sides

8
Theoretical/Conceptual Question
  • What property of the data do we need for
    co-training to work well?
  • Previous work
  • Independence given the label
  • Weak rule dependence
  • Our work - much weaker assumption about how the
    data should behave
  • expansion property of the underlying distribution
  • Though we will need stronger assumption on the
    learning algorithm compared to (1).

9
Co-Training, Formal Setting
  • Assume that examples are drawn from distribution
    D over instance space .
  • Let c be the target function assume that each
    view is sufficient for correct classification
  • c can be decomposed into c1, c2 over each view s.
    t. D has no probability mass on examples x with
    c1(x1) ? c2(x2)
  • Let X and X- denote the positive and negative
    regions of X.
  • Let D and D- be the marginal distribution of D
    over X and X- respectively.
  • Let
  • think of as

D
D-
10
(Formalization)
Expansion
  • We assume that D is expanding.
  • Expansion
  • This is a natural analog of the graph-theoretic
    notions of conductance and expansion.

11
Property of the underlying distribution
Expansion
  • Necessary condition for co-training to work well
  • If S1 and S2 (our confident sets) do not expand,
    then we might never see examples for which one
    hypothesis could help the other.
  • We show, sufficient for co-training to generalize
    well in a relatively small number of iterations,
    under some assumptions
  • the data is perfectly separable
  • have strong learning algorithms on the two sides

12
Expansion, Examples Learning Intervals
Non-expanding distribution
Expanding distribution
13
Expansion
  • Weaker than independence given the label than
    weak rule dependence.

e.g, w.h.p. a random degree-3 bipartite graph is
expanding, but would NOT have independence given
the label, or weak rule dependence
D
D-
14
Main Result
  • Assume D is ?-expanding.
  • Assume that on each of the two views we have
    algorithms A1 and A2 for learning from positive
    data only.
  • Assume that we have initial confident sets S10
    and S20 such that

15
Main Result, Interpretation
  • Assumption on A1, A2 implies the they never
    generalize incorrectly.
  • Question is what needs to be true for them to
    actually generalize to whole of D?

16
Main Result, Proof Idea
  • Expansion implies that at each iteration, there
    is reasonable probability mass on "new, useful"
    data.
  • Algorithms generalize to most of this new region.
  • See paper for real proof.

17
What if assumptions are violated?
  • What if our algorithms can make incorrect
    generalizations and/or there is no perfect
    separability?

18
What if assumptions are violated?
  • Expect "leakage" to negative region.
  • If negative region is expanding too, then
    incorrect generalizations will grow at
    exponential rate.
  • Correct generalization are growing at exponential
    rate too, but will slow down first.
  • Expect overall accuracy to go up then down.

19
Synthetic Experiments
  • Create a 2n-by-2n bipartite graph
  • nodes 1 to n on each side represent positive
    clusters
  • nodes n1 to 2n on each side represent negative
    clusters
  • Connect each node on the left to 3 nodes on the
    right
  • each neighbor is chosen with prob. 1-? to be a
    random node of the same class, and with prob. ?
    to be a random node of the opposite class
  • Begin with an initial confident set
    and then propagate confidence through rounds of
    co-training
  • monitor the percentage of the positive class
    covered, the percent of the negative class
    mistakenly covered, and the overall accuracy

20
Synthetic Experiments
?0.01, n5000, d3
?0.001, n5000, d3
  • solid line indicates overall accuracy
  • green curve is accuracy on positives
  • red curve is accuracy on negatives

21
Conclusions
  • We propose a much weaker expansion assumption of
    the underlying data distribution.
  • It seems to be the right condition on the
    distribution for co-training to work well.
  • It directly motivates the iterative nature of
    many of the practical co-training based
    algorithms.
Write a Comment
User Comments (0)
About PowerShow.com