Semi-Supervised learning - PowerPoint PPT Presentation

About This Presentation
Title:

Semi-Supervised learning

Description:

collection of documents without any labels. easy to collect. Supervised learning ... attenuate the contribution from documents in ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 45
Provided by: Ganeshram8
Category:

less

Transcript and Presenter's Notes

Title: Semi-Supervised learning


1
Semi-Supervised learning
2
Need for an intermediate approach
  • Unsupervised and Supervised learning
  • Two extreme learning paradigms
  • Unsupervised learning
  • collection of documents without any labels
  • easy to collect
  • Supervised learning
  • each object tagged with a class.
  • laborious job
  • Semi-supervised learning
  • Real life applications are somewhere in between.

3
Motivation
  • Document collection D
  • A subset (with )
    has known labels
  • Goal to label the rest of the collection.
  • Approach
  • Train a supervised learner using , the
    labeled subset.
  • Apply the trained learner on the remaining
    documents.
  • Idea
  • Harness information in to enable
    better learning.

4
The Challenge
  • Unsupervised portion of the corpus, ,
    adds to
  • Vocabulary
  • Knowledge about the joint distribution of terms
  • Unsupervised measures of inter-document
    similarity.
  • E.g. site name, directory path, hyperlinks
  • Put together multiple sources of evidence of
    similarity and class membership into a
    label-learning system.
  • combine different features with partial
    supervision

5
Hard Classification
  • Train a supervised learner on available labeled
    data
  • Label all documents in
  • Retrain the classifier using the new labels for
    documents where the classier was most confident,
  • Continue until labels do not change any more.

6
Expectation maximization
  • Softer variant of previous algorithm
  • Steps
  • Set up some fixed number of clusters with some
    arbitrary initial distributions,
  • Alternate following steps
  • based on the current parameters of the
    distribution that characterizes c.
  • Re-estimate, Pr(cd), for each cluster c and each
    document d,
  • Re-estimate parameters of the distribution for
    each cluster.

7
Experiment EM
  • Set up one cluster for each class label
  • Estimate a class-conditional distribution which
    includes information from D
  • Simultaneously estimate the cluster memberships
    of the unlabeled documents.

8
Experiment EM (contd..)
  • Example
  • EM procedure multinomial naive Bayes text
    classifier
  • Laplaces law for parameter smoothing
  • For EM, unlabeled documents belong to clusters
    probabilistically
  • Term counts weighted by the probabilities
  • Likewise, modify class priors

9
EM Issues
  • For , we know the class label cd
  • Question how to use this information ?
  • Will be dealt with later
  • Using Laplace estimate instead of ML estimate
  • Not strictly EM
  • Convergence takes place in practice

10
EM Experiments
  • Take a completely labeled corpus D, and randomly
    select a subset as DK.
  • also use the set of unlabeled
    documents in the EM procedure.
  • Correct classification of a document
  • gt concealed class label class with largest
    probability
  • Accuracy with unlabeled documents gt accuracy
    without unlabeled documents
  • Keeping labeled set of same size
  • EM beats naïve Bayes with same size of labeled
    document set
  • Largest boost for small size of labeled set
  • Comparable or poorer performance of EM for large
    labeled sets

11
Belief in labeled documents
  • Depending on ones faith in the initial labeling
  • Set before 1st iteration
  • With each iteration
  • Let the class probabilities of the labeled
    documents smear'

12
EM Reducing belief in unlabeled documents
  • Problems due to
  • Noise in term distribution of documents in
  • Mistakes in E-step
  • Solution
  • attenuate the contribution from documents in
  • Add a damping factor in E Step for contribution
    from

13
Increasing DU while holding DK fixed also shows
the advantage of using large unlabeled sets in
the EM-like algorithm.
14
EM Reducing belief in unlabeled documents
(contd..)
  • No theoretical justification
  • accuracy is indeed influenced by the choice of
  • What value of to choose ?
  • An intuitive recipe (to be tried)

15
EM Modeling labels using many mixture components
  • Need not be a one to one correspondence between
    EM clusters and class labels.
  • Mixture modeling of
  • Term distributions of some classes
  • Especially the negative class
  • E.g. For two class case football vs. not
    football
  • Documents not about football are actually about
    a variety of other things

16
EM Modeling labels using many mixture components
  • Experiments comparison with Naïve Bayes
  • Lower accuracy with one mixture component per
    label
  • Higher accuracy with more mixture components per
    label
  • Over fitting and degradation with too large a
    number of clusters

17
Allowing more clusters in the EM-like algorithm
than there are class labels often helpto capture
term distributions for composite or complex
topics, and boosts the accuracy of the
semi-supervised learner beyond that of a naive
Bayes classier.
18
Labeling hypertext graphs
  • More complex features than exploited by EM
  • Test document is cited directly by a training
    document, or vice verca
  • Short path between the test document and one or
    more training documents.
  • Test document is cited by a named category in a
    Web directory
  • Target category system could be somewhat
    different
  • Some category of a Web directory co-cites one or
    more training document along with the test
    document.

19
Labeling hypertext graphs Scenario
  • Snapshot of the Web graph, Graph G(V,E)
  • Set of topics,
  • Small subset of nodes VK labeled
  • Use the supervision to label some or all nodes in
    V - Vk

20
Hypertext models for classification
  • cclass, ttext, Nneighbors
  • Text-only model Prtc
  • Using neighbors textto judge my topicPrt,
    t(N) c
  • Better modelPrt, c(N) c
  • Non-linear relaxation

?
21
Absorbing features from neighboring pages
  • Page u may have little text on it to train or
    apply a text classier
  • u cites some second level pages
  • Often second-level pages have usable quantities
    of text
  • Question How to use these features ?

22
Absorbing features indiscriminate absorption of
neighborhood text
  • Does not help. At times deteriorates accuracy
  • Reason Implicit assumption-
  • Topic of a page u is likely to be the same as the
    topic of a page cited by u.
  • Not always true
  • Topic may be related but not same
  • Distribution of topics of the pages cited could
    be quite distorted compared to the totality of
    contents available from the page itself
  • E.g. university page with little textual content
  • Points to how to get to our campus or recent
    sports prowess"

23
Absorbing link-derived features
  • Key insight 1
  • The classes of hyper-linked neighbors is a better
    representation of hyperlinks.
  • E.g.
  • use the fact that u points to a page about
    athletics to raise our belief that u is a
    university homepage,
  • learn to systematically reduce the attention we
    pay to the fact that a page links to the Netscape
    download site.
  • Key insight 2
  • class labels are from a is-a hierarchy.
  • evidence at the detailed topic level may be too
    noisy
  • coarsening the topic helps collect more reliable
    data on the dependence between the class of the
    homepage and the link-derived feature.

24
Absorbing link-derived features
  • Add all prefixes of the class path to the feature
    pool
  • Do feature selection to get rid of noise features
  • Experiment
  • Corpus of US patents
  • Two level topic hierarchy
  • three first-level classes,
  • each has four children.
  • Each leaf topic has 800 documents,
  • Experiment with
  • Text
  • Link
  • Prefix
  • TextPrefix

25
The prefix trick
A two-level topic hierarchy of US patents.
Using prefix-encoded link features in conjunction
with text can significantly reduce
classification error.
26
Absorbing link-derived features Observations
Absorbing text from neighboring pages in an
indiscriminate manner does not help
classify hyper-linked patent documents any better
than a purely text-based naive Bayes classier.
27
Absorbing link-derived features Limitation
  • Vk ltlt V
  • Hardly any neighbors of a node to be classified
    linked to any pre-labeled node
  • Proposal
  • Start with a labeling of reasonable quality
  • Maybe using a text classifier
  • Do
  • Refine the labeling using a coupled distribution
    of text and labels of neighbors,
  • Until the labeling stabilizes.

28
A relaxation labeling algorithm
  • Given
  • Hypertext graph G(V, E)
  • Each vertex u is associated with text uT
  • Desired
  • A labeling f of all (unlabeled) vertices so as to
    maximize

29
Preferential attachment
  • Simplifying assumption undirected graph
  • Web graph starts with m0 nodes
  • Time proceeds in discrete steps
  • Every step, one new node v is added
  • v is attached with m edges to old nodes
  • Suppose old node w has current degree d(w)
  • Multinomial distribution, probability of
    attachment to w is proportional to d(w)
  • Old nodes keep accumulating degree
  • Rich gets richer, or winner takes all

30
Heuristic Assumption
  • E
  • The event that edges were generated as per the
    edge list E
  • Difficult to obtain a known function for
  • Approximate it using heuristic assumptions.
  • Assume that
  • Term occurrences are independent
  • link-derived features are independent
  • no dependence between a term and a link-derived
    feature.
  • Assumption decision boundaries will remain
    relatively immune to errors in the probability
    estimates.

31
Heuristic Assumption (contd.)
  • Approximate the joint probability of neighboring
    classes by the product of the marginals
  • Couple class probabilities of neighboring node
  • Optimization concerns
  • Kleinberg and Tardos global optimization
  • A unique f for all nodes in VU
  • Greedy labeling followed by iterative correction
    of neighborhoods
  • Greedy labeling using a text classier
  • Reevaluate class probability of each page using
    latest estimates of class probabilities of
    neighbors.
  • EM-like soft classification of nodes

32
Inducing a Markov Random field
  • Induction on time-step
  • to break the circular definition.
  • Converges if seed values are reasonably accurate
  • Further assumptions
  • limited range of influence
  • text of nodes other than v contain no information
    about f(v)
  • Already accounted for in the graph structure

33
Overview of the algorithm
  • Desired the class (probabilities) of v given
  • The text vT on that page
  • Classes of the neighbors N(v) of v.
  • Use Bayes rule to invert that goal
  • Build distributions
  • The algorithm HyperClass
  • Input Test node v
  • construct a suitably large vicinity graph around
    and containing v
  • for each w in the vicinity graph do
  • assign
    using a text classier
  • end for
  • while label probabilities do not stabilize (r
    1,2..) do
  • for each node w in the vicinity graph do
  • update to
    using equation
  • end for
  • end while

34
Exploiting link features
  • 9600 patents from 12 classes marked by USPTO
  • Patents have text and cite other patents
  • Expand test patent to include neighborhood
  • Forget fraction of neighbors classes

35
Relaxation labeling Observations
  • When the test neighborhood is completely
    unlabeled.
  • Link performs better than the text-based
    classier
  • Reason Model bias
  • Pages tend to link to pages with a related class
    label."
  • Relaxation labeling
  • An approximate procedure to optimize a global
    objective function on the hypertext graph being
    labeled.
  • A metric graph labeling problem

36
A metric graph-labeling problem
  • Inference about the topic of page u depends
    possibly on the entire Web.
  • Computationally infeasible
  • Unclear if capturing such dependencies is useful.
  • Phenomenon of losing one's way with clicks
  • significant clues about a page expected to be in
    a neighborhood of limited radius
  • Example
  • A hypertext graph
  • Nodes can belong to exactly one of two topics
    (red and blue)
  • Given a graph with a small subset of nodes with
    known colors

37
A metric graph-labeling problem (contd..)
  • Goal find a labeling f(u) (u unlabeled) to
    minimize
  • 2 terms
  • affinity A(c1,c2) cost between all pairs of
    colors.
  • L(u,f(u)) -Pr(f(u)u) cost of assigning label
    f(u) to node u
  • Parameters
  • Marginal distribution of topics,
  • 2 x 2 topic citation matrix probability of
    differently colored nodes linking to each other

38
Semi-supervised hypertext classification
represented as a problem of completing a
partially colored graph subject to a given set of
cost constraints.
39
A metric graph-labeling problem NP-Completeness
  • NP-complete Kleinberg and Tardos
  • approximation algorithms
  • Within a O(log k log log k) multiplicative factor
    of the minimal cost,
  • k number of distinct class labels.

40
Problems with approaches so far
  • Metric or relaxation labeling
  • Representing accurate joint distributions over
    thousands of terms
  • High space and time complexity
  • Naïve Models
  • Fast assume class-conditional attribute
    independence,
  • Dimensionality of textual sub-problem gtgt
    dimensionality of link sub-problem,
  • Pr(vTf(v)) tends to be lower in magnitude than
    Pr(f(N(v))f(v)).
  • Hacky workaround aggressive pruning of textual
    features

41
Co-Training Blum and Mitchell
  • Classifiers with disjoint features spaces.
  • Co-training of classifiers
  • Scores used by each classifier to train the other
  • Semi-supervised EM-like training with two
    classifiers
  • Assumptions
  • Two sets of features (LA and LB) per document dA
    and dB.
  • Must be no instance d for which
  • Given the label , dA is conditionally independent
    of dB (and vice versa)

42
Co-training
  • Divide features into two class-conditionally
    independent sets
  • Use labeled data to induce two separate
    classifiers
  • Repeat
  • Each classifier is most confident about some
    unlabeled instances
  • These are labeled and added to the training set
    of the other classifier
  • Improvements for text hyperlinks

43
Co-Training Performance
  • dAbag of words
  • dBbag of anchor texts from HREF tags
  • Reduces the error below the levels of both LA and
    LB individually
  • Pick a class c by maximizing
  • Pr(cdA) Pr(cdB).

44
Co-training reduces classification error
Reduction in error against the number of mutual
training rounds.
Write a Comment
User Comments (0)
About PowerShow.com