Semi-Supervised learning

About This Presentation

Title:

Semi-Supervised learning

Description:

collection of documents without any labels. easy to collect. Supervised learning ... attenuate the contribution from documents in ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 45

Provided by: Ganeshram8

Category:

more less

Transcript and Presenter's Notes

Title: Semi-Supervised learning

1
Semi-Supervised learning
2
Need for an intermediate approach

Unsupervised and Supervised learning
Two extreme learning paradigms
Unsupervised learning
collection of documents without any labels
easy to collect
Supervised learning
each object tagged with a class.
laborious job
Semi-supervised learning
Real life applications are somewhere in between.

3
Motivation

Document collection D
A subset (with )
has known labels
Goal to label the rest of the collection.
Approach
Train a supervised learner using , the
labeled subset.
Apply the trained learner on the remaining
documents.
Idea
Harness information in to enable
better learning.

4
The Challenge

Unsupervised portion of the corpus, ,
adds to
Vocabulary
Knowledge about the joint distribution of terms
Unsupervised measures of inter-document
similarity.
E.g. site name, directory path, hyperlinks
Put together multiple sources of evidence of
similarity and class membership into a
label-learning system.
combine different features with partial
supervision

5
Hard Classification

Train a supervised learner on available labeled
data
Label all documents in
Retrain the classifier using the new labels for
documents where the classier was most confident,
Continue until labels do not change any more.

6
Expectation maximization

Softer variant of previous algorithm
Steps
Set up some fixed number of clusters with some
arbitrary initial distributions,
Alternate following steps
based on the current parameters of the
distribution that characterizes c.
Re-estimate, Pr(cd), for each cluster c and each
document d,
Re-estimate parameters of the distribution for
each cluster.

7
Experiment EM

Set up one cluster for each class label
Estimate a class-conditional distribution which
includes information from D
Simultaneously estimate the cluster memberships
of the unlabeled documents.

8
Experiment EM (contd..)

Example
EM procedure multinomial naive Bayes text
classifier
Laplaces law for parameter smoothing
For EM, unlabeled documents belong to clusters
probabilistically
Term counts weighted by the probabilities
Likewise, modify class priors

9
EM Issues

For , we know the class label cd
Question how to use this information ?
Will be dealt with later
Using Laplace estimate instead of ML estimate
Not strictly EM
Convergence takes place in practice

10
EM Experiments

Take a completely labeled corpus D, and randomly
select a subset as DK.
also use the set of unlabeled
documents in the EM procedure.
Correct classification of a document
gt concealed class label class with largest
probability
Accuracy with unlabeled documents gt accuracy
without unlabeled documents
Keeping labeled set of same size
EM beats naïve Bayes with same size of labeled
document set
Largest boost for small size of labeled set
Comparable or poorer performance of EM for large
labeled sets

11
Belief in labeled documents

Depending on ones faith in the initial labeling
Set before 1st iteration
With each iteration
Let the class probabilities of the labeled
documents smear'

12
EM Reducing belief in unlabeled documents

Problems due to
Noise in term distribution of documents in
Mistakes in E-step
Solution
attenuate the contribution from documents in
Add a damping factor in E Step for contribution
from

13
Increasing DU while holding DK fixed also shows
the advantage of using large unlabeled sets in
the EM-like algorithm.
14
EM Reducing belief in unlabeled documents
(contd..)

No theoretical justification
accuracy is indeed influenced by the choice of
What value of to choose ?
An intuitive recipe (to be tried)

15
EM Modeling labels using many mixture components

Need not be a one to one correspondence between
EM clusters and class labels.
Mixture modeling of
Term distributions of some classes
Especially the negative class
E.g. For two class case football vs. not
football
Documents not about football are actually about
a variety of other things

16
EM Modeling labels using many mixture components

Experiments comparison with Naïve Bayes
Lower accuracy with one mixture component per
label
Higher accuracy with more mixture components per
label
Over fitting and degradation with too large a
number of clusters

17
Allowing more clusters in the EM-like algorithm
than there are class labels often helpto capture
term distributions for composite or complex
topics, and boosts the accuracy of the
semi-supervised learner beyond that of a naive
Bayes classier.
18
Labeling hypertext graphs

More complex features than exploited by EM
Test document is cited directly by a training
document, or vice verca
Short path between the test document and one or
more training documents.
Test document is cited by a named category in a
Web directory
Target category system could be somewhat
different
Some category of a Web directory co-cites one or
more training document along with the test
document.

19
Labeling hypertext graphs Scenario

Snapshot of the Web graph, Graph G(V,E)
Set of topics,
Small subset of nodes VK labeled
Use the supervision to label some or all nodes in
V - Vk

20
Hypertext models for classification

cclass, ttext, Nneighbors
Text-only model Prtc
Using neighbors textto judge my topicPrt,
t(N) c
Better modelPrt, c(N) c
Non-linear relaxation

?
21
Absorbing features from neighboring pages

Page u may have little text on it to train or
apply a text classier
u cites some second level pages
Often second-level pages have usable quantities
of text
Question How to use these features ?

22
Absorbing features indiscriminate absorption of
neighborhood text

Does not help. At times deteriorates accuracy
Reason Implicit assumption-
Topic of a page u is likely to be the same as the
topic of a page cited by u.
Not always true
Topic may be related but not same
Distribution of topics of the pages cited could
be quite distorted compared to the totality of
contents available from the page itself
E.g. university page with little textual content
Points to how to get to our campus or recent
sports prowess"

23
Absorbing link-derived features

Key insight 1
The classes of hyper-linked neighbors is a better
representation of hyperlinks.
E.g.
use the fact that u points to a page about
athletics to raise our belief that u is a
university homepage,
learn to systematically reduce the attention we
pay to the fact that a page links to the Netscape
download site.
Key insight 2
class labels are from a is-a hierarchy.
evidence at the detailed topic level may be too
noisy
coarsening the topic helps collect more reliable
data on the dependence between the class of the
homepage and the link-derived feature.

24
Absorbing link-derived features

Add all prefixes of the class path to the feature
pool
Do feature selection to get rid of noise features
Experiment
Corpus of US patents
Two level topic hierarchy
three first-level classes,
each has four children.
Each leaf topic has 800 documents,
Experiment with
Text
Link
Prefix
TextPrefix

25
The prefix trick
A two-level topic hierarchy of US patents.
Using prefix-encoded link features in conjunction
with text can significantly reduce
classification error.
26
Absorbing link-derived features Observations
Absorbing text from neighboring pages in an
indiscriminate manner does not help
classify hyper-linked patent documents any better
than a purely text-based naive Bayes classier.
27
Absorbing link-derived features Limitation

Vk ltlt V
Hardly any neighbors of a node to be classified
linked to any pre-labeled node
Proposal
Start with a labeling of reasonable quality
Maybe using a text classifier
Do
Refine the labeling using a coupled distribution
of text and labels of neighbors,
Until the labeling stabilizes.

28
A relaxation labeling algorithm

Given
Hypertext graph G(V, E)
Each vertex u is associated with text uT
Desired
A labeling f of all (unlabeled) vertices so as to
maximize

29
Preferential attachment

Simplifying assumption undirected graph
Web graph starts with m0 nodes
Time proceeds in discrete steps
Every step, one new node v is added
v is attached with m edges to old nodes
Suppose old node w has current degree d(w)
Multinomial distribution, probability of
attachment to w is proportional to d(w)
Old nodes keep accumulating degree
Rich gets richer, or winner takes all

30
Heuristic Assumption

E
The event that edges were generated as per the
edge list E
Difficult to obtain a known function for
Approximate it using heuristic assumptions.
Assume that
Term occurrences are independent
link-derived features are independent
no dependence between a term and a link-derived
feature.
Assumption decision boundaries will remain
relatively immune to errors in the probability
estimates.

31
Heuristic Assumption (contd.)

Approximate the joint probability of neighboring
classes by the product of the marginals
Couple class probabilities of neighboring node
Optimization concerns
Kleinberg and Tardos global optimization
A unique f for all nodes in VU
Greedy labeling followed by iterative correction
of neighborhoods
Greedy labeling using a text classier
Reevaluate class probability of each page using
latest estimates of class probabilities of
neighbors.
EM-like soft classification of nodes

32
Inducing a Markov Random field

Induction on time-step
to break the circular definition.
Converges if seed values are reasonably accurate
Further assumptions
limited range of influence
text of nodes other than v contain no information
about f(v)
Already accounted for in the graph structure

33
Overview of the algorithm

Desired the class (probabilities) of v given
The text vT on that page
Classes of the neighbors N(v) of v.
Use Bayes rule to invert that goal
Build distributions
The algorithm HyperClass
Input Test node v
construct a suitably large vicinity graph around
and containing v
for each w in the vicinity graph do
assign
using a text classier
end for
while label probabilities do not stabilize (r
1,2..) do
for each node w in the vicinity graph do
update to
using equation
end for
end while

34
Exploiting link features

9600 patents from 12 classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
Forget fraction of neighbors classes

35
Relaxation labeling Observations

When the test neighborhood is completely
unlabeled.
Link performs better than the text-based
classier
Reason Model bias
Pages tend to link to pages with a related class
label."
Relaxation labeling
An approximate procedure to optimize a global
objective function on the hypertext graph being
labeled.
A metric graph labeling problem

36
A metric graph-labeling problem

Inference about the topic of page u depends
possibly on the entire Web.
Computationally infeasible
Unclear if capturing such dependencies is useful.
Phenomenon of losing one's way with clicks
significant clues about a page expected to be in
a neighborhood of limited radius
Example
A hypertext graph
Nodes can belong to exactly one of two topics
(red and blue)
Given a graph with a small subset of nodes with
known colors

37
A metric graph-labeling problem (contd..)

Goal find a labeling f(u) (u unlabeled) to
minimize
2 terms
affinity A(c1,c2) cost between all pairs of
colors.
L(u,f(u)) -Pr(f(u)u) cost of assigning label
f(u) to node u
Parameters
Marginal distribution of topics,
2 x 2 topic citation matrix probability of
differently colored nodes linking to each other

38
Semi-supervised hypertext classification
represented as a problem of completing a
partially colored graph subject to a given set of
cost constraints.
39
A metric graph-labeling problem NP-Completeness

NP-complete Kleinberg and Tardos
approximation algorithms
Within a O(log k log log k) multiplicative factor
of the minimal cost,
k number of distinct class labels.

40
Problems with approaches so far

Metric or relaxation labeling
Representing accurate joint distributions over
thousands of terms
High space and time complexity
Naïve Models
Fast assume class-conditional attribute
independence,
Dimensionality of textual sub-problem gtgt
dimensionality of link sub-problem,
Pr(vTf(v)) tends to be lower in magnitude than
Pr(f(N(v))f(v)).
Hacky workaround aggressive pruning of textual
features

41
Co-Training Blum and Mitchell

Classifiers with disjoint features spaces.
Co-training of classifiers
Scores used by each classifier to train the other
Semi-supervised EM-like training with two
classifiers
Assumptions
Two sets of features (LA and LB) per document dA
and dB.
Must be no instance d for which
Given the label , dA is conditionally independent
of dB (and vice versa)

42
Co-training

Divide features into two class-conditionally
independent sets
Use labeled data to induce two separate
classifiers
Repeat
Each classifier is most confident about some
unlabeled instances
These are labeled and added to the training set
of the other classifier
Improvements for text hyperlinks

43
Co-Training Performance

dAbag of words
dBbag of anchor texts from HREF tags
Reduces the error below the levels of both LA and
LB individually
Pick a class c by maximizing
Pr(cdA) Pr(cdB).

44
Co-training reduces classification error
Reduction in error against the number of mutual
training rounds.

Write a Comment

User Comments (0)

About PowerShow.com

Semi-Supervised learning - PowerPoint PPT Presentation

Semi-Supervised learning

collection of documents without any labels. easy to collect. Supervised learning ... attenuate the contribution from documents in ... – PowerPoint PPT presentation