Text%20Classification%20from%20Labeled%20and%20Unlabeled%20Documents%20using%20EM - PowerPoint PPT Presentation

About This Presentation
Title:

Text%20Classification%20from%20Labeled%20and%20Unlabeled%20Documents%20using%20EM

Description:

Reuters (21578 Distribution 1.0) data set: ... For all experiments on Reuters, 10 binary classifiers are trained one per topic. ... Classification of Reuters ... – PowerPoint PPT presentation

Number of Views:308
Avg rating:3.0/5.0
Slides: 51
Provided by: ape64
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Text%20Classification%20from%20Labeled%20and%20Unlabeled%20Documents%20using%20EM


1
Text Classification from Labeled and Unlabeled
Documents using EM
Machine Learning (2000)
  • Kamal Nigam
  • Andrew K. McCallum
  • Sebastian Thrun
  • Tom Mitchell

Presented by Andrew Smith, May 12, 2003
2
Presentation Outline
  • Motivation and Background
  • The Naive Bayes classifier
  • Incorporating unlabeled data with EM (basic
    algorithm)
  • Enhancement 1 Modulating the influence of the
    unlabeled data
  • Enhancement 2 A different probabilistic model
  • Conclusions

3
Motivation
  • The task
  • - Given a set of news articles, automatically
    find documents on the same topic.
  • - We would like to require as few labeled
    documents as possible, since labeling documents
    by hand is expensive.

4
Previous work
  • The problem
  • - Existing statistical text learning algorithms
    require many training examples.
  • - (Lang 1995) A classifier with 1000 training
    documents ranked unlabeled documents. Of the top
    10 only about 50 were correct.

5
Motivation
  • Can we somehow use unlabeled documents?
  • - Yes! Unlabeled data provide information about
    the joint probability distribution.

6
Algorithm Outline
  • Train a classifier with only the labeled
    documents.
  • Use it to probabilistically classify the
    unlabeled documents.
  • Use ALL the documents to train a new classifier.
  • Iterate steps 2 and 3 to convergence.
  • This is reminiscent of K-Means and EM.

7
Presentation Outline
  • Motivation and Background
  • The Naive Bayes classifier
  • Incorporating unlabeled data with EM (basic
    algorithm)
  • Enhancement 1 Modulating the influence of the
    unlabeled data
  • Enhancement 2 A different probabilistic model
  • Conclusions

8
Probabilistic Framework
  • Assumptions
  • - The data are produced by a mixture model.
  • Mixture components
    and class
  • labels
  • There is a one-to-one correspondence between
    mixture components and document classes.
  • Documents
  • Indicator variables. This
    statement means the i th document belongs to
    class j.

9
Probabilistic Framework (2)
Mixture Weights
Probability of class j generating document i
the vocabulary (indexed over t)
indicates a word in the vocabulary.
Documents are ordered word lists.
indicates the word at position j in document i.
10
Probabilistic Framework (3)
The probability of document di is
The probability of mixture component cj
generating document di is
11
Probabilistic Framework (4)
  • The Naive Bayes assumption The words of a
    document are generated independently of their
    order in the document, given the class.

12
Probabilistic Framework (5)
  • Now the probability of a document given its class
    becomes

We can use Bayes Rule to classify documents
find the class with highest probability given a
novel document.
13
Probabilistic Framework (6)
  • To learn the parameters q of the classifier, use
    ML find the most likely set of parameters given
    the data set


The two parameters we need to find are the word
probability estimates and the mixture weights,
written
and
14
Probabilistic Framework (6)
  • The maximization yields parameters that are word
    frequency counts

1 No. of occurrences of wt in class j
V No. of words in class j
1 No. of documents in class j
C D
Laplace smoothing gives each word a prior
probability.
15
Probabilistic Framework (7)
Number of occurrences of word t in document i
This is 1 if document i is in class j, or 0
otherwise.
  • Formally

16
Probabilistic Framework (8)
  • Using Bayes Rule

17
Presentation Outline
  • Motivation and Background
  • The Naive Bayes classifier
  • Incorporating unlabeled data with EM (basic
    algorithm)
  • Enhancement 1 Modulating the influence of the
    unlabeled data
  • Enhancement 2 A different probabilistic model

18
Application of EM to NB
  1. Estimate with only labeled data
  2. Assign probabilistically weighted class-labels to
    unlabeled data.
  3. Use all class labels (given and estimated) to
    find new parameters .
  4. Repeat 2 and 3 until does not change.

19
More Notation
Set of unlabeled documents
Set of labeled documents
20
Deriving the basic Algorithm (1)
  • The probability of all the data is

For unlabeled data, the component of the
probability is a sum across all mixture
components.
21
Deriving the basic Algorithm (2)
  • Easier to maximize the log-likelihood

This contains a log of sums, which makes
maximization intractable.
22
Deriving the basic Algorithm (3)
  • Suppose we have access to the labels for the
    unlabeled documents, expressed as a matrix of
    indicator variables z, where if
    document i is in class j, and 0 otherwise (so
    rows are documents and columns are classes).
    Then the terms of
  • are nonzero only when zij 1 we treat the
    labeled and unlabeled documents the same.

23
Deriving the basic Algorithm (4)
  • The complete log-likelihood becomes

If we replace z with its expected value according
to the current classifier, then this equation
bounds from below the exact log-likelihood, so
iteratively increasing this equation will
increase the log-likelihood.
24
Deriving the basic Algorithm (5)
  • This leads to the basic algorithm
  • E-step
  • M-step

25
Data sets
  • 20 Newsgroups data set
  • 20017 articles drawn evenly from
  • 20 newsgroups
  • Many categories fall into confusable clusters.
  • Words from a stoplist of common short words are
    removed.
  • 62258 unique words occurring more than once
  • Word counts of documents are scaled so each
    document has the same length.

26
Data sets
  • WebKB data set
  • 4199 web pages from university CS departments
  • Divided into four categories (student, faculty,
    course, project) with pages.
  • No stoplist or stemming used.
  • Only 300 most informative words used (mutual
    information with class variable).
  • Validation with a leave-one-university-out
    approach to prevent idiosyncrasies of particular
    universities from inflating success measures.

27
Classification accuracy of 20 NewsGroups
28
Classification accuracy of 20 NewsGroups
29
Classification accuracy of WebKB
30
Predictive words found with EM
  • Iteration 0 Iteration 1 Iteration 2
  • Intelligence DD D
  • DD D DD
  • artificial lecture lecture
  • understanding cc cc
  • DDw D DDDD
  • dist DDDD due
  • identical handout D
  • rus due homework
  • arrange problem assignment
  • games set handout
  • dartmouth tay set
  • natural DDam hw
  • cognitive yurttas exam
  • logic homework problem
  • proving kkfoury DDam
  • prolog sec postscript
  • knowledge postscript solution
  • human exam quiz

31
Presentation Outline
  • Motivation and Background
  • The Naive Bayes classifier
  • Incorporating unlabeled data with EM (basic
    algorithm)
  • Enhancement 1 Modulating the influence of the
    unlabeled data
  • Enhancement 2 A different probabilistic model
  • Conclusions

32
The problem
  • Suppose you have a few labeled documents and many
    more unlabeled documents.
  • Then the algorithm almost becomes unsupervised
    clustering! The only function of the labeled
    data is to assign class labels to the mixture
    components.
  • When the mixture-model assumptions are not true,
    the basic algorithm will find components that
    dont correspond to different class labels.

33
The solution EM-l
  • Modulate the influence of unlabeled data with a
    parameter And maximize

labeled documents
Unlabeled Documents
34
EM-l
  • The E-step is exactly as before, assign
    probabilistic class labels.
  • The M-step is modified to reflect l.
  • Define
  • as a weighting factor to modify the frequency
    counts.

35
EM-l
The new NB parameter estimates become
Probabilistic class assignment
Weight
Word count
sum over all words and documents
36
Classification accuracy of WebKB
37
Classification accuracy of WebKB
38
Presentation Outline
  • Motivation and Background
  • The Naive Bayes classifier
  • Incorporating unlabeled data with EM (basic
    algorithm)
  • Enhancement 1 Modulating the influence of the
    unlabeled data
  • Enhancement 2 A different probabilistic model
  • Conclusions

39
The idea
  • EM-l reduced the effects of violated assumptions
    with the l parameter.
  • Alternatively, we can change our assumptions.
    Specifically, change the requirement of a
    one-to-one correspondence between classes and
    mixture components to a many-to-one
    correspondence.
  • For textual data, this corresponds to saying that
    a class may consist of several different
    sub-topics, each best characterized by a
    different word distribution.

40
More Notation
now represents only mixture components, not
classes.
represents the ath class (topic)
is the assignment of mixture components to classes
This assignment is pre-determined, deterministic,
and permanent once assigned to a particular
class, mixture components do not change
assignment.
41
The Algorithm
  • M-step same as before, find estimates for the
    mixture components using Laplace priors (MAP).
  • E-step
  • - For unlabeled documents, calculate the
    probabilistic mixture component memberships
    exactly as before.
  • - For labeled documents, we previously
    considered
  • to be a fixed indicator (0 or 1) of class
    membership. Now we allow it to vary between 1
    and 0 for mixture components in the same class as
    di . We set it to zero for mixture components
    belonging to classes other than the one
    containing di.

42
Algorithm details
  • Initialize the mixture components for each class
    by randomly setting for
    components in the correct class.
  • Documents are classified by summing up the
    mixture component probabilities of one class to
    form a class probability

43
Another data set
  • Reuters (21578 Distribution 1.0) data set
  • 12902 news articles in 90 topics from Reuters
    newswire, only the ten most populous classes are
    used.
  • No stemming used.
  • Documents are split into early and late
    categories (by date). The task is to predict the
    topics of the later articles with classifiers
    trained on the early ones.
  • For all experiments on Reuters, 10 binary
    classifiers are trained one per topic.

44
Performance Metrics
  • To evaluate the performance, define the two
    quantities

True Pos. True
Pos. False Neg. True Pos.
True Pos. False Pos.
Actual value Pos. Neg.
Recall
True Pos. False Pos.
False Neg. True Neg.
Pos. Neg.
Precision
Prediction
The recall-precision breakeven point is the value
when the two quantities are equal. The breakeven
point is used instead of accuracy (fraction
correctly classified). Because the data sets
have a much higher frequency of negative
examples, the classifier could achieve high
accuracy by always predicting negative.
45
Classification of Reuters
(breakeven points)
Naive Bayes (multiple components)
EM (multiple components)
Naive Bayes
EM
46
Classification accuracy of Reuters
47
Classification of Reuters
(breakeven points)
Using different numbers of mixture components
48
Classification of Reuters
(breakeven points)
Naive Bayes with different numbers of mixture
components
49
Classification of Reuters
(breakeven points)
Using cross-validation or best-EM to select the
number of mixture components
50
Conclusions
  • Cross-validation tends to underestimate the best
    number of mixture components.
  • Incorporating unlabeled data into any classifier
    is important because of the high cost of
    hand-labeling documents.
  • Classifiers based on generative models that make
    incorrect assumptions can still achieve high
    accuracy.
  • The new algorithm does not produce binary
    classifiers that are much better than NB.
Write a Comment
User Comments (0)
About PowerShow.com