Title: BayesANIL A Bayesian Model for Handling Approximate, Noisy or Incomplete Labeling in Text Classification
1BayesANIL A Bayesian Model for Handling
Approximate, Noisy or Incomplete Labeling in Text
Classification
- Ganesh Ramakrishnan (ganramkr_at_in.ibm.com)
- Krishna Prasad Chitrapura (kchitrap_at_in.ibm.com)
- Raghu Krishnapuram (kraghura_at_in.ibm.com)
- Pushpak Bhattacharyya (pb_at_cse.iitb.ac.in)
2Outline
- Motivation
- Related work
- Role of BayesANIL in text classification setting
- The BayesANIL model for learning
- Use of BayesANIL parameters in classifiers
- Experiments
- Conclusions
3Motivation - hurdles in supervised learning of
text classifiers
- Approximations involved in manual labeling of
documents. - Noise in the labeling
- In many scenarios, it is easy to generate a
labeled data set with some amount of noise in the
labeling (e.g., by querying the Web) - Learning from unlabeled documents
- Can be looked upon as learning with incomplete
labeling
4Related work
- Learning from a mixture of positive and unlabeled
examples (Lee and Liu, 2003) - Our proposed method outperforms this technique.
- Countering class noise by iterative removal of
training instances that can be potentially
misclassified under many models (Brodley and
Friedl,1996) - Does not handle approximations in the labeling
process. - Cost-sensitive learning algorithm (Domingos,
1999) - E.g. For data-sets with imbalanced classes
- The proposed method is complementary to this work
5Related work (contd.)
- Generalization from few labeled examples
- Learning with labeled and unlabeled data (Nigam
et al., 2000 Ando Zhang, 2004) - Feature smoothing techniques such as Laplace,
Lidstone and Jeffrey-Perks smoothing (Griffiths
Tenenbaum, 2001) - These techniques do not account for empirical
distribution of features in unlabeled documents - Probabilistic latent semantic analysis (Hofmann,
1999) - More suited for information retrieval
6What we propose
- A model that estimates the degree to which each
document d belongs to (or fits into) each class z
(Pr(d,z)). - Use this measure of Pr(d,z) to aid traditional
text classifiers (NB, SVM) to handle
Approximate, Noisy or Incomplete labeling of text
documents. - Pr(dz) can be used as a measure of support while
Pr(zd) can be used as a measure of confidence.
7Role of BayesANIL in text classification
8The BayesANIL model notations
- is independent of given
- A class generates document instances, each of
which is a bag of words - is computed as the
fraction of times word occurs across all
words in document - Observables
- Parameters
9The BayesANIL model notations
- Scale each document to a common length to avoid
modeling doc length. - Observations n(w,d) become Pr(wd) when scaled to
unit length. - Use Empirical distribution
- q(w,z) in place of n(w,z).
10The BayesANIL model Objective function
- More general form of the log-likelihood objective
(Amari, 1995).
11The BayesANIL model E and M Steps
- Condition for the maximum value of the objective
function is obtained by
12The Algorithm
An EM iteration restructured for efficient
storage and computation
13Re-estimating the empirical distribution
- An optional E step
- With a smoothing parameter
-
- In the case of learning in presence of
classification noise, serves as an estimate
of the proportion of noise in the training data
14Utilizing parameters of BayesANIL in NB
- Improved estimation of NB parameter Pr(wz) based
on the degree to which the training documents
belong to each class. - We call this WeightedNB.
- No explicit feature smoothing is performed.
Model Parameter
Model Parameter
15Utilizing parameters of BayesANIL in SVM
- Pr(d), computed from from Pr(d,z), is a measure
of support for how well the d is labeled. - Cost based SVM learners allow setting the cost of
misclassification for each document d. - Used Matlab based SVM learner --http//www.igi.tug
raz.at/aschwaig/software.html - Error correcting output code for handling
multiple classes. - We call the resultant classifier WeightedSVM.
16Experiments and Results
- Four types of experimental setups
- Supervised Learning
- Access to unlabeled examples
- Learning in presence of noisy labels
- Pr(d) as a measure of support
- Two data sets
- 20 Newsgroups
- WebKB
- Data preparation
- Rainbow to parse, tokenize and index the
documents - Stop words were not removed
- No stemming was performed
17Experiments and Results Supervised
- Accuracies on 2 data sets with and without
Pr(d,z) estimates from BayesANIL. - We stop the EM iterations when change in the
log-likelihood difference of two successive
iterations is less than 0.01. - The smoothing parameter was set to k0.001.
- Train to test ratio was 6040
- Results reported on 20 random train-test splits
18Experiments and Results Labeled-unlabeled
- Setup similar to (Nigam et al., 2000)
- We set aside 1 training, 10 test and the
unlabeled collection is built from the remaining
documents - We report accuracies on test data by varying the
number of unlabeled documents across two values
of k.
19Experiments and Results Access to unlabeled for
WebKB
20Experiments and Results Access to unlabeled for
20 Newsgroups
21Experiments and Results Noisy Labels
- Experimental setup as in (Lee Liu, 2003). 50
training, 20 validation (stopping criteria) and
30 testing. - Classification noise is a, we set ka to counter
the noisy labels. - The results tabulated are for WeightedSVM.
22Comparison with results as reported by (Bing Liu
et al 2003)
Our results on F1
F1 Results reported by Bing Liu et al.
23Experiments and Results Notion of Support
- 10 labeled, rest unlabeled.
- 30 classification noise in labeled set.
- Mean and standard deviation of Pr(d) categories
of training documents based on original label z
- Labeled Correct d is in labeled and
argmaxzPr(z,d)z. - Labeled Wrong d is in labeled and argmaxzPr(z,d)
gz. - Unlabeled Correct d is unlabeled and
argmaxzPr(z,d)z. - Unlabeled Wrong d is unlabeled
- and argmaxzPr(z,d) gz.
24Summary
- EM based algorithm for estimating Pr(d,z)
- provides measures of support and confidence
- an effective way to assist (re)labeling of
documents. - An intuitive modification to E step to
re-estimate the empirical distribution, an
effective way to - reinforce feature values in the unlabeled data
and - reduce the influence of the noisily labeled
examples. - BayesANIL provides measures of confidence
Pr(zd) and support Pr(d). - Parameters of BayesANIL shown to improve the
classification accuracy of NB and SVM. - in presence/absence of noise.
- with and without unlabeled documents.
25Future work
- Handling multi-labeled documents
- Extending to information retrieval.
- Extending the implementation to handle multiple
feature types such as links, titles, etc.
26Thank you for your attention