Classifying with limited training data Active and semi-supervised learning - PowerPoint PPT Presentation

About This Presentation
Title:

Classifying with limited training data Active and semi-supervised learning

Description:

Often labeled data is expensive to collect and unlabeled data is abundant ... Discriminative classifiers: Random boundary in uncertainty region. Sarawagi. 9 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 33
Provided by: KReSITF
Category:

less

Transcript and Presenter's Notes

Title: Classifying with limited training data Active and semi-supervised learning


1
Classifying with limited training dataActive and
semi-supervised learning
  • Sunita Sarawagi
  • sunita_at_it.iitb.ac.in
  • http//www.it.iitb.ac.in/sunita

2
Motivation
  • Several learning methods critically dependent on
    quality of labeled training data
  • Often labeled data is expensive to collect and
    unlabeled data is abundant
  • Two techniques to reduce labeling effort
  • Active learning
  • Iteratively select small sets of unlabeled data
    to be labeled by a human
  • Semi-supervised learning
  • Use unlabeled data to train classifier

3
Outline
  • Active learning
  • Definition
  • Application
  • Algorithms
  • Case studies
  • Duplicate elimination
  • Information Extraction
  • Semi-supervised learning
  • Definition
  • Some methods

4
Application areas
  • Text classification
  • Duplicate elimination
  • Information Extraction
  • HTML wrappers
  • Free text
  • Speech recognition
  • Reducing the need for transcribed data
  • Semantic parsing of natural language
  • Reducing need for complex annotated data

5
Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
y
Need greatest expected reduction in the size of
the uncertainty region
6
Active-learning
  • Explicit measure
  • For each unlabeled instance
  • For each class label
  • Add to training data,
  • Train classifier
  • Measure classifier confusion
  • Compute expected confusion
  • Choose instance that yields lowest expected
    confusion
  • Implicit measure
  • Train classifier
  • For each unlabeled instance
  • Measure prediction uncertainty
  • Choose instance with highest uncertainty

7
Measuring prediction certainty
  • Classifier-specific methods
  • Support vector machines
  • Distance from separator
  • Naïve Bayes classifier
  • Posterior probability of winning class
  • Decision tree classifier
  • Weighted sum of distance from different
    boundaries, error of the leaf, depth of the
    leaf, etc
  • Committee-based approach
  • (Seung, Opper, and Sompolinsky 1992)
  • Disagreement amongst members of a committee
  • Most successfully used method

8
Forming a classifier committee
  • Randomly perturb learnt parameters
  • Probabilistic classifiers.
  • Sample from posterior distribution on parameters
    given training data.
  • Example binomial parameter p has a beta
    distribution with mean p
  • Discriminative classifiers
  • Random boundary in uncertainty region

9
Committee-based algorithm
  • Train k classifiers C1, C2,.. Ck on training data
  • For each unlabeled instance x
  • Find prediction y1,.., yk from the k classifiers
  • Compute uncertainty U(x) as entropy of above y-s
  • Pick instance with highest uncertainty

10
Case study Duplicate elimination
  • Given a list of semi-structured records,
  • find all records that refer to a same entity
  • Example applications
  • Data warehousing merging name/address lists
  • Entity
  • Person
  • Household
  • Automatic citation databases (Citeseer)
    references
  • Entity paper
  • Challenges
  • Errors and inconsistencies in large datasets
  • Domain-specific

11
Motivating example Citations
  • Our prior
  • duplicate when author, title, booktitle and year
    match..
  • Author match could be hard
  • L. Breiman, L. Friedman, and P. Stone, (1984).
  • Leo Breiman, Jerome H. Friedman, Richard A.
    Olshen, and Charles J. Stone.
  • Conference match could be harder
  • In VLDB-94
  • In Proc. of the 20th Int'l Conference on Very
    Large Databases, Santiago, Chile, September 1994.

12
  • Fields may not be segmented,
  • Word overlap could be misleading
  • Non-duplicates with lots of word overlap
  • H. Balakrishnan, S. Seshan, and R. H. Katz.,
    Improving Reliable Transport and Hando
    Performance in Cellular Wireless Networks, ACM
    Wireless Networks, 1(4), December 1995.
  • H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz,
    "Improving TCP/IP Performance over Wireless
    Networks," Proc. 1st ACM Conf. on Mobile
    Computing and Networking, November 1995.
  • Duplicates with little overlap even in title
  • Johnson Laird, Philip N. (1983). Mental models.
    Cambridge, Mass. Harvard University Press.
  • P. N. Johnson-Laird. Mental Models Towards a
    Cognitive Science of Language, Inference, and
    Consciousness. Cambridge University Press, 1983

13
Experiences with the learning approach
  • Too much manual search in preparing training data
  • Hard to spot challenging and covering sets of
    duplicates in large lists
  • Even harder to find close non-duplicates that
    will capture the nuances

Active learning is a generalization of this!
14
Learning to identify duplicates
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 3 N Record 4
Classifier
15
Forming committee of trees
  • Selecting split attribute
  • Normally attribute with lowest entropy
  • Perturbed random attribute within close range
    of lowest
  • Selecting a split point
  • Normally midpoint of range with lowest entropy
  • Perturbed a random point anywhere in the range
    with lowest entropy

16
Experimental analysis
  • 250 references from Citeseer ? 32000 pairs of
    which only 150 duplicates
  • Citeseers script used to segment into author,
    title, year, page and rest.
  • 20 text and integer similarity functions
  • Average of 20 runs
  • Default classifier decision tree
  • Initial labeled set just two pairs

17
Methods of creating committee
  • Data partition bad when limited data
  • Attribute partition bad when sufficient data
  • Parameter perturbation best overall

18
Importance of randomization
Naïve Bayes
Decision tree
  • Important to randomize selection for generative
    classifiers like naïve Bayes

19
Choosing the right classifier
  • SVMs good initially but not effective in choosing
    instances
  • Decision trees best overall

20
Benefits of active learning
  • Active learning much better than random
  • With only 100 active instances
  • 97 accuracy, Random only 30
  • Committee-based selection close to optimal

21
Analyzing selected instances
  • Fraction of duplicates in selected instances 44
    starting with only 0.5
  • Is the gain due to increased fraction of
    duplicates?
  • Replaced non-duplicates in selected set with
    random non-dups
  • Result ? only 40 accuracy!!!

22
Case study Information Extraction (IE)
  • The IE task Given,
  • E a set of structured elements (Target schema)
  • S unstructured source S
  • extract all instances of E from S
  • Varying levels of difficulty depending on input
    and kind of extracted patterns
  • Text segmentation Extraction by segmenting text
  • HTML wrapper Extraction from formatted text
  • Classical IE Extraction from free-format text

23
IE by text segmentation
  • Source concatenation of structured elements with
    limited reordering and some missing fields
  • Example Addresses, bib records

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S.
Clark, J.S. Dordick (1993) Protein and Solvent
Engineering of Subtilising BPN' in Nearly
Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
24
IE with Hidden Markov Models
  • Probabilistic models for IE

Emission probabilities
Title
Author
Letter Et. al Word
0.3 0.1 0.5
Journal
Year
journal ACM IEEE
0.4 0.2 0.3
25
A model for Indian Addresses
26
Active learning in IE with HMM
  • Forming committee of HMMs by random perturbation
  • Emission and transition probabilities are
    independent multinomial distributions.
  • Posterior distribution for Multinomial
    parameters
  • Dirichlet with mean estimated as using maximum
    likelihood
  • Results on part of speech tagging (Dagan 1999)
  • 92.6 accuracy using active learning with 20,000
    instances as against 100,000 random

27
Semi-supervised learning
  • Unlabeled data can improve classifier accuracy by
    providing correlation information between
    features
  • Three methods
  • Probabilistic classifiers like naïve Bayes HMMs
  • The Expectation Maximization method (EM)
  • Distance-based classifiers like k-Nearest
    neighbor
  • Graph min-cut method
  • Paired independent classifiers
  • Co-training

28
The EM approach
  • Dl labeled data, Du unlabeled data
  • Train classifier parameter using Dl
  • While likelihood of Dl Du improves
  • E step For each d in Du, find fractional
    membership in each class using current
    classifier parameter
  • M step Use fractional membership of Du and
    labels of Dl to re-estimate maximum likelihood
    parameters of classifier
  • Output classifier

29
Results with EM
  • Practical considerations
  • When unlabeled data too large and class-labels
    dont correspond to natural data clusters, need
    to weight contribution of unlabeled data to
    parameters
  • Experiments on text classification with Naïve
    Bayes
  • 20 Newsgroup 70 accuracy with 10,000 labeled
    reduced to 600 20000 unlabeled
  • Experiments on IE with HMM
  • No improvement in accuracy

30
The Graph min-cut method
  • Construct a weighted graph using Dl Du
  • Dl Dl Dl-

wij
Wij Similarity between i and j
31
Conclusion
  • Active learning
  • successfully used in several applications to
    reduce need for training data
  • Semi-supervised learning
  • Limited improvement observed in text
    classification with naïve Bayes
  • Most proposed methods classifier-specific
  • Still open to further research

32
References
  • Shlomo Argamon-Engelson and Ido Dagan.
    Committee-based sample selection for
    probabilistic classififers. J. of Artificial
    Intelligence Research, 11335--360, 1999.
  • Yoav Freund, H. Sebastian Seung, Eli Shamir, and
    Naftali Tishby. Selective sampling using the
    query by committee algorithm. Machine Learning,
    28(2-3)133-168, 1997.
  • S Sarawagi and Anuradha Bhamidipaty, Interactive
    deduplication using active learning, ACM SIGKDD
    2002
  • H. S. Seung, M. Opper, and H. Sompolinsky. Query
    by committee. In Computational Learing Theory,
    pages 287-294, 1992.
  • T. Zhang and F. J. Oles. A probability analysis
    on the value of unlabeled data for classification
    problems. ICML, 2000
  • Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita
    Sarawagi. Automatic text segmentation for
    extracting structured records. SIGMOD 2001.
  • D Freitag and A McCallum, Information Extraction
    with HMM Structures Learned by Stochastic
    Optimization, AAAI 2000
Write a Comment
User Comments (0)
About PowerShow.com