Supervised Learning for Text Classification - PowerPoint PPT Presentation

1 / 142
About This Presentation
Title:

Supervised Learning for Text Classification

Description:

Penalized Likelihood. Independent Laplace priors give this not so intuitive ... Higher prior variance = less penalization. We used: C is tuning constant ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 143
Provided by: Madi1
Category:

less

Transcript and Presenter's Notes

Title: Supervised Learning for Text Classification


1
Supervised Learning for Text Classification
2
Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
3
Probabilistic Classification
Let p(ck) prob. that a randomly chosen object
comes from ck Objects from ck have p(x ck ,
?k) (e.g., MVN) Then p(ck x ) ? p(x ck ,
?k) p(ck)
Bayes Error Rate
  • Lower bound on the best possible error rate

4
Bayes error rate about 6
5
Classifier Types
Discriminative model p(ck x ) - e.g. linear
regression, logistic regression, CART Generative
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
6
Regression for Binary Classification
  • Can fit a linear regression model to a 0/1
    response
  • Predicted values are not necessarily between zero
    and one
  • With pgt1, the decision boundary is linear
  • e.g. 0.5 b0 b1 x1 b2 x2

zeroOneR.txt
7
(No Transcript)
8
Naïve Bayes via a Toy Spam Filter Example
  • Naïve Bayes is a generative model that makes
    drastic simplifying assumptions
  • Consider a small training data set for spam along
    with a bag of words representation

9
(No Transcript)
10
(No Transcript)
11
Naïve Bayes Machinery
  • We need a way to estimate
  • Via Bayes theorem we have

or, on the log-odds scale
12
Naïve Bayes Machinery
  • Naïve Bayes assumes

and
leading to
13
Maximum Likelihood Estimation
weights of evidence
14
Naïve Bayes Prediction
  • Usually add a small constant (e.g. 0.5) to avoid
    divide by zero problems and to reduce bias
  • New message the quick rabbit rests

15
  • New message the quick rabbit rests
  • Predicted log odds
  • 0.51 0.51 0.51 0.51 1.10 0 3.04
  • Corresponds to a spam probability of 0.95

16
Linear Discriminant Analysis
K classes, X n p data matrix.
p(ck x ) ? p(x ck , ?k) p(ck)
Could model each class density as multivariate
normal
LDA assumes for all k. Then
This is linear in x.
17
Linear Discriminant Analysis (cont.)
It follows that the classifier should predict
linear discriminant function
If we dont assume the ?ks are identicial, get
Quadratic DA
18
Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum
likelihood
19
LDA
QDA
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
20
Logistic Regression
Note that LDA is linear in x
Linear logistic regression looks the same
But the estimation procedure for the
co-efficients is different. LDA maximizes joint
likelihood y,X logistic regression maximizes
conditional likelihood yX. Usually similar
predictions.
21
Logistic Regression MLE
For the two-class case, the likelihood is
The maximize need to solve (non-linear) score
equations
22
Logistic Regression Modeling
South African Heart Disease Example (yMI)
Wald
23
Simple Two-Class Perceptron
Define Classify as class 1 if h(x)gt0, class 2
otherwise Score function misclassification
errors on training data For training, replace
class 2 xjs by -xj now need h(x)gt0
Initialize weight vector Repeat one or more
times For each training data point xi If
point correctly classified, do nothing Else
Guaranteed to converge to a separating hyperplane
(if exists)
24
Orange Least Squares
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
25
Optimal Hyperplane
The optimal hyperplane separates the two
classes and maximizes the distance to the closest
point from either class. Finding this hyperplane
is a convex optimization problem. This notion
plays an important role in support vector machines
26
Blue Optimal Red Logistic Regression
T. Hastie, R. Tibshirani, and J. Friedman (2001)
The Elements of Statistical Learning data
mining, inference and prediction. Springer
Verlag.
27
Measuring the Performance of a Binary Classifier
28
Suppose we use a cutoff of 0.5
actual outcome
1
0
1
predicted outcome
0
Test Data
29
More generally
b c
misclassification rate
actual outcome
abcd
1
0
a
sensitivity
ac
1
(aka recall)
predicted outcome
d
specificity
0
bd
a
predicitive value positive
ab
(aka precision)
30
Suppose we use a cutoff of 0.5
actual outcome
1
0
7
sensitivity 100
70
1
predicted outcome
10
specificity 77
0
103
31
Suppose we use a cutoff of 0.8
actual outcome
1
0
5
sensitivity 71
52
1
predicted outcome
11
specificity 85
0
112
32
  • Note there are 20 possible thresholds
  • ROC computes sensitivity and specificity for all
    possible thresholds and plots them
  • Note if threshold minimum
  • cd0 so sens1 spec0
  • If threshold maximum
  • ab0 so sens0 spec1

actual outcome
1
0
1
0
33
(No Transcript)
34
(No Transcript)
35
  • Area under the curve is a common measure of
    predictive performance
  • So is squared error S(yi-yhat)2
  • also known as the Brier Score

36
A Close Look at Logistic Regression for Text
Classification
37
Logistic Regression in One Slide
Example Predict the gender (yM/F) of a person
given their height (xa number).
38
Logistic Regression Model
  • Linear model for log odds of class membership
  • We will call any model with these semantics a
    logistic regression model

39
0-
  • Its arbitrary whether we write
  • or
  • Notationally convenient in different cases

40
Equivalent Forms of LR Model (1)
  • Exponential model for odds ratio
  • Ill give you 3 to 1 against this document being
    about Sports.

41
Equivalent Forms of LR Model (2)
  • Logistic model for probability of class membership
  • I think theres the probability is 0.25 that
    this document is about Sports

42
One Beta or Two?
  • We could also write the logistic form as
  • with ?-1 0
  • Suggests natural generalization...

43
Polytomous Logistic Regression (PLR)
  • Elegant approach to multiclass problems
  • Also known as polychotomous LR, multinomial LR,
    and, ambiguously, multiple LR and multivariate LR

44
Why LR is Interesting (1)
  • Usual advantages of linear models
  • Computationally efficient
  • Take advantage of model and data sparsity
  • Numeric or discrete inputs
  • Can use kernels
  • Natural loss function easy to optimize (more
    later)

45
Why LR is Interesting (2)
  • Probabilistic predictions!
  • Optimize expected value of effectiveness measure
    w/o changing model
  • Including utilities, rankings, batch measures
  • Highlight uncertain test cases
  • Estimate number of class members
  • Truth is rarely deterministic

46
Why LR is Interesting (3)
  • Parameters have a meaning
  • How log odds increases w/ feature values
  • Lets you
  • Look at model and see if sensible
  • Use domain knowledge to guide parameter fitting
    (more later)
  • Build some parts of model by hand
  • Cavaet realistically, a lot can (does)
    complicate this interpretation

47
Conditional Maximum Likelihood Fitting
  • Find parameters (ßj's)
  • that predict the largest possibility likelihood
    (probability)
  • of the set of class labels (yi's)
  • given the corresponding vectors (xis)

48
Loglikelihood
  • For optimization, we equivalently maximize the
    logarithm of the conditional likelihood, i.e. find

49
(Negated) Loglikelihood as Loss Function
  • Negative of loglikelihood measures loss (degree
    of error) in predictions
  • Sum over training examples
  • Continuous, differentiable, convex
  • No local minima
  • But global minimum may not be unique
  • Amenable to off-the-shelf optimization approaches

50
(Negated) Loglikelihood as Loss Function
Hastie, Friedman Tibshirani
51
Problems with Likelihood
  • Linear separation of classes leads to infinite
    parameter values

Likelihood maximized by assigning probability
exactly equal to 1.0....
...and these 0.0. Requires infinite parameter
values.
52
P(yx)
53
Problems with Likelihood
  • Usually too many parameters, too little data
  • Result overfitting (predictions worse on test
    data than on training)
  • Algorithmic kludge stop before convergence
  • More principled kludge feature selection
  • But lets take the high road instead...

54
Bayesian Statistics
  • Parameters viewed as drawn from a prior
    distribution, p(?)
  • p(?) summarizes what we know about ? before
    seeing data
  • After seeing data, D, we have a posterior
    distribution, p(?D)
  • Summarizes what we now know about ?

55
Bayes Rule
  • The two are connected by Bayes Rule
  • Gives convenient way to favor some parameter
    values over others
  • True believers say only legitimate way

56
Bayesian MAP Training
  • Find parameters (ßj's) that maximize log
    posterior probability of class labels (yi's)
    given documents (xis) and prior p(ß) on ßj's
  • MAP selection theoretically inferior to making
    use of entire posterior distribution
  • But only if model is exactly correct, etc.
  • In practice MAP can be just as good

57
Priors
  • Any multivariate probability distribution p(?)
    can be used
  • Most lead to intractable, multimodal posteriors
  • Lets look at a simple case
  • Independent univariate prior for each parameter
  • Joint prior is just product of these

58
Earths Favorite Distribution
  • Suppose our prior beliefs about ??js are
    independent Gaussians

59
Diagonal Gaussian Prior
  • If gaussians are independent, joint distribution
    is multivariate gaussian with diagonal covariance
    matrix

60
Penalized Likelihood
  • Diagonal gaussian prior gives this intuitive
    function to maximize
  • Cant overfit if parameter values are small
    enough
  • Variance, ?2, trades off fit and penalty

61
Gaussiangivesdensemodel
(variance)
62
Laplace Distribution
  • Of course, we could pick any univariate
    distribution for our prior
  • How about Laplace

63
Multivariate Laplace
  • Again we can define a multivariate distribution
    as the product of independent Laplace
    distributions

64
Penalized Likelihood
  • Independent Laplace priors give this not so
    intuitive function to maximize
  • Again, favors small parameter values
  • So whats the difference from gaussian prior?

65
Gaussiangivesdensemodel
(variance)
66
Laplacegivessparsemodel
w
67
Text Classification Example
  • ModApte subset of Reuters-21578
  • 90 categories 9603 training docs 18978 features
  • Reuters RCV1-v2
  • 103 cats 23149 training docs 47152 features
  • OHSUMED heart disease categories
  • 77 cats 83944 training docs 122076 features
  • Cosine normalized TFxIDF weights

68
Dense vs. Sparse Models (Macroaveraged F1)
69
(No Transcript)
70
An Example Model (category grain)
71
Bayesian Use of Domain Knowledge
  • Suppose we know (or have resources that suggest)
  • Certain words are positively or negatively
    associated with category
  • Certain words are mostly unrelated to content
  • Prior mean can encode positive or negative
    association
  • Prior variance can encode how confident we are

72
DK-Based Prior Variance
  • Words we believe to be good content indicators
    should be allowed to get larger parameter values
  • Higher prior variance less penalization
  • We used
  • C is tuning constant
  • significance based on TFxIDF in training data, or
    in prior knowledge texts (more later)
  • ?2 is baseline variance for words with no prior
    knowledge (chosen by cross-validation or
    heuristics)

73
DK-based Prior Mode
  • Idea Prior mode (most likely parameter value)
    should be gt 0 for words we believe to be
    positively associated with category
  • ?
  • ? is standard deviation found by cross-validation
  • Analogous to combining of queries and documents
    in relevance feedback

74
Experiments
  • Data sets
  • TREC 2004 Genomics data
  • Categories 32 MeSH categories under Cells
    hierarchy
  • Documents 3742 training and 4175 test
  • Prior Knowledge MeSH category descriptions
  • ModApte subset of Reuters-21578
  • Categories 10 most frequent categories
  • Documents 9603 training and 3299 test
  • Prior Knowledge keywords selected by hand (Wu
    Srihari, 2004)
  • Study different training set sizes

75
Sources of Prior Knowledge
  • Text that is strongly associated with a category
  • But which doesnt have same statistical
    properties as training examples
  • e.g. category descriptions
  • Human intuition

76
Text as Prior Knowledge MeSH Category
Description
  • MeSH Heading Neurons
  • Scope Note The basic cellular units of nervous
    tissue. Each neuron consists of a body, an axon,
    and dendrites. Their purpose is to receive,
    conduct, and transmit impulses in the nervous
    system.
  • Entry Term Nerve Cells
  • See Also Neural Conduction

IDF on domain texts gives low significance
IDF gives high significance
77
Priors on ?s (Laplace, mode 0, domain IDF-based
variance)
78
Priors on ?s (Laplace, domain IDF-based mode,
fixed variance)
79
MeSH Results (training 3742 random examples)
80
MeSH Results (training 500 random examples)
81
MeSH Results (training 5 positive 5 random
examples/category)
82
Prior Knowledge from Human Intuition (Wu
Srihari)
83
ModApte Results (training 100 random examples)
84
ModApte Results (training 5 positive 5 random
examples/category)
85
Advertisement
  • Joint work with Aynur Dayanik, Alex Genkin,
    Michael Hollander, and Vladimir Menkov
  • Bayesian logistic regression software available
    (binary and polytomous)
  • http//www.stat.rutgers.edu/madigan/BBR/
  • http//www.stat.rutgers.edu/madigan/BMR/
  • Optimizer not the best but fast enough to use
    (especially BBR)
  • Long tradition of early slow optimizers -)

86
Off the MAP
  • Bayesian MAP training only uses the single most
    probable ? from the posterior distribution
  • Weighted combination of ?s better?
  • Maybe, but not guaranteed (despite optimality)
  • Computationally expensive
  • High dimensional integrals, Monte Carlo
    algorithms, etc.

87
Application Authorship Attribution
88
Some Background
  • Identification technologies important for
    homeland security and in the legal system
  • Authorship attribution for textual artifacts
    using topic independent stylometric features
    has a long history
  • Historical focus on small numbers of authors and
    low-dimensional representations via function words

89
Some Background
  • Identification technologies important for
    homeland security and in the legal system
  • Authorship attribution for textual artifacts
    using topic independent stylometric features
    has a long history
  • Historical focus on small numbers of authors and
    low-dimensional representations via function words

90
Some Background
  • Identification technologies important for
    homeland security and in the legal system
  • Authorship attribution for textual artifacts
    using topic independent stylometric features
    has a long history
  • Historical focus on small numbers of authors and
    low-dimensional representations via function words

91
  • Used Naïve Bayes with Poisson and Negative
    Binomial model
  • Out-of-sample predictive performance

92
Different Attribution Problems
1 of K
training
test
?
?
?
?
93
odd man out
training
test
?
?
?
which one?
?
new author
aka novelty detection
94
document pairs
training
test
classify this pair of documents

into one of these configurations

?
?

?
?
95
anti-aliasing

in this example, the red author and the grey
author are the same real person
96
Other Related Problems
  • Author gender
  • Author nationality
  • Sentiment (positive/negative feeling)
  • Rhetorical style
  • Multi-authored documents

97
1-of-K Authorship Attribution
  • Represent documents in a topic-free fashion
  • Function words and, of, the, etc.
  • upon?
  • Sentence lengths, word lengths, deep
    linguistics, stylometric features
  • Parts-of-speech? Word endings? Word prefixes?
  • Combinations of the above
  • High-dimensional document representations

98
Polytomous Logistic Regression
  • Sparse Bayesian (aka lasso) Logistic regression
    trivially generalizes to 1-of-k problems
  • Laplace prior particularly appealing here
  • Suppose 100 classes and a word that predicts
    class 17
  • Word gets used 100 times if build 100 binary
    models, or if use polytomous with Gaussian prior
  • With Laplace prior and polytomous it's used only
    once

99
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
Cross-Topic Mini-Experiment
104
Cross-Topic Mini-Experiment
105
odd-man-out Sample Results RCV-1
  • K2
  • KLM-approach K primary authors, L decoy
    authors, M test authors (plus 50 of the K
    authors)

114 RCV-1 journalists with 200articles. Argamon
function words. Average of 10 replications. BBR
threshold tuning for F1.
106
odd-man-out Sample Results RCV-1
  • K2
  • KLM-approach K primary authors, L decoy
    authors, M test authors (plus 50 of the K
    authors)

114 RCV-1 journalists with 200articles. Argamon
function words. Average of 10 replications. BBR
threshold tuning for F1.
107
KDD Challenge
  • 150,000 scientific abstracts
  • Task 1 cluster documents written by T. Suzuki
    into real people
  • Task 2 find documents that have a single author
    deleted and/or replaced
  • Features words, co-authors, institutions, MeSH
    headings

108
Gamon (2004)
  • Discriminate between Anne, Charlotte, and Emily
    Brontë

109
Koppel et al (2004)
  • Unstable words can be replaced without
    changing meaning
  • Use machine translation algorithms to generate
    multiple document versions with the same meaning
  • Function words are unstable

110
Software
  • BMR Software. Sparse non-sparse Bayesian
    multinomial regression software for large numbers
    of classes and features
  • Featex Software. Tool for creating
    high-dimensional document representations for
    authorship attribution

111
Conclusions
  • Lots of interesting open problems
  • How real are the non-literary applications?

112
Some More Case Studies
113
Example 1 Classifying Customer Comments
  • Comments logged by customer service at large
    telecom co.
  • Free form text, formatted account info
  • Goal was classification, to support
  • Informal analysis by account managers
  • Possible automated response
  • (Joint work with Bill Gale)

114
Example (Simulated) Records
  • 11-Oct-1999, 17, 7735555555, CST PRIMARY LANG OF
    CHINESE AND TOLHIM WE WLD CALL BACK BY CHINEES
    SPEAKER
  • 12-Oct-1999, 75, 9085555555, MRS RICHARDS WANTS
    TV OFR FREE MILES. SNT IT.

115
Needs and Techniques
  • Non-technical managers wanted to define classes
    of customers
  • Supervised learning from examples
  • Active learning reduce amount of data to label
  • Data accessed via menu-based interface with oddly
    limited boolean querying
  • Learned rule-based classifiers obeying syntactic
    restrictions using Cohens Grendel system

116
Example Classifier
  • HEAR or CUT or STATIC or NOISE
  • or (LINE and NOISY)
  • or (TALKING and BAD)
  • or (LINES and DIRECT)
  • Accuracy 88 (90 if negation allowed)

117
Example 2 Counting Types of Device Failure
  • Trouble tickets for service calls on PBXes
  • Repair person enters failure type, attributes
  • Plus textual notes (sometimes)
  • As part of process improvement
  • Reorganized taxonomy of failure types
  • Want to know number of failures in each class
  • (Joint work with Mark Jones)

118
Trouble Ticket (Simulated)
  • Customer Giant Foods, Inc. Key West, FL
  • Model RX1837
  • Date 5-Oct-1999
  • Last Service 18-Nov-1997
  • Problem Code OverHeat
  • Resolution Code ReplacePart
  • Notes rpl fan, vac dust rec maint pln

119
Goals and Techniques
  • Leverage similarities between old and new
    taxonomies
  • Classifiers use old class labels as predictors
  • (along with words and attributes)
  • Old class labels used to guide selection of data
    to label
  • First attempt Naive Bayes classifier tuned to
    minimize error rate

120
(No Transcript)
121
What's the Problem?
  • Built classifier to minimize of errors
  • But goal here is counting class members
  • Better approach
  • Predict probability of class membership
  • Add up probabilities to estimate count
  • Used logistic regression to rescale Naive Bayes
    outputs to be probabilities

122
(No Transcript)
123
(No Transcript)
124
Why Are Results So Different?
  • Probability estimates fairly well calibrated...
  • About 20 of docs with p near 0.2 are class
    members
  • ...but almost all less than 0.5
  • If minimizing error rate, classifiers almost
    always say document is not a class member
  • Knowledge of mining goal is critical

125
Example 3 Categorizing Nonprofit Activities
  • Class labels here are meant to be attributes for
    mining and analysis
  • Joint work with Thomas H. Pollak and Sheryl Romeo
    of The Urban Institute
  • In progress

126
Background
  • Tax-exempt groups report finances and programs
    (activities) to IRS
  • Urban Institute and Guidestar digitize reports
    for access and analysis
  • Groups categorize themselves using NTEE taxonomy
  • Groups don't categorize their programs

127
Program Record (Portions)
  • Name WEST CHESTER FIRE TRAINING CTR.
  • Group Purpose FIREFIGHTER EDUCATION TRAINING
  • City W CHESTER, PA
  • NTEE Code M24
  • Program Achievements COMPLETED CONSTRUCTION AND
    DEDICATED A NEW FIRE AND SMOKE TRAINING BUILDING.
  • Exp1 209403, Exp2 218554, Grants 0

128
Program-Level Categorization
  • NCCS wants a category label associated with each
    program
  • Support manual and automated data mining
  • e.g., Is there a correlation between lack of food
    bank and health problems in city?
  • New taxonomy (NPC)
  • Finer-grained, different emphases, than NTEE

129
NPC Taxonomy (Portion)
  • B Education
  • B01 Education, General/Other
  • B02 Education Policy Programs
  • B04 Educational Programs
  • B04.01 Educational Programs, General/Other
  • B04.02 Adult Education Programs
  • B04.02.01 Adult Education Programs, General/Other
  • B04.02.02 Adult Basic Education Programs

130
Manual Categorization w/ NPC
Elite coder blind agreement on random sample of
200 program records
131
Scale
  • 300,000 program records/year
  • Short texts
  • Batches arrive monthly
  • Classification allowed to take several days
  • Even w/ 797 classes, speed not big issue
  • Potential savings 3,000 person-hours/yr

132
Data
  • Labeled examples (from previous social science
    studies)
  • 12,531 labeled by summer interns
  • 11,879 labeled by NCCS personnel
  • 390,000 unlabeled examples
  • Textual and nontextual attributes
  • NPC NTEE taxonomies
  • Existing manually engineered classifier

133
Data Difficulties
  • Multiple coders
  • Intern-labeled data less consistent, invalid
    categories, variable format
  • Engineered classifier, intern data use old
    version of NPC
  • Missing values
  • Labeled data geographic subset, not random
  • Variable of programs per organization

134
Explored Many Techniques
  • Which text fields, and whether to merge
  • Phrase formation
  • Nontextual attributes
  • Choice/tuning of learning algorithm
  • How/how much to use intern-labeled data
  • How/how much to use prior knowledge
  • Balancing of data by source

135
What Mattered
  • Program and organization-level text fields
  • Avoiding overfitting to organizations
  • Using NTEE class as predictor
  • Discriminative learning (vs. naïve Bayes)
  • Efficient software
  • Granularity of desired classification

136
What Didnt Matter (Much)
  • Multiword phrasal attributes
  • Financial attributes
  • Ordering of data
  • Which of several discriminative algorithms
  • Used SNoW w/ Winnow and perceptron
  • Also BoosTexter, but too slow
  • Intern-labeled data

137
Accuracy Engineered v. Learn
Train 8910, Validation 593, Test 2376
Elite-coded, balanced by coder
138
Adjusting Results for Plausibility
  • However, some errors worse than others
  • Had NCCS judge (blindly) category assignments for
    plausibility
  • Human, engineered, learned, hybrid, random
  • Sample of 705 programs, stratified on human level
    1 category
  • Cant specify all plausible in advance

139
Plausibility-Adjusted (Tent.)
Mistakes made by manually engineered classifier
more likely to be plausible
140
Best of Both Worlds?
  • Engineered classifier had similar form as SNoW
    classifiers
  • Linear model gives score for each category
  • Choose highest scoring category
  • SNoW algorithms are incremental
  • Used engineered classifier as starting point
  • After a couple days translation work

141
Plausibility-Adjusted (Tent.)
Helps, though still below goal
142
Next Steps Toward Goal
  • More labeled data becoming available
  • Explicitly capture notion of plausibility
  • Does not correspond to closeness in taxonomy
  • Better combination of engineered learned
  • Kicking out difficult cases to human
Write a Comment
User Comments (0)
About PowerShow.com