Text Classification - PowerPoint PPT Presentation

1 / 125
About This Presentation
Title:

Text Classification

Description:

... 5 viewers – PowerPoint PPT presentation

Number of Views:855
Avg rating:3.0/5.0
Slides: 126
Provided by: marthap
Category:

less

Transcript and Presenter's Notes

Title: Text Classification


1
Text Classification
  • Slides by
  • Tom Mitchell (NB),
  • William Cohen (kNN),
  • Ray Mooney and others at UT-Austin,
  • me

2
Outline
  • Problem definition and applications
  • Very Quick Intro to Machine Learning and
    Classification
  • Learning bounds
  • Bias-variance tradeoff, No free lunch theorem
  • Maximum Entropy Models
  • Other Classification Techniques
  • Representations
  • Vector Space Model (and variations)
  • Feature Selection
  • Dimensionality Reduction
  • Representations and independence assumptions
  • Sparsity and smoothing

3
Spam or not Spam?
  • Most people whove ever used email have developed
    a hatred of spam
  • In the days before Gmail (and still today), you
    could get hundreds of spam messages per day.
  • Spam Filters were developed to automatically
    classify, with high (but not perfect) accuracy,
    which messages are spam and which arent.

4
Text Classification Problem
  • Let D be the space of all possible documents
  • Let C be the space of possible classes
  • Let H be the space of all possible hypotheses (or
    classifiers)
  • Input a labeled sample X ltd,cgt d in D and
    c in C
  • Output a hypothesis h in H D ? C for
    predicting, with high accuracy, the class of
    previously unseen documents

5
Example Applications
  • News topic classification (e.g., Google News)
  • Cpolitics,sports,business,health,tech,
  • SafeSearch filtering
  • Cpornography, not pornography
  • Language classification
  • CEnglish,Spanish,Chinese,
  • Sentiment classification
  • Cpositive review,negative review
  • Email sorting
  • Cspam,meeting reminders,invitations,
    user-defined!

6
Outline
  • Problem definition and applications
  • Very Quick Intro to Machine Learning/Classificatio
    n
  • Learning bounds
  • Bias-variance tradeoff, No free lunch theorem
  • Maximum Entropy Models
  • Other Classification Techniques
  • Representations
  • Vector Space Model (and variations)
  • Feature Selection
  • Dimensionality Reduction
  • Representations and independence assumptions
  • Sparsity and smoothing

7
Machine Learning
  • A learning machine is an algorithm that
    searches for an accurate classifier.
  • Remember
  • Let D be the space of all possible documents
  • Let C be the space of possible classes
  • Let H be the space of all possible hypotheses (or
    classifiers)
  • Input a labeled sample X ltd,cgt d in D and
    c in C
  • Output a hypothesis h in H D ? C for
    predicting, with high accuracy, the class of
    previously unseen documents

8
Concrete Example
  • Let C Spam, Not Spam or S,N
  • Let H be the set of conjunctive rules, like
  • if document d contains
  • free credit score AND
  • click here
  • ? Spam

9
A Simple Learning Algorithm
  1. Pick a class c (S or N)
  2. Find the term t that correlates best with c
  3. Construct a rule r If d contains t ? c
  4. Repeatedly find more terms that correlate with c
  5. Add the new terms to r, until the accuracy stops
    improving on the training data.

10
Loss Function Measuring Accuracy
  • A loss function is a function L H x D x C ?
    0,1
  • Given a hypothesis h, document d, and class c,
    L(h,d,c) returns the error or loss of h when
    making a prediction on d.
  • Simple Example
  • L(h,d,c) 0 if h(d)c, and 1 otherwise.
  • This is called 0-1 loss.

11
4 Things Everyone Should KnowAbout Machine
Learning
  1. Assumptions
  2. Generalization Bounds and Occams Razor
  3. Bias-Variance Tradeoff
  4. No Free Lunch

12
1. Assumptions
  • Machine learning traditionally makes two
    important (and often unrealistic) assumptions.
  • There is a probability distribution P (not
    necessarily known, but its assumed to exist)
    from which all examples d are drawn (training and
    test examples).
  • Each example is drawn independently from this
    distribution.
  • Together, these are known as i.i.d.
    independent and identically distributed.

13
Why are the assumptions important?
  • Basically, its hard to make a prediction about a
    document if all of your training examples are
    totally different.
  • With these assumptions, youre saying its very
    unlikely (with enough training data) that youll
    see a test example thats totally different from
    all of your training data.

14
2. Generalization Bounds
  • Given the assumptions above, its possible to
    prove theoretically that an algorithm can learn
    something useful.

Generalization Bound by Vapnik-Chervonenkis With
probability 1-d over the choice of training data,
Here, h is the VC-dimension of the learning
machine. If the learning machine is complex,
h is big. If its simple, h is small.
15
2. Bounds and Occams Razor
  • Occams Razor All other things being equal, the
    simplest explanation is the best.
  • Generalization bounds lend some theoretical
    credence to this old rule-of-thumb.

16
3. Bias and Variance
  • Bias The built-in tendency of a learning
    machine or hypothesis class to find a hypothesis
    in a pre-determined region of the space of all
    possible classifiers.
  • e.g., our rule hypotheses are biased towards
    axis-parallel lines
  • Variance The degree to which a learning
    algorithm is sensitive to small changes in the
    training data.
  • If a small change in training data causes a large
    change in the resulting classifier, then the
    learning algorithm has high variance.

17
3. Bias-Variance Tradeoff
  • As a general rule,
  • the more biased a learning machine,
  • the less variance it has,
  • and the more variance it has,
  • the less biased it is.

18
4. No Free Lunch Theorem
  • Simply put, this famous theorem says
  • If your learning machine has no bias at all,
  • then its impossible to learn anything.
  • The proof is simple, but out of the scope of this
    lecture. You should check it out.

19
Outline
  • Problem definition and applications
  • Very Quick Intro to Machine Learning and
    Classification
  • Bias-variance tradeoff
  • No free lunch theorem
  • Maximum Entropy Models
  • Other Classification Techniques
  • Representations
  • Vector Space Model (and variations)
  • Feature Selection
  • Dimensionality Reduction
  • Representations and independence assumptions
  • Sparsity and smoothing

20
Machine Learning Techniques for NLP
  • NLP people tend to favor certain kinds of
    learning machines
  • Maximum entropy (or log-linear, or logistic
    regression, or logit) models
  • (gaining in popularity lately)
  • Bayesian networks
  • (directed graphical models, like Naïve Bayes)
  • Support vector machines
  • (but only for certain things, like text
    classification and information extraction)

21
Hypothesis Class
  • A maximum entropy/log-linear model (ME) is any
    function with this form

Log-linear If you take the log, its a linear
function.
Normalization function
22
Feature Functions
  • The functions fi are called feature functions (or
    sometimes just features).
  • These must be defined by the person designing the
    learning machine.
  • Example
  • fi(c,d) If cS, count of how often free
    appears in d. Otherwise, 0.

23
Parameters
  • The ?i are called the parameters of the model.
  • During training, the learning algorithm tries to
    find the best value for the ?i.

24
Example ME Hypothesis
25
Why is it Maximum Entropy?
  • Before we get into how to train one of these,
    lets get an idea of why people use it.
  • The basic intuition is from Occams Razor we
    want to find the simplest probability
    distribution P(c d) that explains the training
    data.
  • Note that this also introduces bias were
    biasing our search towards simple
    distributions.
  • But what makes a distribution simple?

26
Entropy
  • Entropy is a measure of how much uncertainty is
    in a probability distribution.

Examples Entropy of a deterministic
event H(1,0) -1 log 1 0 log 0 (-1)
(0) - 0 log 0 0
27
Entropy
  • Entropy is a measure of how much uncertainty is
    in a probability distribution.

Examples Entropy of flipping a coin H(1/2,1/2)
-1/2 log 1/2 1/2 log 1/2 -(1/2)
(-1) - (1/2) (-1) 1
28
Entropy
  • Entropy is a measure of how much uncertainty is
    in a probability distribution.

Examples Entropy of rolling a six-sided
die H(1/6,1/6) -1/6 log 1/6 - 1/6 log
1/6 -1/6 -2.53 - - 1/6 -2.53
2.53
29
Entropy
  • Entropy of a biased coin flip
  • Let P(Heads) represent the probability that the
    biased coin lands on Heads.

Maximum Entropy Setting for P(Heads) P(Heads)
P(not Heads).
If event X has N possible outcomes, the maximum
entropy setting for p(x1),p(x2),,p(xN)
is p(x1)p(x2)p(xN)1/N.
30
Occams Razor for Distributions
  • Given a set of empirical expectations of the form
    Eltc,dgt in Train fi(c,d)
  • Find a distribution P(c d) such that
  • - it provides the same expectations
  • (matches the training data)
  • Eltc,dgtP(cd) fi(c,d) Eltc,dgt in Train fi(c,d)
  • - maximizes the entropy H(P)
  • (Occams Razor bias)

31
Theorem
  • The maximum entropy distribution for P(cd),
  • subject to the constraints
  • Eltc,dgtP(cd) fi(c,d) Eltc,dgt in Train fi(c,d)
  • must have log-linear form.
  • Thus, max-ent models have to be log-linear models.

32
Training a ME model
  • Training is an optimization problem
  • find the value for ? that maximizes the
    conditional log-likelihood of the training data

33
Training a ME model
  • Optimization is normally performed using some
    form of gradient descent
  • 0) Initialize ?0 to 0
  • 1) Compute the gradient ?CLL
  • 2) Take a step in the direction of the gradient
  • ?i1 ?i a ?CLL
  • 3) Repeat until CLL doesnt improve
  • stop when CLL(?i1) CLL(?i) lt e

34
Training a ME model
  • Computing the gradient

35
Outline
  • Problem definition and applications
  • Very Quick Intro to Machine Learning and
    Classification
  • Bias-variance tradeoff
  • No free lunch theorem
  • Maximum Entropy Models
  • Other Classification Techniques
  • Representations
  • Vector Space Model (and variations)
  • Feature Selection
  • Dimensionality Reduction
  • Representations and independence assumptions
  • Sparsity and smoothing

36
Classification Techniques
  • Book mentions three
  • Naïve Bayes
  • k-Nearest Neighbor
  • Support Vector Machines
  • Others (besides ME)
  • Rule-based systems
  • Decision lists (e.g., Ripper)
  • Decision trees (e.g. C4.5)
  • Perceptron and Neural Networks

37
Bayes Rule
Which is shorthand for
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
For code, see www.cs.cmu.edu/tom/mlbook.html
click on Software and Data
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
How can we implement this if the ai are
continuous-valued attributes?
50
Also called Gaussian distribution
51
Gaussian
Assume P(aivj) follows Gaussian distribution,
use training data to estimate its mean and
variance
52
K-nearest neighbor methods
  • William Cohen
  • 10-601 April 2008

53
BellCores MovieRecommender
  • Participants sent email to videos_at_bellcore.com
  • System replied with a list of 500 movies to rate
    on a 1-10 scale (250 random, 250 popular)
  • Only subset need to be rated
  • New participant P sends in rated movies via email
  • System compares ratings for P to ratings of (a
    random sample of) previous users
  • Most similar users are used to predict scores for
    unrated movies (more later)
  • System returns recommendations in an email
    message.

54
  • Suggested Videos for John A. Jamus.
  • Your must-see list with predicted ratings
  • 7.0 "Alien (1979)"
  • 6.5 "Blade Runner"
  • 6.2 "Close Encounters Of The Third Kind (1977)"
  • Your video categories with average ratings
  • 6.7 "Action/Adventure"
  • 6.5 "Science Fiction/Fantasy"
  • 6.3 "Children/Family"
  • 6.0 "Mystery/Suspense"
  • 5.9 "Comedy"
  • 5.8 "Drama"

55
  • The viewing patterns of 243 viewers were
    consulted. Patterns of 7 viewers were found to be
    most similar. Correlation with target viewer
  • 0.59 viewer-130 (unlisted_at_merl.com)
  • 0.55 bullert,jane r (bullert_at_cc.bellcore.com)
  • 0.51 jan_arst (jan_arst_at_khdld.decnet.philips.nl)
  • 0.46 Ken Cross (moose_at_denali.EE.CORNELL.EDU)
  • 0.42 rskt (rskt_at_cc.bellcore.com)
  • 0.41 kkgg (kkgg_at_Athena.MIT.EDU)
  • 0.41 bnn (bnn_at_cc.bellcore.com)
  • By category, their joint ratings recommend
  • Action/Adventure
  • "Excalibur" 8.0, 4 viewers
  • "Apocalypse Now" 7.2, 4 viewers
  • "Platoon" 8.3, 3 viewers
  • Science Fiction/Fantasy
  • "Total Recall" 7.2, 5 viewers
  • Children/Family
  • "Wizard Of Oz, The" 8.5, 4 viewers
  • "Mary Poppins" 7.7, 3 viewers
  • Mystery/Suspense
  • "Silence Of The Lambs, The" 9.3, 3 viewers
  • Comedy
  • "National Lampoon's Animal House" 7.5, 4 viewers
  • "Driving Miss Daisy" 7.5, 4 viewers
  • "Hannah and Her Sisters" 8.0, 3 viewers
  • Drama
  • "It's A Wonderful Life" 8.0, 5 viewers
  • "Dead Poets Society" 7.0, 5 viewers
  • "Rain Man" 7.5, 4 viewers
  • Correlation of predicted ratings with your actual
    ratings is 0.64 This number measures ability to
    evaluate movies accurately for you. 0.15 means
    low ability. 0.85 means very good ability. 0.50
    means fair ability.

56
Algorithms for Collaborative Filtering 1
Memory-Based Algorithms (Breese et al, UAI98)
  • vi,j vote of user i on item j
  • Ii items for which user i has voted
  • Mean vote for i is
  • Predicted vote for active user a is weighted sum

weights of n similar users
normalizer
57
Basic k-nearest neighbor classification
  • Training method
  • Save the training examples
  • At prediction time
  • Find the k training examples (x1,y1),(xk,yk)
    that are closest to the test example x
  • Predict the most frequent class among those yis.
  • Example http//cgm.cs.mcgill.ca/soss/cs644/proj
    ects/simard/

58
What is the decision boundary?
Voronoi diagram
59
Convergence of 1-NN
x2
P(Yx)
P(Yx)
x
y2
neighbor
y
x1
P(Yx1)
y1
assume equal
let yargmax Pr(yx)
60
Basic k-nearest neighbor classification
  • Training method
  • Save the training examples
  • At prediction time
  • Find the k training examples (x1,y1),(xk,yk)
    that are closest to the test example x
  • Predict the most frequent class among those yis.
  • Improvements
  • Weighting examples from the neighborhood
  • Measuring closeness
  • Finding close examples in a large training set
    quickly

61
K-NN and irrelevant features
?








o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
62
K-NN and irrelevant features

o
o

?
o
o

o
o
o
o
o
o

o


o


o
o
o
o
o
o
63
K-NN and irrelevant features
?
64
Ways of rescaling for KNN
Normalized L1 distance
Scale by IG
Modified value distance metric
65
Ways of rescaling for KNN
Dot product
Cosine distance
TFIDF weights for text for doc j, feature i
xitfi,j idfi
docs in corpus
occur. of term i in doc j
docs in corpus that contain term i
66
Combining distances to neighbors
Standard KNN
Distance-weighted KNN
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
Computing KNN pros and cons
  • Storage all training examples are saved in
    memory
  • A decision tree or linear classifier is much
    smaller
  • Time to classify x, you need to loop over all
    training examples (x,y) to compute distance
    between x and x.
  • However, you get predictions for every class y
  • KNN is nice when there are many many classes
  • Actually, there are some tricks to speed this
    upespecially when data is sparse (e.g., text)

71
Efficiently implementing KNN (for text)
IDF is nice computationally
72
Tricks with fast KNN
  • K-means using r-NN
  • Pick k points c1x1,.,ckxk as centers
  • For each xi, find DiNeighborhood(xi)
  • For each xi, let cimean(Di)
  • Go to step 2.

73
Efficiently implementing KNN
dj3
Selective classification given a training set
and test set, find the N test cases that you can
most confidently classify
dj2
dj4
74
Support Vector Machines
  • Slides by Ray Mooney et al.
  • U. Texas at Austin machine learning group

75
Perceptron Revisited Linear Separators
  • Binary classification can be viewed as the task
    of separating classes in feature space

wTx b 0
wTx b gt 0
wTx b lt 0
f(x) sign(wTx b)
76
Linear Separators
  • Which of the linear separators is optimal?

77
Classification Margin
  • Distance from example xi to the separator is
  • Examples closest to the hyperplane are support
    vectors.
  • Margin ? of the separator is the distance between
    support vectors.

?
r
78
Maximum Margin Classification
  • Maximizing the margin is good according to
    intuition and PAC theory.
  • Implies that only support vectors matter other
    training examples are ignorable.

79
Linear SVM Mathematically
  • Let training set (xi, yi)i1..n, xi?Rd, yi ?
    -1, 1 be separated by a hyperplane with margin
    ?. Then for each training example (xi, yi)
  • For every support vector xs the above inequality
    is an equality. After rescaling w and b by ?/2
    in the equality, we obtain that distance between
    each xs and the hyperplane is
  • Then the margin can be expressed through
    (rescaled) w and b as

wTxi b - ?/2 if yi -1 wTxi b ?/2
if yi 1
yi(wTxi b) ?/2
?
80
Linear SVMs Mathematically (cont.)
  • Then we can formulate the quadratic optimization
    problem
  • Which can be reformulated as

Find w and b such that is
maximized and for all (xi, yi), i1..n
yi(wTxi b) 1
Find w and b such that F(w) w2wTw is
minimized and for all (xi, yi), i1..n yi
(wTxi b) 1
81
Solving the Optimization Problem
  • Need to optimize a quadratic function subject to
    linear constraints.
  • Quadratic optimization problems are a well-known
    class of mathematical programming problems for
    which several (non-trivial) algorithms exist.
  • The solution involves constructing a dual problem
    where a Lagrange multiplier ai is associated with
    every inequality constraint in the primal
    (original) problem

Find w and b such that F(w) wTw is minimized
and for all (xi, yi), i1..n yi (wTxi
b) 1
Find a1an such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) ai 0 for all ai
82
The Optimization Problem Solution
  • Given a solution a1an to the dual problem,
    solution to the primal is
  • Each non-zero ai indicates that corresponding xi
    is a support vector.
  • Then the classifying function is (note that we
    dont need w explicitly)
  • Notice that it relies on an inner product between
    the test point x and the support vectors xi we
    will return to this later.
  • Also keep in mind that solving the optimization
    problem involved computing the inner products
    xiTxj between all training points.

w Saiyixi b yk - Saiyixi Txk
for any ak gt 0
f(x) SaiyixiTx b
83
Soft Margin Classification
  • What if the training set is not linearly
    separable?
  • Slack variables ?i can be added to allow
    misclassification of difficult or noisy examples,
    resulting margin called soft.

?i
?i
84
Soft Margin Classification Mathematically
  • The old formulation
  • Modified formulation incorporates slack
    variables
  • Parameter C can be viewed as a way to control
    overfitting it trades off the relative
    importance of maximizing the margin and fitting
    the training data.

Find w and b such that F(w) wTw is minimized
and for all (xi ,yi), i1..n yi (wTxi
b) 1
Find w and b such that F(w) wTw CS?i is
minimized and for all (xi ,yi), i1..n
yi (wTxi b) 1 ?i, , ?i 0
85
Soft Margin Classification Solution
  • Dual problem is identical to separable case
    (would not be identical if the 2-norm penalty for
    slack variables CS?i2 was used in primal
    objective, we would need additional Lagrange
    multipliers for slack variables)
  • Again, xi with non-zero ai will be support
    vectors.
  • Solution to the dual problem is

Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) 0 ai C for all ai
Again, we dont need to compute w explicitly for
classification
w Saiyixi b yk(1- ?k) - SaiyixiTxk
for any k s.t. akgt0
f(x) SaiyixiTx b
86
Theoretical Justification for Maximum Margins
  • Vapnik has proved the following
  • The class of optimal linear separators has VC
    dimension h bounded from above as
  • where ? is the margin, D is the diameter of the
    smallest sphere that can enclose all of the
    training examples, and m0 is the dimensionality.
  • Intuitively, this implies that regardless of
    dimensionality m0 we can minimize the VC
    dimension by maximizing the margin ?.
  • Thus, complexity of the classifier is kept small
    regardless of dimensionality.

87
Linear SVMs Overview
  • The classifier is a separating hyperplane.
  • Most important training points are support
    vectors they define the hyperplane.
  • Quadratic optimization algorithms can identify
    which training points xi are support vectors with
    non-zero Lagrangian multipliers ai.
  • Both in the dual formulation of the problem and
    in the solution training points appear only
    inside inner products

f(x) SaiyixiTx b
Find a1aN such that Q(a) Sai -
½SSaiajyiyjxiTxj is maximized and (1) Saiyi
0 (2) 0 ai C for all ai
88
Non-linear SVMs
  • Datasets that are linearly separable with some
    noise work out great
  • But what are we going to do if the dataset is
    just too hard?
  • How about mapping data to a higher-dimensional
    space

x
0
x
0
x2
x
0
89
Non-linear SVMs Feature spaces
  • General idea the original feature space can
    always be mapped to some higher-dimensional
    feature space where the training set is separable

F x ? f(x)
90
The Kernel Trick
  • The linear classifier relies on inner product
    between vectors K(xi,xj)xiTxj
  • If every datapoint is mapped into
    high-dimensional space via some transformation F
    x ? f(x), the inner product becomes
  • K(xi,xj) f(xi) Tf(xj)
  • A kernel function is a function that is
    eqiuvalent to an inner product in some feature
    space.
  • Example
  • 2-dimensional vectors xx1 x2 let
    K(xi,xj)(1 xiTxj)2,
  • Need to show that K(xi,xj) f(xi) Tf(xj)
  • K(xi,xj)(1 xiTxj)2, 1 xi12xj12 2 xi1xj1
    xi2xj2 xi22xj22 2xi1xj1 2xi2xj2
  • 1 xi12 v2 xi1xi2 xi22 v2xi1
    v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
  • f(xi) Tf(xj), where f(x) 1 x12
    v2 x1x2 x22 v2x1 v2x2
  • Thus, a kernel function implicitly maps data to a
    high-dimensional space (without the need to
    compute each f(x) explicitly).

91
What Functions are Kernels?
  • For some functions K(xi,xj) checking that
    K(xi,xj) f(xi) Tf(xj) can be cumbersome.
  • Mercers theorem
  • Every semi-positive definite symmetric function
    is a kernel
  • Semi-positive definite symmetric functions
    correspond to a semi-positive definite symmetric
    Gram matrix

K(x1,x1) K(x1,x2) K(x1,x3) K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)


K(xn,x1) K(xn,x2) K(xn,x3) K(xn,xn)
K
92
Examples of Kernel Functions
  • Linear K(xi,xj) xiTxj
  • Mapping F x ? f(x), where f(x) is x itself
  • Polynomial of power p K(xi,xj) (1 xiTxj)p
  • Mapping F x ? f(x), where f(x) has
    dimensions
  • Gaussian (radial-basis function) K(xi,xj)
  • Mapping F x ? f(x), where f(x) is
    infinite-dimensional every point is mapped to a
    function (a Gaussian) combination of functions
    for support vectors is the separator.
  • Higher-dimensional space still has intrinsic
    dimensionality d (the mapping is not onto), but
    linear separators in it correspond to non-linear
    separators in original space.

93
Non-linear SVMs Mathematically
  • Dual problem formulation
  • The solution is
  • Optimization techniques for finding ais remain
    the same!

Find a1an such that Q(a) Sai -
½SSaiajyiyjK(xi, xj) is maximized and (1) Saiyi
0 (2) ai 0 for all ai
f(x) SaiyiK(xi, xj) b
94
SVM applications
  • SVMs were originally proposed by Boser, Guyon and
    Vapnik in 1992 and gained increasing popularity
    in late 1990s.
  • SVMs are currently among the best performers for
    a number of classification tasks ranging from
    text to genomic data.
  • SVMs can be applied to complex data types beyond
    feature vectors (e.g. graphs, sequences,
    relational data) by designing kernel functions
    for such data.
  • SVM techniques have been extended to a number of
    tasks such as regression Vapnik et al. 97,
    principal component analysis Schölkopf et al.
    99, etc.
  • Most popular optimization algorithms for SVMs use
    decomposition to hill-climb over a subset of ais
    at a time, e.g. SMO Platt 99 and Joachims
    99
  • Tuning SVMs remains a black art selecting a
    specific kernel and parameters is usually done in
    a try-and-see manner.

95
Performance Comparison (?)
      linear SVM linear SVM rbf-SVM
  NB Rocchio Dec. Trees kNN C0.5 C1
earn 96.0 96.1 96.1 97.8 98.0 98.2 98.1
acq 90.7 92.1 85.3 91.8 95.5 95.6 94.7
money-fx 59.6 67.6 69.4 75.4 78.8 78.5 74.3
grain 69.8 79.5 89.1 82.6 91.9 93.1 93.4
crude 81.2 81.5 75.5 85.8 89.4 89.4 88.7
trade 52.2 77.4 59.2 77.9 79.2 79.2 76.6
interest 57.6 72.5 49.1 76.7 75.6 74.8 69.1
ship 80.9 83.1 80.9 79.8 87.4 86.5 85.8
wheat 63.4 79.4 85.5 72.9 86.6 86.8 82.4
corn 45.2 62.2 87.7 71.4 87.5 87.8 84.6
microavg. 72.3 79.9 79.4 82.6 86.7 87.5 86.4
SVM classifier break-even F from (Joachims,
2002a, p. 114). Results are shown for the 10
largest categories and for microaveraged
performance over all 90 categories on the
Reuters-21578 data set.
96
Choosing a classifier
Technique Train time Test time Accuracy Interpre- tability Bias-Variance Data Complexity
Naïve Bayes W C V C Vd Medium-low Medium High-bias Low
k-NN W V Vd Medium Low ? High
SVM CD3 Vave C Vd High Low Mixed Medium-low
Neural Nets ? C Vd High Low High-variance High
Log-linear ? C Vd High Medium High-variance/mixed Medium
Ripper ? ? Medium High High-bias ?
Accuracy reputation for accuracy in
experimental settings. Note that it is
impossible to say beforehand which classifier
will be most accurate on any given problem. C
set of classes. W bag of training tokens. V
set of training types. D set of train docs. Vd
types in test document d. Vave average
number of types per doc in training.
97
Outline
  • Problem definition and applications
  • Very Quick Intro to Machine Learning and
    Classification
  • Bias-variance tradeoff
  • No free lunch theorem
  • Maximum Entropy Models
  • Other Classification Techniques
  • Representations
  • Vector Space Model (and variations)
  • Feature Selection
  • Dimensionality Reduction
  • Representations and independence assumptions
  • Sparsity and smoothing

98
Vector Space Model
  • Idea represent each document as a vector.
  • Why bother?
  • How can we make a document into a vector?

99
Documents as Vectors
  • Example
  • Document D1 yes we got no bananas
  • Document D2 what you got
  • Document D3 yes I like what you got

yes we got no bananas what you I like
Vector V1 Vector V2 Vector V3
1 1 1 1 1 0 0 0 0
0 0 1 0 0 1 1 0 0
1 0 1 0 0 1 1 1 1
100
Documents as Vectors
  • Generically, we convert a document into a vector
    by
  • Determine the vocabulary V, or set of all terms
    in the collection of documents
  • For each document d, compute a score sv(d) for
    every term v in V.
  • For instance, sv(d) could be the number of times
    v appears in d.

101
Why Bother?
  • The vector space model has a number of
    limitations (discussed later).
  • But two major benefits
  • Convenience (notational mathematical)
  • Its well-understood
  • That is, there are a lot of side benefits, like
    similarity and distance metrics, that you get for
    free.

102
Handy Tools
  • Euclidean distance and norm
  • Cosine similarity
  • Dot product

103
Measuring Similarity
  • Similarity metric
  • the size of the angle between document vectors.
  • Cosine Similarity

104
Variations of the VSM
  • What should we include in V?
  • Stoplists
  • Phrases and ngrams
  • Feature selection
  • How should we compute sv(d)?
  • Binary (Bernoulli)
  • Term frequency (TF) (multinomial)
  • Inverse Document Frequency (IDF)
  • TF-IDF
  • Length normalization
  • Other

105
What should we include in the Vocabulary?
  • Example
  • Document D1 yes we got no bananas
  • Document D2 what you got
  • Document D3 yes I like what you got
  • All three documents include got
  • ? its not very informative for discriminating
    between the documents.
  • In general, wed like to include all and only
    informative features

106
Zipf Distribution of Language
  • Languages contain
  • a few high-frequency words,
  • a large number of medium frequency words,
  • and a ton of low-frequency words.
  • High-frequency words generally not indicative of
    one class or another, so not useful.
  • Low-frequency words often very indicative of one
    class or another, but we may never (or rarely)
    see them during training.
  • ? data sparsity

107
Stop words and stop lists
  • A simple way to get rid of uninteresting features
    is to eliminate the high-frequency ones
  • These are often called stop words
  • - e.g., the, of, you, got, was, etc.
  • Systems often contain a list (stop list) of
    100 stop words, which are pruned from the
    vocabulary

108
Beyond terms
  • It would be great to include multi-word features
    like New York, rather than just New and
    York
  • But including all pairs of words, or all
    consecutive pairs of words, as features creates
    WAY too many to deal with, and many are very
    sparse.
  • In order to include such features, we need to
    know more about feature selection (upcoming)

109
Variations of the VSM
  • What should we include in V?
  • Stoplists
  • Phrases and ngrams
  • Feature selection
  • How should we compute sv(d)?
  • Binary (Bernoulli)
  • Term frequency (TF) (multinomial)
  • Inverse Document Frequency (IDF)
  • TF-IDF
  • Length normalization
  • Other

110
Score for a feature in a document
Example Document yes we got no bananas no
bananas we got we got
yes we got no bananas what you I like
Binary Term Frequency
1 1 1 1 1 0 0 0 0
1 3 3 2 2 0 0 0 0
111
Inverse Document Frequency
  • An alternative method of scoring a feature
  • Intuition words that are common to many
    documents are less informative, so give them less
    weight.
  • IDF(v)
  • log (Documents / Documents containing v)

112
TF-IDF
  • Term Frequency Inverse Document Frequency
  • TF-IDFv(d) TFv(d) IDF(v)
  • TF-IDFv(d) (v occurs in D)
  • log (Documents / Documents containing v)

113
TF-IDF weighted vectors
  • Example
  • Document D1 yes we got no bananas
  • Document D2 what you got
  • Document D3 yes I like what you got

yes we got no bananas what you I like
Vector V1 Vector V2 Vector V3
.18 1 1 1 1 0 0 0 0
0 0 1 0 0 1 1 0 0
1 0 1 0 0 1 1 1 .48
114
Limitations
  • The vector space model has the following
    limitations
  • Long documents are poorly represented because
    they have poor similarity values.
  • Search keywords must precisely match document
    terms word substrings might result in a false
    positive match.
  • Semantic sensitivity documents with similar
    context but different term vocabulary won't be
    associated, resulting in a false negative match.
  • The order in which the terms appear in the
    document is lost in the vector space
    representation.

115
Curse of Dimensionality.
  • If the data x lies in high dimensional space,
    then an enormous amount of data is required to
    learn distributions, decision rules, or clusters.
  • Example
  • 50 dimensions.
  • Each dimension has 2 possible values.
  • This gives a total of 250 1015 cells.
  • But the no. of data samples will be far less.
    There will not be enough data samples to learn.

116
Dimensionality Reduction
  • Goal Reduce the dimensionality of the space,
    while preserving distances
  • Basic idea find the dimensions that have the
    most variation in the data, and eliminate the
    others.
  • Many techniques (SVD, MDS)
  • May or may not help

117
Feature Selection and Dimensionality Reduction in
NLP
  • TF-IDF (reduces the weight of some features,
    increases weight of others)
  • Mutual Information (MI)
  • Pointwise Mutual Information (PMI)
  • Latent Semantic Analysis next week
  • Information Gain (IG)
  • Chi-square or other independence tests
  • Pure frequency

118
Mutual Information
  • What makes Shanghai a good feature for
    classifying a document as being about China?
  • Intuition four cases

China -China
Shanghai How common? How common?
Shanghai How common? How common?
119
Mutual Information
  • What makes Shanghai a good feature for
    classifying a document as being about China?
  • Intuition four cases
  • If all four cases are equally common, MI 0.

China -China
Shanghai X X
Shanghai X X
120
Mutual Information
  • What makes Shanghai a good feature for
    classifying a document as being about China?
  • Intuition four cases
  • MI grows when one (or two) case(s) becomes much
    more common than the others.

China -China
Shanghai 10X 0
Shanghai X X
121
Mutual Information
  • What makes Shanghai a good feature for
    classifying a document as being about China?
  • Intuition four cases
  • Thats also the case where the feature is useful!

China -China
Shanghai 10X 0
Shanghai X X
122
Mutual Information
123
Pointwise Mutual Information
  • What makes Shanghai a good feature for
    classifying a document as being about China?
  • PMI focuses on just the (, ) case
  • How much more likely than chance is it for
    Shanghai to appear in a document about China?

China -China
Shanghai 10X 0
Shanghai X X
124
Pointwise Mutual Information
125
Feature Engineering
  • This is where domain experts and human judgement
    come into play.
  • Not much to say . except that it matters a lot,
    often more than choosing a classifier
Write a Comment
User Comments (0)
About PowerShow.com