David Newman, UC Irvine Lecture 4: Classification 1 - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

David Newman, UC Irvine Lecture 4: Classification 1

Description:

Discriminative classifiers. No probabilities. Learn the decision boundaries directly ... Discriminative models, focus on locating optimal decision boundaries ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 49
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 4: Classification 1


1
CS 277 Data MiningLecture 4 Classification
Algorithms
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 1 due today
  • Project proposal due next Tuesday (Oct 16)
  • Homework 2 (text classification) available soon

3
Today
  • Review project suggestions and data sets
  • Review instructions for project proposal
  • Lecture Classification

4
Notation
  • Variables X, Y.. with values x, y (lower case)
  • Vectors indicated by X
  • Components of X indicated by Xj with values xj
  • Matrix data set with D rows and W columns
  • jth column contains values for variable Xj (word
    j)
  • ith row contains a vector of measurements on
    object i, indicated by x(i) (document i)
  • The jth measurement value for the ith object is
    xj(i)
  • Unknown parameter for a model q
  • Can also use other Greek letters, like a, b, d, g
  • Vector of parameters q

5
Classification
  • Predictive modeling predict Y given X
  • Y is real-valued ? regression
  • Y is categorical ? classification
  • Classification
  • Many applications speech recognition, document
    classification, OCR, loan approval, face
    recognition, etc

6
Classification v. Regression
  • Similar in many ways
  • both learn a mapping from X to Y
  • Both sensitive to dimensionality of X
  • Generalization to new data is important in both
  • Test error versus model complexity
  • Many models can be used for either classification
    or regression
  • e.g. trees, neural networks
  • Most important differences
  • Categorical Y versus real-valued Y
  • Different score functions
  • e.g., classification error versus squared error

7
Decision Region Terminology
8
A simple classification algorithm
  • Linear separability
  • The perceptron algorithm
  • Matlab example

9
Probabilistic view of Classification
  • Notation let there be K classes c1,..cK
  • Class marginals p(ck) probability of class k
  • Class-conditional probabilities p(
    x ck ) probability of x given ck , k 1,K
  • Posterior class probabilities (by Bayes rule)
    p( ck x ) p( x ck ) p(ck) /
    p(x) , k 1,K
  • where p(x)
    S p( x cj ) p(cj)
  • In theory this is all we need.in practice
    this may not be best approach.

10
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
11
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
12
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
13
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries
14
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries Bayes error rate
fraction of examples misclassified by optimal
classifier shaded
area above
15
Procedure for optimal Bayes classifier
  • For each class learn a model p( x ck )
  • e.g., each class is multivariate Gaussian
  • Use Bayes rule to obtain p( ck x )
  • ? this yields the optimal decision
    regions/boundaries
  • ? use these decision regions/boundaries for
    classification
  • Correct in theory. but practical problems
    include
  • How do we model p( x ck ) ?
  • Even if we know the model for p( x ck ),
    modeling a distribution or density will be very
    difficult in high dimensions (e.g., p 100)
  • Alternative approach model the decision
    boundaries directly

16
3 categories of classifiers in general
  • Generative (or class-conditional) classifiers
  • Learn models for p( x ck ), use Bayes rule to
    find decision boundaries
  • Examples naïve Bayes models, Gaussian
    classifiers
  • Regression (or posterior class probabilities)
  • Learn a model for p( ck x ) directly
  • Examples logistic regression, neural networks
  • Discriminative classifiers
  • No probabilities
  • Learn the decision boundaries directly
  • Examples
  • Linear boundaries perceptrons, linear SVMs
  • Piecewise linear boundaries decision trees,
    nearest-neighbor classifiers
  • Non-linear boundaries non-linear SVMs
  • Note one can usually post-fit class
    probability estimates p( ck x ) to a
    discriminative classifier

17
Which type of classifier is appropriate?
  • Lets look at the score functions
  • c(i) true class, c(x(i) q) class predicted
    by the classifier
  • Class-mismatch loss functions
  • S(q) 1/n Si Cost c(i), c(x(i) q)
  • where cost(i, j) cost of misclassifying
    true class i as predicted class j
  • e.g., cost(i,j) 0 if ij, 1 otherwise
    (misclassification error or 0-1 loss)
  • and more generally cost(i,j) is a matrix of K
    x K losses (e.g., surgery, spam email)
  • Class-probability loss functions
  • S(q) 1/n Si log p(c(i) x(i)
    q ) (log probability score)

18
Example classifying spam email
  • 0-1 loss function
  • Appropriate if we just want to maximize accuracy
  • Asymmetric cost matrix
  • Appropriate if missing non-spam emails is more
    costly than failing to detect spam emails
  • Probability loss
  • Appropriate if we wanted to rank all emails by
    p(spam email features), e.g., to allow the
    user to look at emails via a ranked list
  • In general dont solve a harder problem than you
    need to, or dont model aspects of the problem
    you dont need to (e.g., modeling p(xc)) -
    Vapnik, 1996

19
Examples of classifiers
  • Generative/class-conditional/probabilistic, based
    on p( x ck ),
  • Naïve Bayes (simple, but often effective in high
    dimensions)
  • Parametric generative models, e.g., Gaussian (can
    be effective in low-dimensional problems leads
    to quadratic boundaries in general)
  • Regression-based, p( ck x ) directly
  • Logistic regression simple, linear in odds
    space
  • Neural network non-linear extension of logistic,
    can be difficult to work with
  • Discriminative models, focus on locating optimal
    decision boundaries
  • Linear discriminants, perceptrons simple,
    sometimes effective
  • Support vector machines generalization of linear
    discriminants, can be quite effective,
    computational complexity is an issue
  • Nearest neighbor simple, can scale poorly in
    high dimensions
  • Decision trees swiss army knife, often
    effective in high dimensions

20
Naïve Bayes Classifiers
  • Generative probabilistic model with conditional
    independence assumption on p( x ck )
  • p( x ck ) P p(
    xj ck )
  • Comments
  • Simple to train (just estimate conditional
    probabilities for each feature-class pair)
  • Often works surprisingly well in practice
  • Feature selection can be helpful, e.g.,
    information gain
  • Note that even if CI assumptions are not met, it
    may still be able to approximate the optimal
    decision boundaries (seems to happen in practice)
  • However . on most problems can usually be beaten
    with a more complex model (plus more work)

21
Link between Logistic Regression and Naïve Bayes
Naïve Bayes
Logistic Regression
22
Imbalanced Class Distributions
  • Common in data mining to have one class be much
    less likely than the others
  • e.g., 0.1 of examples are fraudulent or have a
    disease
  • If we train a standard classifier on a random
    sample of data it is very difficult to beat the
    majority classifier in terms of accuracy
  • Approaches
  • Stratified sampling artificially create training
    data with 50 of each class being present, and
    then correct for this in prediction
  • E.g., learn p(xc) on stratified data and use
    true p( c ) when predicting with a probabilistic
    model
  • Use a different score function
  • We are often interested in scoring/screening/ranki
    ng cases when using the model
  • Thus, scores such as how many of the class of
    interest are ranked in the top 1 of predictions
    may be more relevant than overall accuracy (e.g.,
    in document retrieval)

23
Calibration
  • In addition to ranking we may be interested in
    how accurate our estimates of p(cx) are,
  • i.e., if the model says p(cx) 0.9, how
    accurate is this number?
  • Calibration
  • A model is well-calibrated if its probabilistic
    predictio match real-world empirical frequencies
  • If a classifier predicts p(cx) 0.9 for 100
    examples, then on average we would expect about
    90 of these examples to belong to class c, and 10
    not
  • We can estimate calibration curves by binning a
    classifiers probabilistic predictions, and
    measuring how many

24
Calibration in Probabilistic Prediction
25
Linear Discriminants
  • Discriminant -gt method for computing class
    decision boundaries
  • Linear discriminant -gt linear decision boundaries
  • Linear Discriminant Analysis (LDA)
  • Earliest known classifier (1936, R.A. Fisher)
  • Find a projection onto a vector such that means
    for each class (2 classes) are separated as much
    as possible (with variances taken into account
    appropriately)
  • Reduces to a special case of parametric Gaussian
    classifier in certain situations
  • Many subsequent variations on this basic theme
    (regularized LDA)
  • Other linear discriminants
  • Decision boundary (p-1) dimensional hyperplane
    in p dimensions
  • Perceptron learning algorithms (pre-dated neural
    networks)
  • Simple error correction based learning
    algorithms
  • SVMs use a sophisticated margin idea for
    selecting the hyperplane

26
Nearest Neighbor Classifiers
  • kNN select the k nearest neighbors to x from the
    training data and select the majority class from
    these neighbors
  • k is a parameter
  • Small k noisier estimates, Large k smoother
    estimates
  • Best value of k often chosen by cross-validation
  • Comments
  • Virtually assumption free
  • Gives piecewise linear boundaries (i.e.,
    non-linear overall)
  • Interesting theoretical properties
    Bayes error lt error(kNN) lt 2 x Bayes error
    (asymptotically)
  • Disadvantages
  • Can scale poorly with dimensionality sensitive
    to distance metric
  • Requires fast lookup at run-time to do
    classification with large n
  • Does not provide any interpretable model

27
Decision Tree Classifiers
  • Widely used in practice
  • Can handle both real-valued and nominal inputs
    (unusual)
  • Good with high-dimensional data
  • Similar algorithms as used in constructing
    regression trees
  • Historically, developed both in statistics and
    computer science
  • Statistics
  • Breiman, Friedman, Olshen and Stone, CART, 1984
  • Computer science
  • Quinlan, ID3, C4.5 (1980s-1990s)

28
Decision Tree Example
Debt
Income
29
Decision Tree Example
Debt
Income gt t1
??
Income
t1
30
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
31
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
32
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
33
Decision tree example
34
Decision tree example (cont.)
35
Decision tree example (cont.)
Highest information gain. Creates a pure node.
36
Decision tree example (cont.)
Lowest information gain. All nodes have
near-equal yes/no.
37
Decision Tree Pseudocode
node tree-design (Data X,C) for i 1 to
d quality_variable(i) quality_score(Xi,
C) end node X_split, Threshold for
maxquality_variable Data_right, Data_left
split(Data, X_split, threshold) if node
leaf? return(node) else node_right
tree-design(Data_right) node_left
tree-design(Data_left) end end
38
Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
39
How to Choose the Right-Sized Tree?
Predictive Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range for Tree Size
40
Choosing a Good Tree for Prediction
  • General idea
  • grow a large tree
  • prune it back to create a family of subtrees
  • weakest link pruning
  • score the subtrees and pick the best one
  • Massive data sizes (e.g., n 100k data points)
  • use training data set to fit a set of trees
  • use a validation data set to score the subtrees
  • Smaller data sizes (e.g., n 1k or less)
  • use cross-validation
  • use explicit penalty terms (e.g., Bayesian
    methods)

41
Example Spam Email Classification
  • Data Set (from the UCI Machine Learning Archive)
  • 4601 email messages from 1999
  • Manually labeled as spam (60), non-spam (40)
  • 54 features percentage of words matching a
    specific word/character
  • Business, address, internet, free, george, !, ,
    etc
  • Average/longest/sum lengths of uninterrupted
    sequences of CAPS
  • Error Rates (Hastie, Tibshirani, Friedman, 2001)
  • Training 3056 emails, Testing 1536 emails
  • Decision tree 8.7
  • Logistic regression error 7.6
  • Naïve Bayes 10 (typically)

42
(No Transcript)
43
(No Transcript)
44
Treating Missing Data in Trees
  • Missing values are common in practice
  • Approaches to handing missing values
  • During training
  • Ignore rows with missing values (inefficient)
  • During testing
  • Send the example being classified down both
    branches and average predictions
  • Replace missing values with an imputed value
    (can be suboptimal)
  • Other approaches
  • Treat missing as a unique value (useful if
    missing values are correlated with the class)
  • Surrogate splits method
  • Search for and store surrogate variables/splits
    during training

45
Other Issues with Classification Trees
  • Why use binary splits?
  • Multiway splits can be used, but cause
    fragmentation
  • Linear combination splits?
  • can produces small improvements
  • optimization is much more difficult (need weights
    and split point)
  • Trees are much less interpretable
  • Model instability
  • A small change in the data can lead to a
    completely different tree
  • Model averaging techniques (like bagging) can be
    useful
  • Tree bias
  • Poor at approximating non-axis-parallel
    boundaries
  • Producing rule sets from tree models (e.g., c5.0)

46
Why Trees are widely used in Practice
  • Can handle high dimensional data
  • builds a model using 1 dimension at time
  • Can handle any type of input variables
  • categorical, real-valued, etc
  • most other methods require data of a single type
    (e.g., only real-valued)
  • Trees are (somewhat) interpretable
  • domain expert can read off the trees logic
  • Tree algorithms are relatively easy to code and
    test

47
Limitations of Trees
  • Representational Bias
  • classification piecewise linear boundaries,
    parallel to axes
  • regression piecewise constant surfaces
  • High Variance
  • trees can be unstable as a function of the
    sample
  • e.g., small change in the data -gt completely
    different tree
  • causes two problems
  • 1. High variance contributes to prediction error
  • 2. High variance reduces interpretability
  • Trees are good candidates for model combining
  • Often used with boosting and bagging
  • Trees do not scale well to massive data sets
    (e.g., N in millions)
  • repeated random access of subsets of the data

48
Evaluating Classification Results
  • Summary statistics
  • empirical estimate of score function on test
    data, e.g., error rate
  • More detailed breakdown
  • Confusion matrix
  • Can be quite useful in detecting systematic
    errors
  • Detection v. false-alarm plots (2 classes)
  • Binary classifier with real-valued output for
    each example, where higher means more likely to
    be class 1
  • For each possible threshold, calculate
  • Detection rate fraction of class 1 detected
  • False alarm rate fraction of class 2 detected
  • Plot y (detection rate) versus x (false alarm
    rate)
  • Also known as ROC, precision-recall,
    specificity/sensitivity
Write a Comment
User Comments (0)
About PowerShow.com