ICS 278: Data Mining Lectures 7 and 8: Classification Algorithms

About This Presentation

Title:

ICS 278: Data Mining Lectures 7 and 8: Classification Algorithms

Description:

and more generally cost(i,j) is a matrix of K x K losses (e.g., surgery, spam email, etc) ... Requires fast lookup at run-time to do classification with large n ... – PowerPoint PPT presentation

Number of Views:428

Avg rating:3.0/5.0

Slides: 60

Provided by: Informatio367

Category:

more less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lectures 7 and 8: Classification Algorithms

1
ICS 278 Data MiningLectures 7 and 8
Classification Algorithms

Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine

2
Notation

Variables X, Y.. with values x, y (lower case)
Vectors indicated by X
Components of X indicated by Xj with values xj
Matrix data set D with n rows and p columns
jth column contains values for variable Xj
ith row contains a vector of measurements on
object i, indicated by x(i)
The jth measurement value for the ith object is
xj(i)
Unknown parameter for a model q
Can also use other Greek letters, like a, b, d, g
ew
Vector of parameters q

3
Classification

Predictive modeling predict Y given X
Y is real-valued gt regression
Y is categorical gt classification
Classification
Many applications speech recognition, document
classification, OCR, loan approval, face
recognition, etc

4
Classification v. Regression

Similar in many ways
both learn a mapping from X to Y
Both sensitive to dimensionality of X
Generalization to new data is important in both
Test error versus model complexity
Many models can be used for either classification
or regression, e.g.,
Trees, neural networks
Most important differences
Categorical Y versus real-valued Y
Different score functions
E.g., classification error versus squared error

5
Decision Region Terminlogy
6
Probabilistic view of Classification

Notation let there be K classes c1,..cK
Class marginals p(ck) probability of class k
Class-conditional probabilities p(
x ck ) probability of x given ck , k 1,K
Posterior class probabilities (by Bayes rule)
p( ck x ) p( x ck ) p(ck) /
p(x) , k 1,K
where p(x)
S p( x cj ) p(cj)
In theory this is all we need.in practice
this may not be best approach.

7
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
8
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
9
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
10
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries
11
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries Bayes error rate
fraction of examples misclassified by optimal
classifier
shaded area above (see equation
10.3 in text)
12
Procedure for optimal Bayes classifier

For each class learn a model p( x ck )
E.g., each class is multivariate Gaussian with
its own mean and covariance
Use Bayes rule to obtain p( ck x )
gt this yields the optimal decision
regions/boundaries
gt use these decision regions/boundaries for
classification
Correct in theory. but practical problems
include
How do we model p( x ck ) ?
Even if we know the model for p( x ck ),
modeling a distribution or density will be very
difficult in high dimensions (e.g., p 100)
Alternative approach model the decision
boundaries directly

13
Three types of classifiers

Generative (or class-conditional) classifiers
Learn models for p( x ck ), use Bayes rule to
find decision boundaries
Examples naïve Bayes models, Gaussian
classifiers
Regression (or posterior class probabilities)
Learn a model for p( ck x ) directly
Example logistic regression (see lecture 5/6),
neural networks
Discriminative classifiers
No probabilities
Learn the decision boundaries directly
Examples
Linear boundaries perceptrons, linear SVMs
Piecewise linear boundaries decision trees,
nearest-neighbor classifiers
Non-linear boundaries non-linear SVMs
Note one can usually post-fit class
probability estimates p( ck x ) to a
discriminative classifier

14
Which type of classifier is appropriate?

Lets look at the score functions
c(i) true class, c(x(i) q) class predicted
by the classifier
Class-mismatch loss functions
S(q) 1/n Si Cost c(i), c(x(i) q)
where cost(i, j) cost of misclassifying
true class i as predicted class j
e.g., cost(i,j) 0 if ij, 1 otherwise
(misclassification error or 0-1 loss)
and more generally cost(i,j) is a matrix of K
x K losses (e.g., surgery, spam email, etc)
Class-probability loss functions
S(q) 1/n Si log p(c(i) x(i)
q ) (log probability score)
or S(q) 1/n Si c(i) p(c(i)
x(i) q ) 2 (Brier score)

15
Example classifying spam email

0-1 loss function
Appropriate if we just want to maximize accuracy
Asymmetric cost matrix
Appropriate if missing non-spam emails is more
costly than failing to detect spam emails
Probability loss
Appropriate if we wanted to rank all emails by
p(spam email features), e.g., to allow the
user to look at emails via a ranked list.
In general dont solve a harder problem than you
need to, or dont model aspects of the problem
you dont need to (e.g., modeling p(xc)) -
Vapnik, 1996.

16
Classes of classifiers

Class-conditional/probabilistic, based on p( x
ck ),
Naïve Bayes (simple, but often effective in high
dimensions)
Parametric generative models, e.g., Gaussian (can
be effective in low-dimensional problems leads
to quadratic boundaries in general)
Regression-based, p( ck x ) directly
Logistic regression simple, linear in odds
space
Neural network non-linear extension of logistic,
can be difficult to work with
Discriminative models, focus on locating optimal
decision boundaries
Linear discriminants, perceptrons simple,
sometimes effective
Support vector machines generalization of linear
discriminants, can be quite effective,
computational complexity is an issue
Nearest neighbor simple, can scale poorly in
high dimensions
Decision trees swiss army knife, often
effective in high dimensionis

17
Naïve Bayes Classifiers

Generative probabilistic model with conditional
independence assumption on p( x ck ), i.e.
p( x ck ) P p( xj
ck )
Typically used with nominal variables
Real-valued variables discretized to create
nominal versions
(alternative is to model each p( xj ck ) with a
parametric model less widely used)
Comments
Simple to train (just estimate conditional
probabilities for each feature-class pair)
Often works surprisingly well in practice
e.g., state of the art for text-classification,
basis of many widely used spam filters
Feature selection can be helpful, e.g.,
information gain
Note that even if CI assumptions are not met, it
may still be able to approximate the optimal
decision boundaries (seems to happen in practice)
However. on most problems can usually be beaten
with a more complex model (plus more work)

18
Announcements

Homework 2 now online on the Web page
Due next Thursday in class
Homework 1 still being graded
Projects
Interim report due 2 weeks from now (more details
later)
More traffic data now online
Locations of VDS stations now known (contact Ram
Hariharan)
Schedule
Today more on classification
Next clustering, pattern-finding, dimension
reduction
After that specific topics such as text, Web,
credit scoring, etc

19
Link between Logistic Regression and Naïve Bayes
Naïve Bayes
Logistic Regression
20
Linear Discriminant Classifiers

Linear Discriminant Analysis (LDA)
Earliest known classifier (1936, R.A. Fisher)
See section 10.4 for math details
Find a projection onto a vector such that means
for each class (2 classes) are separated as much
as possible (with variances taken into account
appropriately)
Reduces to a special case of parametric Gaussian
classifier in certain situations
Many subsequent variations on this basic theme
(e.g., regularized LDA)
Other linear discriminants
Decision boundary (p-1) dimensional hyperplane
in p dimensions
Perceptron learning algorithms (pre-dated neural
networks)
Simple error correction based learning
algorithms
SVMs use a sophisticated margin idea for
selecting the hyperplane

21
Nearest Neighbor Classifiers

kNN select the k nearest neighbors to x from the
training data and select the majority class from
these neighbors
k is a parameter
Small k noisier estimates, Large k smoother
estimates
Best value of k often chosen by cross-validation
Comments
Virtually assumption free
Interesting theoretical properties
Bayes error lt error(kNN) lt 2 x Bayes error
(asymptotically)
Disadvantages
Can scale poorly with dimensionality sensitive
to distance metric
Requires fast lookup at run-time to do
classification with large n
Does not provide any interpretable model

22
Local Decision Boundaries
Boundary? Points that are equidistant between
points of class 1 and 2 Note locally the
boundary is (1) linear (because of Euclidean
distance) (2) halfway between the 2 class
points (3) at right angles to connector
1
2
Feature 2
1
2
2
?
1
Feature 1
23
Finding the Decision Boundaries
1
2
Feature 2
1
2
2
?
1
Feature 1
24
Finding the Decision Boundaries
1
2
Feature 2
1
2
2
?
1
Feature 1
25
Finding the Decision Boundaries
1
2
Feature 2
1
2
2
?
1
Feature 1
26
Overall Boundary Piecewise Linear
Decision Region for Class 1
Decision Region for Class 2
1
2
Feature 2
1
2
2
?
1
Feature 1
27
Decision Tree Classifiers

Widely used in practice
Can handle both real-valued and nominal inputs
(unusual)
Good with high-dimensional data
similar algorithms as used in constructing
regression trees
historically, developed both in statistics and
computer science
Statistics
Breiman, Friedman, Olshen and Stone, CART, 1984
Computer science
Quinlan, ID3, C4.5 (1980s-1990s)

28
Decision Tree Example
Debt
Income
29
Decision Tree Example
Debt
Income gt t1
??
Income
t1
30
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
31
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
32
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
33
Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
34
Decision Tree Pseudocode
node tree-design (Data X,C) For i 1 to
d quality_variable(i) quality_score(Xi,
C) end node X_split, Threshold for
maxquality_variable Data_right, Data_left
split(Data, X_split, Threshold) if node
leaf? return(node) else node_right
tree-design(Data_right) node_left
tree-design(Data_left) end end
35
Binary split selection criteria

Q(t) N1Q1(t) N2Q2(t), where t is the
threshold
Let p1k be the proportion of class k points in
region 1
Error criterion for a branch
Q1(t) 1 - p1k
Gini index Q1(t) Sk p1k (1 -
p1k)
Cross-entropy Q1(t) Sk p1k
log p1k
Cross-entropy and Gini work better in general
Tend to give higher rank to splits with more
extreme class distributions
Consider (300,100) (100,300) split versus
(400,0) (200 200)

36
Computational Complexity for a Binary Tree

At the root node, for each of p variables
Sort all values, compute quality for each split
O(pN log N) time for real-valued or ordinal
variables
Subsequent internal node operations each take
O(N log N)
e.g., balanced tree of depth K requires
pN log N 2(pN/2 log N/2) 4(pN/4 log N/4)
. 2K(pN/2K log N/2K)
pN(logN log(N/2) log(N/4) log N/2K)
This assumes data are in main memory
If data are on disk then repeated access of
subsets at different nodes may be very slow
(impossible to pre-index)

37
Splitting on a nominal attribute

Nominal attribute with m values
e.g., the name of a state or a city in marketing
data
2m-1 possible subsets gt exhaustive search is
O(2m-1)
For small m, a simple approach is to branch on
specific values
But for large m this may not work well
Neat trick for the 2-class problem
For each predictor value calculate the proportion
of class 1s
Order the m values according to these proportions
Now treat as an ordinal variable and select the
best split (linear in m)
This gives the optimal split for the Gini index,
among all possible 2m-1 splits (Breiman et al,
1984).

38
How to Choose the Right-Sized Tree?
Predictive Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range for Tree Size
39
Choosing a Good Tree for Prediction

General idea
grow a large tree
prune it back to create a family of subtrees
weakest link pruning
score the subtrees and pick the best one
Massive data sizes (e.g., n 100k data points)
use training data set to fit a set of trees
use a validation data set to score the subtrees
Smaller data sizes (e.g., n 1k or less)
use cross-validation
use explicit penalty terms (e.g., Bayesian
methods)

40
Example Spam Email Classification

Data Set (from the UCI Machine Learning Archive)
4601 email messages from 1999
Manually labelled as spam (60), non-spam (40)
54 features percentage of words matching a
specific word/character
Business, address, internet, free, george, !, ,
etc
Average/longest/sum lengths of uninterrupted
sequences of CAPS
Error Rates (Hastie, Tibshirani, Friedman, 2001)
Training 3056 emails, Testing 1536 emails
Decision tree 8.7
Logistic regression error 7.6
Naïve Bayes 10 (typically)

41
(No Transcript)
42
(No Transcript)
43
Treating Missing Data in Trees

Missing values are common in practice
Approaches to handing missing values
During training
Ignore rows with missing values (inefficient)
During testing
Send the example being classified down both
branches and average predictions
Replace missing values with an imputed value
(can be suboptimal)
Other approaches
Treat missing as a unique value (useful if
missing values are correlated with the class)
Surrogate splits method
Search for and store surrogate variables/splits
during training

44
Other Issues with Classification Trees

Why use binary splits?
Multiway splits can be used, but cause
fragmentation
Linear combination splits?
can produces small improvements
optimization is much more difficult (need weights
and split point)
Trees are much less interpretable
Model instability
A small change in the data can lead to a
completely different tree
Model averaging techniques (like bagging) can be
useful
Tree bias
Poor at approximating non-axis-parallel
boundaries
Producing rule sets from tree models (e.g., c5.0)

45
Why Trees are widely used in Practice

Can handle high dimensional data
builds a model using 1 dimension at time
Can handle any type of input variables
categorical, real-valued, etc
most other methods require data of a single type
(e.g., only real-valued)
Trees are (somewhat) interpretable
domain expert can read off the trees logic
Tree algorithms are relatively easy to code and
test

46
Limitations of Trees

Representational Bias
classification piecewise linear boundaries,
parallel to axes
regression piecewise constant surfaces
High Variance
trees can be unstable as a function of the
sample
e.g., small change in the data -gt completely
different tree
causes two problems
1. High variance contributes to prediction error
2. High variance reduces interpretability
Trees are good candidates for model combining
Often used with boosting and bagging
Trees do not scale well to massive data sets
(e.g., N in millions)
repeated random access of subsets of the data

47
Evaluating Classification Results (in general)

Summary statistics
empirical estimate of score function on test
data, eg., error rate
More detailed breakdown
E.g., confusion matrices
Can be quite useful in detecting systematic
errors
Detection v. false-alarm plots (2 classes)
Binary classifier with real-valued output for
each example, where higher means more likely to
be class 1
For each possible threshold, calculate
Detection rate fraction of class 1 detected
False alarm rate fraction of class 2 detected
Plot y (detection rate) versus x (false alarm
rate)
Also known as ROC, precision-recall,
specificity/sensitivity

48
Bagging for Combining Classifiers

Training data sets of size N
Generate B bootstrap sampled data sets of size
N
Bootstrap sample sample with replacement
e.g. B 100
Build B models (e.g., trees), one for each
bootstrap sample
Intuition is that the bootstrapping perturbs
the data enough to make the models more resistant
to true variability
For prediction, combine the predictions from the
B models
E.g., for classification p(c x) fraction of B
models that predict c
Plus generally improves accuracy on models such
as trees
Negative lose interpretability

49
green majority vote purple averaging the
probabilities
From Hastie, Tibshirani, And Friedman, 2001
50
Illustration of Boosting Color of points class
label Diameter of points weight at each
iteration Dashed line single stage classifier.
Green line combined, boosted classifier Dotted
blue in last two bagging (from G. Rätsch, Phd
thesis, 2001)
51
Support Vector Machines(will be discussed again
later)

Support vector machines
Use a different loss function, the margin
Results in convex optimization problem, solvable
by quadratic programming
Decision boundary represented by examples in
training data
Linear version clever placement of the
hyperplane
Non-linear version kernel trick for
high-dimensional problems
Computational complexity can be O(N3) without
speedups

52
Summary on Classifiers

Simple models (but can be effective)
Logistic regression
Naïve Bayes
K nearest-neighbors
Decision trees
Good for high-dimensional problems with different
data types
State of the art
Support vector machines
Boosted trees
Many tradeoffs in interpretability, score
functions, etc

53
Decision Tree Classifiers
Classification
Task
Decision boundaries hierarchy of axis-parallel
Representation
Cross-validated error
Score Function
Greedy search in tree space
Search/Optimization
Data Management
None specified
Models, Parameters
Tree
54
Naïve Bayes Classifier
Classification
Task
Conditional independence probability model
Representation
Score Function
Likelihood
Closed form probability estimates
Search/Optimization
Data Management
None specified
Models, Parameters
Conditional probability tables
55
Logistic Regression
Task
Classification
Log-odds(C) linear function of Xs
Representation
Score Function
Log-likelihood
Search/Optimization
Iterative (Newton) method
Data Management
None specified
Models, Parameters
Logistic weights
56
Nearest Neighbor Classifier
Task
Classification
Representation
Memory-based
Cross-validated error (for selecting k)
Score Function
Search/Optimization
None
Data Management
None specified
Models, Parameters
None
57
Support Vector Machines
Task
Classification
Representation
Hyperplanes
Score Function
Margin
Convex optimization (quadratic programming)
Search/Optimization
Data Management
None specified
Models, Parameters
None
58
Software (same as for Regression)

MATLAB
Many free toolboxes on the Web for regression
and prediction
e.g., see http//lib.stat.cmu.edu/matlab/ and
in particular the CompStats toolbox
R
General purpose statistical computing environment
(successor to S)
Free (!)
Widely used by statisticians, has a huge library
of functions and visualization tools
Commercial tools
SAS, other statistical packages
Data mining packages
Often are not progammable offer a fixed menu of
items