Loading...

PPT – ICS 278: Data Mining Lectures 7 and 8: Classification Algorithms PowerPoint presentation | free to view - id: 1de7b4-ZDc1Z

The Adobe Flash plugin is needed to view this content

ICS 278 Data MiningLectures 7 and 8

Classification Algorithms

- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine

Notation

- Variables X, Y.. with values x, y (lower case)
- Vectors indicated by X
- Components of X indicated by Xj with values xj
- Matrix data set D with n rows and p columns
- jth column contains values for variable Xj
- ith row contains a vector of measurements on

object i, indicated by x(i) - The jth measurement value for the ith object is

xj(i) - Unknown parameter for a model q
- Can also use other Greek letters, like a, b, d, g

ew - Vector of parameters q

Classification

- Predictive modeling predict Y given X
- Y is real-valued gt regression
- Y is categorical gt classification
- Classification
- Many applications speech recognition, document

classification, OCR, loan approval, face

recognition, etc

Classification v. Regression

- Similar in many ways
- both learn a mapping from X to Y
- Both sensitive to dimensionality of X
- Generalization to new data is important in both
- Test error versus model complexity
- Many models can be used for either classification

or regression, e.g., - Trees, neural networks
- Most important differences
- Categorical Y versus real-valued Y
- Different score functions
- E.g., classification error versus squared error

Decision Region Terminlogy

Probabilistic view of Classification

- Notation let there be K classes c1,..cK
- Class marginals p(ck) probability of class k
- Class-conditional probabilities p(

x ck ) probability of x given ck , k 1,K - Posterior class probabilities (by Bayes rule)

p( ck x ) p( x ck ) p(ck) /

p(x) , k 1,K - where p(x)

S p( x cj ) p(cj) - In theory this is all we need.in practice

this may not be best approach.

Example of Probabilistic Classification

p( x c1 )

p( x c2 )

Example of Probabilistic Classification

p( x c1 )

p( x c2 )

1

p( c1 x )

0.5

0

Example of Probabilistic Classification

p( x c1 )

p( x c2 )

1

p( c1 x )

0.5

0

Decision Regions and Bayes Error Rate

p( x c1 )

p( x c2 )

Class c2

Class c1

Class c2

Class c1

Class c2

Optimal decision regions regions where 1 class

is more likely Optimal decision regions ?

optimal decision boundaries

Decision Regions and Bayes Error Rate

p( x c1 )

p( x c2 )

Class c2

Class c1

Class c2

Class c1

Class c2

Optimal decision regions regions where 1 class

is more likely Optimal decision regions ?

optimal decision boundaries Bayes error rate

fraction of examples misclassified by optimal

classifier

shaded area above (see equation

10.3 in text)

Procedure for optimal Bayes classifier

- For each class learn a model p( x ck )
- E.g., each class is multivariate Gaussian with

its own mean and covariance - Use Bayes rule to obtain p( ck x )
- gt this yields the optimal decision

regions/boundaries - gt use these decision regions/boundaries for

classification - Correct in theory. but practical problems

include - How do we model p( x ck ) ?
- Even if we know the model for p( x ck ),

modeling a distribution or density will be very

difficult in high dimensions (e.g., p 100) - Alternative approach model the decision

boundaries directly

Three types of classifiers

- Generative (or class-conditional) classifiers
- Learn models for p( x ck ), use Bayes rule to

find decision boundaries - Examples naïve Bayes models, Gaussian

classifiers - Regression (or posterior class probabilities)
- Learn a model for p( ck x ) directly
- Example logistic regression (see lecture 5/6),

neural networks - Discriminative classifiers
- No probabilities
- Learn the decision boundaries directly
- Examples
- Linear boundaries perceptrons, linear SVMs
- Piecewise linear boundaries decision trees,

nearest-neighbor classifiers - Non-linear boundaries non-linear SVMs
- Note one can usually post-fit class

probability estimates p( ck x ) to a

discriminative classifier

Which type of classifier is appropriate?

- Lets look at the score functions
- c(i) true class, c(x(i) q) class predicted

by the classifier - Class-mismatch loss functions
- S(q) 1/n Si Cost c(i), c(x(i) q)

- where cost(i, j) cost of misclassifying

true class i as predicted class j - e.g., cost(i,j) 0 if ij, 1 otherwise

(misclassification error or 0-1 loss) - and more generally cost(i,j) is a matrix of K

x K losses (e.g., surgery, spam email, etc) - Class-probability loss functions
- S(q) 1/n Si log p(c(i) x(i)

q ) (log probability score) - or S(q) 1/n Si c(i) p(c(i)

x(i) q ) 2 (Brier score)

Example classifying spam email

- 0-1 loss function
- Appropriate if we just want to maximize accuracy
- Asymmetric cost matrix
- Appropriate if missing non-spam emails is more

costly than failing to detect spam emails - Probability loss
- Appropriate if we wanted to rank all emails by

p(spam email features), e.g., to allow the

user to look at emails via a ranked list. - In general dont solve a harder problem than you

need to, or dont model aspects of the problem

you dont need to (e.g., modeling p(xc)) -

Vapnik, 1996.

Classes of classifiers

- Class-conditional/probabilistic, based on p( x

ck ), - Naïve Bayes (simple, but often effective in high

dimensions) - Parametric generative models, e.g., Gaussian (can

be effective in low-dimensional problems leads

to quadratic boundaries in general) - Regression-based, p( ck x ) directly
- Logistic regression simple, linear in odds

space - Neural network non-linear extension of logistic,

can be difficult to work with - Discriminative models, focus on locating optimal

decision boundaries - Linear discriminants, perceptrons simple,

sometimes effective - Support vector machines generalization of linear

discriminants, can be quite effective,

computational complexity is an issue - Nearest neighbor simple, can scale poorly in

high dimensions - Decision trees swiss army knife, often

effective in high dimensionis

Naïve Bayes Classifiers

- Generative probabilistic model with conditional

independence assumption on p( x ck ), i.e.

p( x ck ) P p( xj

ck ) - Typically used with nominal variables
- Real-valued variables discretized to create

nominal versions - (alternative is to model each p( xj ck ) with a

parametric model less widely used) - Comments
- Simple to train (just estimate conditional

probabilities for each feature-class pair) - Often works surprisingly well in practice
- e.g., state of the art for text-classification,

basis of many widely used spam filters - Feature selection can be helpful, e.g.,

information gain - Note that even if CI assumptions are not met, it

may still be able to approximate the optimal

decision boundaries (seems to happen in practice) - However. on most problems can usually be beaten

with a more complex model (plus more work)

Announcements

- Homework 2 now online on the Web page
- Due next Thursday in class
- Homework 1 still being graded
- Projects
- Interim report due 2 weeks from now (more details

later) - More traffic data now online
- Locations of VDS stations now known (contact Ram

Hariharan) - Schedule
- Today more on classification
- Next clustering, pattern-finding, dimension

reduction - After that specific topics such as text, Web,

credit scoring, etc

Link between Logistic Regression and Naïve Bayes

Naïve Bayes

Logistic Regression

Linear Discriminant Classifiers

- Linear Discriminant Analysis (LDA)
- Earliest known classifier (1936, R.A. Fisher)
- See section 10.4 for math details
- Find a projection onto a vector such that means

for each class (2 classes) are separated as much

as possible (with variances taken into account

appropriately) - Reduces to a special case of parametric Gaussian

classifier in certain situations - Many subsequent variations on this basic theme

(e.g., regularized LDA) - Other linear discriminants
- Decision boundary (p-1) dimensional hyperplane

in p dimensions - Perceptron learning algorithms (pre-dated neural

networks) - Simple error correction based learning

algorithms - SVMs use a sophisticated margin idea for

selecting the hyperplane

Nearest Neighbor Classifiers

- kNN select the k nearest neighbors to x from the

training data and select the majority class from

these neighbors - k is a parameter
- Small k noisier estimates, Large k smoother

estimates - Best value of k often chosen by cross-validation
- Comments
- Virtually assumption free
- Interesting theoretical properties

Bayes error lt error(kNN) lt 2 x Bayes error

(asymptotically) - Disadvantages
- Can scale poorly with dimensionality sensitive

to distance metric - Requires fast lookup at run-time to do

classification with large n - Does not provide any interpretable model

Local Decision Boundaries

Boundary? Points that are equidistant between

points of class 1 and 2 Note locally the

boundary is (1) linear (because of Euclidean

distance) (2) halfway between the 2 class

points (3) at right angles to connector

1

2

Feature 2

1

2

2

?

1

Feature 1

Finding the Decision Boundaries

1

2

Feature 2

1

2

2

?

1

Feature 1

Finding the Decision Boundaries

1

2

Feature 2

1

2

2

?

1

Feature 1

Finding the Decision Boundaries

1

2

Feature 2

1

2

2

?

1

Feature 1

Overall Boundary Piecewise Linear

Decision Region for Class 1

Decision Region for Class 2

1

2

Feature 2

1

2

2

?

1

Feature 1

Decision Tree Classifiers

- Widely used in practice
- Can handle both real-valued and nominal inputs

(unusual) - Good with high-dimensional data
- similar algorithms as used in constructing

regression trees - historically, developed both in statistics and

computer science - Statistics
- Breiman, Friedman, Olshen and Stone, CART, 1984
- Computer science
- Quinlan, ID3, C4.5 (1980s-1990s)

Decision Tree Example

Debt

Income

Decision Tree Example

Debt

Income gt t1

??

Income

t1

Decision Tree Example

Debt

Income gt t1

t2

Debt gt t2

Income

t1

??

Decision Tree Example

Debt

Income gt t1

t2

Debt gt t2

Income

t1

t3

Income gt t3

Decision Tree Example

Debt

Income gt t1

t2

Debt gt t2

Income

t1

t3

Income gt t3

Note tree boundaries are piecewise linear and

axis-parallel

Decision Trees are not stable

Moving just one example slightly may lead to

quite different trees and space partition! Lack

of stability against small perturbation of data.

Figure from Duda, Hart Stork, Chap. 8

Decision Tree Pseudocode

node tree-design (Data X,C) For i 1 to

d quality_variable(i) quality_score(Xi,

C) end node X_split, Threshold for

maxquality_variable Data_right, Data_left

split(Data, X_split, Threshold) if node

leaf? return(node) else node_right

tree-design(Data_right) node_left

tree-design(Data_left) end end

Binary split selection criteria

- Q(t) N1Q1(t) N2Q2(t), where t is the

threshold - Let p1k be the proportion of class k points in

region 1 - Error criterion for a branch
- Q1(t) 1 - p1k
- Gini index Q1(t) Sk p1k (1 -

p1k) - Cross-entropy Q1(t) Sk p1k

log p1k - Cross-entropy and Gini work better in general
- Tend to give higher rank to splits with more

extreme class distributions - Consider (300,100) (100,300) split versus

(400,0) (200 200)

Computational Complexity for a Binary Tree

- At the root node, for each of p variables
- Sort all values, compute quality for each split
- O(pN log N) time for real-valued or ordinal

variables - Subsequent internal node operations each take

O(N log N) - e.g., balanced tree of depth K requires
- pN log N 2(pN/2 log N/2) 4(pN/4 log N/4)

. 2K(pN/2K log N/2K) - pN(logN log(N/2) log(N/4) log N/2K)
- This assumes data are in main memory
- If data are on disk then repeated access of

subsets at different nodes may be very slow

(impossible to pre-index)

Splitting on a nominal attribute

- Nominal attribute with m values
- e.g., the name of a state or a city in marketing

data - 2m-1 possible subsets gt exhaustive search is

O(2m-1) - For small m, a simple approach is to branch on

specific values - But for large m this may not work well
- Neat trick for the 2-class problem
- For each predictor value calculate the proportion

of class 1s - Order the m values according to these proportions
- Now treat as an ordinal variable and select the

best split (linear in m) - This gives the optimal split for the Gini index,

among all possible 2m-1 splits (Breiman et al,

1984).

How to Choose the Right-Sized Tree?

Predictive Error

Error on Test Data

Error on Training Data

Size of Decision Tree

Ideal Range for Tree Size

Choosing a Good Tree for Prediction

- General idea
- grow a large tree
- prune it back to create a family of subtrees
- weakest link pruning
- score the subtrees and pick the best one
- Massive data sizes (e.g., n 100k data points)
- use training data set to fit a set of trees
- use a validation data set to score the subtrees
- Smaller data sizes (e.g., n 1k or less)
- use cross-validation
- use explicit penalty terms (e.g., Bayesian

methods)

Example Spam Email Classification

- Data Set (from the UCI Machine Learning Archive)
- 4601 email messages from 1999
- Manually labelled as spam (60), non-spam (40)
- 54 features percentage of words matching a

specific word/character - Business, address, internet, free, george, !, ,

etc - Average/longest/sum lengths of uninterrupted

sequences of CAPS - Error Rates (Hastie, Tibshirani, Friedman, 2001)
- Training 3056 emails, Testing 1536 emails
- Decision tree 8.7
- Logistic regression error 7.6
- Naïve Bayes 10 (typically)

(No Transcript)

(No Transcript)

Treating Missing Data in Trees

- Missing values are common in practice
- Approaches to handing missing values
- During training
- Ignore rows with missing values (inefficient)
- During testing
- Send the example being classified down both

branches and average predictions - Replace missing values with an imputed value

(can be suboptimal) - Other approaches
- Treat missing as a unique value (useful if

missing values are correlated with the class) - Surrogate splits method
- Search for and store surrogate variables/splits

during training

Other Issues with Classification Trees

- Why use binary splits?
- Multiway splits can be used, but cause

fragmentation - Linear combination splits?
- can produces small improvements
- optimization is much more difficult (need weights

and split point) - Trees are much less interpretable
- Model instability
- A small change in the data can lead to a

completely different tree - Model averaging techniques (like bagging) can be

useful - Tree bias
- Poor at approximating non-axis-parallel

boundaries - Producing rule sets from tree models (e.g., c5.0)

Why Trees are widely used in Practice

- Can handle high dimensional data
- builds a model using 1 dimension at time
- Can handle any type of input variables
- categorical, real-valued, etc
- most other methods require data of a single type

(e.g., only real-valued) - Trees are (somewhat) interpretable
- domain expert can read off the trees logic
- Tree algorithms are relatively easy to code and

test

Limitations of Trees

- Representational Bias
- classification piecewise linear boundaries,

parallel to axes - regression piecewise constant surfaces
- High Variance
- trees can be unstable as a function of the

sample - e.g., small change in the data -gt completely

different tree - causes two problems
- 1. High variance contributes to prediction error
- 2. High variance reduces interpretability
- Trees are good candidates for model combining
- Often used with boosting and bagging
- Trees do not scale well to massive data sets

(e.g., N in millions) - repeated random access of subsets of the data

Evaluating Classification Results (in general)

- Summary statistics
- empirical estimate of score function on test

data, eg., error rate - More detailed breakdown
- E.g., confusion matrices
- Can be quite useful in detecting systematic

errors - Detection v. false-alarm plots (2 classes)
- Binary classifier with real-valued output for

each example, where higher means more likely to

be class 1 - For each possible threshold, calculate
- Detection rate fraction of class 1 detected
- False alarm rate fraction of class 2 detected
- Plot y (detection rate) versus x (false alarm

rate) - Also known as ROC, precision-recall,

specificity/sensitivity

Bagging for Combining Classifiers

- Training data sets of size N
- Generate B bootstrap sampled data sets of size

N - Bootstrap sample sample with replacement
- e.g. B 100
- Build B models (e.g., trees), one for each

bootstrap sample - Intuition is that the bootstrapping perturbs

the data enough to make the models more resistant

to true variability - For prediction, combine the predictions from the

B models - E.g., for classification p(c x) fraction of B

models that predict c - Plus generally improves accuracy on models such

as trees - Negative lose interpretability

green majority vote purple averaging the

probabilities

From Hastie, Tibshirani, And Friedman, 2001

Illustration of Boosting Color of points class

label Diameter of points weight at each

iteration Dashed line single stage classifier.

Green line combined, boosted classifier Dotted

blue in last two bagging (from G. Rätsch, Phd

thesis, 2001)

Support Vector Machines(will be discussed again

later)

- Support vector machines
- Use a different loss function, the margin
- Results in convex optimization problem, solvable

by quadratic programming - Decision boundary represented by examples in

training data - Linear version clever placement of the

hyperplane - Non-linear version kernel trick for

high-dimensional problems - Computational complexity can be O(N3) without

speedups

Summary on Classifiers

- Simple models (but can be effective)
- Logistic regression
- Naïve Bayes
- K nearest-neighbors
- Decision trees
- Good for high-dimensional problems with different

data types - State of the art
- Support vector machines
- Boosted trees
- Many tradeoffs in interpretability, score

functions, etc

Decision Tree Classifiers

Classification

Task

Decision boundaries hierarchy of axis-parallel

Representation

Cross-validated error

Score Function

Greedy search in tree space

Search/Optimization

Data Management

None specified

Models, Parameters

Tree

Naïve Bayes Classifier

Classification

Task

Conditional independence probability model

Representation

Score Function

Likelihood

Closed form probability estimates

Search/Optimization

Data Management

None specified

Models, Parameters

Conditional probability tables

Logistic Regression

Task

Classification

Log-odds(C) linear function of Xs

Representation

Score Function

Log-likelihood

Search/Optimization

Iterative (Newton) method

Data Management

None specified

Models, Parameters

Logistic weights

Nearest Neighbor Classifier

Task

Classification

Representation

Memory-based

Cross-validated error (for selecting k)

Score Function

Search/Optimization

None

Data Management

None specified

Models, Parameters

None

Support Vector Machines

Task

Classification

Representation

Hyperplanes

Score Function

Margin

Convex optimization (quadratic programming)

Search/Optimization

Data Management

None specified

Models, Parameters

None

Software (same as for Regression)

- MATLAB
- Many free toolboxes on the Web for regression

and prediction - e.g., see http//lib.stat.cmu.edu/matlab/ and

in particular the CompStats toolbox - R
- General purpose statistical computing environment

(successor to S) - Free (!)
- Widely used by statisticians, has a huge library

of functions and visualization tools - Commercial tools
- SAS, other statistical packages
- Data mining packages
- Often are not progammable offer a fixed menu of

items

Reading

- For this class Chapter 10
- Covers both general concepts in classification

and a broad range of classifiers - Suggested background reading for further

information - Elements of Statistical Learning,
- T. Hastie, R. Tibshirani, and J. Friedman,

Springer Verlag, 2001 - Learning from Kernels,
- B Schoelkopf and A. Smola, MIT Press, 2003.
- Classification Trees,
- Breiman, Friedman, Olshen, and Stone, Wadsworth

Press, 1984.