Classification and Supervised Learning

About This Presentation

Title:

Classification and Supervised Learning

Description:

Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth s notes Shawndra Hill notes Outline Supervised Learning ... – PowerPoint PPT presentation

Number of Views:216

Avg rating:3.0/5.0

Slides: 58

Provided by: ctv3

Category:

more less

Transcript and Presenter's Notes

Title: Classification and Supervised Learning

1
Classification and Supervised Learning

Credits
Hand, Mannila and Smyth
Cook and Swayne
Padhraic Smyths notes
Shawndra Hill notes

2
Outline

Supervised Learning Overview
Linear Discriminant analysis
Tree models
Probability based and Bayes models

3
Classification

Classification or supervised learning
prediction for categorical response
for binary, T/F, can be used as an alternative to
logistic regression
often is a quantized real value or non-scaled
numeric
can be used with categorical predictors
great for missing data - can be a response in
itself!
methods for fitting can be
parametric
algorithmic

Because labels are known, you can build
parametric models for the classes
can also define decision regions and decision
boundaries

5
Examples of classifiers

Generative/class-conditional/probabilistic, based
on p( x ck ),
Naïve Bayes (simple, but often effective in high
dimensions)
Parametric generative models, e.g., Gaussian -
Linear discriminant analysis
Regression-based, based on p( ck x )
Logistic regression simple, linear in odds
space
Neural network non-linear extension of logistic
Discriminative models, focus on locating optimal
decision boundaries
Decision trees swiss army knife, often
effective in high dimensions
Linear discriminants,
Support vector machines (SVM) generalization of
linear discriminants, can be quite effective,
computational complexity is an issue
Nearest neighbor simple, can scale poorly in
high dimensions

6
Evaluation of Classifiers

Already seen some of this
Assume output is probability vector for each
class
Classification error
P(true Y predicted Y)
ROC Area
area under ROC plot
top-k analysis
sometimes all you care about is how well you can
do at the top of the list
plan A top 50 candidates have 44 sales, top 500
have 300 sales
plan B top 50 have 48 sales, top 500 have 270
sales
which do you choose?
often used with imbalanced class distributions -
good classification error is easy!
fraud, etc
calibration is sometimes important
if you say something has 90 chance, does it?

7
Linear Discriminant Analysis

LDA - parametric classification
Fisher 1936 Rao 1948
linear combination of variables separating two
classes by comparing the difference between class
means with the variance in each class
assumes multivariate normal distribution of each
class (cluster)
pros
easy to define likelihood
easy to define boundary
easy to measure goodness of fit
interpretation easy
cons
very rare for data come close to a multi-normal!
works only on numeric predictors

painters data 54 painters rated on a score of
0-21 for composition, drawing color and
expression. Classified them into 8 classes
Composition Drawing Colour Expression School
Da Udine 10 8 16
3 A
Da Vinci 15 16 4
14 A
Del Piombo 8 13 16
7 A
Del Sarto 12 16 9
8 A
Fr. Penni 0 15 8
0 A
Guilio Romano 15 16 4
14 A
Michelangelo 8 17 4
8 A
Perino del Vaga 15 16 7
6 A
Perugino 4 12 10
4 A
Raphael 17 18 12
18 A

library(MASS) lda1lda(School.,datapainters)
9
(No Transcript)
10
(No Transcript)
11
LDA - predictions

to check how good the model is, you can see how
well it predicts what actually happened

gt predict(lda1) class 1 D H D A A H A C A A A
A A C A B B E C C B E D D D D G D D D D D E D G H
E E E F G A F D G A G G E 50 G C H H H Levels
A B C D E F G H posterior
A B C D
E F Da Udine 0.0153311094
0.0059952857 0.0105980288 6.717937e-01
0.124938731 2.913817e-03 Da Vinci
0.1023448947 0.1963312180 0.1155149000
4.444461e-05 0.016182391 1.942920e-02 Del Piombo
0.1763906259 0.0142589568 0.0064792116
6.351212e-01 0.102924883 9.080713e-03 Del Sarto
0.4549047647 0.2079127774 0.1459033415
2.166203e-02 0.146171796 3.716302e-03
gt table(predict(lda1)class,paintersSch)
A B C D E F G H A 5 4 0 0 0 1 1 0 B 0 1 2 0 0
0 0 0 C 1 1 2 0 0 0 0 1 D 2 0 0 9 1 0 1 0 E
0 0 2 0 4 0 1 0 F 0 0 0 0 0 2 0 0 G 0 0 0 1 1
1 4 0 H 2 0 0 0 1 0 0 3
12
Classification (Decision) Trees

Trees are one of the most popular and useful of
all data mining models
Algorithmic version of classification
no distributional assumptions
Competing algorithms CART, C4.5, DBMiner
Pros
no distributional assumptions
can handle real and nominal inputs
speed and scalability
robustness to outliers and missing values
interpretability
compactness of classification rules
Cons
interpretability ?
several tuning parameters to set with little
guidance
decision boundary is non-continuous

13
Decision Tree Example
Debt
Income
14
Decision Tree Example
Debt
Income gt t1
??
Income
t1
15
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
16
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
17
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
18
Example Titanic Data

On the Titanic
1313 passengers
34 survived
was it a random sample?
or did survival depend on features of the
individual?
sex
age
class

pclass survived
name age embarked sex 1
1st 1 Allen, Miss
Elisabeth Walton 29.0000 Southampton female 2
1st 0 Allison, Miss
Helen Loraine 2.0000 Southampton female 3 1st
0 Allison, Mr Hudson Joshua
Creighton 30.0000 Southampton male 4 1st
0 Allison, Mrs Hudson J.C. (Bessie Waldo
Daniels) 25.0000 Southampton female 5 1st
1 Allison, Master Hudson
Trevor 0.9167 Southampton male 6 2nd
1 Anderson, Mr Harry
47.0000 Southampton male
19
Decision trees

At first split decide which is the best
variable to create separation between the
survivors and non-survivors cases

Female
Goodness of split is determined by the purity
of the leaves
20
Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Examples are partitioned recursively to create
pure subgroups
Purity measured by information gain, Gini index,
entropy, etc
Conditions for stopping partitioning
All samples for a given node belong to the same
class
All leaf nodes are smaller than a specified
threshold
BUT building a tree too big will overfit the
data, and will predict poorly.
Predictions
each leaf will have class probability estimates
(CPE), based on the training data that ended up
in that leaf.
majority voting is employed for classifying all
members of the leaf

21
Purity in tree building

Why do we care about pure subgroups?
purity of the subgroup gives us confidence that
new cases that fall into this leaf have a given
label

22
Purity measures

If a data set T contains examples from n classes,
gini index, gini(T) is defined as
where pj is the relative frequency of class j in
T.
If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as
For Titanic split on sex 850/1313 x(1-0.160.84)
463/1313(1-0.660.34) 0.83
The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each
attribute).
Another often used measure Entropy

23
Calculating Information Gain
Information Gain Impurity (parent)
Impurity (children)
Entire population (30 instances)
17 instances
Balancegt50K
Balancelt50K
13 instances
(Weighted) Average Impurity of Children
Information Gain Entropy ( parent) Entropy
(Children) 0.996
- 0.615 0.38
23
24
Information Gain
Information Gain Impurity (parent)
Impurity (children)
Gain0.38
Impurity(D,E) 0.405
Impurity(,B,C) 0.61
Impurity(A) 0.996
Gain0.205
D
Agegt45
B
Impurity(D)0 Log20 1 log210
Balancegt50K
Agelt45
Impurity(B) 0.787
A
E
C
Impurity(E) -3/7 Log23/7 -4/7Log24/70.985
Balancelt50K
Impurity (C) 0.39
Bad risk (Default)
24
Good risk (Not default)
25
Information Gain

At each node chose first the attribute that
obtains maximum information gain providing
maximum information

Gain0.38
Impurity(D,E) 0.405
Impurity(A) 0.996
Impurity(B,C) 0.61
Gain0.205
D
B
Agegt45
Entire population
Balancegt50K
Agelt45
A
E
C
Balancelt50K
25
Bad risk (Default)
Good risk (Not default)
26
Avoid Overfitting in Classification

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the best pruned tree

27
Which attribute to split over?

Brute-force search
At each node examine splits over each of the
attributes
Select the attribute for which the maximum
information gain is obtained

Balance
gt50K
lt50K
27
28
Finding the right size

Use a hold out sample (n fold cross-validation)
Overfit a tree - with many leaves
snip the tree back and use the hold out sample
for prediction, calculate predictive error
record error rate for each tree size
repeat for n folds
plot average error rate as a function of tree
size
fit optimal tree size to the entire data set

R note can use cvtree()
29
Olive oil data
X region area palmitic palmitoleic stearic oleic
linoleic linolenic arachidic 1 1.North-Apulia
1 1 1075 75 226 7823
672 36 60 2 2.North-Apulia 1
1 1088 73 224 7709 781
31 61 3 3.North-Apulia 1 1
911 54 246 8113 549 31
63 4 4.North-Apulia 1 1 966
57 240 7952 619 50
78 5 5.North-Apulia 1 1 1051
67 259 7771 672 50 80 6
6.North-Apulia 1 1 911 49
268 7924 678 51 70

classification of Italian olive oils by their
components
9 areas, from 3 regions

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
Regression Trees

Trees can also be used for regression when the
response is real valued
leaf prediction is mean value instead of class
probability estimates (CPE)
helpful with categorical predictors

34
Tips data
35
Treating Missing Data in Trees

Missing values are common in practice
Approaches to handing missing values
During training
Ignore rows with missing values (inefficient)
During testing
Send the example being classified down both
branches and average predictions
Replace missing values with an imputed value
Other approaches
Treat missing as a unique value (useful if
missing values are correlated with the class)
Surrogate splits method
Search for and store surrogate variables/splits
during training

36
Other Issues with Classification Trees

Can use non-binary splits
Multi-way
Linear combinations
Tend to increase complexity substantially, and
dont improve performance
Binary splits are interpretable, even by
non-experts
Easy to compute, visualize
Model instability
A small change in the data can lead to a
completely different tree
Model averaging techniques (like bagging) can be
useful
Restricted to splits along coordinate axes
Discontinuities in prediction space

37
Why Trees are widely used in Practice

Can handle high dimensional data
builds a model using 1 dimension at time
Can handle any type of input variables
categorical, real-valued, etc
Invariant to monotonic transformations of input
variables
E.g., using x, 10x 2, log(x), 2x, etc, will
not change the tree
So, scaling is not a factor - user can be sloppy!
Trees are (somewhat) interpretable
domain expert can read off the trees logic
Tree algorithms are relatively easy to code and
test

38
Limitations of Trees

Representational Bias
classification piecewise linear boundaries,
parallel to axes
regression piecewise constant surfaces
High Variance
trees can be unstable as a function of the
sample
e.g., small change in the data -gt completely
different tree
causes two problems
1. High variance contributes to prediction error
2. High variance reduces interpretability
Trees are good candidates for model combining
Often used with boosting and bagging

39
Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
40
Random Forests

Another con for trees
trees are sensitive to the primary split, which
can lead the tree in inappropriate directions
one way to see this fit a tree on a random
sample, or a bootstrapped sample of the data -
Solution
random forests an ensemble of unpruned decision
trees
each tree is built on a random subset of the
training data
at each split point, only a random subset of
predictors are selected
many parameters to fiddle!
prediction is simply majority vote of the trees (
or mean prediction of the trees).
Has the advantage of trees, with more robustness,
and a smoother decision rule.
Also, they are trendy!

41
Other Models k-NN

k-Nearest Neighbors (kNN)
to classify a new point
look at the kth nearest neighbor from the
training set
look at the circle of radius r that includes this
point
what is the class distribution of this circle?
Advantages
simple to understand
simple to implement
Disadvantages
what is k?
k1 high variance, sensitive to data
k large robust, reduces variance but blends
everything together - includes far away points
what is near?
Euclidean distance assumes all inputs are equally
important
how do you deal with categorical data?
no interpretable model
Best to use cross-validation and visualization
techniques to pick k.

42
Probabilistic (Bayesian) Models for Classification
If you belong to class k, you have a distribution
over input vectors
Then, given priors on ck, we can get posterior
distribution on classes
At each point in the x space, we have a predicted
class vector, allowing for decision boundaries
43
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
44
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
45
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries
46
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries Bayes error rate
fraction of examples misclassified by optimal
classifier (shaded area above). If max1, then
there is no error. Hence
47
Procedure for optimal Bayes classifier

For each class learn a model p( x ck )
E.g., each class is multivariate Gaussian with
its own mean and covariance
Use Bayes rule to obtain p( ck x )
gt this yields the optimal decision
regions/boundaries
gt use these decision regions/boundaries for
classification
Correct in theory. but practical problems
include
How do we model p( x ck ) ?
Even if we know the model for p( x ck ),
modeling a distribution or density will be very
difficult in high dimensions (e.g., p 100)
Alternative approach model the decision
boundaries directly

48
Bayesian Classification Why?

Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems
Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data.
Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities
Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured

49
Naïve Bayes Classifiers

Generative probabilistic model with conditional
independence assumption on p( x ck ), i.e.
p( x ck ) P p( xj
ck )
Typically used with nominal variables
Real-valued variables discretized to create
nominal versions
Comments
Simple to train (just estimate conditional
probabilities for each feature-class pair)
Often works surprisingly well in practice
e.g., state of the art for text-classification,
basis of many widely used spam filters

50
Naïve Bayes

When all variables are categorical,
classification should be easy (since all xs can
be enumerated)

But, remember the curse of dimensionality!
51
Naïve Bayes Classification
Recall p(ck x) ? p(x ck)p(ck) Now
assume variables are conditionally independent
given the classes

is this a valid assumption? Probably not, but
perhaps still useful
example - symptoms and diseases

52
Naïve Bayes
estimate of the prob that a point x will belong
to ck
weights of evidence
if two classes
53
Play-tennis example estimating P(xiC)
outlook
P(sunnyy) 2/9 P(sunnyn) 3/5
P(overcasty) 4/9 P(overcastn) 0
P(rainy) 3/9 P(rainn) 2/5
temperature
P(hoty) 2/9 P(hotn) 2/5
P(mildy) 4/9 P(mildn) 2/5
P(cooly) 3/9 P(cooln) 1/5
humidity
P(highy) 3/9 P(highn) 4/5
P(normaly) 6/9 P(normaln) 2/5
windy
P(truey) 3/9 P(truen) 3/5
P(falsey) 6/9 P(falsen) 2/5
P(y) 9/14
P(n) 5/14
54
Play-tennis example classifying X

An unseen sample X ltrain, hot, high, falsegt
P(Xy)P(y) P(rainy)P(hoty)P(highy)P(fals
ey)P(y)
3/92/93/96/99/14 0.010582
P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286
Sample X is classified in class n (youll lose!)

55
The independence hypothesis

makes computation possible
yields optimal classifiers when satisfied
but is seldom satisfied in practice, as
attributes (variables) are often correlated.
Yet, empirically, naïve bayes performs really
well in practice.

56
Lab 5

Olive Oil Data
from Cook and Swayne book
consists of composition of fatty acids found in
the lipid fraction of Italian Olive Oils. Study
done to determine authenticity of olive oils.
region (North, South, and Sardinia)
area (nine regions)
9 fatty acids and s

57
Lab 5

Spam Data
Collected at Iowa State University in 2003.
(Cook and Swayne)
2171 cases
21 variables
be careful - 3 vars spampct, category, and spam
were determined by spam models - do not use these
for fitting!
Goal determine spam from valid mail

Write a Comment

User Comments (0)

About PowerShow.com

Classification and Supervised Learning - PowerPoint PPT Presentation

Classification and Supervised Learning

Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth s notes Shawndra Hill notes Outline Supervised Learning ... – PowerPoint PPT presentation