Classification and Supervised Learning - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Classification and Supervised Learning

Description:

Classification and Supervised Learning Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth s notes Shawndra Hill notes Outline Supervised Learning ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 58
Provided by: ctv3
Category:

less

Transcript and Presenter's Notes

Title: Classification and Supervised Learning


1
Classification and Supervised Learning
  • Credits
  • Hand, Mannila and Smyth
  • Cook and Swayne
  • Padhraic Smyths notes
  • Shawndra Hill notes

2
Outline
  • Supervised Learning Overview
  • Linear Discriminant analysis
  • Tree models
  • Probability based and Bayes models

3
Classification
  • Classification or supervised learning
  • prediction for categorical response
  • for binary, T/F, can be used as an alternative to
    logistic regression
  • often is a quantized real value or non-scaled
    numeric
  • can be used with categorical predictors
  • great for missing data - can be a response in
    itself!
  • methods for fitting can be
  • parametric
  • algorithmic

4
  • Because labels are known, you can build
    parametric models for the classes
  • can also define decision regions and decision
    boundaries

5
Examples of classifiers
  • Generative/class-conditional/probabilistic, based
    on p( x ck ),
  • Naïve Bayes (simple, but often effective in high
    dimensions)
  • Parametric generative models, e.g., Gaussian -
    Linear discriminant analysis
  • Regression-based, based on p( ck x )
  • Logistic regression simple, linear in odds
    space
  • Neural network non-linear extension of logistic
  • Discriminative models, focus on locating optimal
    decision boundaries
  • Decision trees swiss army knife, often
    effective in high dimensions
  • Linear discriminants,
  • Support vector machines (SVM) generalization of
    linear discriminants, can be quite effective,
    computational complexity is an issue
  • Nearest neighbor simple, can scale poorly in
    high dimensions

6
Evaluation of Classifiers
  • Already seen some of this
  • Assume output is probability vector for each
    class
  • Classification error
  • P(true Y predicted Y)
  • ROC Area
  • area under ROC plot
  • top-k analysis
  • sometimes all you care about is how well you can
    do at the top of the list
  • plan A top 50 candidates have 44 sales, top 500
    have 300 sales
  • plan B top 50 have 48 sales, top 500 have 270
    sales
  • which do you choose?
  • often used with imbalanced class distributions -
    good classification error is easy!
  • fraud, etc
  • calibration is sometimes important
  • if you say something has 90 chance, does it?

7
Linear Discriminant Analysis
  • LDA - parametric classification
  • Fisher 1936 Rao 1948
  • linear combination of variables separating two
    classes by comparing the difference between class
    means with the variance in each class
  • assumes multivariate normal distribution of each
    class (cluster)
  • pros
  • easy to define likelihood
  • easy to define boundary
  • easy to measure goodness of fit
  • interpretation easy
  • cons
  • very rare for data come close to a multi-normal!
  • works only on numeric predictors

8
  • painters data 54 painters rated on a score of
    0-21 for composition, drawing color and
    expression. Classified them into 8 classes
  • Composition Drawing Colour Expression School
  • Da Udine 10 8 16
    3 A
  • Da Vinci 15 16 4
    14 A
  • Del Piombo 8 13 16
    7 A
  • Del Sarto 12 16 9
    8 A
  • Fr. Penni 0 15 8
    0 A
  • Guilio Romano 15 16 4
    14 A
  • Michelangelo 8 17 4
    8 A
  • Perino del Vaga 15 16 7
    6 A
  • Perugino 4 12 10
    4 A
  • Raphael 17 18 12
    18 A

library(MASS) lda1lda(School.,datapainters)
9
(No Transcript)
10
(No Transcript)
11
LDA - predictions
  • to check how good the model is, you can see how
    well it predicts what actually happened

gt predict(lda1) class 1 D H D A A H A C A A A
A A C A B B E C C B E D D D D G D D D D D E D G H
E E E F G A F D G A G G E 50 G C H H H Levels
A B C D E F G H posterior
A B C D
E F Da Udine 0.0153311094
0.0059952857 0.0105980288 6.717937e-01
0.124938731 2.913817e-03 Da Vinci
0.1023448947 0.1963312180 0.1155149000
4.444461e-05 0.016182391 1.942920e-02 Del Piombo
0.1763906259 0.0142589568 0.0064792116
6.351212e-01 0.102924883 9.080713e-03 Del Sarto
0.4549047647 0.2079127774 0.1459033415
2.166203e-02 0.146171796 3.716302e-03
gt table(predict(lda1)class,paintersSch)
A B C D E F G H A 5 4 0 0 0 1 1 0 B 0 1 2 0 0
0 0 0 C 1 1 2 0 0 0 0 1 D 2 0 0 9 1 0 1 0 E
0 0 2 0 4 0 1 0 F 0 0 0 0 0 2 0 0 G 0 0 0 1 1
1 4 0 H 2 0 0 0 1 0 0 3
12
Classification (Decision) Trees
  • Trees are one of the most popular and useful of
    all data mining models
  • Algorithmic version of classification
  • no distributional assumptions
  • Competing algorithms CART, C4.5, DBMiner
  • Pros
  • no distributional assumptions
  • can handle real and nominal inputs
  • speed and scalability
  • robustness to outliers and missing values
  • interpretability
  • compactness of classification rules
  • Cons
  • interpretability ?
  • several tuning parameters to set with little
    guidance
  • decision boundary is non-continuous

13
Decision Tree Example
Debt
Income
14
Decision Tree Example
Debt
Income gt t1
??
Income
t1
15
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
16
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
17
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
18
Example Titanic Data
  • On the Titanic
  • 1313 passengers
  • 34 survived
  • was it a random sample?
  • or did survival depend on features of the
    individual?
  • sex
  • age
  • class

pclass survived
name age embarked sex 1
1st 1 Allen, Miss
Elisabeth Walton 29.0000 Southampton female 2
1st 0 Allison, Miss
Helen Loraine 2.0000 Southampton female 3 1st
0 Allison, Mr Hudson Joshua
Creighton 30.0000 Southampton male 4 1st
0 Allison, Mrs Hudson J.C. (Bessie Waldo
Daniels) 25.0000 Southampton female 5 1st
1 Allison, Master Hudson
Trevor 0.9167 Southampton male 6 2nd
1 Anderson, Mr Harry
47.0000 Southampton male
19
Decision trees
  • At first split decide which is the best
    variable to create separation between the
    survivors and non-survivors cases

Female
Goodness of split is determined by the purity
of the leaves
20
Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Examples are partitioned recursively to create
    pure subgroups
  • Purity measured by information gain, Gini index,
    entropy, etc
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • All leaf nodes are smaller than a specified
    threshold
  • BUT building a tree too big will overfit the
    data, and will predict poorly.
  • Predictions
  • each leaf will have class probability estimates
    (CPE), based on the training data that ended up
    in that leaf.
  • majority voting is employed for classifying all
    members of the leaf

21
Purity in tree building
  • Why do we care about pure subgroups?
  • purity of the subgroup gives us confidence that
    new cases that fall into this leaf have a given
    label

22
Purity measures
  • If a data set T contains examples from n classes,
    gini index, gini(T) is defined as
  • where pj is the relative frequency of class j in
    T.
  • If a data set T is split into two subsets T1 and
    T2 with sizes N1 and N2 respectively, the gini
    index of the split data contains examples from n
    classes, the gini index gini(T) is defined as
  • For Titanic split on sex 850/1313 x(1-0.160.84)
    463/1313(1-0.660.34) 0.83
  • The attribute provides the smallest ginisplit(T)
    is chosen to split the node (need to enumerate
    all possible splitting points for each
    attribute).
  • Another often used measure Entropy

23
Calculating Information Gain
Information Gain Impurity (parent)
Impurity (children)
Entire population (30 instances)
17 instances
Balancegt50K
Balancelt50K
13 instances
(Weighted) Average Impurity of Children
Information Gain Entropy ( parent) Entropy
(Children) 0.996
- 0.615 0.38
23
24
Information Gain
Information Gain Impurity (parent)
Impurity (children)
Gain0.38
Impurity(D,E) 0.405
Impurity(,B,C) 0.61
Impurity(A) 0.996
Gain0.205
D
Agegt45
B
Impurity(D)0 Log20 1 log210
Balancegt50K
Agelt45
Impurity(B) 0.787
A
E
C
Impurity(E) -3/7 Log23/7 -4/7Log24/70.985
Balancelt50K
Impurity (C) 0.39
Bad risk (Default)
24
Good risk (Not default)
25
Information Gain
  • At each node chose first the attribute that
    obtains maximum information gain providing
    maximum information

Gain0.38
Impurity(D,E) 0.405
Impurity(A) 0.996
Impurity(B,C) 0.61
Gain0.205
D
B
Agegt45
Entire population
Balancegt50K
Agelt45
A
E
C
Balancelt50K
25
Bad risk (Default)
Good risk (Not default)
26
Avoid Overfitting in Classification
  • The generated tree may overfit the training data
  • Too many branches, some may reflect anomalies due
    to noise or outliers
  • Result is in poor accuracy for unseen samples
  • Two approaches to avoid overfitting
  • Prepruning Halt tree construction earlydo not
    split a node if this would result in the goodness
    measure falling below a threshold
  • Difficult to choose an appropriate threshold
  • Postpruning Remove branches from a fully grown
    treeget a sequence of progressively pruned trees
  • Use a set of data different from the training
    data to decide which is the best pruned tree

27
Which attribute to split over?
  • Brute-force search
  • At each node examine splits over each of the
    attributes
  • Select the attribute for which the maximum
    information gain is obtained

Balance
gt50K
lt50K
27
28
Finding the right size
  • Use a hold out sample (n fold cross-validation)
  • Overfit a tree - with many leaves
  • snip the tree back and use the hold out sample
    for prediction, calculate predictive error
  • record error rate for each tree size
  • repeat for n folds
  • plot average error rate as a function of tree
    size
  • fit optimal tree size to the entire data set

R note can use cvtree()
29
Olive oil data
X region area palmitic palmitoleic stearic oleic
linoleic linolenic arachidic 1 1.North-Apulia
1 1 1075 75 226 7823
672 36 60 2 2.North-Apulia 1
1 1088 73 224 7709 781
31 61 3 3.North-Apulia 1 1
911 54 246 8113 549 31
63 4 4.North-Apulia 1 1 966
57 240 7952 619 50
78 5 5.North-Apulia 1 1 1051
67 259 7771 672 50 80 6
6.North-Apulia 1 1 911 49
268 7924 678 51 70
  • classification of Italian olive oils by their
    components
  • 9 areas, from 3 regions

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
Regression Trees
  • Trees can also be used for regression when the
    response is real valued
  • leaf prediction is mean value instead of class
    probability estimates (CPE)
  • helpful with categorical predictors

34
Tips data
35
Treating Missing Data in Trees
  • Missing values are common in practice
  • Approaches to handing missing values
  • During training
  • Ignore rows with missing values (inefficient)
  • During testing
  • Send the example being classified down both
    branches and average predictions
  • Replace missing values with an imputed value
  • Other approaches
  • Treat missing as a unique value (useful if
    missing values are correlated with the class)
  • Surrogate splits method
  • Search for and store surrogate variables/splits
    during training

36
Other Issues with Classification Trees
  • Can use non-binary splits
  • Multi-way
  • Linear combinations
  • Tend to increase complexity substantially, and
    dont improve performance
  • Binary splits are interpretable, even by
    non-experts
  • Easy to compute, visualize
  • Model instability
  • A small change in the data can lead to a
    completely different tree
  • Model averaging techniques (like bagging) can be
    useful
  • Restricted to splits along coordinate axes
  • Discontinuities in prediction space

37
Why Trees are widely used in Practice
  • Can handle high dimensional data
  • builds a model using 1 dimension at time
  • Can handle any type of input variables
  • categorical, real-valued, etc
  • Invariant to monotonic transformations of input
    variables
  • E.g., using x, 10x 2, log(x), 2x, etc, will
    not change the tree
  • So, scaling is not a factor - user can be sloppy!
  • Trees are (somewhat) interpretable
  • domain expert can read off the trees logic
  • Tree algorithms are relatively easy to code and
    test

38
Limitations of Trees
  • Representational Bias
  • classification piecewise linear boundaries,
    parallel to axes
  • regression piecewise constant surfaces
  • High Variance
  • trees can be unstable as a function of the
    sample
  • e.g., small change in the data -gt completely
    different tree
  • causes two problems
  • 1. High variance contributes to prediction error
  • 2. High variance reduces interpretability
  • Trees are good candidates for model combining
  • Often used with boosting and bagging

39
Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
40
Random Forests
  • Another con for trees
  • trees are sensitive to the primary split, which
    can lead the tree in inappropriate directions
  • one way to see this fit a tree on a random
    sample, or a bootstrapped sample of the data -
  • Solution
  • random forests an ensemble of unpruned decision
    trees
  • each tree is built on a random subset of the
    training data
  • at each split point, only a random subset of
    predictors are selected
  • many parameters to fiddle!
  • prediction is simply majority vote of the trees (
    or mean prediction of the trees).
  • Has the advantage of trees, with more robustness,
    and a smoother decision rule.
  • Also, they are trendy!

41
Other Models k-NN
  • k-Nearest Neighbors (kNN)
  • to classify a new point
  • look at the kth nearest neighbor from the
    training set
  • look at the circle of radius r that includes this
    point
  • what is the class distribution of this circle?
  • Advantages
  • simple to understand
  • simple to implement
  • Disadvantages
  • what is k?
  • k1 high variance, sensitive to data
  • k large robust, reduces variance but blends
    everything together - includes far away points
  • what is near?
  • Euclidean distance assumes all inputs are equally
    important
  • how do you deal with categorical data?
  • no interpretable model
  • Best to use cross-validation and visualization
    techniques to pick k.

42
Probabilistic (Bayesian) Models for Classification
If you belong to class k, you have a distribution
over input vectors
Then, given priors on ck, we can get posterior
distribution on classes
At each point in the x space, we have a predicted
class vector, allowing for decision boundaries
43
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
44
Example of Probabilistic Classification
p( x c1 )
p( x c2 )
1
p( c1 x )
0.5
0
45
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries
46
Decision Regions and Bayes Error Rate
p( x c1 )
p( x c2 )
Class c2
Class c1
Class c2
Class c1
Class c2
Optimal decision regions regions where 1 class
is more likely Optimal decision regions ?
optimal decision boundaries Bayes error rate
fraction of examples misclassified by optimal
classifier (shaded area above). If max1, then
there is no error. Hence
47
Procedure for optimal Bayes classifier
  • For each class learn a model p( x ck )
  • E.g., each class is multivariate Gaussian with
    its own mean and covariance
  • Use Bayes rule to obtain p( ck x )
  • gt this yields the optimal decision
    regions/boundaries
  • gt use these decision regions/boundaries for
    classification
  • Correct in theory. but practical problems
    include
  • How do we model p( x ck ) ?
  • Even if we know the model for p( x ck ),
    modeling a distribution or density will be very
    difficult in high dimensions (e.g., p 100)
  • Alternative approach model the decision
    boundaries directly

48
Bayesian Classification Why?
  • Probabilistic learning Calculate explicit
    probabilities for hypothesis, among the most
    practical approaches to certain types of learning
    problems
  • Incremental Each training example can
    incrementally increase/decrease the probability
    that a hypothesis is correct. Prior knowledge
    can be combined with observed data.
  • Probabilistic prediction Predict multiple
    hypotheses, weighted by their probabilities
  • Standard Even when Bayesian methods are
    computationally intractable, they can provide a
    standard of optimal decision making against which
    other methods can be measured

49
Naïve Bayes Classifiers
  • Generative probabilistic model with conditional
    independence assumption on p( x ck ), i.e.
    p( x ck ) P p( xj
    ck )
  • Typically used with nominal variables
  • Real-valued variables discretized to create
    nominal versions
  • Comments
  • Simple to train (just estimate conditional
    probabilities for each feature-class pair)
  • Often works surprisingly well in practice
  • e.g., state of the art for text-classification,
    basis of many widely used spam filters

50
Naïve Bayes
  • When all variables are categorical,
    classification should be easy (since all xs can
    be enumerated)

But, remember the curse of dimensionality!
51
Naïve Bayes Classification
Recall p(ck x) ? p(x ck)p(ck) Now
assume variables are conditionally independent
given the classes
  • is this a valid assumption? Probably not, but
    perhaps still useful
  • example - symptoms and diseases

52
Naïve Bayes
estimate of the prob that a point x will belong
to ck
weights of evidence
if two classes
53
Play-tennis example estimating P(xiC)
outlook
P(sunnyy) 2/9 P(sunnyn) 3/5
P(overcasty) 4/9 P(overcastn) 0
P(rainy) 3/9 P(rainn) 2/5
temperature
P(hoty) 2/9 P(hotn) 2/5
P(mildy) 4/9 P(mildn) 2/5
P(cooly) 3/9 P(cooln) 1/5
humidity
P(highy) 3/9 P(highn) 4/5
P(normaly) 6/9 P(normaln) 2/5
windy
P(truey) 3/9 P(truen) 3/5
P(falsey) 6/9 P(falsen) 2/5
P(y) 9/14
P(n) 5/14
54
Play-tennis example classifying X
  • An unseen sample X ltrain, hot, high, falsegt
  • P(Xy)P(y) P(rainy)P(hoty)P(highy)P(fals
    ey)P(y)
  • 3/92/93/96/99/14 0.010582
  • P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
    en)P(n) 2/52/54/52/55/14 0.018286
  • Sample X is classified in class n (youll lose!)

55
The independence hypothesis
  • makes computation possible
  • yields optimal classifiers when satisfied
  • but is seldom satisfied in practice, as
    attributes (variables) are often correlated.
  • Yet, empirically, naïve bayes performs really
    well in practice.

56
Lab 5
  • Olive Oil Data
  • from Cook and Swayne book
  • consists of composition of fatty acids found in
    the lipid fraction of Italian Olive Oils. Study
    done to determine authenticity of olive oils.
  • region (North, South, and Sardinia)
  • area (nine regions)
  • 9 fatty acids and s

57
Lab 5
  • Spam Data
  • Collected at Iowa State University in 2003.
    (Cook and Swayne)
  • 2171 cases
  • 21 variables
  • be careful - 3 vars spampct, category, and spam
    were determined by spam models - do not use these
    for fitting!
  • Goal determine spam from valid mail
Write a Comment
User Comments (0)
About PowerShow.com