Chapter 7 Classification and Regression Trees presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 7 Classification and Regression Trees

1
Chapter 7 Classification and Regression Trees
2
Introduction

What is a classification tree?
The figure on the next slide describes a tree for
classifying bank customers who receive a loan
offer as either acceptors or non-acceptors,
A function of information such as their income,
education level, and average credit card
expenditure.
Consider the tree in the example.
The square terminal nodes" are marked with 0 or
1 corresponding to a non-acceptor (0) or acceptor
(1).
The values in the circle nodes give the splitting
value on a predictor.
This tree can easily be translated into a set of
rules for classifying a bank customer.
For example, the middle left square node in this
tree gives us the rule
IF(Income gt 92.5) AND (Education lt 1.5) AND
(Family 2.5) THEN Class 0 (non-acceptor).

3
(No Transcript)
4
Classification Trees

There are two key ideas underlying classification
trees.
The first is the idea of recursive partitioning
of the space of the independent variables.
The second is of pruning using validation data.
This implies the need for a third test set.
In the following we describe recursive
partitioning, and subsequent sections explain the
pruning methodology.

5
Recursive Partitioning

Recursive partitioning divides up the p
dimensional space of the x variables into
non-overlapping multi-dimensional rectangles.
The X variables here are considered to be
continuous, binary or ordinal.
This division is accomplished recursively (i.e.,
operating on the results of prior divisions).
First, one of the variables is selected, say xi,
and a value of xi, say si, is chosen to split the
p dimensional space into two parts one part that
contains all the points with xi lt si and the
other with all the points with xi gt si.
Then one of these two parts is divided in a
similar manner by choosing a variable again (it
could be xi or another variable) and a split
value for the variable. This results in three
(multi-dimensional) rectangular regions.

6
Recursive Partitioning

This process is continued so that we get smaller
and smaller rectangular regions.
The idea is to divide the entire x-space up into
rectangles such that each rectangle is as
homogeneous or pure' as possible.
By pure' we mean containing points that belong
to just one class.
(Of course, this is not always possible, as there
may be points that belong to different classes
but have exactly the same values for every one of
the independent variables.)
Let us illustrate recursive partitioning with an
example.

7
Riding Mowers
Splitting the 24 Observations By Lot Size Value
of 19 Split to reduce impurities within a
rectangle
8
Measures of Impurity

There are a number of ways to measure impurity.
The two most popular measures are
the Gini index and
the entropy measure
Denote the m classes of the response variable by
k 1,2,3,,m
The Gini index and the entropy measure use pk
For a rectangle A, pk is the proportion of
observations in rectangle A that belong to class
k.

9
Gini Index
Values of the Gini Index for a Two-Class Case, As
a Function of the Proportion of Observations in
Class 1 (p1)
10
Entropy Index
This measure ranges between 0 (most pure, all
observations belong to the same class) and
log2(m) (when all m classes are equally
represented). In the two-class case, the
entropy measure is maximized (like the Gini
index) at pk 0.5
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Evaluating the Performance of a Classification
Tree

Avoiding Overfitting
Too many rectangles implies too many splits
branches
Solutions
Stopping Tree Growth CHAID
Pruning the Tree

16
Stopping Tree Growth CHAID

CHAID (Chi-Squared Automatic Interaction
Detection) is a recursive partitioning method
that predates classification and regression tree
(CART) procedures
It uses a well-known statistical test (the chi-
square test for independence) to assess whether
splitting a node improves the purity by a
statistically significant amount.
In particular, at each node we split on the
predictor that has the strongest association with
the response variable.
The strength of association is measured by the
p-value of a chi-squared test of independence.
If for the best predictor the test does not show
a significant improvement the split is not
carried out, and the tree is terminated.
This method is more suitable for categorical
predictors, but it can be adapted to continuous
predictors by binning the continuous values into
categorical bins.

17
Pruning the Tree

Grow full tree (over fit the data)
Convert decision node to leaf nodes using the
CART algorithm
CART Algorithm
Uses Cost Complexity Criterion
Which is equal to the misclassification error of
a tree (based on the training data) plus a
penalty factor for the size of the tree.
For a tree T that has L(T) leaf nodes, the cost
complexity can be written as
CC(T) Err(T) aL(T)
where Err(T) is the fraction of training data
observations that are misclassified by tree T and
a is a penalty factor" for tree size.
When a 0 there is no penalty for having too
many nodes in a tree and the best tree using the
cost complexity criterion is the full-grown
unpruned tree.

18
Classification Rules from Trees

Each leaf is equivalent to a classification rule.
Returning to the example on slide 3, the middle
left leaf in the best pruned tree, gives us the
rule
IF(Income gt 92.5) AND (Education lt 1.5) AND
(Family 2.5) THEN Class 0.
The number of rules can be reduced by removing
redundancies.
IF(Income gt 92.5) AND (Education gt 1.5) AND
(Income gt 114.5) THEN Class 1 can be simplified
to
IF(Income gt 114.5) AND (Education gt 1.5) THEN
Class 1.

19
Regression Trees

The CART method can also be used for continuous
response variables.
Regression trees for prediction operate in much
the same fashion as classification trees.
The output variable, Y , is a continuous variable
in this case, but both the principle and the
procedure are the same many splits are attempted
and, for each,
We measure impurity" in each branch of the
resulting tree.
The tree procedure then selects the split that
minimizes the sum of such measures.

20
Prediction

Predicting the value of the response Y for an
observation is performed in a similar fashion to
the classification case
The predictor information is used for dropping"
down the tree until reaching a leaf node.
For instance, to predict the price of a Toyota
Corolla with Age55 and Horsepower86, we drop it
down the tree and reach the node that has the
value 8842.65.
This is the price prediction for this car
according to the tree.
In classification trees the value of the leaf
node (which is one of the categories) is
determined by the voting" of the training data
that were in that leaf.
In regression trees the value of the leaf node is
determines by the average of the training data in
that leaf.
In the above example, the value 8842.6 is the
average of the 56 cars in the training set that
fall in the category of Age gt 52.5 AND Horsepower
lt 93.5.

21
Measuring Impurity

Two types of impurity measures for nodes in
classification trees
the Gini index and
the entropy-based measure.
In both cases the index is a function of the
ratio between the categories of the observations
in that node.
In regression trees a typical impurity measure is
the sum of the squared deviations from the mean
of the leaf.
This is equivalent to the squared errors, because
the mean of the leaf is exactly the prediction.
In the example above, the impurity of the node
with the value 8842.6 is computed by subtracting
8842.6 from the price of each of the 56 cars in
the training set that fell in that leaf, then
squaring these deviations, and summing them up.
The lowest impurity possible is zero, when all
values in the node are equal.

22
Evaluating Performance

The predictive performance of regression trees
can be measured in the same way that other
predictive methods are evaluated,
using summary measures such as RMSE and
charts such as lift charts.

Chapter 7 Classification and Regression Trees PowerPoint PPT Presentation