TreeBased Methods V - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

TreeBased Methods V

Description:

Iris dataset relates species to petal and sepal dimensions reported in centimeters. ... Iris Species by Petal and Sepal Length. Petal.Length 2 .45. Petal. ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 22

Provided by: andreasngu

Category:

more less

Transcript and Presenter's Notes

Title: TreeBased Methods V

1
Tree-Based Methods(VR 9.1)
STAT 6601 Project

Demeke Kasaw, Andreas Nguyen, Mariana Alvaro

2
Overview of Tree-based Methods

What are they?
How do they work?
Examples
Tree pictorials common.
Simple way to depict relationships in data
Tree-based methods use this pictorial to
represent relationships between random variables.

3
Trees can be used for bothClassification and
Regression
Time to Next Eruption vs. Length of Last Eruption
Presence of Surgery Complications vs. Patient Age
and Treatment Start Date

Start gt 8.5 months
Start lt 8.5
Present
Start gt 14.5
Start lt 14.5
Absent
Age lt 12 yrs
Age gt 12 yrs
Absent
Sex M
Sex F
Absent
Present
4
General Computation Issues and Unique Solutions

Over-Fitting When do we stop splitting? Stop
generating new nodes when subsequent splits only
result in little improvement.
Evaluate the quality of the prediction Prune the
tree to ideally select the simplest most accurate
solution.
Methods
Crossvalidation Apply the tree computed from one
set of observations (learning sample) to another
completely independent set of observations
(testing sample).
V-fold crossvalidation Repeat the analysis with
different randomly drawn samples from the data.
Use the tree that shows the best average accuracy
for cross-validated predicted classifications or
predicted values.

5
Computational Details

Specify the criteria for predictive accuracy
Minimum costs Lowest misclassification rate
Case weights
Selecting Splits
Define a measure of impurity for a node. A node
is pure if they contain observations of a
single class.
Determine when to stop splitting
All nodes are pure or contain no more than a n
cases
Until all nodes contain no more than a specified
Fraction of Objects
Selecting the right-size tree
Test sample cross validation
V Fold cross validation
Tree selection after pruning if there are
several trees with costs close to minimum, select
the smallest-sized (least complex)

6
Computational Formulas

Estimation of Accuracy in Classification Trees
Resubstitution estimate
d(x) is the classifier
X1 if X(d(xn) jn) is true
X 0 if X(d(xn) jn) is false
Estimation of Accuracy in Regression Trees
Resubstitution estimate

7
Computational Formulas Estimation of Node
Impurity

Gini Index
Reaches zero when only one class is present at a
node
P(j/t) probability of category j at node t
Entropy or Information

8
Classification Tree ExampleWhat species are
these flowers?
Petal Length Petal Width
Setosa
tree
Versicolor
Sepal Length Sepal Width
Virginica
9
Iris Classification Data

Iris dataset relates species to petal and sepal
dimensions reported in centimeters. Originally
used by R.A. Fisher and E. Anderson for a
discriminant analysis example.
Data is pre-packaged in R dataset library and is
available on DASYL.
Sepal.Length Sepal.Width Petal.Length
Petal.Width Species
6.7 3.0 5.0 1.7
versicolor
5.8 2.7 3.9 1.2
versicolor
7.3 2.9 6.3 1.8
virginica
5.2 4.1 1.5 0.1
setosa
4.4 3.2 1.3 0.2
setosa

10
Iris ClassificationMethod and Code

library(rpart) Load tree fitting
packagedata(iris) Load iris data
Let x tree object fitting Species vs. all
other variables in iris with 10-fold cross
validationx rpart(Species.,iris,xval10)
Plot tree diagram with uniform spacing,
diagonal branches, a 10 margin, and a
titleplot(x, uniformT, branch0, margin0.1,
main"Classification Tree\nIris Species by Petal
and Sepal Length")
Add labels to tree with final counts, fancy
shapes, and blue text colortext(x,use.nT,fancyT
,col"blue")

11
Results
12
Tree-based approach much simpler than the
alternative
Classification Tree Iris Species by Petal and
Sepal Length

Classification with Cross-validation
True Group
Put into Group setosa versicolor virginica
setosa 50 0 0
versicolor 0 48 1
virginica 0 2 49
Total N 50 50 50
N correct 50 48 49
Proportion 1.000 0.960 0.980
N 150 N Correct 147
Linear Discriminant Function for Groups
setosa versicolor virginica
Constant -85.21 -71.75 -103.27
Sepal.Length 23.54 15.70 12.45
Sepal.Width 23.59 7.07 3.69
Petal.Length -16.43 5.21 12.77
Petal.Width -17.40 6.43 21.08

Setosa -85246243.4-164.5-171.641
Versicolor -7216673.454.561.680
PetalLengthlt 2 .45
PetalLengthgt 2 .45
Virginica -10312643.4134.5211.675
Since Versicolor has highest score, we classify
this flower as an Iris versicolor.
setosa
50/0/0
PetalWidthgt 1 .75
PetalWidthlt 1 .75
versicolor
virginica
0/49/5
0/1/45
13
Regression Tree Example

Software used R, rpart package
Goal
Applying the regression tree method on CPU data,
and predicting the response variable,
performance.

14
CPU Data

CPU performance of 209 different processors.
name syct mmin mmax cach chmin
chmax perf
1 ADVISOR 32/60 125 256 6000 256 16
128 198
2 AMDAHL 470V/7 29 8000 32000 32 8
32 269
3 AMDAHL 470/7A 29 8000 32000 32 8
32 220
4 AMDAHL 470V/7B 29 8000 32000 32 8
32 172
5 AMDAHL 470V/7C 29 8000 16000 32 8
16 132
6 AMDAHL 470V/8 26 8000 32000 64 8
32 318
...

PerformanceBenchmark
System Speed(mhz)
Memory (kb)
Cache (kb)
Channels
15
R Code

library(MASS) library(rpart) data(cpus)
attach(cpus)
Fit regression tree to datacpus.rp
lt-rpart(log(perf).,cpus,28,cp0.001)
Print and plot complexity Parameter (cp)
tableprintcp(cpus.rp) plotcp(cpus.rp)
Prune and display tree cpus.rplt-prune(cpus.rp,c
p0.0055)plot(cpus.rp,uniformT,main"Regression
Tree")text(cpus.rp,digits3)
Plot residual vs. predictedplot(predict(cpus.rp
),resid(cpus.rp)) abline(h0)

16
Determine the Best Complexity Parameter (cp)
Value for the Model
Cross-Validated Error
Cross-Validated Error SD
ComplexityParameter
1 R2
Splits

CP nsplit rel error xerror xstd
1 0.5492697 0 1.00000 1.00864 0.096838
2 0.0893390 1 0.45073 0.47473 0.048229
3 0.0876332 2 0.36139 0.46518 0.046758
4 0.0328159 3 0.27376 0.33734 0.032876
5 0.0269220 4 0.24094 0.32043 0.031560
6 0.0185561 5 0.21402 0.30858 0.030180
7 0.0167992 6 0.19546 0.28526 0.028031
8 0.0157908 7 0.17866 0.27781 0.027608
9 0.0094604 9 0.14708 0.27231 0.028788
10 0.0054766 10 0.13762 0.25849 0.026970
11 0.0052307 11 0.13215 0.24654 0.026298
12 0.0043985 12 0.12692 0.24298 0.027173
13 0.0022883 13 0.12252 0.24396 0.027023
14 0.0022704 14 0.12023 0.24256 0.027062
15 0.0014131 15 0.11796 0.24351 0.027246
16 0.0010000 16 0.11655 0.24040 0.026926

17
Regression Tree
Regression TreeBefore Pruning
18
How well does it fit?

Plot of residuals

19
Summary

Advantages of C RT
Simplicity of results
The interpretation of results summarized in a
tree is very simple.
This simplicity is useful for purposes of rapid
classification of new observations
It is much easier to evaluate just one or two
logical conditions.
Tree methods are nonparametric and nonlinear
There is no implicit assumption that the
underlying relationships between the predictor
variables and the dependent variable are linear,
follow some specific non-linear link function

20
References

Venables, Ripley (2002), Modern Applied
Statistics with S,251-266.
StatSoft (2003) Classification and Regression
Trees, Electronic Textbook, StatSoft, 2003,
retrieved on 11/8/2004 from http//www.statsoft.co
m/textbook/stcart.html
Fisher, R. A. (1936) The use of multiple
measurements in taxonomic problems. Annals of
Eugenics, 7, Part II, 179-188.

Using Trees in R (the 30 second version)
Load the rpart librarylibrary(rpart)
For classification trees, make sure the response
is of the type factor. If you dont know how to
do this lookup help(as.factor)or consult a
general R reference.yas.factor(y)
Fit the tree modelfrpart(yx1x2,data,cp0.0
01)If using an unattached dataframe, you must
specify data.If using global variables, then
data can be omitted.A good starting point for
cp, which controls the complexity of the tree, is
given.
Plot and check the modelplot(f,uniformT,margin0
.1) text(f,use.nT)plotcp(f) printcp(f)Look
at the xerrors in the summary and choose the
smallest number of splits that achieve the
smallest xerror. Consider the tradeoff between
model fit and complexity (ie overfitting). Based
on your judgement, repeat step 3 with the cp
value of your choice.
Predict resultspredict(f,newdata,typeclass)wh
ere newdata is a dataframe with the independent
variables.

Using Trees in R (the 30 second version)
Load the rpart librarylibrary(rpart)
For classification trees, make sure the response
is of the type factor. If you dont know how to
do this lookup help(as.factor)or consult a
general R reference.yas.factor(y)
Fit the tree modelfrpart(yx1x2,data,cp0.0
01)If using an unattached dataframe, you must
specify data.If using global variables, then
data can be omitted.A good starting point for
cp, which controls the complexity of the tree, is
given.
Plot and check the modelplot(f,uniformT,margin0
.1) text(f,use.nT)plotcp(f) printcp(f)Look
at the xerrors in the summary and choose the
smallest number of splits that achieve the
smallest xerror. Consider the tradeoff between
model fit and complexity (ie overfitting). Based
on your judgement, repeat step 3 with the cp
value of your choice.
Predict resultspredict(f,newdata,typeclass)wh
ere newdata is a dataframe with the independent
variables.

Using Trees in R (the 30 second version)
Load the rpart librarylibrary(rpart)
For classification trees, make sure the response
is of the type factor. If you dont know how to
do this lookup help(as.factor)or consult a
general R reference.yas.factor(y)
Fit the tree modelfrpart(yx1x2,data,cp0.0
01)If using an unattached dataframe, you must
specify data.If using global variables, then
data can be omitted.A good starting point for
cp, which controls the complexity of the tree, is
given.
Plot and check the modelplot(f,uniformT,margin0
.1) text(f,use.nT)plotcp(f) printcp(f)Look
at the xerrors in the summary and choose the
smallest number of splits that achieve the
smallest xerror. Consider the tradeoff between
model fit and complexity (ie overfitting). Based
on your judgement, repeat step 3 with the cp
value of your choice.
Predict resultspredict(f,newdata,typeclass)wh
ere newdata is a dataframe with the independent
variables.

Using Trees in R (the 30 second version)
Load the rpart librarylibrary(rpart)
For classification trees, make sure the response
is of the type factor. If you dont know how to
do this lookup help(as.factor)or consult a
general R reference.yas.factor(y)
Fit the tree modelfrpart(yx1x2,data,cp0.0
01)If using an unattached dataframe, you must
specify data.If using global variables, then
data can be omitted.A good starting point for
cp, which controls the complexity of the tree, is
given.
Plot and check the modelplot(f,uniformT,margin0
.1) text(f,use.nT)plotcp(f) printcp(f)Look
at the xerrors in the summary and choose the
smallest number of splits that achieve the
smallest xerror. Consider the tradeoff between
model fit and complexity (ie overfitting). Based
on your judgement, repeat step 3 with the cp
value of your choice.
Predict resultspredict(f,newdata,typeclass)wh
ere newdata is a dataframe with the independent
variables.