TreeBased Methods V - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

TreeBased Methods V

Description:

Iris dataset relates species to petal and sepal dimensions reported in centimeters. ... Iris Species by Petal and Sepal Length. Petal.Length 2 .45. Petal. ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 22
Provided by: andreasngu
Category:

less

Transcript and Presenter's Notes

Title: TreeBased Methods V


1
Tree-Based Methods(VR 9.1)
STAT 6601 Project
  • Demeke Kasaw, Andreas Nguyen, Mariana Alvaro

2
Overview of Tree-based Methods
  • What are they?
  • How do they work?
  • Examples
  • Tree pictorials common.
  • Simple way to depict relationships in data
  • Tree-based methods use this pictorial to
    represent relationships between random variables.

3
Trees can be used for bothClassification and
Regression
Time to Next Eruption vs. Length of Last Eruption
Presence of Surgery Complications vs. Patient Age
and Treatment Start Date

Start gt 8.5 months
Start lt 8.5
Present
Start gt 14.5
Start lt 14.5
Absent
Age lt 12 yrs
Age gt 12 yrs
Absent
Sex M
Sex F
Absent
Present
4
General Computation Issues and Unique Solutions
  • Over-Fitting When do we stop splitting? Stop
    generating new nodes when subsequent splits only
    result in little improvement.
  • Evaluate the quality of the prediction Prune the
    tree to ideally select the simplest most accurate
    solution.
  • Methods
  • Crossvalidation Apply the tree computed from one
    set of observations (learning sample) to another
    completely independent set of observations
    (testing sample).
  • V-fold crossvalidation Repeat the analysis with
    different randomly drawn samples from the data.
    Use the tree that shows the best average accuracy
    for cross-validated predicted classifications or
    predicted values.

5
Computational Details
  • Specify the criteria for predictive accuracy
  • Minimum costs Lowest misclassification rate
  • Case weights
  • Selecting Splits
  • Define a measure of impurity for a node. A node
    is pure if they contain observations of a
    single class.
  • Determine when to stop splitting
  • All nodes are pure or contain no more than a n
    cases
  • Until all nodes contain no more than a specified
    Fraction of Objects
  • Selecting the right-size tree
  • Test sample cross validation
  • V Fold cross validation
  • Tree selection after pruning if there are
    several trees with costs close to minimum, select
    the smallest-sized (least complex)

6
Computational Formulas
  • Estimation of Accuracy in Classification Trees
  • Resubstitution estimate
  • d(x) is the classifier
  • X1 if X(d(xn) jn) is true
  • X 0 if X(d(xn) jn) is false
  • Estimation of Accuracy in Regression Trees
  • Resubstitution estimate

7
Computational Formulas Estimation of Node
Impurity
  • Gini Index
  • Reaches zero when only one class is present at a
    node
  • P(j/t) probability of category j at node t
  • Entropy or Information

8
Classification Tree ExampleWhat species are
these flowers?
Petal Length Petal Width
Setosa
tree
Versicolor
Sepal Length Sepal Width
Virginica
9
Iris Classification Data
  • Iris dataset relates species to petal and sepal
    dimensions reported in centimeters. Originally
    used by R.A. Fisher and E. Anderson for a
    discriminant analysis example.
  • Data is pre-packaged in R dataset library and is
    available on DASYL.
  • Sepal.Length Sepal.Width Petal.Length
    Petal.Width Species
  • 6.7 3.0 5.0 1.7
    versicolor
  • 5.8 2.7 3.9 1.2
    versicolor
  • 7.3 2.9 6.3 1.8
    virginica
  • 5.2 4.1 1.5 0.1
    setosa
  • 4.4 3.2 1.3 0.2
    setosa

10
Iris ClassificationMethod and Code
  • library(rpart) Load tree fitting
    packagedata(iris) Load iris data
  • Let x tree object fitting Species vs. all
    other variables in iris with 10-fold cross
    validationx rpart(Species.,iris,xval10)
  • Plot tree diagram with uniform spacing,
    diagonal branches, a 10 margin, and a
    titleplot(x, uniformT, branch0, margin0.1,
    main"Classification Tree\nIris Species by Petal
    and Sepal Length")
  • Add labels to tree with final counts, fancy
    shapes, and blue text colortext(x,use.nT,fancyT
    ,col"blue")

11
Results
12
Tree-based approach much simpler than the
alternative
Classification Tree Iris Species by Petal and
Sepal Length
  • Classification with Cross-validation
  • True Group
  • Put into Group setosa versicolor virginica
  • setosa 50 0 0
  • versicolor 0 48 1
  • virginica 0 2 49
  • Total N 50 50 50
  • N correct 50 48 49
  • Proportion 1.000 0.960 0.980
  • N 150 N Correct 147
  • Linear Discriminant Function for Groups
  • setosa versicolor virginica
  • Constant -85.21 -71.75 -103.27
  • Sepal.Length 23.54 15.70 12.45
  • Sepal.Width 23.59 7.07 3.69
  • Petal.Length -16.43 5.21 12.77
  • Petal.Width -17.40 6.43 21.08

Setosa -85246243.4-164.5-171.641
Versicolor -7216673.454.561.680
PetalLengthlt 2 .45
PetalLengthgt 2 .45
Virginica -10312643.4134.5211.675
Since Versicolor has highest score, we classify
this flower as an Iris versicolor.
setosa
50/0/0
PetalWidthgt 1 .75
PetalWidthlt 1 .75
versicolor
virginica
0/49/5
0/1/45
13
Regression Tree Example
  • Software used R, rpart package
  • Goal
  • Applying the regression tree method on CPU data,
    and predicting the response variable,
    performance.

14
CPU Data
  • CPU performance of 209 different processors.
  • name syct mmin mmax cach chmin
    chmax perf
  • 1 ADVISOR 32/60 125 256 6000 256 16
    128 198
  • 2 AMDAHL 470V/7 29 8000 32000 32 8
    32 269
  • 3 AMDAHL 470/7A 29 8000 32000 32 8
    32 220
  • 4 AMDAHL 470V/7B 29 8000 32000 32 8
    32 172
  • 5 AMDAHL 470V/7C 29 8000 16000 32 8
    16 132
  • 6 AMDAHL 470V/8 26 8000 32000 64 8
    32 318
  • ...

PerformanceBenchmark
System Speed(mhz)
Memory (kb)
Cache (kb)
Channels
15
R Code
  • library(MASS) library(rpart) data(cpus)
    attach(cpus)
  • Fit regression tree to datacpus.rp
    lt-rpart(log(perf).,cpus,28,cp0.001)
  • Print and plot complexity Parameter (cp)
    tableprintcp(cpus.rp) plotcp(cpus.rp)
  • Prune and display tree cpus.rplt-prune(cpus.rp,c
    p0.0055)plot(cpus.rp,uniformT,main"Regression
    Tree")text(cpus.rp,digits3)
  • Plot residual vs. predictedplot(predict(cpus.rp
    ),resid(cpus.rp)) abline(h0)

16
Determine the Best Complexity Parameter (cp)
Value for the Model
Cross-Validated Error
Cross-Validated Error SD
ComplexityParameter
1 R2
Splits
  • CP nsplit rel error xerror xstd
  • 1 0.5492697 0 1.00000 1.00864 0.096838
  • 2 0.0893390 1 0.45073 0.47473 0.048229
  • 3 0.0876332 2 0.36139 0.46518 0.046758
  • 4 0.0328159 3 0.27376 0.33734 0.032876
  • 5 0.0269220 4 0.24094 0.32043 0.031560
  • 6 0.0185561 5 0.21402 0.30858 0.030180
  • 7 0.0167992 6 0.19546 0.28526 0.028031
  • 8 0.0157908 7 0.17866 0.27781 0.027608
  • 9 0.0094604 9 0.14708 0.27231 0.028788
  • 10 0.0054766 10 0.13762 0.25849 0.026970
  • 11 0.0052307 11 0.13215 0.24654 0.026298
  • 12 0.0043985 12 0.12692 0.24298 0.027173
  • 13 0.0022883 13 0.12252 0.24396 0.027023
  • 14 0.0022704 14 0.12023 0.24256 0.027062
  • 15 0.0014131 15 0.11796 0.24351 0.027246
  • 16 0.0010000 16 0.11655 0.24040 0.026926

17
Regression Tree
Regression TreeBefore Pruning
18
How well does it fit?
  • Plot of residuals

19
Summary
  • Advantages of C RT
  • Simplicity of results
  • The interpretation of results summarized in a
    tree is very simple.
  • This simplicity is useful for purposes of rapid
    classification of new observations
  • It is much easier to evaluate just one or two
    logical conditions.
  • Tree methods are nonparametric and nonlinear
  • There is no implicit assumption that the
    underlying relationships between the predictor
    variables and the dependent variable are linear,
    follow some specific non-linear link function

20
References
  • Venables, Ripley (2002), Modern Applied
    Statistics with S,251-266.
  • StatSoft (2003) Classification and Regression
    Trees, Electronic Textbook, StatSoft, 2003,
    retrieved on 11/8/2004 from http//www.statsoft.co
    m/textbook/stcart.html
  • Fisher, R. A. (1936) The use of multiple
    measurements in taxonomic problems. Annals of
    Eugenics, 7, Part II, 179-188.

21
  • Using Trees in R (the 30 second version)
  • Load the rpart librarylibrary(rpart)
  • For classification trees, make sure the response
    is of the type factor. If you dont know how to
    do this lookup help(as.factor)or consult a
    general R reference.yas.factor(y)
  • Fit the tree modelfrpart(yx1x2,data,cp0.0
    01)If using an unattached dataframe, you must
    specify data.If using global variables, then
    data can be omitted.A good starting point for
    cp, which controls the complexity of the tree, is
    given.
  • Plot and check the modelplot(f,uniformT,margin0
    .1) text(f,use.nT)plotcp(f) printcp(f)Look
    at the xerrors in the summary and choose the
    smallest number of splits that achieve the
    smallest xerror. Consider the tradeoff between
    model fit and complexity (ie overfitting). Based
    on your judgement, repeat step 3 with the cp
    value of your choice.
  • Predict resultspredict(f,newdata,typeclass)wh
    ere newdata is a dataframe with the independent
    variables.
  • Using Trees in R (the 30 second version)
  • Load the rpart librarylibrary(rpart)
  • For classification trees, make sure the response
    is of the type factor. If you dont know how to
    do this lookup help(as.factor)or consult a
    general R reference.yas.factor(y)
  • Fit the tree modelfrpart(yx1x2,data,cp0.0
    01)If using an unattached dataframe, you must
    specify data.If using global variables, then
    data can be omitted.A good starting point for
    cp, which controls the complexity of the tree, is
    given.
  • Plot and check the modelplot(f,uniformT,margin0
    .1) text(f,use.nT)plotcp(f) printcp(f)Look
    at the xerrors in the summary and choose the
    smallest number of splits that achieve the
    smallest xerror. Consider the tradeoff between
    model fit and complexity (ie overfitting). Based
    on your judgement, repeat step 3 with the cp
    value of your choice.
  • Predict resultspredict(f,newdata,typeclass)wh
    ere newdata is a dataframe with the independent
    variables.
  • Using Trees in R (the 30 second version)
  • Load the rpart librarylibrary(rpart)
  • For classification trees, make sure the response
    is of the type factor. If you dont know how to
    do this lookup help(as.factor)or consult a
    general R reference.yas.factor(y)
  • Fit the tree modelfrpart(yx1x2,data,cp0.0
    01)If using an unattached dataframe, you must
    specify data.If using global variables, then
    data can be omitted.A good starting point for
    cp, which controls the complexity of the tree, is
    given.
  • Plot and check the modelplot(f,uniformT,margin0
    .1) text(f,use.nT)plotcp(f) printcp(f)Look
    at the xerrors in the summary and choose the
    smallest number of splits that achieve the
    smallest xerror. Consider the tradeoff between
    model fit and complexity (ie overfitting). Based
    on your judgement, repeat step 3 with the cp
    value of your choice.
  • Predict resultspredict(f,newdata,typeclass)wh
    ere newdata is a dataframe with the independent
    variables.
  • Using Trees in R (the 30 second version)
  • Load the rpart librarylibrary(rpart)
  • For classification trees, make sure the response
    is of the type factor. If you dont know how to
    do this lookup help(as.factor)or consult a
    general R reference.yas.factor(y)
  • Fit the tree modelfrpart(yx1x2,data,cp0.0
    01)If using an unattached dataframe, you must
    specify data.If using global variables, then
    data can be omitted.A good starting point for
    cp, which controls the complexity of the tree, is
    given.
  • Plot and check the modelplot(f,uniformT,margin0
    .1) text(f,use.nT)plotcp(f) printcp(f)Look
    at the xerrors in the summary and choose the
    smallest number of splits that achieve the
    smallest xerror. Consider the tradeoff between
    model fit and complexity (ie overfitting). Based
    on your judgement, repeat step 3 with the cp
    value of your choice.
  • Predict resultspredict(f,newdata,typeclass)wh
    ere newdata is a dataframe with the independent
    variables.
Write a Comment
User Comments (0)
About PowerShow.com