Decision Tree Models in Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Decision Tree Models in Data Mining

Description:

Decision Tree Models in Data Mining Matthew J. Liberatore Thomas Coghlan Decision Trees in Data Mining Decision Trees can be used to predict a categorical or a ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 24
Provided by: www70Home
Category:
Tags: data | decision | mining | models | tree

less

Transcript and Presenter's Notes

Title: Decision Tree Models in Data Mining


1
Decision Tree Models in Data Mining
  • Matthew J. Liberatore
  • Thomas Coghlan

2
Decision Trees in Data Mining
  • Decision Trees can be used to predict a
    categorical or a continuous target (called
    regression trees in the latter case)
  • Like logistic regression and neural networks
    decision trees can be applied for classification
    and prediction
  • Unlike these methods no equations are estimated
  • A tree structure of rules over the input
    variables are used to classify or predict the
    cases according to the target variable
  • The rules are of an IF-THEN form for example
  • If Risk Low, then predict on-time payment of a
    loan

3
Decision Tree Approach
  • A decision tree represents a hierarchical
    segmentation of the data
  • The original segment is called the root node and
    is the entire data set
  • The root node is partitioned into two or more
    segments by applying a series of simple rules
    over an input variables
  • For example, risk low, risk not low
  • Each rule assigns the observations to a segment
    based on its input value
  • Each resulting segment can be further partitioned
    into sub-segments, and so on
  • For example risk low can be partitioned into
    income low and income not low
  • The segments are also called nodes, and the final
    segments are called leaf nodes or leaves

4
Decision Tree Example Loan Payment
  • Income
  • lt 30k gt 30k
  • Age Credit Score
  • lt 25 gt25 lt 600 gt 600
  • not on-time on-time
    not on-time on-time

5
Growing the Decision Tree
  • Growing the tree involves successively
    partitioning the data recursively partitioning
  • If an input variable is binary, then the two
    categories can be used to split the data
  • If an input variable is interval, a splitting
    value is used to classify the data into two
    segments
  • For example, if household income is interval and
    there are 100 possible incomes in the data set,
    then there are 100 possible splitting values
  • For example, income lt 30k, and income gt 30k

6
Evaluating the partitions
  • When the target is categorical, for each
    partition of an input variable a chi-square
    statistic is computed
  • A contingency table is formed that maps
    responders and non-responders against the
    partitioned input variable
  • For example, the null hypothesis might be that
    there is no difference between people with income
    lt30k and those with income gt30k in making an
    on-time loan payment
  • The lower the significance or p-value, the more
    likely that we reject this hypothesis, meaning
    that this income split is a discriminating factor

7
Contingency Table
lt30k gt30k total
Payment on-time
Payment not on-time
total
8
Chi-Square Statistic
  • The chi-square statistic computes a measure of
    how different the number of observations is in
    each of the four cells as compared to the
    expected number
  • The p-value associated with the null hypothesis
    is computed
  • Enterprise Miner then computes the logworth of
    the p-value, logworth - log10(p-value)
  • The split that generates the highest logworth for
    a given input variable is selected

9
Growing the Tree
  • In our loan payment example, we have three
    interval-valued input variables income, age, and
    credit score
  • We compute the logworth of the best split for
    each of these variables
  • We then select the variable that has the highest
    logworth and use its split suppose it is income
  • Under each of the two income nodes, we then find
    the logworth of the best split of age and credit
    score and continue the process --
  • subject to meeting the threshold on the
    significance of the chi-square value for
    splitting and other stopping criteria (described
    later)

10
Other Splitting Criteria for a Categorical Target
  • The gini and entropy measures are based on how
    heterogeneous the observations are at a given
    node
  • relates to the mix of responders and
    non-responders at the node
  • Let p1 and p0 represent the proportion of
    responders and non-responders at a node,
    respectively
  • If two observations are chosen (with replacement)
    from a node, the probability that they are either
    both responders or both non-responders is (p1)2
    (p0)2
  • The gini index 1 (p1)2 (p0)2, the
    probability that both observations are different
  • Best case is a gini index of 0 (all observations
    are the same)
  • An index of ½ means both groups equally
    represented

11
Other Splitting Criteria for a Categorical Target
  • The rarity of an event is defined as -log2(pi)
  • Entropy sums up the rarity of response and
    non-response over all observations
  • Entropy ranges from the best case of 0 (all
    responders or all non-responders) to 1 (equal mix
    of responders and non-responders)

12
Splitting Criteria for a Continuous (Interval)
Target
  • An F-statistic is used to measure the degree of
    separation of a split for an interval target,
    such as revenue
  • Similar to the sum of squares discussion under
    multiple regression, the F-statistic is based on
    the ratio of the sum of squares between the
    groups and the sum of squares within groups, both
    adjusted for the number of degrees of freedom
  • The null hypothesis is that there is no
    difference in the target mean between the two
    groups
  • As before, the logworth of the p-value is computed

13
Some Adjustments
  • The more possible splits of an input variable,
    the less accurate the p-value (bigger chance of
    rejecting the null hypothesis)
  • If there are m splits, the Bonferroni adjustment
    adjusts the p-value of the best case by
    subtracting log10(m) from the logworth
  • If Time of Kass Adjustment is set to before then
    the p-values of the splits are compared with
    Bonferroni adjustment

14
Some Adjustments
  • Setting Split Adjustment property to Yes means
    that the significance of the p-value can be
    adjusted by the depth of the tree
  • For example, at the fourth split, a calculate
    p-value of 0.04 becomes 0.0424 0.64, making
    the split statistically insignificant
  • This leads to rejecting more splits, limiting the
    size of the tree
  • Tree growth can also be controlled by setting
  • Leaf Size property (minimum number of
    observations in a leaf)
  • Split Size property (minimum number of
    observations to allow a node to be split)
  • Maximum Depth property (maximum number of
    generation of nodes)

15
Some Results
  • The posterior probabilities are the proportions
    of responders and non-responders at each node
  • A node is classified as a responder or
    non-responder depending on which posterior
    probability is the largest
  • In selecting the best tree, one can use
    Misclassification, Lift, or Average Squared Error

16
Creating a Decision Tree Model in Enterprise
Miner
  • Open the bankrupt project, and create a new
    diagram called Bankrupt_DecTree
  • Drag and drop the bankrupt data node and the
    Decision Tree node (from the model tab) onto the
    diagram
  • Connect the nodes

17
Select ProbChisq for the Criterion under
Splitting RuleChange Use Input Once to Yes
(otherwise, the same variable can appear more
than once in the tree)
18
Under Subtree select Misclassification for
Assessment MeasureKeep defaults under P-Value
Adjustment and Output VariablesUnder Score set
Variable Selection to No (otherwise variables
with importance values greater than 0.05 are set
as rejected and not considered by the tree)
19
The Decision Tree has only one split on RE/TA.
The misclassification rate is 0.15 (3/20), with 2
false negatives and 1 false positive. The
cumulative lift is somewhat lower than the best
cumulative lift, and starts out at 1.777 vs. the
best value of 2.000.
20
Under Subtree, set Method to Largest and rerun.
The result show that another split is added,
using EBIT/TA. However, the misclassification
rate is unchanged at 0.15. This result shows that
setting Method to Assessment and
Misclassification for Assessment Measure finds
the smallest tree having the lowest
misclassification
21
Model Comparison
  • The Model Comparison node under the Assess tab
    can be used to compare several different models
  • Create a diagram called Full Model that includes
    the bankrupt data node connected into the
    regression, decision tree, and neural network
    nodes
  • Connect the three model nodes into the Model
    Comparison node, and connect it and the
    bankrupt_score data node into a Score node

22
For Regression, set Selection Model to none for
Neural Network, set Model Selection Criterion to
Average Error, and the Network properties as
before for Decision Tree, set Assessment Measure
as Average Squared Error, and the other
properties as before. This puts each of the
models on a similar basis for fit. For Model
Comparison set Selection Criterion as Average
Squared Error.
23
Neural Network is selected, although Regression
is nearly identical in average squared error.
The Receiver Operating Characteristic (ROC) curve
shows sensitivity (true positives) vs.
1-specificity (false positives) for various
cutoff probabilities of a response. The chart
shows that no matter what the cutoff
probabilities are, regression and neural network
classify 100 of responders as responders
(sensitivity) and 0 of non-responders as
responders (1-specificity). Decision tree
performs reasonably well, as indicated by the
area above the diagonal line.
Write a Comment
User Comments (0)
About PowerShow.com