Supervised Learning Methods Used to Classify or Predict Physical Abuse - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Supervised Learning Methods Used to Classify or Predict Physical Abuse

Description:

Supervised Learning Methods Used to Classify or Predict. Physical ... The best possible predictor is found to split the root node into ... Donut Chart ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 32
Provided by: Instr175
Category:

less

Transcript and Presenter's Notes

Title: Supervised Learning Methods Used to Classify or Predict Physical Abuse


1
Supervised Learning Methods Used to Classify or
Predict Physical Abuse
  • Marit Gilkey
  • Statistics Senior Seminar
  • Spring 2004

2
Methods
  • CART Classification and Regression Trees
  • CHAID Chi-Square Automatic Interaction
    Detector
  • Logistic Regression

3
CART Algorithm
  • Step 1
  • The best possible predictor is found to split
    the root node into two child nodes.
  • Best the predictor that maximizes the
    reduction in impurity
  • Gini Index ?k?k smk smk
  • where smk (1/Nm) ?xi ? Rm I(yi k)

4
CART Algorithm
  • Step 2
  • Each node is assigned a predicted outcome
    class. (yes, no)
  • Step 3
  • The process is repeated until the terminal
    nodes are too small or too few to split.

5
CHAID Algorithm
  • Step 1
  • The best predictor is chosen from all of the
    predictors, by using a chi-squared test of
    independence.
  • Best predictor with most significant pvalue.

6
CHAID Algorithm
  • Step 2
  • With this predictor, the dataset is
    partitioned into 2 or more subsets based on the
    number of categories of this predictor.
  • All insignificantly different subsets are
    combined.

7
CHAID Algorithm
  • Step 3
  • Each subset is then partitioned further based
    on the same criterion in Step 1.
  • This is repeated for each subset until it can no
    longer be split into statistically significant
    subsets.

8
Logistic Regression
  • The logistic regression model of a response, y,
    and one predictor variable, x, is
  • P (y 1) exp(ß0 ß1X)
  • 1 exp(ß0 ß1X)
  • Uses maximum-likelihood estimation
  • Estimates the probability of a given event
    occurring.

9
Variables
  • CART and CHAID Response is continuous or
    categorical. Predictors are continuous or
    categorical.
  • Logistic Response is binary. Predictors are
    continuous or categorical.

10
Data Description
  • There are 2 sets of data.
  • training set 999 observations validation set
    336 observations
  • The data was obtained by those completing a
    womens health survey.

11
Description of Variables Used
12
Decision Trees
  • Advantages
  • Understandable rules
  • All variable types can be used
  • Identifies variables that are important
  • Disadvantages
  • Large and complex trees-CHAID

13
Logistic Regression
  • Advantages
  • Model is clear
  • All variable types can be used
  • Disadvantages
  • Determine if there are variable interactions

14
CART Initial Tree
  • Predicted
  • 954 No
  • 45 Yes
  • Actual
  • 936 No
  • 63 Yes

15
Deviance
  • ? smk log(smk)
  • Where smk (1/Nm) ?xi ? Rm I(yi k)
  • smk Proportion of observations in node m which
    belong to class k.

16
CART - Pruning
  • Visually, after the 8th terminal node, adding
    another split does not look like it will benefit
    the classification much.

17
CART -Misclassifications
  • Training data
  • Validation data

18
CART - Final Pick
  • 11 terminal nodes

19
CART - Classification
20
CHAID
21
CHAID - Classifications
  • 50/999 .05005
  • 29/336 .0863

22
Donut Chart
23
Logistic Regression
  • If P(Y1) gt .46, we will classify it as being an
    event, or a woman having been physically abused.
  • The misclassification rate is 52/999 .05205

24
Lift Chart
  • Lift measures how well the predicted model is,
    calculated as a ratio.
  • Cumulative Response and Lift Charts are visual
    ways to measure how well the model is.

25
Cumulative Response
26
Lift Chart
27
ROC Curve
  • c 0.917

28
Comparisons
29
Set up Breslow Day
  • There are 3 2x2 tables.
  • We want to test whether or not there are
    significant differences in association with the 3
    methods.

30
Breslow Day Test
  • Ho OR1 OR2OR3
  • Ha ORs are not all equal
  • Test Stat 0.3839
  • Pvalue 0.8253
  • Conclusion There is no evidence the ORs differ
    over the different methods.

31
References
  • 1. Berry, Michael J.A. and Linoff, Gordon Data
    Mining Techniques For Marketing, Sales, and
    Customer Support. John Wiley Sons, Inc. New
    York, 1997.
  • 2. Breiman, L., Friedman, J.H., Olshen, R.A.,
    and Stone, C.J. Classification and Regression
    Trees. Wadsworth Inc. California, 1984.
  • 3. Fernandez, George. Data Mining Using SAS
    Applications. Chapman Hall/CRC Boca Raton,
    FL, 2003.
  • 4. Johnson, Dallas E. Applied Multivariate
    Methods for Data Analysts. Duxbury Press
    Pacific Grove, CA, 1998.
  • 5. Kass, G.V. An Exploratory Technique for
    Investigating Large Quantities of Categorical
    Data. Applied Statistics. Vol.29, No.2
    (1980), 119-127.
  • 6. http//marketing.byu.edu/download/measurementan
    alysis/DataSets/chd/
  • 7. Logistic Regression Examples Using the SAS
    System. SAS Institute Inc. North Carolina,
    1995.
  • 8. Hastie, T., Tibshirani, R., and Friedman, J.
    The Elements of Statistical Learning. Springer
    New York, 2001.
Write a Comment
User Comments (0)
About PowerShow.com