Supervised Learning Methods Used to Classify or Predict Physical Abuse

About This Presentation

Title:

Supervised Learning Methods Used to Classify or Predict Physical Abuse

Description:

Supervised Learning Methods Used to Classify or Predict. Physical ... The best possible predictor is found to split the root node into ... Donut Chart ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 32

Provided by: Instr175

Category:

more less

Transcript and Presenter's Notes

Title: Supervised Learning Methods Used to Classify or Predict Physical Abuse

1
Supervised Learning Methods Used to Classify or
Predict Physical Abuse

Marit Gilkey
Statistics Senior Seminar
Spring 2004

2
Methods

CART Classification and Regression Trees
CHAID Chi-Square Automatic Interaction
Detector
Logistic Regression

3
CART Algorithm

Step 1
The best possible predictor is found to split
the root node into two child nodes.
Best the predictor that maximizes the
reduction in impurity
Gini Index ?k?k smk smk
where smk (1/Nm) ?xi ? Rm I(yi k)

4
CART Algorithm

Step 2
Each node is assigned a predicted outcome
class. (yes, no)
Step 3
The process is repeated until the terminal
nodes are too small or too few to split.

5
CHAID Algorithm

Step 1
The best predictor is chosen from all of the
predictors, by using a chi-squared test of
independence.
Best predictor with most significant pvalue.

6
CHAID Algorithm

Step 2
With this predictor, the dataset is
partitioned into 2 or more subsets based on the
number of categories of this predictor.
All insignificantly different subsets are
combined.

7
CHAID Algorithm

Step 3
Each subset is then partitioned further based
on the same criterion in Step 1.
This is repeated for each subset until it can no
longer be split into statistically significant
subsets.

8
Logistic Regression

The logistic regression model of a response, y,
and one predictor variable, x, is
P (y 1) exp(ß0 ß1X)
1 exp(ß0 ß1X)
Uses maximum-likelihood estimation
Estimates the probability of a given event
occurring.

9
Variables

CART and CHAID Response is continuous or
categorical. Predictors are continuous or
categorical.
Logistic Response is binary. Predictors are
continuous or categorical.

10
Data Description

There are 2 sets of data.
training set 999 observations validation set
336 observations
The data was obtained by those completing a
womens health survey.

11
Description of Variables Used
12
Decision Trees

Advantages
Understandable rules
All variable types can be used
Identifies variables that are important
Disadvantages
Large and complex trees-CHAID

13
Logistic Regression

Advantages
Model is clear
All variable types can be used
Disadvantages
Determine if there are variable interactions

14
CART Initial Tree

Predicted
954 No
45 Yes
Actual
936 No
63 Yes

15
Deviance

? smk log(smk)
Where smk (1/Nm) ?xi ? Rm I(yi k)
smk Proportion of observations in node m which
belong to class k.

16
CART - Pruning

Visually, after the 8th terminal node, adding
another split does not look like it will benefit
the classification much.

17
CART -Misclassifications

Training data

Validation data

18
CART - Final Pick

11 terminal nodes

19
CART - Classification
20
CHAID
21
CHAID - Classifications

50/999 .05005
29/336 .0863

22
Donut Chart
23
Logistic Regression

If P(Y1) gt .46, we will classify it as being an
event, or a woman having been physically abused.
The misclassification rate is 52/999 .05205

24
Lift Chart

Lift measures how well the predicted model is,
calculated as a ratio.
Cumulative Response and Lift Charts are visual
ways to measure how well the model is.

25
Cumulative Response
26
Lift Chart
27
ROC Curve

c 0.917

28
Comparisons
29
Set up Breslow Day

There are 3 2x2 tables.
We want to test whether or not there are
significant differences in association with the 3
methods.

30
Breslow Day Test

Ho OR1 OR2OR3
Ha ORs are not all equal
Test Stat 0.3839
Pvalue 0.8253
Conclusion There is no evidence the ORs differ
over the different methods.

31
References

1. Berry, Michael J.A. and Linoff, Gordon Data
Mining Techniques For Marketing, Sales, and
Customer Support. John Wiley Sons, Inc. New
York, 1997.
2. Breiman, L., Friedman, J.H., Olshen, R.A.,
and Stone, C.J. Classification and Regression
Trees. Wadsworth Inc. California, 1984.
3. Fernandez, George. Data Mining Using SAS
Applications. Chapman Hall/CRC Boca Raton,
FL, 2003.
4. Johnson, Dallas E. Applied Multivariate
Methods for Data Analysts. Duxbury Press
Pacific Grove, CA, 1998.
5. Kass, G.V. An Exploratory Technique for
Investigating Large Quantities of Categorical
Data. Applied Statistics. Vol.29, No.2
(1980), 119-127.
6. http//marketing.byu.edu/download/measurementan
alysis/DataSets/chd/
7. Logistic Regression Examples Using the SAS
System. SAS Institute Inc. North Carolina,
1995.
8. Hastie, T., Tibshirani, R., and Friedman, J.
The Elements of Statistical Learning. Springer
New York, 2001.