Logistic Regression and Discriminant Function Analysis - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Logistic Regression and Discriminant Function Analysis

Description:

Requires an estimation and validation sample to assess predictive accuracy ... of the following variables predict whether a woman is hired to be a Hooters girl? ... – PowerPoint PPT presentation

Number of Views:1473
Avg rating:3.0/5.0
Slides: 82
Provided by: Michael2154
Category:

less

Transcript and Presenter's Notes

Title: Logistic Regression and Discriminant Function Analysis


1
Logistic Regression and Discriminant Function
Analysis
2
Logistic Regression vs. Discriminant Function
Analysis
  • Similarities
  • Both predict group membership for each
    observation (classification)
  • Dichotomous DV
  • Requires an estimation and validation sample to
    assess predictive accuracy
  • If the split between groups is not more extreme
    than 80/20, yield similar results in practice

3
Logistic Reg vs. Discrim Differences
  • Discriminant Analysis
  • Assumes MV normality
  • Assumes equality of VCV matrices
  • Large number of predictors violates MV normality?
    cant be accommodated
  • Predictors must be continuous, interval level
  • More powerful when assumptions are met
  • Many assumptions, rarely met in practice
  • Categorical IVs create problems
  • Logistic Regression
  • No assumption of MV normality
  • No assumption of equality of VCV matrices
  • Can accommodate large numbers of predictors more
    easily
  • Categorical predictors OK (e.g., dummy codes)
  • Less powerful when assumptions are met
  • Few assumptions, typically met in practice
  • Categorical IVs can be dummy coded

4
Logistic Regression
  • Outline
  • Categorical Outcomes Why not OLS Regression?
  • General Logistic Regression Model
  • Maximum Likelihood Estimation
  • Model Fit
  • Simple Logistic Regression

5
Categorical Outcomes Why not OLS Regression?
  • Dichotomous outcomes
  • Passed / Failed
  • CHD / No CHD
  • Selected / Not Selected
  • Quit/ Did Not Quit
  • Graduated / Did Not Graduate

6
Categorical Outcomes Why not OLS Regression?
  • Example Relationship b/w performance and turnover
  • Line of best fit?!
  • Errors (Y-Y) across
  • values of performance (X)?

7
Problems with Dichotomous Outcomes/DVs
  • The regression surface is intrinsically
    non-linear
  • Errors assume one of two possible values, violate
    assumption of normally distributed errors
  • Violates assumption of homoscedasticity
  • Predicted values of Y greater than 1 and smaller
    than 0 can be obtained
  • The true magnitude of the effects of IVs may be
    greatly underestimated
  • Solution Model data using Logistic Regression,
    NOT OLS Regression

8
Logistic Regression vs. Regression
  • Logistic regression predicts a probability that
    an event will occur
  • Range of possible responses between 0 and 1
  • Must use an s-shaped curve to fit data
  • Regression assumes linear relationships, cant
    fit an s-shaped curve
  • Violates normal distribution
  • Creates heteroscedascity

9
Example Relationship b/w Age and CHD (1 Has
CHD)
10
General Logistic Regression Model
  • Y (outcome variable) is the probability that
    having one outcome or another based on a
    nonlinear function of the best linear combination
    of predictors
  • Where
  • Y probability of an event
  • Linear portion of equation (a b1x1) used to
    predict probability of event (0,1), not an end in
    itself

11
The logistic (logit) transformation
  • DV is dichotomous? purpose is to estimate
    probability of occurrences (0, 1)
  • Thus, DV is transformed into a likelihood
  • Logit/logistic transformation accomplishes
    (linear regression eq. takes log of odds)

12
Probability Calculation
Where The relation b/w logit (P) and X is
intrinsically linear b expected change of
logit(P) given one unit change in X a
intercept e Exponential
13
Ordinary Least Squares (OLS) Estimation
  • Purpose is obtain the estimates that would best
    minimize the sum of squared errors, sum(y-y)2
  • The estimates chosen best describe the
    relationships among the observed variables (IVs
    and DV)
  • Estimates chosen maximize the probability of
    obtaining the observed data (i.e., these are the
    population values most likely to produce the data
    at hand)

14
Maximum Likelihood (ML) estimation
  • OLS cant be used in logistic regression because
    of non-linear nature of relationships
  • In ML, the purpose is to obtain the parameter
    estimates most likely to produce the data
  • ML estimators are those with the greatest joint
    likelihood of reproducing the data
  • In logistic regression, each model yields a ML
    joint probability (likelihood) value
  • Because this value tends to be very small (e.g.,
    .00000015), it is multiplied by -2log
  • The -2log transformation also yields a statistic
    with a known distribution (chi-square
    distribution)

15
Model Fit
  • In Logistic Regression, R R2 dont make sense
  • Evaluate model fit using the -2log likelihood
    (-2LL) value obtained for each model (through ML
    estimation)
  • The -2LL value reflects fit of model used to
    compare fit of nested models
  • The -2LL measures lack of fit extent to which
    model fits data poorly
  • When the model fits the data perfectly, -2LL 0
  • Ideally, the -2LL value for the null model (i.e.,
    model with no predictors, or intercept-only
    model) would be larger than then the model with
    predictors

16
Comparing Model Fit
  • The fit of the null model can be tested against
    the fit of the model with predictors using
    chi-square test
  • Where
  • ?2 chi-square for improvement in model fit
    (where df kNull kModel)
  • -2LLMO -2 Log likelihood value for null model
    (intercept-only model)
  • -2LLM1 -2 Log likelihood value for hypothesized
    model
  • Same test can be used to compare nested model
    with k predictor(s) to model with k1 predictors,
    etc.
  • Same logic as OLS regression, but the models are
    compared using a different fit index (-2LL)

17
Pseudo R2
  • Assessment of overall model fit
  • Calculation
  • Two primary Pseudo R2 stats
  • Nagelkerke less conservative
  • preferred by some because max 1
  • Cox Snell more conservative
  • Interpret like R2 in OLS regression

18
Unique Prediction
  • In OLS regression, the significance tests for the
    beta weights indicate if the IV is a unique
    predictors
  • In Logistic regression, the Wald test is used for
    the same purpose

19
Similarities to Regression
  • You can use all of the following procedures you
    learned about OLS regression in logistic
    regression
  • Dummy coding for categorical IVs
  • Hierarchical entry of variables (compare changes
    in classification significance of Wald test)
  • Stepwise (but dont use, its atheoretical)
  • Moderation tests

20
Simple Logistic Regression Example
  • Data collected from 50 employees
  • Y success in training program (1 pass 0
    fail)
  • X1 Job aptitude score (5 very high 1 very
    low)
  • X2 Work-related experience (months)

21
Syntax in SPSS
DV
LOGISTIC REGRESSION PASS /METHOD ENTER APT
EXPER /SAVE PRED PGROUP /CLASSPLOT /PRINT
GOODFIT /CRITERIA PIN(.05) POUT(.10)
ITERATE(20) CUT(.5) .
IVs
22
Results
  • Block O The Null Model results
  • Cant do any worse than this
  • Block 1 Method Enter
  • Tests of the model of interest
  • Interpret data from here

Tests if model is significantly better than the
null model. Significant chi-square means yes!
Step, Block Model yield same results because
all IVs entered in same block
23
Results Continued
-2 Log Likelihood an index of fit - smaller
number means better fit (Perfect fit 0) Pseudo
R2 Interpret like R2 in regression Nagelkerke
preferred by some because max 1, Cox Snell
more conservative estimate uniformly
24
Classification Null Model vs. Model Tested
Null Model 52 correct classification
Model Tested 72 correct classification
25
Variables in Equation
B ? effect of one unit change in IV on the log
odds (hard to interpret) Odds Ratio (OR) ?
Exp(B) in SPSS more interpretable one unit
change in aptitude increases the probability of
passing by 1.7x Wald ? Like t test, uses
chi-square distribution Significance ? to
determine if wald test is significant
26
Histogram of Predicted Probabilities
27
To Flag Misclassified Cases
  • SPSS syntax
  • COMPUTE PRED_ERR0.
  • IF LOW NE PGR_1 PRED_ERR1.
  • You can use this for additional analyses to
    explore causes of misclassification

28
Results Continued
An index of model fit. Chi-square compares the
fit of the data (the observed events) with the
model (the predicted events). The n.s. results
means that the observed and expected values are
similar ? this is good!
29
Hierarchical Logistic Regression
  • Question Which of the following variables
    predict whether a woman is hired to be a Hooters
    girl?
  • Age
  • IQ
  • Weight

30
Simultaneous v. Hierarchical
Block 1. IQ
Block 1. IQ, Age, Weight
Cox Snell .002 Nagelkerke .003
Block 2. Age
Cox Snell .264 Nagelkerke .353
Block 3. Weight
Cox Snell .296 Nagelkerke .395
31
Simultaneous v. Hierarchical
Block 1. IQ
Block 1. IQ, Age, Weight
Block 2. Age
Block 3. Weight
32
Simultaneous v. Hierarchical
Block 1. IQ
Block 1. IQ, Age, Weight
Block 2. Age
Block 3. Weight
33
Multinomial Logistic Regression
  • A form of logistic regression that allows
    prediction of probability into more than 2 groups
  • Based on a multinomial distribution
  • Sometimes called polytomous logistic regression
  • Conducts an omnibus test first for each predictor
    across 3 groups (like ANOVA)
  • Then conduct pairwise comparisons (like post hoc
    tests in ANOVA)

34
Objectives of Discriminant Analysis
  • Determining whether significant differences exist
    between average scores on a set of variables for
    2 a priori defined groups
  • Determining which IVs account for most of the
    differences in average score profiles for 2
    groups
  • Establishing procedures for classifying objects
    into groups based on scores on a set of IVs
  • Establishing the number and composition of the
    dimensions of discrimination between groups
    formed from the set of IVs

35
Discriminant Analysis
  • Discriminant analysis develops a linear
    combination that can best separate groups.
  • Opposite of MANOVA
  • In MANOVA, groups are usually constructed by
    researcher and have clear structure (e.g., a 2 x
    2 factorial design). Groups IVs
  • In discriminant analysis, the groups usually
    have no particular structure and their formation
    is not under experimental control. Groups DVs

36
How Discrim Works
  • Linear combinations (discriminant functions) are
    formed that maximize the ratio of between-groups
    variance to within-groups variance for a linear
    combination of predictors.
  • Total discriminant functions groups 1 OR
    of predictors (whichever is smaller)
  • If more than one discriminant function is
    formed, subsequent discriminant functions are
    independent of prior combinations and account for
    as much remaining group variation as possible.

37
Assumptions in Discrim
  • Multivariate normality of IVs
  • Violation more problematic if overlap between
    groups
  • Homogeneity of VCV matrices
  • Linear relationships
  • IVs continuous (interval scale)
  • Can accommodate nominal but violates MV normality
  • Single categorical DV
  • Results influenced by
  • Outliers (classification may be wrong)
  • Multicollinearity (interpretation of coefficients
    difficult)

38
Sample Size Considerations
  • Observations Predictors
  • Suggested 20 observations per predictor
  • Minimum required 5 observations per predictor
  • Observations Groups (in DV)
  • Minimum smallest group size exceeds of IVs
  • Practical Guide Each group should have 20
    observations
  • Wide variation in group size impacts results
    (i.e., classification is incorrect)

39
Example
In this hypothetical example, data from 500
graduate students seeking jobs were examined.
Available for each student were three predictors
GRE(VQ), Years to Finish the Degree, and Number
of Publications. The outcome measure was
categorical Got a job versus Did not get a
job. Half of the sample was used to determine
the best linear combination for discriminating
the job categories. The second half of the sample
was used for cross-validation.
40
DISCRIMINANT /GROUPSjob(1 2) /VARIABLESgre
pubs years /SELECTsample(1) /ANALYSIS ALL
/SAVECLASS SCORES PROBS /PRIORS SIZE
/STATISTICSMEAN STDDEV UNIVF BOXM COEFF RAW
CORR COV GCOV TCOV TABLE CROSSVALID
/PLOTCOMBINED SEPARATE MAP /PLOTCASES
/CLASSIFYNONMISSING POOLED .
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Interpreting Output
  • Boxs M
  • Eigenvalues
  • Wilks Lambda
  • Discriminant Weights
  • Discriminant Loadings

48
(No Transcript)
49
Violates Assumption of Homogeneity of VCV
matrices. But this test is sensitive in general
and sensitive to violations of multivariate
normality too. Tests of significance in
discriminant analysis are robust to moderate
violations of the homogeneity assumption.
50
(No Transcript)
51
Discriminant Weights
Data from both these outputs indicate that one of
the predictors best discriminates who did/did not
get a job. Which one is it?
Discriminant Loadings
52
This is the raw canonical discriminant function.
The means for the groups on the raw canonical
discriminant function can be used to establish
cut-off points for classification.
53
Classification can be based on distance from the
group centroids and take into account information
about prior probability of group membership.
54
(No Transcript)
55
Two modes?
56
(No Transcript)
57
Violation of the homogeneity assumption can
affect the classification. To check, the analysis
can be conducted using separate group covariance
matrices.
58
No noticeable change in the accuracy of
classification.
59
Discriminant Analysis Three Groups
The group that did not get a job was actually
composed of two subgroupsthose that got
interviews but did not land a job and those that
were never interviewed. This accounts for the
bimodality in the discriminant function scores.
The discriminant analysis of the three groups
allows for the derivation of one more
discriminant function, perhaps indicating the
characteristics that separate those who get
interviews from those who dont, or, those who
have successful interviews from those whose
interviews do not produce a job offer.
60
Remember this?
Two modes?
61
(No Transcript)
62
(No Transcript)
63
DISCRIMINANT /GROUPSgroup(1 3)
/VARIABLESgre pubs years /SELECTsample(1)
/ANALYSIS ALL /SAVECLASS SCORES PROBS
/PRIORS SIZE /STATISTICSMEAN STDDEV UNIVF
BOXM COEFF RAW CORR COV GCOV TCOV TABLE
CROSSVALID /PLOTCOMBINED SEPARATE MAP
/PLOTCASES /CLASSIFYNONMISSING POOLED .
64
(No Transcript)
65
Separating the three groups produces better
homogeneity of VCV matrices. Still significant,
but just barely. Not enough to worry about.
66
Two significant linear combinations can be
derived, but they are not of equal importance.
67
Weights
What do the linear combinations mean now?
Loadings
68
(No Transcript)
69
(No Transcript)
70
Loadings
Weights
71
This figure shows that discriminant function 1,
which is made up of number of publications and
years to finish, reliably differentiates between
those who got jobs, had interviews only, and had
no job or interview. Specially, a high value on
DF1 was associated with not getting a job,
suggesting that having few publications (loading
-.466) and taking a long time to finish
(loading .401) was associated with not getting
a job.
72
(No Transcript)
73
(No Transcript)
74
Territorial Map Canonical Discriminant Function
2 -6.0 -4.0 -2.0 .0
2.0 4.0 6.0
ôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòò
ôòòòòòòòòòô 6.0 ô
23 31 ô ó
23 31
ó ó 23
31 ó ó
23 31
ó ó 23
31 ó ó
23 31
ó 4.0 ô ô ô 23 ô
31ô ô ô ó
23 31 ó
ó 23 31
ó ó
23 31 ó
ó 23 31
ó ó
23 31 ó 2.0
ô ô ô 23 ô 31
ô ô ó
23 31 ó ó
23 31
ó ó 23
31 ó ó
23 31
ó ó 23
31 ó .0 ô
ô ô 23 ô 31
ô ó 23
31 ó ó
23 31
ó ó 23
31 ó ó
23 31 ó
ó 23
31 ó -2.0 ô ô
23 ô ô31 ô ô
ó 23 31
ó ó 23
31 ó
ó 23 31
ó ó 23
31 ó ó
23 31
ó -4.0 ô ô 23 ô
ô ô 31 ô ô ó
23 31
ó ó 23
31 ó ó
23 31
ó ó 23
31 ó ó
23 31 ó
-6.0 ô 23
31 ô ôòòòòòòòòòôòòòòòòòò
òôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòô
-6.0 -4.0 -2.0 .0 2.0
4.0 6.0 Canonical
Discriminant Function 1 Symbols used in
territorial map Symbol Group Label ------
----- -------------------- 1 1
Unemployed 2 2 Got a Job 3 3
Interview Only Indicates a group
centroid
75
(No Transcript)
76
Classification
A classification function is derived for each
group. The original data are used to estimate a
classification score for each person, for each
group. The person is then assigned to the group
that produces the largest classification score.
77
(No Transcript)
78
Is the classification better than would be
expected by chance? Observed values
79
Expected classification by chance E (Row x
Column)/Total N
80
Correct classification that would occur by chance
81
The difference between chance expected and actual
classification can be tested with a chi-square as
well.
145.13 13.82 23.47 14.48 59.25 8.77
25.5 11.28 29.34
Chi squared 331.04
Where degree of freedom ( groups -1)2 df 4
Write a Comment
User Comments (0)
About PowerShow.com