Title: Logistic regression Results from factor analysis Multilevel analysis Last survey issues online next
1 Logistic regressionResults from factor
analysisMulti-level analysisLast survey
issues (online next week)
2AMMBR week 6
- logistic regression example case
- We hebben gegevens over geboortes van bijna 200
baby's. Een deel van deze baby's heeft een
gevaarlijk laag geboortegewicht (1laag,
0niet-laag). De vraag is welke factoren
samenhangen met een laag geboortegewicht.
3Logistic regression assignment what steps?
- 1. Look at the descriptives. Any transformations
necessary/useful? - 2. Is there a problem of multicollinearity
between the (transformed) independent variables? - 3. Sampling adequacy? Are all cells in the
multivariate tables of the categorical variables
filled with enough cases? - 4. Were the cases sampled independently?
- 5. Are there any relevant 'third' variables
missing? - 6. Does the chosen model fit to the data?
4Logistic regression assignment what steps?
- 7. Any interaction effects?
- 8. Any curvilinear relationships?
- 9. Any outliers?
- 10. Homogeneity of the error variance?
- 11. Interpretation of the results of the finally
chosen model
51. Descriptives/transformations
61. Descriptives/transformations
new_weight1/weight
71. Descriptives/transformations
ln_ageln(age)
81. Descriptives/transformations
?
RECODE naararts (00) (1 thru Highest1)
(ELSESYSMIS) INTO new_naararts . EXECUTE .
9two dummy variables for race
- RECODE
- ras
- (21) (10) (30) INTO black .
- EXECUTE .
- RECODE
- ras
- (11) (20) (30) INTO white .
- EXECUTE .
102. multicollinearity?
- no strong bi-variate correlations between the
x-variables - REGRESSION
- /MISSING LISTWISE
- /STATISTICS COLLIN TOL
- /CRITERIAPIN(.05) POUT(.10)
- /NOORIGIN
- /DEPENDENT babylicht
- /METHODENTER black white rookt new_weight
new_naararts hypertensie ln_age - no VIF10, all tolerances gt 0.2
113. sampling adequacy
- frequencies and (higher order!) cross-tabs
- only 12 cases with hypertension
- no analysis of interaction effects with
hypertension possible - Check standard error of estimate carefully
124. independent sampling
- here we have to assume independent sampling...
- if you are doing 'real' research you have to
check...
135. all relevant 'third' variables?
- how do we know? ? theory
- importance of theory guided data collection
- here we have to assume....
146. model fit (simple model)
- LOGISTIC REGRESSION babylicht
- /METHOD ENTER black white ln_age new_weight
rookt new_naararts hypertensie - /SAVE PRED COOK ZRESID
- /PRINT GOODFIT
- /CRITERIA PIN(.05) POUT(.10) ITERATE(20)
CUT(.5) .
156. model fit (simple model)
Nagelkerke's R-Square 18.2
167. interaction effects
- COMPUTE smoke_arts rookt new_naararts .
- compare the difference in fit between simple
model and the model with interaction - Chi-Square26.193-26.1630.03, df1, pgt0.5
- no significant improvement
177. interaction effects
- COMPUTE smoke_arts rookt new_naararts .
188. linear relationship?
- COMPUTE square_age ln_age ln_age .
- EXECUTE .
- LOGISTIC REGRESSION babylicht
- /METHOD ENTER black white new_weight rookt
new_naararts hypertensie square_age - /SAVE PRED COOK ZRESID
- /PRINT GOODFIT
- /CRITERIA PIN(.05) POUT(.10) ITERATE(20)
CUT(.5) . - Note 1 replacement of (transformed) age by age2
- Note 2 no 'nested' models
198. linear relationship?
Model fit not better than fit of simpler (linear)
model Note You cannot take the chi-square
difference as test-statistic
208. linear relationship?
COMPUTE weight_square (new_weight
new_weight)10000. EXECUTE .
is the effect of weight linear or quadratic?
219. Outliers
- FREQUENCIES
- VARIABLESZRE_1
- /STATISTICSSTDDEV RANGE MINIMUM MAXIMUM MEAN
MEDIAN SKEWNESS SESKEW - KURTOSIS SEKURT
- /HISTOGRAM
- /ORDER ANALYSIS .
- LOGISTIC REGRESSION babylicht
- /METHOD ENTER black white rookt new_weight
new_naararts hypertensie - ln_age
- /SAVE PRED COOK LEVER DFBETA ZRESID
- /CLASSPLOT /CASEWISE OUTLIER(2)
- /CRITERIA PIN(.05) POUT(.10) ITERATE(20)
CUT(.5) .
229. Outliers
Residuals and influence statistics
239. Outliers
Plot of residuals against probabilities
(classplot)
249. Outliers
Step number 1 Observed Groups
and Predicted Probabilities 16 ô
ô ó
ó ó
ó F
ó
ó R 12 ô 1
ô E
ó 1
ó Q ó 1 1
ó U ó
10 0 1
ó E 8 ô 1 1 00 0 1 1
ô N ó 01 1
1 00 0 1 1 1
ó C ó 0000 0 11010010 1 1 1 1
ó Y ó 0000000
111010010 110 1 0 1
ó 4 ô 0000000 100010010 11011 0 10
1 ô ó 0000000
00000000010001000 10 1 1
ó ó 0000000000000000010000000 10 11
1 1 1 ó ó
00000000000000000000000000100000 00 1 11 1 0 0
ó Predicted òòòòòòòòòòòòòòôòòòòòòòòòòòòòòô
òòòòòòòòòòòòòòôòòòòòòòòòòòòòòò Prob 0
,25 ,5 ,75
1 Group 00000000000000000000000000000011111111
1111111111111111111111 Predicted
Probability is of Membership for 1 The
Cut Value is ,50 Symbols 0 - 0
1 - 1 Each Symbol Represents
1 Case.
259. Influential Outliers?
- FREQUENCIES
- VARIABLESCOO_1 LEV_1 DFB0_1 DFB1_1 DFB2_1
DFB3_1 DFB4_1 DFB5_1 DFB6_1 - DFB7_1
- /STATISTICSSTDDEV MINIMUM MAXIMUM MEAN
SKEWNESS SESKEW - /HISTOGRAM
- /ORDER ANALYSIS .
Crit val lt1
Crit val 3(k1)/n.127
269. Influential Outliers
Crit val 1
279. Influential Outliers
289. Influential Outliers
2910. Variance homogeneity
- GRAPH
- /SCATTERPLOT(BIVAR)PRE_1 WITH ZRE_1
- /MISSINGLISTWISE .
3011. Interpretation of chosen model
3111. Interpreting the coefficients
- In multiple regression
- y c0 c1x1 c2x2
- the c1 resembles the relation between x1 and
y, irrespective of the value of x2. - In logistic regression this is no longer the case
- -gt In logistic regression, the size of the
effect depends on the values of the other
independent variables.
32 33Multi-level models
Bayesian hierarchical models mixed models
hierarchical linear models random effects
models random coefficient models subject
specific models variance component models
variance heterogeneity
dealing with clustered data. One solution the
variance component model
34Clustered data / multi-level models
- Pupils within schools
- (within regions within countries)
- Firms within regions (or sectors)
- In the ICOP case more than one observation per
person
General issue observations are not independent
35On individual vs aggregate data
For instance X extraversion X
innovativeness of building Y school results Y
extent to which building is liked
36Had we only known, that the data are clustered!
- So the effect of X within clusters can be
different from the effect between clusters!
37MAIN MESSAGE
-
- Make sure that you discern between two kinds of
effects those at the "micro-level" vs those at
the aggregate level and ... - ... that you do not test a micro-hypothesis with
aggregate data (or vice versa)
38A toy example two schools, two pupils
Two schools each with two pupils. We first
calculate the means.
(taken from Rasbash)
Overall mean (32(-1)(-4))/40
39Now the variance
The total variance is the sum of the squares of
the departures of the observations around mean
divided by the sample size(4)
(94116)/47.5
40The variance of the school means around the
overall mean
The variance of the school means around the
overall mean
(2.52(-2.5)2)/26.25
Total variance 7.5
41The variance of the pupils scores around their
schools mean
The variance of the pupils scores around their
schools mean
((3-2.5)2 (2-2.5)2 (-1-(-2.5))2
(-4-(-2.5))2 )/4 1.25
The variance of the school means around overall
mean (2.52(-2.5)2)/26.25
Total variance 7.56.251.25
42-gt So you can partition the variance in
individual level and school level
How much of the variability in pupil attainment
is attributable to schools level factors and how
much to pupil level factors?
In terms of our toy example we can now say
6.25/7.5 82 of the total variation of pupils
attainment is attributable to school level factors
And this is important we want to know how to
explain (in this example) school attainment
1.25/7.5 18 of the total variation of pupils
attainment is attributable to pupil level factors
43Standard multiple regression won't do
So You can use all the data and just run a
multiple regression, but then you disregard the
clustering effect, which gives uncorrect
confidence intervals You can aggregate within
clusters, and then run a multiple regression on
the aggregate data. Two problems no individual
level testing possible you get less data
points. Or you can run a multi-level analysis
44Multi-level models
- The usual multiple regression model assumes
- ... with the subscript "i" defined at the
case-level. - ... and the epsilons independently distributed
with covariance matrix I. - And with clustered data, you know these
assumptions are not met.
45Solution 1 add dummy-variables per person
- So just multiple regression, but with as many
dummy variables as you have - ... where, in this example, there are j1
persons. - IF the clustering is (largely) due to differences
in the intercept between persons, this might
work. - BUT if there are only a handful of cases per
person, this necessitates a huge number of extra
variables
46Solution 2 split your micro-level X-vars
- Say you have
- then create
- and add these as predictors (instead of x1)
Make sure that you understand what is happening
here, and why it is of use.
47Solution 3 the variance component model
In the variance component model, we split the
randomness in a "personal part" and a "rest part"
48Now how do you do this in SPSS?
- Somewhat problematic not available for a binary
Y-variable. - ltSee SPSS demogt
49What we skipped
- Assumption checking for instance, is the
delta-variable normally distributed? - run the analysis with dummies per group check
the set of coefficients and compare whether these
follow a normal distribution - What do we do when we have more than 2 levels?
- What do we do when Y is binary? (in our case run
MIXED anyway. It's wrong but it will do.) - Random coefficients?
50When you have multi-level data (2 levels)
- If applicable consider whether using separate
dummies per group might help - Run an empty mixed model (i.e., just the constant
included) in SPSS. Look at the level on which
most of the variance resides. - If applicable divide micro-variables in "group
mean" variables and "difference from group mean"
variables. - Re-run your mixed model with these variables
included (as you would a multiple regression
analysis)
51To Do
- Have a look at your questions one last time. Uwe
and I will create the survey the coming week
(there were still too many issues rised to start
creating it now). - Train multi-level analysis with the data
available on the Moodle site start practising
and/or brushing up your SPSS skills (once again)