Correlation and Multiple Regression - PowerPoint PPT Presentation

1 / 117
About This Presentation
Title:

Correlation and Multiple Regression

Description:

Coefficients: Effect of X on log odds. Standard errors ... Odds Ratio. Note that these Chi-square values are the square of the standard t-ratios ... – PowerPoint PPT presentation

Number of Views:262
Avg rating:3.0/5.0
Slides: 118
Provided by: educa66
Category:

less

Transcript and Presenter's Notes

Title: Correlation and Multiple Regression


1
Correlation and Multiple Regression
  • Robert K. Toutkoushian
  • Associate Professor
  • Educational Leadership and Policy Studies
  • Indiana University

2
Objectives of Module
  • Review statistical procedures such as correlation
    and multiple regression analysis
  • Examine ways in which these procedures can be
    applied to institutional research
  • Practice using SPSS to implement these procedures
  • Discuss more involved procedures and applications

3
My Approach
  • Aim for a middle ground in terms of difficulty
    (higher UG/lower G level)
  • Focus more on intuition behind procedures rather
    than proofs derivations
  • Assume familiarity with descriptive stats and
    hypothesis testing
  • STRONGLY encourage questions at any time!

4
Covariance and Correlation
  • Both measure the extent to which two variables
    move together. They differ only in units of
    measure
  • Positive covariance/correlation Both variables
    tend to move in the same direction
  • Negative covariance/correlation Both variables
    tend to move in the opposite direction

5
(No Transcript)
6
Remember...
  • When looking for correlations, you may have to
    first reorder one of the variables
  • If two variables are related, then knowing the
    value of one variable may help with guesses as to
    the value of the other (e.g., retention and
    SAT/ACT scores)
  • Correlation does not imply causation!

7
Calculating the Covariance
  • Calculate the means for X and Y (denoted x-bar
    and y-bar)
  • Subtract the mean for X from each X value and
    repeat for Y
  • Multiply the differences together for each
    observation, then sum and divide by degrees of
    freedom (n-1)

8
Covariance -132,000/(4-1) -44,000
9
Correlation Coefficient
  • Properties of correlation coefficient
  • A standardized measure of covariance that
    ranges between -1 and 1
  • Positive 0 lt r ? 1
  • Negative -1 ? r lt 0
  • No correlation r 0
  • Cov(x,y) and r will have the same sign
  • Stronger relationship as r moves away from zero

10
Calculating Correlation Coefficient
  • Calculate cov(x,y) as before
  • Calculate st. devs for X and Y
  • Divide cov(x,y) by product of standard deviations

11
(No Transcript)
12
Institutional Research Applications
  • Correlations can be useful in IR when one
    variable of interest is unobservable, and a
    correlated variable is observable
  • College performance (correlated with HS
    performance)
  • Faculty experience (correlated with age, years
    since degree)
  • Teaching quality (correlated with student
    evaluations)

13
Limitations
  • Weak correlations are less useful for making
    inferences
  • Correlations vary across factors, so it is
    difficult to compare across factors (e.g., stock
    prices and faculty salaries)
  • May be multiple factors affecting a single factor
    of interest
  • Does not measure non-linear relationships

14
Class Example 1
  • Filename TUITION.SAV contains data on average
    public tuition rates, state appropriations, and
    median family income by state in 1994. In SPSS
  • Calculate the means and standard deviations for
    these three variables.
  • Calculate the covariances and correlations
    between state appropriations and (a) public
    tuition rates, (b) median family income.

15
(No Transcript)
16
Linear Regression (OLS)
  • Objective find the best linear (straight line)
    relationship between two or more variables.
  • Ordinary Least Squares (OLS) is the technique
    most often used to choose the best line.
  • This linear relationship is based on the
    covariance between two variables.
  • Regression analysis requires the analyst to
    specify the direction of causation.

17
Advantages of Linear Regression
  • Can predict/forecast one variable (Y) based on
    values of another variable (X)
  • Can perform hypothesis tests to determine if X
    affects Y
  • Can control for differences in Y due to X
  • Very flexible with regard to functional form,
    model specification, etc.

18
Example Gender Equity in Salaries
  • Your President asks you to examine faculty
    salaries at your institution and determine if
    there is a gender equity problem. Descriptive
    stats show that on average men earn more than
    women.
  • How can you control for salary differences due to
    justifiable factors such as experience,
    productivity?
  • How can you determine if the remaining pay
    difference is large enough to conclude that this
    is a problem?

19
Ordinary Least Squares
20
Three Formulations
  • Slope ß in population, b in sample
  • Error term (e or e) encompasses effects of all
    omitted factors
  • Parameters in the population model is
    unobservable
  • Sample line is what you estimate with OLS

21
Assumptions in Linear Regression
  • The error term has a mean of zero and constant
    variance
  • The errors are unrelated to each other
  • The errors are unrelated to the independent
    variable(s) in the model
  • The error term is normally distributed (needed
    for hypothesis testing)

22
Ordinary Least Squares
OLS specifies that the best line is the one
that minimizes the sum of squared errors (
minimize S ei2 )
Intercept (a)
23
Notes on OLS
  • The slope formula is the covariance between X and
    Y divided by the variance for X
  • The slope and covariance will always have the
    same sign
  • b gt 0 indicates a positive relationship
  • b lt 0 indicates a negative relationship
  • b 0 indicates no linear relationship

24
Example An IR analyst is asked to help forecast
applications. She believes there is a
relationship between HS grads and resident
applications each year
25
  • Regression line Y -358.28 0.29X
  • Interpretation For each additional HS grad,
    predicted applications will rise by 0.29.
  • The intercept may not have much meaning.
  • Can predict applications given projections of HS
    grads. If HS grads 36,000, then
  • Y -358.28 0.29(36,000) 10,082

26
Goodness of Fit
  • Measures the strength of the relationship between
    X and Y
  • R-squared (or coefficient of determination)
    proportion of total deviation in Y that is
    explained by X(s)
  • R-squared is bounded between 0 and 1 (R2 1 if
    perfect fit, R2 0 if no fit)
  • R-squared square of correlation coefficient
    (with only one X variable in the model)

27
More on R-squared...
  • When there is no covariance, the slope of the
    regression line is zero and R2 0.
  • Adding variables to the regression model will
    almost always raise R2, but this does not mean
    that the resulting model is better
  • Adjusted R2 attempts to correct for this, but no
    longer has the same interpretation
  • R2 varies depending on the dependent variable.
    Do not use this to compare regression models with
    different Ys.

28
Predicting Resident Applications
Note that HS grads account for 88.5 of the total
deviation in applications.
29
Class Example 2
  • Using TUITION.SAV, in SPSS
  • Calculate a regression line showing how median
    income affects average tuition
  • Calculate R2, TSS, RSS, ESS, and corr(x,y). SPSS
    syntax
  • REGRESSION
  • /MISSING LISTWISE
  • /STATISTICS COEFF OUTS R ANOVA
  • /CRITERIAPIN(.05) POUT(.10)
  • /NOORIGIN
  • /DEPENDENT tuition
  • /METHODENTER income .

30
(No Transcript)
31
Equation Tuition 313.119 0.0719Income
32
Hypothesis Testing for ß
  • In most situations in the social sciences, it is
    rarely known for sure if X affects Y
  • A hypothesis test can be used to determine if the
    data provide sufficient evidence of a
    relationship
  • For most variables, the sample slope b will not
    exactly equal zero. How far from zero must it be
    in order to safely conclude that ß ? 0??

33
Steps in Hypothesis Testing
  • Specify null (H0) and alternative (HA)
    hypotheses
  • Identify test statistic and find critical
    value(s) based on degrees of freedom and
    significance level
  • Calculate test statistic and compare to critical
    value(s)

34
Common Hypotheses for ß
  • ß 0 (X has no effect on Y)
  • ß gt 0 (X has a positive effect on Y)
  • ß lt 0 (X has a negative effect on Y)
  • ß ? 0 (X has some effect on Y or - )
  • Choose two hypotheses that are mutually exclusive
    and exhaustive.
  • The null hypothesis (H0) should always contain
    some form of equal sign.

35
Test Statistic for ß
If e N(0, s2), then b N(ß, Var(b))
Therefore the t-ratio
Will follow a Student t-distribution with n-k
degrees of freedom (k parameters to be
estimated)
The t-ratio is defined as the random variable
minus its mean (when H0 is true), divided by its
standard deviation.
36
Notes on Hypothesis Testing
  • The t-ratio simply counts the standard
    deviations the slope is from zero (distance)
  • The greater the distance, the less likely you
    would have found the value of b if ß 0.
  • For significance tests, since ß 0, the t-ratio
    is the slope divided by its standard deviation
    (or standard error)

37
Example t-ratio of 2.40
This shows that there is only a 1.3 chance of
finding a t-ratio of 2.40 or greater if in fact ß
0. Therefore, if you found a t-value this
high, it is unlikely that ß 0.
38
R2 0.025
TSS 1.9E10 ESS 1.9E10 se v(ESS/826) 4766
39
Do undergraduate enrollments have a significant
effect on average costs per student? Null
Hypothesis ß 0, Alternative Hypothesis ß ?
0 For 826 df, 1 significance level, reject the
null when the calculated t-ratio exceeds 2.575 in
absolute value.
40
P-value Probability of drawing a more extreme
sample value given that the null hypothesis is
true P-value Pr(b lt -0.175) Pr(t lt -4.577)
0.000
41
Units of Measurement
  • The significance levels of any variable will not
    be influenced by the units of measure used for X
    or Y
  • The coefficient represents the units change in
    Y due to a one-unit change in X
  • When the units of measure change, both the
    coefficients and standard errors change
    proportionately (t-ratio remains the same)

42
Out-of-Sample Forecasts/Predictions
  • The regression model can be used to derive
    predictions of Y given values of X(s)
  • Point estimates are found by substituting X into
    the equation and solving for Y (I predict that
    the grad rate will be 70)
  • Interval estimates are predictions that Y will
    fall within a certain interval (I am 95 certain
    that the grad rate will be between 68 and 72)
  • Interval estimates are more conservative, and
    convey the uncertainty in predictions.

43
Two Types of Intervals
  • C.I. For expected value (mean) of Y
  • For given X, what is the predicted average value
    of Y
  • C.I. For a single value of Y
  • For given X, what is the predicted single value
    of Y (more uncertainty, so wider interval)
  • The two methods yield very similar intervals.
    Most IR applications use C.I.s for single value.
  • Intervals can be obtained in SPSS using the
    save subcommand.

44
Predict HS Grads in New Hampshire
  • An IR analyst is charged with developing a model
    to help predict changes in HS grads in the state
    through 2006.
  • File AIR1.SAV in SPSS has two vars HS grads in
    year t (HSGRAD), and 2nd grade enrollments in
    year t-10 (GRADE2).
  • Find correlation between HSGRAD and GRADE2
  • Estimate a regression model
  • Form point and 95 CI estimates of high school
    grads for the next ten years.

45
Under statistics gt correlate gt bivariate
Note that r 0.959, cov(x,y) 701,160 (n12)
46
Under statistics gt regression gt linear
47
In 2006, the model predicts there will be 14,919
high school grads
95 certain that in 2006 there will be between
14,185 and 15,652 high school grads
48
(No Transcript)
49
Multiple Regression Analysis
  • In most IR applications, the dependent variable
    may be influenced by multiple factors
  • Grad rate f(avg. SAT, gender composition, avg.
    HS rank, students on campus,...)
  • Faculty Salary f(education, experience,
    productivity, field,...)
  • Education Costs f(enrollments, research
    intensity, student/faculty ratio,...)

50
Assumptions in Multiple Regression
  • Error term has a mean of zero and constant
    variance
  • Error terms are unrelated to each other
  • Error term is unrelated to independent vars
  • Error term is normally distributed
  • Independent variables are not collinear with each
    other (no multicollinearity)

51
Ordinary Least Squares
52
Least Squares Estimates
  • The coefficients are referred to as partial
    effects because they show the effect of one
    variable on Y holding other vars constant.
  • The OLS formula takes into account any
    relationships between the X variables. For this
    reason, the coefficients usually change when
    variables are dropped/added from model.

53
Other Stats in Multiple Regression
  • Hypothesis tests for significance of coefficients
    can be performed as before, except degrees of
    freedom change (n-k-1).
  • Goodness of fit measures are calculated as
    before. R-squared now represents the deviation
    in Y explained by all Xs together. Thus, R2
    usually rises as Xs are added.
  • Confidence intervals and point estimates can be
    calculated as before.

54
Example Average Public Tuition Rates
  • An IR analyst is asked to help explain why there
    are variations across states in their tuition
    rates at public institutions. She feels that
    factors such as state aid given to students and
    state appropriations help account for these
    differences.
  • Open the file TUITION.SAV in SPSS.

55
Question 1 How do state appropriations affect
average tuition?
State appropriations account for 13.5 of
differences in tuition.
56
Question 2 How do state appropriations and aid
to students affect average tuition?
These two variables account for 40.4 of
differences in tuition.
A 1 increase in appropriations reduces tuition
by 22.6 cents, holding constant state aid per
student.
57
Extensions of Regression Model
  • So far, we have only considered linear models
    where Xs and Ys were continuous. We will now
    examine how to handle
  • Categorical Xs
  • Interactions among Xs
  • Non-linear relationships between X and Y

58
Categorical Variables
  • There are many examples of independent variables
    that are not numerical (ex gender, race,
    institution attended, attitudes/beliefs)
  • Likert scale variables (assign s to
    categorical responses) should not be used in
    regression models in their present form due to
    problems in interpreting changes in units.
  • Slope units change in Y due to a one-unit
    change in X (but Likert s are artificial)

59
Dummy Variables
  • However, categorical Xs can be used if they are
    first recoded into dummy variables
  • Dummy variable has only two values (0,1)
  • Need to specify an assignment rule. Can be used
    for categorical, Likert, and continuous
    variables.
  • The variable can now be used in regression
  • It does not matter which group is assigned 1
  • Coef represents the difference in intercepts for
    the two groups
  • Must omit one of the dummy variables for a
    construct to avoid multicollinearity

60
Examples of Assignment Rules
  • Let X 1 if (0 otherwise)
  • Teaches in Psychology Department
  • Enrolled in public university
  • Family income exceeds 100,000
  • Student is very satisfied with the quality of
    instruction
  • Student graduated from campus
  • Student dropped out of college

61
Note Both equations have the same slope for RANK.
Question Does living on campus matter?
62
Variable Interactions
  • It is possible that the joint occurrence of two
    Xs has an effect on Y separate from each Xs
    effect
  • Academic performance of students with high SAT
    scores and HS ranks
  • State appropriations for higher ed in states with
    low incomes and high tax rates
  • The salary increase from promotions for men and
    women may be different

63
Interactions (contd)
  • In these examples, there is something special
    about the joint occurrence of two variables.
  • To test these assertions, an interaction
    variable can be created and added to the
    regression model.
  • Interaction variables are created by defining a
    third variable as the product of the two
    variables in question.

64
The interaction variable is then added to the
regression model and treated as any other
variable
To find the effect of x1 on y, you need to
differentiate the equation with respect to x1
65
Non-linear Functional Forms
  • Regression analysis can also be used in
    situations where X has a non-linear relationship
    with Y
  • Linear The change in Y due to a one-unit change
    in X is constant.
  • Non-linear The change in Y due to a one-unit
    change in X can vary with the level of X.

66
Graphs of Non-linear Functions
Y
Y
X
X
Exponential Y exp(X)
LogarithmicY ln(X)
67
Graphs of Quadratic Functions
Y
Y
X
X
Maximize Y
Minimize Y
68
Possible IR Examples
  • Exponential Implies that as X increases, Y
    increases at a faster rate.
  • Y salary, X years of experience
  • Logarithmic Implies that as X increases, Y
    increases at a slower rate.
  • Y college GPA, X hrs/week studying
  • Y retention rate, X avg. student SAT score

69
Possible IR Quadratic Examples
  • Maximize Y There is some value of X at which Y
    is maximized.
  • Y Tuition revenue, X tuition rate
  • Y Student gains, X class size
  • Minimize Y There is some value of X at which
    Y is minimized.
  • Y costs/student, X enrollments

70
Using Non-linear Functions
  • Regression analysis requires a linear
    relationship between X and Y.
  • When there is a non-linear relationship, you can
    transform one or more variables and then use the
    transformed variables in the regression model.
  • As long as there is a linear relationship between
    the transformed variables, regression analysis is
    appropriate.

71
Exponential Transformations
  • The coefficient estimate for ß represents the
    approximate percentage change in Y due to a
    one-unit increase in x.
  • The variable x always has the same directional
    effect on Y (positive or negative)
  • The change in Y due to a change in x increases
    at an increasing rate

72
Natural Log Function
The natural log function is the inverse of the
exponential function ln (exp (X)) X
73
Logarithmic
This can also be used for a subset of Xs.
74
Double-Log Function Elasticities
75
Quadratic Transformations
If X is believed to have a quadratic effect on Y,
then create a new variable as the square of X and
add this to the regression model
The change in Y due to a one-unit change in X1
would be found by differentiating the equation
with respect to this variable
Hill-shaped if ß3 lt 0, U-shaped if ß3 gt 0, linear
if ß3 0
76
More on Quadratic Functions
  • The value of X that maximizes or minimizes Y can
    have important implications. This is found by
    solving for X in the first-derivative.
  • Higher-order functions (ex., cubic) can also be
    used in regression. They can yield better
    representations of relationships, but are harder
    to explain and interpret.

77
SPSS Exercise Faculty Salaries
  • An IR analyst is asked to investigate if female
    faculty are paid less than comparable males. She
    draws a sample of 432 faculty and creates these
    variables
  • Salary monthly base salary (in dollars)
  • Rank 1 if Full, 2 if Associate, 3 if Assistant
  • Gender if if male, 0 otherwise
  • Prevexp days of experience before current job
  • Npleave days of non-professional leave
  • Potenexp days since highest degree
  • Nine12 1 if nine-month appointment, 0 otherwise
  • Cite85 Citations in 1985 to all publications

78
Tasks
  • Open the SPSS system file FACSAL.SAV
  • Estimate a regression model showing how gender
    affects salary. How do these results compare to
    a two-sample t-test?
  • How do your findings change when potential
    experience and citations are added?
  • An economist argues that salaries rise
    exponentially with potential experience,
    citations, and gender. How can this be addressed?

79
Answer to first task...
Note mean difference is 916, which has a
t-value of 6.227 and is significant.
80
(No Transcript)
81
Answer to second task...
82
Answer to third task...
83
More Tasks...
  • The VP for Finance argues that individuals with
    high experience levels often get smaller
    percentage salary increases than others. How
    could this be addressed (use same function as in
    previous example)?
  • A female faculty member claims that women face
    discrimination in part because they are rewarded
    less for each citation they receive. How could
    you test this?

84
Answer to fourth task...
85
Answer to fifth task...
86
Model Selection
  • For most IR problems, there are many alternative
    models from which to choose. How should the
    best model be selected?
  • Begin with published studies that look at the
    same (similar) Ys. What variables and
    functional forms do they use?
  • Is there a theory that can be used to guide
  • human capital theory gt salary models
  • median voter theory gt state funding for HE
  • Tintos model gt student retention

87
More model selection comments
  • Better to include too many factors than to omit
    important variables (omitted variable bias)
  • Can estimate several competing model
    specifications and compare results. Be careful
    not to simply select model with the most
    appealing results!
  • Keep in mind trade-off between simplicity and
    accuracy. A simple model is worth its weight in
    gold when explaining to decisionmakers!

88
Faculty Salary Example
  • Return to FACSAL.SAV and create a dummy variable
    for full professors
  • Estimate a model explaining salary as a function
    of gender, then gender and full professor.
  • Estimate a model explaining salary as a function
    of gender, full professor, and potential
    experience.

89
(No Transcript)
90
Problems in Regression Analysis
  • There are three main problems which may arise in
    multiple regression
  • Autocorrelation
  • Heteroscedasticity
  • Multicollinearity
  • We will briefly discuss what each means, how they
    can be detected, and what can be done about them
    when they occur.

91
Autocorrelation
  • This can occur in time-series data when the error
    in one period is related to the error in the
    next.
  • Violates the assumption E(eiej) for i ? j
  • Causes the computer to calculate incorrect
    standard errors, thereby affecting t-ratios.
    Usually, st.errs are too small, so t-ratios are
    too high (making X appear significant when it
    isnt.)
  • Possible IR Examples Predicting applications, HS
    grads, state funding for HE.

92
First-order autocorrelation
Error
et

0
-
Time
t4
t9
93
Durbin-Watson test
Calculates a d-statistic that reflects the
correlation among subsequent error terms
94
Correcting Autocorrelation
  • If autocorrelation is detected, it can be
    corrected through transforming the data to yield
    correct standard errors (generalized least
    squares).
  • Cochrane-Orcutt or Prais-Winston two
    commonly-used methods
  • Standard autocorrelation option in SPSS does
    not do this. Use SPSS Trends or another program.
  • Keep in mind that autocorrelation affects the
    standard errors and not coefficients.

95
Heteroscedasticity
  • May occur in cross-section data when the variance
    of the error term is related to one or more
    independent variable (si2 not constant).
  • Affects standard errors, and hence t-ratios (but
    not coefficient estimates)
  • Potential IR examples
  • Effects of enrollments on average costs
  • Effect of tax revenues on state appropriations
  • Effect of program size on expenditures

96
Graph of Heteroscedasticity
Dependent variable
Regression line
Independent variable
As X increases, the possible errors become larger.
97
Testing for Heteroscedasticity
  • Visual Plot residuals against the variable
    thought to be causing the problem.
  • Park-Glesjer test Estimate model and save
    residuals. Regress the log of squared residuals
    against the log of variable thought to cause the
    problem.
  • Other tests White (1978), Goldfeld-Quandt.
  • SPSS will not do these by default (must do by
    hand or with other software).

98
Correcting for Heteroscedasticity
  • Weighted least squares Weight observations by
    the variable causing heteroscedasticity. However,
    you must know the form.
  • For example, if si2 s2X1i, then weighting each
    observation by the square root of X1 will yield
    correct standard errors.
  • An option that does not require knowing the form
    of heteroscedasticity is by White.

99
Multicollinearity
  • Multicollinearity arises when there is an
    extremely high correlation between two or more
    independent variables in the model.
  • The coefficients are biased the stats program
    does not know how to assign proper weights
  • Standard errors increase, making t-ratios small

100
Multicollinearity (contd)
  • Potential IR examples include (1) effect of
    current and previous experience on faculty
    salaries, (2) effect of SAT score and high school
    rank on academic performance, (3) effect of
    family income and wealth on student demand for
    higher education.
  • A significant correlation between Xs does not
    necessarily lead to multicollinearity. Only when
    the correlation is very high does this occur.

101
Testing/Correcting Multicollinearity
  • There is no universally-accepted test for
    multicollinearity.
  • Variance inflation factors (VIF) estimate how
    much the standard errors increase due to
    correlation with other Xs. No single cutoff
    point for VIFs.
  • Signs of multicollinearity include
  • Two similar variables have widely different
    effects on Y (e.g., only one is signif.)
  • The standard errors are large
  • To test, drop one of the variables from the model
    and compare results. If the coef and st. err.
    change considerably, this may be a problem.

102
Correcting Multicollinearity
  • There is also no uniformly-accepted solution to
    this problem. However, you can drop one of the
    problem variables from the model.
  • Multicollinearity may not be an important issue
    if the collinearity occurs between unimportant
    variables.

103
Makin Multicollinearity
  • Return to the faculty salary data and create a
    new variable
  • newpot potenexp / 365 (years of exper) and
    add this to the regression model
  • Then, make slight changes to first two data
    points change 27.02 to 13 and change 19.01
    to 27.
  • Estimate regression model again, using gender,
    potenexp, newpot

104
Using gender and potenexp
105
Using gender, potenexp and newpot
Variable POTENEXP drops out of the equation
because it is perfectly correlated with NEWPOT.
106
Using gender, potenexp, newpot (after changes)
Gender is significant throughout all three models
Standard errors are about forty-three times
larger than before!
107
Limited Dependent Variables
  • Thus far, we have considered instances where Y
    was continuous and unbounded. However, there are
    many situations where this is violated
  • Individual student data are often dichotomous
    (0,1) variables 1 if graduate, 1 if return, 1 if
    apply/enroll.
  • Some data are discrete counts number of journal
    articles or citations, number of times a student
    changes his/her major

108
Problems with OLS when Y is (0,1)
  • Predictions can be gt 1 or lt 0
  • Coefficients may be biased
  • Heteroscedasticity is present (s2 P(1-P))
  • Error term is not normally distributed (only two
    possible values), so hypothesis tests are invalid
  • Of these problems, the last is the most severe.

109
Maximum Likelihood Estimation
  • In this instance, there are advantages to using a
    technique (MLE) in place of OLS.
  • MLE Find the coefficients that maximize the
    likelihood of generating the observations on Y in
    the sample.
  • Recall that OLS chooses the coefficients based on
    those that minimize the sum of squared errors.

110
Logit and probit analysis
  • When Y (0,1), the two most commonly-used
    functional forms in MLE are the cumulative
    logistic distribution (logit analysis or
    logistic regression) and the cumulative normal
    distribution (probit analysis).
  • The two choices usually yield similar results
  • Each avoids the four problems noted with OLS

111
Logistic regression
For logistic regression, the following functional
form is used
Ln P/(1-P) a b1X1 b2X2
where P probability that Y1
All you have to do, however, is create the dummy
variable for Y and tell SPSS to use logistic
regression to estimate the model. SPSS will
create the log odds ratio for you.
112
Interpreting results
  • The coefficients from logistic regression are
    hard to interpret and explain.
  • Focus on the signs of the coefficients
  • If the sign is positive and significant, then as
    X increases, the probability that Y1 will also
    increase
  • If the sign is negative and significant, then as
    X increases, the probability that Y1 will
    decrease
  • If the coefficient is not significant, then X has
    no effect on the probability that Y1.

113
Example Faculty Rank
  • Return to the faculty dataset, and estimate a
    logistic regression model to explain whether a
    faculty member is a Full professor (under
    Regression / bivariate logistic)
  • Xs include gender, potenexp, prevexp, and cite85
  • Need to create a dummy variable for Full
    Professor first
  • SPSS Probit module is different than used here.

114
Wald Chi-square statistic (coefficient /
standard error) 2
Note that these Chi-square values are the square
of the standard t-ratios
Coefficients Effect of X on log odds
Standard errors
Odds Ratio
P-value
115
Results from rank analysis
  • Since the coefficient for GENDER is positive and
    significant, it means that men and more likely
    than women to hold the rank of Full professor
    after controlling for experience and citations.
  • The positive and significant coef for CITE85
    means that a faculty member is more likely to be
    a Full Prof as citations rise

116
Final Exam SATDATA.SAV
File contains data on 1,999 NH high school
seniors in 1996 who have taken the SAT
  • ASSOC 1 if highest planned degree AA
  • MA Masters
  • PHD Doctorate
  • MALE 1 if male
  • FIRSTGEN 1 if 1st generation
  • INCOME family income
  • INCOME2 income squared
  • PUBHS 1 if attend public high school
  • SATCOMB Combined SAT score
  • SATCOMB2 SAT squared
  • ANYAP 1 if taken any AP course
  • GRADEAVG high school GPA
  • UNH 1 if sent SAT score to UNH
  • KSC 1 if sent SAT score to KSC
  • PSC 1 if sent SAT score to PSC

117
Questions
  • How does family income, student ability, and
    student intentions affect whether a student
    submits SAT scores to UNH, KSC, or PSC?
  • Do SAT takers from poor families and/or first
    generation families do worse on the SAT than
    other students?
Write a Comment
User Comments (0)
About PowerShow.com