Title: Lecture 6: Multiple Regression
1Lecture 6Multiple Regression
- Laura McAvinue
- School of Psychology
- Trinity College Dublin
2Previous Lectures
- Relationship between two variables
- Correlation
- Measure of strength of association between two
variables - Simple linear regression
- Measure of the ability of one variable (X) to
predict the other variable (Y) - Computes a regression equation that describes the
relationship between the response variable (Y)
and the predictor variable (X) by expressing Y as
a function of X
3Multiple Regression
- Used when there is more than one predictor
variable - Two purposes
- To predict Y, given a combination of predictor
variables - To assess the relative importance of each
predictor variable in explaining the response
variable Y
4Regression Equations
Simple Linear Regression
Multiple Regression
Y a b1X1 b2X2 bkXk
b1 Regression coefficient for first predictor
variable, X1 b2 Regression coefficient for
second predictor variable, X2 a Intercept,
value of Y when all predictor variables are 0
5Statistical Models
- Running a regression analysis is not a simple
matter of inputting data, clicking a button and
obtaining a fixed model of the data - You create the model of your data
- Subjective process in many respects
- You shape the model you create
- Your job is to create the model that best
describes the data
6Multiple Regression
- Assessing the relative contribution of each
predictor variable to the response variable - Which variable contributes most?
- Which is the second biggest predictor?
- Which variables dont seem to contribute to
prediction? - Problem
- The order with which you input the variables into
the analysis influences the model - Variable entered first is attributed more
variance - By the time the last variable is entered, there
might be very little variance left to explain
7Variance in Y related to X1
Variance in Y related to X2
Multiple Correlation The predictor variables are
correlated with each other and with the response
variable Which predictor variable gets credit for
this shared variance?
Variance in Y related to shared variance between
X1 X2 Which variable gets credit, X1 or X2?
8Different Methods of Multiple Regression
- Hierarchical Regression
- Entry / Standard Regression
- Sequential Methods
- Forward Addition
- Backward Selection
- Stepwise
- Combinatorial Approach
9Hierarchical Regression
- You decide the order in which the variables are
entered - Based on theory / prior research
- Allows you to assess whether each predictor adds
anything to the model, given the predictors that
are already in the model
10Entry / Standard Regression
- Computer package enters all predictor variables
into the model simultaneously - Creates a regression equation including all
predictor variables - Allows us to assess the unique contribution of
each predictor variable when all other variables
are held constant - Advantages Disadvantages
- Easy to see which variables significantly predict
the response variable - May not create the best model for predicting Y as
it will include variables that dont
significantly predict Y
11Sequential Models
- Aim to create the best model
- The combination of variables that best predicts
the response variable - Build several models in a series of steps, adding
or deleting variables at each step, depending on
their contribution to predicting the response
variable - Final model includes only variables which
significantly and uniquely predict the response
variable
12Sequential Methods
- Forward Addition
- Begins with only one variable in the model
- The variable that makes the biggest contribution
to the response variable (highest r) - Adds the variable with the next highest
contribution - Continues to add variables until there are no
more variables that make a significant
contribution to the response variable over and
above the variables that are already in the
equation
13Sequential Methods
- Backward Selection
- Begins with all predictor variables in the model
and successively deletes variables until only
significant ones remain - Stepwise Regression
- Similar to previous two but more versatile
- Generally moves forward, adding significant
variables, but can move backward to eliminate a
variable if it no longer significantly predicts
when another variable is added
14Sequential Methods
- Drawbacks
- Inclusion in the model depends on mathematical
criterion rather than psychological theory or
research - Variable selection could depend upon tiny
differences in correlation between each predictor
variable and the response variable - Slight numerical differences could therefore lead
to major differences in theoretical
interpretation - Difficult to replicate results
15Combinatorial Methods
- Best Subsets Method
- Computes models with all possible combinations of
the predictor variables and chooses the model
that explains most variance in the response
variable
16Critical Considerations for MR
- Sample size
- Distribution requirements Residuals
- Data must be normally distributed
- Outliers
- Multi-collinearity
17Sample Size
- Ratio of cases to predictors should be
substantial - Stevens (1996) advised about 15 participants per
predictor variable - Size matters The more people in your sample the
better the chance of the results being replicated
- However, an even bigger ratio is needed when
- Response variable has skewed data distribution
- Poor reliability in measures - substantial
measurement error reduces size of true
relationships of variables - Stepwise methods (45-50 participants per
predictor)
18Residuals
- Recall the Method of Least Squares
- Fits the regression line by minimising the
prediction error of the line - Minimises the sum of squares of the residuals
?(Y-Y)2 - Fits a line of the form
Y a b1X1 b2X2 bkXk e
Assumes Y Fit
noise
19Residuals
- Method of Least Squares models the noise (e) in
the data using the normal distribution - Assumes the noise is normally distributed with
mean of 0 and variance s2 - If this assumption is violated, the results of
your regression analysis may not be valid - You need to check this by plotting the residuals
- Standardised Residual Plots
- Histogram
- Normal Probability Plot
20Histogram
?
?
21Normal Probability Plot
Plots the residual value that was obtained for
each data point (observed) against the value you
would expect if the residuals were normally
distributed (expected) Should be a straight
diagonal line
22Outliers
- Data points that lie far from the rest of the
data and have large residuals - Big influence on regression analysis
- You can check for outliers
- Scatterplots examining relationship between
response variable and predictor variables
separately - Casewise diagnostics in SPSS
- Plots of the standardised residuals
23Plot of Standardised Residuals
Plot of Standardized Predicted Values X Studenti
sed Deleted Residuals (Residual scores divided by
their standard deviation, which is
calculated leaving out any suspiciously
outlying data points) Based on the assumption of
normality 99.9 of residuals should lie
within 3 - 3 standard deviations Any point
outside this range is an outlier
24Multi-Collinearity
- Occurs when predictor variables are highly
correlated with one another - High bivariate correlations (.7 / .8 or above)
- High multivariate correlation
- Not a desired feature of the dataset
- Some predictor variables are redundant
- Statistically, leads to unstable results
25Multi-Collinearity
- To assess whether multi-collinearity is present
- Examine the bivariate correlations between
predictor variables - Tolerance Statistic
- 1 Multiple correlation (correlation between
each predictor variable and all others) - If low, then multiple correlation must be high
and multi-collinearity is a problem - Solution
- Leave out one of the predictor variables
- Combine two highly correlated predictor variables
26Lets take an example
- Interested in a theory which suggests that a
persons level of optimism (X1) and the social
support (X2) that he/she has in his/her life
predicts how long he/she will survive (Y) after
being diagnosed with cancer. - Three steps to Regression Analysis
- A. Examine the relationship between the predictor
and response variables separately - B. Perform and interpret the multiple regression
- C. Assess the appropriateness of the regression
analysis
27Lets take an example
- Open the following dataset
- Software / Kevin Thomas / Multiple Regression
Dataset - Run Correlations between
- Survival Optimism
- Survival Social Support
28Create Scatterplots fit regression line
Graphs / Scatter / Simple Scatter / y Survival,
X Predictor Variable
Fit regression line Double click on chart, then
Elements / Fit line at total
29Step 2 The Multiple Regression
- Analyse, Regression, Linear
- Dependent variable Survival
- Independent variable Social, optimism
- Method Enter (gives a standard multiple
regression) - Statistics
- Regression Coefficients
- Estimates ?
- Model fit ?
- Descriptives?
30Answer the questions on your worksheet
- 1. Does this model (i.e. combination of social
support and optimism) significantly predict the
response variable (survival in months)?
Yes, F (2, 199) 67.73, p lt .001
31Answer the questions on your worksheet
- 2. What percentage of variance in the response
variable, survival in months, is explained by
this model?
R Square adjusted Estimate of the population
proportion of variation in survival due to
optimism support Penalises for number of
variables in the model
40.1
32Answer the questions on your worksheet
- 3. Write the regression equation
Survival in months 3.67(optimism)
12.99(social support) 4.34
33Answer the questions on your worksheet
- 4. What does this equation tell us about the
relationship between months of survival and
social support?
As social support increases by one unit, survival
in months increases by almost 13 months
34Answer the questions on your worksheet
- 5. Do both variables significantly predict
survival in months?
Yes, for optimism, t 10, p lt .001 for social
support, t 4.026, p lt .001
35Answer the questions on your worksheet
- 6. Which of the predictor variables contributes
most to the response variable?
Beta Standardized Regression Coefficient (B /
Std. Error) Can be used to compare strength of
contribution of predictor variables
Optimism has a Beta value of .558 and so,
contributes more than social support, which has a
Beta value of .225
36Answer the questions on your worksheet
- 7. Use the regression equation to make the
following prediction If a person has an optimism
score of 10 and a social support score of 2, how
long would you expect them to survive?
Survival in months 3.67(optimism)
12.99(social support) 4.34 Survival in months
3.67(10) 12.99(2) 4.34 Survival in months
36.7 25.98 4.34 Survival in months 67.02
67 months!
37Answer the questions on your worksheet
- 8. What is the standard error of this prediction?
62.43 months
38Step 2 Assess the appropriateness of the Analysis
- Distribution of Residuals
- Outliers
- Multi-collinearity
- Re-run regression but this time
- Statistics
- Collinearity Diagnostics
- Residuals, casewise diagnostics
- Outliers outside 3 standard deviations
- Plots
- Histogram
- Normal Probability Plot
- Plot of Standardized Predicted Values (Y ZPRED)
by Studentized Deleted Residuals (X SDRESID)
39Distribution of Residuals
40Outliers
All residuals lie within -3 and 3 standard
deviations Note that you expect 1 of cases to
lie outside this area so in a large sample, if
you have one or two, that could be ok
41Outliers
All residuals lie within -3 and 3 standard
deviations
42Multi-Collinearity
Bivariate correlations seem to be low (r .182)
even though significant (p .01) Tolerance is
high, meaning that the multiple correlation is
small, meaning that multi-collinearity is not a
feature of this dataset
43Summary
- Multiple Regression
- To predict Y given a combination of predictor
variables - To assess the relative importance of each
predictor variable in explaining the response
variable - Statistical modelling
- Different Methods
- Three steps
- Examine the relationship between the predictor
and response variables separately - Perform and interpret the multiple regression
- Assess the appropriateness of the regression
analysis - There are a number of critical considerations