Lecture 6: Multiple Regression - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Lecture 6: Multiple Regression

Description:

Lecture 6: Multiple Regression Laura McAvinue School of Psychology Trinity College Dublin Previous Lectures Relationship between two variables Correlation Measure of ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 44
Provided by: tcd
Category:

less

Transcript and Presenter's Notes

Title: Lecture 6: Multiple Regression


1
Lecture 6Multiple Regression
  • Laura McAvinue
  • School of Psychology
  • Trinity College Dublin

2
Previous Lectures
  • Relationship between two variables
  • Correlation
  • Measure of strength of association between two
    variables
  • Simple linear regression
  • Measure of the ability of one variable (X) to
    predict the other variable (Y)
  • Computes a regression equation that describes the
    relationship between the response variable (Y)
    and the predictor variable (X) by expressing Y as
    a function of X

3
Multiple Regression
  • Used when there is more than one predictor
    variable
  • Two purposes
  • To predict Y, given a combination of predictor
    variables
  • To assess the relative importance of each
    predictor variable in explaining the response
    variable Y

4
Regression Equations
Simple Linear Regression
Multiple Regression
Y a b1X1 b2X2 bkXk
b1 Regression coefficient for first predictor
variable, X1 b2 Regression coefficient for
second predictor variable, X2 a Intercept,
value of Y when all predictor variables are 0
5
Statistical Models
  • Running a regression analysis is not a simple
    matter of inputting data, clicking a button and
    obtaining a fixed model of the data
  • You create the model of your data
  • Subjective process in many respects
  • You shape the model you create
  • Your job is to create the model that best
    describes the data

6
Multiple Regression
  • Assessing the relative contribution of each
    predictor variable to the response variable
  • Which variable contributes most?
  • Which is the second biggest predictor?
  • Which variables dont seem to contribute to
    prediction?
  • Problem
  • The order with which you input the variables into
    the analysis influences the model
  • Variable entered first is attributed more
    variance
  • By the time the last variable is entered, there
    might be very little variance left to explain

7
Variance in Y related to X1
Variance in Y related to X2
Multiple Correlation The predictor variables are
correlated with each other and with the response
variable Which predictor variable gets credit for
this shared variance?
Variance in Y related to shared variance between
X1 X2 Which variable gets credit, X1 or X2?
8
Different Methods of Multiple Regression
  • Hierarchical Regression
  • Entry / Standard Regression
  • Sequential Methods
  • Forward Addition
  • Backward Selection
  • Stepwise
  • Combinatorial Approach

9
Hierarchical Regression
  • You decide the order in which the variables are
    entered
  • Based on theory / prior research
  • Allows you to assess whether each predictor adds
    anything to the model, given the predictors that
    are already in the model

10
Entry / Standard Regression
  • Computer package enters all predictor variables
    into the model simultaneously
  • Creates a regression equation including all
    predictor variables
  • Allows us to assess the unique contribution of
    each predictor variable when all other variables
    are held constant
  • Advantages Disadvantages
  • Easy to see which variables significantly predict
    the response variable
  • May not create the best model for predicting Y as
    it will include variables that dont
    significantly predict Y

11
Sequential Models
  • Aim to create the best model
  • The combination of variables that best predicts
    the response variable
  • Build several models in a series of steps, adding
    or deleting variables at each step, depending on
    their contribution to predicting the response
    variable
  • Final model includes only variables which
    significantly and uniquely predict the response
    variable

12
Sequential Methods
  • Forward Addition
  • Begins with only one variable in the model
  • The variable that makes the biggest contribution
    to the response variable (highest r)
  • Adds the variable with the next highest
    contribution
  • Continues to add variables until there are no
    more variables that make a significant
    contribution to the response variable over and
    above the variables that are already in the
    equation

13
Sequential Methods
  • Backward Selection
  • Begins with all predictor variables in the model
    and successively deletes variables until only
    significant ones remain
  • Stepwise Regression
  • Similar to previous two but more versatile
  • Generally moves forward, adding significant
    variables, but can move backward to eliminate a
    variable if it no longer significantly predicts
    when another variable is added

14
Sequential Methods
  • Drawbacks
  • Inclusion in the model depends on mathematical
    criterion rather than psychological theory or
    research
  • Variable selection could depend upon tiny
    differences in correlation between each predictor
    variable and the response variable
  • Slight numerical differences could therefore lead
    to major differences in theoretical
    interpretation
  • Difficult to replicate results

15
Combinatorial Methods
  • Best Subsets Method
  • Computes models with all possible combinations of
    the predictor variables and chooses the model
    that explains most variance in the response
    variable

16
Critical Considerations for MR
  • Sample size
  • Distribution requirements Residuals
  • Data must be normally distributed
  • Outliers
  • Multi-collinearity

17
Sample Size
  • Ratio of cases to predictors should be
    substantial
  • Stevens (1996) advised about 15 participants per
    predictor variable
  • Size matters The more people in your sample the
    better the chance of the results being replicated
  • However, an even bigger ratio is needed when
  • Response variable has skewed data distribution
  • Poor reliability in measures - substantial
    measurement error reduces size of true
    relationships of variables
  • Stepwise methods (45-50 participants per
    predictor)

18
Residuals
  • Recall the Method of Least Squares
  • Fits the regression line by minimising the
    prediction error of the line
  • Minimises the sum of squares of the residuals
    ?(Y-Y)2
  • Fits a line of the form

Y a b1X1 b2X2 bkXk e
Assumes Y Fit
noise
19
Residuals
  • Method of Least Squares models the noise (e) in
    the data using the normal distribution
  • Assumes the noise is normally distributed with
    mean of 0 and variance s2
  • If this assumption is violated, the results of
    your regression analysis may not be valid
  • You need to check this by plotting the residuals
  • Standardised Residual Plots
  • Histogram
  • Normal Probability Plot

20
Histogram
?
?
21
Normal Probability Plot
Plots the residual value that was obtained for
each data point (observed) against the value you
would expect if the residuals were normally
distributed (expected) Should be a straight
diagonal line
22
Outliers
  • Data points that lie far from the rest of the
    data and have large residuals
  • Big influence on regression analysis
  • You can check for outliers
  • Scatterplots examining relationship between
    response variable and predictor variables
    separately
  • Casewise diagnostics in SPSS
  • Plots of the standardised residuals

23
Plot of Standardised Residuals
Plot of Standardized Predicted Values X Studenti
sed Deleted Residuals (Residual scores divided by
their standard deviation, which is
calculated leaving out any suspiciously
outlying data points) Based on the assumption of
normality 99.9 of residuals should lie
within 3 - 3 standard deviations Any point
outside this range is an outlier
24
Multi-Collinearity
  • Occurs when predictor variables are highly
    correlated with one another
  • High bivariate correlations (.7 / .8 or above)
  • High multivariate correlation
  • Not a desired feature of the dataset
  • Some predictor variables are redundant
  • Statistically, leads to unstable results

25
Multi-Collinearity
  • To assess whether multi-collinearity is present
  • Examine the bivariate correlations between
    predictor variables
  • Tolerance Statistic
  • 1 Multiple correlation (correlation between
    each predictor variable and all others)
  • If low, then multiple correlation must be high
    and multi-collinearity is a problem
  • Solution
  • Leave out one of the predictor variables
  • Combine two highly correlated predictor variables

26
Lets take an example
  • Interested in a theory which suggests that a
    persons level of optimism (X1) and the social
    support (X2) that he/she has in his/her life
    predicts how long he/she will survive (Y) after
    being diagnosed with cancer.
  • Three steps to Regression Analysis
  • A. Examine the relationship between the predictor
    and response variables separately
  • B. Perform and interpret the multiple regression
  • C. Assess the appropriateness of the regression
    analysis

27
Lets take an example
  • Open the following dataset
  • Software / Kevin Thomas / Multiple Regression
    Dataset
  • Run Correlations between
  • Survival Optimism
  • Survival Social Support

28
Create Scatterplots fit regression line
Graphs / Scatter / Simple Scatter / y Survival,
X Predictor Variable
Fit regression line Double click on chart, then
Elements / Fit line at total
29
Step 2 The Multiple Regression
  • Analyse, Regression, Linear
  • Dependent variable Survival
  • Independent variable Social, optimism
  • Method Enter (gives a standard multiple
    regression)
  • Statistics
  • Regression Coefficients
  • Estimates ?
  • Model fit ?
  • Descriptives?

30
Answer the questions on your worksheet
  • 1. Does this model (i.e. combination of social
    support and optimism) significantly predict the
    response variable (survival in months)?

Yes, F (2, 199) 67.73, p lt .001
31
Answer the questions on your worksheet
  • 2. What percentage of variance in the response
    variable, survival in months, is explained by
    this model?

R Square adjusted Estimate of the population
proportion of variation in survival due to
optimism support Penalises for number of
variables in the model
40.1
32
Answer the questions on your worksheet
  • 3. Write the regression equation

Survival in months 3.67(optimism)
12.99(social support) 4.34
33
Answer the questions on your worksheet
  • 4. What does this equation tell us about the
    relationship between months of survival and
    social support?

As social support increases by one unit, survival
in months increases by almost 13 months
34
Answer the questions on your worksheet
  • 5. Do both variables significantly predict
    survival in months?

Yes, for optimism, t 10, p lt .001 for social
support, t 4.026, p lt .001
35
Answer the questions on your worksheet
  • 6. Which of the predictor variables contributes
    most to the response variable?

Beta Standardized Regression Coefficient (B /
Std. Error) Can be used to compare strength of
contribution of predictor variables
Optimism has a Beta value of .558 and so,
contributes more than social support, which has a
Beta value of .225
36
Answer the questions on your worksheet
  • 7. Use the regression equation to make the
    following prediction If a person has an optimism
    score of 10 and a social support score of 2, how
    long would you expect them to survive?

Survival in months 3.67(optimism)
12.99(social support) 4.34 Survival in months
3.67(10) 12.99(2) 4.34 Survival in months
36.7 25.98 4.34 Survival in months 67.02
67 months!
37
Answer the questions on your worksheet
  • 8. What is the standard error of this prediction?

62.43 months
38
Step 2 Assess the appropriateness of the Analysis
  • Distribution of Residuals
  • Outliers
  • Multi-collinearity
  • Re-run regression but this time
  • Statistics
  • Collinearity Diagnostics
  • Residuals, casewise diagnostics
  • Outliers outside 3 standard deviations
  • Plots
  • Histogram
  • Normal Probability Plot
  • Plot of Standardized Predicted Values (Y ZPRED)
    by Studentized Deleted Residuals (X SDRESID)

39
Distribution of Residuals
40
Outliers
All residuals lie within -3 and 3 standard
deviations Note that you expect 1 of cases to
lie outside this area so in a large sample, if
you have one or two, that could be ok
41
Outliers
All residuals lie within -3 and 3 standard
deviations
42
Multi-Collinearity
Bivariate correlations seem to be low (r .182)
even though significant (p .01) Tolerance is
high, meaning that the multiple correlation is
small, meaning that multi-collinearity is not a
feature of this dataset
43
Summary
  • Multiple Regression
  • To predict Y given a combination of predictor
    variables
  • To assess the relative importance of each
    predictor variable in explaining the response
    variable
  • Statistical modelling
  • Different Methods
  • Three steps
  • Examine the relationship between the predictor
    and response variables separately
  • Perform and interpret the multiple regression
  • Assess the appropriateness of the regression
    analysis
  • There are a number of critical considerations
Write a Comment
User Comments (0)
About PowerShow.com