Multiple linear regression - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Multiple linear regression

Description:

Correlation - indicates the strength of linear relationship between two variables ... hip, biceps, neck, knee, forearm, abdomen circumference measurements. Example ... – PowerPoint PPT presentation

Number of Views:641
Avg rating:3.0/5.0
Slides: 64
Provided by: jillmo9
Category:

less

Transcript and Presenter's Notes

Title: Multiple linear regression


1
Multiple linear regression
  • Gordon Prescott

2
Recap
  • Correlation - indicates the strength of linear
    relationship between two variables
  • Simple linear regression will describe the linear
    relationship between two variables
  • For linear regression to be valid, the
    assumptions of linearity normality of
    residuals and constant variance must all hold

3
Linear regression equation
  • y a b x
  • y intercept ( slope x x )

4
No intercept example
y b x
5
Good fit Interpreting R2
Identical regression lines. R2 high on left and
low on right
6
Statistical inference in regression
  • The regression coefficients calculated from a
    sample of observations are estimates of the
    population regression coefficients
  • Hypothesis tests and confidence intervals can be
    constructed using the sample estimates to make
    inferences about the population regression
    coefficients
  • For the valid use of these inferential
    approaches, it is necessary to check the
    underlying distribution of the data (linearity,
    normality, constant variance)

7
Multiple linear regression
  • Situations frequently occur when we are
    interested in the dependency of a variable on
    several explanatory (independent) variables.
  • The joint influence of the variables, taking into
    account possible correlations among them, may be
    investigated using multiple regression
  • Multiple regression can be extended to any number
    of variables, although it is recommended that the
    number be kept reasonably small

8
Partitioning of variation in dependent variable
Systolic blood pressure
Gender
Age
9
Situations where multiple linear regression is
appropriate
  • To explore the dependency of one outcome variable
    on two or more explanatory variables
    simultaneously
  • development of a prognostic index
  • To study the relationship between two variables
    after removing the possible effects of other
    nuisance variables
  • To adjust for differences in confounding factors
    between groups

10
Research questions
  • Data on cystic fibrosis patients was collected.
    The researchers were interested in looking at
    what factors are related to patients malnutrition
    (as measured by PEmax). Data was available on
    age, sex, height, weight, BMI, lung capacity,
    FEV1, and other lung function variables.
  • Researchers would like to predict a persons
    percentage body fat using measurements of bicep
    circumference, abdomen circumference, height,
    weight and age of the subject.

11
Research questions
  • To investigate the effect of parental birth
    weight on infant birth weight. Strong
    relationship found. Other explanatory variables
    such as maternal height, number of previous
    children, maternal smoking, weight gain during
    pregnancy (all of which are known to be
    associated with infant birth weight) were
    collected.
  • Multiple regression analysis was conducted to
    assess whether the observed association between
    parental birth weight and infant birth weight
    could be explained by inter-relationships between
    parental birth weight and the additional
    variables. It might be that mothers with low
    birth weights were more likely to smoke.

12
Research question
  • Two groups (non-randomised) of patients are
    receiving two different drug treatments for
    hypertension.
  • The effectiveness of the drugs are to be assessed
    by measuring each patients blood pressure six
    months following treatment.
  • A comparison of the characteristics of the
    patients in the two groups indicates that
    patients on drug A are older than those on Drug
    B.
  • There is a known relationship between age and
    blood pressure.
  • Multiple linear regression is used to adjust for
    (remove) the effect of age on blood pressure
    before carrying out a comparison of the two
    treatments.

13
Model
  • y a b1x1 b2x2 b3x3 bkxk
  • y - dependent variable
  • y- predicted value of dependent variable
  • a - intercept (constant)
  • b1 - regression coefficient for x1
  • x1 - explanatory (independent) variable
  • b2 - regression coefficient for x2
  • x2 - explanatory (independent) variable

14
Multiple correlation
  • R - coefficient of multiple correlation
  • It is the correlation between Y and the combined
    predictors (x1, x2 xk)
  • R2 - coefficient of multiple determination
  • It is the proportion of variance in Y that can be
    accounted for by the combined predictors (x1, x2
    xk)

15
Collinearity
  • Occurs when the explanatory variables are
    correlated to one another
  • Extreme multicollinearity occurs when one
    explanatory variable is a linear function of some
    of the other explanatory variables

16
Collinearity Cystic Fibrosis example
  • Data on cystic fibrosis patients. What factors
    are related to patients malnutrition (measured by
    PEmax)?
  • A regression model included height and weight
    (r0.92) as explanatory variables and PEmax
    (index of malnutrition) as the dependent variable
  • Both height and weight were highly correlated
    with PEmax
  • The model with these two variables accounted for
    40 of the variation in PEmax
  • In the model, neither of the coefficients for
    weight or height were significant
  • Including both these highly correlated variables
    obscured their relationship with PEmax

17
Criteria for inclusion in the model
  • The variable should account for a significant
    proportion of the variation in the dependent
    variable
  • This can be assessed by either of the following
    two comparable tests
  • F test from the ANOVA table
  • t-test of the regression coefficient (B)

18
Criteria for variable to be entered
  • F-test
  • H0 The independent (explanatory) variable does
    not account for any of the variability in body
    fat in the population
  • F ratio 1
  • H1Abdomen circumference does account for some of
    the variability in body fat in the population
  • F ratio gt 1

19
Criteria for variable to be entered
  • t-test
  • H0 The regression coefficient for the
    explanatory variable is equal to zero
  • (b10)
  • H1 The regression coefficient for the
    explanatory variable is not equal to zero
  • (b1?0)

20
Selection of explanatory variables
  • Methods by which explanatory variables are
    selected for inclusion in the regression model
  • Enter
  • Forward selection
  • Backward selection
  • Stepwise selection
  • The use of different selection methods on the
    same dataset may result in different regression
    models

21
Enter
  • The explanatory (independent) variables are
    forced into the model
  • Examination of the output from the regression
    model will indicate whether each of the
    explanatory variables are explaining a
    significant proportion of the variation in the
    dependent variable
  • We can test whether the coefficient for each
    explanatory variable differs significantly from 0

22
Automatic selection procedures
  • Should be cautious in the use of these procedures
  • These procedures should be used in combination
    with the data analysts knowledge and common sense
  • Models selected using these automatic methods
    alone are based on mathematical relationships and
    may not make biological/clinical sense

23
Forward selection
  • Simple linear regressions carried out for each
    explanatory variable
  • The one variable which accounts for the most
    variability is selected.
  • Linear regressions with all pairs of explanatory
    variables (including first) are carried out
  • The regression which accounts for the most
    variability in the dependent variable is selected
  • and so on ...

24
Backward selection
  • Multiple regression is performed using all the
    explanatory variables
  • Each explanatory variable is dropped in turn and
    the one that has the least contribution is
    dropped
  • All combinations of this explanatory variable and
    one other are dropped from the model
  • The next one which contribute least to the model
    is removed
  • and so on ...

25
Stepwise selection
  • This approach combines both forward and backward
    selection procedures
  • A variable may be added to the model, but at each
    step all variables in the model are considered
    for exclusion
  • Can be forward or backward stepwise selection
  • SPSS adopts a forward stepwise procedure

26
Stepwise selection
  • At each stage in a stepwise selection procedure,
    explanatory variables already entered in the
    model are assessed to see whether they still
    account for a significant proportion of the
    variation in the dependent variable
  • At each stage in a stepwise procedure
  • all explanatory variables not in the model are
    assessed for inclusion
  • all explanatory variables in the model are
    assessed for removal

27
Example
  • Recall fitness gym example
  • Dependent variable - percentage body fat
  • Explanatory variables
  • age
  • weight
  • height
  • hip, biceps, neck, knee, forearm, abdomen
    circumference measurements

28
Example
  • The aim is to produce an equation which would
    allow us to predict percentage body fat based on
    alternative measurements
  • Selection procedure
  • Stepwise
  • At the SPSS dialogue box, enter all the
    explanatory (independent) variables you wish to
    be considered for inclusion and then select
    stepwise as the method

29
SPSS output multiple regression
30
SPSS output multiple linear regression
  • Each model produced is reported
  • The R square for each model indicates the
    proportion of variability in the dependent
    variable accounted for by that model
  • Note the standard error of the estimate is
    reduced with each additional variable entered

31
SPSS output multiple regression
32
SPSS output multiple regression
33
Variables not in the model
34
SPSS output multiple regression
35
Prediction
  • The predicted percentage body fat for a man with
    an abdomen circumference of 100 cm, height of 168
    cm and a thigh circumference of 57 cm

36
Checking assumptions
  • After a model has been fitted to the data, it is
    essential to check that the assumptions of
    multiple linear regression have not been violated

37
Checking assumptions
  • There should be a linear relationship between the
    dependent variable and ALL continuous/discrete
    explanatory variables.
  • For any value of x, the predicted values should
    be normally distributed (normally distributed
    residuals)
  • The variability of the predicted values is the
    same for all values of x (constant variance)

38
Assumptions Linearity (1a)
  • Plot the dependent variable against each of the
    explanatory (independent) variables
  • Abdomen circumference
  • r0.8

39
Assumptions Linearity (1b)
  • Plot the dependent variable against each of the
    explanatory (independent) variables
  • Height
  • r0.6

40
Assumptions Linearity (1c)
  • Plot the dependent variable against each of the
    explanatory (independent) variables
  • Thigh circumference
  • r0.56

41
Assumptions Linearity (2)
  • Plot the residuals against the predicted values
  • No curvature in the plot should be seen for the
    linearity assumption to hold
  • Assumption satisfied

42
Assumptions Normal residuals (1)
  • Normally distributed residuals can be tested by
    looking at a histogram of the residuals
  • Assumption satisfied

43
Assumptions Normal residuals (2)
  • Normally distributed residuals can be tested by
    looking at a normal probability plot
  • Assumption satisfied

44
Assumption Constant variance
  • Constant variance of the residuals can be
    assessed by plotting the residuals against the
    predicted values
  • There should be an even spread of residuals
    around zero
  • Assumption satisfied

45
General issues multiple regression
  • Types of explanatory variables
  • Exploratory and confirmatory analysis
  • Number of explanatory variables
  • Number of observations
  • Interaction terms

46
Explanatory variables inmultiple linear
regression
  • Explanatory variables can be continuous or
    categorical
  • If a dichotomous variable (coded 0, 1 or 1, 2) is
    included in the regression equation the
    regression coefficient for this variable
    indicates the average difference in the dependent
    variable between the two groups defined by the
    dichotomous variable
  • This is adjusted for any differences between the
    groups with respect to the other variables in the
    model
  • Dummy variables are required for nominal variables

47
One binary explanatory variable and one
continuous explanatory variable
  • 2 independent variables one binary one
    continuous

y a b1 x gender b2 x age
If gender (1male, 2female)
then intercepts differ for males
females. Constant for males is a b1 x 1
and for females is a b1 x 2 a b1
b1 Slope of outcome with age is the same for
males females.
48
Dummy variables (1)
  • Adopted when you have more than two categories
    and the variable is not ordinal
  • e.g. marital status
  • married/living with partner
  • Single
  • divorced/widowed
  • As there are three categories, two dummy
    variables need to be defined

49
Dummy variables (2)
  • d1 d2
  • Married/Liv partner 0 0
  • Single 1 0
  • Divorced/Widowed 0 1
  • Reference category Married/Living with partner
  • Both dummy variables must be entered into the
    regression model to assess whether marital status
    can explain a significant proportion of the
    variation in the dependent variable

50
Exploratory vs confirmatory
  • Multiple regression is relatively straight
    forward when we know which variables we wish to
    have in the model
  • Difficulties can occur when we wish to identify
    from a large number of variables those which are
    related to the dependent variable and assess how
    well the model obtained fits the data
  • Exploratory and confirmatory analyses on the same
    data can be a problem

51
Some further comments
  • Number of potential explanatory variables
  • beware of initial screening of variables
  • Multiple testing
  • Number of observations and number of explanatory
    variables
  • (Rule of thumb - 10 observations per explanatory
    variable)
  • Use common sense when automatic procedures for
    model selection are used
  • Automatic selection procedures are advantageous
    when explanatory variables are highly correlated

52
And more .
  • There is a risk that the model may be over
    optimistic so the predictive capability of a
    model should be assessed using an independent
    data set
  • One option is to split the data into two samples
  • One sample (half your data) is used to develop
    the linear model, then the model is tested on the
    other sample (remainder of data)

53
Interaction in linear regression
  • Interaction terms
  • The relationship between an explanatory and the
    dependent variable may differ for different
    grouping of a variable
  • eg the relationship between age and blood
    pressure may be different for males and females
  • An additional explanatory variable would be added
    to the model to test whether there is a
    statistically significant difference in the
    relationship between males and females

54
Interaction
y a b1 x gender b2 x age b3 x gender x
age b2 is slope with age for males (coded 1) b2
b3 is slope with age for females (coded 2) b3
is additional slope with age for females
relative to slope with age for males
55
Multiple linear regression and ANOVA
  • Large overlap between linear regression and ANOVA
  • Multiple regression where all explanatory
    variables are categorical is in fact the same as
    an ANOVA with several factors
  • The two approaches give identical results

56
Comparison of statistical techniques (1)
  • There are similarities between t-test, ANOVA,
    ANCOVA and linear regression
  • A simple example to illustrate this would be to
    examine the effect of gender on weight
  • Option 1 - t-test
  • Option 2 - ANOVA
  • Option 3 - regression

57
T-test
  • Difference in mean weight between males and
    females 28.2 lbs
  • t-test
  • t8.878 df215 Plt0.001

58
ANOVA
  • The variability in weight that can be explained
    by differences in gender is significant when
    compared to the amount of variability remaining
    unexplained
  • Note variability is partitioned into between and
    within groups
  • F78.2 Plt 0.001

59
Linear regression
  • ANOVA table exactly the same as earlier one
  • Note variation in weight is partitioned into
    regression and residual
  • F-test F78.2 Plt 0.001

60
Linear regression
  • Coding
  • Gender (1male, 2female)
  • Using the regression equation above, estimate the
    average weight for males and the average weight
    for females.

61
Linear regression
  • Recall from t-test difference in means 28.2 lbs
  • Weight (lbs) 195.9 28.2 x gender
  • Mean weight for males (gender 1) 167.75
    lbs
  • Mean weight for females (gender 2) 139.58
    lbs
  • t-test t 8.878 df 215 Plt0.001

62
Comparison of statistical techniques (2)
  • Is there a difference in weight between males and
    females after accounting for any difference due
    to age
  • Option 1 - ANCOVA
  • Option 2 - Multiple linear regression
  • Both will provide the same answer, that after
    adjusting weight for age, there is still a
    significant gender effect

63
Assignment
  • Due in on Monday 12 noon
  • Before, in break of or immediately after Monday
    9-12 lecture
  • Remember to follow the instructions
  • Describe the data, choosing the important summary
    information relevant for each variable
  • Use tables or graphs if message clearer
  • When making comparisons (performing tests)
    identify and present only the important
    information
  • Give actual p-values
  • Give direction of differences and size if
    available
Write a Comment
User Comments (0)
About PowerShow.com