Correlation and Simple Linear Regression - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Correlation and Simple Linear Regression

Description:

Is there an association between age and blood pressure? ... E.g. Positive correlation between a stork population and human birth rates in an ... – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 47
Provided by: jillmo9
Category:

less

Transcript and Presenter's Notes

Title: Correlation and Simple Linear Regression


1
Correlation and Simple Linear Regression
  • Gordon Prescott

2
Correlation Research questions
  • Is there an association between age and blood
    pressure?
  • To assess whether two variables are associated,
    i.e. of the values of one variable tend to be
    higher (or lower) for values of the other
    variable
  • Associations between two continuous variables

3
Correlation
  • Measures the strength of linear association
    between two continuous variables
  • Can be positive or negative
  • Can vary between -1 and 1
  • Does not imply causation (there may be some other
    factor that can explain the association).

4
Correlation and causation
r0.61
5
A note on correlation
  • It does not mean that one variable causes the
    other
  • Coffee consumption and road traffic accidents are
    strongly associated but that does not indicate
    that drinking coffee causes road traffic accidents

6
Pearson correlation coefficient
  • subject body plasma weight
    volume
  • 1 58.0 2.75
  • 2 70.0 2.86
  • 3 74.0 3.37
  • 4 63.5 2.76
  • 5 62.0 2.62
  • 6 70.5 3.49
  • 7 71.0 3.05
  • 8 66.0 3.12

7
Correlation coefficient
  • The correlation coefficient is calculated as
  • r Covariance between X and Y
  • ?(Variance of X variance of Y)

8
Pearson correlation coefficient
  • r-1 Strong negative linear relationship
  • As the value of X increases the value of Y
    decreases
  • r0 No linear relationship between X and Y
  • r1 Strong positive linear relationship
  • As the value of X increases the value of Y
    increases

9
Pearson correlation coefficient
r approaching -1
r approaching 1
10
Hypothesis test for correlation coefficient
  • It is possible to test whether a correlation
    coefficient differs significantly from zero
  • The test statistic for the correlation
    coefficient follows a t-distribution when the
    null hypothesis is true

H0 ? 0 vs. H1 ? ? 0
11
Hypothesis test for correlation coefficient
  • The significance of the correlation coefficient
    will depend on the size of the correlation
    coefficient and the number of observations in the
    sample
  • The validity of this test requires that the
    variables are observed on a random sample of
    individuals and at least one of the variables
    follows a normal distribution

12
Correlation matrix
Correlation 0.814 P-value lt0.001 Number 252

13
Assumptions of correlation
  • Assumptions of distribution
  • Hypothesis test - at least one variable normally
    distributed
  • Confidence interval - both variables must be
    normally distributed

14
Non-parametric correlation
  • When data is ordinal
  • or the data is not Normally distributed,
  • a rank correlation method can be applied
    (Spearmans rank correlation)

15
Example Spearmans rank correlation
  • A study was conducted to investigate the
    relationship between anxiety score for a child
    evaluated by the child him/herself and by that
    childs mother.
  • Childrens anxiety scores measured on a
    continuous scale, mothers anxiety scores
    measured on an ordinal scale 1-7.
  • The null hypothesis is no relationship between
    childrens and mothers evaluations of childrens
    anxiety.

16
Example Spearmans rank correlation
  • The correlation coefficient is calculated in the
    same way as for Pearsons correlation
    coefficient, except that it is calculated on the
    ranks and not the actual values.
  • It ranges from -1 to 1 and has the same
    interpretation.
  • No requirement for the data to follow a Normal
    distribution.

17
Example Spearmans rank correlation
Correlation is significant at 5 (P lt
0.05), so the null hypothesis is rejected,
meaning that there is a relationship between
childrens and mothers evaluation of childrens
anxiety
Correlation 0.638 P-value 0.035 Number 11
18
Use and misuse of correlation
  • All observations should be independent
  • only one observation of each variable should come
    from each individual in the study
  • Data dredging
  • 10 variables, 45 possible correlations 20
    variables, 190 possible correlations!
  • Assessing agreement
  • Relationships between a part to a whole
  • total cholesterol and LDL cholesterol (total
    cholesterol is the sum of 3 types of cholesterol)

19
When not to use correlation
  • Spurious correlations involving time
  • E.g. Positive correlation between a stork
    population and human birth rates in an area of
    the Netherlands
  • Both variables increasing with time and so appear
    to be highly correlated
  • Should look at many areas rather than one area
    over time

20
Simple linear regression
21
Research questions
  • How does systolic blood pressure change as age
    increases?
  • Can systolic blood pressure be predicted from a
    subjects age?
  • Can body fat be predicted from abdomen
    circumference measurements?

22
Simple Linear Regression
  • Simple linear regression describes the
    relationship between two continuous variables
  • Simple linear regression gives the equation of
    the straight line that best describes the
    association between two continuous variables.
  • It enables the prediction of one variable using
    information from another variable.

23
Types of Variable in Linear Regression
  • The dependent variable is the variable to be
    predicted (i.e. the particular outcome of
    interested).
  • The independent variable or explanatory variable
    is the variable used for predicting the
    particular outcome.

24
Equation of a straight line
  • The equation of a straight line is y ?a bx
  • y is the predicted value (of the dependent
    variable)
  • a is the intercept
  • b is the slope (or gradient) of the line
  • x is the independent (explanatory) variable

25
Least squares
  • The values of a and b are calculated to minimise
    the sum of the squared vertical distance from the
    regression line to the dependent variable. This
    is called the least squares fit.
  • This is the difference between the actual value
    of the dependent variable and the predicted value
    from the regression line for each value of the
    independent variable

26
Regression coefficient (b)
  • The slope, b, is often called the regression
    coefficient
  • It has the same sign as the correlation
    coefficient
  • When there is no correlation between x and y,
    then the regression coefficient, b, will equal 0

27
Residuals
  • y a bx ?
  • ? is termed the residual.
  • The residual is the difference between the
    predicted value y (as calculated from the
    regression equation) and the observed value y. So
    residual (y-y)
  • A residual is calculated for each observation.
  • The method of least squares attempts to minimise
    the sum of squared residuals.
  • Mathematical techniques are used to find the
    values of a and b which satisfy the least squares
    fit.

28
Predicted value (y)
  • The predicted value, y, is subject to sampling
    variation
  • Its precision can be estimated (prediction error)
    by the standard error of the estimate
  • The greater the standard error, the greater the
    dispersion of predicted y values around the
    regression line and hence the larger the
    prediction error

29
Example
  • A fitness gym wishes to assess their clients
    body fat. An accurate method of measuring body
    fat is using an underwater weighing technique.
    This is not a practical method for the fitness
    instructors to carry out on the premises. They
    would like to be able to predict their clients
    body fat from other measurements, more easily
    obtainable from the client.
  • 252 men had their body fat measured and their
    abdomen circumference

30
Testing hypothesis
  • H0 There is no linear relationship between body
    fat and abdomen circumference in the population
  • H1 There is a linear relationship between body
    fat and abdomen circumference in the population
  • Or this can be rephrased as
  • H0 Abdomen circumference does not account for
    any variability in body fat in the population
  • H1Abdomen circumference does account for some of
    the variability in body fat in the population

31
Assess whether linear relationship exists
  • The scatterplot of body fat versus abdomen
    circumference indicates that there is a strong
    positive relationship between the two variables
  • Recall that the correlation coefficient was 0.814

32
Simple linear regression in SPSS
  • Analyze
  • Regression
  • Linear
  • The dependent variable is body fat
  • The independent variable is abdomen circumference

33
SPSS linear regression
  • R is the correlation between the two variables
    0.814
  • R square is the proportion of variability in body
    fat measurements that can be explained by
    differences in abdomen circumference 0.662 or
    66.2

34
SPSS linear regression
  • A statistically significant proportion of the
    variability in body fat measurements can be
    attributed to the regression model

35
SPSSregression equation
  • Predicted body fat constant B abdomen circum.
  • Predicted body fat -35.197 0.585 abdomen
    circum.

36
Prediction
  • How do you use linear regression for prediction?
  • The regression equation allows you to predict the
    value of the dependent variable (Y) for a
    particular value of the independent variable (X)
  • Predicted body fat -35.197 0.585 abdomen
    circum
  • What is the predicted body fat content for a man
    with an abdomen circumference of 100cm?
  • Predicted body fat -35.197 0.585 x 100cm
  • -35.197 58.5
  • 23.3

37
Assumptions of linear regression
  • There should be a linear relationship between the
    dependent variable and the independent variable
  • For any value of the independent variable the
    dependent variable values should follow a Normal
    distribution (ie normally distributed residuals)
  • The variance of the dependent variable values
    should be the same for all independent variable
    values.

38
Checking the assumptions
  • After the regression model has been fitted to the
    data it is essential to check that the
    assumptions of linear regression have not been
    violated
  • If any of the assumptions have been violated then
    the regression model is likely to be invalid

39
Assumptions Linearity(1)
  • Plot the dependent variable against the
    independent variable
  • Linear pattern (sausage shape) if linearity
    assumption to hold
  • Assumption satisfied

40
Assumptions Linearity(2)
  • Plot the residuals against the predicted values
  • No curvature in the plot should be seen for the
    linearity assumption to hold
  • Assumption satisfied

41
AssumptionsNormal residuals (1)
  • Normally distributed residuals can be tested by
    looking at a histogram of the residuals
  • Assumption satisfied

42
AssumptionsNormal residuals (2)
  • Normally distributed residuals can be tested by
    looking at a normal probability plot
  • Assumption satisfied

43
Assumption Constant variance
  • Constant variance of the residuals can be
    assessed by plotting the residuals against the
    predicted values
  • There should be an even spread of residuals
    around zero
  • Assumption satisfied

44
Assumption constant variance
  • This assumption would not be satisfied if the
    spread of the residuals increased or decreased as
    the predicted values increase in size
  • The plot should illustrate a random relationship

45
Summary correlation
  • Measures the strength of a linear association
    between two variables (usually continuous or
    discrete).
  • High positive or negative correlations suggest
    that two variables are related (but not that one
    causes the other).
  • Looking at scatterplots of the variables is
    always a good idea.
  • Check for common influences such as time or age
    which may affect both of the variables.

46
Summary simple linear regression
  • Simple Linear regression gives the equation of
    the straight line that best describes the
    association between two variables.
  • A linear relationship between the dependent
    variable and the independent variable is
    required.
  • For any value of the independent variable the
    dependent variable values should follow a Normal
    distribution.
  • The variance of the dependent variable values
    should be the same for all independent variable
    values.
Write a Comment
User Comments (0)
About PowerShow.com