Correlation and Simple Linear Regression presentation

About This Presentation

Transcript and Presenter's Notes

Title: Correlation and Simple Linear Regression

1
Correlation and Simple Linear Regression

Gordon Prescott

2
Correlation Research questions

Is there an association between age and blood
pressure?
To assess whether two variables are associated,
i.e. of the values of one variable tend to be
higher (or lower) for values of the other
variable
Associations between two continuous variables

3
Correlation

Measures the strength of linear association
between two continuous variables
Can be positive or negative
Can vary between -1 and 1
Does not imply causation (there may be some other
factor that can explain the association).

4
Correlation and causation
r0.61
5
A note on correlation

It does not mean that one variable causes the
other
Coffee consumption and road traffic accidents are
strongly associated but that does not indicate
that drinking coffee causes road traffic accidents

6
Pearson correlation coefficient

subject body plasma weight
volume
1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12

7
Correlation coefficient

The correlation coefficient is calculated as
r Covariance between X and Y
?(Variance of X variance of Y)

8
Pearson correlation coefficient

r-1 Strong negative linear relationship
As the value of X increases the value of Y
decreases
r0 No linear relationship between X and Y
r1 Strong positive linear relationship
As the value of X increases the value of Y
increases

9
Pearson correlation coefficient
r approaching -1
r approaching 1
10
Hypothesis test for correlation coefficient

It is possible to test whether a correlation
coefficient differs significantly from zero
The test statistic for the correlation
coefficient follows a t-distribution when the
null hypothesis is true

H0 ? 0 vs. H1 ? ? 0
11
Hypothesis test for correlation coefficient

The significance of the correlation coefficient
will depend on the size of the correlation
coefficient and the number of observations in the
sample
The validity of this test requires that the
variables are observed on a random sample of
individuals and at least one of the variables
follows a normal distribution

12
Correlation matrix
Correlation 0.814 P-value lt0.001 Number 252

13
Assumptions of correlation

Assumptions of distribution
Hypothesis test - at least one variable normally
distributed
Confidence interval - both variables must be
normally distributed

14
Non-parametric correlation

When data is ordinal
or the data is not Normally distributed,
a rank correlation method can be applied
(Spearmans rank correlation)

15
Example Spearmans rank correlation

A study was conducted to investigate the
relationship between anxiety score for a child
evaluated by the child him/herself and by that
childs mother.
Childrens anxiety scores measured on a
continuous scale, mothers anxiety scores
measured on an ordinal scale 1-7.
The null hypothesis is no relationship between
childrens and mothers evaluations of childrens
anxiety.

16
Example Spearmans rank correlation

The correlation coefficient is calculated in the
same way as for Pearsons correlation
coefficient, except that it is calculated on the
ranks and not the actual values.
It ranges from -1 to 1 and has the same
interpretation.
No requirement for the data to follow a Normal
distribution.

17
Example Spearmans rank correlation
Correlation is significant at 5 (P lt
0.05), so the null hypothesis is rejected,
meaning that there is a relationship between
childrens and mothers evaluation of childrens
anxiety
Correlation 0.638 P-value 0.035 Number 11
18
Use and misuse of correlation

All observations should be independent
only one observation of each variable should come
from each individual in the study
Data dredging
10 variables, 45 possible correlations 20
variables, 190 possible correlations!
Assessing agreement
Relationships between a part to a whole
total cholesterol and LDL cholesterol (total
cholesterol is the sum of 3 types of cholesterol)

19
When not to use correlation

Spurious correlations involving time
E.g. Positive correlation between a stork
population and human birth rates in an area of
the Netherlands
Both variables increasing with time and so appear
to be highly correlated
Should look at many areas rather than one area
over time

20
Simple linear regression
21
Research questions

How does systolic blood pressure change as age
increases?
Can systolic blood pressure be predicted from a
subjects age?
Can body fat be predicted from abdomen
circumference measurements?

22
Simple Linear Regression

Simple linear regression describes the
relationship between two continuous variables
Simple linear regression gives the equation of
the straight line that best describes the
association between two continuous variables.
It enables the prediction of one variable using
information from another variable.

23
Types of Variable in Linear Regression

The dependent variable is the variable to be
predicted (i.e. the particular outcome of
interested).
The independent variable or explanatory variable
is the variable used for predicting the
particular outcome.

24
Equation of a straight line

The equation of a straight line is y ?a bx
y is the predicted value (of the dependent
variable)
a is the intercept
b is the slope (or gradient) of the line
x is the independent (explanatory) variable

25
Least squares

The values of a and b are calculated to minimise
the sum of the squared vertical distance from the
regression line to the dependent variable. This
is called the least squares fit.
This is the difference between the actual value
of the dependent variable and the predicted value
from the regression line for each value of the
independent variable

26
Regression coefficient (b)

The slope, b, is often called the regression
coefficient
It has the same sign as the correlation
coefficient
When there is no correlation between x and y,
then the regression coefficient, b, will equal 0

27
Residuals

y a bx ?
? is termed the residual.
The residual is the difference between the
predicted value y (as calculated from the
regression equation) and the observed value y. So
residual (y-y)
A residual is calculated for each observation.
The method of least squares attempts to minimise
the sum of squared residuals.
Mathematical techniques are used to find the
values of a and b which satisfy the least squares
fit.

28
Predicted value (y)

The predicted value, y, is subject to sampling
variation
Its precision can be estimated (prediction error)
by the standard error of the estimate
The greater the standard error, the greater the
dispersion of predicted y values around the
regression line and hence the larger the
prediction error

29
Example

A fitness gym wishes to assess their clients
body fat. An accurate method of measuring body
fat is using an underwater weighing technique.
This is not a practical method for the fitness
instructors to carry out on the premises. They
would like to be able to predict their clients
body fat from other measurements, more easily
obtainable from the client.
252 men had their body fat measured and their
abdomen circumference

30
Testing hypothesis

H0 There is no linear relationship between body
fat and abdomen circumference in the population
H1 There is a linear relationship between body
fat and abdomen circumference in the population
Or this can be rephrased as
H0 Abdomen circumference does not account for
any variability in body fat in the population
H1Abdomen circumference does account for some of
the variability in body fat in the population

31
Assess whether linear relationship exists

The scatterplot of body fat versus abdomen
circumference indicates that there is a strong
positive relationship between the two variables
Recall that the correlation coefficient was 0.814

32
Simple linear regression in SPSS

Analyze
Regression
Linear
The dependent variable is body fat
The independent variable is abdomen circumference

33
SPSS linear regression

R is the correlation between the two variables
0.814
R square is the proportion of variability in body
fat measurements that can be explained by
differences in abdomen circumference 0.662 or
66.2

34
SPSS linear regression

A statistically significant proportion of the
variability in body fat measurements can be
attributed to the regression model

35
SPSSregression equation

Predicted body fat constant B abdomen circum.
Predicted body fat -35.197 0.585 abdomen
circum.

36
Prediction

How do you use linear regression for prediction?
The regression equation allows you to predict the
value of the dependent variable (Y) for a
particular value of the independent variable (X)
Predicted body fat -35.197 0.585 abdomen
circum
What is the predicted body fat content for a man
with an abdomen circumference of 100cm?
Predicted body fat -35.197 0.585 x 100cm
-35.197 58.5
23.3

37
Assumptions of linear regression

There should be a linear relationship between the
dependent variable and the independent variable
For any value of the independent variable the
dependent variable values should follow a Normal
distribution (ie normally distributed residuals)
The variance of the dependent variable values
should be the same for all independent variable
values.

38
Checking the assumptions

After the regression model has been fitted to the
data it is essential to check that the
assumptions of linear regression have not been
violated
If any of the assumptions have been violated then
the regression model is likely to be invalid

39
Assumptions Linearity(1)

Plot the dependent variable against the
independent variable
Linear pattern (sausage shape) if linearity
assumption to hold
Assumption satisfied

40
Assumptions Linearity(2)

Plot the residuals against the predicted values
No curvature in the plot should be seen for the
linearity assumption to hold
Assumption satisfied

41
AssumptionsNormal residuals (1)

Normally distributed residuals can be tested by
looking at a histogram of the residuals
Assumption satisfied

42
AssumptionsNormal residuals (2)

Normally distributed residuals can be tested by
looking at a normal probability plot
Assumption satisfied

43
Assumption Constant variance

Constant variance of the residuals can be
assessed by plotting the residuals against the
predicted values
There should be an even spread of residuals
around zero
Assumption satisfied

44
Assumption constant variance

This assumption would not be satisfied if the
spread of the residuals increased or decreased as
the predicted values increase in size
The plot should illustrate a random relationship

45
Summary correlation

Measures the strength of a linear association
between two variables (usually continuous or
discrete).
High positive or negative correlations suggest
that two variables are related (but not that one
causes the other).
Looking at scatterplots of the variables is
always a good idea.
Check for common influences such as time or age
which may affect both of the variables.

46
Summary simple linear regression

Simple Linear regression gives the equation of
the straight line that best describes the
association between two variables.
A linear relationship between the dependent
variable and the independent variable is
required.
For any value of the independent variable the
dependent variable values should follow a Normal
distribution.
The variance of the dependent variable values
should be the same for all independent variable
values.

Write a Comment

User Comments (0)

About PowerShow.com

Correlation and Simple Linear Regression PowerPoint PPT Presentation