Loading...

PPT – Correlation and Simple Linear Regression PowerPoint presentation | free to view - id: 75da5-MGRmM

The Adobe Flash plugin is needed to view this content

Correlation and Simple Linear Regression

- Gordon Prescott

Correlation Research questions

- Is there an association between age and blood

pressure? - To assess whether two variables are associated,

i.e. of the values of one variable tend to be

higher (or lower) for values of the other

variable - Associations between two continuous variables

Correlation

- Measures the strength of linear association

between two continuous variables - Can be positive or negative
- Can vary between -1 and 1
- Does not imply causation (there may be some other

factor that can explain the association).

Correlation and causation

r0.61

A note on correlation

- It does not mean that one variable causes the

other - Coffee consumption and road traffic accidents are

strongly associated but that does not indicate

that drinking coffee causes road traffic accidents

Pearson correlation coefficient

- subject body plasma weight

volume - 1 58.0 2.75
- 2 70.0 2.86
- 3 74.0 3.37
- 4 63.5 2.76
- 5 62.0 2.62
- 6 70.5 3.49
- 7 71.0 3.05
- 8 66.0 3.12

Correlation coefficient

- The correlation coefficient is calculated as
- r Covariance between X and Y
- ?(Variance of X variance of Y)

Pearson correlation coefficient

- r-1 Strong negative linear relationship
- As the value of X increases the value of Y

decreases - r0 No linear relationship between X and Y
- r1 Strong positive linear relationship
- As the value of X increases the value of Y

increases

Pearson correlation coefficient

r approaching -1

r approaching 1

Hypothesis test for correlation coefficient

- It is possible to test whether a correlation

coefficient differs significantly from zero - The test statistic for the correlation

coefficient follows a t-distribution when the

null hypothesis is true

H0 ? 0 vs. H1 ? ? 0

Hypothesis test for correlation coefficient

- The significance of the correlation coefficient

will depend on the size of the correlation

coefficient and the number of observations in the

sample - The validity of this test requires that the

variables are observed on a random sample of

individuals and at least one of the variables

follows a normal distribution

Correlation matrix

Correlation 0.814 P-value lt0.001 Number 252

Assumptions of correlation

- Assumptions of distribution
- Hypothesis test - at least one variable normally

distributed - Confidence interval - both variables must be

normally distributed

Non-parametric correlation

- When data is ordinal
- or the data is not Normally distributed,
- a rank correlation method can be applied

(Spearmans rank correlation)

Example Spearmans rank correlation

- A study was conducted to investigate the

relationship between anxiety score for a child

evaluated by the child him/herself and by that

childs mother. - Childrens anxiety scores measured on a

continuous scale, mothers anxiety scores

measured on an ordinal scale 1-7. - The null hypothesis is no relationship between

childrens and mothers evaluations of childrens

anxiety.

Example Spearmans rank correlation

- The correlation coefficient is calculated in the

same way as for Pearsons correlation

coefficient, except that it is calculated on the

ranks and not the actual values. - It ranges from -1 to 1 and has the same

interpretation. - No requirement for the data to follow a Normal

distribution.

Example Spearmans rank correlation

Correlation is significant at 5 (P lt

0.05), so the null hypothesis is rejected,

meaning that there is a relationship between

childrens and mothers evaluation of childrens

anxiety

Correlation 0.638 P-value 0.035 Number 11

Use and misuse of correlation

- All observations should be independent
- only one observation of each variable should come

from each individual in the study - Data dredging
- 10 variables, 45 possible correlations 20

variables, 190 possible correlations! - Assessing agreement
- Relationships between a part to a whole
- total cholesterol and LDL cholesterol (total

cholesterol is the sum of 3 types of cholesterol)

When not to use correlation

- Spurious correlations involving time
- E.g. Positive correlation between a stork

population and human birth rates in an area of

the Netherlands - Both variables increasing with time and so appear

to be highly correlated - Should look at many areas rather than one area

over time

Simple linear regression

Research questions

- How does systolic blood pressure change as age

increases? - Can systolic blood pressure be predicted from a

subjects age? - Can body fat be predicted from abdomen

circumference measurements?

Simple Linear Regression

- Simple linear regression describes the

relationship between two continuous variables - Simple linear regression gives the equation of

the straight line that best describes the

association between two continuous variables. - It enables the prediction of one variable using

information from another variable.

Types of Variable in Linear Regression

- The dependent variable is the variable to be

predicted (i.e. the particular outcome of

interested). - The independent variable or explanatory variable

is the variable used for predicting the

particular outcome.

Equation of a straight line

- The equation of a straight line is y ?a bx
- y is the predicted value (of the dependent

variable) - a is the intercept
- b is the slope (or gradient) of the line
- x is the independent (explanatory) variable

Least squares

- The values of a and b are calculated to minimise

the sum of the squared vertical distance from the

regression line to the dependent variable. This

is called the least squares fit. - This is the difference between the actual value

of the dependent variable and the predicted value

from the regression line for each value of the

independent variable

Regression coefficient (b)

- The slope, b, is often called the regression

coefficient - It has the same sign as the correlation

coefficient - When there is no correlation between x and y,

then the regression coefficient, b, will equal 0

Residuals

- y a bx ?
- ? is termed the residual.
- The residual is the difference between the

predicted value y (as calculated from the

regression equation) and the observed value y. So

residual (y-y) - A residual is calculated for each observation.
- The method of least squares attempts to minimise

the sum of squared residuals. - Mathematical techniques are used to find the

values of a and b which satisfy the least squares

fit.

Predicted value (y)

- The predicted value, y, is subject to sampling

variation - Its precision can be estimated (prediction error)

by the standard error of the estimate - The greater the standard error, the greater the

dispersion of predicted y values around the

regression line and hence the larger the

prediction error

Example

- A fitness gym wishes to assess their clients

body fat. An accurate method of measuring body

fat is using an underwater weighing technique.

This is not a practical method for the fitness

instructors to carry out on the premises. They

would like to be able to predict their clients

body fat from other measurements, more easily

obtainable from the client. - 252 men had their body fat measured and their

abdomen circumference

Testing hypothesis

- H0 There is no linear relationship between body

fat and abdomen circumference in the population - H1 There is a linear relationship between body

fat and abdomen circumference in the population - Or this can be rephrased as
- H0 Abdomen circumference does not account for

any variability in body fat in the population - H1Abdomen circumference does account for some of

the variability in body fat in the population

Assess whether linear relationship exists

- The scatterplot of body fat versus abdomen

circumference indicates that there is a strong

positive relationship between the two variables - Recall that the correlation coefficient was 0.814

Simple linear regression in SPSS

- Analyze
- Regression
- Linear
- The dependent variable is body fat
- The independent variable is abdomen circumference

SPSS linear regression

- R is the correlation between the two variables

0.814 - R square is the proportion of variability in body

fat measurements that can be explained by

differences in abdomen circumference 0.662 or

66.2

SPSS linear regression

- A statistically significant proportion of the

variability in body fat measurements can be

attributed to the regression model

SPSSregression equation

- Predicted body fat constant B abdomen circum.
- Predicted body fat -35.197 0.585 abdomen

circum.

Prediction

- How do you use linear regression for prediction?
- The regression equation allows you to predict the

value of the dependent variable (Y) for a

particular value of the independent variable (X) - Predicted body fat -35.197 0.585 abdomen

circum - What is the predicted body fat content for a man

with an abdomen circumference of 100cm? - Predicted body fat -35.197 0.585 x 100cm
- -35.197 58.5
- 23.3

Assumptions of linear regression

- There should be a linear relationship between the

dependent variable and the independent variable - For any value of the independent variable the

dependent variable values should follow a Normal

distribution (ie normally distributed residuals) - The variance of the dependent variable values

should be the same for all independent variable

values.

Checking the assumptions

- After the regression model has been fitted to the

data it is essential to check that the

assumptions of linear regression have not been

violated - If any of the assumptions have been violated then

the regression model is likely to be invalid

Assumptions Linearity(1)

- Plot the dependent variable against the

independent variable - Linear pattern (sausage shape) if linearity

assumption to hold - Assumption satisfied

Assumptions Linearity(2)

- Plot the residuals against the predicted values
- No curvature in the plot should be seen for the

linearity assumption to hold - Assumption satisfied

AssumptionsNormal residuals (1)

- Normally distributed residuals can be tested by

looking at a histogram of the residuals - Assumption satisfied

AssumptionsNormal residuals (2)

- Normally distributed residuals can be tested by

looking at a normal probability plot - Assumption satisfied

Assumption Constant variance

- Constant variance of the residuals can be

assessed by plotting the residuals against the

predicted values - There should be an even spread of residuals

around zero - Assumption satisfied

Assumption constant variance

- This assumption would not be satisfied if the

spread of the residuals increased or decreased as

the predicted values increase in size - The plot should illustrate a random relationship

Summary correlation

- Measures the strength of a linear association

between two variables (usually continuous or

discrete). - High positive or negative correlations suggest

that two variables are related (but not that one

causes the other). - Looking at scatterplots of the variables is

always a good idea. - Check for common influences such as time or age

which may affect both of the variables.

Summary simple linear regression

- Simple Linear regression gives the equation of

the straight line that best describes the

association between two variables. - A linear relationship between the dependent

variable and the independent variable is

required. - For any value of the independent variable the

dependent variable values should follow a Normal

distribution. - The variance of the dependent variable values

should be the same for all independent variable

values.