Correlation and Simple Linear Regression

About This Presentation

Title:

Correlation and Simple Linear Regression

Description:

Title: Factorial Analysis of Variance Author: Katlyn Moran Last modified by: reviewer Created Date: 9/19/2002 7:22:30 PM Document presentation format – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 39

Provided by: Katlyn7

Category:

more less

Transcript and Presenter's Notes

Title: Correlation and Simple Linear Regression

1
Correlation and Simple Linear Regression
2
Basics

Correlation
The linear association between two variables
Strength of relationship based on how tightly
points in an X,Y scatterplot cluster about a
straight line
-1 to 1unitless
Observations should be quantitative
No categorical variables even if recoded
evaluate a visual scatterplot
Independent samples
Correlation does not imply causality
Do not assume infinite ranges of linearity
Ho there is no linear relationship between the
2 variables
Ha there is a linear relationship between the 2
variables

3
Basics

Simple Linear Regression
Examine relationship between one predictor
variable (independent) and a single quantitative
response variable (dependent)
Produces regression equation used for prediction
Normality, equal variances, independence
Least Squares Principle
Do not extrapolate
Analyze residuals
Ho there is no slope, no linear relationship
between the 2 variables
Ha there is a slope, linear relationship
between the 2 variables

4
Direction of the Correlation Coefficient

Positive correlation Indicates that the values
on the two variables being analyzed move in the
same direction. That is, as scores on one
variable go up, scores on the other variable go
up as well (on average) vice versa

Negative correlation Indicates that the values
on the two variables being analyzed move in
opposite directions. That is, as scores on one
variable go up, scores on the other variable go
down, and vice-versa (on average)

5
Strength or Magnitude of the Relationship

Correlation coefficients range in strength from
-1.00 to 1.00
The closer the correlation coefficient is
to either -1.00 or 1.00, the stronger the
relationship is between the two variables
Perfect positive correlation of 1.00 reveals
that for every member of the sample or
population, a higher score on one variable is
related to higher score on the other variable

Perfect negative correlation of 1.00 indicates
that for every member of the sample or
population, a higher score on one variable is
related to a lower score on the other variable
Perfect correlations are never found in actual
social science research

6
Positive and Negative Correlation

Positive and negative correlations are
represented by scattergrams
Scattergrams Graphs that indicate the scores of
each case in a sample simultaneously on two
variables
r the symbol for the sample Pearson correlation
coefficient

The scattergrams presented here represent very
strong positive and negative correlations (r
0.97 and r -0.97 for the positive and negative
correlations, respectively)
7
No Correlation

No discernable pattern between the scores on the
two variables
We learn it is virtually impossible to predict an
individuals test score simply by knowing how
many hours the person studied for the exam

Scattergram representing virtually no correlation
between the number of hours spent studying and
the scores on the exam is presented
8
Pearson Correlation Coefficients In Depth

The first step in understanding how Pearson
correlation coefficients are calculated is to
notice that we are concerned with a samples
scores on two variables at the same time
The data shown are scores on two variables hours
spent studying and exam score. These data are
for a randomly selected sample of five students.
To be used in a correlation analysis, it is
critical that the scores on the two variables are
paired.

Data for Correlation Coefficient Data for Correlation Coefficient Data for Correlation Coefficient
Hours Spent Studying (X variable) Exam Score (Y variable)
Student 1 5 80
Student 2 6 85
Student 3 7 70
Student 4 8 90
Student 5 9 85

Each students score on the X variable must be
matched with his or her own score on the Y
variable
Once this is done a person can determine whether,
on average, hours spent studying is related to
exam scores

9
Calculating the Correlation Coefficient
Definitional Formula for Pearson Correlation

Finding the Pearson correlation coefficient is
simple when following these steps
Find the z scores on each of the two variables
being examined for each case in the sample
Multiply each individual's z score on one
variable with that individual's z score on the
second variable (i.e., find a cross-product)
Sum those across all of the individuals in the
sample
Divide by N

r zx zy N Pearson product-moment correlation coefficient a z score for variable X a paired z score for variable Y the number of pairs of X and Y scores

r S(zx zy) ?

You then have an average standardized cross
product. If we had not standardized these scores
we would have produced a covariance.

10
Calculating the Correlation Coefficient, Cont.

This formula requires that you standardize your
variables
Note When you standardize a variable, you are
simply subtracting the mean from each score in
your sample and dividing by the standard
deviation
What this does is provide a z score for each case
in the sample
Members of the sample with scores below the mean
will have negative z scores, whereas those
members of the sample with scores above the mean
will have positive z scores

11
What the Correlation Coefficient Does, and Does
Not, Tell Us

Correlation coefficients such as the Pearson are
very powerful statistics. They allow us to
determine whether, on average, the values on one
variable are associated with the values on a
second variable
People often confuse the concepts of correlation
and causation
Correlation (co-relation) simply means that
variation in the scores on one variable
correspond with variation in the scores on a
second variable
Causation means that variation in the scores on
one variable cause or create variation in the
scores on a second variable. Correlation does
not equal causation.

12
Other Important Features of Correlations

Simple Pearson correlations are designed to
examine linear relations among variables. In
other words, they describe average straight
relations among variables
Not all relations between variables are linear
As previously mentioned, people often confuse the
concepts of correlation and causation

Example There is a curvilinear relationship
between anxiety and performance on a number of
academic and non-academic behaviors as shown in
the figure below
We call this a curvilinear relationship because
what began as a positive relationship (between
performance and anxiety) at lower levels of
anxiety, becomes a negative relationship at
higher levels of anxiety

13
Caution When Examining Correlation Coefficients

The problem of truncated range is another common
problem that arises when examining correlation
coefficients. This problem is encountered when
the scores on one or both of the variables in the
analysis do not have much variance in the
distribution of scores, possibly due to a ceiling
or floor effect
The data from the table at right show all of the
students did well on the test, whether they spend
many hours studying for it or not

The weak correlation that will be produced by the
data in the table may not reflect the true
relationship between how much students study and
how much they learn because the test was too
easy. A ceiling effect may have occurred,
thereby truncating the range of scores on the exam

Data for Studying-Exam Score Correlation Data for Studying-Exam Score Correlation Data for Studying-Exam Score Correlation
Hours Spent Studying (X variable) Exam Score (Y variable)
Student 1 0 95
Student 2 2 95
Student 3 4 100
Student 4 7 95
Student 5 10 100
14
Statistically Significant Correlations

The alternative hypothesis is that there is, in
fact, a statistical relationship between the two
variables in the population, and that the
population correlation coefficient is not equal
to zero. So what we are testing here is whether
our correlation coefficient is statistically
significantly different from 0

Researchers test whether the correlation
coefficient is statistically significant
To test whether a correlation coefficient is
statistically significant, the researcher begins
with the null hypothesis that there is absolutely
no relationship between the two variables in the
population, or that the correlation coefficient
in the population equals zero

15
The Coefficient of Determination

One way to conceptualize explained variance is to
understand that when two variables are correlated
with each other, they share a certain percentage
of their variance
See next slide for visual

What we want to be able to do with a measure of
association, like a correlation coefficient, is
be able to explain some of the variance in the
scores on one variable with the scores on a
second variable. The coefficient of
determination tells us how much of the variance
in the scores of one variable can be understood,
or explained, by the scores on a second variable

16
The Coefficient of Determination (continued)

In this picture, the two squares are not touching
each other, suggesting that all of the variance
in each variable is independent of the other
variable. There is no overlap
The precise percentage of shared, or explained,
variance can be determined by squaring the
correlation coefficient. This squared
correlation coefficient is known as the
coefficient of determination

17
Other Types of Correlation Coefficients
All of these statistics are very similar to the
Pearson correlation and each produces a
correlation coefficient that is similar to the
Pearson r

For example, suppose you wanted to know whether
gender (male, female) was associated with whether
one smokes cigarettes or not (smoker, non smoker)
In this case, with two dichotomous variables, you
would calculate a phi coefficient
Note Readers familiar with chi-square analysis
will notice that two dichotomous variables can
also be analyzed using chi square test (see
Chapter 14)

Phi
Sometimes researchers want to know whether two
dichotomous variables are correlated. In this
case, we would calculate a phi coefficient (F),
which is specialized version of the Pearson r

18
Other Types of Correlation Coefficients
(continued)

Point Biserial
When one of our variables is a continuous
variable(i.e., measured on an interval or ratio
scale) and the other is a dichotomous variable we
need to calculate a point-biserial correlation
coefficient
This coefficient is a specialized version of the
Pearson correlation coefficient

For example, suppose you wanted to know whether
there is a relationship between whether a person
owns a car (yes or no) and their score on a
written test of traffic rule knowledge, such as
the tests one must pass to get a drivers license
In this example, we are examining the relation
between one categorical variable with two
categories (whether one owns a car) and one
continuous variable (ones score on the drivers
test)
Therefore, the point-biserial correlation is the
appropriate statistic in this instance

19
Other Types of Correlation Coefficients
(continued)

Spearman Rho
Sometimes data are recorded as ranks. Because
ranks are a form of ordinal data, and the other
correlation coefficients discussed so far involve
either continuous (interval, ratio) or
dichotomous variables, we need a different type
of statistic to calculate the correlation between
two variables that use ranked data

The Spearman rho is a specialized form of the
Pearson r that is appropriate for such data
For example, many schools use students grade
point averages (a continuous scale) to rank
students (an ordinal scale)
In addition, students scores on standardized
achievement tests can be ranked
To see whether a students rank in their school
is related to their rank on the standardized
test, a Spearman rho coefficient can be
calculated.

20
Example The Correlation Between Grades and Test
Scores

The correlations on the diagonal show the
correlation between a single variable and itself.
Because we always get a correlation of 1.00 when
we correlate a variable with itself, these
correlations presented on the diagonal are
meaningless. That is why there is not a p value
reported for them

The numbers in the parentheses, just below the
correlation coefficients, report the sample size.
There were 314 eleventh grade students in this
sample
From the correlation coefficient that is off the
diagonal, we can see that students grade point
average (Grade) was moderately correlated with
their scores on the test (r 0.4291). This
correlation is statistically significant, with a
p value of less than 0.0001 (p lt 0.0001)

SPSS Printout of Correlation Analysis SPSS Printout of Correlation Analysis SPSS Printout of Correlation Analysis
Grade Test Score
Grade 1.0000
( 314)
P .
Test Score 0.4291 1.0000
( 314) ( 314)
P 0.000 P .
21
Example The Correlation Between Grades and Test
Scores, Cont.

To gain a clearer understanding of the
relationship between grade and test scores, we
can calculate a coefficient of determination. We
do this by squaring the correlation coefficient.
When we square this correlation coefficient
(0.4291 0.4291 0.1841), we see that grades
explains a little bit more than 18 of the
variance in the test scores

SPSS Printout of Correlation Analysis SPSS Printout of Correlation Analysis SPSS Printout of Correlation Analysis
Grades Test score
Grades 1.0000
( 314)
P .
Test score 0.4291 1.0000
( 314) ( 314)
P 0.000 P .

Because of 80 percentage of unexplained
variance, we must conclude that teacher-assigned
grades reflect something substantially different
from, and more than, just scores on tests.

Same table as in previous slide
22
Regression is Powerful

Allows researchers to examine
How variables are related to each other
The strength of the relations
Relative predictive power of several independent
variables on a dependent variable
The unique contribution of one or more
independent variables when controlling for one or
more covariates

23
Simple vs. Multiple Regression

Simple Regression
Simple regression analysis involves a single
independent, or predictor variable and a single
dependent, or outcome variable

Multiple Regression
Multiple regression involves models that have two
or more predictor variables and a single
dependent variable

1
2
24
Variables Used in Regression

The dependent and independent variables need to
be measured on an interval or ratio scale
Dichotomous (i.e., categorical variables with two
categories) predictor variables can also be used
There is a special form of regression analysis,
logit regression, that allows us to examine
dichotomous dependent variables

25
Benefits of Regression Rather than Correlation

Regression analysis yields more information
The regression equation allows us to think about
the relation between the two variables of
interest in a more intuitive way, using the
original scales of measurement rather than
converting to standardized scores
Regression analysis yields a formula for
calculating the predicted value of one variable
when we know the actual value of the second
variable

26
Simple Linear Regression

Assumes the two variables are linearly related
In other words, if the two variables are actually
related to each other, we assume that every time
there is an increase of a given size in value on
the X variable (called the predictor or
independent variable), there is a corresponding
increase (if there is a positive correlation) or
decrease (if there is a negative correlation) of
a specific size in the Y variable (called the
dependent, or outcome, or criterion variable)

27
Regression Equation Used to Find the Predicted
Value of Y
is the predicted value of the Y variable
is the unstandardized regression coefficient, or
the slope
b
is the intercept (i.e., the point where the
regression line intercepts the Y axis. This is
also the predicted value of Y when X is zero)
a
28
Example of Simple Linear Regression

Is there a relationship between the amount of
education people have and their monthly income?

Education Level (X)in years Monthly Income (Y) in thousands
Case 1 6 1
Case 2 8 1.5
Case 3 11 1
Case 4 12 2
Case 5 12 4
Case 6 13 2.5
Case 7 14 5
Case 8 16 6
Case 9 16 10
Case 10 21 8
Mean 12.9 4.1
Standard Deviation 4.25 3.12
Correlation Coefficient 0.83
29
Example of Simple LinearRegression (continued)
Scatterplot for education and income

With the data provided in the table, we can
calculate a regression. The regression equation
allows us to do two things
find predicted values for the Y variable for any
given value of the X variable
produce the regression line
The regression line is the basis for linear
regression and can help us understand how
regression works

30
Ordinary Least Squares Regression (OLS)

OLS is the most commonly used regression formula
It is based on an idea that we have seen before
the sum of squares
To do OLS find the line of least squares (i.e.,
the straight line that produces the smallest sum
of squared deviations from the line)

Sum of Squares S (observed value predicted
value)2
31
Formula for Calculating Regression Coefficient (b)
is the regression coefficient
b
is the correlation between the X and Y variables
r
is the standard deviation of the Y variable
sy
sx
is the standard deviation of the X variable
32
Formula for Calculating theIntercept (a)
is the average value of Y
is the average value of X
is the regression coefficient
b
33
Error in Predictions

The regression equation does not calculate the
actual value of Y. It can only make predictions
about the value of Y. So error (e) is bound to
occur.
Error is the difference between the actual, or
observed, value of Y and the predicted value of Y
To calculate error, use one of two equations

OR
e Y - a bX
is the actual, or observed value of Y
Y
is the predicted value of Y
34
Two Regression Equations

For the predicted value of Y

For the actual / observed value of Y takes into
account error (e)

Y bX a e
35
Wrapping Words Around theRegression Coefficient

Example Is there a relationship between the
amount of education people have and their monthly
income?

For every unit of increase in X, there is a
corresponding predicted increase of 0.61 units in
Y
OR
For every additional year of education, we would
predict an increase of 0.61 (1,000), or 610, in
monthly income

36
Finding Predicted Values of Y at Given Values of
X

Example What would we predict the monthly income
to be for a person with 9 years of formal
education?

So we would predict that a person with 9 years of
education would make 1,820 per month, plus or
minus our error in prediction (e)

37
Drawing the Regression Line

To do this we need to calculate two points

38
The Regression Line is Not Perfect

The regression line does not always accurately
predict the actual Y values
In some cases there is a little error, and in
other cases there is a larger error
Residuals errors in prediction
In some cases, our predicted value is greater
than our observed value.
Overpredicted observed values of Y at given
values of X that are below the predicted values
of Y. Produces negative residuals.

Sometimes our predicted value is less than our
observed value
Underpredicted observed values of Y at given
values of X that are above the predicted values
of Y. Produces positive residuals.

Write a Comment

User Comments (0)