Title: Correlation
1Correlation
2Correlation
- Relationship or association between variables
- If two variables are related, knowing something
about one of them tells us something about the
other - Ex The relationship between height weight
- Correlation coefficient
- A measure of the relationship between variables
- Pearson product-moment correlation coefficient
(r) - The most common correlation coefficient
3Graphing Relationships
- Scatterplot (scatter diagram)
- A figure in which the individual data points are
plotted in two-dimensional space (Xi , Yi) - The coordinates Xi , Yi are the individuals
scores on X Y
Y
X
4Correlation Terms
- The idea with correlation is that we want to be
able to predict something about one of the
variables by knowing something about the other - For correlation we call X the predictor variable
- The variable from which a prediction is made
- We call Y the criterion variable
- The variable to be predicted
5Correlation Terms
- Correlations are are standardized using the
standard deviation - So they range from -1 to 1
- A correlation of -1 or 1 means perfect prediction
or relationship and a correlation of 0 means no
relationship - The relationships are strongest near the extremes
and weakest near zero - Negative Relationships
- As the values of one variable go up the other
goes down - Positive Relationships
- As the values of one variable go up the other
goes up
6Strong Positive Relationship
7Hours studying and problems missed
Strong Negative Relationship
30
20
10
Hours Studying
0
12
10
8
6
4
2
0
Problems missed
Correlations
Hours
Missed
Pearson Correlation
Hours
1.000
-.973
Sig. (2-tailed)
.
.000
N
10
10
Pearson Correlation
Missed
-.973
1.000
Sig. (2-tailed)
.000
.
N
10
10
Correlation is significant at the 0.01 level
.
(2-tailed).
8Hours I studied and problems you missed
No Relation
24.6
24.4
24.2
24.0
23.8
Problems you missed
23.6
23.4
12
10
8
6
4
2
0
Hours Studied
Correlations
Hours
Missed
a
Pearson Correlation
Hours stud.
1.000
.
Sig. (2-tailed)
.
.
10
10
N
a
a
Pearson Correlation
Prob. missed
.
.
Sig. (2-tailed)
.
.
N
10
10
Cannot be computed because at least one of the
a.
variables is constant.
9Types of Relationships
- The relationship between X and Y can be linear or
curvilinear - We usually are dealing with linear relationships
- Linear relationships
- A situation in which the line that best fits the
points (in a scatterplot) is a straight line - Curvilinear relationship
- A situation in which the line that best fits the
points (in a scatterplot) is a not straight line
10Covariance
- Correlations are based on covariation between X
and Y - Covariance is a statistic representing the degree
to which 2 variables vary together - ?X ?Y
- covXY ?XY - n
- n - 1
- The covariance is based on how far an observation
deviates from the mean on EACH variable - The covariation can be negative
11Example
- Compute the covariance for the following
- Hours Studied Score on Exam
- 6 90
- 8 95
- 2 70
- 4 80
12Pearsons r
- Because the covariance depends on the standard
deviations of X Y - We use the correlation which is standardized by
the standard deviation - r covXY
- sxsy
- r ?XY - ?X ?Y
- n
- ?X2 (?X)2?Y2 (?Y)2
- n n
-
13Example
- Compute the correlation using the following data
- Undergrad GPA GRE Total
- 3.8 2350
- 2.8 1740
- 3.2 2100
- 3.5 2230
14Factors that affect the correlation
- The correlation can be affected by three factors
- Restriction of range in X Y
- Nonlinearity of the relationship between X Y
- Heterogeneous sub-samples
15Restriction of range
- Range restrictions
- Cases where the range over which X and Y varies
is artificially limited - Ex. College GPA and SAT scores
- The problem is that usually only people with high
SATs are allowed into college thus restricting
the possible values of SAT scores we can use - Want to know if SAT scores have a relationship
with how suitable one is for college, but we are
really only answering that question using people
who got into college - May cause r to increase or reduce Normally it
reduces r
16Nonlinearity
- A straight line doesnt best fit our data
- Ex. Height with age
- Height goes up with age but only to a certain
point, it will level off or decrease thereafter
thus reducing our r (which measures linear
relations)
Height
Age
17Heterogeneous Sub-Samples
- Data from the sample of observations could be
subdivided into 2 distinct sets on the basis of
some other variable - Ex height and weight of U.S. adults
- A possible sub-group would be males and females
- If we collapse across sex might get a correlation
of .78 but if you were to look at these
correlations for males (.60) and females (.49)
separately we would find a different pattern - Be careful when combining data from various
sources
18Not all correlations are meaningful
- Not all results you find are meaningful even if
they are strong - There is a significant positive correlation
between ice cream consumption and the number of
deaths due to drowning - Does this mean ice cream consumption causes
downing? No! - There is a third variable that is responsible for
this relationship Hot weather - Correlations usually dont explain causation
19Hypothesis testing with r
- Population correlation coefficient rho (?)
- The correlation coefficient for the population
- The null hypothesis
- H0 ? 0
- The alternative hypothesis
- H1 ? ? 0
- Table E.2 to get the critical value
- Use alpha and df (df n - 2) to get the CV
- If correlation exceeds the critical value reject
the null
20Example
- We are interested in whether the number of
Friends episodes you have watched is related to
the number of hours you have studied. Test the
hypothesis that these variables are related. Set
?.05
21Intercorrelation Matrix
- A table (matrix) showing the pairwise
correlations between all variables
Correlations
HOURS
ICECREAM
DROWNING
MISSED.
Pearson Correlation
HOURS
1.000
1.000
.968
-.973
Sig. (2-tailed)
.
.000
.000
.000
N
10
10
10
10
Pearson Correlation
ICECREAM
1.000
1.000
.968
-.973
Sig. (2-tailed)
.000
.
.000
.000
N
10
10
10
10
Pearson Correlation
DROWNING
.968
.968
1.000
-.908
Sig. (2-tailed)
.000
.000
.
.000
N
10
10
10
10
Pearson Correlation
MISSED
-.973
-.973
-.908
1.000
Sig. (2-tailed)
.000
.000
.000
.
N
10
10
10
10
.
Correlation is significant at the 0.01 level
(2-tailed).
22Correlations with Ranked data
- Data for which the observations have been
replaced with their numerical ranks from lowest
to highest - Ex. Rank these applications in terms of
acceptability rank them in terms of resume
clarity correlate clarity with acceptability - To correlate ranked data we use Spearmans
correlation coefficient for ranked data (rs) - Also called Spearmans rho
- This is not the best technique for ranked data
but it is the most common one
23Computing Spearmans Rho
- To compute the correlation for ranked data, you
can use the Pearson formula - Spearmans rho measures the linearity between the
ranks - Monotonic relationship
- A relationship represented by a line that is
continually increasing or decreasing but perhaps
not in a straight line
24Other Correlation Coefficients
- Point biserial correlation (rpb)
- The correlation coefficient when one of the
variables is dichotomous the other is
continuous - dichotomous variables can only have 2 different
values (e.g. Yes/No) - Compute using the Pearson Formula
- Phi (?)
- The correlation coefficient used when both of the
variables are measured as dichotomies - Compute using the Pearson formula
- See the table on p.164
25Final Example
- The following is the scores of job applicants on
a cognitive ability test and ratings from an
interview (both are out of 100) - CA IR
- 75 80
- 98 96
- 89 87
- 67 72
- Using Pearsons r test the hypothesis that there
is a relationship between cognitive ability and
interview rating