Chapter 9: More about Correlation - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Chapter 9: More about Correlation

Description:

So: weight does not increase by 1 SD for each SD increase in height ... so he is short and likely to weigh less than the average; so your best guess ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 37
Provided by: University354
Category:
Tags: age | and | chapter | correlation | for | height | how | more | much | my | should | weigh

less

Transcript and Presenter's Notes

Title: Chapter 9: More about Correlation


1
Chapter 9 More about Correlation
2
Features of the correlation coefficient
  • r is a pure number as it is based on
    standard units
  • it is not scale dependent and is not
    affected by changes of scale
  • multiplying all s in a list by a constant

x 1 3 4 5 7
c 2 2 2 2 2
x(c) 2 6 8 10 14
9 1 0 1 9
36 4 0 4 36
  • the average is simply multiplied by the constant
  • so are the deviations from the average
  • so is the SD
  • r remains the same

3
Features of the correlation coefficient
  • r is a pure number as it is based on
    standard units
  • it is not scale dependent and is not
    affected by changes of scale
  • adding a constant to all s in a list

x 1 3 4 5 7
c 2 2 2 2 2
x c 3 5 6 7
9
9 1 0 1 9
9 1 0 1 9
  • the constant is simply added to the average
  • so the deviations from the average do not change
  • neither does the SD
  • r remains the same

4
Features of the correlation coefficient
  • rXY rYX
  • r measures clustering in relative terms -
    relative to the SD (OH Fig 3, p. 145)
  • r is particularly useful for these kinds of
    scatter diagrams

r measures linear association only
  • r is not useful for other kinds of scatter
    diagrams

Nonlinear association
r 1.0
outliers
r 0.4
5
Ecological Correlations
  • correlations based on rates/averages tend to
    overstate the strength of an association
  • example Current population survey data
  • r between income and education for individual
    US men (25 - 54) 0.44
  • compute an average education level and income
    level for each state
  • compute the correlation between the 51 pairs
    of average 0.64
  • OH (Fig 6, p 149)
  • Note r .64 gt r .44
  • this r of the state averages overestimates the
    r for the individual men
  • within each state there is a lot of
    variability around the averages
  • this spread is ignored/left out by the
    averages
  • suggests a stronger relationship

6
Association versus causation
Association is NOT causation (correlation is not
causation)
Shoe size
Reading skill
is related to
Age
Note double headed arrow indicates
correlation there is no causation - neither
one causes the other
single headed arrow from age indicates
causation increased age causes larger shoe
size and reading skills
7
Chapter 10 Regression
8
Introduction to Regression
Regression describes how one variable depends on
another. HANES data average height/weight for
men (18-24 OH, fig 1, p159) Note Average height
about 70 inches SD 3 inches Average weight
about 162 pounds SD 30 pounds Correlation
(r) 0.47
The horizontal/vertical scales are such that 1 SD
on both covers the same amount of space, the SD
line (dashed line) rises at 45 degrees The slope
of the SD line (SD of y)/(SD of x) ( a one to
one ratio) Vertical strip shows those who were 1
SD above the average on height most points in
the strip are below the SD line, very few are 1
SD above average on weight - as the r
0.47 So weight does not increase by 1
SD for each SD increase in height for each SD
increase in height, weight increases by 0.47 of a
SD
9
Introduction to Regression
Note Average height about 70 inches SD 3
inches Average weight about 162 pounds SD 30
pounds Correlation (r) 0.47 For each SD
increase in height, weight increases by 0.47 SDs
1 SD above average on height average height
SD 70 3 73 corresponds to a weight of
average weight r( SD) estimate their average
weight 162 (0.47)30 176 2 SDs above
average on height average height 2SDs 70
2(3) 76 corresponds to a weight of average
weight r( 2SDs) estimate their average
weight 162 (0.47)60 190 1.5 SDs below
height average average height - 1.5 SD 70
-1.5(3) 65.5 corresponds to a weight of
average weight - r( 1.5SDs) estimate their
average weight 162 - (0.47)45 141
All these points fall on the solid line - known
as the regression line
10
Points to note about the regression line
  • The regression line for y on x estimates the
    average value for y corresponding to
  • each value of x we regress y onto x
  • The regression line goes through the point
    of averages as men of average height
  • should also be of average weight
  • Along the regression line associated with
    each SD increase in height, weight
  • increases only by 0.47 SDs
  • imagine that we group the men by height
  • one group average height next group 1
    SD above average etc
  • from each group to the next weight also goes
    up
  • but not by 1 SD, rather by (r)SD
  • Using r to estimate the average value of y
    for each x is called the regression
  • method
  • The regression method can be stated as
    follows associated with each SD
  • increase in x, there is an increase of
    only r SDs in y on average

11
Regression Method
as x increases by 1 SD, y only increases r SDs
on average
Why is r the right factor????
Suppose that x and y are not related in any way
r 0 1 SD increase in x is
associated with NO SYSTEMATIC change in y thus
a 0 SD inc/decrease in y
r 0
12
Regression Method
as x increases by 1 SD, y only increases r SDs
on average
if r 1.0,
as x increases by 1 SD
y increases by 1SD
1 SD increase in x is associated with 1 SD
increase in y regression line the SD line
point of averages
1 y SD
1 x SD
13
Regression Method
Suppose that x and y are NOT perfectly positively
related such that 0 lt r lt 1.0
as x increases by 1 SD
y increases by r times SD
point of averages
SD line, regression line if r 1.0
regression line
r (y SD)
1 x SD
regression line does not equal the SD
line because r lt 1.0
14
Regression Method
as x increases by 1 SD, y only decreases r SDs
on average
if r -1.0,
as x increases by 1 SD
y decreases by 1SD
1 SD increase in x is associated with 1 SD
decrease in y regression line the SD line
point of averages
1 x SD
1 y SD
15
Regression Method
Suppose that x and y are NOT perfectly negatively
related such that -1.0 lt r lt 0
as x increases by 1 SD
y decreases by r times SD
SD line, regression line if r -1.0
point of averages
regression line
1 x SD
regression line does not equal the SD line
because r gt - 1.0
r (y SD)
16
Chapter 10 Regression The Graph of Averages
HANES data average height/weight for men (18-24
OH, fig 3, p162)
Graph of the average weight for men at each
height Note the men were chose at random for the
sample the numbers indicate the number of men in
each group there are larger groups in the middle
of the scattergram their averages fall close to
the regression line there are smaller groups at
either end of the scattergram their averages
fall farther from the regression line this is due
to chance variability
  • The regression line is a smoothed version of
    the graph of averages
  • if the graph of averages is a straight line it
    equals the regression line

17
The regression line is a smoothed version of the
graph of averages if the graph of averages is a
straight line it equals the regression line
This smoothing effect is useful when studying
linear association
This smoothing effect is fails to accurately
describe the data when there is non-linear
association here the graph of averages should be
used
regression line
18
The Regression Method for Individuals
HANES data average height/weight for men
(18-24) Note Average height about 70
inches SD 3 inches Average weight
about 162 pounds SD 30 pounds Correlation (r)
0.47
One man is picked at random from the sample,
without knowing anything about him, you are asked
for his weight? Best guess the total group
average 162 pounds Now you are told his height
64 inches (i.e., 2 SDs below average) so he is
short and likely to weigh less than the average
so your best guess here is the average for the
64 inch men group estimate this via the
regression method average r(-2SDs)
162 (0.47)-60 134
NOTE this method only applies for linear
association
19
Example
University data about SAT math scores and
first-year GPAs Average SAT 550 SD 80
Average GPA 2.60 SD 0.60
r 0.40
Scattergram indicates linear association
One student is picked at random from the sample,
without knowing anything about her, you are asked
for her first-year GPA? Best guess the total
group average 2.60 Now you are told her SAT
710 so your best guess here is the average for
the 710 SAT group first need to find out how far
her score is from the average in SD units z
score - average 710 - 550
2 s 80 now estimate her GPA via the
regression method average r(2SDs) 2.60
(0.40)(2)(0.60) 2.60 (.40)1.20 3.08
20
Another example
University data about SAT math scores and
first-year GPAs Average SAT 550 SD
80 Average GPA 2.60 SD 0.60
r 0.40
Another student is picked at random from the
sample, and you are told his SAT 500 and your
best guess is the average for the 500 SAT
group first need to find out how far his score is
from the average in SD units z score -
average 500 - 550 - 0.625 or 0.625 SDs
below average s 80 estimate the students score
via the regression method of average
r(-0.625 SDs) 2.60
(0.40)(-0.625)(0.60) 2.60 (0.40)(-0.375)
2.60 (-0.15) 2.45
regression method predicts for an individual the
average score on y for those individuals who are
all in the same group (x value) on x -
conditional average fig 3, p 162
21
Using the Regression Method for Individuals
Data is collected in a study and regression
estimates are calculated, and these estimates are
then used to predict the performance scores etc
for new people, in other words, for people they
do not yet have data on. For example SAT scores
are used to predict college GPA The college
has SAT data on all the applicants
GPA data for admitted students The r between SAT
and GPA for the admitted students is obtained and
is used to predict how all new applicants will
perform in college This works fine IF the group
of people upon whom the r is based is
representative of the people to whom you are
extrapolating the prediction In the college
example, r is based on admitted students who tend
to have the highest SATs and the issue here is
how valid the use of the obtained r is on
predicting the performance of those students
denied
22
Using the Regression Method to predict percentile
ranks
Joes percentile rank on the SAT is 95 among
first-year students, predict Joes percentile
rank on first year GPA
We are able to use the regression method to
answer this question because
2. Both SATs and GPAs are normally distributed
1. The scattergram is football shaped
To predict Joes percentile rank on the GPA using
his percentile rank on the SAT, we need to 1.
Convert his SAT percentile rank to a z score
(tells us where his score falls in relation to
the average) 2. Use this to find his z score on
the GPA and predict his percentile rank on the
GPA via the regression method
23
Joes SAT percentile rank is 95, predict his
GPA percentile rank
He is above average on the SATs - but by how
many SDs??? As the SATs follow the normal
curve, his percentile rank has this info
only 5 of the scores fell to the right of his
score
5
5
90
Recall table A105 gives the area between 2
scores (negative to positive)
So the area to the left of the other score 5
The area between the two scores here is 100 - 5 -
5 90
Table A105 standard score corresponding to an
area of 90 1.65
Joe scored 1.65 SDs above the average on SA, the
regression method predicts then that he will be
r(SD) 0.40(1.65) .66 SDs above
average on GPA Now we use this to find the
percentile rank for Joe on the GPA (Table
A105) Joes GPA percentile rank is about 75
24
Example 2 Moes SAT percentile is 22, predict
his GPA percentile
He is below average on the SATs - but by how
many SDs??? As the SATs follow the normal
curve, his percentile rank has this info
only 22 of the scores fell to the left of his
score
22
22
56
Recall table A105 gives the area between 2
scores (negative to positive)
So the area to the right of the other score 22
The area between the two scores here is 100 - 22
- 22 56
Table A105 standard score corresponding to an
area of 58 .-80
Moe scored 0.80 SDs below the average on SAT The
regression method predicts then that he will be
r(SD) 0.40(-0.80) -0.32 SDs
below average on GPA Thus Moes GPA
percentile rank is about 38
25
Points to note
  • We did not use the averages and SDs of the
    the GPA and SAT scores in solving that problem
    - it was all done in std units
  • Joe was compared with students in 2 different
    evaluations SATs he did extremely
    well 95th percentile GPAs he did well (but
    not as well) 84th percentile
  • Moe was compared with students in 2 different
    evaluations SATs he did extremely
    poorly 22nd percentile GPAs he did poorly
    (but not as poorly) 34th percentile
  • NOTE both their predicted scores moved towards
    the mean
  • If the SAT/GPA r 1.0, predict their GPA
    rank to equal their SAT rank here r 0.40 so
    they are less than perfectly correlated - we
    cannot do that
  • If r between SAT and GPA 0.0, then our best
    prediction would be the median

26
The Regression Fallacy
Preschool IQ program to boost kids IQs IQs
tested when they enter and leave the program For
both tests averages 100 SDs 15 Appears
that the program has no effect But just like Joe
and Moe kids above average on the first test,
while still above average on the second test
scored lower than on the first (lost /-
5pts) kids below average on the first test,
while still below average on the second test
scored higher than on the first (gained
5pts) Does this prove anything?? No - nothing
much is going on here - peoples scores on two
tests will vary to some degree - chance
variability But the variability causes the
football shaped scatter on the graph and the
spread around the line brings the lower group up
and top group down Called regression to the mean
27
Regression to the mean
  • occurs in nearly all test-retest situations
  • The regression fallacy
  • thinking that this effect is due to something
    important it is simply due to the variability
    around the regression line
  • Why this occurs OH fig 5, p 171
  • 1078 pairs of heights - fathers and
    sons fathers average 68 SD 2.7 sons
    average 69 SD 2.7 r 0.5
  • Note on average sons are 1 inch taller than
    fathers - so we would guess that sons will be 1
    inch taller than their fathers - these pairs are
    plotted on the dashed line 1. goes through
    the point of averages 2. rises 1 SD in
    fathers height for 1 SD increase in sons
    height the SD line
  • Note that there is a lot of spread around the
    line
  • some sons are taller than their fathers, some
    are shorter

28
Regression to the mean
Look at the 72 inch fathers there is a range
in the sons height some are taller than
73 inches - they are above the SD line most are
shorter than 73 ins, they are below the SD
line On average the sons of 72 in fathers are
only 71 inches tall Look at the 64 inch fathers
again there is a range in the sons
height some are shorter than 64 ins, they
are below the SD line most are taller than 64
ins, they are above the SD line On average the
sons of 64 in fathers are only 67 inches
tall Note the dashed line the SD
line the data points form a symmetric
football shaped cloud around it BUT not at 72
inches - contains unusually small y
coordinates 64 inches - contains unusually big
y coordinates This imbalance is always there in
football-shaped clouds
29
Regression to the mean
Note the dashed SD line goes through the
point of averages, rising by 1 SD in sons height
for each SD increase in fathers height The solid
regression line also goes through
the point of averages as well as the average
y-value for each vertical strip of
x-values Recall fathers average 68 SD
2.7 sons average 69 SD 2.7 r 0.5
and std score score - avg. SD
For example 72 inch tall fathers are 4 inches
taller than the average 4 inches 4/2.7 1.5
SDs above the average The regression line
predicts that the sons will be taller than
average by r(1.5 SDs) .5(4.05) 2 inches
which is 69 2 71 inches On previous OH we
showed that the average height of the sons of 72
in fathers is only 71 inches tall - the
prediction is right on
30
Regression to the mean OH Figure 6 , p 172
The regression and SD line for the data the
regression line is less steep than the SD line
the SD line rises at a 45 degree angle its
slope is (SD of y)/(SD of x) a one to one
ratio 1 SD increase in fathers height 1 SD
increase in sons height
the regression line rises at about 22
degrees halfway between the horizontal line
and the SD line because r 0.50 1 SD increase
in fathers height .50 increase in sons
height the regression line tracks the graph of
averages quite well
31
Another perspective on regression to the mean
Example repeated IQ test scores normal
distribution, average of 100, SD of 15
Scores are bound to be different as many factors
that influence test performance amount of
sleep, concentration, motivation, luck etc Such
differences can be explained in terms of chance
variability
Recall basic equation observed score true
score chance error Chance error unsystematic
(not bias), as likely to be positive as it is
negative assume it is about 5 points in size
Observed score of 130
Could be a true score of 125 or 135
135
125
Observed score 130
32
Another perspective on regression to the mean
Example repeated IQ test scores normal
distribution, average of 100, SD of 15
Note if observed score 130 2 possible
explanations 1. True score 135 with -5 chance
error 2. True score 125 with 5 chance
error it is more likely that the true score is
125 with a positive chance error because more
people have true scores of 125 than 135
This accounts for regression to the mean score
above average on first test, true score is
probably a bit lower than the observed score,
predict second score will be lower than the first
(i.e., closer to the true score)
135
125
Observed score 130
33
Another perspective on regression to the mean
Example repeated IQ test scores normal
distribution, average of 100, SD of 15
Note if observed score 70 2 possible
explanations 1. True score 75 with -5 chance
error 2. True score 65 with 5 chance
error it is more likely that the true score is
75 with a negative chance error because more
people have true scores of 75 than 65
This accounts for regression to the mean score
below average on first test, true score is
probably a bit higher than the observed score,
predict second score will be higher than the
first (i.e., closer to the true score)
75
65
Observed score 70
34
Two regression lines as we can regress
weight on height
height on weight
OR
height
weight
weight
height
35
Two regression lines as we can regress
Assume r between height and weight .50
weight on height
height on weight
OR
height
weight
weight
height
Estimates avg. weight for each height dots
mark the avg. weight for each value of height
Estimates avg. height for each weight dots
mark the avg. height for each value of weight
36
r between husband and wifes IQ .50 regress
wifes IQ onto husbands IQ husbands with IQs
of 140 have wives with IQs of 120
Q is the avg. IQ of husbands whose wives IQ
is 120 greater than 120? regress husbands IQ
onto wife's IQ
wifes IQ 120
wifes IQ
The average x-coordinate for wives in this strip
(IQ 120) is about 110
husbands IQ 140
husbands IQ
SD line
regression line for wifes IQ on husbands IQ
regression line for husbands IQ on wifes IQ
Write a Comment
User Comments (0)
About PowerShow.com