Title: Prediction: Why is the regression line Y better than the mean Y
1Prediction Why is the regression line (Y)
better than the mean (Y)?
2Todays plan review, prediction error, and more
review
- Review describing two data sets with r
- Characteristics of r
- Calculating r
- Interpreting r
- Why is prediction of Y with X (using Y) better
than Y? - Why use Y to start with?
- How much error is there when we use Y to predict
Y? - What is Y?
- How can we describe the error in prediction when
using Y? - Review of chapters 1-7.
3Characteristics of Pearsons r.
- Pearsons r must be between -1 and 1, inclusive
(-1 lt r lt 1). - When r 1 or -1, there is a perfect linear
relationship between the two variables (usually
designated X and Y). - rXY 1, perfect positive relationship (as X
increases, Y always increases by a constant
ratio) - rXY -1, perfect negative relationship (as X
increases, Y always decreases by a constant
ratio). - When rXY 0, there is no linear relationship
between X and Y. - When rXY is not -1, 0, or 1, there is a linear
relationship of some direction (positive or
negative) and some magnitude. - r S(zXzY)/N-1 (the corrected average of the z
score cross-products)
4How can you calculate Pearsons r?
rXY S(zXzY)/N-1
or
5What is the correlation between own and denic?
6What is the correlation between own (X) and denic
(Y)?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
rXY .315
7Examples of various rs.
8Are X and Y related? What is Pearsons r?
9(No Transcript)
10Using correlation to predict scores regression
- If there is a non-zero correlation between X and
Y (rXY 0.0), it means that there is a linear
relationship between the two variables, X and Y. - It also means that if you know an X score, you
can use the correlation to help you predict what
the Y score paired with that X score would be. - The closer the correlation is to a perfect one
(rXY 1 or rXY -1) the better the prediction
will be. - The predicted score (Y or Y prime) will be based
on a straight line - (Y bX a) drawn through the scatterplot such
that the distance of each point from the line is
minimized (linear regression). - When rXY 1 or rXY -1, all the points fall on
the line - When rXY is lt 1 or r gt -1, at least some of the
points are not on the line.
11For HR data, what Y is predicted for X 19?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
Y bYX aY
aY Y - bYX
12For HR data, what Y is predicted for X 19?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
Y bYX aY
bY .2053
13For HR data, what Y is predicted for X 19?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
Y .2053X aY
aY Y - bYX aY (45/32) - .2053(313/32) aY
1.406 - .20539.781 aY -0.602
14For HR data, what Y is predicted for X 19?
Y .2053X -0.602 Y .2053(19) -0.602 Y
3.3
15For HR data, what Y is predicted for X 19?
Y .2053X -0.602
(19,3.3)
(0,-0.6)
16Assessing the value of prediction
- In the absence of any other information about Y,
Y is the best predictor. - Consider sample Y with N 1000
- Y1001 ?
- Best predictor means S(Y-Y)2 is minimized
- no single value yields a lower number
- one of the neat things about the mean.
17Error in prediction when Y is used to predict Yi
Yi
(Yi-Y)
Y
Y
X
18Assessing the value of prediction
- In the absence of any other information about Y,
Y is the best predictor. - Best predictor means S(Y-Y)2 is minimized
- (Y-Y) is error in prediction (for any single
point) when mean (Y) is used as predictor. - Goal of regression is to reduce error by using Y
as predictor. - With regression Y-Y is error in prediction.
- Regression uses least squares criterion, S(Y-Y)2
is minimized - Difference between (Y-Y) and (Y-Y) is the
advantage of regression.
19(Yi-Y)
Y bXa
(Yi-Y)
Y
Y
X
20(Yi-Y)
Y bXa
(Yi-Y)
Y
Y
(Y-Y)
X
21r
22(No Transcript)
23Y 1.34X-.20
24Y 1.34X-.20
25Y 1.34X-.20
26Residual error
Total error
Error saved by X
Y 1.34X-.20
27Variability of prediction error
Variability of Y accounted for by X
Total variability of Y (SSY)
28Variability of prediction error
Variability of Y accounted for by X
Total variability of Y (SSY)
29Expressed as percentages
(45.5/45.5)100 100
(13.94/45.5)100 30.64
(31.56/45.5)100 69.36
Variability of prediction error
Variability of Y accounted for by X
Total variability of Y (SSY)
30Proportion of the variability in Y accounted for
by X
(31.56/45.5)100 69.36
AKA Coefficient of determination, r2 Recall that
rxy 0.833.
31What percentage of the variability in Denic is
accounted for by Own?
Y .2053X -0.602
(19,3.3)
(0,-0.6)
Y 1.4
32What does the exam cover?
- Chapters 1-7.
- Not Pages 117-118.
- Not Pages 139-149.
- Everything in lecture.
- Everything in labs.
33What have you learned so far in this course?
- Describing a single data set
- Displaying data (histograms, stem-leaf plot)
- Central tendency (mode, median, mean)
- Variability (variance and standard deviation)
- Using data transformations to describe data z
scores - Describing the relationship between two data
sets - Scatter plot
- Correlation Pearsons r
- Prediction using regression (Y bXa)
34Common challenges
- Understanding
- Why S(X-X) 0.
- What the standard deviation measures.
- Why the standard deviation doesnt increase when
a constant is added to each of the scores but
does increase when each of the scores is
multiplied by a constant. - z score problems (sketch the problem!)
- Explaining
- Why the standard deviation is a good measure of
variability. - Why and how Y reduces prediction error, relative
to the mean, when r 0 but fails to do so when r
0 note b r(sy/sx).
35On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
m 100
36On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
m 100
135
123
37On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
mZ 0
2.19
1.44
38On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
48.57
42.51
6.06
mZ 0
2.19
1.44
39Suppose that you want to join a selective society
that bases membership on IQ scores -- you need to
score in the upper 2 on the test to qualify.
What score do you need?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
m 100
40Suppose that you want to join a selective society
that bases membership on IQ scores -- you need to
score in the upper 2 on the test to qualify.
What score do you need?
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
0.0202
mz 0
2.05
41Suppose that you want to join a selective society
that bases membership on IQ scores -- you need to
score in the upper 2 on the test to qualify. You
need a 133 or higher.
2.05 (X-100)/16 (2.0516)100 X 132.8 X
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
0.0202
mz 0
2.05
42The data to the left are from the annual
Monitoring the Future survey carried out by the
National Institute on Drug Abuse for 21 years
(1975-1995). The column entitled Risk shows
ratings of the perceived risks of cannabis use
made by high school seniors on a scale of 0-100.
The column entitled 30 day shows past 30 day
cannabis use among high school seniors in that
year. Use the data (and accompanying summary
statistics) to answer the following
questions 1) What is the correlation between
perceived risk of cannabis use and actual
cannabis use? 2) If, in the year 2000, you know
that the perceived risk of cannabis use is 52,
what would you predict will be the 30 day
cannabis use for that year? 3) If you had
reducing teenage cannabis use as your goal, would
you use these data as justification for a
multi-million dollar advertising campaign
informing teenagers of the health risks of
cannabis use?
43(No Transcript)
44What is the correlation . . .
N 21 SX 61.71 21 1295.91 SSx 15.092
20 4554.162 SY 24.57 21
515.97 SSy 7.952 20 1264.05 SXY
29715
(1295.91)
(515.97)
29715 -
21
rXY
4554.1621264.05
rXY -0.89
45Predict 30 day use if Risk is 52
Y bYXaY
bY r(sy/sx) bY -0.89(7.95/15.09) bY
-0.47 aY Y - bYX aY 24.57 - (-0.47)
61.71 aY 53.57
Y bYXaY Y -0.47(52) 53.57 Y 29.13
46Y -0.47X 53.57
52,29.13
47If you had reducing teenage cannabis use as your
goal, would you use these data as justification
for a multi-million dollar advertising campaign
informing teenagers of the health risks of
cannabis use?
48Exam rules
- You may use any inanimate object you want.
- Yes calculators!
- Yes books!
- Yes computers!
- Yes SPSS or other statistical software!
- No help from any living human!
- Show your work. When in doubt, more is better
than less. - Write your e-mail address on your exam NOW if you
want me to e-mail you your score. - Honor code is in effect for this exam. Sign
pledge at the bottom of the last page.