Prediction: Why is the regression line Y better than the mean Y - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Prediction: Why is the regression line Y better than the mean Y

Description:

r = S(zXzY)/N-1 (the corrected average of the z score cross-products) ... On an IQ test with normally distributed scores and a mean of 100 and a standard ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 49

Provided by: has9

Category:

more less

Transcript and Presenter's Notes

Title: Prediction: Why is the regression line Y better than the mean Y

1
Prediction Why is the regression line (Y)
better than the mean (Y)?
2
Todays plan review, prediction error, and more
review

Review describing two data sets with r
Characteristics of r
Calculating r
Interpreting r
Why is prediction of Y with X (using Y) better
than Y?
Why use Y to start with?
How much error is there when we use Y to predict
Y?
What is Y?
How can we describe the error in prediction when
using Y?
Review of chapters 1-7.

3
Characteristics of Pearsons r.

Pearsons r must be between -1 and 1, inclusive
(-1 lt r lt 1).
When r 1 or -1, there is a perfect linear
relationship between the two variables (usually
designated X and Y).
rXY 1, perfect positive relationship (as X
increases, Y always increases by a constant
ratio)
rXY -1, perfect negative relationship (as X
increases, Y always decreases by a constant
ratio).
When rXY 0, there is no linear relationship
between X and Y.
When rXY is not -1, 0, or 1, there is a linear
relationship of some direction (positive or
negative) and some magnitude.
r S(zXzY)/N-1 (the corrected average of the z
score cross-products)

4
How can you calculate Pearsons r?
rXY S(zXzY)/N-1
or
5
What is the correlation between own and denic?
6
What is the correlation between own (X) and denic
(Y)?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
rXY .315
7
Examples of various rs.
8
Are X and Y related? What is Pearsons r?
9
(No Transcript)
10
Using correlation to predict scores regression

If there is a non-zero correlation between X and
Y (rXY 0.0), it means that there is a linear
relationship between the two variables, X and Y.
It also means that if you know an X score, you
can use the correlation to help you predict what
the Y score paired with that X score would be.
The closer the correlation is to a perfect one
(rXY 1 or rXY -1) the better the prediction
will be.
The predicted score (Y or Y prime) will be based
on a straight line
(Y bX a) drawn through the scatterplot such
that the distance of each point from the line is
minimized (linear regression).
When rXY 1 or rXY -1, all the points fall on
the line
When rXY is lt 1 or r gt -1, at least some of the
points are not on the line.

11
For HR data, what Y is predicted for X 19?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
Y bYX aY
aY Y - bYX
12
For HR data, what Y is predicted for X 19?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
Y bYX aY
bY .2053
13
For HR data, what Y is predicted for X 19?
N 32 SX 313 SX2 4293 SSx 4293 -
(3132/32) 1231.5 SY 45 SY2 585 SSy
585-(452/32) 521.7 SXY 693
Y .2053X aY
aY Y - bYX aY (45/32) - .2053(313/32) aY
1.406 - .20539.781 aY -0.602
14
For HR data, what Y is predicted for X 19?
Y .2053X -0.602 Y .2053(19) -0.602 Y
3.3
15
For HR data, what Y is predicted for X 19?
Y .2053X -0.602
(19,3.3)
(0,-0.6)
16
Assessing the value of prediction

In the absence of any other information about Y,
Y is the best predictor.
Consider sample Y with N 1000
Y1001 ?
Best predictor means S(Y-Y)2 is minimized
no single value yields a lower number
one of the neat things about the mean.

17
Error in prediction when Y is used to predict Yi
Yi
(Yi-Y)
Y
Y
X
18
Assessing the value of prediction

In the absence of any other information about Y,
Y is the best predictor.
Best predictor means S(Y-Y)2 is minimized
(Y-Y) is error in prediction (for any single
point) when mean (Y) is used as predictor.
Goal of regression is to reduce error by using Y
as predictor.
With regression Y-Y is error in prediction.
Regression uses least squares criterion, S(Y-Y)2
is minimized
Difference between (Y-Y) and (Y-Y) is the
advantage of regression.

19
(Yi-Y)
Y bXa
(Yi-Y)
Y
Y
X
20
(Yi-Y)
Y bXa
(Yi-Y)
Y
Y
(Y-Y)
X
21
r
22
(No Transcript)
23
Y 1.34X-.20
24
Y 1.34X-.20
25
Y 1.34X-.20
26
Residual error
Total error
Error saved by X
Y 1.34X-.20
27
Variability of prediction error
Variability of Y accounted for by X
Total variability of Y (SSY)
28
Variability of prediction error
Variability of Y accounted for by X
Total variability of Y (SSY)

29
Expressed as percentages
(45.5/45.5)100 100
(13.94/45.5)100 30.64
(31.56/45.5)100 69.36
Variability of prediction error
Variability of Y accounted for by X
Total variability of Y (SSY)

30
Proportion of the variability in Y accounted for
by X
(31.56/45.5)100 69.36
AKA Coefficient of determination, r2 Recall that
rxy 0.833.
31
What percentage of the variability in Denic is
accounted for by Own?
Y .2053X -0.602
(19,3.3)
(0,-0.6)
Y 1.4
32
What does the exam cover?

Chapters 1-7.
Not Pages 117-118.
Not Pages 139-149.
Everything in lecture.
Everything in labs.

33
What have you learned so far in this course?

Describing a single data set
Displaying data (histograms, stem-leaf plot)
Central tendency (mode, median, mean)
Variability (variance and standard deviation)
Using data transformations to describe data z
scores
Describing the relationship between two data
sets
Scatter plot
Correlation Pearsons r
Prediction using regression (Y bXa)

34
Common challenges

Understanding
Why S(X-X) 0.
What the standard deviation measures.
Why the standard deviation doesnt increase when
a constant is added to each of the scores but
does increase when each of the scores is
multiplied by a constant.
z score problems (sketch the problem!)
Explaining
Why the standard deviation is a good measure of
variability.
Why and how Y reduces prediction error, relative
to the mean, when r 0 but fails to do so when r
0 note b r(sy/sx).

35
On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
m 100
36
On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
m 100
135
123
37
On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
mZ 0
2.19
1.44
38
On an IQ test with normally distributed scores
and a mean of 100 and a standard deviation of 16,
what percentage of scores fall between 123 and
135?
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
48.57
42.51
6.06
mZ 0
2.19
1.44
39
Suppose that you want to join a selective society
that bases membership on IQ scores -- you need to
score in the upper 2 on the test to qualify.
What score do you need?
Notes Negative z scores are not on the
table Rely on your sketch to tell you if you
need to add 0.5 to the tabled result.
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
m 100
40
Suppose that you want to join a selective society
that bases membership on IQ scores -- you need to
score in the upper 2 on the test to qualify.
What score do you need?
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
0.0202
mz 0
2.05
41
Suppose that you want to join a selective society
that bases membership on IQ scores -- you need to
score in the upper 2 on the test to qualify. You
need a 133 or higher.
2.05 (X-100)/16 (2.0516)100 X 132.8 X
1. Sketch the problem by locating on your sketch
where the necessary z scores will be in relation
to the mean. 2. Shade in the area that you are
looking for. 3. Use Table A, Column A to find
the correct z score. 4. Use either column B or
column C to find the shaded area.
0.0202
mz 0
2.05
42
The data to the left are from the annual
Monitoring the Future survey carried out by the
National Institute on Drug Abuse for 21 years
(1975-1995). The column entitled Risk shows
ratings of the perceived risks of cannabis use
made by high school seniors on a scale of 0-100.
The column entitled 30 day shows past 30 day
cannabis use among high school seniors in that
year. Use the data (and accompanying summary
statistics) to answer the following
questions 1) What is the correlation between
perceived risk of cannabis use and actual
cannabis use? 2) If, in the year 2000, you know
that the perceived risk of cannabis use is 52,
what would you predict will be the 30 day
cannabis use for that year? 3) If you had
reducing teenage cannabis use as your goal, would
you use these data as justification for a
multi-million dollar advertising campaign
informing teenagers of the health risks of
cannabis use?
43
(No Transcript)
44
What is the correlation . . .
N 21 SX 61.71 21 1295.91 SSx 15.092
20 4554.162 SY 24.57 21
515.97 SSy 7.952 20 1264.05 SXY
29715
(1295.91)
(515.97)
29715 -
21
rXY
4554.1621264.05
rXY -0.89
45
Predict 30 day use if Risk is 52
Y bYXaY
bY r(sy/sx) bY -0.89(7.95/15.09) bY
-0.47 aY Y - bYX aY 24.57 - (-0.47)
61.71 aY 53.57
Y bYXaY Y -0.47(52) 53.57 Y 29.13
46
Y -0.47X 53.57
52,29.13
47
If you had reducing teenage cannabis use as your
goal, would you use these data as justification
for a multi-million dollar advertising campaign
informing teenagers of the health risks of
cannabis use?
48
Exam rules

You may use any inanimate object you want.
Yes calculators!
Yes books!
Yes computers!
Yes SPSS or other statistical software!
No help from any living human!
Show your work. When in doubt, more is better
than less.
Write your e-mail address on your exam NOW if you
want me to e-mail you your score.
Honor code is in effect for this exam. Sign
pledge at the bottom of the last page.