Introduction%20to%20Probability%20and%20Statistics%20Thirteenth%20Edition - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction%20to%20Probability%20and%20Statistics%20Thirteenth%20Edition

Description:

Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation ... – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 46

Provided by: ValuedG139

Category:

more less

Transcript and Presenter's Notes

Title: Introduction%20to%20Probability%20and%20Statistics%20Thirteenth%20Edition

1
Introduction to Probability and
StatisticsThirteenth Edition

Chapter 12
Linear Regression and Correlation

2
Correlation Regression

Univariate Bivariate Statistics
U frequency distribution, mean, mode, range,
standard deviation
B correlation two variables
Correlation
linear pattern of relationship between one
variable (x) and another variable (y) an
association between two variables
graphical representation of the relationship
between two variables
Warning
No proof of causality
Cannot assume x causes y

3
1. Correlation Analysis

Correlation coefficient measures the strength of
the relationship between x and y

Sample Pearsons correlation coefficient
4
Pearsons Correlation Coefficient

r indicates
strength of relationship (strong, weak, or none)
direction of relationship
positive (direct) variables move in same
direction
negative (inverse) variables move in opposite
directions
r ranges in value from 1.0 to 1.0

-1.0 0.0
1.0
Strong Negative No Rel.
Strong Positive
5
Limitations of Correlation

linearity
cant describe non-linear relationships
e.g., relation between anxiety performance
no proof of causation
Cannot assume x causes y

6
Some Correlation Patterns
Linear relationships
Curvilinear relationships
Y
Y
X
X
Y
Y
X
X
7
Some Correlation Patterns
Strong relationships
Weak relationships
Y
Y
X
X
Y
Y
X
X
8
Example

The table shows the heights and weights of n
10 randomly selected college football players.

Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175
9
Example scatter plot
r .8261 Strong positive correlation As the
players height increases, so does his weight.
10
Inference using r

The population coefficient of correlation is
called (rho). We can test for a significant
correlation between x and y using a t test

11
Example

Is there a significant positive correlation
between weight and height in the population of
all college football players?

Use the t-table with n-2 8 df to bound the
p-value as p-value lt .005. There is a significant
positive correlation between weight and height in
the population of all college football players.
12
2. Linear Regression

Regression Correlation Prediction
Regression analysis is used to predict the value
of one variable (the dependent variable) on the
basis of other variables (the independent
variables).
Dependent variable denoted y
Independent variables denoted x1, x2, , xk

13
Example

Let y be the monthly sales revenue for a company.
This might be a function of several variables
x1 advertising expenditure
x2 time of year
x3 state of economy
x4 size of inventory
We want to predict y using knowledge of x1, x2,
x3 and x4.

14
Some Questions

Which of the independent variables are useful and
which are not?
How could we create a prediction equation to
allow us to predict y using knowledge of x1, x2,
x3 etc?
How good is this prediction?

We start with the simplest case, in which the
response y is a function of a single independent
variable, x.
15
Model Building
16
A Simple Linear Regression Model

Explanatory and Response Variables are Numeric
Relationship between the mean of the response
variable and the level of the explanatory
variable assumed to be approximately linear
(straight line)
Model

b1 gt 0 ? Positive Association
b1 lt 0 ? Negative Association
b1 0 ? No Association

17
Picturing the Simple Linear Regression Model
Regression Plot
Y
y
? Slope
Error ?
1
a Intercept
X
0
x
18
Simple Linear Regression Analysis
y actual value of a score predicted
value

Variables
x Independent Variable
y Dependent Variable
Parameters
a y Intercept
ß Slope
e normal distribution with mean 0 and variance
s2

19
Simple Linear Regression Model
y
bslope?y/?x
a
intercept
x
20
The Method of Least Squares

The equation of the best-fitting line
is calculated using a set of n pairs (xi, yi).

We choose our estimates a and b to estimate a and
b so that the vertical distances of the points
from the line,
are minimized.

21
Least Squares Estimators
22
Example

The table shows the IQ scores for a random
sample of n 10 college freshmen, along with
their final calculus grades.

Student 1 2 3 4 5 6 7 8 9 10
IQ Scores, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
Use your calculator to find the sums and sums of
squares.
23
Example
24
The Analysis of Variance

The total variation in the experiment is measured
by the total sum of squares

The Total SS is divided into two parts
SSR (sum of squares for regression) measures the
variation explained by using x in the model.
SSE (sum of squares for error) measures the
leftover variation not explained by x.

25
The Analysis of Variance

We calculate

26
The ANOVA Table

Total df Mean Squares
Regression df
Error df

n -1
1
MSR SSR/(1)
n 1 1 n - 2
MSE SSE/(n-2)
Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE
Error n - 2 SSE SSE/(n-2)
Total n -1 Total SS
27
The Calculus Problem
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
28
Testing the Usefulness of the Model (The F Test)

You can test the overall usefulness of the model
using an F test. If the model is useful, MSR will
be large compared to the unexplained variation,
MSE.

This test is exactly equivalent to the t-test,
with t2 F.
29
Minitab Output
30
Testing the Usefulness of the Model

The first question to ask is whether the
independent variable x is of any use in
predicting y.
If it is not, then the value of y does not
change, regardless of the value of x. This
implies that the slope of the line, b, is zero.

31
Testing the Usefulness of the Model
The test statistic is function of b, our best
estimate of b. Using MSE as the best estimate of
the random variation s2, we obtain a t statistic.
32
The Calculus Problem

Is there a significant relationship between the
calculus grades and the IQ scores at the 5 level
of significance?

Reject H 0 when t gt 2.306. Since t 4.38
falls into the rejection region, H 0 is rejected .
There is a significant linear relationship
between the calculus grades and the IQ scores for
the population of college freshmen.
33
Measuring the Strength of the Relationship

If the independent variable x is of useful in
predicting y, you will want to know how well the
model fits.
The strength of the relationship between x and y
can be measured using

34
Measuring the Strength of the Relationship

Since Total SS SSR SSE, r2 measures
the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model.
the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.

For the calculus problem, r2 .705 or 70.5.
Meaning that 70.5 of the variability of Calculus
Scores can be exlain by the model.
35
Estimation and Prediction
Confidence interval
Prediction interval
36
The Calculus Problem

Estimate the average calculus grade for students
whose IQ score is 50 with a 95 confidence
interval.

37
The Calculus Problem

Estimate the calculus grade for a particular
student whose IQ score is 50 with a 95
confidence interval.

Notice how much wider this interval is!
38
Minitab Output

Green prediction bands are always wider than red
confidence bands.
Both intervals are narrowest when x x-bar.

39
Estimation and Prediction

Once you have
determined that the regression line is useful
used the diagnostic plots to check for violation
of the regression assumptions.
You are ready to use the regression line to

Estimate the average value of y for a given value
of x
Predict a particular value of y for a given value
of x.

40
Estimation and Prediction

The best estimate of either E(y) or y for
a given value x x0 is
Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.

41
Regression Assumptions

Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.

Assumptions

The relationship between x and y is linear, given
by y a bx e.
The random error terms e are independent and, for
any value of x, have a normal distribution with
mean 0 and constant variance, s 2.

42
Diagnostic Tools

Normal probability plot or histogram of residuals
Plot of residuals versus fit or residuals versus
variables
Plot of residual versus order

43
Residuals

The residual error is the leftover variation
in each data point after the variation explained
by the regression model has been removed.
If all assumptions have been met, these residuals
should be normal, with mean 0 and variance s2.

44
Normal Probability Plot

If the normality assumption is valid, the plot
should resemble a straight line, sloping upward
to the right.
If not, you will often see the pattern fail in
the tails of the graph.

45
Residuals versus Fits

If the equal variance assumption is valid, the
plot should appear as a random scatter around the
zero center line.
If not, you will see a pattern in the residuals.

Write a Comment

User Comments (0)