Title: Introduction%20to%20Probability%20and%20Statistics%20Thirteenth%20Edition
1Introduction to Probability and
StatisticsThirteenth Edition
- Chapter 12
- Linear Regression and Correlation
2Correlation Regression
- Univariate Bivariate Statistics
- U frequency distribution, mean, mode, range,
standard deviation - B correlation two variables
- Correlation
- linear pattern of relationship between one
variable (x) and another variable (y) an
association between two variables - graphical representation of the relationship
between two variables - Warning
- No proof of causality
- Cannot assume x causes y
31. Correlation Analysis
- Correlation coefficient measures the strength of
the relationship between x and y
Sample Pearsons correlation coefficient
4Pearsons Correlation Coefficient
- r indicates
- strength of relationship (strong, weak, or none)
- direction of relationship
- positive (direct) variables move in same
direction - negative (inverse) variables move in opposite
directions - r ranges in value from 1.0 to 1.0
-1.0 0.0
1.0
Strong Negative No Rel.
Strong Positive
5Limitations of Correlation
- linearity
- cant describe non-linear relationships
- e.g., relation between anxiety performance
- no proof of causation
- Cannot assume x causes y
6Some Correlation Patterns
Linear relationships
Curvilinear relationships
Y
Y
X
X
Y
Y
X
X
7Some Correlation Patterns
Strong relationships
Weak relationships
Y
Y
X
X
Y
Y
X
X
8Example
- The table shows the heights and weights of n
10 randomly selected college football players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175
9Example scatter plot
r .8261 Strong positive correlation As the
players height increases, so does his weight.
10Inference using r
- The population coefficient of correlation is
called (rho). We can test for a significant
correlation between x and y using a t test
11Example
- Is there a significant positive correlation
between weight and height in the population of
all college football players?
Use the t-table with n-2 8 df to bound the
p-value as p-value lt .005. There is a significant
positive correlation between weight and height in
the population of all college football players.
122. Linear Regression
- Regression Correlation Prediction
- Regression analysis is used to predict the value
of one variable (the dependent variable) on the
basis of other variables (the independent
variables). - Dependent variable denoted y
- Independent variables denoted x1, x2, , xk
13Example
- Let y be the monthly sales revenue for a company.
This might be a function of several variables - x1 advertising expenditure
- x2 time of year
- x3 state of economy
- x4 size of inventory
- We want to predict y using knowledge of x1, x2,
x3 and x4.
14Some Questions
- Which of the independent variables are useful and
which are not? - How could we create a prediction equation to
allow us to predict y using knowledge of x1, x2,
x3 etc? - How good is this prediction?
We start with the simplest case, in which the
response y is a function of a single independent
variable, x.
15Model Building
16A Simple Linear Regression Model
- Explanatory and Response Variables are Numeric
- Relationship between the mean of the response
variable and the level of the explanatory
variable assumed to be approximately linear
(straight line) - Model
- b1 gt 0 ? Positive Association
- b1 lt 0 ? Negative Association
- b1 0 ? No Association
17Picturing the Simple Linear Regression Model
Regression Plot
Y
y
? Slope
Error ?
1
a Intercept
X
0
x
18Simple Linear Regression Analysis
y actual value of a score predicted
value
- Variables
- x Independent Variable
- y Dependent Variable
- Parameters
- a y Intercept
- ß Slope
- e normal distribution with mean 0 and variance
s2
19Simple Linear Regression Model
y
bslope?y/?x
a
intercept
x
20The Method of Least Squares
- The equation of the best-fitting line
- is calculated using a set of n pairs (xi, yi).
- We choose our estimates a and b to estimate a and
b so that the vertical distances of the points
from the line, - are minimized.
21Least Squares Estimators
22Example
- The table shows the IQ scores for a random
sample of n 10 college freshmen, along with
their final calculus grades.
Student 1 2 3 4 5 6 7 8 9 10
IQ Scores, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
Use your calculator to find the sums and sums of
squares.
23Example
24The Analysis of Variance
- The total variation in the experiment is measured
by the total sum of squares
- The Total SS is divided into two parts
- SSR (sum of squares for regression) measures the
variation explained by using x in the model. - SSE (sum of squares for error) measures the
leftover variation not explained by x.
25The Analysis of Variance
26The ANOVA Table
- Total df Mean Squares
- Regression df
- Error df
n -1
1
MSR SSR/(1)
n 1 1 n - 2
MSE SSE/(n-2)
Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE
Error n - 2 SSE SSE/(n-2)
Total n -1 Total SS
27The Calculus Problem
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
28Testing the Usefulness of the Model (The F Test)
- You can test the overall usefulness of the model
using an F test. If the model is useful, MSR will
be large compared to the unexplained variation,
MSE.
This test is exactly equivalent to the t-test,
with t2 F.
29Minitab Output
30Testing the Usefulness of the Model
- The first question to ask is whether the
independent variable x is of any use in
predicting y. - If it is not, then the value of y does not
change, regardless of the value of x. This
implies that the slope of the line, b, is zero.
31Testing the Usefulness of the Model
The test statistic is function of b, our best
estimate of b. Using MSE as the best estimate of
the random variation s2, we obtain a t statistic.
32The Calculus Problem
- Is there a significant relationship between the
calculus grades and the IQ scores at the 5 level
of significance?
Reject H 0 when t gt 2.306. Since t 4.38
falls into the rejection region, H 0 is rejected .
There is a significant linear relationship
between the calculus grades and the IQ scores for
the population of college freshmen.
33Measuring the Strength of the Relationship
- If the independent variable x is of useful in
predicting y, you will want to know how well the
model fits. - The strength of the relationship between x and y
can be measured using
34Measuring the Strength of the Relationship
- Since Total SS SSR SSE, r2 measures
- the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model. - the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.
For the calculus problem, r2 .705 or 70.5.
Meaning that 70.5 of the variability of Calculus
Scores can be exlain by the model.
35Estimation and Prediction
Confidence interval
Prediction interval
36The Calculus Problem
- Estimate the average calculus grade for students
whose IQ score is 50 with a 95 confidence
interval.
37The Calculus Problem
- Estimate the calculus grade for a particular
student whose IQ score is 50 with a 95
confidence interval.
Notice how much wider this interval is!
38Minitab Output
- Green prediction bands are always wider than red
confidence bands. - Both intervals are narrowest when x x-bar.
39Estimation and Prediction
- Once you have
- determined that the regression line is useful
- used the diagnostic plots to check for violation
of the regression assumptions. - You are ready to use the regression line to
- Estimate the average value of y for a given value
of x - Predict a particular value of y for a given value
of x.
40Estimation and Prediction
- The best estimate of either E(y) or y for
- a given value x x0 is
- Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.
41Regression Assumptions
- Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
Assumptions
- The relationship between x and y is linear, given
by y a bx e. - The random error terms e are independent and, for
any value of x, have a normal distribution with
mean 0 and constant variance, s 2.
42Diagnostic Tools
- Normal probability plot or histogram of residuals
- Plot of residuals versus fit or residuals versus
variables - Plot of residual versus order
43Residuals
- The residual error is the leftover variation
in each data point after the variation explained
by the regression model has been removed. - If all assumptions have been met, these residuals
should be normal, with mean 0 and variance s2.
44Normal Probability Plot
- If the normality assumption is valid, the plot
should resemble a straight line, sloping upward
to the right. - If not, you will often see the pattern fail in
the tails of the graph.
45Residuals versus Fits
- If the equal variance assumption is valid, the
plot should appear as a random scatter around the
zero center line. - If not, you will see a pattern in the residuals.