Title: Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.
1Chapter 16
- Linear regression is a procedure that identifies
relationship between independent variables and a
dependent variable. - This relationship helps reduce the unexplained
variation of the dependent variable behavior,
thus provide better predictions of its future
values.
2The Simple linear regression model
3The Simple linear regression model
- The model is
- We try to estimate the deterministic part of it
by developing the line with the best fit. - Best fit is defined as the minimum sum of squared
errors. - An error is the difference between the line value
and the actual value for a given x. - The analysis yields
4Problem 6
- Are the costs of welding machines breakdowns
related to their age? - From the data answer the following
- Find the sample regression line
- What is the coefficient of determination.
Interpret. - Are machine age and monthly repair costs linearly
related? - Is the fit good enough to use the model to
predict the monthly repair costs of a 120 months
old machine? - Make the prediction.
5Problem 6
Data
- From Excel we get
- The Cov(age,cost)936.82Mean age (x) 113.35.
378.77Mean cost (y) 395.21.
4094.79. - b1 cov(x,y)/ 936.82/378.77 2.4733b0
y-b0x 395.21-2.4733(113.35) 114.86The
regression line
6Problem 6
- Coefficient of determination.
- In this case
- 56.59 of the variation in costs, are explained
by the model (by the different ages).
7Problem 6
- Is there a linear relationship between monthly
costs and machine age? - We need to test if b1 is not equal to zero.
- H0 b1 0H1 b1¹ 0
- In this case
- t 2.47-0/.5106 4.837
- The rejection region is tgtta/2 ortlt -ta/2 with
n-2 degrees of freedom
Can be calculatedseparately
8Problem 6
Data
The p value lt alpha
9Problem 6
- We need to forecast the expected cost for a 120
months old machine. - The equation provides a point predictionCost
114.862.4733(120) 411.65The prediction
interval (use data analysis plus) LCL 318.12
UCL 505.18 - Whats the prediction for the average monthly
repair cost for all the machines 120 months old? - To answer this question construct the confidence
interval (notice, not the prediction interval!)
10Chapter 17
- The multiple regression model allows more than
one independent variable explain the values of
the dependent variable. - We assess the model as before using
- t test for linear relationships between the
independent variables and the dependent variable
(tested one at a time) - F test for the over usefulness of the model
- Coefficient of determination for the fit.
11Problem 7
- When a company buys another company it is not
unusual that some workers are terminated. - A buyout contract between Laurier Comp and the
Western Comp required that Laurier provides a
severance package to Western workers fired,
equivalent to packages offered to Laurier
workers. - It is suggested that severance is determined by
three factors Age, length of service, pay. - Bill smith, a Western employee, is offered a 5
weeks severance pay when his employment is
terminated. - Based on the data provided by Laurier
(Xr19-05.xls) about severance offered to 50 of
its employees in the past, answer the following
questions
12Problem 7 - continued
- Determine the regression equation. Interpret the
coefficients. - Comment on how well the model fits the data.
- Do all the independent variables belong in the
model? - Does Laurier meet its obligation to Bill Smith?
13Problem 8
- A linear regression model for life longevity
- Insurance companies are interested in predicting
life longevity of their customers. - Data for 100 deceased male customers was
collected, and a regression model run. - The model studied was
- Longevity b0b1MotherAgeb2FatherAgeb3GrandMb
4GrandFe
14Problem 8
Coefficient of determination
The equation
15Problem 8
Overall usefulness H0 all bi 0 H1 At least
one bi 0 F Significance p value
4.86(10-27) Reject H0. The model is useful.
16Problem 8
Mothers age and fathers age at death have
strong linear relationships to an Individuals
age at death. Grandparents age at death are not
good predictors of an individuals age at
death.
17Chapter 18.2
- Dummy variables help include qualitative data in
a regression model. - If qualitative data can be categorized by n
categories, there are n-1 dummy variables needed
to express all the categories. - Dummy variables take on the values 0 or 1.
- Xi 0 if the data point in question does not
belong to category i - Xi 1 if the data point in question belongs to
category i.
18Problem 9
- In problem 6 we studied the relationship between
age of welding machines and breakdown costs. - This study was expanded. It is now including also
lathe machine and stamping machines. See Data
file. Code for machine type 1Welding 2Lathe
3Stamping - Answer the following
- Develop a regression model
- Interpret the coefficient
- Can we conclude that welding machines cost more
to repair than stamping machine. - Predict the monthly cost to repair an 85 month
old lathe machine
19Problem 9
- First we need to prepare the input data
Original data
20Problem 9
- Run the multiple regression
21Problem 9
Cost119.252.538Age-11.755W-199.37L Repair cost
increase on the average by 2.53 a month. The
monthly repair cost for a welding machine is
11.75 lower than for a stamping machine of the
same age. However, this result is not significant
p value.55). There is insufficient evidence in
the sample to support the hypothesis that there
is any difference between repair costs of welding
machines and stamping machines. The monthly
repair cost for a lathe machine is 199.37 lower
than for a stamping machine of the same age. This
result is significant.
- Run the multiple regression
Note the reference line (for the stamping
machine) Cost119.252.538Age
22Chapter 15
- We test the hypotheses that a set of data belongs
to certain distributions - The multinomial distribution
- The normal distribution
- We also study whether two variables are dependent
or not. - We apply a tool called a Chi-squared test
23The multinomial experiment
- The multinomial experiment is an extension of the
binomial experiment. - Characteristics
- There are n independent trials.
- Each trial can result in one of k possible
outcomes. - There is a probability of a type k success (pk)
in each trial. - We test whether the sample gathered support the
hypothesis that p1, p2,,pk are equal to
specified values. The test is called The
goodness of fit test.
24Problem 1
- To determine whether a single die is balanced, or
fair, the die was rolled 600 times. (See
Xr15-09.xls). - Is there sufficient evidence at 5 significance
level to allow you to conclude that the die is
not fair?
25Problem 1
- The hypothesis
- H0 p1 p2 p6 1/6H1 At least one p is not
1/6. - Build a rejection Region
- In our case c2gtc2a,5
26Problem 1
- We calculate c2 as follows
- In our casee1e2e6600(1/6)100
From the file we havef1114 f292
f384f4101 f5107 f6103
27Contingency table
- Here we test the relationship between two
variables. Are they dependent? - We build a contingency table and a Chi-Square
statistic
Variable/ Category 2
c columns
Variable/ Category 1
r rows
28A Sample ProblemContingency table
- Type of music vs. geographic location
- A group of 30-years-old people is interviewed to
determined whether the type of music is somehow
related to the geographic location of their
residence. - From the data presented can we infer that music
preference is affected by the geographic
location? Use (a.10). - H0 Type of music and geographic location are
independent. - H1 Type of music and geographic location are
dependent.
29A Sample problem contd.
Rock R B Country Classical
Northeast 140 32 5 18
South 134 41 52 8
West 154 27 8 13
195 235 202
428 100 65
39
632
- e11(195)(428)/632129.59 e12(195)(100)/63230.
85 - e23(235)(65)/63224.16
- c2 (140-129.59)2/129.59(52-24.16)2/24.1664
.92 - c2.10,(3-1)(4-1) 10.64 64.92gt10.64.
- Reject the null hypothesis. Type of music and
geographic location are not independent.
30A Sample problem contd.
31Chi squared test for normality
- Hypothesize on m and s (mm0 and ss0).
- Divide the Z interval into equal size
sub-intervals. i.e. (2, 1) (-1,0) (0,1)
(1,2) - Determine the corresponding probabilities covered
by each subinterval. i.e. p1P(Zlt-2)
p2P(-2ltZlt-1) - Translate the Z scores to the associated X
values. i,.e. x1m0(-2)s0 x2m0(-1)s0 - Find the actual frequency for each subinterval
i.e. f1 - for the interval below x1 f2 - for
the interval (x1,x2) - Calculate the expected frequency for each
interval - e1 np1 e2 np2
- Build a Chi squared statistic and perform the
test