Linear regression is a procedure that identifies relationship between independent variables and a dependent variable. - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.

Description:

Chapter 16 Linear regression is a procedure that identifies relationship between independent variables and a dependent variable. This relationship helps reduce the ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 32
Provided by: Zvi58
Category:

less

Transcript and Presenter's Notes

Title: Linear regression is a procedure that identifies relationship between independent variables and a dependent variable.


1
Chapter 16
  • Linear regression is a procedure that identifies
    relationship between independent variables and a
    dependent variable.
  • This relationship helps reduce the unexplained
    variation of the dependent variable behavior,
    thus provide better predictions of its future
    values.

2
The Simple linear regression model
  • The model is

3
The Simple linear regression model
  • The model is
  • We try to estimate the deterministic part of it
    by developing the line with the best fit.
  • Best fit is defined as the minimum sum of squared
    errors.
  • An error is the difference between the line value
    and the actual value for a given x.
  • The analysis yields

4
Problem 6
  • Are the costs of welding machines breakdowns
    related to their age?
  • From the data answer the following
  • Find the sample regression line
  • What is the coefficient of determination.
    Interpret.
  • Are machine age and monthly repair costs linearly
    related?
  • Is the fit good enough to use the model to
    predict the monthly repair costs of a 120 months
    old machine?
  • Make the prediction.

5
Problem 6
Data
  • From Excel we get
  • The Cov(age,cost)936.82Mean age (x) 113.35.
    378.77Mean cost (y) 395.21.
    4094.79.
  • b1 cov(x,y)/ 936.82/378.77 2.4733b0
    y-b0x 395.21-2.4733(113.35) 114.86The
    regression line

6
Problem 6
  • Coefficient of determination.
  • In this case
  • 56.59 of the variation in costs, are explained
    by the model (by the different ages).

7
Problem 6
  • Is there a linear relationship between monthly
    costs and machine age?
  • We need to test if b1 is not equal to zero.
  • H0 b1 0H1 b1¹ 0
  • In this case
  • t 2.47-0/.5106 4.837
  • The rejection region is tgtta/2 ortlt -ta/2 with
    n-2 degrees of freedom

Can be calculatedseparately
8
Problem 6
Data
The p value lt alpha
9
Problem 6
  • We need to forecast the expected cost for a 120
    months old machine.
  • The equation provides a point predictionCost
    114.862.4733(120) 411.65The prediction
    interval (use data analysis plus) LCL 318.12
    UCL 505.18
  • Whats the prediction for the average monthly
    repair cost for all the machines 120 months old?
  • To answer this question construct the confidence
    interval (notice, not the prediction interval!)

10
Chapter 17
  • The multiple regression model allows more than
    one independent variable explain the values of
    the dependent variable.
  • We assess the model as before using
  • t test for linear relationships between the
    independent variables and the dependent variable
    (tested one at a time)
  • F test for the over usefulness of the model
  • Coefficient of determination for the fit.

11
Problem 7
  • When a company buys another company it is not
    unusual that some workers are terminated.
  • A buyout contract between Laurier Comp and the
    Western Comp required that Laurier provides a
    severance package to Western workers fired,
    equivalent to packages offered to Laurier
    workers.
  • It is suggested that severance is determined by
    three factors Age, length of service, pay.
  • Bill smith, a Western employee, is offered a 5
    weeks severance pay when his employment is
    terminated.
  • Based on the data provided by Laurier
    (Xr19-05.xls) about severance offered to 50 of
    its employees in the past, answer the following
    questions

12
Problem 7 - continued
  • Determine the regression equation. Interpret the
    coefficients.
  • Comment on how well the model fits the data.
  • Do all the independent variables belong in the
    model?
  • Does Laurier meet its obligation to Bill Smith?

13
Problem 8
  • A linear regression model for life longevity
  • Insurance companies are interested in predicting
    life longevity of their customers.
  • Data for 100 deceased male customers was
    collected, and a regression model run.
  • The model studied was
  • Longevity b0b1MotherAgeb2FatherAgeb3GrandMb
    4GrandFe

14
Problem 8
Coefficient of determination
The equation
15
Problem 8
Overall usefulness H0 all bi 0 H1 At least
one bi 0 F Significance p value
4.86(10-27) Reject H0. The model is useful.
16
Problem 8
Mothers age and fathers age at death have
strong linear relationships to an Individuals
age at death. Grandparents age at death are not
good predictors of an individuals age at
death.
17
Chapter 18.2
  • Dummy variables help include qualitative data in
    a regression model.
  • If qualitative data can be categorized by n
    categories, there are n-1 dummy variables needed
    to express all the categories.
  • Dummy variables take on the values 0 or 1.
  • Xi 0 if the data point in question does not
    belong to category i
  • Xi 1 if the data point in question belongs to
    category i.

18
Problem 9
  • In problem 6 we studied the relationship between
    age of welding machines and breakdown costs.
  • This study was expanded. It is now including also
    lathe machine and stamping machines. See Data
    file. Code for machine type 1Welding 2Lathe
    3Stamping
  • Answer the following
  • Develop a regression model
  • Interpret the coefficient
  • Can we conclude that welding machines cost more
    to repair than stamping machine.
  • Predict the monthly cost to repair an 85 month
    old lathe machine

19
Problem 9
  • First we need to prepare the input data

Original data
20
Problem 9
  • Run the multiple regression

21
Problem 9
Cost119.252.538Age-11.755W-199.37L Repair cost
increase on the average by 2.53 a month. The
monthly repair cost for a welding machine is
11.75 lower than for a stamping machine of the
same age. However, this result is not significant
p value.55). There is insufficient evidence in
the sample to support the hypothesis that there
is any difference between repair costs of welding
machines and stamping machines. The monthly
repair cost for a lathe machine is 199.37 lower
than for a stamping machine of the same age. This
result is significant.
  • Run the multiple regression

Note the reference line (for the stamping
machine) Cost119.252.538Age
22
Chapter 15
  • We test the hypotheses that a set of data belongs
    to certain distributions
  • The multinomial distribution
  • The normal distribution
  • We also study whether two variables are dependent
    or not.
  • We apply a tool called a Chi-squared test

23
The multinomial experiment
  • The multinomial experiment is an extension of the
    binomial experiment.
  • Characteristics
  • There are n independent trials.
  • Each trial can result in one of k possible
    outcomes.
  • There is a probability of a type k success (pk)
    in each trial.
  • We test whether the sample gathered support the
    hypothesis that p1, p2,,pk are equal to
    specified values. The test is called The
    goodness of fit test.

24
Problem 1
  • To determine whether a single die is balanced, or
    fair, the die was rolled 600 times. (See
    Xr15-09.xls).
  • Is there sufficient evidence at 5 significance
    level to allow you to conclude that the die is
    not fair?

25
Problem 1
  • The hypothesis
  • H0 p1 p2 p6 1/6H1 At least one p is not
    1/6.
  • Build a rejection Region
  • In our case c2gtc2a,5

26
Problem 1
  • We calculate c2 as follows
  • In our casee1e2e6600(1/6)100

From the file we havef1114 f292
f384f4101 f5107 f6103
27
Contingency table
  • Here we test the relationship between two
    variables. Are they dependent?
  • We build a contingency table and a Chi-Square
    statistic

Variable/ Category 2
c columns
Variable/ Category 1
r rows
28
A Sample ProblemContingency table
  • Type of music vs. geographic location
  • A group of 30-years-old people is interviewed to
    determined whether the type of music is somehow
    related to the geographic location of their
    residence.
  • From the data presented can we infer that music
    preference is affected by the geographic
    location? Use (a.10).
  • H0 Type of music and geographic location are
    independent.
  • H1 Type of music and geographic location are
    dependent.

29
A Sample problem contd.
Rock R B Country Classical
Northeast 140 32 5 18
South 134 41 52 8
West 154 27 8 13
195 235 202
428 100 65
39
632
  • e11(195)(428)/632129.59 e12(195)(100)/63230.
    85
  • e23(235)(65)/63224.16
  • c2 (140-129.59)2/129.59(52-24.16)2/24.1664
    .92
  • c2.10,(3-1)(4-1) 10.64 64.92gt10.64.
  • Reject the null hypothesis. Type of music and
    geographic location are not independent.

30
A Sample problem contd.
  • Using data analysis Plus

31
Chi squared test for normality
  • Hypothesize on m and s (mm0 and ss0).
  • Divide the Z interval into equal size
    sub-intervals. i.e. (2, 1) (-1,0) (0,1)
    (1,2)
  • Determine the corresponding probabilities covered
    by each subinterval. i.e. p1P(Zlt-2)
    p2P(-2ltZlt-1)
  • Translate the Z scores to the associated X
    values. i,.e. x1m0(-2)s0 x2m0(-1)s0
  • Find the actual frequency for each subinterval
    i.e. f1 - for the interval below x1 f2 - for
    the interval (x1,x2)
  • Calculate the expected frequency for each
    interval
  • e1 np1 e2 np2
  • Build a Chi squared statistic and perform the
    test
Write a Comment
User Comments (0)
About PowerShow.com