Title: QMS 6351 Statistics and Research Methods Analyzing the Relationship Between Two and More Variables Chapter 2.4 Chapter 3.5 Chapter 14 (14.1-14.3, 14.6) Chapter 15 (15.1-15.3, 15.7)
1QMS 6351 Statistics and Research Methods
Analyzing the Relationship Between Two and More
Variables Chapter 2.4Chapter 3.5Chapter 14
(14.1-14.3, 14.6)Chapter 15 (15.1-15.3, 15.7)
2- Chapter 2
- Section 2.4
- Crosstabulations and Scatter Diagrams
3Crosstabulations
- Crosstabulation is a method that can be used to
summarize the data for two variables
simultaneously. - Typically, the tables left and top margin labels
define the classes for the two variables. - Crosstabulation can provide insight about the
relationship between the variables.
4Crosstabulations
- Crosstabulation of Enrollment by Gender and
Degree Level at a University - Degree Level
- Gender Undergraduate Graduate
Doctorate Total - Male 7341 (47.0) 1937 (53.4) 172
(59.1) 9450 (48.3) - Female 8294 (53.0) 1688 (46.6) 119
(40.9) 10101 (51.7) - Total 15635 (100.0) 3625 (100.0)
291(100.0)19551 (100.0)
5Scatter Diagram
- A scatter diagram is a graphical presentation of
the relationship between two quantitative
variables.
6Scatter Diagram
- Scatter Diagram for Engine Size and Gas Mileage
of Eight Automobiles
30
25
20
In-City Gas Mileage (mpg)
15
10
0 2 4 6 8 10
Engine Size (number of cylinders)
7Example Reed Auto Sales
- Reed Auto periodically has a special week-long
sale. As part of the advertising campaign Reed
runs one or more television commercials during
the weekend preceding the sale. Data from a
sample of 5 previous sales showing the number of
TV ads run and the number of cars sold in each
sale are shown below. Develop a scatter diagram. -
8Example (cont.)
- Number of TV Ads Number of Cars Sold
- 1 14
- 3 24
- 2 18
- 1 17
- 3 27
9(No Transcript)
10- Chapter 3
- Section 3.5
- Measures of Association Between Two Variables
- Covariance
- Correlation Coefficient
11- Covariance is a descriptive measure of the linear
association between two variables. - The value of covariance depends upon units of
measurement. - A measure of the relationship between two
variables that avoids this difficulty is the
correlation coefficient.
12Covariance
- If the data sets are samples, the covariance is
denoted by sxy. - If the data sets are populations, the covariance
is denoted by .
13Example Reed Auto Sales
- Sample covariance
- 20/4 5 (autostv ads)
14Correlation Coefficient
- If the data sets are samples, the correlation
coefficient is denoted by rxy. - If the data sets are populations, the correlation
coefficient is denoted by rxy .
15Correlation Coefficient
- The coefficient can take on values between -1 and
1. - If r or r are near -1, it indicates a strong
negative linear relationship. - If r or r are near 1, it indicates a strong
positive linear relationship.
16Example Reed Auto Sales
- s2x 4/4 1 sx 1.
- s2y 114/4 28.5 sy 5.3385.
- Correlation coefficient
- rxy 5/(15.3385) 0.936586.
- A strong positive linear relationship.
17- If r or r 1, it is a case of perfect positive
linear correlation (all points are on a
positively sloped straight line). - If r or r -1, it is a case of a perfect
negative linear correlation (all points are on a
negatively sloped straight line). - If r or r 0, there is no linear correlation
between the two variables (the points are
scattered all over the diagram).
18- We would like to find an analytical/mathematical
expression (a formula) for the relationship
between TV ads and auto sales. - Both a scatter diagram and correlation
coefficient suggest that there is a linear
relationship between TV ads and auto sales.
19(No Transcript)
20Chapter 14 Outline
- The simple linear regression model
- The Least Squares Method
- The coefficient of determination
21Regression analysis
- Regression analysis is a description or the study
of the nature of the relationship between
variables (for example, linear regression,
non-linear regression, simple regression,
multiple regression).
22Functional vs. stochastic relationship
- Functional (deterministic) relationship the
variables are perfectly related the relationship
is true for each/any observation. For example,
the area of a square in mathematics, total
revenue in economics. - Statistical (stochastic) relationship the
variables are not perfectly related, the
relationship is true on average, not for each
observation. For example, MPC in economics.
23The simple linear regression
- The simple linear regression model is a
mathematical way of stating the linear
statistical relationship between two variables. - The variable being predicted is called the
dependent variable. - The variable being used to predict the value of
the dependent variable is called the independent
variable.
24Regression equation
- Regression equation the equation that describes
how the mean value (that is, on average) of the
dependent variable (y) is related to the
independent variable(s) (x). - Simple Linear Regression Equation
- E(y) ?0 ?1x
- ?0 and ?1 are referred to as the parameters of
the model.
25Regression model
- Regression model the equation that describes
how the dependent variable is related to the
independent variable(s) and an error term. - Simple Linear Regression Model
- y ?0 ?1x ?
- ? (the Greek letter epsilon) is a random
variable referred to as the error term. It
absorbs the impact of all other variables on y.
26Estimated regression equation
- We will use a sample to estimate the population
parameters ?0 and ?1 . Sample statistics (denoted
b0 and b1) serve as estimates of ?0 and ?1 .
Substituting the values of b0 and b1 in the
regression equation, we obtain the estimated
regression equation. - Estimated Simple Linear Regression Equation
- y b0 b1x
- y is the mean value of y for a given value of
x.
27The Least Squares Method
- Least Squares Criterion
- min S(yi - yi)2
- where
- yi observed value of the dependent variable
for the i th observation - yi estimated value of the dependent variable
for the i th observation
28The Least Squares Method
- Slope for the Estimated Regression Equation
- This formula appears in the footnote on p. 568
- y -Intercept for the Estimated Regression
Equation - b0 y - b1x
-
_
_
29Example Reed Auto Sales
- Slope for the Estimated Regression Equation
- b1 220 - (10100)/5 5
- 24 - (10)2/5
- y -Intercept for the Estimated Regression
Equation - b0 20 - 5(2) 10
- Estimated Regression Equation
- y 10 5x
30Interpretation
- bo is the expected value of y when x0. (May be
meaningless). In our example, when the number of
TV ads is zero, the expected number of cars sold
is 10. - b1 is the change in the expected value of y when
x changes by 1 unit of its measurement, ceteris
paribus. In our example, when the number of TV
ads increases by 1, the number of cars sold is
expected to increase by 5 cars.
31(No Transcript)
32SST, SSR, SSE
- Relationship Among SST, SSR, SSE
- SST SSR SSE
-
Variation in Y due to X
Total variation in Y
Variation in Y due to all other factors
33x y yhat y-ybar (y-ybar)2 yhat-ybar (yhat-ybar)2 (y-yhat) (y-yhat)2
1 14 15 -6 36 -5 25 -1 1
3 24 25 4 16 5 25 -1 1
2 18 20 -2 4 0 0 -2 4
1 17 15 -3 9 -5 25 2 4
3 27 25 7 49 5 25 2 4
xbar2 ybar20 114 100 14
34Coefficient of Determination
- Coefficient of determination represents the
proportion of SST that is explained by the use of
the regression model. - Coefficient of Determination
- r 2 SSR/SST
- 0 ? r 2 ? 1
35Example Reed Auto Sales
- Coefficient of Determination
- r 2 SSR/SST 100/114 .877193
- The regression relationship is very strong since
87.7 of the variation in number of cars sold can
be explained by the linear relationship between
the number of TV ads and the number of cars sold.
36The Correlation Coefficient
- The correlation coefficient measures the strength
of the linear association between two variables. - The sample correlation coefficient is plus or
minus the square root of the coefficient of
determination. - Sample Correlation coefficient
- 0.936586
sign of b1
37SUMMARY OUTPUT SUMMARY OUTPUT
Regression Statistics Regression Statistics
Multiple R 0.936586
R Square 0.877193
Adjusted R Square 0.836257
Standard Error 2.160247
Observations 5
ANOVA
df SS MS F Significance F
Regression 1 100 100 21.42857 0.018986
Residual 3 14 4.666667
Total 4 114
Coefficients Standard Error t Stat P-value Lower 95 Upper 95 Lower 95.0 Upper 95.0
Intercept 10 2.366432 4.225771 0.024236 2.468958 17.53104 2.468958 17.53104
X Variable 1 5 1.080123 4.6291 0.018986 1.562565 8.437435 1.562565 8.437435
38Chapter 15 Outline
- The multiple linear regression model
- The Least Squares Method
- The multiple coefficient of determination
- Categorical independent variables
39- Multiple Regression Equation
- Multiple Regression Model
- Estimated Multiple Regression Equation
40Multiple coefficient of determination
- R2 SSR/SST
- Adjusted multiple coefficient of determination
-
-
- where p is the number of independent variables.
41Example Programmer Salary Survey
- A software firm collected data for a sample of 20
computer programmers. A suggestion was made that
regression analysis could be used to determine if
salary was related to the years of experience and
the score on the firms programmer aptitude test. - The years of experience, score on the aptitude
test test, and corresponding annual salary
(1000s) for a sample of 20 programmers is shown
on the next slide.
42Test Score
Exper. (Yrs.)
Exper. (Yrs.)
Salary (000s)
Test Score
Salary (000s)
4 7 1 5 8 10 0 1 6 6
9 2 10 5 6 8 4 6 3 3
78 100 86 82 86 84 75 80 83 91
88 73 75 81 74 87 79 94 70 89
38.0 26.6 36.2 31.6 29.0 34.0 30.1 33.9 28.2 30.0
24.0 43.0 23.7 34.3 35.8 38.0 22.2 23.1 30.0 33.0
43- Suppose we believe that salary (y) is related to
the years of experience (x1) and the score on the
programmer aptitude test (x2) by the following
regression model - where
- y annual salary (000),
- x1 years of experience,
- x2 score on programmer aptitude test.
44Solving for the Estimates of ß0, ß1, ß2
- Excels Regression Equation Output
Note Columns F-I are not shown.
45Estimated Regression Equation
SALARY 3.174 1.404(EXPER) 0.251(SCORE)
Note Predicted salary will be in thousands of
dollars.
46Interpreting the Coefficients
In multiple regression analysis, we interpret
each regression coefficient as follows
bi represents an estimate of the change in y
corresponding to a 1-unit increase in xi when
all other independent variables are held
constant.
47Interpreting the Coefficients
b1 1.404
Salary is expected to increase by 1,404 for
each additional year of experience (when the
variable score on programmer attitude test is
held constant).
48Interpreting the Coefficients
b2 0.251
Salary is expected to increase by 251 for
each additional point scored on the programmer
aptitude test (when the variable years of
experience is held constant).
49Multiple Coefficient of Determination
SSR
SST
50Multiple Coefficient of Determination
R2 SSR/SST
R2 500.3285/599.7855 .83418
51Adjusted Multiple Coefficient of Determination
52- Excels Regression Statistics
Regression Statistics Regression Statistics
Multiple R 0.913334
R Square 0.834179
Adjusted R Square 0.814671
Standard Error 2.418762
Observations 20
53Categorical independent variables
- To include categorical independent variables into
a regression equation, we use dummy (0,1)
variables. Dummy variables assume the value of 1
if a specified characteristic is present and the
value of 0 otherwise. For example, man 1 and
woman 0.
54Example (cont.) Programmer Salary Survey
- As an extension of the problem involving the
computer programmer salary survey, suppose that
management also believes that the annual salary
is related to whether the individual has a
graduate degree in computer science or
information systems. - The years of experience, the score on the
programmer aptitude test, whether the individual
has a relevant graduate degree, and the annual
salary (000) for each of the sampled 20
programmers are shown on the next slide.
55Exper. (Yrs.)
Test Score
Test Score
Exper. (Yrs.)
Salary (000s)
Salary (000s)
Degr.
Degr.
4 7 1 5 8 10 0 1 6 6
9 2 10 5 6 8 4 6 3 3
78 100 86 82 86 84 75 80 83 91
88 73 75 81 74 87 79 94 70 89
38.0 26.6 36.2 31.6 29.0 34.0 30.1 33.9 28.2 30.0
No Yes No Yes Yes Yes No No No Yes
Yes No Yes No No Yes No Yes No No
24.0 43.0 23.7 34.3 35.8 38.0 22.2 23.1 30.0 33.0
56Multiple Regression Model
where y annual salary (1000) x1 years
of experience x2 score on programmer aptitude
test x3 0 if individual does not have a
graduate degree 1 if individual does
have a graduate degree
x3 is a dummy variable
57- Excels Regression Equation Output
Note Columns F-I are not shown.
58 59- Excels Regression Statistics
60More Complex Categorical Variables
- If a categorical variable has k levels, k - 1
dummy variables are required, with each dummy
variable being coded as 0 or 1. - For example, a variable with levels A, B, and C
could be represented by x1 and x2 values of (0,
0) for A, (1, 0) for B, and (0,1) for C. - Care must be taken in defining and interpreting
the dummy variables.
61- For example, a variable indicating level of
education could be represented by x1 and x2
values as follows