Statistics and Data Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Statistics and Data Analysis

1
Statistics and Data Analysis

Professor William Greene
Stern School of Business
IOMS Department of
Department of Economics

2
Statistics and Data Analysis
Part 19 Multiple Regression 3
3
Multiple Regression Modeling
1/57

Data Preparation
Examining the data
Transformations
Scaling
Analysis of the Regression
Residuals and outliers
Influential data points
The fit of the regression
R squared and adjusted R squared
Analysis of variance
Individual coefficient estimates and t statistics
Testing for significance of a set of coefficients
Prediction

4
Data Preparation
2/57

Get rid of observations with missing values.
Small numbers of missing values, delete
observations
Large numbers of missing values may need to
give up on certain variables
There are theories and methods for filling
missing values. (Advanced techniques. Usually
not useful or appropriate for real world work.)
Be sure that missingness is not directly
related to the values of the dependent variable.
E.g., a regression that follows systematically
removing high values of Y is likely to be
biased if you then try to use the results to
describe the entire population. (E.g., any
sample related to income or consumption that is
drawn at an airport is likely to biased viz a viz
the entire population.)

5
Transform the Data?
3/57

Just because a variable is skewed does not mean
you should take logs. Take logs if the model you
are fitting calls for taking logs. More later.
Scaling? E.g., per capita data. Scaling by
assets or number of shares, or sales If it is
appropriate in the context of the model (the
study). More later.
Do not transform variables without a good reason
to do so. Skewness, by itself, is not a good
reason. Dont scale variables because the values
are large.
Transform data appropriately for the study you
are doing (i.e., for the story you are trying to
tell your reader).

6
Using Logs
4/57

Generally, use logs for size variables
Use logs if you are seeking to estimate
elasticities
Use logs if your data span a very large range of
values and the independent variables do not (a
modeling issue some art mixed in with the
science).
If the data contain 0s or negative values then
logs will be inappropriate for the study do not
use ad hoc fixes like adding something to Y so it
will be positive.

7
More on Using Logs
5/57

Generally only for continuous variables like
income or variables that are essentially
continuous.
Not for discrete variables like binary variables
or qualititative variables (e.g., stress level
1,2,3,4,5)
Generally be consistent in the equation dont
mix logs and levels.
Generally DO NOT take the log of time (t) in a
model with a time trend. TIME is discrete and
not a measure.

8
Residuals
8/57

Residual the difference between the actual
value of Y and the value predicted by the
regression.
E.g., Switzerland
Estimated equation is DALE 36.900
2.9787EDUC .004601PCHexp
Swiss values are EDUC9.418360, PCHexp2646.442
Regression prediction 77.1307
Actual Swiss DALE 72.71622
Residual 72.71622 77.1307 -4.41448
The regresion overpredicts Switzerland

9
Using Residuals
9/57

As indicators of bad data
As indicators of observations that deserve
attention
As a diagnostic tool to evaluate the regression
model

10
Outliers
13/57

A residual is ei yi a b1xi1,
The standard deviation of the residuals is
Standardized residuals are ei/se.
Large residuals have ei/se gt 2.

11
Strip Mining the Sample Residuals and Outliers
10/57
12
An Aside About Plotting
11/57
13
Appropriate Plot
14
Appropriate Residual Plot
12/57
15
Strip Mining the DataUnusual Observations
14/57
16
When to Remove Outliers
15/57

Outliers have very large residuals
Only if it is ABSOLUTELY necessary
The data are obviously miscoded
There is something clearly wrong with the
observation
Do not remove outliers just because Minitab flags
them. This is not sufficient reason.

17
Units of Measurement
22/57

y a b1x1 b2x2 e
If you multiply every observation of variable x
by the same constant, c, then the regression
coefficient will be divided by c.
E.g., multiply X by .001 to change to thousands
of , then b is multiplied by 1000. b times x
will be unchanged.

18
Scaling the Data
23/57

Units of measurement and coefficients
Macro data and per capita figures
Gasoline data
WHO data
Micro data and normalizations
RD and Profits

19
The Gasoline Market
24/57
Agregate consumption or expenditure data would
not be interesting. Income data are already per
capita.
20
The WHO Data
25/57
Per Capita GDP and Per Capita Health Expenditure.
Aggregate values would make no sense.
Years
21
Profits and RD by Industry
26/57
Is there a relationship between RD and Profits?
This just shows that big industries have larger
profits and RD than small ones.
Gujarati, D. Basic Econometrics, McGraw Hill,
1995, p. 388.
22
Normalized by Sales
27/57
Profits/Sales a ß RD/Sales e
23
More Movie Madness
28/57

McDonalds and Movies (Craig, Douglas, Greene
International Journal of Marketing)
Log Foreign Box Office(movie,country,year) a
ß1LogBox(movie,US,year) ß2LogPCIncome
ß4LogMacsPC GenreEffect
CountryEffect e.

24
29/57
We used McDonalds Per Capita
25
Movie Madness Data (n2198)
30/57
26
Macs and Movies
31/57
Genres (MPAA) 1Drama 2Romance 3Comedy 4Action
5Fantasy 6Adventure 7Family 8Animated 9Thrill
er 10Mystery 11Science Fiction 12Horror 13Crim
e
Countries and Some of the Data Code
Pop(mm) per cap of Language
Income McDonalds 1 Argentina
37 12090 173 Spanish 2 Chile,
15 9110 70 Spanish 3 Spain
39 19180 300 Spanish 4
Mexico 98 8810 270
Spanish 5 Germany 82 25010 1152
German 6 Austria 8 26310
159 German 7 Australia 19 25370
680 English 8 UK 60 23550
1152 UK
27
Making the Genre Variables
32/57
Calc ?
28
Movie Genres
33/57
29
34/57
CRIME is the left out GENRE. AUSTRIA is the left
out country. Australia and UK were left out for
other reasons (algebraic problem with only 8
countries).
30
Model Fit
35/57

How well does the model fit the data?
R2 measures fit the larger the better
Time series expect .9 or better
Cross sections it depends
Social science data .1 is good
Industry or market data .5 is routine

31
OK Fit
36/57
32
Success Measure
37/57

Hypothesis There is no regression.
Equivalent Hypothesis R2 0.
How to test For now, rough rule.Look for F gt
2 for multiple regressionF 144.34 for Movie
Madness

33
A Formal Test of the Regression Model
38/57

Is there a significant relationship?
Equivalently, is R2 gt 0?
Statistically, not numerically.
Testing
Compute the F (R2/K)/(1-R2)/(n-K-1)
Determine if F is large using the appropriate
table

34
The F Test for the Model
39/57

Determine the appropriate critical value from
the table.
Is the F from the computed model larger than the
theoretical F from the table?
Yes Conclude the relationship is significant
No Conclude R2 0.

35

40/57
n1 Number of predictors n2 Sample size
number of predictors 1
36
Testing .
41/57

Use Minitabs F Calculator

37
Finding the Critical F
42/57
Leave as is
Number of predictors in the model K
n-K-1
Standard .95
38
Compare Sample F to Critical F
43/57

F 144.34 for Movie Madness
Critical value from the table is 1.57536.
Reject the hypothesis of no relationship.

39
An Equivalent Approach
44/57

What is the P Value?
We observed an F of 144.34 (or, whatever it is).
If there really were no relationship, how likely
is it that we would have observed an F this large
(or larger)?
Depends on n and K
The probability is reported with the regression
results as the P Value.

40
The F Test
45/57
S 0.952237 R-Sq 57.0 R-Sq(adj)
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
41
A Huge Theorem
46/57

R2 always goes up when you add variables to your
model.
Always.

42
Adjusted R Squared
47/57

Adjusted R2 penalizes your model for obtaining
its fit with lots of variables. Adjusted R2
1 (n-1)/(n-K-1)(1 R2)
Adjusted R2 is denoted
Adjusted R2 is not the mean of anything and it is
not a square. This is just a name.

43
The Analysis of Variance
48/57
S 0.952237 R-Sq 57.0 R-Sq(adj)
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
If n is very large, R2 and Adjusted R2 will not
differ by very much.2198 is quite large for this
purpose.
44
Exploring the Relationship
49/57

F statistic examines the entire relationship.
Benchmark F gt 2 is good for a multiple
regression.
What about individual coefficients?(E.g., is
there a significant relationship between the
number of McDonalds and the local box office
result?)

45
50/57
Use individual t statistics. T gt 2 or T lt -2
suggests the variable is significant. T for
LogPCMacs 9.66. This is large. Note the 2
for t statistics and the 4 22 for the F
statistic for a simple regression (one predictor)
is not a coincidence.
46
What About a Group of Variables?
51/57

Is Genre significant?
There are 12 genre variables
Some are significant (fantasy, mystery, horror)
some are not.
Can we conclude the group as a whole is?
Maybe. We need a test.

47
Theory for the Test
52/57

A larger model has a higher R2 than a smaller
one.
(Larger model means it has all the variables in
the smaller one, plus some additional ones)
(1) Compute this statistic with a calculator

48
Is Genre Significant?
56/57
With the 12 Genre indicator variables S
0.952237 R-Sq 57.0 Without the 12 Genre
indicator variables S 0.967685 R-Sq
55.4 (0.570 0.554)/12F
-------------------------------------- 6.750
(1 0.570)/(2198 20 1)
Cumulative Distribution Function F distribution
with 12 DF in numerator and 2177 DF in
denominator x P( X lt x ) 6.75 1.00000
THIS IS LARGER THAN 0.95
49
Now What?
55/57

If the value that Minitab shows you is greater
than 0.95, then the F statistic is large
I.e., conclude that the group of coefficients is
significant
This means that at least one is nonzero, not that
all necessarily are.

50
Testing .
53/57

Use Minitabs F Calculator

51
F Test
54/57
Leave as is
Number of coefficients in the group
N-K-1 for the larger model
Your F from Step 1
Push the button
52
Summary
57/57

Data preparation missing values
Residuals and outliners
Scaling the data
Model fit and analysis of variance R2
Testing
One variable (coefficient) the t test
A set of variables the F test

Write a Comment

User Comments (0)

About PowerShow.com

Statistics and Data Analysis PowerPoint PPT Presentation