# Statistics and Data Analysis - PowerPoint PPT Presentation

PPT – Statistics and Data Analysis PowerPoint presentation | free to view - id: 56da0-Yjc0Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Statistics and Data Analysis

Description:

### Individual coefficient estimates and t statistics ... Income McDonalds. 1 Argentina 37 12090 173 Spanish. 2 Chile, 15 9110 70 Spanish ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 53
Provided by: William7
Category:
Tags:
Transcript and Presenter's Notes

Title: Statistics and Data Analysis

1
Statistics and Data Analysis
• Professor William Greene
• IOMS Department of
• Department of Economics

2
Statistics and Data Analysis
Part 19 Multiple Regression 3
3
Multiple Regression Modeling
1/57
• Data Preparation
• Examining the data
• Transformations
• Scaling
• Analysis of the Regression
• Residuals and outliers
• Influential data points
• The fit of the regression
• R squared and adjusted R squared
• Analysis of variance
• Individual coefficient estimates and t statistics
• Testing for significance of a set of coefficients
• Prediction

4
Data Preparation
2/57
• Get rid of observations with missing values.
• Small numbers of missing values, delete
observations
• Large numbers of missing values may need to
give up on certain variables
• There are theories and methods for filling
not useful or appropriate for real world work.)
• Be sure that missingness is not directly
related to the values of the dependent variable.
E.g., a regression that follows systematically
removing high values of Y is likely to be
biased if you then try to use the results to
describe the entire population. (E.g., any
sample related to income or consumption that is
drawn at an airport is likely to biased viz a viz
the entire population.)

5
Transform the Data?
3/57
• Just because a variable is skewed does not mean
you should take logs. Take logs if the model you
are fitting calls for taking logs. More later.
• Scaling? E.g., per capita data. Scaling by
assets or number of shares, or sales If it is
appropriate in the context of the model (the
study). More later.
• Do not transform variables without a good reason
to do so. Skewness, by itself, is not a good
reason. Dont scale variables because the values
are large.
• Transform data appropriately for the study you
are doing (i.e., for the story you are trying to

6
Using Logs
4/57
• Generally, use logs for size variables
• Use logs if you are seeking to estimate
elasticities
• Use logs if your data span a very large range of
values and the independent variables do not (a
modeling issue some art mixed in with the
science).
• If the data contain 0s or negative values then
logs will be inappropriate for the study do not
will be positive.

7
More on Using Logs
5/57
• Generally only for continuous variables like
income or variables that are essentially
continuous.
• Not for discrete variables like binary variables
or qualititative variables (e.g., stress level
1,2,3,4,5)
• Generally be consistent in the equation dont
mix logs and levels.
• Generally DO NOT take the log of time (t) in a
model with a time trend. TIME is discrete and
not a measure.

8
Residuals
8/57
• Residual the difference between the actual
value of Y and the value predicted by the
regression.
• E.g., Switzerland
• Estimated equation is DALE 36.900
2.9787EDUC .004601PCHexp
• Swiss values are EDUC9.418360, PCHexp2646.442
• Regression prediction 77.1307
• Actual Swiss DALE 72.71622
• Residual 72.71622 77.1307 -4.41448
• The regresion overpredicts Switzerland

9
Using Residuals
9/57
• As indicators of bad data
• As indicators of observations that deserve
attention
• As a diagnostic tool to evaluate the regression
model

10
Outliers
13/57
• A residual is ei yi a b1xi1,
• The standard deviation of the residuals is
• Standardized residuals are ei/se.
• Large residuals have ei/se gt 2.

11
Strip Mining the Sample Residuals and Outliers
10/57
12
11/57
13
Appropriate Plot
14
Appropriate Residual Plot
12/57
15
Strip Mining the DataUnusual Observations
14/57
16
When to Remove Outliers
15/57
• Outliers have very large residuals
• Only if it is ABSOLUTELY necessary
• The data are obviously miscoded
• There is something clearly wrong with the
observation
• Do not remove outliers just because Minitab flags
them. This is not sufficient reason.

17
Units of Measurement
22/57
• y a b1x1 b2x2 e
• If you multiply every observation of variable x
by the same constant, c, then the regression
coefficient will be divided by c.
• E.g., multiply X by .001 to change to thousands
of , then b is multiplied by 1000. b times x
will be unchanged.

18
Scaling the Data
23/57
• Units of measurement and coefficients
• Macro data and per capita figures
• Gasoline data
• WHO data
• Micro data and normalizations
• RD and Profits

19
The Gasoline Market
24/57
Agregate consumption or expenditure data would
not be interesting. Income data are already per
capita.
20
The WHO Data
25/57
Per Capita GDP and Per Capita Health Expenditure.
Aggregate values would make no sense.
Years
21
Profits and RD by Industry
26/57
Is there a relationship between RD and Profits?
This just shows that big industries have larger
profits and RD than small ones.
Gujarati, D. Basic Econometrics, McGraw Hill,
1995, p. 388.
22
Normalized by Sales
27/57
Profits/Sales a ß RD/Sales e
23
28/57
• McDonalds and Movies (Craig, Douglas, Greene
International Journal of Marketing)
• Log Foreign Box Office(movie,country,year) a
ß1LogBox(movie,US,year) ß2LogPCIncome
ß4LogMacsPC GenreEffect
CountryEffect e.

24
29/57
We used McDonalds Per Capita
25
30/57
26
Macs and Movies
31/57
Genres (MPAA) 1Drama 2Romance 3Comedy 4Action
er 10Mystery 11Science Fiction 12Horror 13Crim
e
Countries and Some of the Data Code
Pop(mm) per cap of Language
Income McDonalds 1 Argentina
37 12090 173 Spanish 2 Chile,
15 9110 70 Spanish 3 Spain
39 19180 300 Spanish 4
Mexico 98 8810 270
Spanish 5 Germany 82 25010 1152
German 6 Austria 8 26310
159 German 7 Australia 19 25370
680 English 8 UK 60 23550
1152 UK
27
Making the Genre Variables
32/57
Calc ?
28
Movie Genres
33/57
29
34/57
CRIME is the left out GENRE. AUSTRIA is the left
out country. Australia and UK were left out for
other reasons (algebraic problem with only 8
countries).
30
Model Fit
35/57
• How well does the model fit the data?
• R2 measures fit the larger the better
• Time series expect .9 or better
• Cross sections it depends
• Social science data .1 is good
• Industry or market data .5 is routine

31
OK Fit
36/57
32
Success Measure
37/57
• Hypothesis There is no regression.
• Equivalent Hypothesis R2 0.
• How to test For now, rough rule.Look for F gt
2 for multiple regressionF 144.34 for Movie

33
A Formal Test of the Regression Model
38/57
• Is there a significant relationship?
• Equivalently, is R2 gt 0?
• Statistically, not numerically.
• Testing
• Compute the F (R2/K)/(1-R2)/(n-K-1)
• Determine if F is large using the appropriate
table

34
The F Test for the Model
39/57
• Determine the appropriate critical value from
the table.
• Is the F from the computed model larger than the
theoretical F from the table?
• Yes Conclude the relationship is significant
• No Conclude R2 0.

35

40/57
n1 Number of predictors n2 Sample size
number of predictors 1
36
Testing .
41/57
• Use Minitabs F Calculator

37
Finding the Critical F
42/57
Leave as is
Number of predictors in the model K
n-K-1
Standard .95
38
Compare Sample F to Critical F
43/57
• F 144.34 for Movie Madness
• Critical value from the table is 1.57536.
• Reject the hypothesis of no relationship.

39
An Equivalent Approach
44/57
• What is the P Value?
• We observed an F of 144.34 (or, whatever it is).
• If there really were no relationship, how likely
is it that we would have observed an F this large
(or larger)?
• Depends on n and K
• The probability is reported with the regression
results as the P Value.

40
The F Test
45/57
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
41
A Huge Theorem
46/57
• R2 always goes up when you add variables to your
model.
• Always.

42
47/57
its fit with lots of variables. Adjusted R2
1 (n-1)/(n-K-1)(1 R2)
• Adjusted R2 is not the mean of anything and it is
not a square. This is just a name.

43
The Analysis of Variance
48/57
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
If n is very large, R2 and Adjusted R2 will not
differ by very much.2198 is quite large for this
purpose.
44
Exploring the Relationship
49/57
• F statistic examines the entire relationship.
Benchmark F gt 2 is good for a multiple
regression.
• What about individual coefficients?(E.g., is
there a significant relationship between the
number of McDonalds and the local box office
result?)

45
50/57
Use individual t statistics. T gt 2 or T lt -2
suggests the variable is significant. T for
LogPCMacs 9.66. This is large. Note the 2
for t statistics and the 4 22 for the F
statistic for a simple regression (one predictor)
is not a coincidence.
46
What About a Group of Variables?
51/57
• Is Genre significant?
• There are 12 genre variables
• Some are significant (fantasy, mystery, horror)
some are not.
• Can we conclude the group as a whole is?
• Maybe. We need a test.

47
Theory for the Test
52/57
• A larger model has a higher R2 than a smaller
one.
• (Larger model means it has all the variables in
the smaller one, plus some additional ones)
• (1) Compute this statistic with a calculator

48
Is Genre Significant?
56/57
With the 12 Genre indicator variables S
0.952237 R-Sq 57.0 Without the 12 Genre
indicator variables S 0.967685 R-Sq
55.4 (0.570 0.554)/12F
-------------------------------------- 6.750
(1 0.570)/(2198 20 1)
Cumulative Distribution Function F distribution
with 12 DF in numerator and 2177 DF in
denominator x P( X lt x ) 6.75 1.00000
THIS IS LARGER THAN 0.95
49
Now What?
55/57
• If the value that Minitab shows you is greater
than 0.95, then the F statistic is large
• I.e., conclude that the group of coefficients is
significant
• This means that at least one is nonzero, not that
all necessarily are.

50
Testing .
53/57
• Use Minitabs F Calculator

51
F Test
54/57
Leave as is
Number of coefficients in the group
N-K-1 for the larger model