Loading...

PPT – Statistics and Data Analysis PowerPoint presentation | free to view - id: 56da0-Yjc0Z

The Adobe Flash plugin is needed to view this content

Statistics and Data Analysis

- Professor William Greene
- Stern School of Business
- IOMS Department of
- Department of Economics

Statistics and Data Analysis

Part 19 Multiple Regression 3

Multiple Regression Modeling

1/57

- Data Preparation
- Examining the data
- Transformations
- Scaling
- Analysis of the Regression
- Residuals and outliers
- Influential data points
- The fit of the regression
- R squared and adjusted R squared
- Analysis of variance
- Individual coefficient estimates and t statistics
- Testing for significance of a set of coefficients
- Prediction

Data Preparation

2/57

- Get rid of observations with missing values.
- Small numbers of missing values, delete

observations - Large numbers of missing values may need to

give up on certain variables - There are theories and methods for filling

missing values. (Advanced techniques. Usually

not useful or appropriate for real world work.) - Be sure that missingness is not directly

related to the values of the dependent variable.

E.g., a regression that follows systematically

removing high values of Y is likely to be

biased if you then try to use the results to

describe the entire population. (E.g., any

sample related to income or consumption that is

drawn at an airport is likely to biased viz a viz

the entire population.)

Transform the Data?

3/57

- Just because a variable is skewed does not mean

you should take logs. Take logs if the model you

are fitting calls for taking logs. More later. - Scaling? E.g., per capita data. Scaling by

assets or number of shares, or sales If it is

appropriate in the context of the model (the

study). More later. - Do not transform variables without a good reason

to do so. Skewness, by itself, is not a good

reason. Dont scale variables because the values

are large. - Transform data appropriately for the study you

are doing (i.e., for the story you are trying to

tell your reader).

Using Logs

4/57

- Generally, use logs for size variables
- Use logs if you are seeking to estimate

elasticities - Use logs if your data span a very large range of

values and the independent variables do not (a

modeling issue some art mixed in with the

science). - If the data contain 0s or negative values then

logs will be inappropriate for the study do not

use ad hoc fixes like adding something to Y so it

will be positive.

More on Using Logs

5/57

- Generally only for continuous variables like

income or variables that are essentially

continuous. - Not for discrete variables like binary variables

or qualititative variables (e.g., stress level

1,2,3,4,5) - Generally be consistent in the equation dont

mix logs and levels. - Generally DO NOT take the log of time (t) in a

model with a time trend. TIME is discrete and

not a measure.

Residuals

8/57

- Residual the difference between the actual

value of Y and the value predicted by the

regression. - E.g., Switzerland
- Estimated equation is DALE 36.900

2.9787EDUC .004601PCHexp - Swiss values are EDUC9.418360, PCHexp2646.442
- Regression prediction 77.1307
- Actual Swiss DALE 72.71622
- Residual 72.71622 77.1307 -4.41448
- The regresion overpredicts Switzerland

Using Residuals

9/57

- As indicators of bad data
- As indicators of observations that deserve

attention - As a diagnostic tool to evaluate the regression

model

Outliers

13/57

- A residual is ei yi a b1xi1,
- The standard deviation of the residuals is
- Standardized residuals are ei/se.
- Large residuals have ei/se gt 2.

Strip Mining the Sample Residuals and Outliers

10/57

An Aside About Plotting

11/57

Appropriate Plot

Appropriate Residual Plot

12/57

Strip Mining the DataUnusual Observations

14/57

When to Remove Outliers

15/57

- Outliers have very large residuals
- Only if it is ABSOLUTELY necessary
- The data are obviously miscoded
- There is something clearly wrong with the

observation - Do not remove outliers just because Minitab flags

them. This is not sufficient reason.

Units of Measurement

22/57

- y a b1x1 b2x2 e
- If you multiply every observation of variable x

by the same constant, c, then the regression

coefficient will be divided by c. - E.g., multiply X by .001 to change to thousands

of , then b is multiplied by 1000. b times x

will be unchanged.

Scaling the Data

23/57

- Units of measurement and coefficients
- Macro data and per capita figures
- Gasoline data
- WHO data
- Micro data and normalizations
- RD and Profits

The Gasoline Market

24/57

Agregate consumption or expenditure data would

not be interesting. Income data are already per

capita.

The WHO Data

25/57

Per Capita GDP and Per Capita Health Expenditure.

Aggregate values would make no sense.

Years

Profits and RD by Industry

26/57

Is there a relationship between RD and Profits?

This just shows that big industries have larger

profits and RD than small ones.

Gujarati, D. Basic Econometrics, McGraw Hill,

1995, p. 388.

Normalized by Sales

27/57

Profits/Sales a ß RD/Sales e

More Movie Madness

28/57

- McDonalds and Movies (Craig, Douglas, Greene

International Journal of Marketing) - Log Foreign Box Office(movie,country,year) a

ß1LogBox(movie,US,year) ß2LogPCIncome

ß4LogMacsPC GenreEffect

CountryEffect e.

29/57

We used McDonalds Per Capita

Movie Madness Data (n2198)

30/57

Macs and Movies

31/57

Genres (MPAA) 1Drama 2Romance 3Comedy 4Action

5Fantasy 6Adventure 7Family 8Animated 9Thrill

er 10Mystery 11Science Fiction 12Horror 13Crim

e

Countries and Some of the Data Code

Pop(mm) per cap of Language

Income McDonalds 1 Argentina

37 12090 173 Spanish 2 Chile,

15 9110 70 Spanish 3 Spain

39 19180 300 Spanish 4

Mexico 98 8810 270

Spanish 5 Germany 82 25010 1152

German 6 Austria 8 26310

159 German 7 Australia 19 25370

680 English 8 UK 60 23550

1152 UK

Making the Genre Variables

32/57

Calc ?

Movie Genres

33/57

34/57

CRIME is the left out GENRE. AUSTRIA is the left

out country. Australia and UK were left out for

other reasons (algebraic problem with only 8

countries).

Model Fit

35/57

- How well does the model fit the data?
- R2 measures fit the larger the better
- Time series expect .9 or better
- Cross sections it depends
- Social science data .1 is good
- Industry or market data .5 is routine

OK Fit

36/57

Success Measure

37/57

- Hypothesis There is no regression.
- Equivalent Hypothesis R2 0.
- How to test For now, rough rule.Look for F gt

2 for multiple regressionF 144.34 for Movie

Madness

A Formal Test of the Regression Model

38/57

- Is there a significant relationship?
- Equivalently, is R2 gt 0?
- Statistically, not numerically.
- Testing
- Compute the F (R2/K)/(1-R2)/(n-K-1)
- Determine if F is large using the appropriate

table

The F Test for the Model

39/57

- Determine the appropriate critical value from

the table. - Is the F from the computed model larger than the

theoretical F from the table? - Yes Conclude the relationship is significant
- No Conclude R2 0.

40/57

n1 Number of predictors n2 Sample size

number of predictors 1

Testing .

41/57

- Use Minitabs F Calculator

Finding the Critical F

42/57

Leave as is

Number of predictors in the model K

n-K-1

Standard .95

Compare Sample F to Critical F

43/57

- F 144.34 for Movie Madness
- Critical value from the table is 1.57536.
- Reject the hypothesis of no relationship.

An Equivalent Approach

44/57

- What is the P Value?
- We observed an F of 144.34 (or, whatever it is).
- If there really were no relationship, how likely

is it that we would have observed an F this large

(or larger)? - Depends on n and K
- The probability is reported with the regression

results as the P Value.

The F Test

45/57

S 0.952237 R-Sq 57.0 R-Sq(adj)

56.6 Analysis of Variance Source DF

SS MS F P Regression

20 2617.58 130.88 144.34 0.000 Residual Error

2177 1974.01 0.91 Total 2197

4591.58

A Huge Theorem

46/57

- R2 always goes up when you add variables to your

model. - Always.

Adjusted R Squared

47/57

- Adjusted R2 penalizes your model for obtaining

its fit with lots of variables. Adjusted R2

1 (n-1)/(n-K-1)(1 R2) - Adjusted R2 is denoted
- Adjusted R2 is not the mean of anything and it is

not a square. This is just a name.

The Analysis of Variance

48/57

S 0.952237 R-Sq 57.0 R-Sq(adj)

56.6 Analysis of Variance Source DF

SS MS F P Regression

20 2617.58 130.88 144.34 0.000 Residual Error

2177 1974.01 0.91 Total 2197

4591.58

If n is very large, R2 and Adjusted R2 will not

differ by very much.2198 is quite large for this

purpose.

Exploring the Relationship

49/57

- F statistic examines the entire relationship.

Benchmark F gt 2 is good for a multiple

regression. - What about individual coefficients?(E.g., is

there a significant relationship between the

number of McDonalds and the local box office

result?)

50/57

Use individual t statistics. T gt 2 or T lt -2

suggests the variable is significant. T for

LogPCMacs 9.66. This is large. Note the 2

for t statistics and the 4 22 for the F

statistic for a simple regression (one predictor)

is not a coincidence.

What About a Group of Variables?

51/57

- Is Genre significant?
- There are 12 genre variables
- Some are significant (fantasy, mystery, horror)

some are not. - Can we conclude the group as a whole is?
- Maybe. We need a test.

Theory for the Test

52/57

- A larger model has a higher R2 than a smaller

one. - (Larger model means it has all the variables in

the smaller one, plus some additional ones) - (1) Compute this statistic with a calculator

Is Genre Significant?

56/57

With the 12 Genre indicator variables S

0.952237 R-Sq 57.0 Without the 12 Genre

indicator variables S 0.967685 R-Sq

55.4 (0.570 0.554)/12F

-------------------------------------- 6.750

(1 0.570)/(2198 20 1)

Cumulative Distribution Function F distribution

with 12 DF in numerator and 2177 DF in

denominator x P( X lt x ) 6.75 1.00000

THIS IS LARGER THAN 0.95

Now What?

55/57

- If the value that Minitab shows you is greater

than 0.95, then the F statistic is large - I.e., conclude that the group of coefficients is

significant - This means that at least one is nonzero, not that

all necessarily are.

Testing .

53/57

- Use Minitabs F Calculator

F Test

54/57

Leave as is

Number of coefficients in the group

N-K-1 for the larger model

Your F from Step 1

Push the button

Summary

57/57

- Data preparation missing values
- Residuals and outliners
- Scaling the data
- Model fit and analysis of variance R2
- Testing
- One variable (coefficient) the t test
- A set of variables the F test