The Power of Regression

- Previous Research Literature Claim
- Foreign-owned manufacturing plants have greater

levels of strike activity than domestic plants - In Canada, strike rates of 25.5 versus 20.3
- Budds Claim
- Foreign-owned plants are larger and located in

strike-prone industries - Need multivariate regression analysis!

The Power of Regression

Dependent Variable Strike Incidence Dependent Variable Strike Incidence Dependent Variable Strike Incidence Dependent Variable Strike Incidence

(1) (2) (3)

U.S. Corporate Parent (Canadian Parent omitted) 0.230 (0.117) 0.201 (0.119) 0.065 (0.132)

Number of Employees (1000s) --- 0.177 (0.019) 0.094 (0.020)

Industry Effects? No No Yes

Sample Size 2,170 2,170 2,170

Statistically significant at the 0.10 level at the 0.05 level (two-tailed tests). Statistically significant at the 0.10 level at the 0.05 level (two-tailed tests). Statistically significant at the 0.10 level at the 0.05 level (two-tailed tests). Statistically significant at the 0.10 level at the 0.05 level (two-tailed tests).

Important Regression Topics

- Prediction
- Various confidence and prediction intervals
- Diagnostics
- Are assumptions for estimation testing

fulfilled? - Specifications
- Quadratic terms? Logarithmic dep. vars.?
- Additional hypothesis tests
- Partial F tests
- Dummy dependent variables
- Probit and logit models

Confidence Intervals

- The true population whatever is within the

following interval (1-?) of the time - Estimate t?/2 ? Standard ErrorEstimate

- Just need
- Estimate
- Standard Error
- Shape / Distribution (including degrees of

freedom)

Prediction Interval for New Observation at xp

- 1. Point Estimate

- 2. Standard Error

- 3. Shape
- t distribution with n-k-1 d.f

- 4. So prediction interval for a new observation

is

- 4. So prediction interval for a new observation

is

Siegel, p. 481

Prediction Interval for Mean Observations at xp

- 1. Point Estimate

- 2. Standard Error

- 3. Shape
- t distribution with n-k-1 d.f

- 4. So prediction interval for a new observation

is

Siegel, p. 483

Earlier Example

Hours of Study (x) and Exam Score (y) Example

- Find 95 CI for Joes exam score (studies for 20

hours) - Find 95 CI for mean score for those who studied

for 20 hours

Regression Statistics Regression Statistics Regression Statistics

Multiple R 0.770

R Squared 0.594

Adj. R Squared Adj. R Squared 0.543

Standard Error Standard Error 10.710

Obs. 10

ANOVA

df SS MS F Significance

Regression 1 1340.452 1341.452 11.686 0.009

Residual 8 917.648 114.706

Total 9 2258.100

Coeff. Std. Error t stat p value Lower 95 Upper 95

Intercept 39.401 12.153 3.242 0.012 11.375 67.426

hours 2.122 0.621 3.418 0.009 0.691 3.554

- -

x 18.80

Diagnostics / Misspecification

- For estimation testing to be valid
- y b0 b1x1 b2x2 bkxk e makes sense
- Errors (ei) are independent
- of each other
- of the independent variables
- Homoskedasticity
- Error variance independent
- of the independent variables
- ?e2 is a constant
- Var(ei) ? xi?2 (i.e., not heteroskedasticity)

Violations render our inferences invalid and

misleading!

Common Problems

- Misspecification
- Omitted variable bias
- Nonlinear rather than linear relationship
- Levels, logs, or percent changes?
- Data Problems
- Skewed variables and outliers
- Multicollinearity
- Sample selection (non-random data)
- Missing data
- Problems with residuals (error terms)
- Non-independent errors
- Heteroskedasticity

Omitted Variable Bias

- Question 3 from Sample Exam B
- wage 9.05 1.39 union
- (1.65) (0.66)
- wage 9.56 1.42 union 3.87 ability
- (1.49) (0.56)

(1.56) - wage -3.03 0.60 union 0.25 revenue
- (0.70) (0.45)

(0.08) - H. Farber thinks the average union wage is

different from average nonunion wage because

unionized employers are more selective and hire

individuals with higher ability. - M. Friedman thinks the average union wage is

different from the average nonunion wage because

unionized employers have different levels of

revenue per employee.

Checking the Assumptions

- How to check the validity of the assumptions?
- Cynicism, Realism, and Theory
- Robustness Checks
- Check different specifications
- But dont just choose the best one!
- Automated Variable Selection Methods
- e.g., Stepwise regression (Siegel, p. 547)
- Misspecification and Other Tests
- Examine Diagnostic Plots

Diagnostic Plots

Increasing spread might indicate

heteroskedasticity. Try transformations or

weighted least squares.

Diagnostic Plots

Tilt from outliers might indicate skewness.

Try log transformation

Problematic Outliers

Stock Performance and CEO Golf Handicaps (New

York Times, 5-31-98)

Number of obs 44 R-squared

0.1718 -------------------------------------------

----- stockrating Coef. Std. Err. t

Pgtt -------------------------------------------

---- handicap -1.711 .580 -2.95

0.005 _cons 73.234 8.992 8.14

0.000 ---------------------------------------

---------

Without 7 Outliers

Number of obs 51 R-squared

0.0017 -------------------------------------------

----- stockrating Coef. Std. Err. t

Pgtt -------------------------------------------

---- handicap -.173 .593 -0.29

0.771 _cons 55.137 9.790 5.63

0.000 -------------------------------------------

-----

With the 7 Outliers

Are They Really Outliers??

Diagnostic Plot is OK

BE CAREFUL!

Stock Performance and CEO Golf Handicaps (New

York Times, 5-31-98)

Diagnostic Plots

Curvature might indicate nonlinearity. Try

quadratic specification

Diagnostic Plots

Good diagnostic plot. Lacks obvious indications

of other problems.

Adding Squared (Quadratic) Term

Job Performance regression on Salary (in 1,000s)

(Egg Data)

Source SS df MS Number of

obs 576 ------- ---------------------

F(2,573) 122.42 Model 255.61 2 127.8

Prob gt F 0.0000 Residual 598.22

573 1.044 R-squared

0.2994 ----------------------------- Adj

R-squared 0.2969 Total 853.83 575 1.485

Root MSE 1.0218 ---------------------

-------------------------------------- job

performance Coef. Std. Err. t

Pgtt ------------------------------------------

----------------- salary .0980844

.0260215 3.77 0.000 salary squared

-.000337 .0001905 -1.77 0.077

_cons -1.720966 .8720358 -1.97 0.049

-------------------------------------------------

-----------

Salary Squared Salary2 salary2 in Excel

Quadratic Regression

Job perf -1.72 0.098 salary 0.00034 salary

squared

Quadratic regression (nonlinear)

Quadratic Regression

Job perf -1.72 0.098 salary 0.00034 salary

squared

Effect of salary will eventually turn negative

But where?

Another Specification Possibility

- If data are very skewed, can try a log

specification - Can use logs instead of levels for independent

and/or dependent variables - Note that the interpretation of the coefficients

will change - Re-familiarize yourself with Siegel, pp. 68-69

Quick Note on Logs

- a is the natural logarithm of x if
- 2.71828a x
- or, ea x
- The natural logarithm is abbreviated ln
- ln(x) a
- In Excel, use ln function
- We call this the log but dont use the log

function! - Usefulness spreads out small values and narrows

large values which can reduce skewness

Earnings Distribution

Skewed to the right

Weekly Earnings from the March 2002 CPS, n15,000

Residuals from Levels Regression

Skewed to the rightuse of t distribution is

suspect

Residuals from a regression of Weekly Earnings on

demographic characteristics

Log Earnings Distribution

Not perfectly symmetrical, but better

Natural Logarithm of Weekly Earnings from the

March 2002 CPS, i.e., ln(weekly earnings)

Residuals from Log Regression

Almost symmetricaluse of t distribution is

probably OK

Residuals from a regression of Log Weekly

Earnings on demographic characteristics

Hypothesis Tests

- Weve been doing hypothesis tests for single

coefficients - H0 ? 0 reject if t gt t?/2,n-k-1
- HA ? ? 0
- What about testing more than one coefficient at

the same time? - e.g., want to see if an entire group of 10 dummy

variables for 10 industries should be in the

model - Joint tests can be conducted using partial F tests

Partial F Tests

- H0 ?1 ?2 ?3 ?C 0
- HA at least one ?i ? 0
- How to test this?
- Consider two regressions
- One as if H0 is true
- i.e., ?1 ?2 ?3 ?C 0
- This is a restricted (or constrained) model
- Plus a full (or unconstrained) model in which

the computer can estimate what it wants for each

coefficient

Partial F Tests

- Statistically, need to distinguish between
- Full regression no better than the restricted

regression - versus
- Full regression is significantly better than

the restricted regression - To do this, look at variance of prediction errors
- If this declines significantly, then reject H0
- From ANOVA, we know ratio of two variances has an

F distribution - So use F test

Partial F Tests

- SSresidual Sum of Squares Residual
- C constraints
- The partial F statistic has C, n-k-1 degrees of

freedom - Reject H0 if F gt F?,C, n-k-1

Coal Mining Example (Again)

Regression Statistics Regression Statistics Regression Statistics Regression Statistics

R Squared R Squared 0.955

Adj. R Squared Adj. R Squared Adj. R Squared 0.949

Standard Error Standard Error Standard Error 108.052

Obs. Obs. 47

ANOVA ANOVA df SS MS F Significance

Regression Regression 6 9975694.933 1662615.822 142.406 0.000

Residual Residual 40 467007.875 11675.197

Total Total 46 10442702.809

Coeff. Std. Error t stat p value Lower 95 Upper 95

Intercept -168.510 -168.510 258.819 -0.651 0.519 -691.603 354.583

hours 1.244 1.244 0.186 6.565 0.000 0.001 0.002

tons 0.048 0.048 0.403 0.119 0.906 -0.001 0.001

unemp 19.618 19.618 5.660 3.466 0.001 8.178 31.058

WWII 159.851 159.851 78.218 2.044 0.048 1.766 317.935

Act1952 -9.839 -9.839 100.045 -0.098 0.922 -212.038 192.360

Act1969 -203.010 -203.010 111.535 -1.820 0.076 -428.431 22.411

Minitab Output

- Predictor Coef StDev T

P - Constant -168.5 258.8 -0.65

0.519 - hours 1.2235 0.186 6.56

0.000 - tons 0.0478 0.403 0.12

0.906 - unemp 19.618 5.660 3.47

0.001 - WWII 159.85 78.22 2.04

0.048 - Act1952 -9.8 100.0 -0.10

0.922 - Act1969 -203.0 111.5 -1.82

0.076 - S 108.1 R-Sq 95.5 R-Sq(adj)

94.9 - Analysis of Variance
- Source DF SS MS F

P - Regression 6 9975695 1662616 142.41

0.000 - Error 40 467008 11675
- Total 46 10442703

Is the Overall Model Significant?

- H0 ?1 ?2 ?3 ?6 0
- HA at least one ?i ? 0
- Note for testing the overall model, Ck
- i.e., testing all coefficients together
- From the previous slides, we have SSresidual for

the full (or unconstrained) model - SSresidual467,007.875
- But what about for the restricted (H0 true)

regression? - Estimate a constant only regression

Constant-Only Model

Regression Statistics Regression Statistics Regression Statistics Regression Statistics

R Squared R Squared 0

Adj. R Squared Adj. R Squared Adj. R Squared 0

Standard Error Standard Error Standard Error 476.461

Obs. Obs. 47

ANOVA ANOVA df SS MS F Significance

Regression Regression 0 0 0 . .

Residual Residual 46 10442702.809 227015.278

Total Total 46 10442702.809

Coeff. Std. Error t stat p value Lower 95 Upper 95

Intercept 671.937 671.937 69.499 9.668 0.0000 532.042 811.830

Partial F Tests

142.406

- H0 ?1 ?2 ?3 ?6 0
- HA at least one ?i ? 0
- Reject H0 if F gt F?,C, n-k-1 F0.05,6,40 2.34
- 142.406 gt 2.34 so reject H0. Yes, overall model

is significant

Select F Distribution 5 Critical Values

Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom

1 2 3 4 5 6

1 161 199 216 225 230 234

2 18.5 19.0 19.2 19.2 19.3 19.3

3 10.1 9.55 9.28 9.12 9.01 8.94

8 5.32 4.46 4.07 3.84 3.69 3.58

10 4.96 4.10 3.71 3.48 3.33 3.22

11 4.84 3.98 3.59 3.36 3.20 3.09

12 4.75 3.89 3.49 3.26 3.11 3.00

18 4.41 3.55 3.16 2.93 2.77 2.66

40 3.94 3.09 2.84 2.46 2.31 2.19

1000 3.85 3.00 2.61 2.38 2.22 2.11

Denominator Degrees of Freedom

A Small Shortcut

Regression Statistics Regression Statistics Regression Statistics Regression Statistics

R Squared R Squared 0.955

Adj. R Squared Adj. R Squared Adj. R Squared 0.949

Standard Error Standard Error Standard Error 108.052

Obs. Obs. 47

ANOVA ANOVA df SS MS F Significance

Regression Regression 6 9975694.933 1662615.822 142.406 0.000

Residual Residual 40 467007.875 11675.197

Total Total 46 10442702.809

Coeff. Std. Error t stat p value Lower 95 Upper 95

Intercept -168.510 -168.510 258.819 -0.651 0.519 -691.603 354.583

hours 1.244 1.244 0.186 6.565 0.000 0.001 0.002

tons 0.048 0.048 0.403 0.119 0.906 -0.001 0.001

unemp 19.618 19.618 5.660 3.466 0.001 8.178 31.058

WWII 159.851 159.851 78.218 2.044 0.048 1.766 317.935

Act1952 -9.839 -9.839 100.045 -0.098 0.922 -212.038 192.360

Act1969 -203.010 -203.010 111.535 -1.820 0.076 -428.431 22.411

For constant only model, SSresidual10,442,702.80

9

So to test overall model, you dont need to run a

constant-only model

An Even Better Shortcut

Regression Statistics Regression Statistics Regression Statistics Regression Statistics

R Squared R Squared 0.955

Adj. R Squared Adj. R Squared Adj. R Squared 0.949

Standard Error Standard Error Standard Error 108.052

Obs. Obs. 47

ANOVA ANOVA df SS MS F Significance

Regression Regression 6 9975694.933 1662615.822 142.406 0.000

Residual Residual 40 467007.875 11675.197

Total Total 46 10442702.809

Coeff. Std. Error t stat p value Lower 95 Upper 95

Intercept -168.510 -168.510 258.819 -0.651 0.519 -691.603 354.583

hours 1.244 1.244 0.186 6.565 0.000 0.001 0.002

tons 0.048 0.048 0.403 0.119 0.906 -0.001 0.001

unemp 19.618 19.618 5.660 3.466 0.001 8.178 31.058

WWII 159.851 159.851 78.218 2.044 0.048 1.766 317.935

Act1952 -9.839 -9.839 100.045 -0.098 0.922 -212.038 192.360

Act1969 -203.010 -203.010 111.535 -1.820 0.076 -428.431 22.411

In fact, the ANOVA table F test is exactly the

test for the overall model being

significantrecall Unit 8

Testing Any Subset

Regression Statistics Regression Statistics Regression Statistics Regression Statistics

R Squared R Squared 0.955

Adj. R Squared Adj. R Squared Adj. R Squared 0.949

Standard Error Standard Error Standard Error 108.052

Obs. Obs. 47

ANOVA ANOVA df SS MS F Significance

Regression Regression 6 9975694.933 1662615.822 142.406 0.000

Residual Residual 40 467007.875 11675.197

Total Total 46 10442702.809

Coeff. Std. Error t stat p value Lower 95 Upper 95

Intercept -168.510 -168.510 258.819 -0.651 0.519 -691.603 354.583

hours 1.244 1.244 0.186 6.565 0.000 0.001 0.002

tons 0.048 0.048 0.403 0.119 0.906 -0.001 0.001

unemp 19.618 19.618 5.660 3.466 0.001 8.178 31.058

WWII 159.851 159.851 78.218 2.044 0.048 1.766 317.935

Act1952 -9.839 -9.839 100.045 -0.098 0.922 -212.038 192.360

Act1969 -203.010 -203.010 111.535 -1.820 0.076 -428.431 22.411

Partial F test can be used to test any subset of

variables

For example, H0 ?WWII ?Act1952 ?Act1969

0 HA at least one ?i ? 0

Restricted Model

Restricted regression with ?WWII ?Act1952

?Act1969 0

Regression Statistics Regression Statistics Regression Statistics Regression Statistics

R Squared R Squared 0.955

Adj. R Squared Adj. R Squared Adj. R Squared 0.949

Standard Error Standard Error Standard Error 108.052

Obs. Obs. 47

ANOVA ANOVA df SS MS F Significance

Regression Regression 3 9837344.76 3279114.920 232.923 0.000

Residual Residual 43 605358.049 14078.094

Total Total 46 10442702.809

Coeff. Std. Error t stat p value

Intercept 147.821 147.821 166.406 0.888 0.379

hours 0.0015 0.0015 0.0001 20.522 0.000

tons -0.0008 -0.0008 0.0003 -2.536 0.015

unemp 7.298 7.298 4.386 1.664 0.103

Partial F Tests

3.950

- H0 ?WWII ?Act1952 ?Act1969 0
- HA at least one ?i ? 0
- Reject H0 if F gt F?,C, n-k-1 F0.05,3,40 2.84
- 3.95 gt 2.84 so reject H0. Yes, subset of three

coefficients are jointly significant

Regression and Two-Way ANOVA

A B C B2 B3 B4 B5 Value

1 0 0 0 0 0 0 10

1 0 0 1 0 0 0 12

1 0 0 0 1 0 0 18

1 0 0 0 0 1 0 20

1 0 0 0 0 0 1 8

0 1 0 0 0 0 0 9

0 1 0 1 0 0 0 6

0 1 0 0 1 0 0 15

0 1 0 0 0 1 0 18

0 1 1 0 0 0 1 7

0 0 1 0 0 0 0 8

Treatments Treatments Treatments

A B C

1 10 9 8

2 12 6 5

3 18 15 14

4 20 18 18

5 8 7 8

Blocks

Stack data using dummy variables

Recall Two-Way Results

ANOVA Two-Factor Without Replication ANOVA Two-Factor Without Replication ANOVA Two-Factor Without Replication ANOVA Two-Factor Without Replication ANOVA Two-Factor Without Replication ANOVA Two-Factor Without Replication ANOVA Two-Factor Without Replication

Source of Variation SS df MS F P-value F crit

Blocks 312.267 4 78.067 38.711 0.000 3.84

Treatment 26.533 2 13.267 6.579 0.020 4.46

Error 16.133 8 2.017

Total 354.933 14

Regression and Two-Way ANOVA

- Source SS df MS Number of

obs 15 - -------------------------------- F( 6,

8) 28.00 - Model 338.800 6 56.467 Prob gt F

0.0001 - Residual 16.133 8 2.017 R-squared

0.9545 - -------------------------------- Adj

R-squared 0.9205 - Total 354.933 14 25.352 Root MSE

1.4201 - --------------------------------------------------

----------- - treatment Coef. Std. Err. t Pgtt 95

Conf. Int - -------------------------------------------------

----------- - b -2.600 .898 -2.89 0.020

-4.671 -.529 - c -3.000 .898 -3.34 0.010

-5.071 -.929 - b2 -1.333 1.160 -1.15 0.283

-4.007 1.340 - b3 6.667 1.160 5.75 0.000

3.993 9.340 - b4 9.667 1.160 8.34 0.000

6.993 12.340 - b5 -1.333 1.160 -1.15 0.283

-4.007 1.340 - _cons 10.867 .970 11.20 0.000

8.630 13.104 - --------------------------------------------------

-----------

Regression and Two-Way ANOVA

- Regression Excerpt for Full Model
- Source SS df MS
- ----------------------------
- Model 338.800 6 56.467
- Residual 16.133 8 2.017
- ----------------------------
- Total 354.933 14 25.352

Use these SSresidual values to do partial F tests

and you will get exactly the same answers as the

Two-Way ANOVA tests

Regression Excerpt for ?b2 ?b3 0 Source

SS df MS ----------------------------

Model 26.533 2 13.267 Residual 328.40 12

27.367 ---------------------------- Total

354.933 14 25.352

Regression Excerpt for ?b ?c 0 Source

SS df MS ----------------------------

Model 312.267 4 78.067 Residual 42.667 10

4.267 ---------------------------- Total

354.933 14 25.352

Select F Distribution 5 Critical Values

Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom Numerator Degrees of Freedom

1 2 3 4 5 6 9

1 161 199 216 225 230 234 241

2 18.5 19.0 19.2 19.2 19.3 19.3 19.4

3 10.1 9.55 9.28 9.12 9.01 8.94 8.81

8 5.32 4.46 4.07 3.84 3.69 3.58 3.39

10 4.96 4.10 3.71 3.48 3.33 3.22 3.02

11 4.84 3.98 3.59 3.36 3.20 3.09 2.90

12 4.75 3.89 3.49 3.26 3.11 3.00 2.80

18 4.41 3.55 3.16 2.93 2.77 2.66 2.46

40 3.94 3.09 2.84 2.46 2.31 2.19 2.12

1000 3.85 3.00 2.61 2.38 2.22 2.11 1.89

? 3.84 3.00 2.60 2.37 2.21 2.10 1.83

Denominator Degrees of Freedom

3 Seconds of Calculus

Regression Coefficients

- y b0 b1x
- (linear form)
- log(y) b0 b1x (semi-log form)
- log(y) b0 b1log(x) (double-log form)

1 unit change in x changes y by b1

1 unit change in x changes y by b1 (x100) percent

1 percent change in x changes y by b1 percent

Log Regression Coefficients

- wage 9.05 1.39 union
- Predicted wage is 1.39 higher for unionized

workers (on average) - log(wage) 2.20 0.15 union
- Semi-elasticity
- Predicted wage is approximately 15 higher for

unionized workers (on average) - log(wage) 1.61 0.30 log(profits)
- Elasticity
- A one percent increase in profits increases

predicted wages by approximately 0.3 percent

Multicollinearity

Auto repair records, weight, and engine size

Number of obs 69 F( 2, 66)

6.84 Prob gt F 0.0020 R-squared

0.1718 Adj R-squared 0.1467 Root MSE

.91445 -------------------------------------------

--- repair Coef. Std. Err. t Pgtt

---------------------------------------------

weight -.00017 .00038 -0.41

0.685 engine -.00313 .00328 -0.96

0.342 _cons 4.50161 .61987 7.26

0.000 --------------------------------------------

--

Multicollinearity

- Two (or more) independent variables are so highly

correlated that a multiple regression cant

disentangle the unique contributions of each - Large standard errors and lack of statistical

significance for individual coefficients - But joint significance
- Identifying multicollinearity
- Some say rule of thumb rgt0.70 (or 0.80)
- But better to look at results
- OK for prediction Bad for assessing theory

Prediction With Multicollinearity

- Prediction at the Mean (weight3019 and

engine197)

Model for prediction Predicted Repair Lower 95 Limit (Mean) Upper 95 Limit (Mean)

Multiple Regression 3.411 3.191 3.631

Weight Only 3.412 3.193 3.632

Engine Only 3.410 3.192 3.629

Dummy Dependent Variables

- Dummy dependent variables
- y b0 b1x1 bkxk e
- Where y is a 0,1 indicator variable
- Examples
- Do you intend to quit? yes / no
- Did the worker receive training? yes/no
- Do you think the President is doing a good job?

yes/no - Was there a strike? yes / no
- Did the company go bankrupt? yes/no

Linear Probability Model

- Mathematically / computationally, can estimate a

regression as usual (the monkeys wont know the

difference) - This is called a linear probability model
- Right-hand side is linear
- And is estimating probabilities
- P(y 1) b0 b1x1 bkxk
- b10.15 (for example) means that a one unit

change in x1 increases probability that y1 by

0.15 (fifteen percentage points)

Linear Probability Model

- Excel wont know the difference, but perhaps it

should - Linear probability model problems
- ?e2 P(y1)?1-P(y1)
- But P(y 1) b0 b1x1 bkxk
- So ?e2 is
- Predicted probabilities are not bounded by 0,1
- R2 is not an accurate measure of predictive

ability - Can use a pseudo-R2 measure
- Such as percent correctly predicted

Logit Model Probit Model

- Solution to these problems is to use nonlinear

functional forms that bound P(y1) between 0,1 - Logit Model (logistic regression)
- Probit Model
- Where ? is the normal cumulative distribution

function

Recall, ln(x) a when ea x

Logit Model Probit Model

- Nonlinear so need statistical package to do the

calculations - Can do individual (z-tests, not t-tests) and

joint statistical testing as with other

regressions - Also confidence intervals
- Need to convert coefficients to marginal effects

for interpretation - Should be aware of these models
- Though in many cases, a linear probability model

works just fine

Example

- Dep. Var 1 if you know of the FMLA, 0 otherwise

Probit estimates Number of obs

1189 LR chi2(14)

232.39 Prob gt

chi2 0.0000 Log likelihood -707.94377

Pseudo R2 0.1410 ---------------------------

--------------------------------- FMLAknow

Coef. Std. Err. z Pgtz 95 Conf.

Int --------------------------------------------

--------------- union .238 .101 2.35

0.019 .039 .436 age -.002

.018 -0.13 0.897 -.038 .033 agesq

.135 .219 0.62 0.536 -.293

.564 nonwhite -.571 .098 -5.80 0.000

-.764 -.378 income 1.465 .393 3.73

0.000 .696 2.235 incomesq -5.854 2.853

-2.05 0.040 -11.45 -.262 other controls

omitted _cons -1.188 .328 -3.62

0.000 -1.831 -.545 ---------------------------

---------------------------------

Marginal Effects

- For numerical interpretation / prediction, need

to convert coefficients to marginal effects - Example Logit Model
- So b1 gives effect on Log(), not P(y1)
- Probit is similar
- Can re-arrange to find out effect on P(y1)
- Usually do this at the sample means

Marginal Effects

Probit estimates Number of obs

1189 LR chi2(14)

232.39 Prob gt

chi2 0.0000 Log likelihood -707.94377

Pseudo R2 0.1410 ---------------------------

--------------------------------- FMLAknow

dF/dx Std. Err. z Pgtz 95 Conf.

Int --------------------------------------------

--------------- union .095 .040 2.35

0.019 .017 .173 age -.001

.007 -0.13 0.897 -.015 .013 agesq

.054 .087 0.62 0.536 -.117

.225 Nonwhite -.222 .036 -5.80 0.000

-.293 -.151 income .585 .157 3.73

0.000 .278 .891 incomesq -2.335 1.138

-2.05 0.040 -4.566 -.105 other controls

omitted -----------------------------------------

------------------

For numerical interpretation / prediction, need

to convert coefficients to marginal effects

But Linear Probability Model is OK, Too

Probit Coeff.

Union 0.238 (0.101)

Nonwhite -0.571 (0.098)

Income 1.465 (0.393)

Income Squared -5.854 (2.853)

Probit Marginal

0.095 (0.040)

-0.222 (0.037)

0.585 (0.157)

-2.335 (1.138)

Regression

0.084 (0.035)

-0.192 (0.033)

0.442 (0.091)

-1.354 (0.316)

So regression is usually OK, but should still be

familiar with logit and probit methods