Research Method - PowerPoint PPT Presentation


PPT – Research Method PowerPoint presentation | free to view - id: 7ae2c7-MjU2N


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Research Method


Research Method Lecture 9 (Ch9) More on specification and Data issues * Using Proxy Variables for Unobserved Explanatory Variables Suppose you are interested in ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 31
Provided by: Rafael184


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Research Method

Research Method
  • Lecture 9 (Ch9)
  • More on specification and Data issues

Using Proxy Variables for Unobserved Explanatory
  • Suppose you are interested in estimating the
    return to Education. So you consider the
    following model.
  • Log(Wage)ß0ß1Educ ß2Exp (ß3Abilityu) (1)
  • Ability is unobserved, so it is included in the
    composite error term. If Ability is correlated
    with the year of education, ß1 will be biased.
  • Question if ability is correlated with Educ,
    what is the direction of the bias?

  • One way to eliminate the bias is to use a Panel
    data then apply the fixed effect or the first
    differencing method.
  • Another method is to use a proxy variable for
    ability. This is the topic of this section.
  • Suppose that IQ is a proxy variable for ability,
    and that IQ is available in your data.

  • Then, the basic idea is to estimate the
  • Regress Log(Wage) on Educ, Exp, and IQ (2)
  • This is called the plug-in solution to the
    omitted variables problem.
  • The question is under what conditions (2)
    produces consistent estimates for the original
    regression (1). I will explain these conditions
    using the above example (though the arguments can
    be easily generalized).
  • It turns out, the following two conditions ensure
    that you get consistent estimates by using the
    plug-in solution.

  • Condition 1 u is uncorrelated with IQ. In
    addition, the original equation should satisfy
    the usual conditions (i.e, u is also uncorrelated
    with Educ, Exp, and Ability).
  • Condition 2 E(AbilityEduc, Exp,
  • Condition 2 means that, once IQ is conditioned,
    Educ and Exp does not explain Ability. More
    simple way to express condition 2 is that the
    ability can be written as
  • Abilityd0d3IQv3
  • where, v3 is a random error which is uncorrelated
    with either IQ, Educ or Exp. What it means is
    that Ability is a function of IQ only.

Omitted variable
The initial explanatory variables
The proxy variable
  • Then, it is clear why these two conditions
    guarantee that the plug-in condition produces
    consistent estimates. Just plug (3) into (1).
    Then you have
  • Log(Wage)(ß0d0)ß1Educ ß2Exp ß3d3IQ
    (uß3v3 ) (4)
  • Where
  • Since u and v3 are uncorrelated with any of the
    explanatory variables under condition1 and
    condition 2, the slope parameters are consistent.
    The intercept has changed, but usually you are
    not interested in the intercept. Importantly, you
    get consistent estimates for the slope

  • It is also obvious that, if condition 2 is
    violated, then the plug in solution will not
    work. If the condition 2 is violated, then
    ability will be a function of not only IQ, but
    also Educ and Exp. So you will have
  • Abilityd0 d1Educd2Expd3IQv3 (5)
  • If you plug (5) into (1), you have
  • Log(Wage)(ß0d0)(ß1ß3d1)Educ (ß2ß3d2)Exp
    ß3d3IQ (uß3v3 ) (4)
  • Thus, the coefficient for Educ is no longer ß1,
    but it is ß1ß3d1. Thus, the plug-in solution
    produces inconsistent estimates when condition 2
    is violated.

If condition 2 is violated then, ability is a
function of all the variables.
  • Ex.1 Use Wage2.dta to estimate a log wage
    equation to examine the return to education.
    Include in the equation exper, tenure, married,
    south, urban, black. Do you think that the return
    to education is unbiased? What do you think is
    the direction of the bias
  • Ex.2 Now, use IQ as a proxy for unobserved
    ability. Did the result change? Was your
    prediction of the direction of the bias correct?

Answer OLS without IQ
Answer OLS with IQ
Using lagged dependent variable as proxy variables
  • Often the lag of the dependent variable is used
    as a proxy for the unobserved variables.
  • First consider the following model.
  • (Crime rate) ß0ß1(unemp)
    ß2(expenditure) u
  • If there are omitted factors that directly affect
    crime rate and at the same time correlated with
    unemployment rate, ß1 will be biased. The omitted
    factors may be some pre-existing conditions, like
    demographic features (age, race etc). Crime rate
    could be different among cities for historical

  • The idea is that, the lag of the dependent
    variable may summarize such pre-existing
  • So, estimate the following equation
  • (Crime rate)it ß0ß1(unemp)it
  • ß3(Crime rate)it-1
  • The following slides estimate the model using

  • We estimate Crime2.dta to estimate the
    regressions. Results are the following.

Without the lag of dependent varriable.
With the lag of dependent variable.
Measurement error
  • The existence of important omitted variables
    causes endogeneity problem.
  • Another source of endogeneity is the measurement
  • This section explains under what circumstance
    the measurement error causes endogeneity, and
    under what circumstance it does not.

Measurement error in explanatory variable.
  • When the explanatory variables are measured with
    errors, this causes the endogeneity problem.
  • This is a common problem. For example, in a
    typical survey, the respondents may report their
    annual incomes with a lot of errors. Variables
    such as GPA or IQ may be reported with errors as

  • Now, let us understand the nature of the problem.
  • Suppose that you want to estimate the following
    simple regression.
  • yß0ß1x1 u .(1)
  • where x1 is the measurement-error free variable.
    Suppose that this regression satisfies MLR.1
    through MLR.4.
  • Now, suppose that you only observe the
    error-ridden variable x1. That is
  • x1x1e1
  • where e1 is a random error uncorrelated with x1.

  • To be more precise, the measurement error is such
  • x1x1e1 .(2)
  • and
  • Cov(x1, e1)0 .(3)
  • (2) and (3) is called the classical
    errors-in-variables (CEV) assumption.
  • Note that the above assumption has nothing to do
    with u. We maintain the assumption that u is
    uncorrelated with both x1 and x1. This also
    means that u is uncorrelated with e1.

  • Because we only observe the error-ridden variable
    x1, we can only estimate the following model.
  • yß0ß1x1v.(4)
  • Under the CEV assumption, the observed
    (error-ridden) variable in regression (4) is
  • To see this, plug x1x1-e1 into the original
    regression (1) to get
  • yß0ß1x1(u- ß1e1).(5)

  • So, we have vu- ß1e1
  • Now, notice that
  • Cov(x1, v)Cov(x1, u- ß1e1) ?0
  • See the front board for the proof.
  • Therefore, x1 is correlated with the error term.
    Therefore, x1 is endogenous. Thus, OLS will be

  • Under the CEV assumption, we can predict the
    direction of the inconsistency (characterization
    of the bias is difficult). Let be the
    estimated coefficient from the error-ridden
    variable regression (4). Then, we have
  • Proof see the front board
  • Since the term inside the parenthesis is always
    smaller than 1, there is a bias towards zero.
    This is called the attenuation bias.

Error in variable (more general case)
  • Suppose you want to estimate the following model.
  • yß0ß1x1ß2x2.ßkxku
  • where x1 is measurement free variable.
  • However, you only observe error-ridden variable
    x1. So you can only estimate the above regression
    by replacing x1 with x1.

  • Assume that other variables are measurement error
  • Then the probability limit of is given by

where is the population error from the
following regression. x1d0d1x2
dk-1xk r1
Measurement error in the dependent variable
  • When the measurement error is in the dependent
    variable, but explanatory variables have no
    measurement-errors, there will be no bias in OLS.
  • Consider the following model.
  • yß0ß1x1 u .(1)
  • where y is the measurement free variable.
  • But, you only observe the error-ridden variable y.

  • Assume the following
  • yye ..(2)
  • and
  • Cov(y, e)0 ...(3)
  • Again, we maintain the assumption that u is
    uncorrelated with both x1 and x1. This also
    means that u is uncorrelated with e1.
  • By plugging yy-e into (1), we have the
    following OLS.
  • yß0ß1x1 (ue) (5)
  • Since e and u are not correlated with the
    explanatory variables, (5) causes no bias in the

Non random sampling 1 Exogenous sampling
  • Consider the following regression
  • Savingß0ß1(income)ß2(age)u
  • Suppose that the survey is conducted for people
    over 35 years old. This is non-random sampling,
    but the sampling criteria is based on the
    independent variable. This is called the sample
    selection based on the independent variables, and
    is an example of exogenous sample selection.
  • In this case, OLS regression of the above model
    has no bias.

Non random sampling 2 Enogenous sampling
  • Consider the following regression.
  • Wealthß0ß1(Educ)ß2(Exper)u
  • However, suppose that only people with wealth
    below 250,000 are included in the sample. Then
    the sample selection criteria is based on the
    dependent variable. This is called the sample
    selection based on dependent variable, and is an
    example of endogenous sample selection.
  • In this case, OLS estimate of the above
    regression are always biased.

Stratified sampling
  • This is a common survey method, in which the
    population is divided into non-overlapping
    groups, or strata. The sampling is random within
    each group.
  • However, some groups are often oversampled in
    order to increase observations for that group.
    Whether this causes the bias depends on whether
    the selection is exogenous or endogenous.

  • If females are oversampled, and you are
    interested in the gender differences in savings,
    then this is the exogenous sample selection.
    Thus, this causes no bias.
  • If people with low wealth are oversampled, and if
    you are interested in the wealth regression, then
    this is endogenous sample selection. This causes
    a bias in the regression.

More subtle form of sample selection.
  • Suppose that you are interested in estimating the
    wage offer regression.
  • Low(wage offer) ß0ß1(Educ)ß2(Exper)u
  • When the wage offer is too low for a particular
    person, the person may decide not to work. Thus,
    this person will not be included in the sample.
    This is the case where sample selection is caused
    by the persons decision to work or not.

  • When the decision is based on unobserved factors,
    then the OLS regression causes a bias. This is
    called the sample selection bias.
  • This is typically a problem for the study of the
    wage offer for women.
  • This course does not cover the method to correct
    for this type of bias. In the fall semester, I
    will cover this type of issues in a new course
    the Cross Section and Panel Data Analysis.