2' Linear dependent variables - PowerPoint PPT Presentation

1 / 136
About This Presentation
Title:

2' Linear dependent variables

Description:

The adjusted R-squared corrects for this by accounting for the number of model ... nothing about whether our hypothesis about the determinants of Y is correct. ... – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 137
Provided by: accl5
Category:

less

Transcript and Presenter's Notes

Title: 2' Linear dependent variables


1
2. Linear dependent variables
  • 2.1 The basic idea underlying linear regression
  • 2.2 Single variable OLS
  • 2.3 Correctly interpreting the coefficients
  • 2.4 Examining the residuals
  • 2.5 Multiple regression
  • 2.6 Heteroskedasticity
  • 2.7 Correlated errors
  • 2.8 Multicollinearity
  • 2.9 Outlying observations
  • 2.10 Median regression
  • 2.11 Looping

2
2.1 The basic idea underlying linear regression
  • A simple linear regression aims to characterize
    the relation between a dependent variable and one
    independent variable using a straight line
  • You have already seen how to fit a line between
    two variables using the scatter command
  • Linear regression does the same thing but it can
    be extended to include multiple independent
    variables

3
2.1 The basic idea
  • For example, you predict that larger companies
    usually pay higher fees
  • You can formalize the effect of company size on
    predicted fees using a simple equation
  • The parameter a0 represents what fees are
    expected to be in the case that Size 0.
  • The parameter a1 captures the impact of an
    increase in Size on expected fees.

4
2.1 The basic idea
  • The parameters a0 and a1 are assumed to be the
    same for all observations and they are called
    regression coefficients
  • You may argue that company size is not the only
    variable that affects audit fees. For example,
    the complexity of the audit engagement, or the
    size of the audit firm may also matter.
  • If you do not know all the factors that influence
    fees, the predicted fee that you calculate from
    the above equation will differ from the
    actual fee.

5
2.1 The basic idea
  • The deviation between the predicted fee and the
    actual fee is called the residual. In general,
    you might represent the relation between actual
    fees and predicted fees in the following way
  • where represents the residual term (i.e., the
    difference between actual and predicted fees)

6
2.1 The basic idea
  • Putting the two together we can express actual
    fees using the following equation
  • The goal of regression analysis is to estimate
    the parameters a0 and a1

7
2.1 The basic idea
  • One of the simplest techniques to estimate the
    coefficients is known as ordinary least squares
    (OLS).
  • The objective of OLS is to make the difference
    between the predicted and actual values as small
    as possible
  • In other words, the goal is to minimize the
    magnitude of the residuals

8
2.1 The basic idea
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Download ols.dta to your hard drive and open in
    STATA (use "C\phd\ols.dta", clear)
  • examine the graphical relation between the two
    variables, twoway (scatter y x) (lfit y x)

9
2.1 The basic idea
  • This line is fitted by minimizing the sum of the
    squared differences between the observed and
    predicted values of y (known as the residual sum
    of square, RSS)
  • The main assumptions required to obtain these
    coefficients are that
  • The relation between y and x is linear
  • The x variable is uncorrelated with the residuals
    (i.e., x is exogenous)
  • The residuals have a mean value of zero

10
2.1 The basic idea
11
(No Transcript)
12
Class exercise 2a
  • Using the formulas and the data currently in
    STATA, calculate the parameters a1 and a0

13
2.2 Single variable OLS (regress)
  • Instead of using these formulas to calculate the
    regression coefficients, we can instead use the
    regress command
  • regress y x
  • The first variable (y) is the dependent variable
    while the second (x) is the independent variable

14
2.2 Single variable OLS (regress)
  • This gives the following output

15
2.2 Single variable OLS (regress)
  • The coefficient estimates are 3.000 for the a0
    parameter and 0.500 for the a1 parameter
  • We can use these to predict the values of Y for
    any given value of X. For example, when X 5 we
    predict that Y will be
  • display 3.0000910.50009095

16
2.2 Single variable OLS (_b)
  • Alternatively, we do not need to type the
    coefficient estimates because STATA will remember
    them for us. They are stored by STATA using the
    name _bvarname where varname is replaced with
    the name of the independent variable or the
    constant (_cons)
  • display _b_cons_bx5

17
2.2 Single variable OLS
  • Note that the predicted value of y when x equals
    5 differs from the actual value
  • list y if x5
  • The actual value is 5.68 compared to the
    predicted value of 5.50. The difference for this
    observation is the residual error that arises
    because x is not a perfect predictor of y.

18
2.2 Single variable OLS
  • If we want to compute the predicted value of y
    for each value of x in our dataset, we can use
    the saved coefficients
  • gen y_hat_b_cons_bxx
  • The estimated residuals are the difference
    between the observed y values and the predicted y
    values
  • gen y_resy-y_hat
  • list x y_hat y y_res

19
2.2 Single variable OLS (predict)
  • A quicker way to do this would be to use the
    predict command after regress
  • predict yhat
  • predict yres, resid
  • Checking that this gives the same answer
  • list yhat y_hat yres y_res
  • You should also note that the values of x, yhat
    and yres correspond with those found on the
    scatter graph
  • sort x
  • list x y y_hat y_res

20
2.2 Single variable OLS
21
2.2 Single variable OLS
  • Note that by construction, there is zero
    correlation between the x variable and the
    residuals
  • twoway (scatter y_res x) (lfit y_res x)

22
2.2 Single variable OLS
  • Standard errors
  • Typically our data comprises a sample that is
    taken from a larger population
  • The coefficients are only estimates of the true
    a0 and a1 values that describe the entire
    population
  • If we obtained a second random sample from the
    same population, we would obtain different
    coefficient estimates for a0 and a1

23
2.2 Single variable OLS
  • We therefore need a way to describe the
    variability that would obtain if we were to apply
    OLS to many different samples
  • Equivalently, we want a measure of how
    precisely our coefficients are estimated
  • The solution is to calculate standard errors,
    which are simply the sample standard deviations
    associated with the estimated coefficients
  • Standard errors (SEs) allow us to perform
    statistical tests, e.g., is our estimate of a1
    significantly greater than zero?

24
2.2 Single variable OLS
  • The techniques for estimating standard errors are
    based on additional OLS assumptions
  • Homoscedasticity (i.e., the residuals have a
    constant variance)
  • Non-correlation (i.e., the residuals are not
    correlated with each other)
  • Normality (i.e., the residuals are normally
    distributed)

25
2.2 Single variable OLS
  • The t-statistic is obtained by dividing the
    coefficient estimate by the standard error

26
2.2 Single variable OLS
  • The p-values are from the t-distribution and they
    tell you how likely it is that you would have
    observed the estimated coefficient under the
    assumption that the true coefficient in the
    population is zero.
  • The p-value of 0.002 tells you that it is very
    unlikely (prob 0.2) that the true coefficient
    on x is zero.
  • The confidence intervals mean you can be 95
    confident that the true coefficient of x lies
    between 0.233 and 0.767.

27
2.2 Single variable OLS
  • To explain this we need some notation
  • captures the variation in y around its mean
  • captures the variation that is not explained by
    x
  • captures the variation that is explained by x

28
2.2 Single variable OLS
  • The total sum of squares (TSS) 41.27
  • The explained sum of squares (ESS) 27.51
  • The residual sum of squares (RSS) 13.76
  • Note that TSS ESS RSS.

29
2.2 Single variable OLS
  • The column labeled df contains the number of
    degrees of freedom
  • For the ESS, df k-1 where k number of
    regression coefficients (df 2 1)
  • For the RSS, df n k where n number of
    observations ( 11 - 2)
  • For the TSS, df n-1 ( 11 1)
  • The last column (MS) reports the ESS, RSS and TSS
    divided by their respective degrees of freedom

30
2.2 Single variable OLS
  • The first number simply tells us how many
    observations are used to estimate the model
  • The other statistics here tell you how well the
    model explains the variation in Y

31
2.2 Single variable OLS
  • The R-squared ESS / TSS ( 27.51 / 41.27
    0.666)
  • So x explains 66 of the variation in y.
  • Unfortunately, many researchers in accounting
    (and other fields) evaluate the quality of a
    model by looking only at the R-squared.
  • This is not only invalid it is also very
    dangerous (I will explain why later)

32
2.2 Single variable OLS
  • One problem with the R-squared is that it will
    always increase even when an independent variable
    is added that has very little explanatory power.
  • Adding another variable is not always a good idea
    as you lose one degree of freedom for each
    additional coefficient that needs to be
    estimated. Adding insignificant variables can be
    especially inefficient if you are working with a
    small sample size.
  • The adjusted R-squared corrects for this by
    accounting for the number of model parameters, k,
    that need to be estimated
  • Adj R-squared 1-(1-R2)(n-1)/(n-k)
    1-(1-.666)(10)/9 0.629
  • In fact the adjusted R-squared can even take on
    negative values. For example, suppose that y and
    x are uncorrelated in which case the unadjusted
    R-squared is zero
  • Adj R-squared 1-(n-1)/(n-2) (n-2-n1)/(n-2)
    -1/(n-2)

33
2.2 Single variable OLS
  • You might think that another way to measure the
    fit of the model is to add up the residuals.
    However, by definition the residuals will sum to
    zero.
  • An alternative is to square the residuals, add
    them up (giving the RSS) and then take the square
    root.
  • Root MSE square root of RSS/n-k
  • 13.76 / (11-2)0.5 1.236
  • One way to interpret the root MSE is that it
    shows how far away on average the model is from
    explaining y
  • The F-statistic (ESS/k-1)/(RSS/n-k)
  • (27.51 / 1)/(13.76/9) 17.99
  • the F statistic is used to test whether the
    R-squared is significantly greater than zero
    (i.e., are the independent variables jointly
    significant?)
  • Prob gt F gives the probability that the R-squared
    we calculated will be observed if the true
    R-squared in the population is actually equal to
    zero
  • This F test is used to test the overall
    statistical significance of the regression model

34
Class exercise 2b
  • Open your Fees.dta file and run the following two
    regressions
  • audit fees on total assets
  • the log of audit fees on the log of total assets
  • What does the output of your regression mean?
  • Which model appears to have the better fit

35
2.3 Correctly interpreting the coefficients
  • So far we have considered the case where the
    independent variable is continuous.
  • Interpretation of results is even more
    straightforward when the independent variable is
    a dummy.
  • reg auditfees big6
  • ttest auditfees, by(big6)

36
2.3 Correctly interpreting the coefficients
  • Suppose we wish to test whether the Big 6 fee
    premium is significantly different between listed
    and non-listed companies

37
2.3 Correctly interpreting the coefficients
  • gen listed0
  • replace listed1 if companytype2
    companytype3 companytype5
  • reg auditfees big6 if listed0
  • ttest auditfees if listed0, by(big6)
  • reg auditfees big6 if listed1
  • ttest auditfees if listed1, by(big6)
  • gen listed_big6listedbig6
  • reg auditfees big6 listed listed_big6

38
2.3 Correctly interpreting the coefficients
  • Some studies report the economic significance
    of the estimated coefficients as well as the
    statistical significance
  • Economic significance refers to the magnitude of
    the impact of x on y
  • There is no single way to evaluate economic
    significance but many studies describe the
    change in the predicted value of y as x increases
    from the 25th percentile to the 75th (or as x
    changes by one standard deviation around its mean)

39
2.3 Correctly interpreting the coefficients
  • For example, we can calculate the expected change
    in audit fees as company size increases from the
    25th to 75th percentiles
  • reg auditfees totalassets
  • sum totalassets if auditfeeslt., detail
  • gen fees_low_b_cons_btotalassetsr(p25)
  • gen fees_high_b_cons_btotalassetsr(p75)
  • sum fees_low fees_high

40
Class exercise 2c
  • Estimate the audit fee model in logs rather than
    in absolute values
  • Calculate the expected change in audit fees as
    company size increases from the 25th to 75th
    percentiles
  • Compare your results for economic significance to
    those we obtained when the fee model was
    estimated using the absolute values of fees and
    assets.
  • Hint you will need to take the exponential of
    the predicted log of fees in order to make this
    comparison.

41
2.3 Correctly interpreting the coefficients
  • When evaluating the economic significance of a
    dummy variable coefficient, we usually do so
    using the values zero and one rather than
    percentiles
  • For example
  • reg lnaf big6
  • gen fees_nb6exp(_b_cons)
  • gen fees_b6exp(_b_cons_bbig6)
  • sum fees_nb6 fees_b6

42
2.3 Correctly interpreting the coefficients
  • Suppose we believe that the impact of a Big 6
    audit on fees depends upon the size of the
    company
  • Usually, we would quantify this impact using a
    range of values for lnta (e.g., as lnta increases
    from the 25th to the 75th percentile)

43
2.3 Correctly interpreting the coefficients
  • For example
  • gen big6_lnta big6lnta
  • reg lnaf big6 lnta big6_lnta
  • sum lnta if lnaflt. big6lt., detail
  • gen big6_low_bbig6_bbig6_lntar(p25)
  • gen big6_high_bbig6_bbig6_lntar(p75)
  • gen big6_mean_bbig6_bbig6_lntar(mean)
  • sum big6_low big6_high big6_mean

44
  • It is amazing how many studies give a misleading
    interpretation of the coefficients when using
    interaction terms. For example, Blackwell et al.

45
(No Transcript)
46
  • Class questions
  • Theoretically, how should auditing affect the
    interest rate that the company has to pay?
  • Empirically, how do we measure the impact of
    auditing on the interest rate using eq. (1)?

47
(No Transcript)
48
  • Class question At what values of total assets
    (000) is the effect of the Audit Dummy on the
    interest rate
  • negative, zero, positive?

49
(No Transcript)
50
  • Class questions
  • What is the mean value of total assets within
    their sample?
  • How does auditing affect the interest rate for
    the average company in their sample?

51
(No Transcript)
52
  • Verify that the above claim is true.
  • Suppose Blackwell et al. had reported the impact
    for a firm with 11m in assets and another firm
    with 15m in assets.
  • How would this have changed the conclusions
    drawn?
  • Do you think the paper would have been published
    if the authors had made this comparison?

53
(No Transcript)
54
2.4 Examining the residuals
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Download anscombe.dta to your hard drive (use
    "C\phd\anscombe.dta", clear)
  • Run the following regressions
  • reg y1 x1
  • reg y2 x2
  • reg y3 x3
  • reg y4 x4
  • Note that the output from these regressions is
    virtually identical
  • intercept 3.0 (t-stat2.67)
  • x coefficient 0.5 (t-stat4.24)
  • R-squared 66

55
Class exercise 2d
  • If you did not know about regression assumptions
    or regression diagnostics you would probably stop
    your analysis at this point, concluding that you
    have a good fit for all four models.
  • In fact, only one of these four models is well
    specified.
  • Draw scatter graphs for each of these four
    associations (e.g., twoway (scatter y1 x1) (lfit
    y1 x1)).
  • Of the four models, which do you think is the
    well specified one?
  • Draw scatter graphs for the residuals against the
    x variable for each of the four regressions is
    there a pattern?
  • Which of the OLS assumptions are violated in
    these four regressions?

56
2.4 Examining the residuals
  • Unfortunately, it is common among researchers to
    judge whether a model is well-specified solely
    in terms of its explanatory power (i.e., the
    R-squared).
  • Many researchers fail to report other types of
    diagnostic tests
  • is there significant heteroscedasticity?
  • is there any pattern to the residuals?
  • are there any problems of outliers?

57
2.4 Examining the residuals
  • For example, many audit fee studies claim that
    their models are well-specified because they have
    high R2
  • Carson et al. (2003)

58
2.4 Examining the residuals
  • Gu (2007) points out that
  • econometricians consider R2 values to be
    relatively unimportant (accounting researchers
    put far too much emphasis on the magnitude of the
    R2)
  • regression R2s should not be compared across
    different samples
  • in contrast there is a large accounting
    literature that uses R2s to determine whether the
    value relevance of accounting information has
    changed over time

59
  • It is easy to show that the same economic model
    can yield very different R2 depending on how the
    variables are transformed
  • Using either eq. (1) or (2), we will obtain
    exactly the same coefficient estimates because
    the economic model is the same
  • If eq. (1) is well-specified, so also is eq. (2)
  • If eq. (1) is mis-specified, so also is eq. (2)
  • However, the R2 of eq. (1) will be very different
    from the R2 of eq. (2)

60
  • Example
  • use "C\phd\Fees.dta", clear
  • gen lnafln(auditfees)
  • gen lntaln(totalassets)
  • sort companyid yearend
  • by companyid gen lnaf_laglnaf_n-1
  • egen missrmiss(lnaf lnta lnaf_lag)
  • gen chlnaflnaf-lnaf_lag
  • reg lnaf lnta lnaf_lag if miss0
  • reg chlnaf lnta lnaf_lag if miss0
  • The lnta coefficients are exactly the same in the
    two models.
  • The lnaf_lag coefficient in eq. (2) equals the
    lnaf_lag coefficient in eq. (1) minus one.
  • The R2 is much higher in eq. (1) than eq. (2).
  • The high R2 in eq. (1) does not imply that the
    model is well-specified.
  • The low R2 in eq. (2) does not imply that the
    model is mis-specified.
  • Either both equations are well-specified or they
    are both mis-specified.
  • The R2 tells us nothing about whether our
    hypothesis about the determinants of Y is
    correct.

61
2.4 Examining the residuals
  • Instead of relying only on the R2, an examination
    of the residuals can help us to identify whether
    the model is well specified. For example compare
    the audit fee model which is not logged
  • reg auditfees totalassets
  • predict res1, resid
  • twoway (scatter res1 totalassets, msize(tiny))
    (lfit res1 totalassets)
  • With the logged audit fee model
  • reg lnaf lnta
  • predict res2, resid
  • twoway (scatter res2 lnta, msize(tiny)) (lfit
    res2 lnta)
  • Notice that the residuals are more spherical
    displaying less of an obvious pattern in the
    logged model.

62
2.4 Examining the residuals
  • In order to obtain unbiased standard errors we
    have to assume that the residuals are normally
    distributed
  • We can test this using a histogram of the
    residuals
  • hist res1
  • this does not give us what we need because there
    are severe outliers
  • sum res1, detail
  • hist res1 if res1gt-22 res1lt208, normal
    xlabel(-25(25)210)
  • hist res2
  • sum res2, detail
  • hist res2 if res2gt-2 res2lt1.8, normal
    xlabel(-2(0.5)2)
  • The residuals are much closer to the assumed
    normal distribution when the variables are
    measured in logs

63
(No Transcript)
64
Class exercise 2e
  • Following Pong and Whittington (1994) estimate
    the raw value of audit fees as a function of raw
    assets and assets squared
  • Examine the residuals
  • Do you think this model is better specified than
    the one in logs?

65
2.5 Multiple regression
  • Researchers use multiple regression when they
    believe that Y is affected by multiple
    independent variables
  • Y a0 a1 X1 a2 X2 e
  • Why is it important to control for multiple
    factors that influence Y?

66
2.5 Multiple regression
  • Suppose the true model is
  • Y a0 a1 X1 a2 X2 e
  • where X1 and X2 is uncorrelated with the error, e
  • Suppose the OLS model that we estimate is
  • Y a0 a1 X1 u
  • where u a2 X2 e
  • OLS imposes the assumption that X1 is
    uncorrelated with the residual term, u.
  • Since X1 is uncorrelated with e, the assumption
    that X1 is uncorrelated with u is equivalent to
    assuming that X1 is uncorrelated X2.

67
2.5 Multiple regression
  • If X1 is correlated with X2 the OLS estimate of
    a1 is biased.
  • The magnitude of the bias depends upon the
    strength of the correlation between X1 and X2.
  • Of course, we often do not know whether the model
    we estimate is the true model
  • In other words, we are unsure whether there is an
    omitted variable (X2) that affects Y and that is
    correlated with our variable of interest (X1)

68
2.5 Multiple regression
  • We can judge whether or not there is likely to be
    a correlated omitted variable problem using
  • theory
  • prior empirical studies
  • our understanding of the data generation process

69
2.5 Multiple regression
  • Theory
  • Does theory suggest that X2 affects Y?
  • Unfortunately, theory often fails to give a clear
    guide to empirical researchers as to which
    variables need to be controlled for

70
2.5 Multiple regression
  • The data generation process (DGP)
  • Many researchers go wrong simply because they
    fail to understand the underlying process that
    generates the data (e.g., they fail to understand
    the institutional details).
  • Let me give you an example, from my research on
    the reports issued by the Public Company
    Accounting Oversight Board (PCAOB)
  • The PCAOB has been issuing reports about
    weaknesses that they found in audit firms work

71
  • The dependent variable equals the number of
    weaknesses disclosed in the PCAOBs report about
    the audit firm
  • Ln(CLIENTS) is a continuous measure of audit
    firm size (the log of the number of companies
    audited by the audit firm)
  • BIG is a dummy variable capturing audit firm size
  • The audit firm size coefficients are positive and
    highly significant
  • The PCAOB has been reporting more weaknesses at
    the larger audit firms

72
2.5 Multiple regression
  • A working paper has also reported this positive
    relation and concluded that the larger audit
    firms have been offering lower quality audits
  • this conclusion contradicts evidence from many
    other studies that find larger audit firms
    provide higher quality audits
  • The researchers made this mistake because they
    failed to understand the data generation
    process for the weaknesses disclosed in PCAOB
    reports.

73
2.5 Multiple regression
  • To understand the data generation process, it is
    often important to understand how the data
    originate http//www.pcaobus.org/Inspections/Publi
    c_Reports/index.aspx

74
(No Transcript)
75
  • Lennox Pittman (2008)

76
(No Transcript)
77
  • In Col. (1) there is a severe omitted variable
    problem because
  • the PCAOBs sample size is not included as a
    control variable.
  • the PCAOBs sample size is highly correlated with
    audit firm size.
  • In Col. (3), we see there is no significant
    relation between audit firm size and the number
    of reported weaknesses, after we control for the
    size of the PCAOBs sample.
  • An understanding of the data generation process
    is vital if we are to avoid drawing invalid
    conclusions.
  • A PCAOB report discloses all serious weaknesses
    found in the inspectors sample.
  • There is a biased association between audit firm
    size and the number of reported weaknesses if the
    size of the PCAOBs sample is not controlled for.

78
2.5 Multiple regression
  • What does it mean to control for the effect of
    a variable?
  • In a multiple regression, the coefficient a1
    captures the effect of a one-unit increase in X1
    on Y, after controlling for (i.e., holding
    constant) X2
  • Y a0 a1 X1 a2 X2 e
  • I will now explain this concept in more detail
    using an empirical example

79
2.5 Multiple regression
  • We are going to look at the effect of non-audit
    fees on audit fees after controlling for the
    effect of company size
  • lnaf a0 a1 lnta a2 lnnaf e
  • gen lnnafln(1nonauditfees)
  • capture drop miss
  • egen missrmiss(lnaf lnta lnnaf)
  • First I estimate the following model and
    calculate the residuals
  • lnaf a0 a1 lnta res1
  • reg lnaf lnta if miss0
  • predict res1 if miss0, resid
  • note that res1 reflects the part of lnaf that has
    nothing to do with lnta (res1 a2 lnnaf e)
  • Next I estimate the following model and calculate
    the residuals
  • lnnaf b0 b1 lnta res2
  • reg lnnaf lnta if miss0
  • predict res2 if miss0, resid
  • note that res2 reflects the part of lnnaf that
    has nothing to do with lnta (by construction res2
    is uncorrelated with lnta)

80
2.5 Multiple regression
  • Finally I estimate the following two models
  • lnaf a0 a1 lnta a2 lnnaf e (1)
  • res1 a2 res2 e (2)
  • reg lnaf lnta lnnaf if miss0
  • reg res1 res2 if miss0
  • Note that the coefficient and t-statistic on res2
    in eq. (2) are exactly the same as the
    coefficient and t-statistic for lnnaf in eq. (1)
  • What does all this mean?
  • The coefficient a2 in eq. (1) captures the impact
    of lnnaf on lnaf after controlling for the fact
    that
  • (1) lnta affects lnaf (a1 gt 0 in eq. (1)), and
  • (2) there is a significant positive correlation
    between lnnaf and lnta (b1 gt 0)

81
2.5 Multiple regression
  • Note that if there had been zero correlation
    between lnnaf and lnta (b1 0), the coefficient
    a1 would be the same in both the simple and
    multiple regression models
  • lnaf a0 a1 lnta a2 lnnaf e
  • lnaf a0 a1 lnta res1
  • The reason is that res1 would be uncorrelated
    with lnta if there is zero correlation between
    lnnaf and lnta (b1 0).
  • In other words the coefficient a1 is estimated
    with bias only if res1 is significantly
    correlated with lnta.

82
2.5 Multiple regression
  • This reinforces the intuition that it is only
    necessary to control for those variables that
  • affect Y, AND
  • are correlated with the independent variable
    whose coefficient we want to estimate
  • For example, if we want to estimate the effect of
    lnnaf on lnaf, we must control for lnta because
  • lnta affects lnaf, and
  • lnta is correlated with lnnaf
  • Note that both of these conditions are necessary
    for there to be an omitted variable problem.

83
2.5 Multiple regression
  • Previously, when we were using simple regression
    with one independent variable, we checked whether
    there was a pattern between the residuals and the
    independent variable
  • lnaf a0 a1 lnta res1
  • twoway (scatter res1 lnta) (lfit res1 lnta)
  • When we are using multiple regression, we want to
    test whether there is a pattern between the
    residuals and the right hand side of the equation
    as a whole
  • The right hand side of the equation as a whole
    is the same thing as the predicted value of the
    dependent variable

84
2.5 Multiple regression
  • So we should examine whether there is a pattern
    between the residuals and the predicted values of
    the dependent variable
  • For example, lets estimate a model where audit
    fees depend on non-audit fees, company size,
    audit firm size, whether the company is listed on
    a stock market
  • gen listed0
  • replace listed1 if companytype2
    companytype3 companytype5
  • reg lnaf lnnaf lnta big6 listed
  • predict lnaf_hat
  • predict lnaf_res, resid
  • twoway (scatter lnaf_res lnaf_hat) (lfit lnaf_res
    lnaf_hat)

85
2.5 Multiple regression (rvfplot)
  • In fact, those nice guys at STATA have given us a
    command which enables us to short-cut having to
    use the predict command for calculating the
    residuals and the fitted values
  • reg lnaf lnnaf lnta big6 listed
  • rvfplot
  • rvf stands for residuals versus fitted

86
2.6 Heteroscedasticity (hettest)
  • The OLS techniques for estimating standard errors
    are based on an assumption that the variance of
    the errors is the same for all values of the
    independent variables (homoscedasticity)
  • In many cases, the homoscedasticity assumption is
    clearly violated. For example
  • reg auditfees nonauditfees totalassets big6
    listed
  • rvfplot
  • the homoscedasticity assumption can be tested
    using the hettest command after we do the
    regression
  • reg auditfees nonauditfees totalassets big6
    listed
  • hettest
  • Heteroscedasticity does not bias the coefficient
    estimates but it does bias the standard errors of
    the coefficients

87
2.6 Heteroscedasticity (robust)
  • Heteroscedasticity is often caused by using a
    dependent variable that is not symmetric
  • for example the auditfees variable is highly
    skewed due to the fact that it has a lower bound
    of zero
  • much of the heterosedasticity can often be
    removed by transforming the dependent variable
    (e.g., use the log of audit fees instead of the
    raw values)
  • When you find that there is heteroscedasticity,
    you need to adjust the standard errors using the
    Huber/White/sandwich estimator
  • In STATA it is easy to do this adjustment using
    the robust option
  • reg auditfees nonauditfees totalassets big6
    listed, robust
  • Compare the adjusted and unadjusted results
  • reg auditfees nonauditfees totalassets big6
    listed
  • note that the coefficients are exactly the same
  • the t-statistics on the independent variables are
    much smaller when the standard errors are
    adjusted for heteroscedasticity

88
Class exercise 2f
  • Esimate the audit fee model in logs rather than
    absolute values
  • Using rvfplot, assess whether the residuals
    appear to be non-constant
  • Using hettest, provide a formal test for
    heteroscedasticity
  • Compare the coefficients and t-statistics when
    you estimate the standard errors with and without
    adjusting for heteroscedasticity.

89
2.7 Correlated errors
  • The OLS techniques for estimating standard errors
    are based on an assumption that the errors are
    not correlated
  • This assumption is typically violated when we use
    repeated annual observations on the same
    companies
  • The residuals of a given firm are correlated
    across years (time series dependence)

90
Time-series dependence
  • Time-series dependence is nearly always a problem
    when researchers use panel data
  • Panel data data that are pooled for the same
    companies across time
  • In panel data, there are likely to be unobserved
    company-specific characteristics that are
    relatively constant over time

91
  • Lets start with a simple regression model where
    the errors are assumed to be uncorrelated
  • We now relax the assumption of independent errors
    by assuming that the error term has an unobserved
    company-specific component that does not vary
    over time and an idiosyncratic component that is
    unique to each company-year observation
  • Similarly, we can assume that the X variable has
    a company-specific component that does not vary
    over time and an idiosyncratic component

92
Time-series dependence
  • In this case, the OLS standard errors tend to be
    biased downwards and the magnitude of this bias
    is increasing in the number of years within the
    panel.
  • To understand the intuition, consider the extreme
    case where the residuals and independent
    variables are perfectly correlated across time.
  • In this case, each additional year provides no
    additional information and will have no effect on
    the true standard error
  • However, under the incorrect assumption of
    time-series independence, it is assumed that each
    additional year provides additional observations
    and the estimated standard errors will shrink
    accordingly and incorrectly
  • This problem can be avoided by adjusting the
    standard errors for the clustering of yearly
    observations across a given company

93
Time-series dependence
  • To understand all this, it is helpful to review
    the following example
  • First, I estimate the model using just one
    observation for each company (in the year 1998)
  • gen fyedate(yearend, "mdy")
  • gen yearyear(fye)
  • drop if year!1998
  • sort companyid
  • drop if companyidcompanyid_n-1
  • reg lnaf lnta big6 listed, robust

94
Time-series dependence
  • Now I create a dataset in which each observation
    is duplicated
  • Each duplicated observation provides no
    additional information and will have no effect on
    the true standard error but it will reduce the
    estimated standard error (i.e., the estimated
    standard error will be biased downwards)
  • save "C\phd\Fees98.dta", replace
  • append using "C\phd\Fees98.dta"
  • reg lnaf lnta big6 listed, robust
  • Notice that the coefficient estimates in the
    duplicated dataset are exactly the same as in the
    dataset that had only observation per company.
  • However, the estimated standard errors are
    smaller and the t-statistics are larger in the
    duplicated dataset because we are using twice as
    many observations.

95
Time-series dependence (robust cluster())
  • We can obtain correct standard errors in the
    duplicate dataset using the robust cluster()
    option which adjusts the standard errors for
    clustering of observations (here they are
    duplicated) for each company
  • reg lnaf lnta big6 listed, robust cluster
    (companyid)
  • The t-statistics here are exactly the same as
    when the model is estimated using just one
    observation per year.

96
Time-series dependence
  • In reality the observations of a given company
    are not exactly the same from one year to the
    next (i.e., they are not exact duplicates).
  • However, the observations of a given company
    often do not change much from one year to the
    next.
  • For example, a companys size and the fees that
    it pays may not change much over time (i.e.,
    there is a strong unobserved company-specific
    component to the variables).
  • Failing to account for this in panel data tends
    to overstate the magnitude of the t-statistics.

97
Time-series dependence
  • It is easy to demonstrate that the residuals of a
    given company tend to be very highly correlated
    over time
  • First, start again with the original data
  • use "C\phd\Fees.dta", clear
  • gen fyedate(yearend, "mdy")
  • gen yearyear(fye)
  • gen lnafln(auditfees)
  • gen lntaln(totalassets)
  • save "C\phd\Fees1.dta", replace
  • Estimate the fee model and obtain the residuals
    for each company-year observation
  • reg lnaf lnta
  • predict res, resid

98
Time-series dependence
  • Reshape the data so that we have each company as
    a row and there are separate variables for each
    yearly set of residuals
  • keep companyid year res
  • sort companyid year
  • drop if companyidcompanyid_n-1
    yearyear_n-1
  • reshape wide res, i( companyid) j(year)
  • browse
  • Examine the correlations between the residuals of
    a given company
  • pwcorr res1998- res2002

99
Time-series dependence
  • We can easily control for this problem of
    time-series dependence using the robust cluster()
    option
  • use "C\phd\Fees1.dta", clear
  • reg lnaf lnta, robust cluster(companyid)
  • Note that if we do not control for time-series
    dependence, the t-statistic is biased upwards
    even though we have controlled for the
    heteroscedasticity
  • reg lnaf lnta, robust
  • If we do not control for heteroscedasticity, the
    upward bias would be even worse
  • reg lnaf lnta
  • TOP TIP Whenever you use panel data you should
    get into the habit of using the robust cluster()
    option, otherwise your significant results from
    pooled regressions may be spurious.

100
2.8 Multicollinearity
  • Perfect collinearity occurs if there is a perfect
    linear relation between multiple variables of the
    regression model.
  • For example, our dataset covers a sample period
    of five years (1998-2002). Suppose we create a
    dummy for each year and include all five year
    dummies in the fee regression.
  • tabulate year, gen(year_)
  • reg lnaf year_1 year_2 year_3 year_4 year_5
  • Note that STATA excludes one of the year dummies
    when estimating the model why is that?

101
2.8 Multicollinearity
  • The reason is that a linear combination of the
    year dummies equals the constant in the model
  • year_1 year_2 year_3 year_4 year_5 1
  • where 1 is a constant
  • The model can only be estimated if one of the
    year dummies or the constant is excluded
  • reg lnaf year_1 year_2 year_3 year_4 year_5,
    nocons
  • STATA automatically throws away one of the year
    dummies so that the model can be estimated

102
Class exercise 2g
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • Download international.dta to your hard drive
    and open in STATA
  • You are interested in testing whether legal
    enforcement affects the importance of equity
    markets to the economy
  • Create dummy variables for each country in your
    dataset
  • Run a regression where importanceofequitymarket
    is the dependent variable and legalenforcement is
    the independent variable
  • How many country dummies can be included in your
    regression? Explain.
  • Are your results for the legalenforcement
    coefficient sensitive to your choice for which
    country dummies to exclude? Explain.

103
2.8 Multicollinearity
  • We have seen that when there is perfect
    collinearity between independent variables, STATA
    will have to exclude one of them.
  • For example, a linear combination of all year
    dummies equals the constant in the model
  • year_1 year_2 year_3 year_4 year_5
    constant
  • so we cannot estimate a model that includes all
    the year dummies and the constant term
  • Even if the independent variables are not
    perfectly collinear, there can still be a problem
    if they are highly correlated

104
2.8 Multicollinearity
  • Multicollinearity can cause
  • the standard errors of the coefficients to be
    large (i.e., the coefficients are not estimated
    precisely)
  • the coefficient estimates can be highly unstable
  • Example
  • use "C\phd\Fees.dta", clear
  • gen lnafln(auditfees)
  • gen lntaln(totalassets)
  • gen lnta1lnta
  • reg lnaf lnta lnta1
  • Obviously, you must exclude one of these
    variables because lnta and lnta1 are perfectly
    correlated

105
2.8 Multicollinearity
  • Lets see what happens if we change the value of
    lnta1 for just one observation
  • list lnta if _n1
  • replace lnta18 if _n1
  • reg lnaf lnta
  • reg lnaf lnta1
  • reg lnaf lnta lnta1
  • Notice that the lnta and lnta1 coefficients are
    highly significant when included separately but
    they are insignificant when included together
  • The reason of course is that, by construction,
    lnta and lnta1 are very highly correlated
  • pwcorr lnta lnta1, sig

106
2.8 Multicollinearity
  • As another example, we can see that the
    coefficients can flip signs as a result of high
    collinearity
  • sort lnaf lnta
  • replace lnta110 if _nlt100
  • reg lnaf lnta
  • reg lnaf lnta1
  • reg lnaf lnta lnta1
  • pwcorr lnta lnta1, sig

107
2.8 Multicollinearity (vif)
  • Variance-inflation factors (VIF) can be used to
    assess whether multicollinearity is a problem for
    a particular independent variable
  • The VIF takes account of the variables
    correlations with all other independent variables
    on the right hand side
  • The VIF shows the increase in the variance of the
    coefficient estimate that is attributable to the
    variables correlations with other independent
    variables in the model
  • reg lnaf lnta big6 lnta1
  • vif
  • reg lnaf lnta big6
  • vif
  • Multicollinearity is generally regarded as high
    (very high) if the VIF is greater than 10 (20)

108
2.9 Outlying observations
  • We have already seen that outlying observations
    heavily influence the results of OLS models

109
2.9 Outlying observations
  • In simple regression (with just one independent
    variable), it is easy to spot outliers from a
    scatterplot of Y on X
  • For example, a company is an outlier if it is
    very small in terms of size and it pays an audit
    fee that is very high
  • In multiple regression (where there are multiple
    X variables), some observations may be outliers
    even though they do not show up on the
    scatterplot
  • Moreover, observations that show up as outliers
    on the scatterplot might actually be normal once
    we control for other factors in the multiple
    regression
  • For example the small company may pay a high
    audit fee because other characteristics of that
    company make it a complex audit.

110
2.9 Outlying observations (cooksd)
  • We can calculate the influence of each
    observation on the estimated coefficients using
    Cooks D
  • Values of Cooks D that are higher than 4/N are
    considered large, where N is the number of
    observations used in the regression
  • reg lnaf lnta big6
  • predict cook, cooksd
  • sum cook, detail
  • gen max4/e(N)
  • e(N) is the number of observations in the most
    recent regression model (the estimation sample
    size is stored by STATA as an internal result)
  • count if cookgtmax cooklt.

111
2.9 Outlying observations (cooksd)
  • We can discard the observations that have values
    larger than Cooks D and re-estimate the model
  • reg lnaf lnta big6 if cookltmax
  • For example, Ke and Petroni (2004, p.906) explain
    that they use Cooks D to exclude outliers and
    the standard errors are adjusted for
    heteroscedasticity and time-series dependence
    (they are using a panel dataset)

112
2.9 Outlying observations (winsor)
  • Rather than drop outlying observations, some
    researchers choose to winsorize the data
  • Winsorizing replaces the extreme values of a
    variable with the values at certain percentiles
    (e.g., the top and bottom 1)
  • You can winsorize variables in STATA using the
    winsor command
  • winsor lnaf, gen(wlnaf) p(0.01)
  • winsor lnta, gen(wlnta) p(0.01)
  • sum lnaf wlnaf lnta wlnta, detail
  • reg wlnaf wlnta big6
  • A disadvantage with winsorizing is that the
    researcher is assuming that outliers lie only at
    the extremes of the variables distribution.

113
2.10 Median regression
  • Median regression is quite similar to OLS but it
    can be more reliable especially when we have
    problems of outlying observations
  • Recall that in OLS, the coefficient estimates are
    chosen to minimize the sum of the squared
    residuals

114
2.10 Median regression
  • In median regression, the coefficient estimates
    are chosen to minimize the sum of the absolute
    residuals
  • Squaring the residuals in OLS means that large
    residuals are more heavily weighted than small
    residuals.
  • Because the residuals are not squared in median
    regression, the coefficient estimates are less
    sensitive to outliers

115
2.10 Median regression
  • Median regression takes its name from its
    predicted values, which are estimates of the
    conditional median of the dependent variable.
  • In OLS, the predicted values are estimates of the
    conditional mean of the dependent variable.
  • The predicted values of both regression
    techniques therefore measure the central tendency
    (i.e., mean or median) of the dependent variable.

116
2.10 Median regression
  • STATA treats median regression as a special case
    of quantile regression.
  • In quantile regression, the coefficients are
    estimated so that the sum of the weighted
    absolute residuals is minimized
  • where the weights are wi

117
2.10 Median regression (qreg)
  • Weights can be different for positive and
    negative residuals. If positive and negative
    residuals are weighted equally, you get a median
    regression. If positive residuals are weighted by
    the factor 1.5 and negative residuals are
    weighted by the factor 0.5, you get a 3rd
    quartile regression, etc.
  • In STATA you perform quantile regressions using
    the qreg command
  • qreg lnaf lnta big6
  • reg lnaf lnta big6

118
Class exercise 2h
  • Open the anscombe.dta file
  • Do a scatterplot of y3 and x3
  • Do an OLS regression of y3 on x3 for the full
    sample.
  • Calculate Cooks D to test for the presence of
    outliers.
  • Do an OLS regression of y3 on x3 after dropping
    any outliers.
  • Do a median regression of y3 on x3 for the full
    sample.

119
2.10 Median regression
  • Basu and Markov (2004) compare the results of OLS
    and median regressions to determine whether
    analysts who issue earnings forecasts attempt to
    minimize
  • the sum of squared forecast errors (OLS), or
  • the sum of absolute forecast errors (median)

120
(No Transcript)
121
  • The LAD estimator is simply the median regression
    command that we saw earlier (qreg)

122
  • Basu and Markov (2004) conclude that analysts
    forecasts may accurately reflect their rational
    expectations
  • Their study is a good example of how we can make
    an important contribution to the literature if we
    use an estimation technique that is not widely
    used by accounting researchers

123
2.11 Looping
  • Looping can be very useful when we want to carry
    out the same operations many times
  • Looping significantly reduces the length of our
    do files because it means we do not have to state
    the same commands many times
  • When software designers use the word
    programming they mean they are creating a new
    command
  • Usually we do not need new commands because what
    we need has already been written for us in STATA
  • However, programming is necessary if we want to
    use looping

124
2.11 Looping (program, forvalues)
  • Example
  • program ten
  • forvalues i 1(1)10
  • display i'
  • end
  • To run this program we simply type ten

125
2.11 Looping (program, forvalues)
  • Whats happening?
  • program ten we are telling STATA that the name
    of our program is ten and that we are starting
    to write a program
  • end we are telling STATA that we have finished
    writing the program
  • everything inside these brackets is part of
    a loop
  • forvalues i the program will perform the
    commands inside the brackets for each value of i
    (i is called a local macro)
  • 1(1)10 i goes from one to ten, increasing by
    the value one every time
  • display i' this is the command inside the
    brackets and STATA will execute this command for
    each value of i from one to ten. Note that is
    at the top left of your keyboard whereas ' is
    next to the Enter key

126
2.11 Looping (program, forvalues)
  • The macro i has single quotes around it. These
    quotes tell Stata to replace the macro with the
    value that it holds before executing the command.
    So the first time through the loop, i holds the
    value of 1. Stata first replaces i' with 1, and
    then it executes the command
  • display 1
  • The next time through, i holds the value of 2.
    Stata first replaces i' with the value 2, and
    then it executes the command
  • display 2
  • This process continues through the values 3,
    4,...,10.

127
2.11 Looping (capture)
  • Suppose we make a mistake in the program or we
    want to modify the program in some way
  • We first need to drop this program from STATAs
    memory
  • program drop ten
  • we can then go on to write a new program called
    ten
  • It is good practice to drop any program that
    might exist with the same name before writing a
    new program
  • capture program drop ten

128
2.11 Looping
  • Our program is now
  • capture program drop ten
  • program ten
  • forvalues i 1(1)10
  • display i'
  • end
  • To run this program we simply type ten

129
Another example
  • Earnings management studies often estimate
    abnormal accruals using the Jones model
  • ACCRUALSit a0k (1/ASSETit-1) a1k (?SALESit /
    ASSETit-1) a2k (PPEit /ASSETit-1) uit
  • ACCRUALSit change in non-cash current assets
    minus change in non-debt current liabilities,
    scaled by lagged assets.
  • The k sub-scripts indicate that the model is
    estimated separately for each industry.
  • Industries are identified using Standard
    Industrial Classification (SIC) codes

130
Another example
  • The number of industries
  • 10 using one digit codes,
  • 100 using two digit codes,
  • 1,000 using three digit codes, etc
  • Your do file could be very long if you had
    separate lines for each industry
  • Estimate abnormal accruals for SIC 1
  • Estimate abnormal accruals for SIC 2
  • Estimate abnormal accruals for SIC 3
  • ..
  • Estimate abnormal accruals for SIC 10, etc.

131
Another example
  • Your do file will be much shorter if you use the
    looping technique
  • capture program drop ab_acc
  • program ab_acc
  • forvalues i 1(1)10
  • insert commands that you want to execute on each
    industry SIC code
  • end

132
Another example
  • Go to http//ihome.ust.hk/accl/Phd_teaching.htm
  • open accruals.dta in STATA and generate the
    variables we need
  • the regressions will be estimated at the
    one-digit level
  • use "C\phd\accruals.dta", clear
  • gen one_sicint(sic/1000)
  • gen ncca current_assets- cash
  • gen ndcl current_liabilities- debt_in_current_lia
    bilities
  • sort cik year
  • gen ch_nccancca-ncca_n-1 if cikcik_n-1
  • gen ch_ndclndcl-ndcl_n-1 if cikcik_n-1
  • gen accruals(ch_ncca-ch_ndcl)/assets_n-1 if
    cikcik_n-1
  • gen lag_assetsassets_n-1 if cikcik_n-1
  • gen ppe_scaledppe/assets_n-1 if cikcik_n-1
  • gen chsales_scaled(sales-sales_n-1)/assets_n-1
    if cikcik_n-1

133
Another example
  • gen ab_acc.
  • capture program drop ab_acc
  • program ab_acc
  • forvalues i 0(1)9
  • capture reg accruals lag_assets ppe_scaled
    chsales_scaled if one_sici'
  • capture predict ab_acci' if one_sici', resid
  • replace ab_acc ab_acci' if one_sici'
  • capture drop ab_acci'
  • end
  • ab_acc

134
Explaining this program
  • forvalues i 0(1)9
  • the one_sic variable takes values from 0 to 9
  • capture reg accruals lag_assets ppe_scaled
    chsales_scaled if one_sici'
  • the regressions are run at the one digit level
    because some industries have insufficient
    observations at the two-digit level
  • capture predict ab_acci' if one_sici', resid
  • For each industry, I create a separate abnormal
    accrual variable (ab_acc1 if industry 1 ab_acc2
    if industry 2, etc.).
  • If this line had been capture predict ab_acc if
    one_sici', resid we would not have been able
    to go beyond industry 1 as the ab_acc was
    already defined
  • replace ab_acc ab_acci' if one_sici'
  • The overall abnormal accrual variable (ab_acc)
    equals ab_acc1 if industry 1, equals ab_acc2 if
    industry 2, etc.
  • before starting the program I had to gen ab_acc.
    in order for this replace command to work
  • capture drop ab_acci'
  • I drop ab_acc1, ab_acc2, etc. because I only need
    the ab_acc variable.

135
Conclusion
  • You should now have a good understanding of
  • how OLS models work,
  • how to interpret the results of OLS models,
  • how to find out whether the assumptions of OLS
    are violated,
  • how to correct the standard errors for
    heteroscedasticity, time-series dependence and
    cross-sectional dependence,
  • how to handle problems of outliers

136
Conclusion
  • So far, we have been discussing the case where
    our dependent variable is continuous (e.g., lnaf)
  • When the dependent variable is not continuous, we
    cannot use OLS (or quantile) regression.
  • The next topic considers how to estimate models
    where our dependent variable is not continuous.
Write a Comment
User Comments (0)
About PowerShow.com