Title: Quantitative Methods
1- Quantitative Methods Week 5
- Linear Regression Analysis
Roman Studer Nuffield College roman.studer_at_nuffiel
d.ox.ac.uk
2Homework (II)
- Many factors (variables) are potentially
associated with the drop in crime rates in the
US. Where do you find correlations between a
variable and the falling crime rate? Which
variables are positively, which ones negatively
correlated with the crime rate variable? Where do
you not find any correlation? Explanations?
Concept/Variable
Measured Variable
Correlation
Strong economy
Unemployment rate, poverty level
Reliance on prisons
Number of prisoners
Education
Educational Attainment
Legalizing abortion
Number of abortions
- How did you solve the lag problem with the
abortion variable?
3Simple Linear Regression Introduction
- As with correlation, we are still looking at the
relationship between two (ratio level) variables,
but now we do not treat them symmetrically
anymore, but make a distinction between the two
influences
y x
Dependent variable Independent variable
Explained variable Explanatory variable
- As a consequence, the question that can be
anwered with correlation analysis and regression
analysis are different
Correlation Regression
Is there an association between X and Y? How exactly does Y change if X changes?
How strong? Positive or negative? How much of the change in Y is explained?
4Simple Linear Regression Introduction (II)
- Therefore, we move from issues of mere
association to issues of causality - HOWEVER A simple linear regression cannot
establish causation! - We assume that causality runs from x to y
- Sometimes this assumption is questionable, then
more sophisticated models and tests are needed - Simultaneous equation models
- Two-stage least square regressions
- Causality tests
- So we have to be careful when making such
assumptions. Lets look at some pairs of
variables. - Which is the explanatory variable and which the
dependent variable? - In which cases might you expect a mutual
interaction between the two variables? - Height and weight
- Birth rate and marriage rate
- Rainfall and crop yield
- Rate of unemployment and level of relief
expenditure in British parishes - Government spending on welfare programmes and the
voting share of left-liberal parties - CO2 emissions and global warming
5Simple Linear Regression Introduction (III)
- Straight line (linear) relationship between two
variables - The relationship between two variables can take a
nonlinear form - Kuznets (1955) hypothesised that the relationship
between the level of economic development (x) and
income inequality (y) takes the form of a
nonlinear (inverted-U shaped) form -
Income Inequality
Economic Development
- Multiple or multivariate regression includes two
or more explanatory variables
6The Equation of a Straight Line
- The equation of a straight line Y a bX
Example y23x
7The Equation of a Straight Line (II)
- Equation of a straight line YabX
- Y and X are the dependent and the independent
variable respectively and y1, y2, , yn and x1,
x2, , xn their values - a is the intercept. It determines the level at
which the straight line crosses the vertical
axis, i.e. it gives the value of y when x0 - b measures the slope of the line
- Positive relationship bgt0
- No relationship b0
- Negative Relationship blt0
- If x increases by one unit, y will increase by b
units
8Fitting the Regression Line
- How do we find the line, which is the best fit,
i.e. the line that describes the linear
relationship between X and Y the best - This is an issue as in the real world,
relationships between two variables never follow
a completely linear pattern
9Fitting the Regression Line (II)
- The regression line predicts the values of Y
based on the values of X. Thus, the best line
will minimise the deviation between the predicted
and the actual values (the error, e)
10Fitting the Regression Line (III)
- However, to avoid the problem that positive and
negative deviations cancel out, we look at
squared deviations (Yi Yi)2 - Also, as we want to minimise the total errors, we
are interested in minimising the sum of all
squared errors - Therefore, the regression line it the line that.
- minimizes the sum of the squares of the vertical
deviation of all pairs of the values of X and Y
from the regression line
- This estimation procedure is known as ordinary
least squares regression (OLS). Its formal
derivation yields the two formulae needed to
calculate the regression line, i.e. a formula for
the intercept (a) and a formula for the slope (b)
11Fitting the Regression Line (IV)
- With these two formulae, we can actually easily
calculate the regression line for small datasets
by hand. However, for large datasets and when we
have more than one explanatory variable, we use
the Stata to do it for us
12The Goodness of Fit
- Once we get the best fit regression line, we
still want to know how good a fit this line
really is - How much of the variation in Y is explained by
this regression line? - The measure to describe the explanatory power of
a regression is the coefficient of determination
r2, which is equal to the square of the
correlation coefficient - It is a measure of the success with which the
movements in Y are explained by the movements in
X - In particular it measures how much of the
derivation of Y from the mean of Y is explained
by the regression
13The Goodness of Fit (II)
- This is the interactive part of the class.
- Please explain the concept depicted in the
following graph in your own words
Regression line
x
14The Goodness of Fit (III)
Total variation explained variation
unexplained variation TSS ESS USS R²ESS/TSS
Explained Sum of Squares/Total Sum of Squares
15The Goodness of Fit (IV)
- R2 gives the proportion of the sample variation
in y that is explained by x - R2 0.35 means that the explanatory variable
explains 35 of the variation in the dependent
variable - R2 ranges between 0 and 1
- The higher R2 the better the fit of the
regression line to the data - R2 can be used to compare the explanatory power
of different regression models (with the same
dependent variable) - R2 has to be interpreted in the light of what we
would expect - A low R2 does not necessarily mean that an OLS
regression equation is useless
16Computer Class
17Exercises
- Weimar elections Unemployment and votes for the
Nazi - A) Descriptive Statistics
- Get the dataset about the Weimar election of
1932 at http//www.nuff.ox.ac.uk/users/studer/tea
ching.htm - Look at the variables (votes for the Nazi party,
level of unemployment) in turn - Get a first visualisation of the data does it
look normally distributed? - Compute the mean, median, standard deviation,
coefficient of variation, kurtosis and skewness
for the variable - Make a scatter plot Do you think the two
variables are associated? How and how strongly? - B) Regression Analysis
- Which one is the dependent variable?
- Estimate the regression line
- What is the interpretation?
- What is the explanatory power of the regression?
- Draw a scatter plot and add the regression line
18Exercises (II)
2. Weimar elections Nazi votes and the share of
Catholics ? Do the same exercises as with the
unemployment rate A) Descriptive
Statistics B) Regression Analysis
19Appendix STATA Commands
- regress depvar indepvars Linear regression the
dependent and independent variables are
indicated by the order. The dependent variable
depvar comes first, then the independent
variable(s) indepvar follow - predict yhat, xb Calculates the linear
prediction for each observation, i.e. yhati
abindepvari - predict res, resid The option resid behind the
comma let STATA calculate and save the
residuals, i.e. resiyi- yhati - generate newvarexp Creates a new variable.
STATA provides numerous functions, see STATAs
Help on Functions and expressions/math
functions
20Homework
- Readings
- Feinstein Thomas, Ch. 5
- Problem Set 4
- Finish the exercises from todays computer class
if you havent done so already. Include all the
results and answers in the file you send me - On the next slide you see part of the
macroeconomic dataset we used in week 3. Look at
the variables GDP per head and education,
assuming that GDP per head is the dependent
variable and education the explanatory variable.
Calculate by hand - The regression coefficient, b
- The intercept, a
- The total sum of squares
- The explained sum of squares
- The residual sum of squares
- The coefficient of determination
- (hint graph the values first by hand and then
look at tables 4.1 and 4.2) - Download the complete data set at
http//www.nuff.ox.ac.uk/users/studer/teaching.htm
(macro data) and calculate the regression line
and the coefficient of determination - Interpret the results
- Do the results differ from the correlation
results?
21Homework (II)
Country GDP per Head Agriculture Education
Norway 54360 1.60 81
Switzerland 49660 1.40 49
United States 39430 1.20 83
Brazil 3340 10.10 21
Iran 2340 13.70 21