Loading...

PPT – Regression Analysis with SPSS PowerPoint presentation | free to view - id: 119b46-ZjFiN

The Adobe Flash plugin is needed to view this content

Regression Analysiswith SPSS

- Robert A. Yaffee, Ph.D.
- Statistics, Mapping and Social Science Group
- Academic Computing Services
- Information Technology Services
- New York University
- Office 75 Third Ave Level C3
- Tel 212.998.3402
- E-mail yaffee_at_nyu.edu
- February 04

Outline

- Conceptualization
- Schematic Diagrams of Linear Regression processes
- Using SPSS, we plot and test relationships for

linearity - Nonlinear relationships are transformed to linear

ones - General Linear Model
- Derivation of Sums of Squares and

ANOVADerivation of intercept and regression

coefficients - The Prediction Interval and its derivation
- Model Assumptions
- Explanation
- Testing
- Assessment
- Alternatives when assumptions are unfulfilled

Conceptualization of Regression Analysis

- Hypothesis testing
- Path Analytical Decomposition of effects

Hypothesis Testing

- For example hypothesis 1 X is statistically

significantly related to Y. - The relationship is positive (as X increases, Y

increases) or negative (as X decreases, Y

increases). - The magnitude of the relationship is small,

medium, or large. - If the magnitude is small, then a unit change in

x is associated with a small change in Y.

Regression AnalysisHave a clear notion of what

you can and cannot do with regression analysis

- Conceptualization
- A Path Model of a Regression Analysis

In a path analysis, Yi is endogenous. It is the

outcome of several paths. Direct effects on Y3

C,E, F Indirect effects on Y3 BF, BDF Total

Effects Direct Indirect effects

Interaction coefficient C X1 and X2 must be in

model for interaction to be properly specified.

A Precursor to Modeling with Regression

- Data Exploration Run a scatterplot matrix and

search for linear relationships with the

dependent variable.

Click on graphs and then on scatter

When the scatterplot dialog box appears, select

Matrix

A Matrix of Scatterplots will appear

Search for distinct linear relationships

(No Transcript)

(No Transcript)

Decomposition of the Sums of Squares

Graphical Decomposition of Effects

Decomposition of the sum of squares

Decomposition of the sum of squares

- Total SS model SS error SS
- and if we divide by df
- This yields the Variance Decomposition We have

the total variance model variance error

variance

F test for significance and R2 for magnitude of

effect

- R2 Model var/total var

- F test for model significance
- Model Var/Error Var

ANOVA tests the significance of the Regression

Model

The Multiple Regression Equation

- We proceed to the derivation of its components

- The intercept a
- The regression parameters, b1 and b2

Derivation of the Intercept

Derivation of the Regression Coefficient

- If we recall that the formula for the

correlation coefficient can be expressed as

follows

(No Transcript)

Extending the bivariate case To the Multiple

linear regression case

It is also easy to extend the bivariate

intercept to the multivariate case as follows.

Significance Tests for the Regression Coefficients

- We find the significance of the parameter

estimates by using the F or t test. - The R2 is the proportion of variance explained.

F and T tests for significance for overall model

Significance tests

- If we are using a type II sum of squares, we

are dealing with the ballantine. DV Variance

explained a b

Significance tests

- T tests for statistical significance

Significance tests

- Standard Error of intercept

Standard error of regression coefficient

Programming Protocol

After invoking SPSS, procede to File, Open, Data

Select a Data Set (we choose employee.sav) and

click on open

We open the data set

To inspect the variable formats, click on

variable view on the lower left

Because gender is a string variable, we need to

recode gender into a numeric format

We autorecode gender by clicking on transform and

then autorecode

We select gender and move it into the variable

box on the right

Give the variable a new name and click on add new

name

Click on ok and the numeric variable sex is

created

It has values 1 for female and 2 for male and

those values labels are inserted.

To invoke Regression analysis,Click on Analyze

Click on Regression and then linear

Select the dependent variable Current Salary

Enter it in the dependent variable box

Entering independent variables

- These variables are entered in blocks. First the

potentially confounding covariates that have to

entered. - We enter time on job, beginning salary, and

previous experience.

After entering the covariates, we click on next

We now enter the hypotheses we wish to test

- We are testing for minority or sex differences in

salary after controlling for the time on job,

previous experience, and beginning salary. - We enter minority and numeric gender (sex)

After entering these variables, click on

statistics

We select the following statistics from the

dialog box and click on continue

Click on plots to obtain the plots dialog box

We click on OK to run the regression analysis

Navigation window (left) and output window(right)

This shows that SPSS is reading the variables

correctly

Variables Entered and Model Summary

Omnibus ANOVA

Significance Tests for the Model at each stage of

the analysis

Full ModelCoefficients

We omit insignificant variables and rerun the

analysis to obtain trimmed model coefficients

Beta weights

- These are standardized regression coefficients

used to compare the contribution to the

explanation of the variance of the dependent

variable within the model.

T tests and signif.

- These are the tests of significance for each

parameter estimate. - The significance levels have to be less than .05

for the parameter to be statistically significant.

Assumptions of the Linear Regression Model

- Linear Functional form
- Fixed independent variables
- Independent observations
- Representative sample and proper specification of

the model (no omitted variables) - Normality of the residuals or errors
- Equality of variance of the errors (homogeneity

of residual variance) - No multicollinearity
- No autocorrelation of the errors
- No outlier distortion

Explanation of the Assumptions

- 1. Linear Functional form
- Does not detect curvilinear relationships
- Independent observations
- Representative samples
- Autocorrelation inflates the t and r and f

statistics and warps the significance tests - Normality of the residuals
- Permits proper significance testing
- Equality of variance
- Heteroskedasticity precludes generalization and

external validity - This also warps the significance tests
- Multicollinearity prevents proper parameter

estimation. It may also preclude computation of

the parameter estimates completely if it is

serious enough. - Outlier distortion may bias the results If

outliers have high influence and the sample is

not large enough, then they may serious bias the

parameter estimates

Diagnostic Tests for the Regression Assumptions

- Linearity tests Regression curve fitting
- No level shifts One regime
- Independence of observations Runs test
- Normality of the residuals Shapiro-Wilks or

Kolmogorov-Smirnov Test - Homogeneity of variance if the residuals

Whites General Specification test - No autocorrelation of residuals Durbin Watson or

ACF or PACF of residuals - Multicollinearity Correlation matrix of

independent variables.. Condition index or

condition number - No serious outlier influence tests of additive

outliers Pulse dummies. - Plot residuals and look for high leverage of

residuals - Lists of Standardized residuals
- Lists of Studentized residuals
- Cooks distance or leverage statistics

Explanation of Diagnostics

- Plots show linearity or nonlinearity of

relationship - Correlation matrix shows whether the independent

variables are collinear and correlated. - Representative sample is done with probability

sampling

Explanation of Diagnostics

- Tests for Normality of the residuals. The

residuals are saved and then subjected to either

of - Kolmogorov-Smirnov Test Tests the limit of the

theoretical cumulative normal distribution

against your residual distribution. - Nonparametric Tests
- 1 sample K-S test

Collinearity Diagnostics

More Collinearity Diagnostics

- condition numbers
- maximum eigenvalue/minimum eigenvalue.
- If condition numbers are between 100 and 1000,

there is moderate to strong collinearity

If Condition index gt 30 then there is strong

collinearity

Outlier Diagnostics

- Residuals.
- The predicted value minus the actual value. This

is otherwise known as the error. - Studentized Residuals
- the residuals divided by their standard errors

without the ith observation - Leverage, called the Hat diag
- This is the measure of influence of each

observation - Cooks Distance
- the change in the statistics that results from

deleting the observation. Watch this if it is

much greater than 1.0.

Outlier detection

- Outlier detection involves the determination

whether the residual (error predicted actual)

is an extreme negative or positive value. - We may plot the residual versus the fitted plot

to determine which errors are large, after

running the regression.

Create Standardized Residuals

- A standardized residual is one divided by its

standard deviation.

Limits of Standardized Residuals

- If the standardized residuals have values in

excess of 3.5 - and -3.5, they are outliers.
- If the absolute values are less than 3.5, as

these are, then there are no outliers - While outliers by themselves only distort mean

prediction when the sample size is small enough,

it is important to gauge the influence of

outliers.

Outlier Influence

- Suppose we had a different data set with two

outliers. - We tabulate the standardized residuals and obtain

the following output

Outlier a does not distort and outlier b does.

Studentized Residuals

- Alternatively, we could form studentized

residuals. These are distributed as a t

distribution with dfn-p-1, though they are not

quite independent. Therefore, we can

approximately determine if they are statistically

significant or not. - Belsley et al. (1980) recommended the use of

studentized residuals.

Studentized Residual

These are useful in estimating the statistical

significance of a particular observation, of

which a dummy variable indicator is formed. The

t value of the studentized residual will indicate

whether or not that observation is a

significant outlier. The command to generate

studentized residuals, called rstudt is predict

rstudt, rstudent

Influence of Outliers

- Leverage is measured by the diagonal components

of the hat matrix. - The hat matrix comes from the formula for the

regression of Y.

Leverage and the Hat matrix

- The hat matrix transforms Y into the predicted

scores. - The diagonals of the hat matrix indicate which

values will be outliers or not. - The diagonals are therefore measures of leverage.
- Leverage is bounded by two limits 1/n and 1.

The closer the leverage is to unity, the more

leverage the value has. - The trace of the hat matrix the number of

variables in the model. - When the leverage gt 2p/n then there is high

leverage according to Belsley et al. (1980) cited

in Long, J.F. Modern Methods of Data Analysis

(p.262). For smaller samples, Vellman and Welsch

(1981) suggested that 3p/n is the criterion.

Cooks D

- Another measure of influence.
- This is a popular one. The formula for it is

Cook and Weisberg(1982) suggested that values of

D that exceeded 50 of the F distribution (df

p, n-p) are large.

Using Cooks D in SPSS

- Cook is the option /R
- Finding the influential outliers
- List cook, if cook gt 4/n
- Belsley suggests 4/(n-k-1) as a cutoff

DFbeta

- One can use the DFbetas to ascertain the

magnitude of influence that an observation has on

a particular parameter estimate if that

observation is deleted.

Programming Diagnostic TestsTesting

homoskedasiticitySelect histogram, normal

probability plot, and insert zresid in Yand

zpred in X

Then click on continue

Click on Save to obtain the Save dialog box

We select the following

Then we click on continue, go back to the Main

Regression Menu and click on OK

Check for linear Functional Form

- Run a matrix plot of the dependent variable

against each independent variable to be sure that

the relationship is linear.

Move the variables to be graphed into the box on

the upper right, and click on OK

Residual Autocorrelation check

See significance tables for this statistic

Run the autocorrelation function from the Trends

Module for a better analysis

Testing for Homogeneity of variance

Normality of residuals can be visually inspected

from the histogram with the superimposed normal

curve. Here we check the skewness for symmetry

and the kurtosis for peakedness

Kolmogorov Smirnov Test An objective test of

normality

(No Transcript)

(No Transcript)

Multicollinearity test with the correlation

matrix

(No Transcript)

(No Transcript)

Alternatives to Violations of Assumptions

- 1. Nonlinearity Transform to linearity if

there is nonlinearity or run a nonlinear

regression - 2. Nonnormality Run a least absolute

deviations regression or a median regression

(available in other packages or generalized

linear models SPLUS glm, STATA glm, or SAS

Proc MODEL or PROC GENMOD). - 3. Heteroskedasticity weighted least squares

regression (SPSS) or white estimator (SAS,

Stata, SPLUS). One can use a robust regression

procedure (SAS, STATA, or SPLUS) to obtain

downweighted outlier effect in the estimation. - 4. Autocorrelation Run AREG in SPSS Trends

module or either Prais or Newey-West procedure

in STATA. - 4. Multicollinearity components regression or

ridge regression or proxy variables. 2sls in

SPSS or ivreg in stata or SAS proc model or proc

syslin.

Model Building Strategies

- Specific to General Cohen and Cohen
- General to Specific Hendry and Richard
- Extreme Bounds analysis E. Leamer.

Nonparametric Alternatives

- If there is nonlinearity, transform to linearity

first. - If there is heteroskedasticity, use robust

standard errors with STATA or SAS or SPLUS. - If there is non-normality, use quantile

regression with bootstrapped standard errors in

STATA or SPLUS. - If there is autocorrelation of residuals, use

Newey-West autoregression or First order

autocorrelation correction with Areg. If there

is higher order autocorrelation, use Box Jenkins

ARIMA modeling.