Loading...

PPT – SW318 Social Work Statistics Slide 1 PowerPoint presentation | free to download - id: 438ed1-ZTk2N

The Adobe Flash plugin is needed to view this content

Regression Analysis

- We have previously studied the Pearsons r

correlation coefficient and the r2 coefficient of

determination as measures of association for

evaluating the relationship between an interval

level independent variable and an interval level

dependent variable. - These statistics are components of a broader set

of statistical techniques for evaluating the

relationship between two interval level

variables, called regression analysis (sometimes

referred to in combination as correlation and

regression analysis).

Regression Analysis vs. Chi-Square Test of

Independence

- Our purpose now is to use a hypothesis test to

conclude that there is a relationship between two

interval level variables in the population

represented by our sample data. - We could use a chi-square test of independence to

determine whether or not a relationship exists

between two variables in the population

represented by our data, provided we grouped the

values of both variables to create a bivariate

table. - However, it is preferable to test for the

presence of a relationship retaining the

variables as interval level data because this

strategy is more effective at detecting the

existence of relationship. We might find a

relationship using interval level statistics that

we do not find using nominal level statistics

because the nominal level statistics are less

precise.

Elements of Regression Analysis

- We will first review previous material on

regression and correlation - The scatterplot or scattergram
- The regression equation
- Then, we will examine the statistical evidence to

determine whether or not, the relationships found

in our sample data are applicable to the

population represented by the sample using a

hypothesis test.

Purpose of Regression Analysis

- The purpose of regression analysis is to answer

the same three questions that have been

identified as requirements for understanding the

relationships between variables - Is there a relationship between the two

variables? - How strong is the relationship?
- What is the direction of the relationship?

Scatterplots - 1

- The relationship between two interval variables

can be graphed as a scatterplot or a scatter

diagram which shows the position of all of the

cases in an x-y coordinate system. - The independent variable is plotted on the

x-axis, or the horizontal axis. - The dependent variable is plotted on the y-axis,

or the vertical axis. - A dot in the body of the chart represented the

intersection of the data on the x-axis and the

y-axis

Scatterplots - 2

- The trendline or regression line is plotted on

the chart in a contrasting color - The overall pattern of the dots, or data points,

succinctly summarizes the nature of the

relationship between the two variables. - The clarity of the pattern formed by the dots can

be enhanced by drawing a straight line through

the cluster such that the line touches every dot

or comes as close to doing so as possible. - This summarizing line is called the regression

line. - We will see later how this line is obtained, but

for now, we will look at how it helps us

understand the scatterplot.

Scatterplots - 3

The pattern of the points on the scatterplot

gives us information about the relationship

between the variables. The regression line,

drawn in red, makes it easier for us to

understand the scatterplot.

The Uses of Scatterplots

- Scatterplots give us information about our three

questions about the relationship between two

interval variables - Is there a relationship between the two

variables? - How strong is the relationship?
- What is the direction of the relationship?
- In addition, the regression line on the

scatterplot can be used to estimate the value of

the dependent variable for any value of the

independent variable.

Scatterplots Evidence of a Relationship

The angle between the regression line and the

horizontal x-axis provides evidence of a

relationship. If there is no relationship, the

regression line will be parallel to the axis.

When there is no relationship between two

variables, the regression line is parallel to the

horizontal axis.

When there is a relationship between two

variables, the regression line lies at an angle

to the horizontal axis, sloping either upward or

downward.

Scatterplots Strength of a Relationship

The strength of a relationship is indicated by

the narrowness of the band of points spread

around the regression line the tighter the

band, the stronger the relationship.

The spread of the points around the regression

line is narrow, indicating a stronger

relationship. We should check the scale of the

vertical axis to make sure the narrow band is not

the result of an excessively large scale.

In this scatterplot, the points are very spread

out around the regression line. The relationship

is weak.

Scatterplots Direction of Relationship

When the regression line slopes upward to the

right, there is a positive, or direct,

relationship between the variables. When the

regression line slopes downward, the relationship

is negative, or inverse.

In this scatterplot, the regression line slopes

donward to the right, indicating a negative or

inverse relationship. The values of the

variables move in opposite directions.

In this scatterplot, the regression line slopes

upward to the right, indicating a positive or

direct relationship. The values of both

variables increase and decrease at the same time.

Scatterplots Predicting Scores

For any value of the independent variable on the

horizontal x-axis, the predicted value for the

dependent variable will be the corresponding

value on the vertical y-axis.

For the value of the independent variable on the

horizontal axis, we draw a line upward to the

regression line, e.g. 52. We draw a

perpendicular line from the value on the x-axis

to the regression line.

The estimate for the dependent variable is

obtained by drawing a line parallel to the x-axis

from the regression line to the vertical y-axis

and reading the value where this line crosses the

y-axis, e.g. 50.

The Effect of Scaling on the Scatterplot

- The scale used for the vertical
- y-axis can change the appearance of the

scatterplot and alter our interpretation of the

strength of the relationship. The three

scatterplots on this slide all use the same data.

In the original plot, the y-axis is scaled from 0

to 80.

The Assumption of Linearity

- An underlying assumption of regression analysis

is that the relationship between the variables is

linear, meaning that the points in the

scatterplot must form a pattern that can be

approximated with a straight line. - While we could test the assumption of linearity

with a test of statistical significance of the

correlation coefficient, we will make a visual

assess tor scatterplots. - If the scatterplot indicates that the points do

not follow a linear pattern, the techniques of

linear correlation and regression should not be

applied.

Examples of Linear Relationships

- These two scatterplots are for data on poverty of

nations. The plots below show strong linear

relationships. The points are evenly distributed

on either side of the regression line.

Examples of Non-linear Relationships

- These scatterplots show a non-linear

relationship. The points are not evenly

distributed on either side of the regression

line. We will often see a concentration of

points on one side of the regression line and an

absence of points on the other side.

The Regression Equation

- The regression equation is the algebraic formula

for the regression line, which states the

mathematical relationship between the independent

and the dependent variable. - We can use the regression line to estimate the

value of the dependent variable for any value of

the independent variable. - The stronger the relationship between the

independent and dependent variables, the closer

these estimates will come to the actual score

that each case had on the dependent variable.

Components of the Regression Equation

- The regression equation has two components.
- The first component is a number called the

y-intercept that defines where the line crosses

the vertical y axis. - The second component is called the slope of the

line, and is a number that multiplies the value

of the independent variable. - These two elements are combined in the general

form for the regression equation - the estimated score on the dependent variable
- the y-intercept the slope the score

on the independent variable

The Standard Form of the Regression Equation

- The standard form for the regression equation or

formula is - Y a bX
- where
- Y is the estimated score for the dependent

variable - X is the score for the independent variable
- b is the slope of the regression line, or the

multiplier of X - a is the intercept, or the point on the vertical

axis where the regression line crosses the

vertical y-axis

Depicting the Regression Equation

The regression equation includes both the

y-intercept and the slope of the line. The

y-intercept is 1.0 and the slope is 0.5.

The slope is the multiplier of x. It is the

amount of change in y for a change of one unit in

x. If x changes one unit from 2.0 to 3.0,

depicted by the blue arrow, y will change by 0.5

units, from 2.0 to 2.5 as depicted by the red

arrow.

- The y-intercept is the point on the vertical

y-axis where the regression line crosses the

axis, i.e. 1.0.

Deriving the Regression Equation

- In this plot, none of the points fall on the

regression line. - The difference between the actual value for the

dependent variable and the predicted value for

each point is shown by the red lines. This

difference is called the residual, and represents

the error between the actual and predicted

values. - The regression equation is computed to minimize

the total amount of error in predicting values

for the dependent variable. The method for

deriving the equation is called the "method of

least squares," meaning that the regression line

minimizes the sum of the squared residuals, or

errors between actual and predicted values.

Interpreting the Regression Equation the

Intercept

- The intercept is the point on the vertical axis

where the regression line crosses the axis. It

is the predicted value for the dependent variable

when the independent variable has a value of

zero. - This may or may not be useful information

depending on the context of the problem.

Interpreting the Regression Equation the Slope

- The slope is interpreted as the amount of change

in the predicted value of the dependent variable

associated with a one unit change in the value of

the independent variable. - If the slope has a negative sign, the direction

of the relationship is negative or inverse,

meaning that the scores on the two variables move

in opposite directions. - If the slope has a positive sign, the direction

of the relationship is positive or direct,

meaning that the scores on the two variables move

in the same direction.

Interpreting the Regression Equation when the

Slope equals 0

- If there is no relationship between two

variables, the slope of the regression line is

zero and the regression line is parallel to the

horizontal axis. - A slope of zero means that the predicted value of

the dependent variable will not change, no matter

what value of the independent variable is used. - If there is no relationship, using the regression

equation to predict values of the dependent

variable is no improvement over using the mean of

the dependent variable.

Assumptions Required for Utilizinga Regression

Equation

- The assumptions required for utilizing a

regression equation are the same as the

assumptions for the test of significance of a

correlation coefficient. - Both variables are interval level.
- Both variables are normally distributed.
- The relationship between the two variables is

linear. - The variance of the values of the dependent

variable is uniform for all values of the

independent variable (equality of variance).

Assumption of Normality

- Strictly speaking, the test requires that the two

variables be bivariate normal, meaning that the

combined distribution of the two variables is

normal. It is usually assumed that the variables

are bivariate normal if each variable is normally

distributed, so this assumption is tested by

checking the normality of each variable. - Each variable will be considered normal if its

skewness and kurtosis statistics fall between

1.0 and 1.0 or if the sample size is

sufficiently large to apply the Central Limit

theorem.

Assumption of Linearity

- Linearity means that the pattern of the points in

a scatterplot form a band, like the pattern in

the chart on the right

- When the pattern of the points follows a curve,

like the scatterplot on the right, the

correlation coefficient will not accurately

measure the relationship.

Test of Linearity

- The test of linearity is a diagnostic statistical

test of the null hypothesis that the linear model

is an appropriate fit for the data points. The

desired outcome for this test is to fail to

reject the null hypothesis. - If the probability for the test of statistic is

less than or equal to the level of significance

for the problem, we reject the null hypothesis,

concluding that the data is not linear and the

Regression Analysis is not appropriate for the

relationship between the two variables. - If the probability for the test of linearity

statistic is greater than the level of

significance for the problem, we fail to reject

the null hypothesis and conclude that we satisfy

the assumption of linearity.

Assumption of Homoscedasticity

- Homoscedasticity (equality of variances) means

that the points are evenly dispersed on either

side of the regression line for the linear

relationship.

In this scatterplot, the points extend about the

same distance above and below the regression line

for most of the length of the regression line.

This scatterplot meets the assumption of

homoscedasticity.

In this scatterplot, the spread of the points

around the regression line is narrower at the

left end of the regression line than at the right

end of the regression line. This funnel shape

is typical of a scatterplot showing violations of

the assumption of homoscedasticity.

Test of Homoscedasticity

- When we compared groups, we used the Levene test

of population variances to test for the

assumption that the group variances were equal. - In order to use this test for the assumption of

homoscedasity, we will convert the interval level

independent variable into a dichotomous variable

with low scores in one group and high scores in

the other group. We can then compare the

variances of the two groups derived from the

independent variable.

Levene Test of Homogeneity of Variances

- The Levene test of equality of population

variances tests whether or not the variances for

the two groups are equal. It is a test of the

research hypothesis that the variance

(dispersion) of the group with low scores is

different from the variance of the group with

high scores. The null hypothesis that the

variance (dispersion) of both groups are equal. - If the probability of the test statistic is

greater than 0.05, we do not reject the null

hypothesis and conclude that the variances are

equal. This is the desired outcome. - If the probability of the test statistic is less

than or equal to 0.05, we conclude the variances

are different and the Regression Analysis is not

an appropriate test for the relationship between

the two variables.

The hypothesis test of r2

- The purpose of the hypothesis test of r2 is a

test of the applicability of our findings to the

population represented by the sample. - When we studied association between two interval

variables, we stated that the Pearson r

correlation coefficient and its square, the

coefficient of determination measure the strength

of the relationship between two interval

variables. When the correlation coefficient and

coefficient of determination are zero (0), there

is no relationship. - The hypothesis test of r2 is a test of whether or

not r2 is larger than zero in the population.

The hypothesis test of r2

- The research hypothesis states that r2 is larger

than zero. (a relationship exists) - The null hypothesis states that r2 is equal to

zero. (no relationship) - Recall that we interpreted the coefficient of

determination r2 as the reduction in error

attributable to with the relationship between the

variables. - The test statistic is an ANOVA F-test which tests

whether or not the reduction in error associated

with using the regression equation is really

greater than zero.

How the regression ANOVA test works?

We will use the sample data we used for

correlation and regression to examine how the

hypothesis test for r2 works.

We are interested in the relationship between

family size and number of credit cards.

The scatter diagram or scatterplot

The dependent variable is plotted on the Y or

vertical axis.

The independent variable is plotted on the x or

horizontal axis.

The mean as the best guess

Without taking into account the independent

variable, our best guess for the number of credit

cards for any subject is the mean, 7.0.

Errors using the mean as estimate

Errors are measured by computing the difference

between the mean and each Y value, squaring the

differences, and then summing them. When we

compute the answer in SPSS, it will tell us that

the total amount of error is 22.0.

The regression line

The regression line minimizes the error (the best

fitting or least squares line)

The equation for the regression line

SPSS will give us the formula for the regression

line in the form Y a bX, or for these

variables Number of Credit Cards 2.871 .971

x Family Size

PRE reduction in error

SPSS also tells us the amount of error using only

the mean and using the regression line.

Error using mean only (total) 22.000

Error using regression line 5.486

Reduction in error associated with the regression 16.514

PRE measure (r2) 22.0-5.486 .751 22.0

The ANOVA test for the regression

The F statistic is calculated as the ratio of

error reduced by regressions divided the error

remaining. If the ratio were 1 and these two

numbers were the same, we would not have reduced

any error, there would be no relationship, and

the p-value would not let us reject the null

hypothesis. In this problem, the amount of error

reduced by the regression is large relative to

the amount remaining, so the F statistic is

large, the p-value(0.005) is smaller than the

alpha level of significance, so we reject the

null hypothesis.

Interpreting Pearsons r correlation coefficient

The square root of r2 is Pearsons r, the

correlation coefficient. If we want to

characterize the strength of the relationship, we

compare the size of r to the interpretive

guidelines for measures of association.

Interpreting the direction of the relationship

To interpret the direction of the relationship

between the variables, we look at the coefficient

for the independent variable. In this example,

the coefficient of 0.971 is positive, so we would

interpret this relationship as Families with

more members had more credit cards.

Testing Assumptions in Homework Problems

- The process of testing assumptions can easily

overwhelm the task of testing the significance of

the relationship. - Since our emphasis here is testing the hypothesis

that the relationship is generalizable to the

population represented by the sample data, we

will assume that our data satisfies the

assumptions without explicitly testing

assumptions.

Homework Problem Questions

- The question in the homework problems requires us

to look at three things - Does the hypothesis test support the existence of

a relationship in the population? - Is the strength of the relationship characterized

correctly? - Is the direction of the relationship between the

variables correctly stated?

Practice Problem 1

This question asks you to use linear regression

to examine the relationship between marital and

age. Linear regression requires that the

dependent variable and the independent variables

be interval. Ordinal variables may be included as

interval variables if a caution is added to any

true findings. The dependent variable marital

is nominal level which does not satisfy the

requirement for a dependent variable. The

independent variable age is interval level,

satisfying the requirement for an independent

variable.

Practice Problem - 2

This question asks you to use linear regression

to examine the relationship between fund and

attend. The level of measurement requirements

for multiple regression are satisfied fund is

ordinal level, and attend is ordinal level. A

caution is added because ordinal level variables

are included in the analysis. Given the

assumption that the distributional requirements

for linear regression are satisfied, you can

conduct a linear regression using SPSS without

examining distributional assumptions for the

variables.

Linear Regression Hypothesis Test in SPSS (1)

You can conduct a linear regression

using Analyze gt Regression gt Linear

Linear Regression Hypothesis Test in SPSS (2)

Move the dependent variable to Dependent and

the independent variable to Independent(s)

boxes and then click OK button.

Linear Regression Hypothesis Test in SPSS (3)

Based on the ANOVA table for the linear

regression (F(1, 604) 70.579, plt0.001), there

was an relationship between the dependent

variable "degree of religious fundamentalism" and

the independent variable "frequency of attendance

at religious services". Since the probability of

the F statistic (plt0.001) was less than or equal

to the level of significance (0.05), the null

hypothesis that correlation coefficient (R) was

equal to 0 was rejected. The research

hypothesis that there was a relationship between

the variables was supported.

Linear Regression Hypothesis Test in SPSS (4)

Given the significant F-test result, the

correlation coefficient (R) can be interpreted.

The correlation coefficient for the

relationship between the independent variable and

the dependent variable was 0.323, which would be

characterized as a weak relationship using the

rule of thumb that a correlation between 0.0 and

0.20 is very weak 0.20 to 0.40 is weak 0.40 to

0.60 is moderate 0.60 to 0.80 is strong and

greater than 0.80 is very strong. The

relationship between the independent variables

and the dependent variable was incorrectly

characterized as a moderate relationship. The

relationship should have been characterized as a

weak relationship. The answer to the problem is

false.

Practice Problem 3

This question asks you to use linear regression

to examine the relationship between educ and

age. educ and age are interval level,

satisfying the level of measurement requirements

for regression. Given the assumption that the

distributional requirements for linear regression

are satisfied, you can conduct a linear

regression using SPSS without examining

distributional characteristics of variables.

Linear Regression Hypothesis Test in SPSS (5)

You can conduct a linear regression

using Analyze gt Regression gt Linear

Linear Regression Hypothesis Test in SPSS (6)

Move the dependent variable to Dependent and

the independent variable to Independent(s)

boxes and then click OK button.

Linear Regression Hypothesis Test in SPSS (7)

Based on the ANOVA table for the linear

regression (F(1, 659) 9.983, p0.002), there

was an relationship between the dependent

variable "highest year of school completed" and

the independent variable "age". Since the

probability of the F statistic (p0.002) was less

than or equal to the level of significance

(0.05), the null hypothesis that correlation

coefficient (R) was equal to 0 was rejected.

The research hypothesis that there was a

relationship between the variables was supported.

Linear Regression Hypothesis Test in SPSS (8)

Given the significant F-test result, the

correlation coefficient (R) can be interpreted.

The correlation coefficient for the

relationship between the independent variable and

the dependent variable was 0.122, which can be

characterized as a very weak relationship. .

Linear Regression Hypothesis Test in SPSS (9)

The b coefficient for the independent variable

"age" was -.021, indicating an inverse

relationship with the dependent variable. Higher

numeric values for the independent variable "age"

age are associated with lower numeric values

for the dependent variable "highest year of

school completed" educ. The statement in the

problem that "survey respondents who were older

had completed more years of school" is incorrect.

The direction of the relationship is stated

incorrectly.

Practice Problem 4

This question asks you to use linear regression

to examine the relationship between sei and

age. sei and age are interval level,

satisfying the level of measurement requirements

for regression. Given the assumption that the

distributional requirements for linear regression

are satisfied, you can conduct a linear

regression using SPSS without examining

distributional characteristics of variables.

Linear Regression Hypothesis Test in SPSS (10)

You can conduct a linear regression

using Analyze gt Regression gt Linear

Linear Regression Hypothesis Test in SPSS (11)

Move the dependent variable to Dependent and

the independent variable to Independent(s)

boxes and then click OK button.

Linear Regression Hypothesis Test in SPSS (12)

Based on the ANOVA table for the linear

regression (F(1, 629) .266, p0.606), there was

no relationship between the dependent variable

"socioeconomic index" and the independent

variable "age". Since the probability of the F

statistic (p0.606) was greater than the level of

significance (0.05), the null hypothesis that

correlation coefficient (R) was equal to 0 was

not rejected. The research hypothesis that

there was a relationship between the variables

was not supported.

Steps in solving Linear Regression Hypothesis

Test Problems - 1

The following is a guide to the decision process

for answering homework problems about Linear

Regression Hypothesis Test problems

Are the dependent and independent variables

ordinal or interval level?

Incorrect application of a statistic

No

Yes

Make sure that the assumption that the

distributional requirements for linear regression

are satisfied is made. Otherwise, you have to

check the assumption first.

Our regression problems will assume that the

assumptions are met.

Steps in solving Linear Regression Hypothesis

Test Problems - 2

Conduct the linear regression analysis

Is the p-value in the ANOVA table for the F ratio

test lt alpha?

No

False

Yes

Is the interpretation of the strength of the

correlation coefficient correct?

No

False

Yes

Steps in solving Linear Regression Hypothesis

Test Problems - 3

Is the direction of the relationship correctly

stated?

No

False

Yes

Are either of the variables ordinal level?

No

True

Yes

True with caution