Correlation and regression - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Correlation and regression

Description:

Looking at the relationship between two interval-ratio variables* *You can use ordinal variables ... http://argyll.epsb.ca/jreed/math9/strand4/scatterPlot.htm ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 41
Provided by: sys84
Category:

less

Transcript and Presenter's Notes

Title: Correlation and regression


1
Correlation and regression
  • Friday, February 24th 2006

2
Outline
  • Lines intercept and gradient
  • Correlation
  • Line fitting
  • The correlation coefficient Pearsons r
  • Regression
  • What is it?
  • Least squares
  • Testing the model
  • Example SPSS Output
  • Multiple Regression
  • Coefficients
  • Effect Size
  • Assumptions
  • Transforming variables
  • Interactions
  • Outliers

3
Looking at the relationship between two
interval-ratio variablesYou can use ordinal
variables if you adjust them to represent rank
order
  • When we want to know how two variables are
    related to one another the pattern of the data
    points on a scatterplot can illustrate various
    patterns and relationships, including
  • data correlation
  • positive or direct relationships between
    variables
  • negative or inverse relationships between
    variables
  • non-linear patterns

Example of scatterplot showing relationship
between Grip Strength and Arm Strength
4
Thinking about linesWhat can we measure
  • Gradient a measure of how the line slopes
  • Intercept where the line cuts the y axis
  • Correlation a measure of how well the
    line fits the data

5 4 3 2 1 0
Equation for a line y a bx a is the
point at which the line crosses the y axis (when
x0). b is a measure of the slope (the amount
of change in y that occurs with a 1-unit change
in x).
y
y 1.5 0.5x
0 1 2 3 4 5 x
5
Same Intercept, Different Gradient
For each line y 35 bx Where b varies
6
Same Gradient, Different Intercept
For each line y a 2.5x Where a varies
7
In Groups
  • Draw pictures of the following lines
  • y 2 3x
  • y -2 x
  • y 4 - 2x
  • y 3 - 0.5x

Write equations for the following lines
8
Linear relationship
  • The technique of line-fitting, known as
    regression is used to measure how well a line
    fits a scatter of plots.
  • When the data points form a straight line on the
    graph, the linear relationship between the
    variables is stronger and the correlation is
    higher.
  • The following scatterplot shows a strong linear
    relationship between the two variables.
  • We say that these two variables are highly
    correlated.

9
Positive and negative relationships
  • Positive or direct relationships
  • If the points cluster around a line that runs
    from the lower left to upper right of the graph
    area, then the relationship between the two
    variables is positive or direct.
  • An increase in the value of x is more likely
    associated with an increase in the value of y.
  • The closer the points are to the line, the
    stronger the relationship.
  • Negative or inverse relationships
  • If the points tend to cluster around a line that
    runs from the upper left to lower right of the
    graph, then the relationship between the two
    variables is negative or inverse.
  • An increase in the value of x is more likely
    associated with a decrease in the value of y.
  • The closer the points are to the line, the
    stronger the relationship.

10
There are lots of online sites where you can
explore this topic
  • Three examples
  • http//argyll.epsb.ca/jreed/math9/strand4/scatterP
    lot.htm
  • This site lets you produce your own scatter
    plot, produce a line of best fit, practice
    interpolating data points on the line, and look
    at the correlation coefficient.
  • http//www.stat.berkeley.edu/stark/Java/Html/Corr
    elation.htm
  • This site lets you alter a scatter plot and add
    your own points, see the point of averages,
    standard deviation lines, and correlation
    coefficient as well as plot the regression line
    and more.
  • http//www.stat.uiuc.edu/courses/stat100/java/GCAp
    plet/GCAppletFrame.html
  • This site allows you to guess correlations.
  • You can also take a look at Chapter 8 of
    Statistics for the Terrified.

11
Working out the correlation coefficient
(Pearsons r)
  • Pearsons r tells us how much one variable
    changes as the values of another changes their
    covariation.
  • Variation is measured with the standard
    deviation. This measures average variation of
    each variable from the mean for that variable.
  • Covariation is measured by calculating the amount
    by which each value of X varies from the mean of
    X, and the amount by which each value of Y varies
    from the mean of Y and multiplying the
    differences together and finding the average (by
    dividing by n-1).
  • Pearsons r is calculated by dividing this by (SD
    of x) x (SD of y) in order to standardize it.

12
Working out the correlation coefficient
(Pearsons r)
  • This can also be calculated as the average sum of
    the products of the standardized values of x and
    y
  • r will always fall between 1 and -1.
  • A correlation of either 1 or -1 means perfect
    association between the two variables.
  • A correlation of 0 means that there is no
    association.
  • Note correlation does not mean causation. We can
    only determine causation by reference to our
    theory. However (thinking about it the other way
    round) there is unlikely to be causation if there
    is not correlation.

13
Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Note reminder of how to standardize scores

14
Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Note reminder of how to standardize scores

15
Worked Example
Average of x 4, SD 2 Average of y 7, SD 4
Average of the products 0.75 -0.25 0
-0.75 2.25 2.00
Note reminder of how to standardize scores

Divide by n-1 2.00/(5-1) 2/4 .5
16
Explained Variation
  • Pearsons r measures strength of association
    between two variables.
  • It does not tell you how much of variable y is
    explained by variable x. To get this you need to
    calculate r2. This is known as the coefficient of
    determination.
  • In this example r2 0.5 x 0.5 0.25. Therefore
    25 of the variation in y is explained by x.

17
What is Regression?
  • A way of predicting the value of one variable
    from another.
  • It is a hypothetical model of the relationship
    between two variables.
  • The model used is a linear one.
  • Therefore, we describe the relationship using the
    equation of a straight line.

18
How the correlation coefficient describes a
linear relationship
  • The regression line for y on x estimates the
    average value for y corresponding to each value
    of x
  • The regression line always goes through the point
    of averages (the point that contains the average
    y score and the average x score)
  • Associated with each increase of one SD of x
    there is an increase of r SDs in y, on the
    average.

The regression estimate
y
Point of averages
r x SDy
SDx
x
19
Regression and the description of a Straight Line
  • bi
  • Regression coefficient for the predictor
  • Gradient (slope) of the regression line
  • Direction/Strength of Relationship
  • a
  • Intercept (value of Y when X 0)
  • Point at which the regression line crosses the
    Y-axis (ordinate)
  • ?i
  • Unexplained error.

20
The Method of Least Squares
Why is this line a better summary of the data
than a line which is marginally more steep or
marginally more shallow or which is a millimetre
or two further up the page? In fact the line has
been chosen in such a way that the sum of the
squares of the vertical distances between the
points and the line is minimised. As we have
seen earlier in the module, squaring differences
has the advantage of making positive and negative
differences equivalent.
21
How Good is the Model?
  • The regression line is only a model based on the
    data.
  • This model might not reflect reality.
  • We need some way of testing how well the model
    fits the observed data.
  • How?

22
Sum of Squares
  • SST
  • Total variability (variability betweenscores
    and the mean).
  • SSR
  • Residual/Error variability (variability between
    the regression model and the actual data).
  • SSM
  • Model variability (difference in variability
    between the model and the mean).

23
Testing the Model ANOVA
  • If the model results in better prediction than
    using the mean, then we expect SSM to be much
    greater than SSR

24
Testing the Model ANOVA
  • Mean Squared Error
  • Sums of Squares are total values.
  • They can be expressed as averages. The averages
    are obtained by dividing the sum of squares by
    the degrees of freedom for each model.
  • These are called Mean Squares, MS
  • If you know F you can check whether the model is
    significantly better at predicting the dependent
    variable than chance alone.

25
Testing the Model R2
  • R2
  • The proportion of variance accounted for by the
    regression model (you can transform R2 into a
    percentage).
  • The Pearson Correlation Coefficient Squared

26
Regression An Example
27
SPSS output showing the F ratio
If the improvement due to fitting the model is
much greater than the inaccuracy within the model
then the value of F will be greater than 1. In
this instance the value of F is 99.587 SPSS
tells us that the probability of obtaining this
value of F by chance is very low (p
lt.001) Note Mean Square Sum of Squares /
df F MS regression / MS residual
28
SPSS output showing R2
In this instance the model explains 33.5 of the
variation in the dependent variable.
29
SPSS Output Model Parameters
30
Produce your own regression equations at the
following site
  • http//people.hofstra.edu/faculty/Stefan_Waner/ne
    wgraph/regressionframes.html
  • Lets discover what the equation is that relates
    age to number of jobs ever held (assuming that
    there is one).
  • y number of jobs ever
  • x age
  • Using the equation that weve got from this site
  • How many jobs would you predict that someone who
    is 25 would have ever held?
  • What about someone who is 40?

31
Multiple Regression when there is more than one
independent variable
  • b1
  • Regression coefficient for the first predictor,
    controlling for the other predictors
  • Direction/Strength of Relationship
  • b2
  • Regression coefficient for the second predictor,
    controlling for the other predictors
  • Direction/Strength of Relationship
  • bn
  • Regression coefficient for the nth predictor,
    controlling for the other predictors
  • Direction/Strength of Relationship
  • a
  • Intercept (value of Y when X1 and X2 and Xn all
    0)
  • Point at which the regression line crosses the
    Y-axis (ordinate)
  • ?i
  • Unexplained error.

32
Multiple regression an example
33
SPSS Output Example Coefficients
  • This is a regression of usual gross monthly pay
    on the numbers of hours a respondent has worked,
    their age, and whether or not they have a degree.
  • The coefficients for each independent variable
    show the effect of that variable holding all
    other variables in the model constant.
  • Questions
  • How much would you expect a 30 year old to earn
    if they do not have a degree and work 36 hours a
    week?
  • How much would you expect a 60 year old to earn
    if they have a degree and work 20 hours a week?

34
SPSS Output Example effect size
  • Standardized coefficients enable us to measure
    the different effect sizes of different
    independent variables i.e. to answer the
    question Does a persons age or whether they
    have a degree make more difference to their pay?
  • We cannot use the normal coefficients to make
    this comparison because each variable is measured
    in different units i.e. a degree is coded 1 or
    0 (you have one or not), age ranges from 16 to
    90 (relatively evenly spread out) and hours of
    work from 0 to 100 (but bunched up between 20
    and 50).
  • Standardized coefficients measure the number of
    standard deviations of change in the dependent
    variable (gross pay) that is produced by one
    standard deviation of change in each independent
    variable.
  • Since we are comparing like with like now we can
    determine whether a one standard deviation in
    respondents age has more or less effect than a
    one standard deviation in having a degree.
  • As the standardized coefficient for holding a
    degree is larger than the standardized
    coefficient for age we can say that this variable
    has a larger effect. However the number of hours
    worked has the largest effect.

35
Checking Assumptions Checking Residuals
Linearity This assumption is that there is a
straight line relationship between the
independent and dependent variables (n.b. if
there is not it may be possible to make it linear
by transforming one or more variables). Homoscedas
ticity This assumption means that the variance
around the regression line is the same for all
values of the independent variable(s).
36
Example Transforming variables
  • Some variables have a non-linear effect. If this
    is the case you may be able to transform them in
    such a way as to model their effect.
  • A common way of transforming variables is to
    square them. If you include just a square-term
    the effect will be exponential (getting rapidly
    larger). If you include both a squared and the
    original item you can explore a curvilinear
    relationship.
  • Example The curvilinear affect of age on
    income.
  • I suspect that age actually has a curvilinear
    affect on income i.e. that people initially
    earn more as they get older (so age has a
    positive effect) but that eventually this evens
    out and then perhaps declines.
  • In order to explore this I will do the same
    regression as above but will include both age and
    age2 as independent variables. (I calculate age2
    using the compute function in SPSS and
    calculating age2ageage).

37
Example Transforming variables
As you can see the coefficient for age is large
and positive and the coefficient for age2 is
small and negative. This means that the combined
effect of age for someone who is 25 is 64.596 x
age 0.732 x age2 64.596 x 25 0.732 x
252 1,614.90 457.50 1,157.40 What is
the combined effect of age for someone who is 45?
65? What happens to the size of the square-term
in comparison to the original term?
38
Example Interactions
It occurred to me that age may have a different
effect on those people with higher levels of
education than on those with lower educational
levels. In order to investigate this I decided to
look at the interaction of age and degree I
created a new variable agedegree. This will be
equal to age for those with a degree (scored 1)
and will equal zero for those without a degree
(scored 0). This means that the combined effect
of age for someone who is 25 and has a degree
is (64.323 7.703) x age 0.735 x age2
72.026 x 25 0.735 x 252 1,800.65
459.38 1,341.28 Substantively this means
that age has a stronger (positive) influence on
pay when people have a degree.
39
Note The effect of outliers
Because the regression line minimizes the squared
difference of points to the line outliers can
have a very large effect (as their squared
distance to the line will make a big difference).
This is why it is sometimes advisable to run
regression analysis omitting outliers.
40
Next Week
  • Regression requires that your dependent variable
    is interval-ratio.
  • Next week we will look at logistic regression,
    which is similar to regression analysis (and
    produces similar looking equations and SPSS
    output), but is used where the dependent variable
    is dichotomous.
Write a Comment
User Comments (0)
About PowerShow.com