CORRELATION - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

CORRELATION

Description:

... one examines correlations is to see if two variables are related. ... the sign of the correlation describes exactly what that relation is. Chapter 9. Slide 38 ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 45
Provided by: Grad57
Category:

less

Transcript and Presenter's Notes

Title: CORRELATION


1
CORRELATION REGRESSION
2
Correlation vs. Regression
  • This chapter will speak of both correlations and
    regressions.
  • both use similar mathematical procedures to
    provide a measure of relation the degree to
    which two continuous variables vary together. . .
    .or covary.
  • The regression term is used when 1) one of the
    variables is a fixed variable, and 2) the end
    goal is use the measure of relation to predict
    values of the random variable based on values of
    the fixed variable

3
Correlation vs. Regression
  • Examples
  • In this class, height and ratings of physical
    attractiveness (both random variables) vary
    across individuals. We could ask, What is the
    correlation between height and these ratings in
    our class? Essentially, we are asking As
    height increases, is there any systematic
    increase (positive correlation) or decrease
    (negative correlation) in ones rating of their
    own attractiveness?

4
Correlation vs. Regression
  • Examples
  • Alternatively, we could do an experiment in
    which the experimenter compliments a subject on
    their appearance one to eight times prior to
    obtaining a rating (note that number of
    compliments is a fixed variable). We could now
    ask can we predict a persons rating of their
    attractiveness, based on the number of
    compliments they were given?

5
Scatterplots
  • The first way to get some idea about a possible
    relation between two variables is to do a
    scatterplot of the variables.
  • Lets consider the first example discussed
    previously where we were interested in the
    possible relation between height and ratings of
    physical attractiveness.

6
Scatterplots
  • The following is a sample of the data from our
    class as it pertains to this issue

7
Scatterplots
  • We can create a scatterplot of these data by
    simply plotting one variable against the other
  • correlation 0.146235 or 0.15

8
Scatterplots
  • Correlations range from -1 (perfect negative
    relation) through 0 (no relation) to 1 (perfect
    positive relation).
  • Well see exactly how to calculate these in a
    moment, but the scatterplots would look like. . .

9
Scatterplots
10
Covariance
  • The first step in calculating a correlation
    co-efficient is to quantify the covariance
    between two variables.
  • For the sake of an example, consider the height
    and weight variables from our class data set. . .

11
Covariance
  • Well just focus on the first 12 subjects data
    for now.

12
Covariance
  • The covariance of these variables is computed as

13
Covariance
  • The covariance formula should look familiar to
    you. If all the Ys were exchanged for Xs, the
    covariance formula would be the variance formula.
  • Note what this formula is doing, however, it is
    capturing the degree to which pairs of points
    systematically vary around their respective means.

14
Covariance
  • If paired X and Y values tend to both be above or
    below their means at the same time, this will
    lead to a high positive covariance.
  • However, if the paired X and Y values tend to be
    on opposite sides of their respective means, this
    will lead to a high negative covariance.
  • If there is no systematic tendencies of the sort
    mentioned above, the covariance will tend towards
    zero.

15
Covariance
  • To make life easier there is also a
    computationally more workable version of the
    covariance formula

16
Covariance
  • For our height versus weight example then
  • The covariance itself gives us little info about
    the relation we are interested in, because it is
    sensitive to the standard deviation of X and Y.
    It must be transformed (standardized) before it
    is useful. Hence. . . .

17
The Pearson Product-Moment Correlation
Coefficient (r)
  • The Pearson Product-Moment Correlation
    Coefficient, r, is computed simple by
    standardizing the covariance estimate as follows
  • This results in r values ranging from -1.0 to
    1.0 as discussed earlier.

18
The Pearson Product-Moment Correlation
Coefficient (r)
  • So, if we apply this to the example used
    earlier...

19
Adjusted r
  • Unfortunately, the r we measure using our sample
    is not an unbiased estimator of the population
    correlation coefficient ? (rho).
  • We can correct for this using the adjusted
    correlation coefficient which is computed as
    follows

20
Adjusted r
  • So, for our example

21
The Regression Line
  • Often scatter plots will include a regression
    line that overlays the points in the graph

22
The Regression Line
  • The regression line represents the best
    prediction of the variable on the Y axis (Weight)
    for each point along the X axis (Height).
  • For example, my (Martys) data is not depicted in
    the graph. But if I tell you that I am about 72
    inches tall, you can use the graph to predict my
    weight.

23
The Regression Line
  • Going back to your high school days, you perhaps
    recall that any straight line can be depicted by
    an equation of the form
  • Where Y is the predicted value of Y
  • a is the slope of the line
  • b is the intercept

24
The Regression Line
  • Since the regression line is supposed to be the
    line that provides the best prediction of Y,
    given some value of X, we need to find values of
    a b that produce a line that will be the
    best-fitting linear function (i.e., the predicted
    values of Y will come as close as possible to the
    obtained values of Y).

25
The Regression Line
  • The first thing we need to do when finding this
    function is to define what we mean by best.
  • Typically, the approach we take is to assume that
    the best regression line is the one the minimizes
    errors in prediction which are mathematically
    defined as the difference between the obtain and
    predicted values of Y
  • (Y Y)
  • this difference is typically termed the residual.

26
The Regression Line
  • For reasons similar to those involved in
    computations of variance, we cannot simply
    minimize S(Y-Y) because that sum will equal zero
    for any line passing through the point (X, Y).
  • Instead, we must minimize S(Y-Y)2

27
The Regression Line
  • At this point, the text book goes through a bunch
    of mathematical stuff showing you how to solve
    for a and b by substituting the equation for a
    line in for Y in the equation S(Y-Y)2, and then
    minimizing the result.
  • You dont have to know any of that, just the
    result, which is

28
The Regression Line
  • For our height versus weight example, b 2.36
    and a -28.26 (you should confirm these for
    yourself -- as a check of your understanding).
  • Thus, the regression line for our data is
  • Y2.36X(-28.26)

29
Residual (or error) variance
  • Once we have obtained a regression line, the next
    issue concerns how well the regression line
    actually fits the data.
  • Analogous to how we calculated the variance
    around a mean, we can calculate the variance
    around a regression line, termed the residual
    variance or error variance and denoted as
    , in the following manner

30
Residual (or error) variance
  • This equation uses N-2 in the denominator because
    two degrees of freedom were lost when computing Y
    (calculating a and b).
  • The square root of this term is called the
    standard error of the estimate and is denoted as

31
Residual (or error) variance
  • 1) The hard (but logical) way

32
Residual (or error) variance
33
Residual (or error) variance
  • 2) The easy (but dont ask me why it works) way.
  • In another feat of mathematical wizardry, the
    textbook shows how you can go from the formula
    above, to the following (easier to work with)
    formula

34
Residual (or error) variance
  • 2) The easy (but dont ask me why it works) way.
  • If we use the non-corrected value of r, we
    should get the same answer as when we used the
    hard way

35
Residual (or error) variance
  • 2) The easy (but dont ask me why it works) way.
  • The difference is due partially to rounding
    errors, but mostly to the fact that this easy
    formula is actually an approximation that assumes
    large N. When N is small, the obtained value
    over-estimates the actual value by a factor of
    (N-1)/(N-2).

36
Hypothesis Testing
  • The text discusses a number of hypothesis testing
    situations relevant to r and b, and gives the
    test to be performed in each situation.
  • I only expect you to know how to test 1) whether
    a computed correlation coefficient is
    significantly different from zero, and 2) whether
    two correlations are significantly different.

37
Hypothesis Testing
  • One of the most common reasons one examines
    correlations is to see if two variables are
    related.
  • If the computed correlation coefficient is
    significantly different from zero, that suggests
    that there is a relation ... the sign of the
    correlation describes exactly what that relation
    is.

38
Hypothesis Testing
  • To test whether some computed r is significantly
    different from zero, you first compute the
    following t-value
  • That value is then compared to a critical t with
    N-2 degrees of freedom.

39
Hypothesis Testing
  • If the obtained t is more extreme than the
    critical t, then you can reject the null
    hypothesis (that the variables are not related).
  • For our height versus weight example
  • tcrit(10) 2.23, therefore we cannot reject H0.

40
Hypothesis Testing
  • Testing for a difference between two independent
    rs (i.e. does r1-r20?) turns out to be a
    trickier issue, because the sampling distribution
    of the difference in two r values is not normally
    distributed.
  • Fisher has shown that this problem can be
    compensated for by first transforming each of the
    r values using the following formula

41
Hypothesis Testing
  • this leads to an r value that is normally
    distributed and whose standard error is reflected
    by the following formula

42
Hypothesis Testing
  • Given all this, one can test for the difference
    between two independent rs using the following
    z-test

43
Hypothesis Testing
  • So, the steps we have to go through to test the
    difference of two independent rs are
  • 1) compute both r values.
  • 2) transform both r values.
  • 3) get a z value based on the formula above.
  • 4) find the probability associated with that z
    value.
  • 5) compare the obtained probability to alpha
    divided



    by two (or alpha if doing a
    one-tailed test).

44
Hypothesis Testing
Write a Comment
User Comments (0)
About PowerShow.com