Research Methods of Applied Linguistics and Statistics (11) - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Research Methods of Applied Linguistics and Statistics (11)

Description:

Research Methods of Applied Linguistics and Statistics (11) Correlation and multiple regression By Qin Xiaoqing * * Standard error of estimate There is some overlap ... – PowerPoint PPT presentation

Number of Views:296
Avg rating:3.0/5.0
Slides: 48
Provided by: qin72
Category:

less

Transcript and Presenter's Notes

Title: Research Methods of Applied Linguistics and Statistics (11)


1
Research Methods of Applied Linguistics and
Statistics (11)
  • Correlation and multiple regression
  • By Qin Xiaoqing

2
Pearson Correlation
  • The Pearson correlation allows us to establish
    the strength of relationships between continuous
    variables.
  • To show the relationship, the first step is to
    draw a scatterplot or scattergram, which can
    help us to obtain a preliminary understanding of
    this relationship.
  • The scatterplot can be described in terms of
    direction, strength and linearity.

3
Correlation and SPSS
  • Pearson product-moment coefficient is designed
    for interval level (continuous) variables. It can
    also be used if you have one continuous variable
    (e.g., scores on a measure of self-esteem) and
    one dichotomous variable (e.g., sex M/F).
  • Spearman rank order correlation is designed for
    use with ordinal level or ranked data.
  • SPSS will calculate two types of correlation.
    First, it will give a simple bivariate
    correlation (which just means between two
    variables), also known as zero-order correlation.
    SPSS will also explore the relationship between
    two variables, while controlling for another
    variable. This is known as partial correlation.

4
Direction
  • Positive relationships represent relationships in
    which an increase in one variable is associated
    with an increase in a second.
  • Negative relationships represent relationships in
    which an increase in one variable is associated
    with decrease in a second.

5
Strength
  • Strong relationships appear as those in which the
    dots are very close to a straight line
  • Weak relationships appear as those in which the
    dots are more scattered about a straight line, or
    farther away from that line.

6
Linearity
  • Linear relationships are indicated when the
    pattern of dots on the scatter diagram appears to
    be straight, or if the points could be
    represented by drawing a straight line through
    them.

7
Steps for computation
  1. List the score for each S in parallel columns on
    a data sheet.
  2. Square each score and enter these values in the
    columns labeled X2 and Y2.
  3. Multiply the scores and enter this value in the
    XY column.
  4. Add the values in each column.
  5. Insert the values in the formula of correlation
    coefficient.

8
Example
S X Y X2 Y2 XY
1 12 8 144 64 96
2 10 12 100 144 120
3 11 5 121 25 556
4 9 8 181 64 72
5 8 4 64 16 32
6 7 13 49 169 91
7 7 7 49 49 49
8 5 3 25 9 15
9 4 8 16 64 32
10 3 5 9 25 15
Total 76 73 658 629 577
9
Scatterplot
10
Interpretation of scatterplot
  • Checking for outliers
  • Inspecting the distribution of data points
  • Are the data points spread all over the place?
    This suggests a very low correlation.
  • Are all the points neatly arranged in a narrow
    cigar shape? This suggests quite a strong
    correlation.
  • Could you draw a straight line through the main
    cluster of points, or would a curved line better
    represent the points? If a curved line is evident
    (suggesting a curvilinear relationship), then
    Pearson correlation should not be used.
  • What is the shape of the cluster? Is it even from
    one end to the other? Or does it start off narrow
    and then get fatter. If this is the case, the
    data may be violating the assumption of variance
    homogeneity.
  • Determining the direction of the relationship
    between the variables

11
Formula of r for raw score
X Y X2 Y2 XY
Total 76 73 658 629 577
12
Assumptions underlying Pearson correlation
  1. The data are measured as scores or ordinal scales
    that are truly continuous.
  2. The scores on the two variables, X and Y, are
    independent.
  3. The data should be normally distributed through
    their range.
  4. The relationship between X and Y must be linear.

13
Interpreting the correlation coefficient
  1. When r.60, the variance overlap between the 2
    measures is .36.
  2. The overlap tells that the 2 measures provide
    similar information. Or the magnitude of r2
    indicates the amount of variances in X which is
    accounted for by Y or vice versa.

14
Correlation coefficient
  • If you hope 2 tests measure basically the same
    thing, .71 isnt very strong .80 or .90 may be
    desirable.
  • A correlation of .30 or lower may appear weak,
    but in educational research such a correlation
    might be very important.
  • Significant level plt.05, .01, dfN-2

15
  • r.10 to .29 or r.10 to .29 small
  • r.30 to .49 or r.30 to .49 medium
  • r.50 to 1.0 or r.50 to 1.0 large

16
Presenting the results from correlation
17
Comparing the correlation coefficients for two
groups
  • Sometimes when doing correlational research you
    may want to compare the strength of the
    correlation coefficients for two separate groups.

18
Factors affecting correlation
  • If you have a restricted range of scores on
    either of the variables, this will reduce the
    value of r, eg. Age (18-20) and success on an
    exam.
  • The existence of scores with extreme outliers in
    the data.
  • The presence of extremely high and extremely low
    scores on a variable with little in the middle.
  • Reliability of the data.
  • Non-linear relationship. Always check the
    scatterplot, particularly if you obtain low
    values of r.

19
Correlation versus causality
  • Correlation provides an indication that there is
    a relationship between two variables It does not
    however indicate that one variable causes the
    other. The correlation between two variables (A
    and B) could be due to the fact that A causes B,
    that B causes A, or (just to complicate matters)
    that an additional variable (C) causes both A and
    B. The possibility of a third variable that
    influences both of your observed variables should
    always be considered.

20
Statistical vs practical significance
  • Dont get too excited if your correlation
    coefficients are significant. With large
    samples, even quite small correlation
    coefficients can reach statistical significance.
    Although statistically significant, the practical
    significance of a correlation of .2 is very
    limited. You should focus on the actual size of
    Pearsons r and the amount of shared variance
    between the two variables. To interpret the
    strength of your correlation coefficient you
    should also take into account other research that
    has been conducted in your particular topic area.
    If other researchers in your area have only been
    able to predict 9 per cent of the variance (a
    correlation of .3) in a particular outcome (e.g.,
    anxiety), then your study that explains 25 per
    cent would be impressive in comparison. In other
    topic areas, 25 per cent of the variance
    explained may seem small and irrelevant.

21
Linear regressionMultiple regression
22
Understanding regression
  • Regression is a way of predicting performance on
    the dependent variable via one or more
    independent variables.
  • In simple regression, we predict scores on one
    variable on the basis of scores on a second.
  • In multiple regression, we expand the possible
    sources of prediction and test to see which of
    many variables and which combination of variables
    allow us to make the best prediction.

23
Linear regression
  • Regression and correlation are related
    procedures. The correlation coefficient is
    central to simple linear regression. While we
    cant make causal claims on the basis of
    correlation, we can use correlation to predict
    one variable from another.
  • We cant just throw variables into a multiple
    regression and hope that, magically, answers will
    appear.
  • We should have a sound thoretical or conceptual
    reason for the analysis and, in particular, the
    order of variables entering the equation.

24
Uses of multiple regression
  • how well a set of variables is able to predict a
    particular outcome
  • which variable in a set of variables is the best
    predictor of an outcome and
  • whether a particular predictor variable is still
    able to predict an outcome when the effects of
    another variable are controlled for.

25
Assumptions of multiple regression
  • Sample size
  • Stevens (1996) recommends that for social
    science research, about 15 subjects per predictor
    are needed for a reliable equation.
  • Tabachnick and Fidell (1996, p. 132) give a
    formula for calculating sample size requirements,
    taking into account the number of independent
    variables that you wish to use N gt 50 8m
    (where m number of independent variables). If
    you have five independent variables you would
    need 90 cases.
  • More cases are needed if the dependent variable
    is skewed.
  • For stepwise regression there should be a ratio
    of forty cases for every independent variable.

26
  • Multicollinearity. It exists when the independent
    variables are highly correlated (r.9 and above).
    Multiple regression doesnt like
    multicollinearity, and it certainly doesnt
    contribute to a good regression model, so always
    check for this problem before you start.
  • Outliers. Multiple regression is very sensitive
    to outliers (very high or very low scores).
  • Normality, linearity

27
MLAT and language learning
The closer r is to 1 the smaller the error will
be in predicting performance on one variable to
that of the second. The smaller, the greater the
error.
28
Predicting scores using regression
  • 4 pieces of information are needed They are
  • the mean for scores on one variable
  • The mean for scores on the second variable
  • The Ss score on X, and
  • The slope of the best-fitting straight line of
    the joint distribution.
  • With this information, we can predict the Ss
    score on Y from X on a mathematical basis. By
    regressing Y on X, predicting Y from X will be
    possible.

29
Regression line
  • Lines drawn to the straight line in the
    scatterplot show the amount of error. Suppose we
    square each of these errors and then find the
    mean of the sum of these squared errors. This
    best-fitting straight line is called regression
    line and is technically defined as the line that
    results in the smallest mean of the sum of the
    squared errors.
  • We can think of the regression line as being
    that which is closest to all the dots but, more
    precisely, it is the one that results in a mean
    of the squared errors that is less than any other
    line we might produce.

30
Determining the slope
  • Turn MLAT and language learning to z score for
    comparability.
  • Then plot the intersection of each Ss z score on
    the MLAT and on the test. As the z scores on the
    MLAT increase they form a run. The horizontal
    line of a triangle. At the same time, the z
    scores on the test increase to form a rise, the
    vertical line.
  • The slope (b) of the regression line is shown as
    we connect these 2 lines to form the third side
    of the triangle.

31
Regression coefficient with known r and SD
  • In the diagram, an increase of say 6 units on the
    run (MLAT) would equal 2 units of increase on the
    rise.
  • The slope is the rise divided by the run. The
    result is a fraction. That fraction is the
    correlation coefficient.
  • The correlation coefficient is the same as the
    slope of the best-fitting line in a z-score
    scatterplot. In the triangle, the slope of the
    regression line was 26, and so r for the two is
    .33. suppose SDs are 8 and 10 respectively for Y
    and X.
  • To obtain the slope, we multiply the correlation
    coefficient by the standard deviation of Y over
    the standard deviation of X.

32
Regression coefficient with raw data
  • With r and SD, it is very easy to find the slope.
    With raw data, the formula for slope follows

33
Example using TSE to predict TOEFL
  • Mean on TOEFL540, SD40. Mean on TSE30, SD4.
    r.80, b8.0
  • A student achieved 36 on the TSE, 6 higher than
    the mean. Multiplying that by the slope, we get
    8648. So our prediction of TOEFL is mean Y
    (540) 48588. The formula follows
  • Another regression equation is

34
Standard error of estimate
  • There is some overlap in the variance of the two
    variables. When we square the value of r, we find
    the degree of shared variance.
  • Of the original 100 of the variance, with an
    r.50, we have accurately accounted for 25 of
    the variance using the straight line as the bass
    for prediction. The error variance now is reduced
    to 75.
  • In regression, standard error of estimate (SEE)
    shows the dispersion of scores away from the
    straight line. If all the data are tightly
    clustered on the line, little error is made in
    prediction.
  • SEE tells us how much error is likely to occur in
    prediction.

35
Error variance
  • To compute SEE, we need to know the error
    variance, which is the sum of squares of actual
    scores minus predicted scores divided by N-2.
  • The square root of this variance is referred to
    as the SEE (1.35)

Mean for X8, SD4.47 mean for Y10.8, SD2.96
r.89
36
Confidence interval
  • 68 confidence interval 1 SEE (eg. 1.35) 68
    of actual Y scores would fall within . 1.35 of
    the predicted Y score.
  • 95 confidence interval 1.96SEE
  • 99 confidence interval 2.58SEE
  • Suppose estimated score is 11.98, then
  • 95 confidence interval between 9.33
    (11.98-1.351.96) and 14.63 (11.981.351.96)
  • 99 confidence interval?
  • 8.5(11.98-3.48) - 15.46 (11.983.48)

37
Estimated L2 scores predicted from class hours
38
Goodness of fit for regression model R2
  • R2, also called multiple correlation or the
    coefficient of multiple determination, is the
    percent of the variance in the dependent
    explained uniquely or jointly by the
    independents.
  • Adjusted R2 is an adjustment for the fact that
    when one has a large number of independents, it
    is possible that R2 will become artificially high
    simply because some independents' chance
    variations "explain" small parts of the variance
    of the dependent.
  • The greater the number of independents, the more
    the researcher is expected to report the adjusted
    coefficient.

39
T-test
  • t-tests are used to assess the significance of
    individual b coefficients. specifically testing
    the null hypothesis that the regression
    coefficient is zero.

40
F test
  • F test is used to test the significance of R,
    which is the same as testing the significance of
    R2, which is the same as testing the significance
    of the regression model as a whole.
  • If prob(F) lt .05, then the model is considered
    significantly better than would be expected by
    chance and we reject the null hypothesis of no
    linear relationship of y to the independents.

41
Multicollinearity
  • Multicollinearity is the intercorrelation of
    independent variables. R2's near 1 violate the
    assumption of no perfect collinearity, while high
    R2's increase the standard error of the beta
    coefficients and make assessment of the unique
    role of each independent difficult or impossible.

42
tolerance or VIF
  • To assess multivariate multicollinearity, one
    uses tolerance or VIF, which build in the
    regressing of each independent on all the others.
  • As a rule of thumb, if tolerance is less than
    .20, a problem with multicollinearity is
    indicated.
  • When tolerance is close to 0 there is high
    multicollinearity of that variable with other
    independents and the b and beta coefficients will
    be unstable.
  • The more the multicollinearity, the lower the
    tolerance, the more the standard error of the
    regression coefficients.

43
Selecting method for predicting variables
Forward selection
  • This method starts with a model containing none
    of the explanatory variables. In the first step,
    the procedure considers variables one by one for
    inclusion and selects the variable that results
    in the largest increase in R2. In the second
    step, the procedures considers variables for
    inclusion in a model that only contains the
    variable selected in the first step. In each
    step, the variable with the largest increase in
    R2 is selected until, according to an F-test,
    further additions are judged to not improve the
    model.

44
Backward selection
  • This method starts with a model containing all
    the variables and eliminates variables one by
    one, at each step choosing the variable for
    exclusion as that leading to the smallest
    decrease in R2. Again, the procedure is repeated
    until, according to an F-test, further exclusions
    would represent a deterioration of the model.

45
Stepwise selection
  • This method is, essentially, a combination of the
    previous two approaches. Starting with no
    variables in the model, variables are added as
    with the forward selection method. In addition,
    after each inclusion step, a backward elimination
    process is carried out to remove variables that
    are no longer judged to improve the model.

46
Interpretation of the results from multiple
regression
  • Checking the assumptions
  • Evaluating the model
  • Evaluating each of the independent variables

47
Presenting the results of multiple regression
  • It would be a good idea to look for examples of
    the presentation of different statistical
    analysis in the journals relevant to your topic
    area. Different journals have different
    requirements and expectations. Given the severe
    space limitations in journals these days, often
    only a brief summary of the results is presented
    and readers are encouraged to contact the author
    for a copy of the full results.
Write a Comment
User Comments (0)
About PowerShow.com