Title: Is the Association Statistically Significant Session 16
1Is the Association Statistically Significant
Session 16
2Tests of Statistical Significance
- Nominal
- Lambda t test
- Phi ?2
- Contingency Coefficient ?2
- Cramers V ?2
- Ordinal
- Gamma t test
- Somers d t test
- Tau-b, Tau-c t test
- Interval
- Pearsons r t test
3Chi-Square TestStatistical Significance for
Nominal and Ordinal Level Variables
- To be used when
- Variables are not interval level
- Not normally distributed
4One-Way Chi-Square Distribution of Values Across
a Single Variable
- Are variations across cell frequencies chance
variations or are they a pattern? - Null hypothesis is that there is no pattern of
distribution. In other words, cases are
distributed evenly across cells. - Alternative hypothesis is that categories vary.
Distribution is not the same across all categories
5- We have two sets of cell frequencies
- The cell frequencies that correspond with the
null hypothesis - The observed cell frequencies
- How large is the discrepancy between these two
sets of values?
6Formula
- ?2 S(fo fe)2
- fe
- The closer fo is to fe, the smaller the value of
the chi-square test - The larger the discrepancy, the larger the value
of the chi-square test - A larger value means we are more likely to reject
the null and say that there is a pattern
7- degrees of freedom are
- k 1
- where k the number of categories
8Two-Way Chi-Square Distribution of Values Across
a Two Variables
- Used to compare two frequency distributions in
other words, a crosstab - Null hypothesis is that cases are distributed
evenly across cells. - Alternative hypothesis is that there is variation
in the distribution of values of one variable
across categories of the other
9Formulas and Calculations
- Expected frequency for null hypothesis is based
on marginal values - Formula for the chi-square test is the same
- Degrees of freedom
- df (r 1)(c 1)
10- Median test (p. 302-305) skip this
11Linear Regression
- Provides a way to evaluate the influence of one
independent variable on the dependent variable,
controlling for the influence of other variables
12Statistics Produced by Regression
- a the constant
- y-hat the predicted values of y given certain
values of the independent variables - e the error, the discrepancy between the actual
observed value of y and the predicted value of y,
the slush factor - beta coefficients the influence on y of a one
unit change in x - standardized beta coefficients puts the
independent variables in the same metric
13Statistics Produced by Regression
- t tests the statistical significance of x on y
- p values the probability of the observed value
of the beta coefficient if the true influence
were 0 - R-squared how well the model fits the data, or,
the percentage of variation in y explained by the
variation in the independent variables
14- The Adjusted R-squared or coefficient of multiple
determination Collectively, urbanization,
population growth, and GDP explain 79 of
variation in female literacy rate.
15- For each one percent increase in the percentage
of people living in cities, female literacy
increases by .61 or six tenths of one percent - For each one percent increase in the annual
population the female literacy rate decreases by
13.7 percent. - Gross domestic product per capita does not have a
statistically significant influence on female
literacy
16Why Use Multivariate Analysis - Regression?
- Descriptive Statistics one variable
- Measures of Association two variables
- Multivariate Analysis three variables or more
17Why Multivariate Analysis Regression
- To identify spurious relationships
- To correctly specify relationships
- To thoroughly describe a process
- the goal of research is to explain why variables
vary
18The Basic Regression Model
- Y a bX e
- Y is the observed value of the dependent variable
- a is the expected value of Y when X 0 (a
baseline value) - b is the slope steep when X has a strong
influence on Y (in other words, b is larger) - e is the amount of variation in Y that cant be
explained by X
19The Regression Line?
- The regression line (the slope) is drawn to
minimize the distance between the slope and the
plotted points which are the observed values of
the dependent variable - The regression line represents predicted values
(predicted by the equation) and the points
represent actual observed values
20- Using the slope coefficient (b), the actual
values of X, and the value of a, we can plot the
regression line. - The error term, e, is the distance between the
regression line (the slope) and the location of
the actual observed points, the values of the
dependent variable.
21Assumptions for Regression
- Both the independent and the dependent variables
are measured at the interval level - The relationship is linear
- Variables must be normally distributed or sample
must be large - Sample must be random for tests of statistical
significance
22The Significance of the Errors
- Back to the proportionate reduction in error
- The errors should be as small as possible
- We use the average value of Y to guess
- We compare this to our guess using the value of X
for that observation
23Pearsons, Regression, and the Coefficient of
Determination
- Coefficient of Determination is also know as the
R-squared. And if there is more than one
independent variable its the adjusted R-squared.
Enough names for ya? The adjusted R-squared
takes the number of independent variables into
consideration. Kind of like degree of difficulty
in diving and gymnastics.
24- The R-squared value is the percent of variation
in the dependent variable that is explained by
the independent variables, collectively.
25The Other Statistics
- T-score and p-value the statistical
significance of the individual coefficients. In
other words, is the influence of this independent
variable on the dependent variable statistically
different from 0? - The beta coefficient the magnitude of the
influence of X on Y. The amount of change in Y
for a one unit change in X.