Statistical Inference I: Hypothesis testing; sample size presentation

About This Presentation

Transcript and Presenter's Notes

Title: Statistical Inference I: Hypothesis testing; sample size

1
Statistical Inference I Hypothesis testing
sample size
2
Statistics Primer

Statistical Inference
Hypothesis testing
P-values
Type I error
Type II error
Statistical power
Sample size calculations

3
What is a statistic?

A statistic is any value that can be calculated
from the sample data.
Sample statistics are calculated to give us an
idea about the larger population.

4
Examples of statistics

mean
The average cost of a gallon of gas in the US is
2.65.
difference in means
The difference in the average gas price in Los
Angeles (2.96) compared with Des Moines, Iowa
(2.32) is 64 cents.
proportion
67 of high school students in the U.S. exercise
regularly
difference in proportions
The difference in the proportion of Democrats who
approve of Obama (83) versus Republicans who do
(14) is 69

5
What is a statistic?

Sample statistics are estimates of population
parameters.

6
Sample statistics estimate population parameters
7
What is sampling variation?

Statistics vary from sample to sample due to
random chance.
Example
A population of 100,000 people has an average IQ
of 100 (If you actually could measure them all!)
If you sample 5 random people from this
population, what will you get?

8
Sampling Variation
Mean IQ100
9
Sampling Variation and Sample Size

Do you expect more or less sampling variability
in samples of 10 people?
Of 50 people?
Of 1000 people?
Of 100,000 people?

10
Sampling Distributions

Most experiments are one-shot deals. So, how do
we know if an observed effect from a single
experiment is real or is just an artifact of
sampling variability (chance variation)?
Requires a priori knowledge about how sampling
variability works
Question Why have I made you learn about
probability distributions and about how to
calculate and manipulate expected value and
variance?
Answer Because they form the basis of describing
the distribution of a sample statistic.

11
Standard error

Standard Error is a measure of sampling
variability.
Standard error is the standard deviation of a
sample statistic.
Its a theoretical quantity! What would the
distribution of my statistic be if I could repeat
my experiment many times (with fixed sample
size)? How much chance variation is there?
Standard error decreases with increasing sample
size and increases with increasing variability of
the outcome (e.g., IQ).
Standard errors can be predicted by computer
simulation or mathematical theory (formulas).
The formula for standard error is different for
every type of statistic (e.g., mean, difference
in means, odds ratio).

12
What is statistical inference?

The field of statistics provides guidance on how
to make conclusions in the face of chance
variation (sampling variability).

13
Example 1 Difference in proportions

Research Question Are antidepressants a risk
factor for suicide attempts in children and
adolescents?

Example modified from Antidepressant Drug
Therapy and Suicide in Severely Depressed
Children and Adults Olfson et al. Arch Gen
Psychiatry.200663865-872.

14
Example 1

Design Case-control study
Methods Researchers used Medicaid records to
compare prescription histories between 263
children and teenagers (6-18 years) who had
attempted suicide and 1241 controls who had never
attempted suicide (all subjects suffered from
depression).
Statistical question Is a history of use of
antidepressants more common among cases than
controls?

15
Example 1

Statistical question Is a history of use of
particular antidepressants more common among
heart disease cases than controls?
What will we actually compare?
Proportion of cases who used antidepressants in
the past vs. proportion of controls who did

16
Results
No () of cases (n263)
No () of controls (n1241)
Any antidepressant drug ever
120 (46)
448 (36)
46
36
Difference10
17
What does a 10 difference mean?

Before we perform any formal statistical analysis
on these data, we already have a lot of
information.
Look at the basic numbers first THEN consider
statistical significance as a secondary guide.

18
Is the association statistically significant?

This 10 difference could reflect a true
association or it could be a fluke in this
particular sample.
The question is 10 bigger or smaller than the
expected sampling variability?

19
What is hypothesis testing?

Statisticians try to answer this question with a
formal hypothesis test

20
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no association between
antidepressant use and suicide attempts in the
target population ( the difference is 0)
21
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory (formula)
The standard error of the difference in two
proportions is
Thus, we expect to see differences between the
group as big as about 6.6 (2 standard errors)
just by chance
22
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation

In computer simulation, you simulate taking
repeated samples of the same size from the same
population and observe the sampling variability.
I used computer simulation to take 1000 samples
of 263 cases and 1241 controls assuming the null
hypothesis is true (e.g., no difference in
antidepressant use between the groups).

23
Computer Simulation Results
24
What is standard error?
Standard error measure of variability of sample
statistics
25
Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 10 between cases and
controls.
26
Hypothesis Testing
Step 4 Calculate a p-value
P-valuethe probability of your data or something
more extreme under the null hypothesis.
27
Hypothesis Testing
Step 4 Calculate a p-valuemathematical theory
28
The p-value from computer simulation
29
P-value
P-valuethe probability of your data or something
more extreme under the null hypothesis. From our
simulation, we estimate the p-value to be 3/1000
or .003
30
Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between antidepressant
use and suicide in the target population.
31
What does a 10 difference mean?

Is it statistically significant? YES
Is it clinically significant?
Is this a causal association?

32
What does a 10 difference mean?

Is it statistically significant? YES
Is it clinically significant? MAYBE
Is this a causal association? MAYBE

Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
33
What would a lack of statistical significance
mean?

If this study had sampled only 50 cases and 50
controls, the sampling variability would have
been much higheras shown in this computer
simulation

34
(No Transcript)
35
With only 50 cases and 50 controls
36
Two-tailed p-value
37
What does a 10 difference mean (50 cases/50
controls)?

Is it statistically significant? NO
Is it clinically significant? MAYBE
Is this a causal association? MAYBE

No evidence of an effect ? Evidence of no effect.
38
Example 2 Difference in means

Example Rosental, R. and Jacobson, L. (1966)
Teachers expectancies Determinates of pupils
I.Q. gains. Psychological Reports, 19, 115-118.

39
The Experiment (note exact numbers have been
altered)

Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n90).
Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent these students were
identified as academic bloomers (n18).
BUT the children on the teachers lists had
actually been randomly assigned to the list.
At the end of the year, the same I.Q. test was
re-administered.

40
Example 2

Statistical question Do students in the
treatment group have more improvement in IQ than
students in the control group?
What will we actually compare?
One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.

41
Results
Academic bloomers (n18)
Controls (n72)
Change in IQ score
12.2 (2.0)
8.2 (2.0)
12.2 points
8.2 points
Difference4 points
42
What does a 4-point difference mean?

Before we perform any formal statistical analysis
on these data, we already have a lot of
information.
Look at the basic numbers first THEN consider
statistical significance as a secondary guide.

43
Is the association statistically significant?

This 4-point difference could reflect a true
effect or it could be a fluke.
The question is a 4-point difference bigger or
smaller than the expected sampling variability?

44
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no difference between
academic bloomers and normal students ( the
difference is 0)
45
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory
The standard error of the difference in two means
is
We expect to see differences between the group as
big as about 1.0 (2 standard errors) just by
chance
46
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation

In computer simulation, you simulate taking
repeated samples of the same size from the same
population and observe the sampling variability.
I used computer simulation to take 1000 samples
of 18 treated and 72 controls, assuming the null
hypothesis (that the treatment doesnt affect
IQ).

47
Computer Simulation Results
48
What is the standard error?
Standard error measure of variability of sample
statistics
49
Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 4 between treated and
controls.
50
Hypothesis Testing
Step 4 Calculate a p-value
P-valuethe probability of your data or something
more extreme under the null hypothesis.
51
Hypothesis Testing
Step 4 Calculate a p-valuemathematical theory

p-value lt.0001

52
Getting the P-value from computer simulation
53
P-value
P-valuethe probability of your data or something
more extreme under the null hypothesis. Here,
p-valuelt.0001
54
Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between being labeled as
gifted and subsequent academic achievement.
55
What does a 4-point difference mean?

Is it statistically significant? YES
Is it clinically significant?
Is this a causal association?

56
What does a 4-point difference mean?

Is it statistically significant? YES
Is it clinically significant? MAYBE
Is this a causal association? MAYBE

Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
57
What if our standard deviation had been higher?

The standard deviation for change scores in both
treatment and control was 2.0. What if change
scores had been much more variablesay a standard
deviation of 10.0?

58
(No Transcript)
59
With a std. dev. of 10.0
60
What would a 4.0 difference mean (std. dev10)?

Is it statistically significant? NO
Is it clinically significant? MAYBE
Is this a causal association? MAYBE

No evidence of an effect ? Evidence of no effect.
61
Hypothesis testing summary

Null hypothesis the hypothesis of no effect
(usually the opposite of what you hope to prove).
The straw man you are trying to shoot down.
Example antidepressants have no effect on
suicide risk
P-value the probability of your observed data if
the null hypothesis is true.
Example The probability that the study would
have found 10 higher suicide attempts in the
antidepressant group (compared with control) if
antidepressants had no effect (i.e., just by
chance).
If the p-value is low enough (i.e., if our data
are very unlikely given the null hypothesis),
this is evidence that the null hypothesis is
wrong.
If p-value is low enough (typically lt.05), we
reject the null hypothesis and conclude that
antidepressants do have an effect.

62
Summary The Underlying Logic of hypothesis tests
Follows this logic Assume A. If A, then
B. Not B. Therefore, Not A. But throw in a bit
of uncertaintyIf A, then probably B
63
Error and power

Type I error rate (or significance level) the
probability of finding an effect that isnt real
(false positive).
If we require p-valuelt.05 for statistical
significance, this means that 1/20 times we will
find a positive result just by chance.
Type II error rate the probability of missing an
effect (false negative).
Statistical power the probability of finding an
effect if it is there (the probability of not
making a type II error).
When we design studies, we typically aim for a
power of 80 (allowing a false negative rate, or
type II error rate, of 20).

64
Type I and Type II Error in a box
65
Reminds me ofPascals Wager
66
Type I and Type II Error in a box
67
Review Question 1

If we have a p-value of 0.03 and so decide that
our effect is statistically significant, what is
the probability that were wrong (i.e., that the
hypothesis test gave us a false positive)?
.03
.06
Cannot tell
1.96
95

68
Review Question 1

If we have a p-value of 0.03 and so decide that
our effect is statistically significant, what is
the probability that were wrong (i.e., that the
hypothesis test gave us a false positive)?
.03
.06
Cannot tell
1.96
95

69
Review Question 2

Standard error is
For a given variable, its standard deviation
divided by the square root of n.
A measure of the variability of a sample
statistic.
The inverse of sample size.
A measure of the variability of a characteristic.
All of the above.

70
Review Question 2

Standard error is
For a given variable, its standard deviation
divided by the square root of n.
A measure of the variability of a sample
statistic.
The inverse of sample size.
A measure of the variability of a characteristic.
All of the above.

71
Review Question 3

A randomized trial of two treatments for
depression failed to show a statistically
significant difference in improvement from
depressive symptoms (p-value .50). It follows
that
The treatments are equally effective.
Neither treatment is effective.
The study lacked sufficient power to detect a
difference.
The null hypothesis should be rejected.
There is not enough evidence to reject the null
hypothesis.

72
Review Question 3

A randomized trial of two treatments for
depression failed to show a statistically
significant difference in improvement from
depressive symptoms (p-value .50). It follows
that
The treatments are equally effective.
Neither treatment is effective.
The study lacked sufficient power to detect a
difference.
The null hypothesis should be rejected.
There is not enough evidence to reject the null
hypothesis.

73
Review Question 4

Following the introduction of a new treatment
regime in a rehab facility, alcoholism cure
rates increased. The proportion of successful
outcomes in the two years following the change
was significantly higher than in the preceding
two years (p-value lt.005). It follows that
The improvement in treatment outcome is
clinically important.
The new regime cannot be worse than the old
treatment.
Assuming that there are no biases in the study
method, the new treatment should be recommended
in preference to the old.
All of the above.
None of the above.

74
Review Question 4

Following the introduction of a new treatment
regime in a rehab facility, alcoholism cure
rates increased. The proportion of successful
outcomes in the two years following the change
was significantly higher than in the preceding
two years (p-value lt.005). It follows that
The improvement in treatment outcome is
clinically important.
The new regime cannot be worse than the old
treatment.
Assuming that there are no biases in the study
method, the new treatment should be recommended
in preference to the old.
All of the above.
None of the above.

75
Statistical Power

Statistical power is the probability of finding
an effect if its real.

76
Can we quantify how much power we have for given
sample sizes?
77
study 1 263 cases, 1241 controls
Null Distribution difference0.
Clinically relevant alternative difference10.
78
study 1 263 cases, 1241 controls
Power chance of being in the rejection region if
the alternative is truearea to the right of this
line (in yellow)
Power here gt80
79
study 1 50 cases, 50 controls
Power closer to 20 now.
80
Study 2 18 treated, 72 controls, STD DEV 2
Clinically relevant alternative difference4
points
Power is nearly 100!
81
Study 2 18 treated, 72 controls, STD DEV10
Power is about 40
82
Study 2 18 treated, 72 controls, effect size1.0
Power is about 50
Clinically relevant alternative difference1
point
83
Factors Affecting Power

1. Size of the effect
2. Standard deviation of the characteristic
3. Bigger sample size
4. Significance level desired

84
1. Bigger difference from the null mean
85
2. Bigger standard deviation
86
3. Bigger Sample Size
87
4. Higher significance level
88
Sample size calculations

Based on these elements, you can write a formal
mathematical equation that relates power, sample
size, effect size, standard deviation, and
significance level

89
Simple formula for difference in proportions
90
Simple formula for difference in means
91
Sample size calculators on the web

http//biostat.mc.vanderbilt.edu/twiki/bin/view/Ma
in/PowerSampleSize
http//calculators.stat.ucla.edu
http//hedwig.mgh.harvard.edu/sample_size/size.htm
l

92
These sample size calculations are idealized

They do not account for losses-to-follow up
(prospective studies)
They do not account for non-compliance (for
intervention trial or RCT)
They assume that individuals are independent
observations (not true in clustered designs)
Consult a statistician!

93
Review Question 5

Which of the following elements does not increase
statistical power?
Increased sample size
Measuring the outcome variable more precisely
A significance level of .01 rather than .05
A larger effect size.

94
Review Question 5

Which of the following elements does not increase
statistical power?
Increased sample size
Measuring the outcome variable more precisely
A significance level of .01 rather than .05
A larger effect size.

95
Review Question 6

Most sample size calculators ask you to input a
value for ?. What are they asking for?
The standard error
The standard deviation
The standard error of the difference
The coefficient of deviation
The variance

96
Review Question 6

Most sample size calculators ask you to input a
value for ?. What are they asking for?
The standard error
The standard deviation
The standard error of the difference
The coefficient of deviation
The variance

97
Review Question 7

For your RCT, you want 80 power to detect a
reduction of 10 points or more in the treatment
group relative to placebo. What is 10 in your
sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level

98
Review Question 7

For your RCT, you want 80 power to detect a
reduction of 10 points or more in the treatment
group relative to placebo. What is 10 in your
sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level

Statistical Inference I: Hypothesis testing; sample size PowerPoint PPT Presentation