Loading...

PPT – Statistical Inference I: Hypothesis testing; sample size PowerPoint presentation | free to download - id: 65ea10-MWU1Z

The Adobe Flash plugin is needed to view this content

Statistical Inference I Hypothesis testing

sample size

Statistics Primer

- Statistical Inference
- Hypothesis testing
- P-values
- Type I error
- Type II error
- Statistical power
- Sample size calculations

What is a statistic?

- A statistic is any value that can be calculated

from the sample data. - Sample statistics are calculated to give us an

idea about the larger population.

Examples of statistics

- mean
- The average cost of a gallon of gas in the US is

2.65. - difference in means
- The difference in the average gas price in Los

Angeles (2.96) compared with Des Moines, Iowa

(2.32) is 64 cents. - proportion
- 67 of high school students in the U.S. exercise

regularly - difference in proportions
- The difference in the proportion of Democrats who

approve of Obama (83) versus Republicans who do

(14) is 69

What is a statistic?

- Sample statistics are estimates of population

parameters.

Sample statistics estimate population parameters

What is sampling variation?

- Statistics vary from sample to sample due to

random chance. - Example
- A population of 100,000 people has an average IQ

of 100 (If you actually could measure them all!) - If you sample 5 random people from this

population, what will you get?

Sampling Variation

Mean IQ100

Sampling Variation and Sample Size

- Do you expect more or less sampling variability

in samples of 10 people? - Of 50 people?
- Of 1000 people?
- Of 100,000 people?

Sampling Distributions

- Most experiments are one-shot deals. So, how do

we know if an observed effect from a single

experiment is real or is just an artifact of

sampling variability (chance variation)? - Requires a priori knowledge about how sampling

variability works - Question Why have I made you learn about

probability distributions and about how to

calculate and manipulate expected value and

variance? - Answer Because they form the basis of describing

the distribution of a sample statistic.

Standard error

- Standard Error is a measure of sampling

variability. - Standard error is the standard deviation of a

sample statistic. - Its a theoretical quantity! What would the

distribution of my statistic be if I could repeat

my experiment many times (with fixed sample

size)? How much chance variation is there? - Standard error decreases with increasing sample

size and increases with increasing variability of

the outcome (e.g., IQ). - Standard errors can be predicted by computer

simulation or mathematical theory (formulas). - The formula for standard error is different for

every type of statistic (e.g., mean, difference

in means, odds ratio).

What is statistical inference?

- The field of statistics provides guidance on how

to make conclusions in the face of chance

variation (sampling variability).

Example 1 Difference in proportions

- Research Question Are antidepressants a risk

factor for suicide attempts in children and

adolescents?

- Example modified from Antidepressant Drug

Therapy and Suicide in Severely Depressed

Children and Adults Olfson et al. Arch Gen

Psychiatry.200663865-872.

Example 1

- Design Case-control study
- Methods Researchers used Medicaid records to

compare prescription histories between 263

children and teenagers (6-18 years) who had

attempted suicide and 1241 controls who had never

attempted suicide (all subjects suffered from

depression). - Statistical question Is a history of use of

antidepressants more common among cases than

controls?

Example 1

- Statistical question Is a history of use of

particular antidepressants more common among

heart disease cases than controls? - What will we actually compare?
- Proportion of cases who used antidepressants in

the past vs. proportion of controls who did

Results

No () of cases (n263)

No () of controls (n1241)

Any antidepressant drug ever

120 (46)

448 (36)

46

36

Difference10

What does a 10 difference mean?

- Before we perform any formal statistical analysis

on these data, we already have a lot of

information. - Look at the basic numbers first THEN consider

statistical significance as a secondary guide.

Is the association statistically significant?

- This 10 difference could reflect a true

association or it could be a fluke in this

particular sample. - The question is 10 bigger or smaller than the

expected sampling variability?

What is hypothesis testing?

- Statisticians try to answer this question with a

formal hypothesis test

Hypothesis testing

Step 1 Assume the null hypothesis.

Null hypothesis There is no association between

antidepressant use and suicide attempts in the

target population ( the difference is 0)

Hypothesis Testing

Step 2 Predict the sampling variability assuming

the null hypothesis is truemath theory (formula)

The standard error of the difference in two

proportions is

Thus, we expect to see differences between the

group as big as about 6.6 (2 standard errors)

just by chance

Hypothesis Testing

Step 2 Predict the sampling variability assuming

the null hypothesis is truecomputer simulation

- In computer simulation, you simulate taking

repeated samples of the same size from the same

population and observe the sampling variability. - I used computer simulation to take 1000 samples

of 263 cases and 1241 controls assuming the null

hypothesis is true (e.g., no difference in

antidepressant use between the groups).

Computer Simulation Results

What is standard error?

Standard error measure of variability of sample

statistics

Hypothesis Testing

Step 3 Do an experiment

We observed a difference of 10 between cases and

controls.

Hypothesis Testing

Step 4 Calculate a p-value

P-valuethe probability of your data or something

more extreme under the null hypothesis.

Hypothesis Testing

Step 4 Calculate a p-valuemathematical theory

The p-value from computer simulation

P-value

P-valuethe probability of your data or something

more extreme under the null hypothesis. From our

simulation, we estimate the p-value to be 3/1000

or .003

Hypothesis Testing

Step 5 Reject or do not reject the null

hypothesis.

Here we reject the null. Alternative hypothesis

There is an association between antidepressant

use and suicide in the target population.

What does a 10 difference mean?

- Is it statistically significant? YES
- Is it clinically significant?
- Is this a causal association?

What does a 10 difference mean?

- Is it statistically significant? YES
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE

Statistical significance does not necessarily

imply clinical significance.

Statistical significance does not necessarily

imply a cause-and-effect relationship.

What would a lack of statistical significance

mean?

- If this study had sampled only 50 cases and 50

controls, the sampling variability would have

been much higheras shown in this computer

simulation

(No Transcript)

With only 50 cases and 50 controls

Two-tailed p-value

What does a 10 difference mean (50 cases/50

controls)?

- Is it statistically significant? NO
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE

No evidence of an effect ? Evidence of no effect.

Example 2 Difference in means

- Example Rosental, R. and Jacobson, L. (1966)

Teachers expectancies Determinates of pupils

I.Q. gains. Psychological Reports, 19, 115-118.

The Experiment (note exact numbers have been

altered)

- Grade 3 at Oak School were given an IQ test at

the beginning of the academic year (n90). - Classroom teachers were given a list of names of

students in their classes who had supposedly

scored in the top 20 percent these students were

identified as academic bloomers (n18). - BUT the children on the teachers lists had

actually been randomly assigned to the list. - At the end of the year, the same I.Q. test was

re-administered.

Example 2

- Statistical question Do students in the

treatment group have more improvement in IQ than

students in the control group? - What will we actually compare?
- One-year change in IQ score in the treatment

group vs. one-year change in IQ score in the

control group.

Results

Academic bloomers (n18)

Controls (n72)

Change in IQ score

12.2 (2.0)

8.2 (2.0)

12.2 points

8.2 points

Difference4 points

What does a 4-point difference mean?

- Before we perform any formal statistical analysis

on these data, we already have a lot of

information. - Look at the basic numbers first THEN consider

statistical significance as a secondary guide.

Is the association statistically significant?

- This 4-point difference could reflect a true

effect or it could be a fluke. - The question is a 4-point difference bigger or

smaller than the expected sampling variability?

Hypothesis testing

Step 1 Assume the null hypothesis.

Null hypothesis There is no difference between

academic bloomers and normal students ( the

difference is 0)

Hypothesis Testing

Step 2 Predict the sampling variability assuming

the null hypothesis is truemath theory

The standard error of the difference in two means

is

We expect to see differences between the group as

big as about 1.0 (2 standard errors) just by

chance

Hypothesis Testing

Step 2 Predict the sampling variability assuming

the null hypothesis is truecomputer simulation

- In computer simulation, you simulate taking

repeated samples of the same size from the same

population and observe the sampling variability. - I used computer simulation to take 1000 samples

of 18 treated and 72 controls, assuming the null

hypothesis (that the treatment doesnt affect

IQ).

Computer Simulation Results

What is the standard error?

Standard error measure of variability of sample

statistics

Hypothesis Testing

Step 3 Do an experiment

We observed a difference of 4 between treated and

controls.

Hypothesis Testing

Step 4 Calculate a p-value

P-valuethe probability of your data or something

more extreme under the null hypothesis.

Hypothesis Testing

Step 4 Calculate a p-valuemathematical theory

p-value lt.0001

Getting the P-value from computer simulation

P-value

P-valuethe probability of your data or something

more extreme under the null hypothesis. Here,

p-valuelt.0001

Hypothesis Testing

Step 5 Reject or do not reject the null

hypothesis.

Here we reject the null. Alternative hypothesis

There is an association between being labeled as

gifted and subsequent academic achievement.

What does a 4-point difference mean?

- Is it statistically significant? YES
- Is it clinically significant?
- Is this a causal association?

What does a 4-point difference mean?

- Is it statistically significant? YES
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE

Statistical significance does not necessarily

imply clinical significance.

Statistical significance does not necessarily

imply a cause-and-effect relationship.

What if our standard deviation had been higher?

- The standard deviation for change scores in both

treatment and control was 2.0. What if change

scores had been much more variablesay a standard

deviation of 10.0?

(No Transcript)

With a std. dev. of 10.0

What would a 4.0 difference mean (std. dev10)?

- Is it statistically significant? NO
- Is it clinically significant? MAYBE
- Is this a causal association? MAYBE

No evidence of an effect ? Evidence of no effect.

Hypothesis testing summary

- Null hypothesis the hypothesis of no effect

(usually the opposite of what you hope to prove).

The straw man you are trying to shoot down. - Example antidepressants have no effect on

suicide risk - P-value the probability of your observed data if

the null hypothesis is true. - Example The probability that the study would

have found 10 higher suicide attempts in the

antidepressant group (compared with control) if

antidepressants had no effect (i.e., just by

chance). - If the p-value is low enough (i.e., if our data

are very unlikely given the null hypothesis),

this is evidence that the null hypothesis is

wrong. - If p-value is low enough (typically lt.05), we

reject the null hypothesis and conclude that

antidepressants do have an effect.

Summary The Underlying Logic of hypothesis tests

Follows this logic Assume A. If A, then

B. Not B. Therefore, Not A. But throw in a bit

of uncertaintyIf A, then probably B

Error and power

- Type I error rate (or significance level) the

probability of finding an effect that isnt real

(false positive). - If we require p-valuelt.05 for statistical

significance, this means that 1/20 times we will

find a positive result just by chance. - Type II error rate the probability of missing an

effect (false negative). - Statistical power the probability of finding an

effect if it is there (the probability of not

making a type II error). - When we design studies, we typically aim for a

power of 80 (allowing a false negative rate, or

type II error rate, of 20).

Type I and Type II Error in a box

Reminds me of Pascals Wager

Type I and Type II Error in a box

Review Question 1

- If we have a p-value of 0.03 and so decide that

our effect is statistically significant, what is

the probability that were wrong (i.e., that the

hypothesis test gave us a false positive)? - .03
- .06
- Cannot tell
- 1.96
- 95

Review Question 1

- If we have a p-value of 0.03 and so decide that

our effect is statistically significant, what is

the probability that were wrong (i.e., that the

hypothesis test gave us a false positive)? - .03
- .06
- Cannot tell
- 1.96
- 95

Review Question 2

- Standard error is
- For a given variable, its standard deviation

divided by the square root of n. - A measure of the variability of a sample

statistic. - The inverse of sample size.
- A measure of the variability of a characteristic.
- All of the above.

Review Question 2

- Standard error is
- For a given variable, its standard deviation

divided by the square root of n. - A measure of the variability of a sample

statistic. - The inverse of sample size.
- A measure of the variability of a characteristic.
- All of the above.

Review Question 3

- A randomized trial of two treatments for

depression failed to show a statistically

significant difference in improvement from

depressive symptoms (p-value .50). It follows

that - The treatments are equally effective.
- Neither treatment is effective.
- The study lacked sufficient power to detect a

difference. - The null hypothesis should be rejected.
- There is not enough evidence to reject the null

hypothesis.

Review Question 3

- A randomized trial of two treatments for

depression failed to show a statistically

significant difference in improvement from

depressive symptoms (p-value .50). It follows

that - The treatments are equally effective.
- Neither treatment is effective.
- The study lacked sufficient power to detect a

difference. - The null hypothesis should be rejected.
- There is not enough evidence to reject the null

hypothesis.

Review Question 4

- Following the introduction of a new treatment

regime in a rehab facility, alcoholism cure

rates increased. The proportion of successful

outcomes in the two years following the change

was significantly higher than in the preceding

two years (p-value lt.005). It follows that - The improvement in treatment outcome is

clinically important. - The new regime cannot be worse than the old

treatment. - Assuming that there are no biases in the study

method, the new treatment should be recommended

in preference to the old. - All of the above.
- None of the above.

Review Question 4

- Following the introduction of a new treatment

regime in a rehab facility, alcoholism cure

rates increased. The proportion of successful

outcomes in the two years following the change

was significantly higher than in the preceding

two years (p-value lt.005). It follows that - The improvement in treatment outcome is

clinically important. - The new regime cannot be worse than the old

treatment. - Assuming that there are no biases in the study

method, the new treatment should be recommended

in preference to the old. - All of the above.
- None of the above.

Statistical Power

- Statistical power is the probability of finding

an effect if its real.

Can we quantify how much power we have for given

sample sizes?

study 1 263 cases, 1241 controls

Null Distribution difference0.

Clinically relevant alternative difference10.

study 1 263 cases, 1241 controls

Power chance of being in the rejection region if

the alternative is truearea to the right of this

line (in yellow)

Power here gt80

study 1 50 cases, 50 controls

Power closer to 20 now.

Study 2 18 treated, 72 controls, STD DEV 2

Clinically relevant alternative difference4

points

Power is nearly 100!

Study 2 18 treated, 72 controls, STD DEV10

Power is about 40

Study 2 18 treated, 72 controls, effect size1.0

Power is about 50

Clinically relevant alternative difference1

point

Factors Affecting Power

- 1. Size of the effect
- 2. Standard deviation of the characteristic
- 3. Bigger sample size
- 4. Significance level desired

1. Bigger difference from the null mean

2. Bigger standard deviation

3. Bigger Sample Size

4. Higher significance level

Sample size calculations

- Based on these elements, you can write a formal

mathematical equation that relates power, sample

size, effect size, standard deviation, and

significance level

Simple formula for difference in proportions

Simple formula for difference in means

Sample size calculators on the web

- http//biostat.mc.vanderbilt.edu/twiki/bin/view/Ma

in/PowerSampleSize - http//calculators.stat.ucla.edu
- http//hedwig.mgh.harvard.edu/sample_size/size.htm

l

These sample size calculations are idealized

- They do not account for losses-to-follow up

(prospective studies) - They do not account for non-compliance (for

intervention trial or RCT) - They assume that individuals are independent

observations (not true in clustered designs) - Consult a statistician!

Review Question 5

- Which of the following elements does not increase

statistical power? - Increased sample size
- Measuring the outcome variable more precisely
- A significance level of .01 rather than .05
- A larger effect size.

Review Question 5

- Which of the following elements does not increase

statistical power? - Increased sample size
- Measuring the outcome variable more precisely
- A significance level of .01 rather than .05
- A larger effect size.

Review Question 6

- Most sample size calculators ask you to input a

value for ?. What are they asking for? - The standard error
- The standard deviation
- The standard error of the difference
- The coefficient of deviation
- The variance

Review Question 6

- Most sample size calculators ask you to input a

value for ?. What are they asking for? - The standard error
- The standard deviation
- The standard error of the difference
- The coefficient of deviation
- The variance

Review Question 7

- For your RCT, you want 80 power to detect a

reduction of 10 points or more in the treatment

group relative to placebo. What is 10 in your

sample size formula? - a. Standard deviation
- b. mean change
- c. Effect size
- d. Standard error
- e. Significance level

Review Question 7

- For your RCT, you want 80 power to detect a

reduction of 10 points or more in the treatment

group relative to placebo. What is 10 in your

sample size formula? - a. Standard deviation
- b. mean change
- c. Effect size
- d. Standard error
- e. Significance level

Homework

- Problem Set 3
- Reading continue reading textbook
- Reading p-value article
- Journal article/article review sheet