Basic Statistical Principlesfor the Clinical

Research ScientistKristin CobbOctober 13 and

October 20, 2004

Statistics in Medical Research

- 1. Design phase
- Statistics starts in the planning stages of a

clinical trial or laboratory experiment to - establish optimal sample size needed
- ensure sound study design
- 2. Analysis phase
- Make inferences about a wider population.

Common problems with statistics in medical

research

- Sample size too small to find an effect (design

phase problem) - Sub-optimal choice of measurement for predictors

and outcomes (design phase problem) - Inadequate control for confounders (design or

analysis problem) - Statistical analyses inadequate (analysis

problem) - Incorrect statistical test used (analysis

problem) - Incorrect interpretation of computer output

(analysis problem) - Therefore, it is essential to collaborate with

a statistician both during planning and analysis!

Additionally, errors arise when

- The statistical content of the paper is confusing

or misleading because the authors do not fully

understand the statistical techniques used by the

statistician. - The statistician performs inadequate or

inappropriate analyses because she is unclear

about the questions the research is designed to

answer. - Therefore, clinical research scientists need to

understand the basic principles of biostatistics

Outline (today and next week)

- 1. Primer on hypothesis testing, p-values,

confidence intervals, statistical power. - 2. Biostatistics in Practice Applying statistics

to clinical research design

Quick review

- Standard deviation
- Histograms (frequency distributions)
- Normal distribution (bell curve)

Review Standard deviation

Standard deviation tells you how variable a

characteristic is in a population. For example,

how variable is height in the US? A standard

deviation of height represents the average

distance that a random person is away from the

mean height in the population.

Review Histograms

Review Histograms

1 inch bins

Review Normal Distribution

Review Normal Distribution

In fact, here, 101/150 (67) subjects have

heights between 62.7 and 67.7 (1 standard

deviation below and above the mean).

A perfect, theoretical normal distribution

carries 68 of its area within 1 standard

deviation of the mean.

Review Normal Distribution

In fact, here, 146/150 (97) subjects have

heights between 60.2 and 70.2 (2 standard

deviations below and above the mean).

A perfect, theoretical normal distribution

carries 95 of its area within 2 standard

deviations of the mean.

Review Normal Distribution

In fact, here, 150/150 (100) subjects have

heights between 57.7 and 72.7 (1 standard

deviation below and above the mean).

A perfect, theoretical normal distribution

carries 99.7 of its area within 3 standard

deviations of the mean.

Review Applying the normal distribution

- If womens heights in the US are normally

distributed with a mean of 65 inches and a

standard deviation of 2.5 inches, what percentage

of women do you expect to have heights above 6

feet (72 inches)?

From standard normal chart or computer ? Z of

2.8 corresponds to a right tail area of .0026

expect 2-3 women per 1000 to have heights of 6

feet or greater.

Statistics Primer

- Statistical Inference
- Sample statistics
- Sampling distributions
- Central limit theorem
- Hypothesis testing
- P-values
- Confidence intervals
- Statistical power

Statistical Inference The process of making

guesses about the truth from a sample.

- EXAMPLE What is the average blood pressure of

US post-docs? - We could go out and measure blood pressure in

every US post-doc (thousands). - Or, we could take a sample and make inferences

about the truth from our sample.

Using what we observe, 1. We can test an a priori

guess (hypothesis testing). 2. We can estimate

the true value (confidence intervals).

Statistical Inference is based on Sampling

Variability

- Sample Statistic we summarize a sample into one

number e.g., could be a mean, a difference in

means or proportions, an odds ratio, or a

correlation coefficient - E.g. average blood pressure of a sample of 50

American men - E.g. the difference in average blood pressure

between a sample of 50 men and a sample of 50

women - Sampling Variability If we could repeat an

experiment many, many times on different samples

with the same number of subjects, the resultant

sample statistic would not always be the same

(because of chance!). - Standard Error a measure of the sampling

variability

Examples of Sample Statistics

- Single population mean
- Difference in means (ttest)
- Difference in proportions (Z-test)
- Odds ratio/risk ratio
- Correlation coefficient
- Regression coefficient

Variability of a sample mean

Random Postdocs

The Truth (not knowable)

Variability of a sample mean

Random samples of 5 post-docs

The Truth (not knowable)

Variability of a sample mean

Samples of 50 Postdocs

The Truth (not knowable)

129 mmHg

134 mmHg

131 mmHg

130 mmHg

128 mmHg

130 mmHg

Variability of a sample mean

Samples of 150 Postdocs

The Truth (not knowable)

131.2 mmHg

130.2 mmHg

129.7 mmHg

130.9 mmHg

130.4 mmHg

129.5 mmHg

How sample means vary A computer experiment

- 1. Pick any probability distribution and specify

a mean and standard deviation. - 2. Tell the computer to randomly generate 1000

observations from that probability distributions - E.g., the computer is more likely to spit out

values with high probabilities - 3. Plot the observed values in a histogram.
- 4. Next, tell the computer to randomly generate

1000 averages-of-2 (randomly pick 2 and take

their average) from that probability

distribution. Plot observed averages in

histograms. - 5. Repeat for averages-of-5, and averages-of-100.

Uniform on 0,1 average of 1(original

distribution)

Uniform 1000 averages of 2

Uniform 1000 averages of 5

Uniform 1000 averages of 100

Exp(1) average of 1(original distribution)

Exp(1) 1000 averages of 2

Exp(1) 1000 averages of 5

Exp(1) 1000 averages of 100

Bin(40, .05) average of 1(original

distribution)

Bin(40, .05) 1000 averages of 2

Bin(40, .05) 1000 averages of 5

Bin(40, .05) 1000 averages of 100

The Central Limit Theorem

- If all possible random samples, each of size n,

are taken from any population with a mean ? and a

standard deviation ?, the sampling distribution

of the sample means (averages) will

3. be approximately normally distributed

regardless of the shape of the parent population

(normality improves with larger n)

Example 1 Weights of doctors

- Experimental question Are practicing doctors

setting a good example for their patients in

their weights? - Experiment Take a sample of practicing doctors

and measure their weights - Sample statistic mean weight for the sample
- ?IF weight is normally distributed in doctors

with a mean of 150 lbs and standard deviation of

15, how much would you expect the sample average

to vary if you could repeat the experiment over

and over?

Relative frequency of 1000 observations of weight

mean 150 lbs standard deviation 15 lbs

(No Transcript)

(No Transcript)

(No Transcript)

Using Sampling Variability

- In reality, we only get to take one sample!!
- But, since we have an idea about how sampling

variability works, we can make inferences about

the truth based on one sample.

Experimental results

- Lets say we take one sample of 100 doctors and

calculate their average weight.

Expected Sampling Variability for n100 if the

true weight is 150 (and SD15)

Expected Sampling Variability for n100 if the

true weight is 150 (and SD15)

P-value associated with this experiment

P-value (the probability of our sample average

being 160 lbs or more IF the true average weight

is 150) lt .0001 Gives us evidence that 150 isnt

a good guess

The P-value

- P-value is the probability that we would have

seen our data (or something more unexpected) just

by chance if the null hypothesis (null value) is

true. - Small p-values mean the null value is unlikely

given our data.

The P-value

- By convention, p-values of lt.05 are often

accepted as statistically significant in the

medical literature but this is an arbitrary

cut-off. - A cut-off of plt.05 means that in about 5 of 100

experiments, a result would appear significant

just by chance (Type I error).

Hypothesis Testing

- The Steps
- Define your hypotheses (null, alternative)
- The null hypothesis is the straw man that we

are trying to shoot down. - Null here mean weight of doctors 150 lbs
- Alternative here mean weight gt 150 lbs

(one-sided) - Specify your sampling distribution (under the

null) - If we repeated this experiment many, many times,

the sample average weights would be normally

distributed around 150 lbs with a standard error

of 1.5

- 3. Do a single experiment (observed sample mean

160 lbs) - 4. Calculate the p-value of what you observed

(plt.0001) - 5. Reject or fail to reject the null hypothesis

(reject)

Errors in Hypothesis Testing

Errors in Hypothesis Testing

- Type-I Error (false positive)
- Concluding that the observed effect is real when

its just due to chance. - Type-II Error (false negative)
- Missing a real effect.
- POWER (the complement of type-II error)
- The probability of seeing a real effect (of

rejecting the null if the null is false).

Beyond Hypothesis TestingEstimation (confidence

intervals)

Wed estimate based on these data that the

average weight is somewhere closer to 160 lbs.

And we could state the precision of this estimate

(a confidence interval)

Confidence Intervals

- (Sample statistic) ? (measure of how confident

we want to be) ? (standard error)

Confidence interval (more information!!)

- 95 CI for the mean
- 1601.96(1.5) (157 163)

What Confidence Intervals do

- They indicate the un/certainty about the size

of a population characteristic or effect. Wider

CIs indicate less certainty. - Confidence intervals can also answer the

question of whether or not an association exists

or a treatment is beneficial or harmful.

(analogous to p-values) - e.g., since the 95 CI of the mean weight does

not cross 150 lbs (the null value), then we

reject the null at plt.05.

Expected Sampling Variability for n2

Expected Sampling Variability for n2

Expected Sampling Variability for n10

Statistical Power

- We found the same sample mean (160 lbs) in our

100-doctor sample, 10-doctor sample, and 2-doctor

sample. - But we only rejected the null based on the

100-doctor and 10-doctor samples. - Larger samples give us more statistical power

Can we quantify how much power we have for given

sample sizes?

(No Transcript)

(No Transcript)

Null Distribution mean150 sd4.74

Clinically relevant alternative mean160

sd4.74

(No Transcript)

Null Distribution mean150 sd1.37

Nearly 100 power!

Clinically relevant alternative mean160

sd1.37

Factors Affecting Power

- 1. Size of the difference (10 pounds higher)
- 2. Standard deviation of the characteristic

(sd15) - 3. Bigger sample size
- 4. Significance level desired

1. Bigger difference from the null mean

2. Bigger standard deviation

3. Bigger Sample Size

4. Higher significance level

Examples of Sample Statistics

- Single population mean
- Difference in means (ttest)
- Difference in proportions (Z-test)
- Odds ratio/risk ratio
- Correlation coefficient
- Regression coefficient

Example 2 Difference in means

- Example Rosental, R. and Jacobson, L. (1966)

Teachers expectancies Determinates of pupils

I.Q. gains. Psychological Reports, 19, 115-118.

The Experiment (note exact numbers have been

altered)

- Grade 3 at Oak School were given an IQ test at

the beginning of the academic year (n90). - Classroom teachers were given a list of names of

students in their classes who had supposedly

scored in the top 20 percent these students were

identified as academic bloomers (n18). - BUT the children on the teachers lists had

actually been randomly assigned to the list. - At the end of the year, the same I.Q. test was

re-administered.

The results

- Children who had been randomly assigned to the

top-20 percent list had mean I.Q. increase of

12.2 points (sd2.0) vs. children in the control

group only had an increase of 8.2 points (sd2.0) - Is this a statistically significant difference?

Give a confidence interval for this difference.

Difference in means

- Sample statistic Difference in mean change in IQ

test score. - Null hypothesis no difference between academic

bloomers and normal students

Explore sampling distributionof difference in

means

- Simulate 1000 differences in mean IQ change under

the null hypothesis (both academic bloomer and

controls improve by, lets say, 8 points, with a

standard deviation of 2.0)

academic bloomers

normal students

Difference academic bloomers-normal students

Notice that most experiments yielded a difference

value between 1.1 and 1.1 (wider than the above

sampling distributions!)

Confidence interval (more information!!)

- 95 CI for the difference 4.01.99(.52) (3.0

5.0)

Does not cross 0 therefore, significant at .05.

95 confidence interval for the observed

difference 4 2.523-5

Clearly lots of power to detect a difference of 4!

- How much power to detect a difference of 1.0?

Power closer to 50 now.

Example 3 Difference in proportions

- Experimental question Do men tend to prefer Bush

more than women? - Experimental design Poll representative samples

of men and women in the U.S. and ask them the

question do you plan to vote for Bush in

November, yes or no? - Sample statistic The difference in the

proportion of men who are pro-Bush versus women

who are pro-Bush? - Null hypothesis the difference in proportions

0 - Observed results women.36 men.46

Explore sampling distributionof difference in

proportions

- Simulate 1000 differences in proportion

preferring Bush under the null hypothesis (41

overall prefer Bush, with no difference between

genders)

men

women

Under the null hypothesis, most experiments

yielded a mean between .27 and .55

Difference men-women

Under the null hypothesis, most experiments

yielded difference values between -.20 (women

preferring Bush more than men) and .20 (men

preferring Bush more)

- What if we had 200 men and 200 women?

men

Most of 1000 simulated experiments yielded a mean

between .34 and .48

women

Most of 1000 simulated experiments yielded a mean

between .34 and .48

Difference men-women

Notice that most experiments will yield a

difference value between -.10 (women preferring

Bush more than men) and .10 (men preferring Bush

more)

- What if we had 800 men and 800 women?

men

Most experiments will yield a mean between .38

and.44

women

Most experiments will yield a mean between .38

and.44

Difference men-women

Notice that most experiments will yield a

difference value between -.05 (women preferring

Bush more than men) and .05 (men preferring Bush

more)

If we sampled 1600 per group, a 2.5 difference

would be statistically significant at a

significance level of .05. If we sampled 3200

per group, a 1.25 difference would be

statistically significant at a significance

level of .05. If we sampled 6400 per group, a

.625 difference would be statistically

significant at a significance level of

.05. BUT if we found a significant difference

of 1 between men and women, would we care if we

were Bush or Kerry??

Limits of hypothesis testingStatistical vs.

Clinical Significance

Consider a hypothetical trial comparing death

rates in 12,000 patients with multi-organ failure

receiving a new inotrope, with 12,000 patients

receiving usual care. If there was a 1

reduction in mortality in the treatment group

(49 deaths versus 50 in the usual care group)

this would be statistically significant (plt.05),

because of the large sample size. However, such

a small difference in death rates may not be

clinically important.

Example 4 The odds ratio

- Experimental question Does smoking increase

fracture risk? - Experiment Ask 50 patients with fractures and 50

controls if they ever smoked. - Sample statistic Odds Ratio (measure of relative

risk) - Null hypothesis There is no association between

smoking and fractures (odds ratio1.0).

The Odds Ratio (OR)

Example 3 Sampling Variability of the null Odds

Ratio (OR) (50 cases/50 controls/20 exposed)

If the Odds Ratio1.0 then with 50 cases and 50

controls, of whom 20 smoke, this is the expected

variability of the sample OR?note the right skew

The Sampling Variability of the natural log of

the OR (lnOR) is more Gaussian

Statistical Power

- Statistical power here is the probability of

concluding that there is an association between

exposure and disease if an association truly

exists. - The stronger the association, the more likely we

are to pick it up in our study. - The more people we sample, the more likely we are

to conclude that there is an association if one

exists (because the sampling variability is

reduced).

Part II Biostatistics in Practice Applying

statistics to clinical research design

From concept to protocol

- Define your primary hypothesis
- Define your primary predictor and outcome

variables - Decide on study type (cross-sectional,

case-control, cohort, RCT) - Decide how you will measure your predictor and

outcome variables, balancing statistical power,

ease of measurement, and potential biases - Decide on the main statistical tests that will be

used in analysis - Calculate sample size needs for your chosen

statistical test/s - Describe your sample size needs in your written

protocol, disclosing your assumptions - Write a statistical analysis plan
- Briefly, describe descriptive statistics that you

plan to present - Describe which statistical tests you will use to

test your primary hypotheses - Describe which statistical tests you will use to

test your secondary hypotheses - Describe how you will account for confounders and

test for interactions - Describe any exploratory analyses that you might

perform

Powering a studyWhat is the primary hypothesis?

- Before you can calculate sample size, you need to

know the primary statistical analysis that you

will use in the end. - What is your main outcome of interest?
- What is your main predictor of interest?
- Which statistical test will you use to test for

associations between your outcome and your

predictor? - Do you need to adjust sample size needs upwards

to account for loss to follow-up, switching arms

of a randomized trial, accounting for

confounders? - Seek guidance from a statistician

Overview of statistical tests

- The following table gives the appropriate choice

of a statistical test or measure of association

for various types of data (outcome variables and

predictor variables) by study design.

e.g., blood pressure pounds age treatment

(1/0)

(No Transcript)

Comparing Groups

- T-test compares two means
- (null hypothesis difference in means 0)
- ANOVA compares means between gt2 groups
- (null hypothesis difference in means 0)
- Non-parametric tests are used when normality

assumptions are not met - (null hypothesis difference in medians 0)
- Chi-square test compares proportions between

groups - (null hypothesis categorical variables are

independent)

Simple sample size formulas/calculators available

- Sample size for a difference in means
- Sample size for a difference in proportions
- Can roughly be used if you plan to calculate risk

ratios, odds ratios, or to run logistic

regression or chi-square tests - Sample size for a hazard ratio/log-rank test
- If you plan to do survival analysis Kaplan-Meier

methods (log-rank test), Cox regression

(No Transcript)

The pay-off for sitting through the theoretical

part of these lectures!

- Heres where it pays to understand whats behind

sample size/power calculations! - Youll have a much easier time using sample size

calculators if you arent just putting numbers

into a black box!

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

If this look complicated, dont panic!

- In reality, youre unlikely to have to derive

sample size formulas yourself - ?but its critical to understand where they come

from if youre going to apply them yourself.

Formula for difference in means

Formula for difference in proportions

Formula for hazard ratio/log-rank test

Recommended sample size calculators!

- http//hedwig.mgh.harvard.edu/sample_size/size.htm

l - http//vancouver.stanford.edu8080/clio/index.html

- ?Traverse protocol wizard

These sample size calculations are idealized

- We have not accounted for losses-to-follow up
- We have not accounted for non-compliance (for

intervention trial or RCT) - We have assumed that individuals are independent

observations (not true in clustered designs) - Consult a statistician for these considerations!

Applying statistics to clinical research design

Example

- You want to study the relationship between

smoking and fractures.

Steps

- ?Define your primary hypothesis
- ?Define your primary predictor and outcome

variables - ?Decide on study type

Applying statistics to clinical research design

Example

- ? predictor smoking (yes/no or continuous)
- ?outcome osteoporotic fracture (time-to-event)
- ?Study design cohort

From concept to protocol

- ?Decide how you will measure your predictor and

outcome variables - ?Decide on the main statistical tests that will

be used in analysis - ?Calculate sample size needs for your chosen

statistical test/s

(No Transcript)

Formula for hazard ratio/log-rank test

Example sample size calculation

- Ratio of exposed to unexposed in your sample?
- 11
- Proportion of non-smokers who will fracture in

your defined population over your defined study

period? - 10
- What is a clinically meaningful hazard ratio?
- 2.0
- Based on hazard ratio, how many smokers will

fracture? - 1-902 19
- What power are you targeting?
- 80
- What significance level?
- .05

Formula for hazard ratio/log-rank test

You may want to adjust upwards for loss to

follow-up. E.g., if you expect to lose 10,

divide the above estimate by 90.

From concept to protocol

- Describe your sample size needs in your written

protocol, disclosing your assumptions - Write a statistical analysis plan

(No Transcript)

Statistical analysis plan

- Descriptive statistics
- E.g., of study population by smoking status
- Kaplan-Meier Curves (univariate)
- Describe exploratory analyses that may be used to

identify confounders and other predictors of

fracture - Cox regression (multivariate)
- What confounders have you measured, and how will

you incorporate them into multivariate analysis? - How will you explore for possible interactions?
- Describe potential exploratory analysis for other

predictors of fracture