Title: Biostatistics in Practice
1Biostatistics in Practice
Session 3 Testing Hypotheses
Peter D. Christenson Biostatistician http//rese
arch.LABioMed.org/Biostat
2Session 3 Preparation
We have been using a recent study on
hyperactivity for the concepts in this course.
The questions below based on this paper are
intended to prepare you for session 3.
3Session 3 Preparation
1. From Figures 1 and 2, we see that 153/209
73 of parents of the younger children and
144/160 90 of parents of the older children
initially were interested but did not
participate. Does it seem logical that the rate
is lower for the 3-year-olds? Do you have any
intuition on whether the magnitude of the 73 vs.
90 difference is enough to support an age
difference, regardless of the logical reason?
4Session 3 Preparation 1
153/209 144/160 73
? Consented ? 90
5Session 3 Preparation 1
153/209 144/160 73
? Consented ? 90
Not intuitive whether 73 vs. 90 is a real
difference, i.e. reproducible or extrapolates to
other persons.
6Session 3 Preparation 1
153/209 144/160 73
? Consented ? 90
Hypothesis testing compares 73 and 90. It does
not say how precise the s are.
7Session 3 Preparation
- 2. Look at the left side of the bottom panel of
Figure 3 and recall what we have said about
confidence intervals. Would you conclude that
there is a change in hyperactivity under Mix A? - 3. Repeat question 2 for placebo.
8Session 3 Preparation 2 and 3
9Session 3 Preparation 2 and 3
Possible values for real effect.
Zero is ruled out.
10Session 3 Preparation
4. Do you think that the positive conclusion for
question 3 has been "proven"? 5. Do you think
that the negative conclusion for question 2 has
been "proven"?
11Session 3 Preparation
4. Do you think that the positive conclusion for
question 3 has been "proven"? Yes, with 95
confidence. 5. Do you think that the negative
conclusion for question 2 has been
"proven"? No, since more subjects would give a
narrower confidence interval. Hypothesis testing
make a Yes or No conclusion whether there is an
effect and quantifies the chances of a correct
conclusion either way. Confidence intervals give
possible magnitudes of effects.
12Session 3 Goals
Statistical testing concepts Three most common
tests Software Equivalence of testing and
confidence intervals False positive and false
negative conclusions
13Session 3 Data
For this session, we will focus on another paper
for which I have the raw data. Paper is posted on
our class website. Subjects were hospitalized for
many days, blood samples taken every 8 hours and
vital signs recorded every hour. Subject is
adrenal insufficient if 2 successive serum
cortisols are low.
14Goal Do Groups Differ By More than is Expected
By Chance?
Cohan (2005) Crit Care Med332358-66.
15Goal Do Groups Differ By More than is Expected
By Chance?
- First, need to
- Specify experimental units (Persons? Blood
draws?). - Specify single outcome for each unit (e.g.,
Yes/No, mean or min of several measurements?). - Examine raw data, e.g., histogram, for meeting
test requirements. - Specify group summary measure to be used (e.g.,
or mean, median over units). - Choose particular statistical test for the
outcome.
16Outcome Type ? Statistical Test
WilcoxonTest
Medians
s
ChiSquareTest
. . .
Means
t Test
. . .
Cohan (2005) Crit Care Med332358-66.
17Minimal MAP Group Distributions of Individual
Units
AI Group (N42) Stem.Leaf
7 6
1 7 11334 5 6
555 3 6 01112344
8 5 5566778 7
5 01222234 8 4 57788
5 4 23
2 3 6 1 3 13
2 ----------------
Multiply Stem.Leaf by 10
Non-AI Group (N38) Stem.Leaf
7 79 2 7
00111234 8 6 5556777888
10 6 00112234 8 5
67999 5 5 3
1 4 79 2 4 04
2 ----------------
Multiply Stem.Leaf by 10
? Approximately normally distributed ? Use means
to summarize groups. ? Use t-test to compare
means.
18Goal Do Groups Differ By More than is Expected
By Chance?
- Next, need to
- Calculate a standardized quantity for the
particular test, a test statistic. - Often t(Diff in Group Means)/SE(Diff)
- Compare the test statistic to what it is
expected to be if (populations represented by)
groups do not differ. Often t is approxly
normal bell curve. - Declare groups to differ if test statistic is
too deviant from expectations in (2) above. - Often absolute value of t gt2.
19t-Test for Minimal MAP Step 1
- Calculate a standardized quantity for the
particular test, a test statistic.
AI N 42 Mean 56.1666667 Std Dev
10.7824634 SE(Mean) 1.6610.78/v42
Non AI N 38 Mean 63.4122807 Std Dev
8.7141575 SE(Mean) 1.418.71/v38
Diff in Group Means 63.4 - 56.2 7.2
(Signal) SE(Diff) sqrtSEM12 SEM22
sqrt(1.6621.412) 2.2 (Noise)
Signal to Noise Ratio
? Test Statistic t (7.2 - 0)/2.2 3.28
20t-Test for Minimal MAP Step 2
- Compare the test statistic to what it is expected
to be if (populations represented by) groups do
not differ. Often t is approxly normal bell
curve.
Expected values for test statistic if groups do
not differ. Area under sections of curve
probability of values in the interval. (0.5 for
0 to 8)
Prob (-2 to -1) is Area 0.14
Expect
Observed 3.28
0.95 Chance
21t-Test for Minimal MAP Step 3
- Declare groups to differ if test statistic is too
deviant. How much?
Convention Too deviant is lt 5 chance ? t
gt2. Two-tailed the 5 is allocated equally
for either group to be superior.
Expect
2.5
2.5
Conclude Groups differ since 3.28 has lt5 if no
difference in the entire populations.
95 Chance
Observed 3.28
22t-Test for Minimal MAP p value
- Declare groups to differ if test statistic is too
deviant. How much?
p-value Probability of a test statistic at least
as deviant as observed, if populations really do
not differ. Smaller values ? more evidence of
group differences.
p value 2(0.0007) 0.0014 ltlt0.05
Expect
Area 0.0007
Area 0.0007
Observed 3.28
95 Chance
23t-Test Technical Note
- There are actually several types of t-tests
- Equal vs. unequal variance (variance SD2),
depending on whether the SDs are too different
between the groups. Yes, there is another
statistical test for comparing the SDs.
AI N 42 Mean 56.1666667 Std Dev
10.7824634 SE(Mean) 1.6610.78/v42
Non AI N 38 Mean 63.4122807 Std Dev
8.7141575 SE(Mean) 1.418.71/v38
SE(Diff) sqrtSEM12 SEM22
sqrt(1.6621.412) 2.2 is approximate. There are
more complicated exact formulas that software
implements.
24t-Test Another Note
- There are other types of t-tests
- A two-sided t-test assumes that differences
(between groups or pre-to-post) are possible in
both directions, e.g., increase or decrease. - A one-sided t-test assumes that these differences
can only be either an increase or decrease, or
one group can only have higher or lower responses
than the other group. This is very rare, and
generally not acceptable.
25Back to Paper Normal Range
SD 8.7 SD 10.8 N 38
N 42
What is the normal range for lowest MAP in AI
patients, i.e., 95 of subjects were in
approximately what range?
26Back to Paper Normal Range
SD 8.7 SD 10.8 N 38
N 42
What is the normal range for lowest MAP in AI
patients, i.e., 95 of subjects were in
approximately what range? Answer 56.2 2(10.8)
35 to 78
27Back to Paper Confidence Intervals
SD 8.7 SD 10.8 N 38
N 42 SE 1.41 SE 1.66 SE(Diff
of Means) 2.2
SE(Diff) sqrt of SEM12 SEM22
? 63.4-56.2 7.2 is the best guess for the MAP
diff between the means of all AI and non-AI
patients. We are 95 sure that diff is within
7.22SE(Diff) 7.22(2.2) 2.8 to 11.6.
28Back to Paper t-test
? 7.2 is statistically significant (p0.0014)
i.e., only 14 of 1000 sets of 80 patients would
differ so much, if AI and non-AI really dont
differ in MAP. Is ? 7.2 clinically significant?
29Confidence Intervals ? Tests
Hyperactivity Paper
pgt0.05 p0.05 plt0.05
30Confidence Intervals ? Tests
The Algebra ?/SE(?) t lt 2 is equivalent
to ? lt 2 SE(?) is equivalent to -2 SE(?) lt ?
lt 2 SE(?) is equivalent to ? - 2 SE(?) lt 0 lt ?
2 SE(?)
Hypothesis Test
Confidence Interval
31Confidence Intervals ? Tests
95 Confidence Intervals
Non-overlapping 95 confidence intervals, as
here, are sufficient for significant (plt0.05)
group differences. However, non-overlapping is
not necessary. They can overlap and still groups
can differ significantly.
32Back to Paper Experimental Units
Cannot use t-test for comparing lab data for
multiple blood draws per subject.
bat least 100 g/kg/min of propofol administered
at the time of blood draw, or any pentobarbital
in the 48 hrs before the blood draw
Generalization of t-test
33Tests on Percentages
Is 26.3 vs. 61.9 statistically significant
(plt0.05), i.e., a difference too large to have a
lt5 of occurring by chance if groups do not
really differ? Solution Same theme as for means.
Find a test statistic and compare to its expected
values if groups do not differ. See next slide.
34Tests on Percentages
Cannot use t-test for comparing lab data for
multiple blood draws per subject.
Chi-Square Distribution
Here, the signal in the test statistic is a
squared quantity, expected to be 1.
Area 0.002
Expect
Test statistic10.2 gtgt 5.99, so plt0.05. In fact,
p0.002.
5.99
1
Observed 10.2
95 Chance
35Tests on Percentages Chi-Square
The chi-square test statistic (10.2 in the
example) is found by first calculating what is
the expected number of AI patients with MAP lt60
and the same for non-AI patients, if AI and
non-AI really do not differ for this. Then,
chi-square is found as the sum of standardized
(Observed Expected)2. This should be close to
1, as in the graph on the previous slide, if
groups do not differ. The value 10.2 seems too
big to have happened by chance (probability0.002)
if there is no difference among all TBI
subjects.
36Back to t-Test
Declare groups to differ if test statistic is too
deviant.
How much deviance is enough proof?
Convention Too deviant is lt 5 chance ? t
gt2. Why not choose, say, tgt3, so that our
chances of being wrong are even less, lt1?
Expect
2.5
2.5
95 Chance
Observed 3.28
37Back to t-Test
Convention Too deviant is lt 5 chance ? t
gt2. Why not choose, say, tgt3, so that our
chances of being wrong are even less, lt1?
Expect
lt0.5
lt0.5
gt99 Chance
Observed 3.28
Answer Then the chances of missing a real
difference are increased, the converse wrong
conclusion. This is analogous to setting the
threshold for a diagnostic test of disease.
38Power of a Study
Statistical power is the sensitivity of a study
to detect real effects, if they exist. It needs
to be balanced with the likelihood of wrongly
declaring effects when they are non-existent.
Today, we have been keeping that error at lt5.
Power is the topic for the next session 4.