Title: Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy
1Multiple Tests, Multivariable Decision Rules, and
Studies of Diagnostic Test Accuracy
Chapter 8 Multiple Tests and Multivariable
Decision Rules Chapter 5 Studies of Diagnostic
Test Accuracy
Michael A. Kohn, MD, MPP 10/19/2006
2Outline of Topics
- Combining results of multiple tests importance
of test non-independence - Recursive Partitioning
- Logistic Regression
- Published rules for combining test results
importance of validation separate from derivation - Biases in studies of diagnostic test accuracy
- Overfitting bias
- Incorporation bias
- Referral bias
- Double gold standard bias
- Spectrum bias
3Warning Different Example
- Example of combining two tests in this talk
- Prenatal sonographic Nuchal Translucency (NT) and
Nasal Bone Exam (NBE) as dichotomous tests for
Trisomy 21 - Example of combining two tests in book
- Premature birth (GA lt 36 weeks) and low birth
weight (BW lt 2500 grams) as dichotomous tests for
neonatal morbidity
Cicero, S., G. Rembouskos, et al. (2004).
"Likelihood ratio for trisomy 21 in fetuses with
absent nasal bone at the 11-14-week scan."
Ultrasound Obstet Gynecol 23(3) 218-23. Soon
to be replaced
4If NT 3.5 mm Positive for Trisomy 21
Whats wrong with this definition?
5(No Transcript)
6- In general, dont make multi-level tests like NT
into dichotomous tests by choosing a fixed cutoff - I did it here to make the discussion of multiple
tests easier - I arbitrarily chose to call 3.5 mm positive
7One Dichotomous Test
- Trisomy 21
- Nuchal D D- LR
- Translucency
- 3.5 mm 212 478 7.0
- lt 3.5 mm 121 4745 0.4
- Total 333 5223
Do you see that this is (212/333)/(478/5223)?
Review of Chapter 3 What are the sensitivity,
specificity, PPV, and NPV of this test? (Be
careful.)
8Nuchal Translucency
- Sensitivity 212/333 64
- Specificity 4745/5223 91
- Prevalence 333/(3335223) 6
- (Study population pregnant women about to under
go CVS, so high prevalence of Trisomy 21) - PPV 212/(212 478) 31
- NPV 4745/(121 4745) 97.5
Not that great prior to test P(D-) 94
9Clinical Scenario One TestPre-Test Probability
of Downs 6NT Positive
- Pre-test prob 0.06
- Pre-test odds 0.06/0.94 0.064
- LR() 7.0
- Post-Test Odds Pre-Test Odds x LR()
- 0.064 x 7.0 0.44
- Post-Test prob 0.44/(0.44 1) 0.31
10Pre-Test Probability of Tri21 6NT
PositivePost-Test Probability of Tri21 31
Clinical Scenario One Test
Using Probabilities
Using Odds
Pre-Test Odds of CAD 0.064EECG Positive (LR
7.0)Post-Test Odds of CAD 0.44
11Clinical Scenario One TestPre-Test Probability
of Tri21 6NT Positive
- NT (LR 7.0)
- ---------------gt
- -------------------------X---------------X-------
----------------------- -
- Log(Odds) 2 -1.5 -1 -0.5
0 0.5 1 - Odds 1100 133 110 13
11 31 101 - Prob 0.01 0.03 0.09 0.25
0.5 0.75 0.91
Odds 0.064 Prob 0.06
Odds 0.44 Prob 0.31
12Nasal Bone Seen NBE Negative for Trisomy 21
Nasal Bone Absent NBE Positive for Trisomy 21
13Second Dichotomous Test
- Nasal Bone Tri21 Tri21- LR
- Absent 229 129 27.8
- Present 104 5094 0.32
- Total 333 5223
Do you see that this is (229/333)/(129/5223)?
14Pre-Test Probability of Trisomy 21 6NT
Positive for Trisomy 21 ( 3.5 mm)Post-NT
Probability of Trisomy 21 31NBE Positive for
Trisomy 21 (no bone seen)Post-Nuclide
Probability of Trisomy 21 ?
Clinical Scenario Two Tests
Using Probabilities
15Clinical Scenario Two Tests
Using Odds
Pre-Test Odds of Tri21 0.064NT Positive (LR
7.0)Post-Test Odds of Tri21 0.44NBE Positive
(LR 27.8?)Post-Test Odds of Tri21 .44 x
27.8? 12.4? (P
12.4/(112.4) 92.5?)
16Clinical Scenario Two TestsPre-Test
Probability of Trisomy 21 6NT 3.5 mm AND
Nasal Bone Absent
- NT (LR 6.96)
- ---------------gt
- NBE (LR 27.8)
- -----------------------
----gt - NT NBE
- Can we do this? ---------------gt------
---------------------gt - NT and NBE
- ---------------X----------------X------
----------------------X- -
- Log(Odds) 2 -1.5 -1 -0.5
0 0.5 1 - Odds 1100 133 110 13
11 31 101 - Prob 0.01 0.03 0.09 0.25
0.5 0.75 0.91
Odds 0.064 Prob 0.06
Odds 12.4 Prob 0.925
Odds 0.44 Prob 0.31
17Question
- Can we use the post-test odds after a positive
Nuchal Translucency as the pre-test odds for the
positive Nasal Bone Examination? - i.e., can we combine the positive results by
multiplying their LRs? - LR(NT, NBE ) LR(NT ) x LR(NBE ) ?
- 7.0 x 27.8 ?
- 194 ?
18Answer No
NT NBE Trisomy 21 Trisomy 21 - LR
Pos Pos 158 47 36 0.7 69
Pos Neg 54 16 442 8.5 1.9
Neg Pos 71 21 93 1.8 12
Neg Neg 50 15 4652 89 0.2
Total Total 333 100 5223 100
Not 194
19Non-Independence
- Absence of the nasal bone does not tell you as
much if you already know that the nuchal
translucency is 3.5 mm.
20Clinical Scenario
Using Odds
Pre-Test Odds of Tri21 0.064NT/NBE (LR
68.8)Post-Test Odds 0.064 x 68.8
4.40 (P 4.40/(14.40) 81, not 92.5)
21Non-Independence
NT
---------------gt
NBE
---------------------------gt
NT NBE if
tests were independent---------------gt----------
------------------gt
NT and NBE since tests are
dependent-----------------------------------gt
---------------X----------------X---------
---------X----------
Log(Odds) 2 -1.5 -1 -0.5
0 0.5 1 Odds 1100 133
110 13 11 31 101
Prob 0.01 0.03 0.09 0.25
0.5 0.75 0.91
Prob 0.81
22Non-Independence of NT and NBE
- Apparently, even in chromosomally normal fetuses,
enlarged NT and absence of the nasal bone are
associated. A false positive on the NT makes a
false positive on the NBE more likely. Of normal
(D-) fetuses with NT lt 3.5 mm only 2.0 had nasal
bone absent. Of normal (D-) fetuses with NT
3.5 mm, 7.5 had nasal bone absent.
Some (but not all) of this may have to do with
ethnicity. In this London study, chromosomally
normal fetuses of Afro-Caribbean ethnicity had
both larger NTs and more frequent absence of the
nasal bone.
In Trisomy 21 (D) fetuses, normal NT was
associated with the presence of the nasal bone,
so a false negative on the NT was associated with
a false negative on the NBE.
23Non-Independence
- Instead of looking for the nasal bone, what if
the second test were just a repeat measurement of
the nuchal translucency? - A second positive NT would do little to increase
your certainty of Trisomy 21. If it was false
positive the first time around, it is likely to
be false positive the second time.
24Reasons for Non-Independence
- Tests measure the same aspect of disease.
- Consider exercise ECG (EECG) and radionuclide
scan as tests for coronary artery disease (CAD)
with the gold standard being anatomic narrowing
of the arteries on angiogram. Both EECG and
nuclide scan measure functional narrowing. In a
patient without anatomic narrowing (a D-
patient), coronary artery spasm could cause false
positives on both tests.
25Reasons for Non-Independence
- Spectrum of disease severity.
- In the EECG/nuclide scan example, CAD is defined
as 70 stenosis on angiogram. A D patient
with 71 stenosis is much more likely to have a
false negative on both the EECG and the nuclide
scan than a D patient with 99 stenosis.
26Reasons for Non-Independence
- Spectrum of non-disease severity.
- In this example, CAD is defined as 70 stenosis
on angiogram. A D- patient with 69 stenosis is
much more likely to have a false positive on both
the EECG and the nuclide scan than a D- patient
with 33 stenosis.
27Counterexamples Possibly Independent Tests
- For Venous Thromboembolism
- CT Angiogram of Lungs and Doppler Ultrasound of
Leg Veins - Alveolar Dead Space and D-Dimer
- MRA of Lungs and MRV of leg veins
28Unless tests are independent, we cant combine
results by multiplying LRs
29Ways to Combine Multiple Tests
- On a group of patients (derivation set), perform
the multiple tests and determine true disease
status (apply the gold standard) - Measure LR for each possible combination of
results - Recursive Partitioning
- Logistic Regression
30Determine LR for Each Result Combination
NT NBE Tri21 Tri21- LR Post Test Prob
Pos Pos 158 47 36 0.7 69 81
Pos Neg 54 16 442 8.5 1.9 11
Neg Pos 71 21 93 1.8 12 43
Neg Neg 50 15 4652 89.1 0.2 1
Total Total 333 100 5223 100
Assumes pre-test prob 6
31Determine LR for Each Result Combination
2 dichotomous tests 4 combinations 3 dichotomous
tests 8 combinations 4 dichotomous tests 16
combinations Etc.
2 3-level tests 9 combinations 3 3-level tests
27 combinations Etc.
32Determine LR for Each Result Combination
How do you handle continuous tests?
Not practical for most groups of tests.
33Recursive PartitioningMeasure NT First
34Recursive PartitioningExamine Nasal Bone First
35Recursive PartitioningExamine Nasal Bone
FirstCVS if P(Trisomy 21 gt 5)
36Recursive PartitioningExamine Nasal Bone
FirstCVS if P(Trisomy 21 gt 5)
37Recursive Partioning
- Same as Classification and Regression Trees
(CART) - Dont have to work out probabilities (or LRs) for
all possible combinations of tests, because of
tree pruning
38Tree Pruning Goldman Rule
- 8 Tests for Acute MI in ER Chest Pain Patient
- ST Elevation on ECG
- CP lt 48 hours
- ST-T changes on ECG
- Hx of MI
- Radiation of Pain to Neck/LUE
- Longest pain gt 1 hour
- Age gt 40 years
- CP not reproduced by palpation.
Goldman L, Cook EF, Brand DA, et al. A computer
protocol to predict myocardial infarction in
emergency department patients with chest pain. N
Engl J Med. 1988318(13)797-803.
398 tests ? 28 256 Combinations
40(No Transcript)
41Recursive Partitioning
- Does not deal well with continuous test results
- when there is a monotonic relationship between
between the rest result and the probability of
disease
42Logistic Regression
- Ln(Odds(D))
- a bNTNT bNBENBE binteract(NT)(NBE)
- 1
- - 0
- More on this later in ATCR!
43- Logistic Regression Approach to the R/O ACI
patient
Coefficient MV Odds Ratio
Constant -3.93
Presence of chest pain 1.23 3.42
Pain major symptom 0.88 2.41
Male Sex 0.71 2.03
Age 40 or less -1.44 0.24
Age gt 50 0.67 1.95
Male over 50 years -0.43 0.65
ST elevation 1.314 3.72
New Q waves 0.62 1.86
ST depression 0.99 2.69
T waves elevated 1.095 2.99
T waves inverted 1.13 3.10
T wave ST changes -0.314 0.73
Selker HP, Griffith JL, D'Agostino RB. A tool
for judging coronary care unit admission
appropriateness, valid for both real-time and
retrospective use. A time-insensitive predictive
instrument (TIPI) for acute cardiac ischemia a
multicenter study. Med Care. Jul
199129(7)610-627. For corrected coefficients,
see http//medg.lcs.mit.edu/cardiac/cpain.htm
44Clinical Scenario
- 71 y/o man with 2.5 hours of CP, substernal,
non-radiating, described as bloating. Cannot
say if same as prior MI or worse than prior
angina. - Hx of CAD, s/p CABG 10 yrs prior, stenting 3
years and 1 year ago. DM on Avandia. - ECG RBBB, Qs inferiorly. No ischemic ST-T
changes.
Real patient seen by MAK 1 am 10/12/04
45(No Transcript)
46Coefficient Clinical Scenario Clinical Scenario
Constant -3.93 Result -3.93
Presence of chest pain 1.23 1 1.23
Pain major symptom 0.88 1 0.88
Sex 0.71 1 0.71
Age 40 or less -1.44 0 0
Age gt 50 0.67 1 0.67
Male over 50 years -0.43 1 -0.43
ST elevation 1.314 0 0
New Q waves 0.62 0 0
ST depression 0.99 0 0
T waves elevated 1.095 0 0
T waves inverted 1.13 0 0
T wave ST changes -0.314 0 0
-0.87
Odds of ACI 0.418952
Probability of ACI Probability of ACI 30
47What Happened to Pre-test Probability?
- Typically clinical decision rules report
probabilities rather than likelihood ratios for
combinations of results. - Can back out LRs if we know prevalence, pD,
in the study dataset. - With logistic regression models, this backing
out is known as a prevalence offset. (See
Chapter 8A.)
48Optimal Cutoff for a Single Continuous Test
- Depends on
- Pre-test Probability of Disease
- ROC Curve (Likelihood Ratios)
- Relative Misclassification Costs
- Cannot choose an optimal cutoff with just the ROC
curve.
49Optimal Cutoff Line for Two Continuous Tests
50Choosing Which Tests to Include in the Decision
Rule
- Have focused on how to combine results of two or
more tests, not on which of several tests to
include in a decision rule. - Options include
- Recursive partitioning
- Automated stepwise logistic regression
Choice of variables in derivation data set
requires confirmation in a separate validation
data set.
51Need for Validation Example
- Study of clinical predictors of bacterial
diarrhea. - Evaluated 34 historical items and 16 physical
examination questions. - 3 questions (abrupt onset, gt 4 stools/day, and
absence of vomiting) best predicted a positive
stool culture (sensitivity 86 specificity 60
for all 3). - Would these 3 be the best predictors in a new
dataset? Would they have the same sensitivity
and specificity?
DeWitt TG, Humphrey KF, McCarthy P. Clinical
predictors of acute bacterial diarrhea in young
children. Pediatrics. Oct 198576(4)551-556.
52Need for Validation
- Develop prediction rule by choosing a few tests
and findings from a large number of
possibilities. - Takes advantage of chance variations in the data.
- Predictive ability of rule will probably
disappear when you try to validate on a new
dataset. - Can be referred to as overfitting.
53VALIDATION
- No matter what technique (CART or logistic
regression) is used, the rule for combining
multiple test results must be tested on a data
set different from the one used to derive it. - Beware of validation sets that are just
re-hashes of the derivation set. - (This begins our discussion of potential problems
with studies of diagnostic tests.)
54Studies of Diagnostic Test AccuracySackett, EBM,
pg 68
- Was there an independent, blind comparison with a
reference (gold) standard of diagnosis? - Was the diagnostic test evaluated in an
appropriate spectrum of patients (like those in
whom we would use it in practice)? - Was the reference standard applied regardless of
the diagnostic test result? - Was the test (or cluster of tests) validated in a
second, independent group of patients?
55Bias in Studies of Diagnostic Test Accuracy
- Index Test Test Being Evaluated
- Gold Standard Test Used to Determine True
Disease Status
56Studies of Diagnostic TestsSackett, EBM, pg 68
- Was there an independent, blind comparison with a
reference (gold) standard of diagnosis? - Was the diagnostic test evaluated in an
appropriate spectrum of patients (like those in
whom we would use it in practice)? - Was the reference standard applied regardless of
the diagnostic test result? - Was the test (or cluster of tests) validated in a
second, independent group of patients?
57Studies of Diagnostic TestsIncorporation Bias
Index Test is incorporated into gold standard.
Consider a study of the usefulness of various
findings for diagnosing pancreatitis. If the
"Gold Standard" is a discharge diagnosis of
pancreatitis, which in many cases will be based
upon the serum amylase, then the study can't
quantify the accuracy of the amylase for this
diagnosis.
58Studies of Diagnostic TestsIncorporation Bias
A study of BNP in dyspnea patients as a
diagnostic test for CHF also showed that the CXR
performed extremely well in predicting CHF.
The two cardiologists who determined the final
diagnosis of CHF were blinded to the BNP level
but not to the CXR report, so the assessment of
BNP should be unbiased, but not the assessment
CXR.
Maisel AS, Krishnaswamy P, Nowak RM, McCord J,
Hollander JE, Duc P, et al. Rapid measurement of
B-type natriuretic peptide in the emergency
diagnosis of heart failure. N Engl J Med
2002347(3)161-7.
59Studies of Diagnostic TestsSackett, EBM, pg 68
- Was there an independent, blind comparison with a
reference (gold) standard of diagnosis? - Was the diagnostic test evaluated in an
appropriate spectrum of patients (like those in
whom we would use it in practice)? - Was the reference standard applied regardless of
the diagnostic test result? - Was the test (or cluster of tests) validated in a
second, independent group of patients?
60Studies of Diagnostic TestsVerification Bias
The study population only includes those to whom
the gold standard was applied, but patients with
positive index tests are more likely to be
referred for the gold standard.
Example V/Q Scan as a test for PE. Gold
standard is a PA-gram. Patients with negative
V/Q scans are less frequently referred for
PA-gram than those with positive V/Q scans.
Only patients who had PA-grams are included in
the study.
AKA Work-up, Referral Bias, or Ascertainment Bias
61Studies of Diagnostic TestsVerification Bias
PA-gram PA-gram-
V/Q Scan a b
V/Q Scan - c ? d ?
Sensitivity (a/(ac)) is biased UP.
Specificity (d/(bd)) is biased DOWN.
62Studies of Diagnostic TestsDouble Gold Standard
Bias
One gold standard (e.g. biopsy) is applied in
patients with positive index test, another gold
standard (e.g., clinical follow-up) is applied in
patients with a negative index test.
63Studies of Diagnostic TestsDouble Gold Standard
Test V/Q Scan Disease PE Gold Standard PA-gram
in patients who had one, clinical follow-up in
patients who didnt Study Population All
patients presenting to the ED who received a V/Q
scan. Assume some patients did not get PA-gram
because of normal/low probability V/Q scans but
would have had positive PA-grams. Instead they
had negative clinical follow-up and were counted
as true negatives. If they had had PA-grams,
they would have been counted as false negatives.
PIOPED. JAMA 1990263(20)2753-9.
64Studies of Diagnostic TestsDouble Gold Standard
PA-Gram PA-Gram -
V/Q Scan a b
V/Q Scan - c d
Sensitivity (a/(ac)) biased UP Specificity
(d/(bd)) biased UP
65Studies of Diagnostic TestsSackett, EBM, pg 68
- Was there an independent, blind comparison with a
reference (gold) standard of diagnosis? - Was the diagnostic test evaluated in an
appropriate spectrum of patients (like those in
whom we would use it in practice)? - Was the reference standard applied regardless of
the diagnostic test result? - Was the test (or cluster of tests) validated in a
second, independent group of patients?
66Studies of Diagnostic TestsSpectrum Bias
So far, we have said that PPV and NPV of a test
depend on the population being tested,
specifically on the prevalence of D in the
population.
We said that sensitivity and specificity are
properties of the test and independent of the
prevalence and, by implication at least, the
population being tested.
In fact,
67Studies of Diagnostic TestsSpectrum Bias
Sensitivity depends on the spectrum of disease in
the population being tested.
Specificity depends on the spectrum of
non-disease in the population being tested.
68Studies of Diagnostic TestsSpectrum Bias
D and D- groups are not homogeneous.
D-/D really is D-,D, D, or D
D-/D really is (D1-, D2-, or D3-)/D
69Studies of Diagnostic TestsSpectrum Bias
Example Absence of Nasal Bone (on 13-week
ultrasound) as a Test for Chromosomal Abnormality
70Spectrum BiasAbsence of Nasal Bone as a Test for
Chromosomal Abnormality
-
- Nasal D D- LR
- Bone
- Absent 229 129 7.0
- Present 104 5094 0.4
- Total 333 5223
Sensitivity 229/333 69 BUT the D group
only included fetuses with Trisomy 21
71Spectrum BiasAbsence of Nasal Bone as a Test for
Chromosomal Abnormality
- D group excluded 295 fetuses with other
chromosomal abnormalities (esp. Trisomy 18) - If the purpose of the nasal bone exam is to
determine on whom to get CVS, these 295 fetuses
with chromosomal abnormalities other than trisomy
21 should be included in the D group. - 95/295 (32, not 69) had absent nasal bone.
72Spectrum BiasAbsence of Nasal Bone as a Test for
Chromosomal Abnormality
-
- Nasal D D- LR
- Bone
- Absent 22995 324 478 7.0
- Present 104200304 4745 0.4
- Total 333295628 5223
Sensitivity 324/628 52 NOT 69 obtained when
the D group only included fetuses with Trisomy 21
73Spectrum BiasAbsence of Nasal Bone as a Test for
Chromosomal Abnormality
- By excluding chromosomal abnormalities other than
Trisomy 21 from the D group, the study
exaggerates the sensitivity of the Nasal Bone
Exam (NBE) for chromosomal abnormalities. - True Sensitivity of NBE for chromosomal
abnormalities 52 - Biased estimate due to spectrum bias (excluding
other chromosomal problems) 69
74Biases in Studies of Tests
- Overfitting Bias Data snooped cutoffs take
advantage of chance variations in derivations set
making test look falsely good. - Incorporation Bias index test part of gold
standard (Sensitivity Up, Specificity Up) - Verification/Referral Bias positive index test
increases referral to gold standard (Sensitivity
Up, Specificity Down) - Double Gold Standard positive index test causes
application of definitive gold standard, negative
index test results in clinical follow-up
(Sensitivity Up, Specificity Up) - Spectrum Bias
- D sickest of the sick (Sensitivity Up)
- D- wellest of the well (Specificity Up)
75Biases in Studies of Tests
- Dont just identify potential biases, figure out
how the biases could affect the conclusions. - Studies concluding a test is worthless are not
invalid if biases in the design would have led to
the test looking BETTER than it really is.