Options for Summarizing Medical Test Performance in the Absence of a

About This Presentation

Title:

Options for Summarizing Medical Test Performance in the Absence of a

Description:

Options for Summarizing Medical Test Performance in the Absence of a Gold Standard Prepared for: The Agency for Healthcare Research and Quality (AHRQ) – PowerPoint PPT presentation

Number of Views:218

Avg rating:3.0/5.0

Slides: 44

Provided by: AlexP153

Learn more at: http://www.baylorcme.org

Category:

more less

Transcript and Presenter's Notes

Title: Options for Summarizing Medical Test Performance in the Absence of a

1
Options for SummarizingMedical Test
Performancein the Absence of a Gold Standard

Prepared for
The Agency for Healthcare Research and Quality
(AHRQ)
Training Modules for Medical Test Reviews Methods
Guide
www.ahrq.gov

2
Learning Objectives

Recognize settings where the reference standard
may be imperfect (i.e., no gold standard)
Describe sources of potential bias resulting from
the use of an imperfect reference standard when
estimating the sensitivity and specificity of a
medical test
Understand the options for analyzing data, their
advantages and justification, and potential
assumptions

Truly Diseased Truly Healthy
Index text () True positive (TP) False positive (FP)
Index test (-) False negative (FN) True negative (TN)

Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
4
Introduction Reference Standard Issues

True status is directly observable (e.g., for
tests predicting short-term mortality after a
procedure).
True status is commonly based on a reference
standard (test), which is considered to be a
gold standard if it actually reflects the
true status.
Reference standard bias arises when the
reference test does not mirror the truth well.
The further the reference test deviates from the
truth, the less accurate the estimate of the
index tests performance.
An imperfect reference standard is a reference
standard test that misclassifies true status at
a rate that cannot be ignored.

The simplest case is an index test and a
reference standard that give dichotomous results
(e.g., positive or negative for disease).
Both the index and reference tests can err by not
reflecting the true status.
The example in the following slide shows true
2-by-2 table probabilities in relation to the
eight combinations of index and reference test
results.
These eight probabilities (?1, ?1, ?1, ?1, ?2,
?2, ?2, and ?2) need to be estimated from the
accuracy data.
The perfect reference standard is the gold
standard.

Truly Diseased Truly Diseased Truly Healthy Truly Healthy
RS () RS (-) RS () RS (-)
Index test () ?1 ?2 ?2 ?1
Index test (-) ?1 ?2 ?2 ?1
RS () RS (-)
Index test () ? ?1 ?2 ? ?1 ?2
Index test (-) ? ?1 ?2 ? ?1 ?2
Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
7
Imperfect Reference Standard Bias (1 of 2)

Naïve estimates are underestimates versus true
values when test results are independent among
those with and without the condition of interest
(conditional independence).

Abbreviations Seindex index test
specificity Spindex index test specificity P
disease prevalence
Solid red line true sensitivity
Dashed red line true specificity Solid black
line naïve sensitivity Dashed black
line naïve specificity
Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
9
Reference Standard Validity

Only rarely are we absolutely sure that the
reference standard is a perfect reflection of the
truth.
Often, we are comfortable with overlooking small
or moderate misclassifications by the reference
standard.
Hard-and-fast rules for judging the (in)adequacy
of the reference standard do not exist.
Consult content experts on a case-by-case basis
to make judgments.
There are three settings in which one might
question the validity of the reference standard.
The reference method yields different
measurements over time or across settings.
The condition of interest is variably defined.
The new method is an improved version of a
usually applied test

Situation The reference method yields different
measurements over time or across settings.
Example Diagnosis of obstructive sleep apnea
typically requires a high Apnea-Hypopnea Index
(AHI an objective measurement) and the presence
of suggestive symptoms and signs.
Problem There is large night-to-night
variability in measured AHI and substantial
between-rater and between-laboratory variability.

Situation The condition of interest is variably
defined.
Example The disease, such as psoriatic
arthritis, is complex.
Problem There is no single symptom, sign, or
measurement that suffices to make a diagnosis of
the disease with certainty. Instead, a set of
diagnostic criteria (symptoms, signs, imaging
results, and laboratory measures) is used to
identify the disease, which will unavoidably be
differentially applied across studies.

Situation The new method is an improved version
of a usually applied test.
Example Measurement of parathyroid hormone (PTH)
Problem Older measurement methodologies are
being replaced by newer, more specific ones.
Measurements with the new and old methodologies
do not agree very well.
It is incorrect to use the older method as the
reference standard.

Analytic options 1 and 2 below are preferred when
possible to summarize data from two fallible
tests option 3 is also suitable.
Forgo the classical paradigm, which focuses on
test accuracy assess the ability of the index
test to predict patient outcomes (using the index
test as a predictive instrument).
Forgo the classical paradigm assess agreement of
the index and reference test results, that is,
treat index and reference tests as two
alternative measurement methods.
Using the classical paradigm, calculate naïve
estimates of the index tests sensitivity and
specificity, but qualify study findings to avoid
misinterpretation.
Mathematically adjust the naïve estimates of
the index tests sensitivity and specificity to
account for the imperfect reference standard.

Forgo the classical paradigm, which compares the
index test to a reference standard (test
accuracy).
This information is not informative or
interpretable with an imperfect reference
standard.
Instead, assess the ability of the index test to
predict patient outcomes such as history, future
clinical events, and response to therapy.
This option follows a well-known paradigm in
systematic reviews for evaluating prognostic
tests (more information is available in Module
12).

Forgo the classical paradigm (test accuracy).
Instead, assess agreement (concordance) of the
index and reference test results.
Simply treat the index and reference tests as two
alternative measurement methods.
How to do this depends on whether the results are
categorical or continuous.
For categorical test results
Cohens kappa statistic is a measure of
categorical agreement that accounts for agreement
by chance.
Meta-analyses of kappa statistics are not common
in the medical literature they will need to be
explained and interpreted in detail.

When there are continuous test results but
individual data points are available, the
researcher can
Directly compare measurements between tests
Pool data from all available studies and
Perform regression of one test versus another,
which accounts for measurement error
Conduct a Bland-Altman analysis (difference vs.
the average of the two test results)
When there are continuous test results but
individual data points are not available, the
researcher can
Summarize study-level information from (1) or (2)
above

Calculate naïve estimates of the index test
sensitivity (Se) and specificity (Sp), ignoring
imperfection of the reference standard but making
qualitative judgments on the direction of bias of
these naïve estimates.
Index and reference tests are independent within
strata of disease (conditional independence).
Naïve estimates of index test Se and Sp are
biased downward (underestimated).
Index and reference tests are correlated within
strata of disease. Naïve estimates of Se and Sp
can be
Overestimates if tests agree more than by chance
Underestimates when tests disagree more than by
chance
Problem The researcher cannot assume conditional
independence without justification external data
are needed.

The prostate-specific antigen (PSA) test is used
to detect prostate cancer.
Numerous methods have been developed to test PSA
levels.
These tests prone to false-negative
misclassification PSA levels are not elevated in
up to 15 percent of prostate cancer cases.
Obesity can reduce serum PSA.
Obesity will likely affect all PSA-detection
methods, old and new (conditional dependence).
Conditional dependence of PSA tests results in
overestimation of the accuracy of a new (index)
test.
When compared to a non-PSA reference (e.g., a
prostate biopsy), this is no conditional
dependence misclassification results in in
underestimation.

Mathematically adjust (correct) the naïve
estimates of the index test sensitivity and
specificity to account for the imperfect
reference standard.
Data from 2 ? 2 tables are not enough additional
information is needed from the literature.
The task is easiest if conditional independence
can be assumed when
The sensitivity and specificity of an imperfect
reference test are known from other studies.
The specificity of both the index and imperfect
reference standard are known from other studies,
but the sensitivities are unknown.
Use Bayesian inference to add prior distribution
data from other studies as opposed to fixed
values. It provides data on sensitivity,
specificity, and disease prevalence.
Alternative sets of assumptions are possible.
Problem Model mis-specification can result in
biased estimates.

Obstructive sleep apnea (OSA) is characterized by
sleep disturbances secondary to upper airway
obstruction.
OSA has a prevalence of 2 to 4 percent in
middle-aged adults.
It is associated with daytime somnolence,
cardiovascular morbidity, diabetes, and other
adverse outcomes.
Treatment includes continuous positive airway
pressure.
A systematic review on the diagnosis of OSA in
the home setting used
Portable monitors as the index diagnostic test
Facility-based polysomnography as the reference
standard
The reviewers first attempted analysis option 3,
then moved on to analysis option 2.

Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip
S, Raman G, et al. Home diagnosis of obstructive
sleep apnea-hypopnea syndrome. Technology
Assessment. Available at www.cms.gov/Medicare/Cove
rage/DeterminationProcess/downloads/id48TA.pdf.
21
Systematic Review ExampleChoice of Reference
Standard and Cutoff

There is no perfect or accepted reference
standard for obstructive sleep apnea (OSA).
A diagnosis of OSA is based on suggestive signs
and symptoms and objective assessment of
breathing patterns during sleep with
facility-based polysomnography (PSG).
PSG quantifies the Apnea-Hypopnea Index (AHI).
Portable monitors can also measure AHI.
A high AHI (usually 15 events per hour of sleep)
is suggestive of OSA alternative cutoffs range
from 5 to 40 events/hour.
The main analysis in the systematic reviews used
a cutoff of AHI 15, but cutoffs of 10 and 20
were also analyzed (there were too few data to
analyze other cut-offs).

Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip
S, Raman G, et al. Home diagnosis of obstructive
sleep apnea-hypopnea syndrome. Technology
Assessment. Available at www.cms.gov/Medicare/Cove
rage/DeterminationProcess/downloads/id48TA.pdf.
22
Systematic Review ExampleAnalysis Option 3
Naïve Estimates

The reviewers calculated naïve estimates of the
sensitivity (Se) and specificity (Sp) of the
Apnea-Hypopnea Index by comparing portable
monitors with polysomnography and qualified
the results.
Naïve estimates of sensitivity and specificity
were displayed in the receiver operator
characteristic space.
High Se and Sp levels were suggested.

However, there was considerable variability in
the measurements.
It was not possible to deduce whether the naïve
estimates overestimate or underestimate the
true Se and Sp.

Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip
S, Raman G, et al. Home diagnosis of obstructive
sleep apnea-hypopnea syndrome. Technology
Assessment. Available at www.cms.gov/Medicare/Cove
rage/DeterminationProcess/downloads/id48TA.pdf.
23
Systematic Review ExampleAnalysis Option 2
Pooled Data Analysis

Reviewers also described concordance between
Apnea-Hypopnea Index (AHI) measured by portable
monitors (index test) versus polysomnography
(reference test) with Bland-Altman analysis
(continuous data with individual points
available), but are the tests interchangeable?
They found better agreement for lower AHI levels.

Dashed line line of perfect agreement Broad
limits suboptimal agreement
Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip
S, Raman G, et al. Home diagnosis of obstructive
sleep apnea-hypopnea syndrome. Technology
Assessment. Available at www.cms.gov/Medicare/Cove
rage/DeterminationProcess/downloads/id48TA.pdf.
24
Systematic Review ExampleAnalysis Option 2
Study-Specific Results

The reviewers summarized Bland-Altman plots
across studies.
The mean difference in the two measurements of
the Apnea-Hypopnea Index (mean bias) and the
95-percent limits of agreement are shown for each
study.
The 95-percent limits of agreement are very wide
in most studies, suggesting great variability in
the measurements with the two methods.

Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Ip
S, Raman G, et al. Home diagnosis of obstructive
sleep apnea-hypopnea syndrome. Technology
Assessment. Available at www.cms.gov/Medicare/Cove
rage/DeterminationProcess/downloads/id48TA.pdf.
25
Systematic Review ExampleConclusions and a
Recommendation

Measurements of the Apnea-Hypopnea Index (AHI)
with the two methods generally agree on which
patients have 15 or less events per hour of sleep
(low AHI).
The methods disagree on the exact measurement
among people who have higher AHIs on average.
The reviewers identified a gap in the literature.
The reviewers recommended undertaking studies
that perform clinical validation of portable
monitors, i.e. their ability to predict patients
history, risk propensity, or clinical profile
(analysis option 1).

When multiple reference standard tests, or
multiple cutoffs for the same reference test, are
available
Justify the choice of test and/or cutoff or
Consider analyzing multiple options
Decide on the most appropriate analysis options
to synthesize test performance.
The four analysis options presented in this
module are largely complementary approaches and
are not mutually exclusive.
Analysis options 1, 2, and 3 are recommended.
Analysis option 4 requires expert statistical
help.
There are no empirical data on the merits and
pitfalls of the mathematical adjustments in
option 4 for an imperfect reference standard.

The validity of the reference standard should be
questioned when the new test being evaluated is
an improved version of the usually applied test.
True
False

28
Practice Question 1 (2 of 2)

Explanation for Question 1
The statement is true. There are several
situations when the validity of the reference
standard should be questioned. These include when
a new method of testing is an improved version of
the usually applied test. Measurements using the
different methods may not agree well.

29
Practice Question 2 (1 of 2)

Which of the following options is considered most
preferable for evaluating information on a
diagnostic test when there is no perfect
reference test (gold standard)?
Assess the tests ability to predict
patient-relevant outcomes instead of test
accuracy.
Assess whether the results of the two tests agree
or disagree and treat them as two alternative
measurement methods.
Calculate estimates of the index tests
sensitivity and specificity from each study, but
qualify the study findings.
Adjust the estimates of sensitivity and
specificity of the index test to account for the
imperfect reference standard.

30
Practice Question 2 (2 of 2)

Explanation for Question 2
The correct answer is a. All of the options
listed are suggested methods for synthesizing
information on medical tests when there is no
gold standard. The preferred method involves
assessing the tests ability to predict
patient-relevant outcomes instead of calculating
test accuracy when compared with an imperfect
standard. This way, the index test is treated as
a predictive instrument.

31
Practice Question 3 (1 of 2)

When considering imperfect reference standard
bias, which of the following applies to naïve
estimates of sensitivity and specificity when
there is conditional independence of the results?
They are overestimates compared to the true
values.
They are underestimates compared to the true
values.
They are always equal to the true values.
They cannot be compared to the true values.

32
Practice Question 3 (2 of 2)

Explanation for Question 3
The correct answer is b. Conditional independence
implies that the results of the index and
reference tests are independent among people with
and without the condition of interest. In this
case, estimates of sensitivity and specificity
from the standard formulas will usually be
smaller than the true values. In other words, the
naïve estimates of sensitivity and specificity
for the index test will be underestimates of the
true values.

33
Practice Question 4 (1 of 2)

When evaluating a medical test with no gold
standard, one can mathematically calculate
accurate sensitivity and specificity of the index
test using standard 2 ? 2 cross-tabulation of
test results.
True
False

34
Practice Question 4 (2 of 2)

Explanation for Question 4
The statement is false. The estimates of
sensitivity and specificity will have to be
adjusted to account for the imperfect reference
standard. This may require expert statistical
help.

35
Authors

This presentation was prepared by Brooke
Heidenfelder, Andrzej Kosinski, Rachael Posey,
Lorraine Sease, Remy Coeytaux, Gillian Sanders,
and Alex Vaz, members of the Duke University
Evidence-based Practice Center
The module is based on Trikalinos TA, Balion TA.
Options for summarizing medical test performance
in the absence of a gold standard. In Chang SM
and Matchar DB, eds. Methods guide for medical
test reviews. Rockville, MD Agency for
Healthcare Research and Quality June 2012. p.
9.1-16. AHRQ Publication No. 12-EHC017. Available
at www.effectivehealthcare.ahrq.gov/medtestsguide.
cfm.

36
References (1 of 8)

Albert PS, Dodd LE. A cautionary note on the
robustness of latent class models for estimating
diagnostic error without a gold standard.
Biometrics 2004 Jun60(2)427-35. PMID 15180668.
Altman DG, Bland JM. Absence of evidence is not
evidence of absence. BMJ 1995 Aug311(7003)485.
PMID 7647644.
Bablok W, Passing H, Bender R, et al. A general
regression procedure for method transformation.
Application of linear regression procedures for
method comparison studies in clinical chemistry,
Part III. J Clin Chem Clin Biochem 1988
Nov26(11)783-90. PMID 3235954.
Black MA, Craig BA. Estimating disease prevalence
in the absence of a gold standard. Stat Med 2002
Sep 3021(18)2653-69. PMID 12228883.
Bland JM, Altman DG. Measuring agreement in
method comparison studies. Stat Methods Med Res
1999 Jun8(2)135-60. PMID 10501650.

37
References (2 of 8)

Bland JM, Altman DG. Applying the right
statistics analyses of measurement studies.
Ultrasound Obstet Gynecol 2003 Jul22(1)85-93.
PMID 12858311.
Bossuyt PM. Interpreting diagnostic test accuracy
studies. Semin Hematol 2008 Jul45(3)189-95.
PMID 18582626.
Dendukuri N, Hadgu A, Wang L. Modeling
conditional dependence between diagnostic tests
a multiple latent variable model. Stat Med 2009
Feb 128(3)441-61. PMID 19067379.
Dendukuri N, Joseph L. Bayesian approaches to
modeling the conditional dependence between
multiple diagnostic tests. Biometrics 2001
Mar57(1)158-67. PMID 11252592.
Garrett ES, Eaton WW, Zeger S. Methods for
evaluating the performance of diagnostic tests in
the absence of a gold standard a latent class
model approach. Stat Med 2002 May
1521(9)1289-307. PMID 12111879.

38
References (3 of 8)

Gart JJ, Buck AA. Comparison of a screening test
and a reference test in epidemiologic studies.
II. A probabilistic model for the comparison of
diagnostic tests. Am J Epidemiol 1966
May83(3)593-602. PMID 5932703.
Goldberg JD, Wittes JT. The estimation of false
negatives in medical screening. Biometrics 1978
Mar34(1)77-86. PMID 630038.
Gyorkos TW, Genta RM, Viens P, et al.
Seroepidemiology of Strongyloides infection in
the Southeast Asian refugee population in Canada.
Am J Epidemiol 1990 Aug132(2)257-64. PMID
2196791.
Hui SL, Zhou XH. Evaluation of diagnostic tests
without gold standards. Stat Methods Med Res 1998
Dec7(4)354-70. PMID 9871952.
Joseph L, Gyorkos TW. Inferences for likelihood
ratios in the absence of a "gold standard". Med
Decis Making 1996 Oct-Dec16(4)412-7. PMID
8912303.

39
References (4 of 8)

Jonas DE, Wilt TJ, Taylor BC, et al. Chapter 11
challenges in and principles for conducting
systematic reviews of genetic tests used as
predictive indicators. J Gen Intern Med 2012
Jun27 Suppl 1S83-93. PMID 22648679.
Linnet K. Estimation of the linear relationship
between the measurements of two methods with
proportional errors. Stat Med 1990
Dec9(12)1463-73. PMID 2281234.
Linnet K. Performance of Deming regression
analysis in case of misspecified analytical error
ratio in method comparison studies. Clin Chem
1998 May44(5)1024-31. PMID 9590376.
Qu Y, Tan M, Kutner MH. Random effects models in
latent class analysis for evaluating accuracy of
diagnostic tests. Biometrics 1996
Sep52(3)797-810. PMID 8805757.

40
References (5 of 8)

Reitsma JB, Rutjes AW, Khan KS, et al. A review
of solutions for diagnostic accuracy studies with
an imperfect or missing reference standard. J
Clin Epidemiol 2009 Aug62(8)797-806. PMID
19447581.
Rutjes AW, Reitsma JB, Coomarasamy A, et al.
Evaluation of diagnostic tests when there is no
gold standard. A review of methods. Health
Technol Assess 2007 Dec11(50)iii, ix-51. PMID
18021577.
Sokal RR, Rohlf EF. Biometry. New York, NY
Freeman 1981.
Sun S. Meta-analysis of Cohen's kappa. Health
Serv Outcomes Res Method 201111145-163.
Thompson IM, Pauler DK, Goodman PJ, et al.
Prevalence of prostate cancer among men with a
prostate-specific antigen level lt or 4.0 ng per
milliliter. N Engl J Med 2004 May
27350(22)2239-46. PMID 15163773.

41
References (6 of 8)

Toft N, Jorgensen E, Hojsgaard S. Diagnosing
diagnostic tests evaluating the assumptions
underlying the estimation of sensitivity and
specificity in the absence of a gold standard.
Prev Vet Med 2005 Apr68(1)19-33. PMID
15795013.
Torrance-Rynard VL, Walter SD. Effects of
dependent errors in the assessment of diagnostic
test performance. Stat Med 1997 Oct
1516(19)2157-75. PMID 9330426.
Trikalinos TA, Balion TA. Options for summarizing
medical test performance in the absence of a
gold standard. In Chang SM and Matchar DB,
eds. Methods guide for medical test reviews.
Rockville, MD Agency for Healthcare Research and
Quality June 2012. p. 9.1-16. AHRQ Publication
No. 12-EHC017. Available at www.effectivehealthcar
e.ahrq.gov/medtestsguide.cfm.

42
References (7 of 8)

Trikalinos TA, Balion CM, Coleman CI, et al.
Chapter 8 meta-analysis of medical test
performance when there is a gold standard. J
Gen Intern Med 2012 Jun27 Suppl 1S56-66. PMID
22648676.
Trikalinos TA, Ip S, Raman G, et al. Home
diagnosis of obstructive sleep apnea-hypopnea
syndrome. Technology Assessment (Prepared by the
TuftsNew England Medical Center Evidence-based
Practice Center). Rockville, MD, Agency for
Healthcare Research and Quality August 2007.
Available at www.cms.gov/Medicare/Coverage/
Determination Process/downloads/id48TA.pdf.
Vacek PM. The effect of conditional dependence on
the evaluation of diagnostic tests. Biometrics
1985 Dec41(4)959-68. PMID 3830260.
Walter SD, Irwig LM. Estimation of test error
rates, disease prevalence and relative risk from
misclassified data a review. J Clin Epidemiol
198841(9)923-37. PMID 3054000.

43
References (8 of 8)

Walter SD, Irwig L, Glasziou PP. Meta-analysis of
diagnostic tests with imperfect reference
standards. J Clin Epidemiol 1999
Oct52(10)943-51. PMID 10513757.
Whiting P, Rutjes AW, Reitsma JB, et al. Sources
of variation and bias in studies of diagnostic
accuracy a systematic review. Ann Intern Med
2004 Feb 3140(3)189-202. PMID 14757617.

Write a Comment

User Comments (0)