Meta-analysis%20of%20Test%20Performance%20When%20There%20Is%20a%20

About This Presentation

Title:

Meta-analysis%20of%20Test%20Performance%20When%20There%20Is%20a%20

Description:

Meta-analysis of Test Performance When There Is a Gold Standard Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Medical ... – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 59

Provided by: AlexP171

Category:

more less

Transcript and Presenter's Notes

Title: Meta-analysis%20of%20Test%20Performance%20When%20There%20Is%20a%20

1
Meta-analysis of Test Performance When There Is a
Gold Standard

Prepared for
The Agency for Healthcare Research and Quality
(AHRQ)
Training Modules for Medical Test Reviews Methods
Guide
www.ahrq.gov

2
Learning Objectives

Graphically display diagnostic test performance
across multiple studies using a gold standard
reference
Explain the dependence of sensitivity and
specificity over studies and thus the need for a
multivariate (joint) analysis
Describe choices for a meta-analysis to summarize
test performance depending on whether the
sensitivity and specificity estimates from
multiple studies vary (or do not vary) widel

This module focuses on how to conduct a
meta-analysis with a gold standard reference.
Module 9 discusses how to conduct a meta-analysis
when no gold standard reference exists.
There are two goals for a meta-analysis in a
systematic review
Provide summary estimates for key quantities
Explain observed heterogeneity in the results of
studies included in the review
For systematic reviews of medical tests, a
meta-analysis often focuses on synthesis of test
performance data.

Gold Standard A reference standard that is
considered adequate in defining the presence or
absence of the condition of interest (disease).
Diagnostic Test This type of test is potentially
less accurate than using the gold standard to
ascertain disease.
Data The main focus is on tests with positive or
negative results because of the use of a cut-off
level (threshold) each study provides 2 2
tabulation.

Test Result With Disease Healthy
Positive True positive False positive
Negative False negative True negative
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
5
Measures Used To Assess Test Performance (1 of 2)

Sensitivity The proportion of test positives
among people with a disease (true-positive rate)
Specificity The proportion of test negatives
among healthy people (true-negative rate)
Positive predictive value Proportion with
disease among people with test-positive results
Negative predictive value Proportion of healthy
people with test-negative results
The predictive values can be computed from
sensitivity, specificity, and disease prevalence.

Positive likelihood ratio sensitivity/(1 ?
specificity) proportion of test positives among
diseased/proportion of test positives among
healthy
Negative likelihood ratio (1 ?
sensitivity)/specificity proportion of test
negatives among diseased/proportion of test
negatives among healthy
Diagnostic odds ratio (true positives/false
negatives)/(false positives/true negatives)
odds of a positive test with disease over odds of
a positive test without disease
Diagnostic odds ratios do not allow weighing of
the true-positive and false-positive rates
separately.

Meta-analysis aims to provide a meaningful
summary of sensitivity and specificity across
studies.
Within each study, sensitivity and specificity
are independent they are estimated from
different patients (those with a disease or those
who are healthy).
Across studies, sensitivity and specificity are
generally negatively correlated as one
increases the other is expected to decrease.
This negative correlation is most obvious with
varying thresholds (known as threshold effect),
varying time from onset of symptom to test, et
cetera.
Positive correlations are often due to a missing
covari

This is an example with 11 studies using D-dimer
tests to diagnose acute coronary events, showing
that sensitivity increases as specificity
decreases
Summarizing the two correlated quantities is a
multivariate problem, and multivariate methods
should be used to address it.

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Becker DM, Philbrick
JT, Bachhuber TL, et al. Ann Intern Med 1996 May
13156(9)939-46. PMID 8624174.
9
Challenges

How to quantitatively summarize medical test
performance when
The sensitivity and specificity estimates of
various studies do not vary widely or extensively
Can use a summary point to obtain summary test
performance if the tests have the same threshold
Summary point a summary sensitivity and summary
specificity pair
The sensitivity and specificity of multiple
studies vary widely
Can use a summary line to describe the
relationship between average sensitivity and
average specificity
May be less important than variations in
thresholds, reference standards, study designs,
et cetera, between the studies

Principle 1 Favor the most informative way to
summarize the data.
Choose between a summary point and a summary
line.
Use the summary point when sensitivity/specificity
do not vary much.
Use the summary line when there are different
thresholds for positive tests or estimates vary
widely.
Both can also be used, since they convey
complementary information.
The choice is subjective there are no
hard-and-fast rules.
Principle 2 Explore the variability in study
results with graphs and suitable analyses rather
than relying exclusively on grand means (i.e.,
a single summary statistic).

Problem
Within a study, sensitivity, specificity,
positive/negative predictive values, and
prevalence are all interrelated via simple
formulas.
Meta-analyzing each metric across studies will
create summaries that are inconsistent with these
formulas.
Proposed solution
Obtain summaries for sensitivities and
specificities across studies via meta-analysis,
then back-calculate the rest of the metrics
(using the formulas) over a range of prevalence
values.

A Visual Summary of Sensitivity/Specificity
Across Studies With back calculation of the Other
Metrics

NLR negative likelihood ratio NPV negative
predictive value PLR positive likelihood ratio
PPV positive predictive value Prev
prevalence Se sensitivity Sp specificity
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
13
Deciding Which Metrics To Meta-analyze (3 of 6)

Why does it make sense to directly meta-analyze
sensitivity and specificity?
It aligns well with our understanding of
positivity threshold effects.
Sensitivity and specificity are often considered
independent of prevalence.
Summary sensitivity and specificity obtained by
direct meta-analysis will always be between 0 and
1.
These two metrics are not as easily understood as
predictive values and likelihood ratios, so back
calculation of these other metrics is useful.

Why does it not make sense to directly
meta-analyze predictive values or prevalence?
Predictive values are dependent on prevalence.
Rarely is it meaningful to meta-analyze each
value across studies.
Prevalence is often wide ranging.
Prevalence cannot be estimated from case-control
studies (the main design of many medical test
studies).
It is better to back calculate these values over
a range of plausible prevalence values.

Why can directly meta-analyzing positive and
negative likelihood ratios be problematic?
Combining likelihood ratios across studies does
not guarantee the summary values are internally
consistent.
It is possible to obtain summary likelihood
ratios that correspond to impossible summary
sensitivities or specificities (i.e., values lt0
or gt1).
Back calculation avoids this.
This is not a common case, however often direct
meta-analysis yields the same conclusions as back
calculation.

Directly analyzing diagnostic odds ratios
Is straightforward and follows standard
meta-analytic methods
Characteristics of the diagnostic odds ratio
Closely linked to sensitivity, specificity, and
likelihood ratios
Can easily be included in meta-regression models
for analysis of heterogeneity between studies
Disadvantages
Challenging to interpret
Impossible to weigh the true-positive rate and
the false-positive rate separately

Meta-analytic methods should
Respect the multivariate nature of test
performance metrics (i.e., sensitivity and
specificity)
Allow for nonindependence between sensitivity and
specificity across studies (threshold effect)
Allow for between-study heterogeneity (i.e.,
variability not explained by the statistical
distribution of the data in each study)
The most theoretically motivated approaches are
based on multivariate methods (hierarchical
modeling).

Multivariate meta-analysis of sensitivity and
specificity (i.e., joint analysis of both) should
be performed, rather than separate univariate
meta-analyses.
It requires hierarchical modeling.
Bivariate model
Hierarchical summary receiver operator
characteristic model
Both families of models use two levels to model
data.
1st level within-study variability, from 2 2
table counts
2nd level between-study variability (i.e.
heterogeneity), allowing for nonindependence of
sensitivity and specificity across studies

Model families differ in the parameters used for
between-study variability in the 2nd level.
The bivariate model uses parameters that are
transformations of the average sensitivity and
specificity.
The hierarchical summary receiver operator
characteristic (HSROC) model uses a scale
parameter and an accuracy parameter.
Both models are functions of the sensitivity and
specificity.
They also define an underlying HSROC curve.
Both models are mathematically the same in the
absence of covariates.
Both models assume a normal distribution of
parameters, which can be difficult to satisfy.

Researchers need to choose between the bivariate
and the hierarchical summary receiver operator
characteristic (HSROC) models when covariates are
present (i.e., meta-regression analysis). For
example
The bivariate model is more appropriate when
there is variation in disease severity.
This affects sensitivity but not specificity.
The bivariate model allows direct evaluation of
the difference in sensitivity and/or specificity.
The HSROC model is more effective when spectrum
effects (the subjects in a study do not
represent the patients who will receive the test
in practice) are present.
This is more likely to affect test accuracy
rather than threshold.
The HSROC model allows direct evaluation of the
difference in accuracy and/or threshold
parameters.

Methods Commonly Used To Calculate a Summary
Point

Method Description or Comment Does It Have the Desired Characteristics?
Independent meta-analysis of sensitivity and specificity Separate meta-analyses per metric Within-study variability preferably modeled by the binomial distribution Ignores the correlation between sensitivity and specificity Underestimates the summary sensitivity and specificity and wrong confidence intervals
Joint (multivariate) meta-analysis of sensitivity and specificity based on hierarchical modeling Based on multivariate (joint) modeling of sensitivity and specificity Two families of models that are equivalent when there are no covariates Modeling preferably using binomial likelihood rather than normal approximations The generally preferred method
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
22
Preferred Methods for Obtaining a Summary Line (1
of 2)

Hierarchical modeling is recommended.
Hierarchical summary lines can be calculated from
bivariate random-effects model parameters
A range of hierarchical summary receiver operator
characteristic (HSROC) lines can be calculated
from fitted bivariate model parameters.
An example is the Rutter-Gatsonis HSROC model.
Represent alternative characterizations of the
bivariate distribution of sensitivity and
specificity
Show how the summary sensitivity changes with the
summary specificity

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Rutter CM, Gatsonis
CA. Acad Radiol 1995 Mar2 Suppl 1S48-56
discussion S65-7, S70-1 pas. PMID 9419705.
23
Preferred Methods for Obtaining a Summary Line (2
of 2)

Methods Commonly Used To Calculate a Summary Line

Method Description or Comment Does It Have the Desired Characteristics?
Moses-Littenberg model Summary line based on a simple regression of the difference of logit-transformed true-positive and false-positive rates versus their average Ignores unexplained variation between-studies (fixed effects) Does not account for correlation between sensitivity and specificity Does not account for variability in the independent variable Inability to weight studies optimally yields wrong inferences when covariates are used
Random intercept augmentation of the Moses-Littenberg model Regression of the difference of logit-transformed true-positive and false-positive rates versus their average for random effects that allows for variability across studies Does not account for correlation between sensitivity and specificity Does not account for variability in the independent variable
Summary receiver operator characteristic (ROC) based on hierarchical modeling Same as for multivariate meta-analysis to obtain a summary point hierarchical modeling Many ways to obtain a (hierarchical) summary ROC Rutter-Gatsonis (most common) Several alternative curves Most theoretically motivated method Rutter-Gatsonis hierarchical summary ROC is recommended in the Cochrane Handbook, as it is the method that has been used most often
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Littenberg B, Moses
LE. Med Decis Making 1993 Oct-Dec13(4)313-21.
PMID 8246704. Rutter CM, Gatsonis CA. Acad
Radiol 1995 Mar2 Suppl 1S48-56 discussion
S65-7, S70-1 pas. PMID 9419705.
24
Special Case Joint Analysis of Sensitivity and
Specificity With Multiple Thresholds (1 of 2)

It is not uncommon for studies to report multiple
sensitivity/ specificity pairs at several
thresholds for positive tests.
Option 1 Decide on one threshold from each study
(e.g., the threshold with the highest
sensitivity)
Option 2 Use all thresholds
An extension of the hierarchical summary receiver
operator characteristic model has been developed
for this purpose.
A method combining whole receiver operator
characteristic (ROC) curves can also be used.
It is recommended that data be explored
graphically in ROC space to highlight
similarities and differences among the studies.

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
25
Special Case Joint Analysis of Sensitivity and
Specificity With Multiple Thresholds (2 of 2)

This is an example of an ROC graph for studies
with different thresholds for total serum
bilirubin. Points on the line for each study
represent sensitivity/specificity pairs at
different threshold values.

This is a typical receiver operator
characteristic (ROC) graph for four hypothetical
studies. Studies in the left shaded area have an
LR 10. Studies in the top shaded area have an
LR- 0.1. Those in the intersection have both.
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Chung
M, Lau J, et al. Pediatrics 2009
Oct124(4)1162-71. PMID 19786450.
26
Recommended Algorithm

A three-step algorithm is recommended for
meta-analyzing studies with a gold standard
reference
Start by considering sensitivity and specificity
separately.
Perform a multivariate meta-analysis (when each
study reports a single threshold).
Explore between-study heterogeneity

Reviewers should familiarize themselves with the
pattern of study-level sensitivities and
specificities.
Use graphical displays.
Forest plots of study sensitivities and
specificities with their confidence intervals
give a visual impression of variability of
sensitivity and specificity across studies
A plot of sensitivity (vertical axis) versus 1
specificity (horizontal axis) give a visual
impression of the relationship between
sensitivity and specificity across studies. These
plots are also known as receiver operator
characteristic graphs.
A shoulder-and-arm pattern is present when there
is a threshold effect.

Examples of forest plots
An example of a receiver operator characteristic
graph with the shoulder-and-arm pattern

Increasing the threshold decreases sensitivity
but increases specificity
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Becker DM, Philbrick
JT, Bachhuber TL, et al. Arch Intern Med 1996 May
13156(9)939-46. PMID 8624174.
29
Step 2 Multivariate Meta-analysis (When Each
Study Reports a Single Threshold)

Obtain a 2-dimensional summary point
(sensitivity, specificity) using the bivariate
model of meta-analysis, preferably with
utilization of binomial error.
Obtain summary lines based on multivariate
meta-analytic models.
Interpretation of a summary line is not
automatically that of threshold effects,
especially if there is a positive correlation
between sensitivity and specificity across
studies
If more than one threshold is reported per study,
consider incorporating all of them in the
analysis both qualitatively (via graphs) and
quantitatively (via proper methods).

The hierarchical summary receiver operator
characteristic (HSROC) model allows direct
evaluation of heterogeneity in accuracy and
threshold parameters.
Bivariate models allow direct evaluation of
sensitivity and specificity.
Added covariates that reduce variability across
studies may need to be taken into account when
summarizing the studies.
Some common sources of heterogeneity
Patient population/selection
Methods to verify/interpret results
Clinical setting
Disease severity

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
31
Example 1 D-Dimer Assays for Diagnosing Venous
Thromboembolism (1 of 4)
Forest Plots of Sensitivity, Specificity, and
Likelihood Ratios

D-dimers are fragments specific to fibrin
degradation.
They are measured by using an enzyme-linked
immunosorbent assay (ELISA) to diagnose venous
thromboembolism.

Forest plots show more heterogeneity in
sensitivity/specificity than in likelihood
ratios.
Verified by formal heterogeneity testing
May be a threshold effect
Because of the variety of thresholds being used
in each study, it is more informative to
summarize test performance with an hierarchical
summary receiver operator characteristic plot
rather than by summarizing sensitivities and
specificities.

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
33
Example 1 D-Dimer Assays for Diagnosing Venous
Thromboembolism (3 of 4)
HSROC Plot of D-Dimer Tests Using the Highest
Thresholds

The shoulder-and-arm pattern indicates the
threshold effect.
The location of points in the upper shaded area
of the receiver operator characteristic space
indicates high sensitivity and low specificity.
The test minimizes false-negative results and is
good for ruling out disease.

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Lijmer JG, Bossuyt
PM, Heisterkamp SH. Stat Med 2002 Jun
1521(11)1525-37. PMID 12111918.
34
Example 1 D-Dimer Assays for Diagnosing Venous
Thromboembolism (4 of 4)
Calculated Negative Predictive Values for the
D-Dimer Test With the Prevalence of Venous
Thromboembolism Between 5 and 50 Percent

It is informative to give a summary of the
negative and positive predictive values for this
test.
Calculate over a range of prevalence values using
the summary sensitivity and specificity values.
A consistently high negative predictive value
line means that a high percentage of people who
test negative actually are negative for the
disease.

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Lijmer JG, Bossuyt
PM, Heisterkamp SH. Stat Med 2002 Jun
1521(11)1525-37. PMID 12111918.
35
Example 2 Serial Measurements of the Creatine
Kinase-Myocardial Band To Diagnose Acute Cardiac
Ischemia (1 of 3)

Serial measurements of the creatine
kinase-myocardial band (CK-MB) are used to
diagnose acute cardiac ischemia in the emergency
room.
Blood levels of CK-MB increase over time from
symptom onset.
14 studies performed CK-MB testing at varying
times after symptom onset.
There was evident heterogeneity in sensitivity
that was not attributable to the threshold
effect.
The sensitivity of the test increased as the time
from symptom onset increased.
The difference in sensitivity may be attributable
to time to test this possibility, a bivariate
meta-analytic model was used.

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm.
36
Example 2 Serial Measurements of the Creatine
Kinase-Myocardial Band To Diagnose Acute Cardiac
Ischemia (2 of 3)

Sensitivity increases with longer hours from
symptom onset to the last measurement of the
creatine kinase-myocardial band.

Actual Hours
95-Percent Confidence Regions
Actual hours are indicated next to the points
circles 3 hours Xs gt 3 hours
Dashed lines 95-percent confidence regions
blue 3 hours red gt 3 hours
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Balk EM, Ioannidis
JP, Salem D, et al. Ann Emerg Med 2001
May37(5)478-94. PMID 11326184. Lau J,
Ioannidis JP, Balk E, et al. Evid Rep Technol
Assess (Summ) 2000 Sep(26)1-4. PMID 11079073.
37
Example 2 Serial Measurements of the Creatine
Kinase-Myocardial Band To Diagnose Acute Cardiac
Ischemia (3 of 3)

The hierarchical summary receiver operator
characteristic (HSROC) model (bivariate
meta-regression) was used to compare summary
sensitivity and specificity with a binary
variable to account for timing of the last serial
creatine kinase-myocardial band measurement
(fixed-effects binary covariate).
Note that properly specified bivariate/HSROC
meta-regressions can be used to compare two or
more index tests.

Meta-analysis Metric 3 Hours gt3 Hours P-Value for the Comparison Across Subgroups
Summary sensitivity (Percentage) 80 (64 to 90) 96 (85 to 99) 0.36
Summary specificity (Percentage) 97 (94 to 98) 97 (95 to 99) 0.56
Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Methods guide for medical
test reviews. Available at www.effectivehealthcare
.ahrq.gov/medtestsguide.cfm. Balk EM, Ioannidis
JP, Salem D, et al. Ann Emerg Med 2001
May37(5)478-94. PMID 11326184. Lau J,
Ioannidis JP, Balk E, et al. Evid Rep Technol
Assess (Summ) 2000 Sep(26)1-4. PMID 11079073.
38
Overall Recommendations (1 of 3)

Use the bivariate random-effects meta-analytic
models to obtain a summary sensitivity and
specificity.
Back-calculate the overall positive and negative
predictive values (over a range of prevalence
values) from summary estimates of sensitivity and
specificity, rather than meta-analyzing them
directly.
Back-calculate overall positive and negative
likelihood ratios from summary estimates of
sensitivity and specificity, rather than
meta-analyzing them directly.

To obtain a summary line, use multivariate
meta-analysis methods such as the hierarchical
summary receiver operator characteristic (HSROC)
model.
Several summary lines can be obtained based on
multivariate meta-analytic models.
They can differ when the estimated correlation
between sensitivity and specificity is positive
and when there is little between-study
variability.
If there is evidence of a positive correlation,
the variability in the studies cannot be
attributed to a threshold effect.
Explore for missing important covariates.

If more than one threshold is reported per study,
this must be taken into account in the
quantitative analyses.
Qualitative analysis with graphs and quantitative
analyses with proper methods are encouraged.
Explore the impact of study characteristics on
summary results using meta-regressionbased
analyses or subgroup analyses in the context of
the primary methodology used to summarize the
studies.

Within individual studies of a systematic review,
sensitivity and specificity are independent
variables.
True
False

42
Practice Question 1 (2 of 2)

Explanation for Question 1
This statement is true. Sensitivity and
specificity within each study are independent
because they are estimated from different
patients. Across studies they typically are
negatively correlated.

43
Practice Question 2 (1 of 2)

Why does this module recommend directly
meta-analyzing sensitivity and specificity?
Sensitivity and specificity are dependent on the
prevalence of the condition under study.
Other predictive values and likelihood ratios can
be back-calculated for a range of prevalence
values by using known formulas.
Summary sensitivity and specificity obtained by
direct meta-analysis will always be greater than
1.
Interpretation of sensitivity and specificity is
very intuitive.

44
Practice Question 2 (2 of 2)

Explanation for Question 2
The correct answer is b. Once the summary
sensitivity and specificity are calculated by
meta-analysis, there are formulas that allow the
back calculation of overall predictive values and
likelihood ratios. Likelihood ratios and
predicative values are more easily interpreted by
the reader of the review. Sensitivity and
specificity are often considered to be
independent of prevalence because they do not
depend on it mathematically and will always be
between 0 and 1.

45
Practice Question 3 (1 of 2)

What is the preferred method for obtaining a
summary sensitivity and specificity in a
meta-analysis?
Multivariate meta-analysis
Separate univariate meta-analyses
Using a summary line
The Kester and Buntinx variant

46
Practice Question 3 (2 of 2)

Explanation for Question 3
The correct answer is a. A multivariate
meta-analysis of sensitivity and specificity is
the recommended method for obtaining a summary
point (summary sensitivity and specificity). This
is a joint analysis of both quantities instead of
a separate univariate meta-analyses. Obtaining a
summary line is an alternative to calculating a
summary point. The Kester and Buntinx method is
used to analyze sensitivity and specificity pairs
when there are several thresholds for positive
tests.

47
Practice Question 4 (1 of 2)

In which situation would a summary line be more
helpful in summarizing medical test performance?
Sensitivity and specificity estimates of various
studies do not vary widely.
Sensitivity and specificity of various studies
vary over a large range.

48
Practice Question 4 (2 of 2)

Explanation for Question 4
The correct answer is b. Both a summary point and
a summary line are informative and are useful in
synthesizing data. There are no strict rules to
follow in deciding which to use. A summary line
may be more helpful as a summary of test
performance when the sensitivity and specificity
estimates of various studies vary over a large
range.

49
Authors

This presentation was prepared by Brooke
Heidenfelder, Andrzej Kosinski, Rachael Posey,
Lorraine Sease, Remy Coeytaux, Gillian Sanders,
and Alex Vaz, of the Duke University
Evidence-based Practice Center.
The module is based on Trikalinos TA, Coleman CI,
Griffith L, et al. Meta-analysis of test
performance when there is a gold standard. In
Chang SM and Matchar DB, eds. Methods guide for
medical test reviews. Rockville, MD Agency for
Healthcare Research and Quality June 2012. p.
8.1-21. AHRQ Publication No. 12-EHC017. Available
at www.effectivehealthcare.ahrq.gov/medtestsguide.
cfm.

50
References (1 of 9)

Arends LR, Hamza TH, van Houwelingen JC, et al.
Bivariate random effects meta-analysis of ROC
curves. Med Decis Making 2008 Sep-Oct28(5)621-38
. PMID 18591542.
Balk EM, Ioannidis JP, Salem D, et al. Accuracy
of biomarkers to diagnose acute cardiac ischemia
in the emergency department a meta-analysis. Ann
Emerg Med 2001 May37(5)478-94. PMID 11326184.
Becker DM, Philbrick JT, Bachhuber TL, et al.
D-dimer testing and acute venous thromboembolism.
A shortcut to accurate diagnosis? Arch Intern Med
1996 May 13156(9)939-46. PMID 8624174.
Bossuyt PM, Reitsma JB, Bruns DE, et al. The
STARD statement for reporting studies of
diagnostic accuracy explanation and elaboration.
Ann Intern Med 2003 Jan 7138(1)W1-12. PMID
12513067.

51
References (2 of 9)

Chappell FM, Raab GM, Wardlaw JM. When are
summary ROC curves appropriate for diagnostic
meta-analyses? Stat Med 2009 Sep
2028(21)2653-68. PMID 19591118.
Deeks JJ, Altman DG. Diagnostic tests 4
likelihood ratios. BMJ 2004 Jul
17329(7458)168-9. PMID 15258077.
Dukic V, Gatsonis C. Meta-analysis of diagnostic
test accuracy assessment studies with varying
number of thresholds. Biometrics 2003
Dec59(4)936-46. PMID 14969472.
Fu R, Gartlehner G, Grant M, et al. Conducting
quantitative synthesis when comparing medical
interventions AHRQ and the Effective Health Care
Program. J Clin Epidemiol 2011 Nov64(11)1187-97.
PMID 21477993.
Glas AS, Lijmer JG, Prins MH, et al. The
diagnostic odds ratio a single indicator of test
performance. J Clin Epidemiol 2003
Nov56(11)1129-35. PMID 14615004.

52
References (3 of 9)

Harbord RM, Deeks JJ, Egger M, et al. A
unification of models for meta-analysis of
diagnostic accuracy studies. Biostatistics 2007
Apr8(2)239-51. PMID 16698768.
Harbord RM, Whiting P, Sterne JA, et al. An
empirical comparison of methods for meta-analysis
of diagnostic accuracy showed hierarchical models
are necessary. J Clin Epidemiol 2008
Nov61(11)1095-103. PMID 19208372.
Hartmann KE, Matchar DB, Chang S. Chapter 6
assessing applicability of medical test studies
in systematic reviews. J Gen Intern Med 2012
Jun27 Suppl 1S39-46. PMID 22648674.
Irwig L, Tosteson AN, Gatsonis C, et al.
Guidelines for meta-analyses evaluating
diagnostic tests. Ann Intern Med 1994 Apr
15120(8)667-76. PMID 8135452.

53
References (4 of 9)

Kardaun JW, Kardaun OJ. Comparative diagnostic
performance of three radiological procedures for
the detection of lumbar disk herniation. Methods
Inf Med 1990 Jan29(1)12-22. PMID 2308524.
Kester AD, Buntinx F. Meta-analysis of ROC
curves. Med Decis Making 2000 Oct-Dec20(4)430-9.
PMID 11059476.
Lau J, Ioannidis JP, Balk E, et al. Evaluation of
technologies for identifying acute cardiac
ischemia in emergency departments. Evid Rep
Technol Assess (Summ) 2000 Sep(26)1-4. PMID
11079073.
Lau J, Ioannidis JP, Schmid CH. Summing up
evidence one answer is not always enough. Lancet
1998 Jan 10351(9096)123-7. PMID 9439507.
Leeflang MM, Bossuyt PM, Irwig L. Diagnostic test
accuracy may vary with prevalence implications
for evidence-based diagnosis. J Clin Epidemiol
2009 Jan62(1)5-12. PMID 18778913.

54
References (5 of 9)

Lijmer JG, Bossuyt PM, Heisterkamp SH. Exploring
sources of heterogeneity in systematic reviews of
diagnostic tests. Stat Med 2002 Jun
1521(11)1525-37. PMID 12111918.
Lijmer JG, Mol BW, Heisterkamp S, et al.
Empirical evidence of design-related bias in
studies of diagnostic tests. JAMA 1999 Sep
15282(11)1061-6. PMID 10493205.
Littenberg B, Moses LE. Estimating diagnostic
accuracy from multiple conflicting reports a new
meta-analytic method. Med Decis Making 1993
Oct-Dec13(4)313-21. PMID 8246704.
Loong TW. Understanding sensitivity and
specificity with the right side of the brain. BMJ
2003 Sep 27327(7417)716-9. PMID 14512479.
Moses LE, Shapiro D, Littenberg B. Combining
independent studies of a diagnostic test into a
summary ROC curve data-analytic approaches and
some additional considerations. Stat Med 1993 Jul
3012(14)1293-316. PMID 8210827.

55
References (6 of 9)

Mulherin SA, Miller WC. Spectrum bias or spectrum
effect? Subgroup variation in diagnostic test
evaluation. Ann Intern Med 2002 Oct
1137(7)598-602. PMID 12353947.
Oei EH, Nikken JJ, Verstijnen AC, et al. MR
imaging of the menisci and cruciate ligaments a
systematic review. Radiology 2003
Mar226(3)837-48. PMID 12601211.
Reitsma JB, Glas AS, Rutjes AW, et al. Bivariate
analysis of sensitivity and specificity produces
informative summary measures in diagnostic
reviews. J Clin Epidemiol 2005 Oct58(10)982-90.
PMID 16168343.
Riley RD, Abrams KR, Lambert PC, et al. An
evaluation of bivariate random-effects
meta-analysis for the joint synthesis of two
correlated outcomes. Stat Med 2007 Jan
1526(1)78-97. PMID 16526010.

56
References (7 of 9)

Riley RD, Abrams KR, Sutton AJ, et al. Bivariate
random-effects meta-analysis and the estimation
of between-study correlation. BMC Med Res
Methodol 2007 Jan 1273. PMID 17222330.
Rutjes AW, Reitsma JB, Di Nisio M, et al.
Evidence of bias and variation in diagnostic
accuracy studies. CMAJ 2006 Feb 14174(4)469-76.
PMID 16477057.
Rutter CM, Gatsonis CA. Regression methods for
meta-analysis of diagnostic test data. Acad
Radiol 1995 Mar2 Suppl 1S48-56 discussion
S65-7, S70-1 pas. PMID 9419705.
Simel DL, Bossuyt PM. Differences between
univariate and bivariate models for summarizing
diagnostic accuracy may not be large. J Clin
Epidemiol 2009 Dec62(12)1292-300. PMID
19447007.
Thompson SG, Sharp SJ. Explaining heterogeneity
in meta-analysis a comparison of methods. Stat
Med 1999 Oct 3018(20)2693-708. PMID 10521860.

57
References (8 of 9)

Trikalinos TA, Coleman CI, Griffith L, et al.
Meta-analysis of test performance when there is a
gold standard. In Chang SM and Matchar DB,
eds. Methods guide for medical test reviews.
Rockville, MD Agency for Healthcare Research and
Quality June 2012. p. 8.1-21. AHRQ Publication
No. 12-EHC017. Available at www.effectivehealthcar
e.ahrq.gov/medtestsguide.cfm.
Trikalinos TA, Chung M, Lau J, et al. Systematic
review of screening for bilirubin encephalopathy
in neonates. Pediatrics 2009 Oct124(4)1162-71.
PMID 19786450.
Visser K, Hunink MG. Peripheral arterial disease
gadolinium-enhanced MR angiography versus
color-guided duplex US--a meta-analysis.
Radiology 2000 Jul216(1)67-77. PMID 10887229.
Whiting P, Rutjes AW, Reitsma JB, et al. Sources
of variation and bias in studies of diagnostic
accuracy a systematic review. Ann Intern Med
2004 Feb 3140(3)189-202. PMID 14757617.