Advanced Statistical Analysis in Epidemiology: Interrater Reliability Diagnostic Cutpoints Test Comp - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Advanced Statistical Analysis in Epidemiology: Interrater Reliability Diagnostic Cutpoints Test Comp

Description:

ROC Curves continued ... In order to compare the areas under two or more ROC curves, use the same formula, ... two areas under the curves may be compared as a ... – PowerPoint PPT presentation

Number of Views:486
Avg rating:3.0/5.0
Slides: 56
Provided by: jeffrey69
Category:

less

Transcript and Presenter's Notes

Title: Advanced Statistical Analysis in Epidemiology: Interrater Reliability Diagnostic Cutpoints Test Comp


1
Advanced Statistical Analysis in
EpidemiologyInter-rater Reliability
Diagnostic CutpointsTest Comparison Discrepant
Analysis Polychotomous Logistic
RegressionandGeneralized Estimating Equations
  • Jeffrey J. Kopicko, MSPH
  • Tulane University School of Public Health and
    Tropical Medicine

2
Diagnostic Statistics Typically Assess a 2 x 2
contingency table taking the form
3
Inter-rater Reliability
Suppose that two different tests exist for the
diagnosis of a specific disease. We are
interested in determining if the new test is as
reliable in diagnosing the disease as the old
test (gold standard).
4
Inter-rater Reliability continued
In 1960, Cohen proposed a statistic that would
provide a measure of reliability between the
ratings of two different radiologists in the
interpretation of x-rays. He called it the Kappa
coefficient.
5
Inter-rater Reliability continued
Cohens Kappa can be used to assess the
reliability between two raters or diagnostic
tests. Based on the previous contingency table,
it has the following form and interpretation
6
Inter-rater Reliability continued
Cohens Kappa
where
Rosner, 1986
7
Inter-rater Reliability continued
Cohens Kappa is appropriately used when the
prevalence of the disease is low and the marginal
totals of the contingency table are distributed
evenly. When these are not the case, Cohens
Kappa will be erroneously low.
8
Inter-rater Reliability continued
Byrt, et al. proposed a solution to these
possible biases in 1994. They called their
solution the Prevalence-Adjusted Bias-Adjusted
Kappa or PABAK. It has the same interpretation
as Cohens Kappa and the following form
9
Inter-rater Reliability continued
1. Take the mean of b and c. 2. Take the mean
of a and d. 3. Compute PABAK using these means
and the original Cohens Kappa formula and this
table.
10
Inter-rater Reliability continued
  • PABAK is preferable in all instances, regardless
    of the prevalence or the potential bias between
    raters.
  • More meaningful statistics regarding the
    diagnostic value of a test can be computed,
    however.

11
Diagnostic Measures
  • Prevalence
  • Sensitivity
  • Specificity
  • Predictive Value Positive
  • Predictive Value Negative

12
Diagnostic Measures continued
Prevalence Definition Prevalence quantifies the
proportion of individuals in a population who
have the disease at a specific instant and
provides and estimate of the probability (risk)
that an individual will be ill at a point in
time. Formula
13
Diagnostic Measures continued
Sensitivity Definition Sensitivity is defined
as the probability of testing positive if the
disease is truly present. Formula
14
Diagnostic Measures continued
Specificity Definition Specificity is defined
as the probability of testing negative if the
disease is truly absent. Formula
15
Diagnostic Measures continued
Predictive Value Positive Definition Predictive
Value Positive (PV ) is defined as the
probability that a person actually has the
disease given that he or she tests
positive. Formula
16
Diagnostic Measures continued
Predictive Value Negative Definition Predictive
Value Negative (PV- ) is defined as the
probability that a person actually disease-free
given that he or she tests negative. Formula
17
Example Cervical Cancer Screening
The standard of care for cervical
cancer/dysplasia detection is the Pap smear. We
want to assess a new serum DNA detection test for
the Humanpapilloma Virus.
18
Prevalence 55/500 0.110 Sensitivity 50/55
0.909 Specificity 410/445 0.921 PV 50/85
0.588 PV- 410/415 0.988
19
Reciever Operating Characteristic (ROC) Curves
Sensitivities and Specificities are used to 1.
Determine the diagnostic value of a test. 2.
Determine the appropriate cutpoint for continuous
data. 3. Compare the diagnostic values of two or
more tests.
20
ROC Curves continued
1. For every gap in continuous data, the mean
value is taken as the cutoff. This is where
there is a change in the contingency table
distribution. 2. At each new cutpoint, the
sensitivity and specificity is calculated. 3.
The sensitivity is graphed versus 1-specificity.
21
ROC Curves continued
4. Since the sensitivity and specificity are
proportions, the total area of the graph is 1.0
units. 5. The area under the curve is the
statistic of interest. 6. The area under a curve
produced by chance alone is 0.50 units.
22
ROC Curves continued
7. If the area under the diagnostic test curve
is significantly above 0.50, then the test is a
good predictor of disease. 8. If the area under
the diagnostic test curve is significantly below
0.50, then the test is an inverse predictor of
disease. 9. If the area under the diagnostic
test curve is not significantly different from
0.50, then the test is a poor predictor of
disease.
23
ROC Curves continued
10. An individual curve can be compared to 0.50
using the N(0, 1) distribution. 11. Two or more
diagnostic tests can be compared also using the
N(0, 1) distribution. 12. A diagnostic cutpoint
can be determined for tests with continuous
outcomes in order to maximize the sensitivity and
specificity of the test.
24
(No Transcript)
25
ROC Curves continued
Determining Diagnostic Cutpoints
26
ROC Curves continued
Determining Diagnostic Cutpoints
27
(No Transcript)
28
ROC Curves continued
Diagnostic Value of a Test
where a1 area under the diagnostic test
curve, ao 0.50, se a1 is the standard error of
the area, and se ao 0.00.
29
ROC Curves continued
Diagnostic Value of a Test
For the RCP example, the area under the curve is
0.987, with a p-value of lt0.001. The optimal
cutpoint for this test is 1.1 ng/ml.
30
ROC Curves continued
Comparing the areas under 2 or more curves
In order to compare the areas under two or more
ROC curves, use the same formula, substituting
the values for the second curve for those
previously defined for chance alone.
31
(No Transcript)
32
ROC Curves continued
Comparing the areas under 2 or more curves
For the CMV retinitis example, the Digene test
had the largest area (although not significantly
greater than antigenemia). The cutpoint was
determined to be 1,400 cells/cc. The sensitivity
was 0.85 and the specificity was 0.84.
Bonferronni adjustments must be made for gt2
comparisons.
33
ROC Curves continued
Another Application?
Remember when Cohens Kappa was unstable at
extreme prevalence and/or when there was bias
among the raters? What about using ROC curves to
assess inter-rater reliability?
34
ROC Curves continued
Another limitation to K is that it provides only
a measure of agreement, regardless of whether the
raters correctly classify the state of the items.
K can be high, indicating excellent reliability,
even though both raters incorrectly assess the
items.
35
ROC Curves continued
The two areas under the curves may be compared as
a measure of overall inter-rater reliability.
This comparison is made by applying the following
formula droc (1- Area1- Area2) By
subtracting the difference in areas by one, droc
is on a similar scale as K, ranging from 0 to 1.
36
ROC Curves continued
If both raters correctly classify the objects at
the same rate, their sensitivities and
specificities will be equal, resulting in a droc
of 1. If one rater correctly classifies all the
objects, and the second rater misclassifies all
the objects, droc will equal 0.
37
(No Transcript)
38
Statistics for Figure 1(N20) Rater One
Correct 80 sensitivity 0.80 specificity
0.80 Area under ROC 0.80
Rater Two Correct 55 sensitivity
0.60 specificity 0.533 Area under ROC
0.567
droc 0.7667
39
Monte Carlo Simulation
Several different levels of disease prevalence,
sample size and rater error rates were assessed
using Monte Carlo methods. Total sample sizes of
20, 50 and 100 were generated each for disease
prevalence of 5, 15, 25, 50, 75, and 90 percent.
Two raters were used in this study. Rater One was
fixed at a 5 percent probability of
misclassifying the true state of the disease,
while Rater Two was allowed varying levels of
percent misclassification. For each condition of
disease prevalence, rater error, and sample size,
1000 valid samples were generated and analyzed
using SAS proc IML.
40
ROC Curves continued
Another limitation is that K provides only a
measure of agreement, regardless of whether the
raters correctly classify the state of the items.
K can be high, indicating excellent reliability,
even though both raters incorrectly assess the
items.
41
ROC Curves continued
If both raters correctly classify the objects at
the same rate, their sensitivities and
specificities will be equal, resulting in a droc
of 0. If one rater correctly classifies all the
objects, and the second rater misclassifies all
the objects, droc will equal 1.
42
ROC Curves continued
If both raters correctly classify the objects at
the same rate, their sensitivities and
specificities will be equal, resulting in a droc
of 0. If one rater correctly classifies all the
objects, and the second rater misclassifies all
the objects, droc will equal 1.
43
ROC Curves continued
If both raters correctly classify the objects at
the same rate, their sensitivities and
specificities will be equal, resulting in a droc
of 0. If one rater correctly classifies all the
objects, and the second rater misclassifies all
the objects, droc will equal 1.
44
Based on the above results, it appears that the
difference in two ROC curves may be a more stable
estimate of inter-rater agreement than K. Based
on the metric used to assess K, a similar metric
can be formed for the difference in two ROC
curves. We propose the following 1.0 gt droc gt
0.95 excellent reliability 0.8 lt droc lt
0.95 good reliability 0 lt droc lt 0.8 marginal
reliability
45
ROC Curves continued
From the example data provided with Figure 1, it
can be seen that droc behaves similarly to K.
The droc from these data is 0.7667, while K is
0.30. Both result in a decision of marginal
inter-rater reliability. However, from the ROC
plot and the percent correct for each rater, it
is seen that Rater One is much more correct in
his observations than Rater Two, with percent
agreements of 80 and 55 , respectively.
46
ROC Curves continued
Without the individual calculation of the
sensitivities and specificities, information
about the correctness of the raters would have
remained obscure. Additionally, with the large
differential rater error, K may have been
underestimated. The difference in ROC curves
allows many advantages over K, but only when the
true state of the objects being rated is known.
Finally, with very little adaptation, these
methods may be extended to more than two raters
and to continuous outcome data.
47
So, we now know how to assess whether a test is a
good predictor of disease, how to compare two or
more tests, and how to determine cutpoints. But,
What if there is no established gold-standard?
48
Discrepant Analysis
Discrepant Analysis (DA) is a commonly used (and
commonly misused) technique of estimating the
sensitivity and specificity of diagnostic tests
that are imperfect gold-standards. This
technique often results in upwardly biased
estimates of the diagnostic statistics.
49
Discrepant Analysis continued
Example Chlamydia trachomatis is a common STI
that has been diagnosed using cervical swab
culture for years. Often, though, patients only
present for screening when they are symptomatic.
Symptomatic screening may be closely associated
with organism load. Therefore, culture diagnosis
may miss carriers and patients with low organism
loads.
50
Discrepant Analysis continued
Example continued GenProbe testing has also been
used to capture some cases that are not captured
by culture. New polymerase chain reaction (PCR)
and ligase chain reaction (LCR) DNA assays may be
better diagnostic tests. But, there is obviously
no good gold-standard.
51
Discrepant Analysis continued
Example continued 1. Culture vs. PCR 2.
Culture GenProbe vs. PCR 3. Culture vs. LCR 4.
Culture GenProbe vs. LCR and many other
combinations.
52
Discrepant Analysis continued
Example continued Goal is to maximize the
sensitivity and specificity of the new tests,
since we think that the new tests are probably
more accurate. Major limitation is that this is
often seen as a fishing expedition with the
great possibility of Type I error, and inflation
of diagnostic statistics.
53
Polychotomous Logistic Regression
Simple logistic regression is useful when the
outcome of interest is binomial (ie yes/no,
male/female, etc.) Linear regression is useful
when the outcome of interest is continuous (ie
age, blood pressure, etc.) But what if the
outcome is categorical with more than one level?
54
Generalized Estimating Equations
GEE is used when there are repeated measures on
continuous, ordinal, or categorical outcomes, and
there are different numbers of measurements on
each subject. It is useful in that it accounts
for missing data at different time points. The
interpretation of the GEE model is the same as
all other regressions.
55
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com