Title: Improving Content Validity: A Confidence Interval for Small Sample Expert Agreement
1Improving Content Validity A Confidence
Interval for Small Sample Expert Agreement
- Jeffrey M. Miller Randall D. Penfield
- NCME, San Diego
- April 13, 2004
- University of Florida
- millerjm_at_ufl.edu penfield_at_coe.ufl.edu
2INTRODUCING CONTENT VALIDITY
- Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests
(AERA/APA/NCME, 1999) - Content validity refers to the degree to which
the content of the items reflects the content
domain of interest (APA, 1954)
3THE NEED FOR IMPROVED REPORTING
Content is a precursor to drawing a score-based
inference. It is evidence-in-waiting (Shepard,
1993 Yalow Popham, 1983)
Unfortunately, in many technical manuals,
content representation is dealt with in a
paragraph, indicating that selected panels of
subject matter experts (SMEs) reviewed the test
content, or mapped the items to the content
standards(Crocker, 2003)
4QUANTIFYING CONTENT VALIDITY
- Several indices for quantifying expert agreement
have been proposed - The mean rating across raters is often used in
calculations - However, the mean alone does not provide
information regarding its proximity to the
unknown population mean. - We need a usable inferential procedure go gain
insight into the accuracy of the sample mean as
an estimate of the population mean.
5THE CONFIDENCE INTERVAL
- A simple method is to calculate the
traditional Waldconfidence interval -
- However, this interval is inappropriate for
rating scales.
- Too few raters and response categories to assume
population normality has not been violated. - No reason to believe the distribution should be
normal. - The rating scale is bounded with categories that
are discrete.
6AN ALTERNATIVE IS THE
SCORE CONFIDENCE INTERVAL FOR RATING SCALES
- Penfield (2003) demonstrated that the Score
method outperformed the Wald interval especially
when - The number of raters was small (e.g., 10)
- The number of categories was small (e.g., 5)
- Furthermore, this interval is asymmetric
- It is based on the actual distribution for the
mean rating of concern. - Further, the limits cannot extend below or above
the actual limits of the categories.
7STEPS TO CALCULATING THE SCORE CONFIDENCE INTERVAL
- 1. Obtain values for n, k, and z
- n the number of raters
- K the highest possible rating
- z the standard normal variate associated with
the confidence level (e.g., /- 1.96 at 95
confidence)
8- 2. Calculate the mean item ratingThe sum of
the ratings for an item divided by the number of
raters
9- 3. Calculate p p
- Or if scale begins with 1 then
- p
10- 4. Use p to calculate the upper and lower limits
for a confidence interval for population
proportion (Wilson, 1927)
11- 5. Calculate the upper and lower limits of the
Score confidence intervalfor the population mean
rating
12- Shorthand Example
- Item 3 ? 8
- The content of this item represents the ability
to add single-digit numbers. - 1 2 3 4
- Strongly Disagree Disagree
Agree Strongly Agree - Suppose the expert review session includes 10
raters. - The responses are 3, 3, 3, 3, 3, 3, 3, 3, 3, 4
-
13- Shorthand Example
- n 10
- k 4
- z 1.96
- the sum of the items 31
- 31/10 3.10
-
-
-
- p so,
p 31 / (104) 0.775 -
14- Shorthand Example (cont.)
-
-
- (65.842 11.042) / 87.683 0.625
- (65.842 11.042) / 87.683 0.877
-
15- Shorthand Example (cont.)
-
-
-
- 3.100 1.96sqrt(0.938/10) 2.500
- 3.100 1.96sqrt(0.421/10) 3.507
-
16We are 95 confident that the population mean
rating falls somewhere between 2.500 and 3.507
17- Content Validation
-
- Method 1 Retain only items with a Score interval
of a particular width based on - A priori determination of appropriateness
- An empirical standard (25th and 75th percentiles
of all widths) - 2. Method 2 Retain items based on hypothesis
test that the lower limit is above a particular
value -
-
18 19-
- Conclusions
-
- Score method provides a confidence interval that
is not dependent on the normality assumption - Outperforms the Wald interval when the number of
raters and scale categories is small - Provides a decision-making method for the fate of
items in expert review sessions. - Computational complexity can be eased through
simple programming in Excel, SPSS, and SAS -
-
20-
- For further reading,
- Penfield, R. D. (2003). A score method for
constructing asymmetric confidence intervals for
the mean of a rating scale item. Psychological
Methods, 8, 149-163. - Penfield, R. D., Miller, J. M. (in press).
Improving content validation studies using an
asymmetric confidence interval for the mean of
expert ratings. Applied Measurement in Education. -