Title: Rating Performance Assessments of Students With and Without Disabilities: A Generalizability Study of Teacher Bias
1Rating Performance Assessments of Students With
and Without Disabilities A Generalizability
Study of Teacher Bias
Jose-Felipe Martinez-Fernandez Ann M.
Mastergeorge
UCLA Graduate School of Education Information
StudiesCenter for the Study of
EvaluationNational Center for Research on
Evaluation, Standards, and Student Testing
American Educational Research Association New
Orleans April 1-5, 2001
2Introduction
- Performance assessments are increasingly popular
methods for the evaluation of academic
performance. - A number of studies have shown that well trained
raters can be reliable scorers of performance
assessments for the general population of
students. - This study addressed whether any bias exists from
trained raters when scoring performance
assessments of students with disabilities.
3Purpose
- Compare the sources of score variability for
students with and without disabilities in
Language Arts and Mathematics performance
assessments. - Determine if important differences exist across
student groups in terms of variance components,
and if so whether rater (teacher) bias plays a
role. - Complement results with raters perceptions on
bias (their own and others).
4Method
- Student and Rater samples come from a larger
district-wide validation study involving
thousands of performance assessments. - Teachers from each grade and content area were
trained as Raters. - A total of 6 studies (each with different raters
and students) were performed for 3rd , 7th and
9th grade assessments in Language Arts and
Mathematics.
5Method (continued)
- For each study, 60 assessments (30 from regular
education students and 30 from students who
received some kind of accommodation) were rated
by 4 raters in two occasions. - Raters were aware of each students disability
status only in the 2nd rating occasion. Bias is
defined as systematic differences in the scores
across occasions. - No practice or memory effects expected.
-
- Score scale ranges from 1 to 4.
6Method (continued)
- Two kinds of Generalizability designs First a
nested-within-disability design with all 60
students P(D) x R x O. - Second, separate fully crossed P x R x O
designs for each disability group of 30 students. - Math assessments consisted of two tasks. Both a
random P x R x O x T design and a fixed P x R
x O design averaging over tasks were used. - A survey inquired about raters perceptions
regarding bias in rating students with
disabilities (their own and other raters).
7Score Distributions
8Generalizability Results Nested Design
Language Arts ScoreRater x Occasion x Person
(Disability)
9Generalizability Results (continued)Nested
Design Mathematics ScoreTask x Rater x
Occasion x Person (Disability)
10Generalizability Results (continued)Crossed
Design by Disability Language Arts ScoreRater
x Occasion x Person
11Generalizability Results (continued)Crossed
Design by Disability Mathematics ScoreTask x
Rater x Occasion x Person
12Generalizability Results (continued)Crossed
Design by Disability Mathematics with Task
facet fixed ScorePerson x Rater x Occasion,
averaging over the two tasks
13Rater Survey Rater Perceptions ( plt.01. N40)
14Rater Survey (continued)Mean Score of Raters on
Self and Others Regarding Fairness and Bias on
Scoring
15Discussion
- Variance Components
- Person (P) component is always the largest (50
to 70 of variance across designs). However
there still exists a good amount of measurement
error (triple interaction, ignored facets). - Some differences exist between regular education
and disability groups in terms of variance
components
16Discussion (continued)
- Differences between groups
- Total amount of variance is always less in the
disability groups (more skewed distribution). - Variance due to persons (P) and therefore
Dependability coefficients are lower for the
disability group in Language Arts. This is also
true in Mathematics if we use a fixed averaged
task facet, but not with two random tasks.
17Discussion (continued)
- Rater Bias
- No Rater (R) main effects. No leniency
differences across raters. - No rating occasion (O) effect. Overall there
is no bias introduced by rater knowledge of
disability status. - No rater interactions with tasks or occasions.
18 Discussion (continued)
- However, there is a non-negligible Person by
Rater (PxR) interaction which is considerably
larger for disability students. -
- This does not necessarily constitute bias but can
still compromise validity of scores for
accommodated students. - Are features in papers from students with
disabilities differentially salient to different
raters?
19Discussion (continued)
- There is a Large Person by Task (PxT) interaction
in Math, but it is considerably smaller for
students with disabilities - Disability students may not be as aware of the
different nature of the tasks so that this
somehow natural interaction (Miller Linn, 2000
and others) would show. - Accommodations may not be having the intended
leveling effects. - With a random task facet the lower PxT
interaction increases reliability for
disability students.
20Discussion (continued)
- From Rater Survey
- Teachers believe that there is a certain bias and
unfairness from raters when scoring performance
assessments from students with disabilities. - Raters see themselves as more fair and unbiased
than the general population of raters. - Whether this is due to training, or to initially
high self-perceptions is not clear. A not
uncommon Im great but others arent as much
kind of effect could be the sole reason.
21Future Directions and Questions
- Are there different patterns for different kinds
of disabilities/accommodations? - Are accommodations being used appropriately and
having the intended effects? - Do patterns hold for raters at the local school
sites who in general receive less training? - Does rater background influence the size and
nature of these effects and interactions? - How does the testing occasion facet influence
variance components/other interactions?