Title: Working Paper: Assessment, Validation, and Benchmarking with Student Evaluation of Instruction Instr
1Working Paper Assessment, Validation, and
Benchmarking with Student Evaluation of
Instruction Instruments
William B. Clark University of Akron Wayne
College
2Assessment is systematic observation.
- Inferences that can be drawn from any
observations are ultimately determined by the
relationships amongst the observations.
3Asystematic Observation
- Asystematic observation may result from
- Inappropriately constructed assessment tools
- Asystematic observational methods
- Political, human, or institutional obstacles
4Inappropriately Constructed Assessment Tools
- When committees an other personnel are charged
with the construction of assessment instruments,
a number errors frequently occur - Disregard for Variance and overemphasis of the
mean. - Arbitrary imposition of a symmetric response
scale, without consideration of distribution of
the event measured. - The bias to measure only excellence
- Restriction of response scale
- Restriction of the domain of measurement.
- Lack of adequate, meaningful pilot work.
5Asystematic Observational Methods
- Mainly through errors of omission and lack of
forethought, committees and other personnel may
make a number of strategic errors that later
compromise otherwise meaningful systematic
observation - Use of technology as a goal rather than a means.
- Failure to master, else reliably outsource,
technology used. - Production of a series of disjointed reports.
- Failure to store data in a manner that
facilitates analysis. - No use of a primary or external key.
6Political or Human Obstacles to Assessment
- Local
- Faculty
- Committees
- Institutional
- Self-Appointed Gatekeepers
- Administrators
- Cultural
7Local Obstacles to Assessment
- Faculty
- Even the best faculty are often resistant to the
notion of being measured. - There is not necessarily any motivation for
faculty to create the most critical instrument
possible. - Committees
- The leadership of insufficiently knowledgeable
individuals is not useful, at best. - The inclusion of individuals on committees solely
in order to increase buy-in or
representativeness amongst faculty ultimately
improves neither.
8Institutional Obstacles to Assessment
- Self-Appointed Gatekeepers
- Secretaries or technicians may presume authority
to prevent access to data. - Administrators
- Constant discontinuation and adoption of
instruments and methods. - Failure to persist in assessment efforts for
sufficient periods in order to allow meaningful
aggregation and comparison of multiple,
concurrent measures.
9Cultural Obstacles to Assessment
- A historical anecdote.
- An example Comparative vs. Normative Data on the
Student Instructional Record II (SIR-II) - When the SIR II survey is administered, the
institution and/or individual instructors control
all aspects of the administration. For example,
when and how the survey is administered within
the classroom is determined by the institution or
faculty member. Because of this local control,
and because the sample of institutions does not
proportionately represent all types of
institutions, the national data represents
comparative rather than normative data. - Because students typically use the favorable end
of the scale in making evaluations, comparative
data provide a context within which instructors
and others can interpret individual reports.
However, institutions may wish to supplement the
SIR II comparative data with their own
comparative data developed over time.
10The Current Instrument
- The current instrument has its flaws.
- These problems serve as the basis of much of the
guidance provided in this presentation. - Despite these problems, this instrument is
empirically useful.
11Obverse Side
- 14 Items
- Optical Mark Recognition (OMR) Scanned
- Pre-slugged and serialized with textual and
binary labels in the upper left quadrant
12Reverse Side
- Three itemsthe third with four prompts
- Imaging and empirical analysis under development.
13A priori Content Domain of Items 1 - 14
During construction, Item 13 and Item 14 were
placed last because they were seen as not
assessing either specific observable behavior or
the course itself.
14Considerations in Initial Order of Items
- Similar items should be presented together, as a
set - enables stem and leaf layout
- reduces amount of ink and space used
- reduces need of respondent to shift back and
forth between topics and mind-sets - Speed and perceived ease of task are greater
considerations than novelty within the layout.
15Obvious Main Effects Associated with Student
Evaluation of Instruction Scores
- Questions that address specific behavior of
instructors yield higher means than either
questions that address student thought and
understanding or global questions assessing
overall instructor performance - In aggregate, full-time faculty members earn
higher mean scores than part-time faculty members.
16Mean Response to Items 1 - 14
The mean scores of Item 13 (student thinking
and interest) and Item 14 (the global item)
are lower.
17Mean Response to Items 1 14 by Full-Time versus
Part-Time Faculty Status
In aggregate, full-time faculty means are
consistently higher than those of part-time
faculty. The mean Item 13 and Item 14 scores of
part-time faculty are lower than would be
predicted based on this main effect.
18Reliability and Scale Properties
- Individual item reliability over time, by
instructor - Scale reliability over time, by instructor
- Internal consistency
- Skewness and kurtosis
- Interval properties of the response scale
- Homogeneity of variance
- Stability of scores over time across the entire
population
19Individual Item Reliability
- Across one term, reliabilities of individual
items range from a high of .66 to a low of .40,
and from .59 to .52 across one year. - Item 14, the one global measure, demonstrates the
highest reliability, .66 across one term and .59
across one year. - From two to five years, the scale reliability
ranges from .35 to .26, moderate to low, and
scores on individual items begin to be unreliable
after a period of three years.
20Scale Reliability
Best estimates of reliability obtained based on
an omnibus analysis of all faculty, which shows
that reliability declines over time with
regularity. This decline can be described by the
equation
r -.048 years .524
21Internal Consistency
- Item scores demonstrate a high degree of
interrelationship. - Intercorrelations amongst the items range from a
low of .54 (between item 7 and item 11) to a high
of .77 (between item 13 and item 14). - There is a remarkable degree of internal
consistency ( .96)to some degree product of
restricted variance within the response scale. - Internal consistency is reduced if any one of the
14 items is removed from the scale.
22Skewness and Kurtosis
- As is typical with such instruments, the response
distribution is highly negatively skewed. - On the 5-point scale, most responses are either
4s or 5s. - This effect is somewhat less pronounced for Item
13 and Item 14.
23Response Scale
- When responses are known to be highly negatively
skewed, it is imprudent to expend the response
scale on the distinction between Fair and Good. - Good response scales neednt be symmetric they
need only be incremented in equal intervals. - Instead, focus the range of the scale over the
range in which most of the variance is expected. - Avoid measurement ceilings through selecting a
top increment that should be difficult to attain.
24Heterogeneity of Variance
- Likely, the single greatest threat to the
validity of the scale arises from the negative
skewness of the responses. - Variance decreases, with increasing scores in a
population where most scores are high. - The scale is relatively insensitive to
excellence.
25Scores have increased over the baseline of
measurement
- From fall 1999 to fall 2004, the mean scale score
has been range-bound between 4.44 and 4.50. - From spring 2005 forward, the mean scale score
has ranged between 4.53 and 4.61.
26Convergent Validity
- Convergence with concurrent measurements made
with standardized instruments. Compare changes
observed in the baseline of mean student
evaluation of instruction scores with changes
seen in the baseline of the Instructional
Effectiveness Scale of the Noel-Levitz Student
Satisfaction Inventory. - Association with total terms of teaching
experience (internal and external), derived from
ad hoc datapart of an internal audit of
part-time faculty salaries.
27Noel-Levitz Student Satisfaction Inventory
Instructional Effectiveness, baseline comparison
The observed change in the baseline of evaluation
scores coincides with those seen in the
Instructional Effectiveness Scale Scores.
28Terms of Service (Teaching Experience) of
Part-time Faculty by Evaluation Scores
- No linear relationship exists between terms of
service and evaluation scores. - The best non-linear model could not account for
more than 1 of the variance.
29Terms of Service of Part-time Faculty by
Evaluation ScoresLatent Classes
- More experienced instructors with low scores are
outliers. - Exploratory Cluster Analysis (max log-likelihood
model) was employed in order to identify latent
classes. - Three distinct groups emerged.
30Confidence Intervals of the Three Instructor
Classes
31Descriptive Statistics of Clusters Low and High
Experience Faculty
32Discriminant Validity
- The student evaluation of instruction instrument
measures little of what it what not intended to
measure including - Instructor Characteristics
- Instructor Activity
- Student Demographics
- Section Characteristics
33Instructor Characteristics
There is no practically significant relationship
between mean student evaluation of instruction
scores of class sections and external demographic
characteristics of the instructor there is no
measureable effect of instructor gender.
34Relationship between Age and Scale Scores, with
Confidence Intervals
35Confidence Intervals of Mean Scale Scores for
Caucasian and African American Instructors
Although the magnitude of association is nominal,
the direction of the association actually favors
the minority group.
36Instructor Activity
There is no practically significant relationship
between mean student evaluation of instruction
scores and either instructor load or total number
of students taught there is no discernable
relationship between number of sections taught
and scores.
37Relationships between both Billing Hours and
Students Taught, and Scale Scores.
38Student Demographics
There is trivial relationship between student
evaluation of instruction scores mean student
age similarly, mean academic level and mean
student cumulative credit hours and cumulative
GPA bore smaller relationships. Across all
classes, gender ratio was also trivial, but may
be noteworthy under certain circumstances.
39Relationship between Mean Student Age and Scale
Scores
- Having older students confers, at most, a
trifling advantage in scale scores. - Mean student age accounts for about 2 of the
variance in student evaluation of instruction
scores.
40Relationships between both Academic Level and
Cumulative Credit Hours, and Scale Scores
41Relationship between Mean Student Cumulative GPA
and Scale Scores
- Over the observed
- range of academic achievement, the effect of
cumulative GPA is trivial. - However, a cubic solution suggests that a section
with an hypothetically high mean GPA might be
somewhat more inclined to award higher scores,
and an inverse effect might exist for a section
composed of students with failing academic
records.
42Relationship between Gender Ratio (portion
female) and Scale Scores.
- Gender ratio
- (quantified as
- portion female) accounts for about 2 of the
variance in scale scores. - A quadratic model outperforms a linear model, and
is more parsimonious. - 0 all male enrollment
- 1 all female enrollment.
43Relationship between Gender Ratio and Scale
Scores, with linear transformation.
- Portion female can be transformed into sex
skewness - After transformation, this performs as well as
the quadratic model - This transformation renders further analyses more
easily interpreted.
sex skewness ((portion female) - .5) 2)
44The Effect of Gender Ratio is Dependent on Class
Size
Because sex is a discrete variable, the effect
should be suppressed for the smallest sections,
but not the mid-sized group.
For classes of 40 in size r .36,
(plt.001) Gender ratio accounts for about 13 of
the variance.
45Section Characteristics
- The relationships between mean section student
evaluation of instruction scores and both mean
student grade awarded by section and section head
count are smallaccounting for less than 5 and
less than 3 of the variance, respectively.
However, instructors of very small sections (6 or
fewer students) enjoy an advantage. - The relationships between mean scores and both
response rate and number of withdrawals in the
section are trivial. - There is no demonstrable relationship between
mean scores and class format. - The is a significant albeit modest effect for
class meeting time favoring weekend courses.
46Relationship between Class Size and scale scores
- Although a quadratic solution accounts for a
scantly greater portion of variance in mean
evaluation scores than does a logarithmic
solution (.032 versus .029), the former model is
inexplicable, while the latter is consistent with
what is seen in similar instruments. - Classes of six or fewer students typically
receive higher scores close to the ceiling of
measurement.
47Relationship between Response Rate and Scale
Scores
- The effect of the portion of the head count
enrollment responding is trivial, accounting for
about 1 of the variance. - Some of this is due to the effect of small
sections about 0.6 of variance is explained
when sections of 6 or fewer are removed.
48Relationship between Student Grades and Scale
Scores
Less than 5 of the variance in mean student
evaluation of instruction scores can be explained
by mean grade awarded by section.
49Relationship between Number of Student
withdrawals and Scale Scores
- Number of withdrawals by section accounts for
less variance than student gradesslightly more
than 1. - This effect may mirror that of student grades,
but may be suppressed due to the discrete nature
of withdrawals.
50Relationship between Class Meeting Time and Scale
Scores
- Excluding TBA Courses, there is a modest effect
for class meeting time F(3, 3569) 4.721 - (p .003).
- The only significant difference amongst groups
was found between each morning and day classes
and weekend classes (.006?p?.008).
51Relationship between Class Format and Scale Scores
- The is no association between course format
(lecture, lab, or online) and mean evaluation
scores. - The other components were too few in occurrence
for analysis.
52Benchmarks and Special Benchmarks
- Table listing maximum scores attained by decile,
for each - Overall
- Full-time v Part-time faculty teaching all
courses - Department
- Overall
- Full-time v Part-time
- Course
- Overall
- Full-time v Part-time
- For each break where at least five sections are
represented - Table listing mean scores for sections of six or
fewer students enrolled -
53Benchmarks
A. Scores Attained by Percentile, by Faculty
Status, Department, and Course Displayed are
maximum scores earned by each percentile of
sections within each department or course named
in each row. For instance, if for full time
faculty teaching a given course, the 70th
percentile is listed as 4.70, then the maximum or
best score earned in 70 of these sections is
4.70. Thus each score represents a benchmark
highest score attained across portions of faculty
teaching each department and course listed.
54Special Benchmarks for Small Class Sections of
Six or Fewer Students
B. Higher Predicted Means for Smaller
Sections Based on Table 14, the following means
can be used to benchmark scores of smaller
sections
55Thank you.
- For further information, contact
- William Clark bclark3_at_uakron.edu
- Paulette Popovich popovic_at_uakron.edu