Working Paper: Assessment, Validation, and Benchmarking with Student Evaluation of Instruction Instr - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Working Paper: Assessment, Validation, and Benchmarking with Student Evaluation of Instruction Instr

Description:

Inferences that can be drawn from any observations are ultimately ... data in a manner that facilitates analysis. ... material in an orderly manner. 10. ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 56

Provided by: willia523

Category:

more less

Transcript and Presenter's Notes

Title: Working Paper: Assessment, Validation, and Benchmarking with Student Evaluation of Instruction Instr

1
Working Paper Assessment, Validation, and
Benchmarking with Student Evaluation of
Instruction Instruments
William B. Clark University of Akron Wayne
College
2
Assessment is systematic observation.

Inferences that can be drawn from any
observations are ultimately determined by the
relationships amongst the observations.

3
Asystematic Observation

Asystematic observation may result from
Inappropriately constructed assessment tools
Asystematic observational methods
Political, human, or institutional obstacles

4
Inappropriately Constructed Assessment Tools

When committees an other personnel are charged
with the construction of assessment instruments,
a number errors frequently occur
Disregard for Variance and overemphasis of the
mean.
Arbitrary imposition of a symmetric response
scale, without consideration of distribution of
the event measured.
The bias to measure only excellence
Restriction of response scale
Restriction of the domain of measurement.
Lack of adequate, meaningful pilot work.

5
Asystematic Observational Methods

Mainly through errors of omission and lack of
forethought, committees and other personnel may
make a number of strategic errors that later
compromise otherwise meaningful systematic
observation
Use of technology as a goal rather than a means.
Failure to master, else reliably outsource,
technology used.
Production of a series of disjointed reports.
Failure to store data in a manner that
facilitates analysis.
No use of a primary or external key.

6
Political or Human Obstacles to Assessment

Local
Faculty
Committees
Institutional
Self-Appointed Gatekeepers
Administrators
Cultural

7
Local Obstacles to Assessment

Faculty
Even the best faculty are often resistant to the
notion of being measured.
There is not necessarily any motivation for
faculty to create the most critical instrument
possible.
Committees
The leadership of insufficiently knowledgeable
individuals is not useful, at best.
The inclusion of individuals on committees solely
in order to increase buy-in or
representativeness amongst faculty ultimately
improves neither.

8
Institutional Obstacles to Assessment

Self-Appointed Gatekeepers
Secretaries or technicians may presume authority
to prevent access to data.
Administrators
Constant discontinuation and adoption of
instruments and methods.
Failure to persist in assessment efforts for
sufficient periods in order to allow meaningful
aggregation and comparison of multiple,
concurrent measures.

9
Cultural Obstacles to Assessment

A historical anecdote.
An example Comparative vs. Normative Data on the
Student Instructional Record II (SIR-II)
When the SIR II survey is administered, the
institution and/or individual instructors control
all aspects of the administration. For example,
when and how the survey is administered within
the classroom is determined by the institution or
faculty member. Because of this local control,
and because the sample of institutions does not
proportionately represent all types of
institutions, the national data represents
comparative rather than normative data.
Because students typically use the favorable end
of the scale in making evaluations, comparative
data provide a context within which instructors
and others can interpret individual reports.
However, institutions may wish to supplement the
SIR II comparative data with their own
comparative data developed over time.

10
The Current Instrument

The current instrument has its flaws.
These problems serve as the basis of much of the
guidance provided in this presentation.
Despite these problems, this instrument is
empirically useful.

11
Obverse Side

14 Items
Optical Mark Recognition (OMR) Scanned
Pre-slugged and serialized with textual and
binary labels in the upper left quadrant

12
Reverse Side

Three itemsthe third with four prompts
Imaging and empirical analysis under development.

13
A priori Content Domain of Items 1 - 14
During construction, Item 13 and Item 14 were
placed last because they were seen as not
assessing either specific observable behavior or
the course itself.
14
Considerations in Initial Order of Items

Similar items should be presented together, as a
set
enables stem and leaf layout
reduces amount of ink and space used
reduces need of respondent to shift back and
forth between topics and mind-sets
Speed and perceived ease of task are greater
considerations than novelty within the layout.

15
Obvious Main Effects Associated with Student
Evaluation of Instruction Scores

Questions that address specific behavior of
instructors yield higher means than either
questions that address student thought and
understanding or global questions assessing
overall instructor performance
In aggregate, full-time faculty members earn
higher mean scores than part-time faculty members.

16
Mean Response to Items 1 - 14
The mean scores of Item 13 (student thinking
and interest) and Item 14 (the global item)
are lower.
17
Mean Response to Items 1 14 by Full-Time versus
Part-Time Faculty Status
In aggregate, full-time faculty means are
consistently higher than those of part-time
faculty. The mean Item 13 and Item 14 scores of
part-time faculty are lower than would be
predicted based on this main effect.
18
Reliability and Scale Properties

Individual item reliability over time, by
instructor
Scale reliability over time, by instructor
Internal consistency
Skewness and kurtosis
Interval properties of the response scale
Homogeneity of variance
Stability of scores over time across the entire
population

19
Individual Item Reliability

Across one term, reliabilities of individual
items range from a high of .66 to a low of .40,
and from .59 to .52 across one year.
Item 14, the one global measure, demonstrates the
highest reliability, .66 across one term and .59
across one year.
From two to five years, the scale reliability
ranges from .35 to .26, moderate to low, and
scores on individual items begin to be unreliable
after a period of three years.

20
Scale Reliability
Best estimates of reliability obtained based on
an omnibus analysis of all faculty, which shows
that reliability declines over time with
regularity. This decline can be described by the
equation
r -.048 years .524
21
Internal Consistency

Item scores demonstrate a high degree of
interrelationship.
Intercorrelations amongst the items range from a
low of .54 (between item 7 and item 11) to a high
of .77 (between item 13 and item 14).
There is a remarkable degree of internal
consistency ( .96)to some degree product of
restricted variance within the response scale.
Internal consistency is reduced if any one of the
14 items is removed from the scale.

22
Skewness and Kurtosis

As is typical with such instruments, the response
distribution is highly negatively skewed.
On the 5-point scale, most responses are either
4s or 5s.
This effect is somewhat less pronounced for Item
13 and Item 14.

23
Response Scale

When responses are known to be highly negatively
skewed, it is imprudent to expend the response
scale on the distinction between Fair and Good.
Good response scales neednt be symmetric they
need only be incremented in equal intervals.
Instead, focus the range of the scale over the
range in which most of the variance is expected.
Avoid measurement ceilings through selecting a
top increment that should be difficult to attain.

24
Heterogeneity of Variance

Likely, the single greatest threat to the
validity of the scale arises from the negative
skewness of the responses.
Variance decreases, with increasing scores in a
population where most scores are high.
The scale is relatively insensitive to
excellence.

25
Scores have increased over the baseline of
measurement

From fall 1999 to fall 2004, the mean scale score
has been range-bound between 4.44 and 4.50.
From spring 2005 forward, the mean scale score
has ranged between 4.53 and 4.61.

26
Convergent Validity

Convergence with concurrent measurements made
with standardized instruments. Compare changes
observed in the baseline of mean student
evaluation of instruction scores with changes
seen in the baseline of the Instructional
Effectiveness Scale of the Noel-Levitz Student
Satisfaction Inventory.
Association with total terms of teaching
experience (internal and external), derived from
ad hoc datapart of an internal audit of
part-time faculty salaries.

27
Noel-Levitz Student Satisfaction Inventory
Instructional Effectiveness, baseline comparison
The observed change in the baseline of evaluation
scores coincides with those seen in the
Instructional Effectiveness Scale Scores.
28
Terms of Service (Teaching Experience) of
Part-time Faculty by Evaluation Scores

No linear relationship exists between terms of
service and evaluation scores.
The best non-linear model could not account for
more than 1 of the variance.

29
Terms of Service of Part-time Faculty by
Evaluation ScoresLatent Classes

More experienced instructors with low scores are
outliers.
Exploratory Cluster Analysis (max log-likelihood
model) was employed in order to identify latent
classes.
Three distinct groups emerged.

30
Confidence Intervals of the Three Instructor
Classes
31
Descriptive Statistics of Clusters Low and High
Experience Faculty
32
Discriminant Validity

The student evaluation of instruction instrument
measures little of what it what not intended to
measure including
Instructor Characteristics
Instructor Activity
Student Demographics
Section Characteristics

33
Instructor Characteristics
There is no practically significant relationship
between mean student evaluation of instruction
scores of class sections and external demographic
characteristics of the instructor there is no
measureable effect of instructor gender.
34
Relationship between Age and Scale Scores, with
Confidence Intervals
35
Confidence Intervals of Mean Scale Scores for
Caucasian and African American Instructors
Although the magnitude of association is nominal,
the direction of the association actually favors
the minority group.
36
Instructor Activity
There is no practically significant relationship
between mean student evaluation of instruction
scores and either instructor load or total number
of students taught there is no discernable
relationship between number of sections taught
and scores.
37
Relationships between both Billing Hours and
Students Taught, and Scale Scores.
38
Student Demographics
There is trivial relationship between student
evaluation of instruction scores mean student
age similarly, mean academic level and mean
student cumulative credit hours and cumulative
GPA bore smaller relationships. Across all
classes, gender ratio was also trivial, but may
be noteworthy under certain circumstances.
39
Relationship between Mean Student Age and Scale
Scores

Having older students confers, at most, a
trifling advantage in scale scores.
Mean student age accounts for about 2 of the
variance in student evaluation of instruction
scores.

40
Relationships between both Academic Level and
Cumulative Credit Hours, and Scale Scores
41
Relationship between Mean Student Cumulative GPA
and Scale Scores

Over the observed
range of academic achievement, the effect of
cumulative GPA is trivial.
However, a cubic solution suggests that a section
with an hypothetically high mean GPA might be
somewhat more inclined to award higher scores,
and an inverse effect might exist for a section
composed of students with failing academic
records.

42
Relationship between Gender Ratio (portion
female) and Scale Scores.

Gender ratio
(quantified as
portion female) accounts for about 2 of the
variance in scale scores.
A quadratic model outperforms a linear model, and
is more parsimonious.
0 all male enrollment
1 all female enrollment.

43
Relationship between Gender Ratio and Scale
Scores, with linear transformation.

Portion female can be transformed into sex
skewness
After transformation, this performs as well as
the quadratic model
This transformation renders further analyses more
easily interpreted.

sex skewness ((portion female) - .5) 2)
44
The Effect of Gender Ratio is Dependent on Class
Size
Because sex is a discrete variable, the effect
should be suppressed for the smallest sections,
but not the mid-sized group.
For classes of 40 in size r .36,
(plt.001) Gender ratio accounts for about 13 of
the variance.
45
Section Characteristics

The relationships between mean section student
evaluation of instruction scores and both mean
student grade awarded by section and section head
count are smallaccounting for less than 5 and
less than 3 of the variance, respectively.
However, instructors of very small sections (6 or
fewer students) enjoy an advantage.
The relationships between mean scores and both
response rate and number of withdrawals in the
section are trivial.
There is no demonstrable relationship between
mean scores and class format.
The is a significant albeit modest effect for
class meeting time favoring weekend courses.

46
Relationship between Class Size and scale scores

Although a quadratic solution accounts for a
scantly greater portion of variance in mean
evaluation scores than does a logarithmic
solution (.032 versus .029), the former model is
inexplicable, while the latter is consistent with
what is seen in similar instruments.
Classes of six or fewer students typically
receive higher scores close to the ceiling of
measurement.

47
Relationship between Response Rate and Scale
Scores

The effect of the portion of the head count
enrollment responding is trivial, accounting for
about 1 of the variance.
Some of this is due to the effect of small
sections about 0.6 of variance is explained
when sections of 6 or fewer are removed.

48
Relationship between Student Grades and Scale
Scores
Less than 5 of the variance in mean student
evaluation of instruction scores can be explained
by mean grade awarded by section.
49
Relationship between Number of Student
withdrawals and Scale Scores

Number of withdrawals by section accounts for
less variance than student gradesslightly more
than 1.
This effect may mirror that of student grades,
but may be suppressed due to the discrete nature
of withdrawals.

50
Relationship between Class Meeting Time and Scale
Scores

Excluding TBA Courses, there is a modest effect
for class meeting time F(3, 3569) 4.721
(p .003).
The only significant difference amongst groups
was found between each morning and day classes
and weekend classes (.006?p?.008).

51
Relationship between Class Format and Scale Scores

The is no association between course format
(lecture, lab, or online) and mean evaluation
scores.
The other components were too few in occurrence
for analysis.

52
Benchmarks and Special Benchmarks

Table listing maximum scores attained by decile,
for each
Overall
Full-time v Part-time faculty teaching all
courses
Department
Overall
Full-time v Part-time
Course
Overall
Full-time v Part-time
For each break where at least five sections are
represented
Table listing mean scores for sections of six or
fewer students enrolled

53
Benchmarks
A. Scores Attained by Percentile, by Faculty
Status, Department, and Course Displayed are
maximum scores earned by each percentile of
sections within each department or course named
in each row. For instance, if for full time
faculty teaching a given course, the 70th
percentile is listed as 4.70, then the maximum or
best score earned in 70 of these sections is
4.70. Thus each score represents a benchmark
highest score attained across portions of faculty
teaching each department and course listed.
54
Special Benchmarks for Small Class Sections of
Six or Fewer Students
B. Higher Predicted Means for Smaller
Sections Based on Table 14, the following means
can be used to benchmark scores of smaller
sections
55
Thank you.