Working Paper: Assessment, Validation, and Benchmarking with Student Evaluation of Instruction Instr - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Working Paper: Assessment, Validation, and Benchmarking with Student Evaluation of Instruction Instr

Description:

Inferences that can be drawn from any observations are ultimately ... data in a manner that facilitates analysis. ... material in an orderly manner. 10. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 56
Provided by: willia523
Category:

less

Transcript and Presenter's Notes

Title: Working Paper: Assessment, Validation, and Benchmarking with Student Evaluation of Instruction Instr


1
Working Paper Assessment, Validation, and
Benchmarking with Student Evaluation of
Instruction Instruments
William B. Clark University of Akron Wayne
College
2
Assessment is systematic observation.
  • Inferences that can be drawn from any
    observations are ultimately determined by the
    relationships amongst the observations.

3
Asystematic Observation
  • Asystematic observation may result from
  • Inappropriately constructed assessment tools
  • Asystematic observational methods
  • Political, human, or institutional obstacles

4
Inappropriately Constructed Assessment Tools
  • When committees an other personnel are charged
    with the construction of assessment instruments,
    a number errors frequently occur
  • Disregard for Variance and overemphasis of the
    mean.
  • Arbitrary imposition of a symmetric response
    scale, without consideration of distribution of
    the event measured.
  • The bias to measure only excellence
  • Restriction of response scale
  • Restriction of the domain of measurement.
  • Lack of adequate, meaningful pilot work.

5
Asystematic Observational Methods
  • Mainly through errors of omission and lack of
    forethought, committees and other personnel may
    make a number of strategic errors that later
    compromise otherwise meaningful systematic
    observation
  • Use of technology as a goal rather than a means.
  • Failure to master, else reliably outsource,
    technology used.
  • Production of a series of disjointed reports.
  • Failure to store data in a manner that
    facilitates analysis.
  • No use of a primary or external key.

6
Political or Human Obstacles to Assessment
  • Local
  • Faculty
  • Committees
  • Institutional
  • Self-Appointed Gatekeepers
  • Administrators
  • Cultural

7
Local Obstacles to Assessment
  • Faculty
  • Even the best faculty are often resistant to the
    notion of being measured.
  • There is not necessarily any motivation for
    faculty to create the most critical instrument
    possible.
  • Committees
  • The leadership of insufficiently knowledgeable
    individuals is not useful, at best.
  • The inclusion of individuals on committees solely
    in order to increase buy-in or
    representativeness amongst faculty ultimately
    improves neither.

8
Institutional Obstacles to Assessment
  • Self-Appointed Gatekeepers
  • Secretaries or technicians may presume authority
    to prevent access to data.
  • Administrators
  • Constant discontinuation and adoption of
    instruments and methods.
  • Failure to persist in assessment efforts for
    sufficient periods in order to allow meaningful
    aggregation and comparison of multiple,
    concurrent measures.

9
Cultural Obstacles to Assessment
  • A historical anecdote.
  • An example Comparative vs. Normative Data on the
    Student Instructional Record II (SIR-II)
  • When the SIR II survey is administered, the
    institution and/or individual instructors control
    all aspects of the administration. For example,
    when and how the survey is administered within
    the classroom is determined by the institution or
    faculty member. Because of this local control,
    and because the sample of institutions does not
    proportionately represent all types of
    institutions, the national data represents
    comparative rather than normative data.
  • Because students typically use the favorable end
    of the scale in making evaluations, comparative
    data provide a context within which instructors
    and others can interpret individual reports.
    However, institutions may wish to supplement the
    SIR II comparative data with their own
    comparative data developed over time.

10
The Current Instrument
  • The current instrument has its flaws.
  • These problems serve as the basis of much of the
    guidance provided in this presentation.
  • Despite these problems, this instrument is
    empirically useful.

11
Obverse Side
  • 14 Items
  • Optical Mark Recognition (OMR) Scanned
  • Pre-slugged and serialized with textual and
    binary labels in the upper left quadrant

12
Reverse Side
  • Three itemsthe third with four prompts
  • Imaging and empirical analysis under development.

13
A priori Content Domain of Items 1 - 14
During construction, Item 13 and Item 14 were
placed last because they were seen as not
assessing either specific observable behavior or
the course itself.
14
Considerations in Initial Order of Items
  • Similar items should be presented together, as a
    set
  • enables stem and leaf layout
  • reduces amount of ink and space used
  • reduces need of respondent to shift back and
    forth between topics and mind-sets
  • Speed and perceived ease of task are greater
    considerations than novelty within the layout.

15
Obvious Main Effects Associated with Student
Evaluation of Instruction Scores
  • Questions that address specific behavior of
    instructors yield higher means than either
    questions that address student thought and
    understanding or global questions assessing
    overall instructor performance
  • In aggregate, full-time faculty members earn
    higher mean scores than part-time faculty members.

16
Mean Response to Items 1 - 14
The mean scores of Item 13 (student thinking
and interest) and Item 14 (the global item)
are lower.
17
Mean Response to Items 1 14 by Full-Time versus
Part-Time Faculty Status
In aggregate, full-time faculty means are
consistently higher than those of part-time
faculty. The mean Item 13 and Item 14 scores of
part-time faculty are lower than would be
predicted based on this main effect.
18
Reliability and Scale Properties
  • Individual item reliability over time, by
    instructor
  • Scale reliability over time, by instructor
  • Internal consistency
  • Skewness and kurtosis
  • Interval properties of the response scale
  • Homogeneity of variance
  • Stability of scores over time across the entire
    population

19
Individual Item Reliability
  • Across one term, reliabilities of individual
    items range from a high of .66 to a low of .40,
    and from .59 to .52 across one year.
  • Item 14, the one global measure, demonstrates the
    highest reliability, .66 across one term and .59
    across one year.
  • From two to five years, the scale reliability
    ranges from .35 to .26, moderate to low, and
    scores on individual items begin to be unreliable
    after a period of three years.

20
Scale Reliability
Best estimates of reliability obtained based on
an omnibus analysis of all faculty, which shows
that reliability declines over time with
regularity. This decline can be described by the
equation
r -.048 years .524
21
Internal Consistency
  • Item scores demonstrate a high degree of
    interrelationship.
  • Intercorrelations amongst the items range from a
    low of .54 (between item 7 and item 11) to a high
    of .77 (between item 13 and item 14).
  • There is a remarkable degree of internal
    consistency ( .96)to some degree product of
    restricted variance within the response scale.
  • Internal consistency is reduced if any one of the
    14 items is removed from the scale.

22
Skewness and Kurtosis
  • As is typical with such instruments, the response
    distribution is highly negatively skewed.
  • On the 5-point scale, most responses are either
    4s or 5s.
  • This effect is somewhat less pronounced for Item
    13 and Item 14.

23
Response Scale
  • When responses are known to be highly negatively
    skewed, it is imprudent to expend the response
    scale on the distinction between Fair and Good.
  • Good response scales neednt be symmetric they
    need only be incremented in equal intervals.
  • Instead, focus the range of the scale over the
    range in which most of the variance is expected.
  • Avoid measurement ceilings through selecting a
    top increment that should be difficult to attain.

24
Heterogeneity of Variance
  • Likely, the single greatest threat to the
    validity of the scale arises from the negative
    skewness of the responses.
  • Variance decreases, with increasing scores in a
    population where most scores are high.
  • The scale is relatively insensitive to
    excellence.

25
Scores have increased over the baseline of
measurement
  • From fall 1999 to fall 2004, the mean scale score
    has been range-bound between 4.44 and 4.50.
  • From spring 2005 forward, the mean scale score
    has ranged between 4.53 and 4.61.

26
Convergent Validity
  • Convergence with concurrent measurements made
    with standardized instruments. Compare changes
    observed in the baseline of mean student
    evaluation of instruction scores with changes
    seen in the baseline of the Instructional
    Effectiveness Scale of the Noel-Levitz Student
    Satisfaction Inventory.
  • Association with total terms of teaching
    experience (internal and external), derived from
    ad hoc datapart of an internal audit of
    part-time faculty salaries.

27
Noel-Levitz Student Satisfaction Inventory
Instructional Effectiveness, baseline comparison
The observed change in the baseline of evaluation
scores coincides with those seen in the
Instructional Effectiveness Scale Scores.
28
Terms of Service (Teaching Experience) of
Part-time Faculty by Evaluation Scores
  • No linear relationship exists between terms of
    service and evaluation scores.
  • The best non-linear model could not account for
    more than 1 of the variance.

29
Terms of Service of Part-time Faculty by
Evaluation ScoresLatent Classes
  • More experienced instructors with low scores are
    outliers.
  • Exploratory Cluster Analysis (max log-likelihood
    model) was employed in order to identify latent
    classes.
  • Three distinct groups emerged.

30
Confidence Intervals of the Three Instructor
Classes
31
Descriptive Statistics of Clusters Low and High
Experience Faculty
32
Discriminant Validity
  • The student evaluation of instruction instrument
    measures little of what it what not intended to
    measure including
  • Instructor Characteristics
  • Instructor Activity
  • Student Demographics
  • Section Characteristics

33
Instructor Characteristics
There is no practically significant relationship
between mean student evaluation of instruction
scores of class sections and external demographic
characteristics of the instructor there is no
measureable effect of instructor gender.
34
Relationship between Age and Scale Scores, with
Confidence Intervals
35
Confidence Intervals of Mean Scale Scores for
Caucasian and African American Instructors
Although the magnitude of association is nominal,
the direction of the association actually favors
the minority group.
36
Instructor Activity
There is no practically significant relationship
between mean student evaluation of instruction
scores and either instructor load or total number
of students taught there is no discernable
relationship between number of sections taught
and scores.
37
Relationships between both Billing Hours and
Students Taught, and Scale Scores.
38
Student Demographics
There is trivial relationship between student
evaluation of instruction scores mean student
age similarly, mean academic level and mean
student cumulative credit hours and cumulative
GPA bore smaller relationships. Across all
classes, gender ratio was also trivial, but may
be noteworthy under certain circumstances.
39
Relationship between Mean Student Age and Scale
Scores
  • Having older students confers, at most, a
    trifling advantage in scale scores.
  • Mean student age accounts for about 2 of the
    variance in student evaluation of instruction
    scores.

40
Relationships between both Academic Level and
Cumulative Credit Hours, and Scale Scores
41
Relationship between Mean Student Cumulative GPA
and Scale Scores
  • Over the observed
  • range of academic achievement, the effect of
    cumulative GPA is trivial.
  • However, a cubic solution suggests that a section
    with an hypothetically high mean GPA might be
    somewhat more inclined to award higher scores,
    and an inverse effect might exist for a section
    composed of students with failing academic
    records.

42
Relationship between Gender Ratio (portion
female) and Scale Scores.
  • Gender ratio
  • (quantified as
  • portion female) accounts for about 2 of the
    variance in scale scores.
  • A quadratic model outperforms a linear model, and
    is more parsimonious.
  • 0 all male enrollment
  • 1 all female enrollment.

43
Relationship between Gender Ratio and Scale
Scores, with linear transformation.
  • Portion female can be transformed into sex
    skewness
  • After transformation, this performs as well as
    the quadratic model
  • This transformation renders further analyses more
    easily interpreted.

sex skewness ((portion female) - .5) 2)
44
The Effect of Gender Ratio is Dependent on Class
Size
Because sex is a discrete variable, the effect
should be suppressed for the smallest sections,
but not the mid-sized group.
For classes of 40 in size r .36,
(plt.001) Gender ratio accounts for about 13 of
the variance.
45
Section Characteristics
  • The relationships between mean section student
    evaluation of instruction scores and both mean
    student grade awarded by section and section head
    count are smallaccounting for less than 5 and
    less than 3 of the variance, respectively.
    However, instructors of very small sections (6 or
    fewer students) enjoy an advantage.
  • The relationships between mean scores and both
    response rate and number of withdrawals in the
    section are trivial.
  • There is no demonstrable relationship between
    mean scores and class format.
  • The is a significant albeit modest effect for
    class meeting time favoring weekend courses.

46
Relationship between Class Size and scale scores
  • Although a quadratic solution accounts for a
    scantly greater portion of variance in mean
    evaluation scores than does a logarithmic
    solution (.032 versus .029), the former model is
    inexplicable, while the latter is consistent with
    what is seen in similar instruments.
  • Classes of six or fewer students typically
    receive higher scores close to the ceiling of
    measurement.

47
Relationship between Response Rate and Scale
Scores
  • The effect of the portion of the head count
    enrollment responding is trivial, accounting for
    about 1 of the variance.
  • Some of this is due to the effect of small
    sections about 0.6 of variance is explained
    when sections of 6 or fewer are removed.

48
Relationship between Student Grades and Scale
Scores
Less than 5 of the variance in mean student
evaluation of instruction scores can be explained
by mean grade awarded by section.
49
Relationship between Number of Student
withdrawals and Scale Scores
  • Number of withdrawals by section accounts for
    less variance than student gradesslightly more
    than 1.
  • This effect may mirror that of student grades,
    but may be suppressed due to the discrete nature
    of withdrawals.

50
Relationship between Class Meeting Time and Scale
Scores
  • Excluding TBA Courses, there is a modest effect
    for class meeting time F(3, 3569) 4.721
  • (p .003).
  • The only significant difference amongst groups
    was found between each morning and day classes
    and weekend classes (.006?p?.008).

51
Relationship between Class Format and Scale Scores
  • The is no association between course format
    (lecture, lab, or online) and mean evaluation
    scores.
  • The other components were too few in occurrence
    for analysis.

52
Benchmarks and Special Benchmarks
  • Table listing maximum scores attained by decile,
    for each
  • Overall
  • Full-time v Part-time faculty teaching all
    courses
  • Department
  • Overall
  • Full-time v Part-time
  • Course
  • Overall
  • Full-time v Part-time
  • For each break where at least five sections are
    represented
  • Table listing mean scores for sections of six or
    fewer students enrolled

53
Benchmarks
A. Scores Attained by Percentile, by Faculty
Status, Department, and Course Displayed are
maximum scores earned by each percentile of
sections within each department or course named
in each row. For instance, if for full time
faculty teaching a given course, the 70th
percentile is listed as 4.70, then the maximum or
best score earned in 70 of these sections is
4.70. Thus each score represents a benchmark
highest score attained across portions of faculty
teaching each department and course listed.
54
Special Benchmarks for Small Class Sections of
Six or Fewer Students
B. Higher Predicted Means for Smaller
Sections Based on Table 14, the following means
can be used to benchmark scores of smaller
sections
55
Thank you.
  • For further information, contact
  • William Clark bclark3_at_uakron.edu
  • Paulette Popovich popovic_at_uakron.edu
Write a Comment
User Comments (0)
About PowerShow.com