How to Assess and Measure Competency - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

How to Assess and Measure Competency

Description:

Test specifications ensure each assessment is similar, fair, and covers critical content ... Dr. Thomas Haladyna, Arizona State University. 27 ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 47
Provided by: RobS158
Category:

less

Transcript and Presenter's Notes

Title: How to Assess and Measure Competency


1
How to Assess andMeasure Competency
  • Robert C. Shaw, Jr., PhD
  • Program Director

2
Presentation Outline
  • Describe a programs responsibilities
  • Assess appropriate content
  • Measure abilities as precisely as possible
  • Reference each cut score to a criterion

3
The validity claim
  • Our program is confident we can make valid
    inferences from an assessment because
  • we carefully selected and structured the content
  • and
  • observed scores are reasonably precise
  • Weakness in either claim diminishes the validity
    argument

4
Define appropriate content
  • What should we assess?

5
Information sources for content
Certification Boards Expectations
6
What should we assess?
  • A program should seek multiple opinions about
    program content
  • May mean more than one faculty person in the
    program
  • Could extend to survey results from several
    stakeholders
  • Those who hire your graduates
  • Those who graduated

7
Describe potential content
  • Define potential content by describing job
    behaviors or tasks
  • Interpret ABG results
  • Determine the appropriate time to refer a patient
    for consultation from another service
  • Adjust mechanical ventilation settings to
    optimize oxygenation for a patient while
    minimizing the risk of pulmonary injury

8
Define terminal behaviors
  • Focus terminal assessments on end-product
    behavior you expect students to master
  • Insert a pulmonary artery catheter in a patient
    within a critical care setting using standard
    technique while minimizing risks of infection and
    lung involvement
  • Integrate pulmonary function testing results with
    patient history and other laboratory results to
    produce a diagnosis

9
Measure task criticality
  • Typically expressed by the interaction of a
  • importance/significance/risk measure
  • and a
  • frequency/extent measure

10
Potential survey measurements
  • How important is the task to success?
  • OR
  • How significant is the task to safe and effective
    practice?
  • 4Extremely
  • 3Very
  • 2Moderately
  • 1Minimally

11
Potential survey measurements
  • If this task is incorrectly performed, how strong
    is the risk?
  • 3 Potentially fatal
  • 2Likely to increase morbidity
  • 1 Unlikely to have an adverse effect
  • 3High
  • 2Moderate
  • 1Low

12
Potential survey measurements
  • How frequently do you perform the task?
  • 3Every week
  • 2A few times each year
  • 1Less than once a year
  • 3Very often
  • 2Occasionally
  • 1Infrequently

13
Potential survey measurements
  • Have you performed the task in the last year?
  • 1Yes
  • 0No

14
What can we do with task measurements?
  • Normed-referenced approach
  • Rank order tasks from most to least critical
  • Start at the top and work down using available
    time
  • Criterion-referenced approach
  • Identify tasks that are sufficiently critical to
    ensure program coverage and competency assessment

15
Select item type(s) for each assessment
  • Constructed response (e.g., short answer, essay,
    performance)
  • Short development time
  • Long scoring time
  • Scores have strong subjective characteristics
  • Selected response (e.g., true/false, matching,
    multiple-choice)
  • Long development time
  • Short scoring time
  • Scores have strong objective characteristics

16
High stakes terminal assessments should be
standardized
  • Specify how the assessment should look before
    writing/selecting items
  • Test specifications ensure each assessment is
    similar, fair, and covers critical content

17
Test specifications are typically two-dimensional
18
Entire test blueprint/matrix
19
Test specifications and items
  • Each item should be linked to a task and a
    cognitive process level
  • It helps to store items in a database
  • A sophisticated database will permit additional
    layers of classification
  • Acute/chronic care
  • Age groups

20
Item banking software
  • FastTest
  • www.assess.com/frmSoftCat.htm
  • ExamView
  • www.pearsonncs.com/examview/
  • examview.htm
  • LXRTest
  • www.lxrtest.com/

21
Measure abilities precisely
  • Are we confident an assessment has yielded a
    sufficiently precise ability estimate?

22
Reliability
  • Theoretical premise
  • Observed scores are assumed to express true
    ability plus some measurement error
  • High reliability implies low measurement error

23
Reliability
  • Reliability indices are R2 values, which express
    the percentage of observed score variance that
    can be attributed to true score variance
  • How high is high enough?
  • A test score reliability value of at least .85 is
    a characteristic of large-scale, standardized
    assessments, many exceed .90
  • Sufficiently reliable test scores from a test
    built by a program should show values of at least
    .60

24
Reliability
  • Reliability is an attribute of a set of test
    scores, it is not an attribute of a test
  • Therefore, a program should assess reliability
    for each group
  • KR20 is appropriate for dichotomously scored
    (0,1) items
  • Coefficient alpha works for polytomously (0,
    1,n) scored items

25
Why are selected response items used for so many
assessments?
  • Assuming the time to assess is constant, more
    responses can be elicited from students using
    selected response items
  • more items
  • broader content coverage
  • increased information
  • enhanced measurement precision
  • stronger validity
  • Scores are more strongly objective

26
Add items or options?
  • A program cannot go wrong by adding more items to
    an assessment
  • A program may only consume space and time by
    adding more options to multiple-choice items
  • There is growing evidence items with 3 options
    are optimal, particularly when doing so permits
    inclusion of more items on an assessment
  • Dr. Thomas Haladyna, Arizona State University

27
Up to a point, measurement precision and item
quantity are directly related
Reliability
Higher quality items
Lower quality items
Item Count
28
What encourages high item quality?
  • Write well
  • Clear, concise, accurate
  • Remove unnecessary information from the stimulus
  • Present nuanced choices that require a
    sophisticated mastery of material to correctly
    respond
  • Item review is another opportunity to seek
    multiple opinions

29
What encourages high item quality?
  • Avoid formats known to be flawed
  • D. All of the above
  • D. None of the above
  • Negative wording
  • All of the following are true EXCEPT
  • Which of the following is not true?

30
What encourages high item quality?
  • Apply quality improvement principles
  • Analyze item performance
  • Retain items that contribute to test score
    reliability
  • Change or discard items that fail to contribute
    or negatively affect reliability

31
Item analysis properties
  • Difficulty
  • p proportion of students who correctly
    responded
  • Discrimination
  • rpb correlation between item success and
    students test scores

32
Item difficulty
Contribution to Test Score Reliability
1.0
0.0
0.4
0.6
p
33
Item discrimination
  • Because rpb values are correlations, values
    reflect one of three possibilities relative to
    reliability
  • Positive contribution
  • No contribution
  • Negative contribution

34
Using item parameters diagnostically
  • Relative to reliability contribution, item
  • p values provide magnitude information
  • rpb values provide magnitude and direction ( or
    -) information

35
Using item parameters diagnostically
  • Difficulty and discrimination properties equally
    contribute to reliability
  • The best items show .30ltplt.70 AND ppbgt.20
  • The worst items exist at the difficulty extremes
    and show zero or negative discrimination

36
After diagnosing an item that shows a weak or
negative reliability contribution
  • What should we do?
  • Observe option response frequencies and mean
    scores
  • Identify incorrect responses that attracted
    students with test scores equal to or greater
    than the average
  • Replace the offending option with a less
    attractive response
  • Rewrite the stem to clarify ambiguities
  • OR
  • Discard the whole item and use a better one the
    next time

37
Item analysis software
  • Iteman
  • www.assess.com/Software/iteman.htm
  • examSystem II
  • www.pearsonncs.com/examsystem/index.htm
  • LXRTest
  • www.lxrtest.com/
  • True Score II
  • www.nine-patch.com/TSCDL.htm
  • Excel Templates Free
  • www.eflclub.com/elvin/publications/2003/itemanalys
    is.html

38
Internal resources may be available
  • There is a good probability a large university
    with education, psychology, and/or statistics
    departments will have a system available for
    scoring items and providing analyses of test
    scores and items

39
Reference each cut score to a criterion
  • Should we define and assess minimal competence
    for our program?

40
Cut points
  • Highly reliable test scores reveal differences
    between students abilities and can help
    accurately rank order students, which may be
    important to employers
  • However, the program is likely interested in
    assessing whether each student is sufficiently
    competent to safely and effectively practice
  • Such assessment concerns typically surface as
    students are about to graduate

41
Measuring minimal competence
  • A program should decide whether it wants to
    create one large assessment with a single
    compensatory cut point
  • OR
  • Should each content domain have its own cut, a
    conjunctive model

42
Why are there so many compensatory cut competency
assessments?
  • If a program selects the more rigorous
    conjunctive model, then each component test will
    produce its own set of scores, each with its own
    reliability
  • Each component must have a sufficient number of
    items or data points to be confident each student
    groups test scores will show adequate
    reliability
  • Modules of less than 80-100 program-made items
    are unlikely to produce adequate reliability

43
Seek multiple opinions . . . again
  • Program faculty should define skills competent
    practitioners possess
  • This is a group activity
  • Each cut point should be linked to a definition
    of minimally competent practitioners

44
Performance assessments
  • Pick your spots
  • Ensure a sufficient quantity of information is
    collected
  • Standardize administration
  • Measure agreement between/among evaluators

45
Summary
  • Collective opinions are closer to the truth about
  • appropriate assessment content,
  • item quality, and
  • justifiable cut scores than any one opinion
  • Unreliable scales have no utility

46
Thank you for the opportunity to share some
details about measurement
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com