Chapter 3. Reliability: presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 3. Reliability:

1
Chapter 3. Reliability

2
What CU?
3
Reliability Topics

4
Basic Notions of Reliability

Reliability refers to the reliability of a test
score or set of test scores, not the reliability
of the test.
Reliability questions ask Are the scores
consistent? Are they stable?
Reliability is a matter of degree it is NOT
all-or-none.
Reliability is not the same as validity
validity asks Does a test measure what is
suppose to? (reliability is necessary for, but
not a sufficient condition for, validity) .
Reliability deals with unsystematic error in
assessment. Systematic error (examples, I test
well because I am test-wise or I do not test
well because English is not my first language)
will not be uncovered through tests of reliability

5
Factors Affecting ReliabilitySources of
Unreliability

Test Scoring
difference between two scorers judgments
one scorer over time (fatigue) and/or halo effect
Test Content
the sample of test items is too small
the sample of test items is not evenly selected
across material
Test Administration
noise, time limits not consistent, physical
conditions
Personal Conditions
temporary ups and downs
(chronic test anxiety would be a systematic error
and thus undetectable through measures of
reliability)
Note None of these factors automatically result
in unreliability, but as we build our
assessments, we hope to reduce the impact of
these factors. The extent to which these factors
may be affecting test scores is an empirical
question and we can and will address this as we
continue.

6
A Bit of Theory (True/Observed)

The perfect test would be unaffected by the
sources of unreliability and on this perfect test
each examinee should get his or her true score.
Unfortunately, we know the observed score we get
was likely affected by one or more of the sources
of unreliability.
So, our observed score is likely too high or to
low. The difference between the observed score
and the true score we call the error score and
this score can be positive or negative.
We can express this mathematically as
True Score Obtained Score /- Error
T O /- E (or, looking at it another way, O T
/- E)
Theory Time If we could re-administer a test to
one person an infinite number of times, what
would expect the distribution of their scores to
look like? Answer The Bell Shaped Curve. We
will return to this concept when we discuss the
standard error of measurement.

7
Determining Reliability by Usingthe Concept of
Correlation

I can use my understanding of correlation (how
two things are related) to come up with a
mathematical calculation that will suggest the
strength (or lack of strength) regarding one or
more of the sources of unreliability that I have
identified.
I will be calculating what will be called the
reliability coefficient (since it is a
correlation coefficient measuring a type of
reliability). This value will range -1 to 1.
For example, lets consider rater reliability.
That is, do different scorers rate equally or,
another concern, does one scorer rate differently
over time. We express that as either
Inter-rater reliability among raters
(international many nations)
Intra-rater same rater (intramural sports
within 1 school)
Note the hyphen after inter- and intra- may not
be used by some authors
Compute using Spearman Rank Correlation

8
Some History . . . Karl Pearson (1857-1936)

Pearson was a Galton protégé and was appointed
the first Galton Professor or Eugenics (1911) at
University College of London .
Introduced a new science "Biometrics" which
integrated statistics with evolutionary theory.
Advocated social imperialism "superior" races
and countries should produce more offspring than
those considered to be less developed.
In the United States, Indiana was the first to
pass a pioneering statute (1907) allowing state
officials to sterilize those deemed unfit to
breed. California enacted an even stricter
eugenics law. California made it legal for state
officials to asexualize those considered
feeble-minded, prisoners exhibiting sexual or
moral perversions, and anyone with more than
three criminal convictions.

9
More Reliability Approaches to Consider

Test-retest (impractical for you important in
standardized tests)
Alternate Forms (again, impractical for you but
important in standardized tests)
Internal Consistency (not appropriate for speeded
tests)
Kuder-Richardson (really a series of formulas
based on dichotomously scored items)
Coefficient alpha - Cronbachs (most widely used
as can be used with continuous item types)
Split-half odd-even w/Spearman-Brown
correction to apply to full test (easiest for you
to do and understand)

10
Reliability of Your Classroom Tests

I would recommend doing Split-Half Reliability.
Step 1 Split your test into two parts (odd
even).
Step 2 Use Pearson Product Moment Correlation
- Ungrouped Data to determine rxy (rxy
represents the correlation between the two halves
of the scale). By doing the split-half we reduce
the number of items which we know will
automatically reduce the reliability, SO
Step 3 To estimate reliability of whole test
then use the Spearman Brown correction formula
rsb 2rxy /(1rxy)
where rsb is the split-half reliability
coefficient

11
As a Teacher, What Do I Need to Know Most About
Reliability

For tests I create myself
Increasing number of items increases reliability.
Moderate difficulty level increases reliability.
Having items measuring similar content increases
reliability.
For standardized tests I use
Look for each tests published reliability data.
Use the published reliability coefficient to
determine the Standard Error of Measurement
(abbreviated SEM) found in the data
See the following illustration

12
Standard Error of Measurement

The SEM is the standard deviation of a
hypothetically infinite number of obtained scores
around a persons true score.

13
SEM and Confidence Bands

14
Final Thoughts Advice

Use multiple sources of information.
Find and Use a published tests SEM to help
interpretation.
Standard Error of Measurement is distinct from
Standard error of mean (samples/populations)
Standard error of estimate (prediction)
Reliability for Criterion-referenced Items may
use techniques already covered but sometimes
require special treatment.
Worry about scorer reliability when score depends
on judgment.

15
More Final Words . . .

Reliability for Sub Scores is problematic since
small clusters are usually quite unreliable.
For important decisions, get reliability gt.90.
Be wary of short tests. To increase reliability,
increase number of items, exercises, or
observations.
Occasionally check reliability of your classroom
tests.
Be able to distinguish between reliability and
validity.

16
Terms/Concepts to Review andStudy on Your Own (1)

17
Terms/Concepts to Review andStudy on Your Own (2)

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 3. Reliability: PowerPoint PPT Presentation