Testing 05 - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Testing 05

Description:

Testing 05 Reliability Errors & Reliability Errors in the test cause unreliability. The fewer the errors, the more reliable the test Sources of errors: Obvious: poor ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 64

Provided by: linl5

Category:

more less

Transcript and Presenter's Notes

Title: Testing 05

1
Testing 05

Reliability

2
Errors Reliability

Errors in the test cause unreliability.
The fewer the errors, the more reliable the test
Sources of errors
Obvious poor health, fatigue, lack of interest
Less obvious facets discussed in Fig. 5.3

3
Reliability Validity

Reliability is a necessary condition for
validity.
Reliability validity are complementary aspects
of the measurement.
Reliability How much of the performance is due
to measurement errors, or to factors other than
the language ability we want to measure.
Validity How much of the performance is due to
the language ability we want to measure.

4
Reliability Measurement

Reliability measurement includes logical
analysis and empirical research, i.e. identify
sources of errors and estimate the magnitude of
their effects on the scores.

5
Logical Analysis

Example of identification of source of errors
Topic in an oral interview business negotiation
Source of error if we want to measure the test
takers ability of general topics.
Indicator of the ability if we want to the test
takers ability of business English.

6
Empirical Research

Procedures are usually complex.
Three kinds of theories
Classical true score theory (CTS)
Generalizability theory (G-Theory)
Item Response Theory (IRT)

7
Factors on Test Scores

Characteristics of factors
general vs. specific
lasting vs. temporary
systematic vs. unsystematic

8
Factors that affect language test scores
9
Variance Standard Deviation

s standard deviation of the sample
s standard deviation of the population
s2 variance of the sample
s2 variance of the population
sv?(X-X)2/n-1
where
X individual score
X mean score
n number of students

10
Correlation Coefficient (????)

Covariance (COV) two variables, X and Y, vary
together.
COV(X,Y)1/(n-1)?(Xi-X)(Yi-Y)
Correlation Coefficient (Pearson Product-moment
Correlation Coefficient ?????????)
r(x,y)COV(x,y)/sxsy
r(x,y) 1/(n-1)?(Xi-X)(Yi-Y)/ sxsy

11
Correlation Coefficient

Where
n number of items
Xi individual score of the first half
X mean of the scores in the first half
Yi individual score of the second half
Y mean of the scores of the second half
sx standard deviation of the first half
sy standard deviation of the second half

12
Calculation of Correlation Coefficient

Manually
Manual Excel
Excel

13
Classical True Score Theory

also referred to as the classical reliability
theory because its major task is to estimate the
reliability of the observed scores of a test.
That is, it attempts to estimate the strength of
the relationship between the observed score and
the true score.
sometimes referred to as the true score theory
because its theoretical derivations are based on
a mathematical model known as the true score
model

14
(No Transcript)
15
Assumptions in CTS

Assumption 1 The observed score consists of the
true score and the error score, i..e. xxtxe
Assumption 2 Error scores are unsystematic,
random and uncorrelated to the true score, i.e.
s2st2se2

16
Parallel Test

Two tests are parallel if
xx
sx2sx2
rxyrxy

17
Correlation Between Parallel Tests

If the observed scores on two parallel tests are
highly correlated, the effects of the error
scores are minimal.
Reliability is the correlation between the
observed scores of two parallel tests.
The definition is the basis for all estimates of
reliability within CTS theory.
Condition the observed scores on the two tests
are experimentally independent.

18
Error Score Estimation and Measurement

Relations between reliability, true score and
error score
The higher the portion of the true score, the
higher the correlation of the two parallel tests.
(True scores are systematic)
The higher the portion of the error score, the
lower the correlation of the two parallel tests.
(Error scores are random)

19
Error Score Estimation and Measurement

rxxst2/se2
(st2se2)/sx21
se2/ sx21- st2/ sx2
st2/ sx2 rxx
se2/ sx21- rxx
se2(1- rxx)/ sx2

20
Approaches to Estimate Reliability

Three approaches based on different sources of
errors.
Internal consistency source of errors from
within the test and scoring procedure
Stability How consistent test scores are over
time.
Equivalence Scores on alternative forms of tests
are equivalent.

21
Internal Consistency

Dichotomous
Split-half reliability estimates
The Spearman-Brown split-half estimate
The Guttman split-half estimate
Kuder-Richardson reliability coefficients
Non-dichotomous
Coefficient alpha
Rater consistency

22
Split-half Reliability Estimates

Split the test into two halves which have equal
means and variances (equivalence) and are
independent of each other (independence).
1. divide the test into the first and second
halves.
2. random halves
3. odd-even method

23
Spearman-Brown Reliability Estimate

rxx2rhh/(1rhh)
where
rhh correlation between the two halves of the
test
Procedure
1. Divide the test into two equal halves
2. Calculate the correlation coefficient between
the two halves
3. Calculate the Spearman-Brown reliability
estimate

24
Guttman Split-Half Estimate

rxx2(1-(sh12sh22)/sx2)
where
sh12 variance of the first half
sh22 variance of the second half
sx2 variance of the total scores

25
Kuder-Richardson Formula 20

rxxk/(k-1)(1-?pq/sx2)
where
k number of items on the test
p proportion of the correct answers, i.e.
correct answers/total answers (difficulty)
q proportion of the incorrect answers, i.e. 1-p
sx2 total test score variance

26
Kuder-Richardson Formula 21

rxx(ksx2-x(k-x))/(k-1)sx2
where
k number of items on the test
sx2 total test score variance
x mean score

27
Coefficient alpha

ak/(k-1)(1-?si2/sx2)
where
k number of items on the test
?si2 sum of the variances of the different
parts of the test
sx2 variance of the test scores

28
Comparison of Estimates Assumptions

29
Summary Estimate Procedure

Spearman-Brown
1. split
2. variances of each half
3. correlation coefficient of each half
4. reliability coefficient

30
Summary Estimate Procedure

Guttman
1. split
2. variances of each half
3. variance of the whole test
4. reliability coefficient

31
Summary Estimate Procedure

K-C 20
1. number of questions
2. proportion of correct answers of each question
3. proportion of incorrect answers of each
question
4. sum of the product of p and q
5. variance of the whole test
6. reliability coefficient

32
Summary Estimate Procedure

K-C 21
1. number of questions
2. mean of the test
3. variance of the test
4. reliability coefficient

33
Summary Estimate Procedure

Coefficienta
1. number of the parts of the test
2. mean of each part
3. variance of each part
4. sum of variances of all parts
5. mean of the test
6. variance of the test
7. reliability coefficient

34
Rater Consistency

Intra-rater
Inter-rater

35
Intra-rater Reliability

Rate each paper twice. Condition the two ratings
must be independent of each other.
Two ways of estimating
Spearman-Brown Take each rating as a split half
and compute the reliability coefficient.

36
Intra-rater Reliability

Conditions the two ratings must have the similar
means and variances to ensure the equivalence of
the two ratings
Coefficient alpha Take two ratings as two parts
of a test.
a(k/(k-1))(1-(sx12sx22)/sx1x22)

37
Intra-rater Reliability

where
k number of ratings
sx12 variance of the first rating
sx22 variance of the second rating
sx1x22 variance of the summed ratings
Since k2, the formula can be reduced to the
Guttman Reliability Coefficient Formula.

38
Inter-rater Reliability

If there are only two raters, use split-half
estimates to obtain the reliability coefficient.
Or Grade Correlation Coefficient
rxx1-6?D2/(n(n2-1))
where
D difference between the grades of the two
ratings

39
Inter-rater Reliability

n number of the test takers
See testing 05-2 sheet 5 for example
Note the same grade should be shared.
If there are more than two raters, use
Coefficient alpha estimate

40
Stability (test-retest reliability)

Administer the test twice to a group of
individuals and compute the correlation between
the two set of scores. The correlation can then
be interpreted as an indicator of how stable the
scores are over time.
Learning effects and practice effects must be
taken into account.

41
Equivalence (parallel forms reliability)

Use alternative forms of a given test. Compute
and compare the means and standard deviations of
for each of the two forms to determine their
equivalence. The correlation between the two sets
can be interpreted as an indicator of the
equivalence of the two tests or an estimate of
the reliability of either one.

42
GENERALIZABILITY THEORY
43
GENERALIZABILITY THEORY

Generalizability theory (G-theory) is a framework
of factorial design and the analysis of variance.
It constitutes a theory and set of procedures for
specifying and estimating the relative effects of
different factors on observed test scores, and
thus provides a means for relating the uses or
interpretations to the way test users specify and
interpret different factors as either abilities
or sources of error.

44
GENERALIZABILITY THEORY

G-theory treats a given measure or score as a
sample from a hypothetical universe of possible
measures, i.e. on the basis of an individual's
performance on a test we generalize to his
performance in other contexts.
Reliability generalizability
The way we define a given universe of measures
will depend upon the universe of generalization

45
Application of G-theory

Two stages
G-study
D-study

46
G-study

consider the uses that will be made of the test
scores, investigate the sources of variance that
are of concern or interest.On the basis of this
generalizability study, the test developer
obtains estimates of the relative sizes of the
different sources of variance ('variance
components').

47
D-study

When the results of the G-study are satisfactory,
the test developer administers the test under
operational conditions, and uses G-theory
procedures to estimate the magnitude of the
variance components. These estimates provide
information that can inform the interpretation
and use of the test scores.

48
Significance of G-theory

The application of G-Theory thus enables test
developers and test users to specify the
different sources of variance that are of concern
for a given test use, to estimate the relative
importance of these different sources
simultaneously, and to employ these estimates in
the interpretation and use of test scores.

49
Universes Of Generalization And Universe Of
Measures

universe of generalization, a domain of uses or
abilities (or both)
the universe of possible measures types of test
scores we would be willing to accept as
indicators of the ability to be measured for the
purpose intended.

50
Populations of Persons

In addition to defining the universe of possible
measures, we must define the group, or population
of persons about whom we are going to make
decisions or inferences.

51
Universe Score

A universe score xp is thus defined as the mean
of a person's scores on all measures from the
universe of possible measures. The universe score
is thus the G-theory analog of the CTS-theory
true scores. The variance of a group of persons'
scores on all measures would be equal to the
universe score variance sp2, which is similar to
CTS true score variance in the sense that it
represents that proportion of observed score
variance that remains constant across different
individuals and different measurement facets and
conditions.

52
Universe Score

The universe score is different from the CTS true
score, however, in that an individual is likely
to have different universe scores for different
universes of measures.

53
Generalizability Coefficients

The G-theory analog of the CTS-theory reliability
coefficient is the generalizability coefficient,
which is defined as the proportion of observed
score variance that is universe score variance
pxx2sp2/sx2
where sp2 is universe score variance and sx2 is
observed score variance, which includes both
universe score and error variance.

54
Estimation

Variance components sources of variances
persons(p), forms(f), raters(r)
sx2sp2sf2sr2spf2spr2sfr2spfr2
Use ANOVA to compute for the magnitude of the
variance
Analyse those that are significantly large.

55
Standard Error of Measurement (SEM)

We need to know the extent the test score may
vary.(SEM)
Formula of SEM Estimation
sesxv(1-rxx)
From
rxxst2/sx2 (1)
st2/sx2se2/sx21 (2)
se2/sx21-st2/sx2 (3)
se2/sx21-rxx
se2sx2(1-rxx)

56
Interpretation of Test Scores

Difficulty
Distinction
Z score

57
Difficulty for Dichotomous Scoring

pR/n
where
p difficulty index
R right answers
n number of students

58
Difficulty for Dichotomous Scoring (Corrected)

Cp(kp-1)/(k-1)
Where
Cp corrected difficulty index
p uncorrected difficulty index
k number of choices

59
Difficulty for Non-dichotomous Scoring

pmean/full score
30--85

60
Distinction

Label the top 27 of the total as the high group
and the lowest 27 of the total as the low group.
DPH-PL
Where
D distinction index
PH rate of the correct answers in the high group
PL rate of the correct answers in the low group

61
Z score

A way of placing an individual score in the whole
distribution of scores on a test it expresses
how many standard deviation units lie above or
below the mean. Scores above the mean are
positive those below the mean are negative.
An advantage of z scores is that they allow
scores from different tests to be compared, where
the mean and standard deviation differ, and where
score points may not be equal.
Z(X-X)/s

62
T-score

A transformation of a z score, equivalent to it
but with the advantage of avoiding negative
values, and hence often used for reporting
purposes.
T10Z50

63
Standardized Score

A transformation of raw scores which provides a
measure of relative standing in a group and
allows comparison of raw scores from different
distributions, eg. from tests of different
lengths. It does this by converting a raw score
into a standard frame of reference which is
expressed in terms of its relative position in
the distribution of scores. The z score is the
most commonly used standardized score.
Standardized score 100Z500