Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 9, 20 - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 9, 20

Description:

Components of an Individual's Observed Item Score (Simplistic view) Observed true ... Autobiographical memory memory of things in time and space ... – PowerPoint PPT presentation

Number of Views:384
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Class 4 Basic Psychometric Characteristics: Variability, Reliability, Interpretability October 9, 20


1
Class 4Basic Psychometric CharacteristicsVar
iability, Reliability, Interpretability October
9, 2008
  • Anita L. Stewart
  • Institute for Health Aging
  • University of California, San Francisco

2
Overview of Class 4
  • Concepts of error
  • Basic psychometric characteristics
  • Variability
  • Reliability
  • Interpretability

3
Components of an Individuals Observed Item Score
  • (Simplistic view)
  • Observed true item
    score score

error
4
Components of an Individuals Observed Item Score
  • Observed true item
    score score

error
score that would be obtained over repeated
testings Nunnally, 1994, p211
5
Random versus Systematic Error
  • Observed true item
    score score

random systematic
error


6
Random versus Systematic Error
  • Observed true item
    score score

Relevant to reliability
random systematic
error


Relevant to validity
7
Components of Variability in Item Scores of a
Group of Individuals
  • Observed true
    score score
    variance variance
  • Total variance
  • (sum of all observed item scores)

error variance
8
Components of Variability in Item Scores of a
Group of Individuals
  • Observed true
    score score
    variance variance
  • Total variance
  • (sum of all observed item scores)

(Random)error variance
9
Combining Items into Multi-Item Scales
  • When items are combined into a summated scale,
    random error to some extent cancels out
  • Error variance reduced as items increases
  • Reducing random error increases amount of true
    score variance

10
Sources of Error
  • Subjects
  • Observers or interviewers
  • Measure or instrument

11
Example Measuring Weight of Children
  • Observed score is a linear combination of many
    sources of variation for an individual

12
Measuring Weight in Pounds (Without Shoes) of One
Child
Amount of water past 30 min
True weight 80 lbs
Weightof clothes
Observed weight



Person weighing childrenis not very precise
Scale ismiscalibrated


13
Measuring Weight in Pounds (Without Shoes) of One
Child
Amount of water past 30 min .25 lb
True weight 80 lbs
Weightof clothes .70 lb
Observed weight 82.1 lbs



Person weighing childrenis not very precise 1
lb
Scale ismiscalibrated .1 lb


82.1 80 .25 .70 .1 1
14
Sources of Error in Measuring Weight of Children
  • Weight of clothes
  • Subject source of random error
  • Scale is miscalibrated
  • Instrument source of systematic error
  • Person weighing child is not precise
  • Observer source of random error

15
Measuring Depressive Symptoms (past 4 weeks) in
an Asian or Latino Man
Hard to choose number on the 1-6response choice
scale
Observed depressionscore
True depression 16


Measure misses 2culturally-bound symptoms
Unwillingnessto tellinterviewer
Poor memory of feelings



16
Measuring Depressive Symptoms (past 4 weeks) in
an Asian or Latino Man
Hard to choose number on the 1-6response choice
scale 1
Observed depressionscore 12
True depression 16


Measure misses 2culturally-bound symptoms -2
Unwillingto tellinterviewer -2
Poor memory of feelings -1



12 16 1 -2 -1 -2
17
Sources of Error in Measuring Depression
  • Hard to choose one number on 1-6 response scale
  • Subject source of random error
  • Unwilling to tell interviewer, poor memory of
    feelings
  • Subject sources of systematic error (underreport
    true depression)
  • Measure misses culturally-bound symptoms
  • Instrument source of systematic error
    (underestimate true depression)

18
Four Types of Memory Errors From Cognitive
Psychology
  • Encoding
  • Information inadequately stored in memory
  • Storage
  • Memory eroded over time
  • Retrieval
  • Some events/feelings harder to recall
  • Reconstruction
  • Errors filling in missing pieces

R Torangeau, Chap 3, in AA Stone et al. (eds)The
Science of Self-Report, London Lawrence Erlbaum,
2000
19
Memory and Time
  • Autobiographical memory memory of things in
    time and space
  • Events not encoded with their calendar dates
  • Thus time is a poor retrieval method
  • Numerous errors remembering when and how
    often something occurred within a particular
    time frame

N Bradburn, Chap 4, The Science of Self-Report
20
Memory and Emotion
  • Tend to remember
  • positive more than negative experiences
  • more emotionally intense than neutral experiences
  • non-threatening events more than threatening,
    sensitive events

Kihlstrom et al, Chap 6, The Science of
Self-Report
21
Overview
  • Concepts of error
  • Basic psychometric characteristics
  • Variability
  • Reliability
  • Interpretability

22
Variability
  • Good variability
  • All (or nearly all) scale levels are represented
  • Distribution approximates bell-shaped normal
  • Variability is a function of the sample
  • Need to understand variability of a measure in
    sample similar to one you are studying
  • Review criteria
  • Adequate variability on the latent variable that
    is relevant to your study

23
Indicators of Variability
  • Range of scores
  • Mean, median, mode
  • Standard deviation (or standard error)
  • Skewness statistic
  • at floor (lowest possible score)
  • at ceiling (highest possible score)

24
Range of Scores Possible and Observed
  • Especially important for multi-item measures
  • Example
  • CES-D possible range is 0-30
  • Wong et al. study of mothers of young children
    observed range was 0-23
  • missing entire high end of the distribution (none
    had high levels of depression)

25
Mean, Median, Mode
  • Mean - average
  • Median - midpoint
  • Mode - most frequent score
  • In normally distributed measures, these are all
    the same
  • In non-normal distributions, they will vary

26
Mean and Standard Deviation
  • Most information on variability is from mean and
    standard deviation
  • Can envision how measure is distributed on the
    possible range
  • Mean 1 SD 64 of the scores

27
Skewness
  • Positive skew - scores bunched at low end, long
    tail to the right
  • Negative skew - opposite pattern
  • Skewness coefficient ranges from - infinity to
    infinity
  • the closer to zero, the more normal
  • Scores 2.0 are cause for concern

28
Ceiling and Floor Effects Similar to Skewness
Information
  • Ceiling effects substantial number of people get
    highest possible score
  • Floor effects opposite
  • More helpful for single-item measures or coarse
    scales with only a few levels

29
to what extent did health problems limit you in
everyday physical activities (such as walking and
climbing stairs)?
49 not limited at all (cant improve)

30
SF-36 Variability Information in Patients with
Chronic Conditions (N3,445)
All on 0-100 scales, higher is better
McHorney C et al. Med Care. 19943240-66.
31
Evidence of Floor and Ceiling Effects in One
SF-36 Scale
24 37
All on 0-100 scales, higher is better
McHorney C et al. Med Care. 19943240-66.
32
Reasons for Poor Variability
  • Low variability in construct being measured in
    that sample (true low variation)
  • Items not adequately tapping construct
  • If only one item, especially hard
  • Items not detecting variation at one end
  • What to do
  • If developing measures, add items
  • If selecting measures find another one

33
Advantages of Multi-item Scales Revisited
  • Using multi-item scales minimizes likelihood of
    ceiling/floor effects
  • Even if items are skewed, multi-item scale
    normalizes the skew

34
Percent with Best Score on 5 Items in the MOS
MHI-5
  • 6-level response scale - all of the time to
    none of the time

Stewart A. et al., Measuring Functioning and
Well-Being, 1992
35
Percent with Best Score on 5 Items in the MOS
MHI-5
  • 6-level response scale - all of the time to
    none of the time

5-itemscale 5had highestscore
Stewart A. et al., Measuring Functioning and
Well-Being, 1992
36
Overview
  • Concepts of error
  • Basic psychometric characteristics
  • Variability
  • Reliability
  • Interpretability

37
Reliability
  • Extent to which an observed score is free of
    random error
  • Produces the same score each time it is
    administered (all else being equal)
  • Population-specific - reliability affected by
  • sample size
  • variability in scores (dispersion)
  • a persons level on the scale

38
Back to Components of Variability in Item Scores
of a Group of Individuals
  • Observed true
    score score
    variance variance
  • Total variance
  • (Variation is the sum of all observed item
    scores)

error variance
39
Reliability Depends on True Score Variance
  • Reliability is a group-level statistic
  • Reliability
  • Reliability 1 (error variance)
  • OR
  • Proportion of variance due to true score
    Total variance

40
Reliability Depends on True Score Variance
  • Reliability of .70 means 30 of variancein
    observed scores is due to error
  • Reliability total variance error variance
  • .70 1.0 .30

41
Reliability Coefficient
  • Typically ranges from .00 - 1.00
  • Higher scores indicate better reliability

42
Importance of Reliability
  • Necessary for validity (but not sufficient)
  • Low reliability (or high measurement error)
    attenuates correlations with other variables
  • May conclude that two variables are not related
    when they are
  • Greater reliability greater power
  • The more reliable your scales, the smaller sample
    size you need to detect an association

43
Reliable Scale?
  • NO!
  • There is no such thing as a reliable scale
  • We accumulate evidence of reliability in a
    variety of populations in which it has been tested

44
How Do You Know if a Scale or Measure Has
Adequate Reliability?
  • Adequacy of reliability judged according to
    standard criteria
  • Criteria depend on type of coefficient

45
Types of Reliability Tests
  • Internal-consistency
  • Test-retest
  • Inter-rater
  • Intra-rater

46
Internal Consistency Reliability Cronbachs
Alpha
  • Requires multiple items supposedly measuring same
    construct to calculate
  • Extent to which all items measure the same
    construct (same latent variable)

47
Internal-Consistency Reliability
  • For multi-item scales
  • Cronbachs alpha
  • for scales using ordinal items (e.g., 1-5)
  • Kuder Richardson 20 (KR-20)
  • for scales using dichotomous items

48
Minimum Standardsfor Internal Consistency
Reliability
  • For group comparisons (e.g., regression,
    correlational analyses)
  • .70 or above is minimum (Nunnally, 1978)
  • .80 is optimal
  • above .90 is unnecessary
  • For individual assessment (e.g., treatment
    decisions)
  • .90 or above (.95) is preferred (Nunnally, 1978)

49
Internal-Consistency Reliability Can be Spurious
  • Based on only those who answered all questions in
    the measure
  • If a lot of people are having trouble with the
    items and skip some, they are not included in
    test of reliability
  • Important to compare sample size in reliability
    calculation to total sample

50
Internal-Consistency Reliability is a Function of
Number of Items in Scale
  • Increases with the number of items
  • Very large scales (20 or more items) can have
    high reliability without other good psychometric
    properties

51
Example 20 item Beck Depression Inventory (BDI)
  • BDI 1978 version (asks about past week)
  • Internal consistency reliability .86

Beck AT et al. J Clin Psychol. 1984401365-1367
52
Example 20 item Beck Depression Inventory (BDI)
  • BDI 1978 version (asks about past week)
  • Internal consistency reliability .86
  • BUT 3 items correlated lt .30 with other items in
    the scale

Beck AT et al. J Clin Psychol. 1984401365-1367
53
Reliability Varies by Level on Measure
  • Reliability can be poorer for those scoring at
    one end of the scale
  • Example Number of visits to doctor in past 12
    months
  • More reliable for those with fewer visits

54
Test-Retest Reliability
  • Repeat assessment on individuals not expected to
    change
  • Time between assessments should be
  • Short enough so no change occurs
  • Long enough so subjects dont recall first
    response
  • Only reliability test for single item measures
  • Coefficient correlation between 2 measurements

55
Appropriate Test-Retest Coefficients by Type of
Scale
  • Continuous scales (ratio or interval scales,
    multi-item Likert scales)
  • Pearson
  • Ordinal or non-normally distributed scales
  • Spearman or Kendalls tau
  • Dichotomous (categorical) measures
  • Phi or Kappa

56
Minimum Standards for Test-Retest Reliability
  • Magnitude of a test-retest correlation is
    important, not significance
  • Criterion similar to that for internal
    consistency
  • gt.70 is desirable
  • gt.80 is optimal

57
Observer or Rater Reliability
  • Inter-rater reliability (across two or more
    raters)
  • Consistency (correlation) between two or more
    observers of the same subjects (one point in
    time)
  • Intra-rater reliability (within one rater)
  • Consistency within one observer
  • Correlation among repeated values obtained by the
    same observer (over time)

58
Observer or Rater Reliability
  • Sometimes Pearson correlations are used scores
    on a group of individuals obtained by one
    observer correlated with scores obtained by
    another observer
  • Assesses association only
  • .65 to .95 are typical correlations
  • gt.85 is considered acceptable

McDowell I et al. Measuring Health, 2006, p. 45.
59
Association vs. Agreement When Correlating Scores
from Two Times or Ratings
  • Association degree to which scores of one rater
    linearly predict scores of 2nd rater
  • Agreement extent to which same score obtained
    on 2nd measurement (retest, 2nd rater)
  • Can have high correlation and poor agreement
  • If second score is consistently higher for all
    subjects, can obtain high correlation
  • Need second test of mean differences

60
Hypothetical Scores on 4 Subjects by 2 Observers
61
Example of Association and Agreement
  • Scores by observer 1 are exactly 2 points above
    scores by observer 2
  • Correlation (association) would be perfect
    (r1.0)
  • Agreement is poor (no agreement on score in all
    cases - a difference of 2 between scores on each
    subject

62
Intraclass Correlation Coefficient (Kappa) for
Testing Inter-rater Reliability
  • Coefficient indicates level of agreement of two
    or more judges, exceeding that which would be
    expected by chance
  • Appropriate for dichotomous (categorical) scales
    and ordinal scales
  • Several forms of kappa
  • e.g., Cohens kappa 2 judges, dichotomous scale
  • Sensitive to number of observations, distribution
    of data

63
Interpreting Magnitude of Kappa Level of
Reliability
  • lt0.00
  • .00 - .20
  • .21 - .40
  • .41 - .60
  • .61 - .80
  • .81 - 1.00

Poor Slight Fair Moderate Substantial Almost
perfect
.60 or higher is acceptable (Landis, 1977)
64
Reliability Often Poorer in Lower SES or Low
Literacy Groups
  • More random error due to
  • Reading problems
  • Difficulty understanding complex questions
  • Unfamiliarity with questionnaires and surveys

65
Advantages of Multi-item Scales Revisited
  • Using multi-item scales improves reliability
  • Random error is canceled out across multiple
    items

66
Overview
  • Concepts of error
  • Basic psychometric characteristics
  • Variability
  • Reliability
  • Interpretability

67
Interpretability What does a Score Mean?
  • What are the endpoints?
  • What does a high score mean? (direction of
    scoring)
  • Compared to norms - is score low or high?
  • Single items, more easily interpretable
  • Multi-item scales, no inherent meaning to scores

68
Endpoints
  • What is minimum and maximum possible?
  • Enable interpretation of mean score
  • When scores are added, endpoints depend on number
    of items number of response choices
  • 5 items, 4 response choices 5 to 20
  • 3 items, 5 response choices 3 to 15

69
Compare Results to Norms
  • Comparing your means to published norms helps
    interpret the mean of your sample
  • SF-36 has numerous norms, e.g.
  • General population
  • By age group, gender, and chronic disease

70
SF-36 in MOS versus Norms
JE Ware et al, SF-36 Health Survey Manual
andInterpretation Guide, The Health Institute,
1993.
71
Direction of Scoring
  • What does a high score mean?
  • Where in the range does the mean score lie?
  • Toward top, bottom?
  • In the middle?

72
Descriptive Statistics for 3,000 Women
Med Care, 2003411262-1276
73
Descriptive Statistics for 3,000 Women
Activity no measure mentioned Stress Perceived
stress scale (Cohen, 1983)
Med Care, 2003411262
74
Perceived Stress Scale (Cohen 1983) Hard to Find
  • Available in JSTOR
  • Can print one page at a time
  • Searched article on line
  • Could not find scoring information other than
    reverse 7 of the 14 items and sum them
  • Possible score range of 0-56
  • Could not find response choices

75
Another Example Mean Scores in a Sample of Older
Adults
Mean
Physical functioning 45.0 Sleep
problems 28.1 Disability 35.7
76
Making it Easier to Interpret
Mean
Physical functioning 45.0 Sleep problems
28.1 Disability 35.7
All scores 0-100
77
Making it Easier to Interpret
Mean
Physical functioning () 45.0 Sleep problems
(-) 28.1 Disability (-) 35.7
All scores 0-100 () indicates higher score is
better health (-) indicates lower score is
better health
78
Confusion Introduced by Labels
  • SF-36 Bodily Pain scale
  • Higher score is no pain or limitations due to
    pain
  • Rationale so 8 subscales scored in same
    direction
  • Social Adjustment Scale (Weissman)
  • Functional Status Index (Jette)

79
Mean Has to be Interpreted Within Possible Range
  • M SD
  • Parents harsh discipline practices
  • Interviewers ratings of mother 2.55
    .74
  • Husbands reports of wife 5.32
    3.30
  • Note high score indicates more harsh practices

80
Mean Has to be Interpreted Within Possible Range
(Add Range)
  • M SD
  • Parents harsh discipline practices
  • Interviewers ratings of mother (1-5)
    2.55 .74
  • Husbands reports of wife (1-7)
    5.32 3.30
  • Note high score indicates more harsh practices

81
Mean Has to be Interpreted Within Possible Range
  • M SD
  • Parents harsh discipline practices
  • Interviewers ratings of mother (1-5)
    2.55 .74
  • Husbands reports of wife (1-7) 5.32
    3.30
  • Interviewer 1 2 3
    4 5
  • Husband 1 2 3 4
    5 6 7
  • Note high score indicates more harsh practices

2.55
5.32
82
Transforming a Summated Scale to a 0-100 Scale
  • Works with any ordinal or summated scale
  • Transforms it so 0 is the lowest possible and 100
    is the highest possible
  • Eases interpretation across numerous scales

(observed score - minimum possible score)
100 x
(maximum possible score - minimum possible score)
83
Homework
  • Complete rows 9-20 on matrix for both measures
  • Interpretability, scale characteristics,
    variability, reliability
Write a Comment
User Comments (0)
About PowerShow.com