Title: Dvid Gergely: Building a Case for Euro Examinations or A case study
1Dávid GergelyBuilding a Case for Euro
Examinations or A case study
2 The Mission of the Study
- Piloting the Manual and seeing how good the
methodology of linking is. - Getting initial measures for items and tasks
calibrated to the CEF - Establishing a link for Euro examinations with
the CEF. - In sum, validate the test by following the
methodology outlined in the Manual. Build a case
for the CEF link by collecting validity evidence.
3Initial decisions by Euro
- Management decision to select GramVoc only
- a question of finances
- North The most difficult task you could pick
- Unpopular kind of test
- The Dutch CEF Construct project focused on
reading and listening - ALTE produced grids for speaking and listening
- Any CEF scales relevant to the GramVoc paper?
4Productive orientation of CEF
- General Linguistic Range B2
- Has a sufficient range of language to be able to
give clear descriptions, express viewpoints and
develop arguments without much conspicuous
searching for words, using some complex sentence
forms to do so.
5In retrospect
- Advantages of selecting the GramVoc paper
- The knowledge of language underlies all other
skills in the examination - GramVoc project as pilot for the rest of the Euro
papers. As part of the efforts of the Hungarian
Accreditation Board, Euro Examinations will do
level setting exercises to all skills-based Euro
papers.
6Process and Audience for the Case Study
- Four phases of action according to the Manual
- Familiarization
- Specification
- Standardization (of judgements)
- Empirical validation
- Working with the team of full-time item writers
as holders of standards
7The Familiarization Phase 1
- Survey of familiarity with the CEF scales
- Descriptors from 15 scales, 133 items, as in a
test - Statistical analyses
- Initial facility value of responses 0.4
- Low? How low?
- 16/133 descriptors nobody got the level right.
Significantly more B1 descriptors. - No descriptor -- same descriptor problem
8The Familiarization Phase 2
- Insights from categorizing descriptors
- No correct identification of level, spread of
responses 16 - lt50 of team correctly identified level 55
- 50 correctly identified level 62
- In cases uncertainty, tendency to place level of
descriptor higher than in CEF. Lower Euro
standards? Leniency? - Chi-squares Leniency not related to any of the
scales, but it is to level B2.
9The Specification Phasea qualitative content
audit
- Lack of yardsticks for a test like GramVoc
- Van Ek and Trim volumes not useful.
- CEF provides description of 15 categories, but
without level specification pp. (108-117). - Euro specifications need attention.
- Two lines of work
- Elucidating item-writers concepts
- Expert analysis of what (item focuses) actually
goes into the test on the basis of the scope, the
gradation and stability between 2 consecutive
test administrations.
10Specification Phase 2Elucidating item-writers
concepts
- Item-writers conceptualisations of levels
coherent? In line with CEF? In line with Euro
specifications? - Item writers select best task for each task
type and level. Answer What is it that makes
this task the best for you? - Series of workshops to bring item-writers
conceptualisations to light.
11Specification Phase 3Expert analysis of item
focuses
- Evidence of construct under-representation?
Anything else measured, other than the construct?
Items to generate construct-irrelevant variance? - 2 experts identify item focuses, then jointly
finalize classification of items acc. to 15 CEF
categories. - Predict problematic items.
12Results Specification Phase
- Item-writers concepts broadly match CEF. Better
overall results than in familiarization phase. - Statistical analysis of expert classifications
- Distribution of focuses related to task type and
author (text), but not related to level and
administration. - Results similar when two administrations at the
same level were compared. - Lack of significant focus differences by level
prompted investigation of item complexity.
Statistical test inconclusive p 0.05 -
13The Standardization of Judgements Line 1
- Investigating the gap between Local Euro
standards and the CEF standards - Item-writers identified descriptors on the basis
of collations the content of which exceeded local
standards - Tabulation and qualitative analysis of responses.
History of descriptors taken into account. - The gap does not widen up the CEF scale. Most
conspicuous at B2, but less considerable if
descriptor history is accounted for. - Why do the uncalibrated descriptors represent a
higher level of requirements than those that went
through it?
14Standardization of Judgements Line 2 Video
rating conference
- CEF Performance Samples Link to Norths rating
conference (1996/2000) - A second-best option and problems
- How similar was the rating of the Euro
item-writers to each others? - Encouraging results
- Reliability of scale use Chronbachs Alpha 0.96
- Kendalls W 0.85
15Standardization of JudgementsLine 3 Standard
Setting
- With about 20 scripts per level for both test
2003 and 2004 - An examinee-based method. Scripts carefully
chosen, arranged in decreasing order of ability - Overfitting candidates
- Info about items
- Rating done twice bearing in mind
- Round1 conventional Euro standards, Kendalls W
ranged 0.8 - 0.83 - Round 2 CEF standards, Kendalls W ranged 0.75 -
0.79 - Results provided additional info about Line 1
16Empirical Validation Phase
- Empirical validation started very early
- Internal validation item analyses
- Independent analyses
- Joint analyses of same level papers
- External validation
- Using standard setting data from the
Standardization phase as ratings - Calibrate overall test difficulties
- Anchor item means of independent analyses to
calibrated overall test difficulties - Use a corrected version of Norths scale
- Compare cutoffs obtained in this way with
conventional Euro cutoffs.
17