Title: A Preliminary Study of the 2000 Basic English Word List in Taiwan
1A Preliminary Study of the 2000 Basic English
Word List in Taiwan
2Outline
- I. Introduction
- II. Research Questions
- III. Instruments
- IV. Discussion
- V. Conclusion
3I. Introduction
- 1. Importance of Word List
- 2. Basic Criteria for Making a Word List
- 3. Well-known Word Lists
41. Importance of Word List
- Four kinds of word list (Nation, 2001)
- high-frequency words
- academic words
- technical words
- low-frequency words
- High-frequency words
- Common core or start-up vocabulary for beginners.
- This small number of high frequency words makes
up most of the words learners meet. - With these words, beginners can feel empowered
that they can do things. - Master the High-frequency words Many studies
have suggested that second language learners need
first to concentrate on the high frequency words
and the return for learning the high frequency
words is very great (Nation, 1993 Meara, 1995
Nation Waring, 1997 Waring, 2000 .Nation
2001).
52. Basic Criteria for Making a Word List
- When making a list of high-frequency words, both
frequency and range must be considered. - Frequency As not all of the words are equally
useful, one measure of usefulness is word
frequency, that is, how often the word occurs in
normal use of the language (National, 1997). - Range is measured by seeing how many different
texts or subcorpora each particular word occurs
in. A word with wide range occurs in many
different texts or subcorpora (Nation, 2001 p16)
63. Well-know Word Lists
- Table 1. Well-known English Word Lists
7Word Lists in Taiwan
- Table 2. English Word Lists in Taiwan
8II. Research Questions
- 1. Word form and size between the GSL and TBEWL
- What is the size of word families and types?
- What is the distribution of words in 26 letters
of alphabet? - What is the distribution of content and function
words? - 2. Overlap Does the GSL contain most of the
TBEWL? - 3. Coverage Does the TBEWL provide better
coverage than the GSL? - 4. Range Do most of the words in the lists occur
in a range of texts?
9III. Instruments
- 1. RANGE and FREQUENCY programs
- 2. Taiwan Basic English Word List (TBEWL)
- 3. The General Service List (GSL)
101. RANGE and FREQUENCY programs
- RANGE Two programs http//www.vuw.ac.nz/lals/staf
f/paul-nation/RANGE32.zip - Range32 (Vocabulary profile)
- Here is a sample table from RANGE.
- What common vocabulary is found in all these
texts? - How large a vocabulary is needed to read this
text? - If a learner has a vocabulary of 2,000 words, how
much of the vocabulary in the text will be
familiar to the learner? - What are the words in the text which the learner
is not likely to know? - How well does the course book prepare learners
for the vocabulary in newspapers? - How rich a vocabulary do second language learners
use in their free writing? - What is needed to run RANGE?
- base word lists (BASEWRD1.txt, BASEWRD2.txt,
BASEWRD3.txt etc), - text files in ASCII (DOS) format.
RANGE Base Word Lists
Texts
Vs.
111. RANGE and FREQUENCY programs
- Frequency32
- Run on an ASCII text
- Make a frequency list of all the words in a
single text. - Only run one text at a time.
- Output
- Order An alphabetical list or a frequency
ordered list. - Rank order of the words, their raw frequency and
the cumulative percentage frequency. - Here is some sample output from FREQUENCY.
122. Taiwan Basic English Word List (TBEWL)
- Date of announcement
- 21 January, 2003 by Ministry of Education
- The TBEWL was based on many word lists and
corpus - (1) Word lists from curricular standards used by
elementary and junior high schools from Taiwan,
Korea, Japan, and Shanghai (??) - (2) Word lists from Taiwan college entrance
center - (3) The most frequent word lists from the U.S.,
U.K., South Africa and Japan - (4) Collins COBUILD corpus
132. Taiwan Basic English Word List (TBEWL)
- The TBEWL is adjusted by
- (1) life experience of the elementary and junior
high school students in Taiwan - (2) Standards for English language learning
- (3) Environment for foreign language learning
- The TBEWL includes
- (1) The most basic English word list 1000 words
(TBEWL 1000) - (2) The most useful word list 2000 (TBEWL 2000)
including the first 1000 words.
14Making the TBEWL 1st and 2nd 1000 Words
- Make individual headword list
- Delete (derivational, inflectional or synonymous)
words in the parentheses - Separate compounds (ice cream) , phrases (get up)
into individual words - Make Word families of TBEWL
- A same basis for comparison Word family
- Dictionaries
- Cambridge Advanced Learners Dictionary
- Collins CoBuild Advanced Learners English
Dictionary - WordNet 2.1
- Add the letters of the alphabet into the TBEWL
1st 1000 word families - Test the TBEWL 1st and 2nd 1000 words as the base
word files (basewrd1.txt and basewrd2.txt) in
RANGE. (no overlap between 1st and 2nd 1000 word
types)
15Making the TBEWL 1st and 2nd 1000 Word Families
- Table 3 TBEWL 1st and 2nd 1000
163. The General Service List (GSL)
- Why does this study choose GSL
- With similar purpose and numbers of words
- The GSL still remains the best of the available
list because of (Nation, 1993 1997, Nation
Hwang, 1995 Nation Waring, 1997 Coxhead,
2000). - (1) its information about frequency of meanings
- (2) Wests careful application of criteria other
than frequency and range - (3) coverage from 78 to 92 (82 mean coverage)
of various kinds of written text - (4) its basis for many series of graded readers
- Criteria to select these words (West, 1953)
- Frequency
- Ease or difficulty of learning (cost)
- Necessity
- Cover
- Stylistic level
- Intensive and emotional words
17Making the GSL 1st and 2nd 1000 Word Families
- Table 4. GSL 1st and 2nd 1000
-
- The letters of the alphabet, numbers, days of the
weeks and months of the year are added in making
the GSL 1st 1000 word families. - Adapted from the base word files in RANGE.
18IV. Discussion
- 1. Word form and size
- Word family and types
- Word distribution in 26 letters of alphabet
- Content and function words
- 2. Overlap Does the GSL contain most of the
TBEWL? - 3. Coverage Does the TBEWL provide better
coverage than the GSL? - 4. Range Do most of the words in the lists occur
in a range of texts?
191. Word form and size
- Table 5. Word family and types in GSL and TBEWL
- The TBEWL has a little more words in its original
word list because it includes some compounds (air
conditioner, hair dresser, ice cream, ) and
phrases (a little, a few, . - After extending to word families, the GSL has
more word families with more types because the
TBEWL originally includes numbers, months and
weeks, and has more function words, food, animals
insects which are not included in the GSL.
201. Comparison of Word Forms
- Table 6. Word distribution in 26 letters of
alphabet - The distribution in letters of alphabet is
similar between the GSL and the TBESL
211. Comparison of Word Forms
- Table 7. Distribution of Function and Content
Words - TBEWL has more function words as it already
includes numbers and pronouns.
222. Overlap Does the GSL contain most of the
TBEWL?
- Table 8. TBEWL 2001 words in GSL 1952 Words
232. Overlap Does the GSL contain most of the
TBEWL?
- Table 9. TBEWL 2001 Words in GSL 7827 Word Types
- Overlap 1506 (75.3) words
- TBEWL not in GSL 495 (24.7) words
24TBEWL not in GSL 495 (24.7) words
- High Frequency Words According to the frequency
per one million in BNC lemmas, some of these
words are high frequent such as - area (585), affect (133), assume (112), available
(272), contact (140), contract (175), couple
(152). Create (217), design (266), which may be
worth including in the list. - Low Frequency Words There are still a lot of
these words which are rather low frequent and
some are even not included in the BNC frequent
word list, such as words in the food category,
culture specific words. Most of these words are
related to daily life or culture specific words.
They are the special features of the TBEWL, and
may cause low coverage while comparing to GSL in
the next section. - Low frequency words (not in BNC 10 frequencies
per million) include alphabet, armchair,
badminton, bakery, bakery, banana, barbecue,
bark, baseball, basement, basketball, blackboard,
blouse, bookcase, bookstore, brunch, buffet, bug,
bun, burger, cabbage, cafeteria, campus, candy,
carrot, cartoon, centimeter, cereal, chess,
chopsticks, chubby, clap, classmate, closet,
cockroach, coke, comic, conditioner,
congratulation, considerate, cookie, couch,
cowboy, crab, crayon, cute, dentist, dessert,
dial, diligent, dinosaur, dizzy, dodge, doughnut,
downtown, dresser, drugstore, dumb, dumpling,
earrings, seafood, semester, shrimp, shorts,
skate, ski, sneakers, steak, sweater, swimsuit,
thanksgiving, pork, wok
25TBEWL not in GSL
- (1) Culture specific words
- chopsticks, dumpling, typhoon, Halloween,
Thanksgiving, cowboy - (2) Food drink
- bakery, banana, barbecue, beef, beer, brunch,
buffet, bun, burger, cabbage, cafeteria, candy,
carrot, cereal, chocolate, coke, cookie, crab,
delicious, dessert, doughnut, dumpling, mango,
noodle, peach, pear, pork, pumpkin, seafood,
shrimp, steak - (3) Sports games
- badminton, baseball, basketball, bike, chess,
skate, ski, tennis - (4) School
- biology, blackboard, bookcase, bookstore, campus,
chart, chemistry, classmate, classroom, crayon,
debate, eraser, quiz, semester, textbook, workbook
26TBEWL not in GSL
- (5) Animals insects
- ant, bark, bee, bug, butterfly, cockroach,
dinosaur, dolphin, dragon, eagle, mosquito,
panda, shark, spider, tiger, wolf - (6) House apartments
- balcony, bathroom, bench, blanket, carpet,
ceiling, closet, conditioner, couch, decorate - (7) Clothing accessories
- blouse, earrings, pajamas, pants, scarf, shorts,
sneakers, sweater, swimsuit, t-shirt, trousers,
vest, wallet - (8) Countries and proper names
- China, Chinese, America, Taiwan, ROC, USA, MRT
- (9) Computer Tech
- computer, e-mail, Internet
273. Coverage Does the TBEWL provide better
coverage than the GSL?
- Coverage refers to the percentage of tokens in a
text which are accounted for (covered by)
particular word lists. The corpora used in the
comparison are - (1) VOA corpus a 1,300,000 token VOA written
script corpus. - (2) Literature corpus a 4,290,000 token
fiction/story/fairy tale corpus of texts from
Project Gutenberg and The Baldwin Online
Childrens Literature Project. - (3) Academic corpus a 632,000 token English
paper texts from thesis abstracts, online
journals such as The Journal Community
Informatics, TESL-EJ, The Internet TESL Journal,
Language Learning and Technology, ReCALL, Reading
in a Foreign Language, and Working Papers in
TESOL and Applied Linguistics. - (4) Examination and textbook corpus a 30,900
token corpus from (A) English examination texts
from sample GEPT elementary level tests and the
Basic Competence Test for Junior High School
Students from 20012005. (B) Worksheets (reading,
writing, activities, examinations, grammars,
cultural supplementary) from Longman junior high
school English textbook (Lesson 16 in Book 1
3).
283. Coverage Does the TBEWL provide better
coverage than the GSL?
- Table 10. Percentage coverage of a range of
corpora by the lists from the GSL and TBEWL - The GSL provides slightly better coverage in most
of the corpora except Exam-Textbook corpus. - The TBEWL has better coverage in Examination and
Textbook corpus, and shows its local color.
294. Range Do Most of the words in the lists occur
in a range of texts?
- Based on the same corpus, the range is going to
see if all the words in the lists are working.
That is, does every word family (headword) in the
lists occur in the various corpora? There could
be words in the lists which seem useful but do
not occur.
304. Range Do Most of the words in the lists occur
in a range of texts?
- Table 11. Percentage of word families in the
lists occurring in various corpora -
- The GSL consists of 1986 (998988) word
families. The TBEWL consists of 1963 (985978)
word families. - The TBEWL lists are fractionally better than the
GSL in VOA and Exam-Textbook corpora while the
GSL is better in Literature and Academic corpora.
- Generally, the TBEWL lists show a little better
in the distribution among different corpora.
31V. Conclusion
- 1. Comparison of Word Forms
- Although the TBEWL has a little more words in its
list, it consists of a little less word families
and types when making its words into word
families. - Word distribution in 26 letters of alphabet in
both lists is similar. The first 3 letters are
S, C, and P while the last two are X and Z. - The TBEWL has a little more function words than
the GSL. - 2. There are 1552 (76) TBEWL words also found in
the GSL, while there are 499 (24) words of TBEWL
not in GSL. - 3. The GSL provides slightly better coverage in
most of the corpora except in Exam Textbook
corpus. - 4. Generally, the TBEWL lists are fractionally a
little better than the GSL.
32V. Conclusion
- 1. Answers to the Research Questions
- 2. Findings of the Study
- 3. Limitation of the Study
- 4. Implications for Vocabulary Teaching and
Learning
331. Answers to the Research Questions
- (1) Comparison of Word Forms and Size
- A. Although the TBEWL has a little more words in
its list, it consists of a little less word
families and types when making its words into
word families. - B. The word distribution in 26 letters of
alphabet in both lists is similar. The first 3
letters are S, C, and P while the last two are X
and Z. - C. The TBEWL has a little more function words
than the GSL.
341. Answers to the Research Questions
- (2) There are 1506 (75.3) TBEWL words also found
in the GSL, while there are 495 (24.7) words of
TBEWL not in GSL. - (3) The GSL provides slightly better coverage in
most of the corpora except in Exam Textbook
corpus. - (4) Generally, the TBEWL lists are fractionally a
little better among the four corpus than the GSL.
352. Findings of the Study
- The GSL performs its better coverage and range in
most of the corpora, and prove its still a very
good word list in spite of its age. - Comparing to the GSL, the TBEWL has less word
families (-23) and types (-564), but the TBEWL
demonstrates a similar coverage and range among
different corpora. The TBESL seems to be a
workable word list. - Although 495 TBEWL words are not found in the
GSL, these words are related to daily life and
culture, such as food drinks, sports, school,
animals insects, house apartments, and
clothing accessories. In spite of their low
frequencies, they may be useful for beginners and
students under junior high schools in Taiwan. - The coverage and range of the TBEWL shows much
better in the Exam-Textbook corpus, which proves
that it is a good word list with Taiwan color.
363. Limitation of the Study
- Limitation of the Range software
- A. Compounds and contractions can not be
identified by the Range software. - B. Homographs were counted under the same word
family (May a month of a year and may an
auxiliary verb were counted under the same
headword) - Limitation of Corpus
- A. Only 4 kinds of corpus are collected in this
study. - B. In each kind of corpus, this study only
collects a limited range. - Future research
- The future study should try to include larger
corpus with wider range, especially the textbook
corpus. Thus the comparison of coverage and range
shall be more stable and convincing.
374. Implications for Vocabulary Teaching and
Learning
- Teachers can see the differences between the two
word lists, and know what words are special in
the TBEWL, and plan how to teach these words. - The RANGE software with the TBEWL 1st and 2nd
1000 as its base word files can be used to
calculate the coverage of any text. This will be
a convenient tool for teachers to check if the
text is suitable for their students or not. - Besides the headwords, teachers should also
introduce the word families of a headword,
including prefixes, suffixes, derivations and
inflection of a word. - Teachers and students should not only focus on
the word list. Although decontextualized learning
of vocabulary is effective, learning from context
is still necessary to broaden the width of word
knowledge. Besides, students need also learn how
to use the words productively to deepen their
word knowledge.
38Thanks
Comments and Discussion
su_at_ntjcpa.edu.twhttp//www.opensource.idv.tw/
39Word family
- A word family consists of a headword, its
inflected forms, and its closed related derived
forms. That is, it includes both closely related
inflected and derived forms even if the part of
speech is not the same. Here are some examples - ADD
- ADDED
- ADDING
- ADDITION
- ADDITIONAL
- ADDITIVE
- ADDITIONS
- ADDS
- Major problem what should be included in a word
family and what should not. - Bauer, Laurie and Paul Nation (1993). Word
Families. International Journal of Lexicography,
6(4), 253279. - A word family consists of a base word and all its
derived and inflected forms. These are important
as it is believed that they can be understood by
learners without them having to learn each form
separately. For example, in the Vocabprofile
programme the word family grouped under the head
word ABLE includes abler, ablest, ably and
unable. For more information on the
classification criteria used see Bauer Nation
(1993).
ADMIT ADMISSION ADMITTEDLY ADMITS ADMITTED AD
MITTING
40Sample Output from RANGE
- This shows that 54 of the running words in the
text are in base list one and these 54 words make
up 72 of the total running words in the text. In
the word list column, one, two, three refer to
each of the base lists.
41Sample output from FREQUENCY
- In the example, the word type a is the third most
frequent word. It occurs 108 times in the text,
and along with the and of covers 14.29 of the
text. On its own it covers 3.01 (14.29 minus
11.28) of the text. See the beginning of this set
of instructions to see how to run FREQUENCY.
42Web Vocabulary Profilers http//132.208.224.131/v
p/
43Word Frequency Text Profilerhttp//www.edict.com.
hk/textanalyser/
44GSLAWL vs. TBESL
- Table 10. Distribution of TBEWL 2000 in GSL
Headwords and Family Words
45- Types are different from tokens in that the exact
same word form represents only one type. Thus
Hamlet's famous To be or not to be counts as six
tokens but only four types due to the repetitions - A token is a string of letters making up an
individual word. Thus Hamlet's famous To be or
not to be counts as six tokens irrespective of
the repetitions.
46- Nation and Hwang (1995)
- Replacing 452 of the words in GSL with 250 words
of higher frequency across a range of genres only
result a 1 coverage (from 82.3 to 83.4). - Nation (1993)
- This list is rather old, based on work done in
1930s and 1940s. However it still remains the
most useful one available as the relative
frequency of various meanings of each word is
given. - Older series of graded readers are based on this
list.