Word Association Thesaurus as a Resource for Building Wordnet - PowerPoint PPT Presentation

About This Presentation
Title:

Word Association Thesaurus as a Resource for Building Wordnet

Description:

WAN vs. WAT differ not only in volume but also in the ... Semantic relations between words (e.g. ROLE/INVOLVED) Distinguishing different senses of a word ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 30
Provided by: fuji57
Category:

less

Transcript and Presenter's Notes

Title: Word Association Thesaurus as a Resource for Building Wordnet


1
Word Association Thesaurus as a Resource for
Building Wordnet
  • Anna Sinopalnikova
  • Masaryk University, Brno, Czech Republic
  • Saint-Petersburg State University, Russia
  • anna_at_fi.muni.cz

2
Overview
  • Types of LRs used
  • What is Word Association?
  • Information to be extracted from WAT
  • WAT vs. Corpus
  • Conclusions
  • Future plans

3
What kind of language resources are used to build
wordnets?
  • Primary resources
  • e.g. text corpora
  • present (more or less) raw data on the language
    in use
  • information is given implicitly
  • Derived resources
  • e.g. explanatory dictionaries, Roget type
    thesauri
  • present explications of internal knowledge of
    language
  • based on primary resources intuition
  • information is given explicitly

4
What is better?
  • To build an adequate and reliable lexical
    database (e.g. wordnet) it is not enough to rely
    upon information produced by experts (i. e.
    linguists, lexicographers).
  • One should rather explore the raw data, and
    extract information from language in its actual
    and its potential use.
  • Corpora reign!

5
Word Association
  • Association connection or relation between 2
    entities (perceptions, ideas or words), that
    manifests in a following way an appearance of
    one entity entails the appearance of the other in
    the mind
  • Word Association an appearance of one word
    entails the appearance of the other in the mind

6
Association examples (1)
  • Kill ?

7
Association examples (1)
  • Kill ? ? Bill

8
Association examples (2)
9
Association examples (2)
  • ? Nike

10
Association examples (3)

11
Association examples (3)

  • ? Kill Bill

12
Word Association Test
  • Generally, a list of words (stimuli) is given to
    subjects (either in writing or in oral form). The
    subjects are asked to respond with the first word
    that comes into their mind (responses).
  • Other methods controlled association test,
    priming etc.

13
Cat stimulates
  • Dog 49, mouse 8, black 4, animal 2, eyes, gut,
    kitten, tom 2, bit, Cheshire, claw, claws,
    enigma, feline, furry, hearth, house, kin,
    kittens, milk, pet, pussy, todd 1
  • (of 100 people asked)

14
Word Association Norms (WAN)
  • WAN represents the data collected through a
    series of WA test carried out according to the
    standard technique.
  • The body of WAN list of responses and their
    absolute frequencies for each stimulus word
  • E.g. Kent Rosanoff (1910) 100 stimuli
    - 1000 subjects
  • Palermo Jenkins (1964) 200 stimuli - 1000
    subjects

15
Word Association Thesaurus (WAT)
  • WAT is a kind of WAN
  • WAN vs. WAT differ not only in volume but also in
    the procedure of data collection. It implies
    cycles A small set of stimuli is used as a
    starting point of the experiment, responses
    obtained for them are used as stimuli in the next
    stage, the cycle being repeated at least 3 times.
  • Being a thesaurus WAT is expected to cover all
    the vocabulary (all the words relevant for the
    language) and reflect the basic structure of a
    particular language (all the relations between
    words relevant for this particular language
    system).
  • E.g. Kiss et al (1972) about 54.000 words,
    Nelson et al (1973-1990) about 75.000 words,
    Karaulov et al (1994-1998) 23.000 words

16
What kind of linguistic information could be
extracted from WAT?
  1. The core concepts of the language
  2. Syntagmatic paradigmatic relations between
    words presented explicitly (as opposed to text
    corpora)
  3. Relevance of word senses for native speakers
  4. Relevance of relations for native speakers
  5. Domain information that are shown (as opposed to
    dictionaries)
  6. Semantic classification of words obtained by
    using formal criteria

17
The core concepts of the language
  • In every language there is a finite number of
    words that appear as responses more frequently
    then other words. This set is quite stable
  • it does change much as the time goes
  • it doesnt depends on the starting circumstances,
    e.g. on words that were chosen as stimulus words
  • Russian man, house, love, life,
    be/eat, think, live, go, big/large,
    good, bad, no/not...
  • 295 words with more then 100 relations
  • English man, sex, no (not), love, house work,
    eat, think, go, live good, old, small
  • 586 words with more then 100 relations
  • Cf. EuroWordNet Basic Concepts

18
Syntagmatic relations
  • E.g. Cat -gt black, Cheshire, pussy
  • Cat -gt mat, nip, purr
  • Law of contiguity through life we learn what
    goes together and reproduce it together
  • Right and left contexts of a word
  • Help to acquire
  • Selectional preferences, valency frames
  • Semantic relations between words (e.g.
    ROLE/INVOLVED)
  • Distinguishing different senses of a word
  • Establishing relations of synonymy, hyponymy, and
    antonymy
  • Cf. text corpora

19
Paradigmatic relations
  • E.g. Cat-gt dog, mouse, animal, pet
  • Cat-gt eyes, claw
  • Synonyms, hyponyms/hyperonyms/co-hyponyms,
    meronyms/holonyms, or antonyms
  • Law of contiguity???
  • Help us to acquire
  • This information may be included directly in
    terms of semantic relations between wordnet
    entries
  • Also it helps us to enrich and to check out the
    set of relations encoded earlier

20
Classifying verbs according to the number of
their syntagmatic associations
  • .

21
Domain information
  • E.g., hospital gt nurse, doctor, pain, ill,
    injury, load
  • This type of data is not so easily extracted from
    corpora, in explanatory dictionaries it is
    presented partly
  • Is crucial while we approach wordnet usage in IR.

22
Relevance of word senses for native speakers
  • WAT for each word 80 of associations are
    related to 1-3 of its senses.
  • Cf. Corpus 90 of occurrences of a word
  • That allows us
  • to measure the relevance of a particular word
    sense for native speakers.
  • to find an appropriate place for it in the
    hierarchy of senses.
  • to define the necessary level of sense
    granularity to include into a wordnet no more
    and no less senses of each word than native
    speakers do differentiate.
  • Problem emotionally coloured senses are thus
    overestimated. E.g. ???? ? ????

23
Relevance of relations for native speakers
  • It is clear that in a WN words must have at least
    a hyperonym and desirably a synonym.
  • Other relations???
  • Relations are not the same for different PoS, but
    also they are not the same for different words
    within the same PoS.
  • E.g. buy CONVERSIVE sell, while cry
    INVOLVED_AGENT baby.

24
WAT vs. Corpus
  • Compare a corpus to WAT
  • Wetter Rapp (1996), Willners (2001)
    Correlation between frequency of word X and word
    Y co-occurrence in a corpus and strength of
    association word X-word Y in WAT.
  • Compare WAT to a corpus?

25
WAT vs. Corpus (2)
  • Coverage 64 word associations do not occur in
    the corpus

26
WAT vs. Corpus (3)
N of occurrences in the corpus N of occurrences in RWAT of all word associations missing
0 2 48
0 3 22
0 4 14
0 5 8
0 6-10 5
0 11-15 lt1
0 15-20 lt1
0 gt 20 0
Table 1. Distribution of word associations that
do not occur in the corpus. NB! Mostly its
Syntagmatic WA that are missing, not
paradigmatic ones
27
Conclusions
  • The advantages of using WAT in wordnet
    constructing
  • Great variety of linguistic information
    extracted.
  • WAT is equal to or excels other LRs in
    several respects.
  • Raw data (as opposed to theoretical one, cf.
    conventional dictionaries, that supposes the
    researchers introspection and intuition to be
    involved, and hence, leads to over- and
    under-estimation of the language phenomena).
  • WAT is comparable to a balanced text corpus,
    and could supply all the necessary empirical
    information in case of absence of the latter.
  • Probabilistic nature of data presented (data
    reflects the relative rather then absolute
    relevance of language phenomena).
  • Parallel usage of WAT and other LR is effective
    way of
  • constant checking-out of wordnet construction,
  • refining wordnet and
  • expanding wordnet

28
Future plans
  • WAT vs. Corpus vs. Wordnet
  • Czech small large middle
  • English large large large
  • Russian large middle - small

29
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com