Title: Word Association Thesaurus as a Resource for Extending Semantic Networks
1Word Association Thesaurus as a Resource for
Extending Semantic Networks
- Anna Sinopalnikova1, 2, Pavel Smrz1
- 1Faculty of Informatics, Masaryk University
- Botanicka 68a, 602 00 Brno, Czech Republic
- 2Saint-Petersburg State University
- Universitetskaya 11, Saint-Petersburg, Russia
- anna, smrz_at_fi.muni.cz
2Overview
- Motivation
- Word Association and other notions of
psycholinguistics - WAT vs. Corpus
- Semantic Information from WAT
- core concepts, semantic primitives, syntagmatic
and paradigmatic relations, domain information
3Types of Semantic Resources used in NLP
Corpora Dictionaries, thesauri, ontologies, taxonomies
These are primary resources, presenting (more or less) raw data on the language in use. Information is given implicitly. Need special extraction procedures and tools. These are derived resources, presenting explications of some internal knowledge. They are based on primary resources researchers intuition. Information is given explicitly.
4Motivation
- There is still a need for empirical basis of
semantic network construction. - Semantic Web initiatives.
- WAT are available for many languages. Nobody
knows what are they good for and how to use them.
5Word Association and other notions of
psycholinguistics
- Word Association
- Word Association Test
- Word Association Norms
- Word Association Thesaurus
6Example
- Needle stimulates
- -gt thread 41, pin 13, sharp 6, sew 5,
cotton 2, dressmaker 1, fix 1, prick 1,
sewing 1, sow 1, spring 1, stitch 1, etc.
7WATs explored
- RAT - Russian WAT by Karaulov et al (1994-1998)
8000 stimuli - 23000 words covered 1000
subjects, - EAT - Edinburgh WAT by Kiss et al (1972) 8400
stimuli 54000 words covered - 1000 subjects, - Czech WAN (Novak et al, 1996) 150 stimuli - 4000
words covered 250 subjects. - Experience gained in projects
- RussNet (a wordnet-like database for Russian
linking lexical semantics with derivational
morphology - Czech part of the BalkaNet project (multilingual
wordnet-like network for 5 Balkan languages and
Czech).
8WAT vs. Corpus
- History Church Hanks, 1990 Wettler Rapp,
1993 Willners, 2001 - Bokrjonok 3.0. - balanced corpus for Russian (16
mln words), - BNC - British National Corpus (112 mln),
- CNC - Czech National Corpus (160 mln) and its
unbalanced version (630 mln words) - Research procedure
- 5000 pairs e.g. cheese mouse, dark - alley have
been extracted from each WAN in random order, and
then searched in the corpora. - The window span was fixed to -10 10 words.
9WAN vs. Corpus Russian
- Quantitative analysis (Sinopalnikova, 2004)
- - 64 word associations do not occur in the
corpus, - - 49 while excluding unique associations (that
with - absolute frequency 1)
- Qualitative analysis
- - high ratio of syntagmatic associations to be
absent, - - for verbs this number was up to 84.
10WAN vs. Corpus Russian (2)
11WAN vs. Corpus English
- Quantitative analysis
- - 31 word associations do not occur in the BNC
- Qualitative analysis
- PARADIGMATIC 57,1
- SYNTAGMATIC 8,4
- DOMAIN 21,7
- OTHER 12,8
12WAN vs. Corpus English (2)
- acquiring synonymy and hyponymy
- e.g. sex fornicate (archaic or humorous), ire
(poetic) anger, cowardly yellow (slang) - acquiring information about low frequent words
- e.g. perambulate (NBNC 3), fornicate (NBNC
6) - cf. EAT perambulate - walk 30, pram 17, baby
9, push 8, about 1, dawdle 1,move 1,
promenade 1, slowly 1, stroll1, through1,
wander1, etc. - acquiring domain relations absent portion of
them was surprisingly large for such corpus as
BNC - e.g. ink-pot pen 24, non-violence peace 29,
offside soccer 2
13WAN vs. Corpus Czech
- Quantitative analysis
- - 514 associations missing (10,28)
- Qualitative analysis
- - proportion of the syntagmatic and paradigmatic
ones among them was similar to that for English
14Extracting semantic information from WAT
- Associations
- by form 10 (e.g. know no, yellow - mellow)
- by meaning 90 (e.g. needle sew, yellow -
sun) - core concepts,
- semantic primitives,
- syntagmatic and paradigmatic relations,
- domain information
15Core concepts
- In WAT there could be observed words that have an
above-average number of direct links to other
words. - Russian ???????, ???, ???, ?????, ????, ??????,
????, ????, ???????, ??????, ?????, ??? (??),
?????, ?????? etc. (295 words with more then 100
relations) - English man, sex, no (not), love, house work,
eat, think, go, live good, old, small etc. (586
words with more then 100 relations) - Czech clovek, dum, strom jíst, jít, myslet
moc, starý, velký, bílý, hezký etc. - These words determine the fundamental concepts of
a particular language system, and thus should be
incorporated into ontology as its core components
(e.g., SUMO upper concepts or EWN Base Concepts.
16Semantic primitives
- WAT could also provide a list of basic concepts
associated with each separate word. - Thus revealing semantics of a word (situation) as
a list of semantic constituents - separate pieces
of information. - Abstract words (verbs, adjectives or nouns
denoting complex situation or emotional states)
are difficult to decompose by means of logic and
intuition. - E.g. Depression could be reduced to its
constituents sad 7, low 5, black 4, manic 4,
sadness 3, bored 3, misery 2, tiredness 2,
despair 1, gloom 1, grey 1, hopelessness 1,
monotony 1, sick 1, mood 1, nerves 1, etc., its
probable causes rain 3, guilt 1, pain 1,
unemployment 1, its probable effects suicide 1,
its antipodes elation 3, fun 1, happiness 1 etc.
17Syntagmatic and paradigmatic relations
- Linguistic substitutes for reality
- WA reflect the order of events in reality, the
way objects are organized in the space, and the
way human beings experience them. - Associations by contiguity e.g. cry baby may be
treated as a manifestation of syntagmatic
relation between verb and its subject, while take
hand as a ROLE_INSTRUMENT relation. - Generalization! e.g. drink water, beer, milk,
ale, Coca-cola, coffee, juice, etc. found in WAT
should be generalized as drink ROLE_OBJECT
beverage relation and in such a form incorporated
in the semantic network
18Syntagmatic and paradigmatic relations (2)
- The law of contiguity could not explain all
associations. - Law of similarity, e.g. inanimate dead 39
(SYNONYMY), seek find 56 (CAUSE relation), buy
sell 56 (CONVERSIVE relation). - One of the main benefits of WAT paradigmatic
relations are given explicitly as opposed to
other sources of empirical data (e.g. text
corpora).
19Domain information
- WAT explicitly present the way common words are
grouped together according to the fragments of
reality they describe. - E.g., hospital gt nurse, doctor, pain,
ill, injury, load - Types of domain relations
- name of domain (situation) domain member e.g.
hospital nurse8, finance money 61, football
player4 marriage husband 2 - participant participant e.g. pepper salt 58,
tamer lion 69, needle thread 41 mouse
cat 22 - participant circumstance e.g. umbrella rain
58 actor stage23 - participant pointer to its function/role in the
situation e.g. larder food 58, envelope
letter 60, actor play 15 etc. - To differentiate types of domain relations within
semantic network, vs. to include them as uniform
IS_ASSOCIATED_TO relation?
20Conclusions
- Advantages of using WAT in constructing semantic
network - Simplicity of data acquisition.
- Broad variety of semantic information to acquire.
- Empirical nature of data extracted (as opposed to
theoretical one, cf. conventional ontologies,
taxonomies or classification schemes, that
supposes the researchers introspection and
intuition to be involved, and hence, leads to
over- and under-estimation of the phenomena under
consideration). - Probabilistic nature of data presented (data
reflects the relative rather then absolute
relevance of semantic relations in each
particular case).
21