Word Association Thesaurus as a Resource for Extending Semantic Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Word Association Thesaurus as a Resource for Extending Semantic Networks

Description:

Word Association Thesaurus as a Resource for Extending Semantic Networks ... 1, gloom 1, grey 1, hopelessness 1, monotony 1, sick 1, mood 1, nerves 1, etc. ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 22
Provided by: annasinop
Category:

less

Transcript and Presenter's Notes

Title: Word Association Thesaurus as a Resource for Extending Semantic Networks


1
Word Association Thesaurus as a Resource for
Extending Semantic Networks
  • Anna Sinopalnikova1, 2, Pavel Smrz1
  • 1Faculty of Informatics, Masaryk University
  • Botanicka 68a, 602 00 Brno, Czech Republic
  • 2Saint-Petersburg State University
  • Universitetskaya 11, Saint-Petersburg, Russia
  • anna, smrz_at_fi.muni.cz

2
Overview
  • Motivation
  • Word Association and other notions of
    psycholinguistics
  • WAT vs. Corpus
  • Semantic Information from WAT
  • core concepts, semantic primitives, syntagmatic
    and paradigmatic relations, domain information

3
Types of Semantic Resources used in NLP
Corpora Dictionaries, thesauri, ontologies, taxonomies
These are primary resources, presenting (more or less) raw data on the language in use. Information is given implicitly. Need special extraction procedures and tools. These are derived resources, presenting explications of some internal knowledge. They are based on primary resources researchers intuition. Information is given explicitly.
4
Motivation
  • There is still a need for empirical basis of
    semantic network construction.
  • Semantic Web initiatives.
  • WAT are available for many languages. Nobody
    knows what are they good for and how to use them.

5
Word Association and other notions of
psycholinguistics
  • Word Association
  • Word Association Test
  • Word Association Norms
  • Word Association Thesaurus

6
Example
  • Needle stimulates
  • -gt thread 41, pin 13, sharp 6, sew 5,
    cotton 2, dressmaker 1, fix 1, prick 1,
    sewing 1, sow 1, spring 1, stitch 1, etc.

7
WATs explored
  • RAT - Russian WAT by Karaulov et al (1994-1998)
    8000 stimuli - 23000 words covered 1000
    subjects,
  • EAT - Edinburgh WAT by Kiss et al (1972) 8400
    stimuli 54000 words covered - 1000 subjects,
  • Czech WAN (Novak et al, 1996) 150 stimuli - 4000
    words covered 250 subjects.
  • Experience gained in projects
  • RussNet (a wordnet-like database for Russian
    linking lexical semantics with derivational
    morphology
  • Czech part of the BalkaNet project (multilingual
    wordnet-like network for 5 Balkan languages and
    Czech).

8
WAT vs. Corpus
  • History Church Hanks, 1990 Wettler Rapp,
    1993 Willners, 2001
  • Bokrjonok 3.0. - balanced corpus for Russian (16
    mln words),
  • BNC - British National Corpus (112 mln),
  • CNC - Czech National Corpus (160 mln) and its
    unbalanced version (630 mln words)
  • Research procedure
  • 5000 pairs e.g. cheese mouse, dark - alley have
    been extracted from each WAN in random order, and
    then searched in the corpora.
  • The window span was fixed to -10 10 words.

9
WAN vs. Corpus Russian
  • Quantitative analysis (Sinopalnikova, 2004)
  • - 64 word associations do not occur in the
    corpus,
  • - 49 while excluding unique associations (that
    with
  • absolute frequency 1)
  • Qualitative analysis
  • - high ratio of syntagmatic associations to be
    absent,
  • - for verbs this number was up to 84.

10
WAN vs. Corpus Russian (2)
11
WAN vs. Corpus English
  • Quantitative analysis
  • - 31 word associations do not occur in the BNC
  • Qualitative analysis
  • PARADIGMATIC 57,1
  • SYNTAGMATIC 8,4
  • DOMAIN 21,7
  • OTHER 12,8

12
WAN vs. Corpus English (2)
  • acquiring synonymy and hyponymy
  • e.g. sex fornicate (archaic or humorous), ire
    (poetic) anger, cowardly yellow (slang)
  • acquiring information about low frequent words
  • e.g. perambulate (NBNC 3), fornicate (NBNC
    6)
  • cf. EAT perambulate - walk 30, pram 17, baby
    9, push 8, about 1, dawdle 1,move 1,
    promenade 1, slowly 1, stroll1, through1,
    wander1, etc.
  • acquiring domain relations absent portion of
    them was surprisingly large for such corpus as
    BNC
  • e.g. ink-pot pen 24, non-violence peace 29,
    offside soccer 2

13
WAN vs. Corpus Czech
  • Quantitative analysis
  • - 514 associations missing (10,28)
  • Qualitative analysis
  • - proportion of the syntagmatic and paradigmatic
    ones among them was similar to that for English

14
Extracting semantic information from WAT
  • Associations
  • by form 10 (e.g. know no, yellow - mellow)
  • by meaning 90 (e.g. needle sew, yellow -
    sun)
  • core concepts,
  • semantic primitives,
  • syntagmatic and paradigmatic relations,
  • domain information

15
Core concepts
  • In WAT there could be observed words that have an
    above-average number of direct links to other
    words.
  • Russian ???????, ???, ???, ?????, ????, ??????,
    ????, ????, ???????, ??????, ?????, ??? (??),
    ?????, ?????? etc. (295 words with more then 100
    relations)
  • English man, sex, no (not), love, house work,
    eat, think, go, live good, old, small etc. (586
    words with more then 100 relations)
  • Czech clovek, dum, strom jíst, jít, myslet
    moc, starý, velký, bílý, hezký etc.
  • These words determine the fundamental concepts of
    a particular language system, and thus should be
    incorporated into ontology as its core components
    (e.g., SUMO upper concepts or EWN Base Concepts.

16
Semantic primitives
  • WAT could also provide a list of basic concepts
    associated with each separate word.
  • Thus revealing semantics of a word (situation) as
    a list of semantic constituents - separate pieces
    of information.
  • Abstract words (verbs, adjectives or nouns
    denoting complex situation or emotional states)
    are difficult to decompose by means of logic and
    intuition.
  • E.g. Depression could be reduced to its
    constituents sad 7, low 5, black 4, manic 4,
    sadness 3, bored 3, misery 2, tiredness 2,
    despair 1, gloom 1, grey 1, hopelessness 1,
    monotony 1, sick 1, mood 1, nerves 1, etc., its
    probable causes rain 3, guilt 1, pain 1,
    unemployment 1, its probable effects suicide 1,
    its antipodes elation 3, fun 1, happiness 1 etc.

17
Syntagmatic and paradigmatic relations
  • Linguistic substitutes for reality
  • WA reflect the order of events in reality, the
    way objects are organized in the space, and the
    way human beings experience them.
  • Associations by contiguity e.g. cry baby may be
    treated as a manifestation of syntagmatic
    relation between verb and its subject, while take
    hand as a ROLE_INSTRUMENT relation.
  • Generalization! e.g. drink water, beer, milk,
    ale, Coca-cola, coffee, juice, etc. found in WAT
    should be generalized as drink ROLE_OBJECT
    beverage relation and in such a form incorporated
    in the semantic network

18
Syntagmatic and paradigmatic relations (2)
  • The law of contiguity could not explain all
    associations.
  • Law of similarity, e.g. inanimate dead 39
    (SYNONYMY), seek find 56 (CAUSE relation), buy
    sell 56 (CONVERSIVE relation).
  • One of the main benefits of WAT paradigmatic
    relations are given explicitly as opposed to
    other sources of empirical data (e.g. text
    corpora).

19
Domain information
  • WAT explicitly present the way common words are
    grouped together according to the fragments of
    reality they describe.
  • E.g., hospital gt nurse, doctor, pain,
    ill, injury, load
  • Types of domain relations
  • name of domain (situation) domain member e.g.
    hospital nurse8, finance money 61, football
    player4 marriage husband 2
  • participant participant e.g. pepper salt 58,
    tamer lion 69, needle thread 41 mouse
    cat 22
  • participant circumstance e.g. umbrella rain
    58 actor stage23
  • participant pointer to its function/role in the
    situation e.g. larder food 58, envelope
    letter 60, actor play 15 etc.
  • To differentiate types of domain relations within
    semantic network, vs. to include them as uniform
    IS_ASSOCIATED_TO relation?

20
Conclusions
  • Advantages of using WAT in constructing semantic
    network
  • Simplicity of data acquisition.
  • Broad variety of semantic information to acquire.
  • Empirical nature of data extracted (as opposed to
    theoretical one, cf. conventional ontologies,
    taxonomies or classification schemes, that
    supposes the researchers introspection and
    intuition to be involved, and hence, leads to
    over- and under-estimation of the phenomena under
    consideration).
  • Probabilistic nature of data presented (data
    reflects the relative rather then absolute
    relevance of semantic relations in each
    particular case).

21
  • Thank you...
Write a Comment
User Comments (0)
About PowerShow.com