Title: Comparing%20Corpus%20Co-Occurrence,%20Dictionary%20and%20Wikipedia%20Entries%20as%20Resources%20for%20Semantic%20Relatedness%20Information
1Comparing Corpus Co-Occurrence, Dictionary and
Wikipedia Entries as Resources for Semantic
Relatedness Information
Sabine Schulte im Walde Universität
Stuttgart
2Overview
- Motivation / Introduction
- Data-intensive lexical semantics
- Corpus-based descriptions
- Semantic Associations
- Our Work
- Evaluation of data-driven models
- Cross-comparison between resources
- Summary / Conclusions
3Data-intensive lexical semantics
- Modelling word meaning
- Using meaning aspects
- Automatically obtainable
- Goal Determine (dis)similarity of words
- Applications
- Word sense discrimination
- Anaphora resolution
- ...
4Corpus-based Descriptions
- Disadvantage Corpus co-occurrence does not cover
all aspects of word meaning - Especially world knowledge
- Our question Can we find complementing
information in other resources? - Dictionaries?
- Encyclopaedias?
5Dictionary and Encyclopaedia
- Consider other resources
- Dictionaries contain detailed information
about word senses - Encyclopaedias written knowledge compendiums
- How to identify meaning aspects?
- In our work, we rely on semantic associations
6Semantic Associations
- Definition
- We define semantic associations as concepts
spontaneously called to mind by other concepts
(stimuli) - Assumption
- Evoked words reflect highly salient linguistic
and conceptual features
7Data Collection Verb Stimuli
- Associates to verb stimuli
- Web experiment
- 330 verb stimuli
- 30 seconds per verb
klagen complain, moan, sue klagen complain, moan, sue klagen complain, moan, sue
Gericht court 19
jammern moan 18
weinen cry 13
Anwalt lawyer 11
Richter judge 9
Klage complaint 7
Leid suffering 6
Trauer mourning 6
Klagemauer Wailing Wall 5
laut noisy 5
8Data Collection Noun Stimuli
- Associates to noun stimuli
- Offline experiment
- 409 noun stimuli
- 3 associates per noun
Schloss castle, lock Schloss castle, lock Schloss castle, lock
Schlüssel key 51
Tür door 15
Prinzessin princess 8
Burg castle 8
sicher safe 7
Fahrrad bike 7
schließen close 7
Keller cellar 7
König king 7
Turm tower 6
9Knowledge Resources
- Corpus data
- German newspaper corpus
- 200 mio. words
- Dictionary WDG
- (Wörterbuch der deutschen Gegenwartssprache)
- Freely available dictionary (130,000 entries)
- Average of 840 words/entry
- Encyclopedia Wikipedia
- Free online encyclopedia (650,000 articles)
- Average of 1,164 words/article
10Analysis Vorgehensweise
- Corpus data
- Extract co-occurrence windows of stimuli
- Check windows for associations
- WDG / Wikipedia
- Download stimuli entries
- Check content for associations
- Missing entries
- WDG - 7/0
- Wikipedia - 2/54
11Analysis Resource Coverage
POS Types Tokens
corpus 70 84
WDG 12 28
Wikipedia 26 46
POS Types Tokens
corpus 67 77
WDG 12 25
Wikipedia 6 10
1.2 2.3 1.8
1.2 2.0 1.7
- Resources differ in ...
- coverage per stimuli part-of-speech
- token/type ratio
- proportions per associates part-of-speech
(next slide)
12Analysis Resource Coverage (2)
- Proportions per associates part-of-speech
- Noun stimuli
- Corpus 88 V gt 84 N gt 83 Adj
- WDG 43 V gt 31 Adj gt 26 N
- Wikipedia 49 N gt 39 Adj gt 37 V
- Verb stimuli
- Corpus 91 Adv gt 79 V gt 77 Adj
gt 76 N - WDG 29 Adv gt 28 V gt 25 Ngt 24 Adj
- Wikipedia 12 N gt 9 Adj/Adv gt 6 V
13Analysis Cross-Comparison
- Noun associate
- World knowledge?
- Only in WDG/Wiki carrot orange, cry tears,
... - Only in Corpus igloo eskimo, teach school,
...
Corpus Dic Wiki
Corpus - 55.0 46.0
WDG 0.8 - 5.7
Wiki 3.2 18.1 -
Corpus Dic Wiki
Corpus - 45.8 22.1
WDG 0.7 - 3.9
Wiki 0.5 3.6 -
14Summary / Conclusions
- Analysis of associations across resources
- Results
- Different coverage per stimuli (noun vs. verb)
- Different (predominant) PoS in word descriptions
- Different strength of semantic relatedness
- Resources complement each other
- gt A combination of resources should be helpful
for modelling word meaning and similarity
15(No Transcript)
16Questions?