Title: An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information
1An Effective Word Sense Disambiguation Model
Using Automatic Sense Tagging Based on
Dictionary Information
- Yong-Gu Lee
- 2007-08-17
- yonggulee_at_hotmail.com
2Contents
- Introduction
- Related Works
- Research Goals
- Effective Word Sense Disambiguation Model and
Evaluation - Conclusion
3Introduction
- Word Sense Disambiguation (WSD)
- The problem of selecting a sense for a word from
a set of predefined possibilities. - Intermediate task which is not an end in
itself, but rather is necessary at one level or
another. - Obviously essential for language understanding
applications. - Machine translation
- Information retrieval and hypertext navigation
- Content and thematic analysis
- Speech processing and text processing
4Related works(1/3)
- Approaches to WSD
- Knowledge-Based Disambiguation
- use of external lexical resources such as
dictionaries and thesauri - discourse properties
- Corpus-based Disambiguation
- Hybrid Disambiguation
5Related works(2/3)
- Corpus-based Disambiguation
- Supervised Disambiguation
- based on a labeled training set
- the learning system has
- a training set of feature-encoded inputs AND
- their appropriate sense label (category)
- Unsupervised Disambiguation
- based on unlabeled corpora
- The learning system has
- a training set of feature-encoded inputs BUT
- NOT their appropriate sense label (category)
6Related works(3/3)
- Lexical Resources for WSD
- Machine readable format
- Machine Readable Dictionaries (MRD) Longman,
Oxford, etc - Thesauri and semantic networks Roget Thesaurus,
Wordnet, etc - Sense tagged data
- Senseval-1,2,3(www.senseval.org)
- Provides sense annotated data for many languages,
for several tasks - Languages English, Romanian, Chinese, Basque,
Spanish, etc. - Tasks Lexical Sample, All words, etc.
- SemCor, Hector, etc
7Research Motivation
- Manual sense tagging
- Labor-intensive and high cost
- Limitation of available sense tagged corpus
- Except English, other languages have a few corpus
for WSD. - Coverage of sense tagged words
- Some corpus has only one or a few words whose
sense was tagged. - Line corpus, interest corpus, etc
- If using supervised disambiguation method, the
only word that appeared in the sense tagged
corpus is disambiguated.
8Research Goals
- Minimize or eliminate the cost of manual
labeling. - Automatic sense tagging using MRD and heuristic
rules - Improve the performance of word sense
disambiguation. - Using supervised disambiguation
- Naïve Bayes classifier
9Effective Word Sense Disambiguation Model
- Automatic Tagging Technique
- Experimental Environment
- Evaluation of Automatic Tagging Technique
- Evaluation of Sense Classification
- Evaluation of Fusion Method
10An Outline Diagram for the Proposed Research
Sense Classification
Automatic Sense Tagging and Training
Collection
Collocation Extraction
Test Context
Context Extraction of Target Word
Classify Word Sense
Sense Tagging
Auto Sense Tagging Module
Naïve Bayes Classifier
Training Set
Key Word Extraction
Evaluation
Dictionary
11Automatic Tagging Technique
- Dictionary Information-based Method
- Collocation Overlap-based Method
- Data Fusion Method
- Dictionary Information-based Method
Collocation Overlap-based Method
12Dictionary Information-based Method(1/2)
- Extract necessary information from dictionary.
- Heuristic 1 One Sense per Collocation / One
Sense per Discourse - Telephone line, ????/Gyeonggi-jeonmang(economic
prospect) - Heuristic 2 Using of corresponding Chinese
characters - ??/Gamja ??(Potato)/??(Reduction of capital)
- Heuristic 3 Co-occurrence of synonym, antonym
and related terms. - Heuristic 4 Occurrence of the derived words
13Dictionary Information-based Method(2/2)
- Heuristic 5 Co-occurrence of key feature that is
extracted from definition of target word entry
like Lesk(1986). - Algorithm
- Retrieve from MRD all sense definitions of the
words to be disambiguated - Determine the overlap between each sense
definition and the current context - Choose senses that lead to highest overlap
14Collocation Overlap-based Method
- Semantic similarity metric using the collocation
overlap - Algorithm
- Retrieve keywords from MRD all sense definitions
of the words to be disambiguated - Extract collocation words of the keywords from
test collection by threshold - Extract collocation words of the target words
from the test collection - Determine the overlap of each collocation
words(2, 3) - Choose senses that lead to highest overlap
15Feature Selection
- By document frequency
- Test Collection -gt docDF
- Definitions as documents -gt dicDF
- docDF lt 5000 dicDF lt 300
16Sense Classification Naïve Bayes Classifier
source Manning and Schütze. 1999. Foundations
of Statistical Natural Language Processing.
17Experimental Environment(1/2)
- Test Collection
- Includes all the articles(127,641) in three
Korean daily newspapers for the year 2004 - Use part-of-speech tagger and lexical analysis
- Evaluation
- Accuracy
18Target Word for WSD
words No of Senses No of Articles Total Frequency
??/Gamja 2 622 1,115
??/Gyeonggi 4 18,484 37,763
??/Gigan 2 11,255 15,803
??/Sinbyeong 3 360 469
??/Sinjang 4 703 952
??/Yongi 4 3,227 5,147
??/Indo 5 2,022 2,750
??/Jigu 2 4,017 9,372
??/Jiwon 3 12,577 21,320
19Evaluation of Automatic Sense Tagging
- Dictionary Information-based Method
- By Rule
No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 3,229 2,931 0.9077
2 Chinese characters 74 74 1.0000
3 Synonym 2,107 1,598 0.7584
3 Antonym 237 195 0.8228
3 Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 128,520 60,810 0.5091
SUM SUM 136,091 67,470 0.4958
20Results of Feature Selection- words
words All Information Type All Information Type All Information Type
words Total Correct Accuracy
??/Gamja 802 800 0.9975
??/Gyeonggi 6,200 4,833 0.7795
??/Gigan 2,128 1,271 0.5973
??/Sinbyeong 299 265 0.8863
??/Sinjang 653 471 0.7213
??/Yongi 4,732 4,169 0.8810
??/Indo 3,956 2,274 0.5748
??/Jigu 2,207 2,124 0.9624
??/Jiwon 4,826 1,870 0.3875
SUM 25,803 18,077 0.7006
21Results of Feature Selection - Rule
No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3 Synonym 1,650 1,556 0.9430
3 Antonym 237 195 0.8228
3 Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 20,315 12,842 0.6321
SUM SUM 25,803 18,077 0.7006
22(No Transcript)
23Evaluation of Automatic Sense Tagging
- Collocation Co-occurrence-based Method
- Performance by threshold
Rank Total Correct Accuracy
Top10 6,155 3,727 0.6055
Top30 9,258 5,215 0.5633
Top50 11,544 6,264 0.5426
Top100 13,432 6,751 0.5026
All 19,436 7,796 0.4009
24(No Transcript)
25Auto Tagging Result of Top 30
Word Main Source Total Correct Accuracy
??/Gamja Definitions 273 251 0.9194
??/Gyeonggi Definitions 3,540 2,951 0.8336
??/Gigan Synonym, Definitions 1,205 365 0.3029
??/Sinbyeong Definitions 112 67 0.5982
??/Sinjang Definitions 101 77 0.7624
??/Yongi Definitions 520 435 0.8365
??/Indo Antonym, Definitions 277 195 0.7040
??/Jigu Definitions 609 546 0.8966
??/Jiwon Related Words, Definitions 2,621 328 0.1251
Sum 9,258 5,215 0.5633
26Auto Tagging Result of Top 30
Information Type All Target Words All Target Words All Target Words
Information Type Total Correct Accuracy
Synonym 544 402 0.7390
Antonym 129 119 0.9225
Related Terms 230 166 0.7217
Definitions 8,355 4,528 0.5420
SUM 9,258 5,215 0.5633
27Comparison of Two Auto Tagging Methods
28Build a Classifier
- Train set 600
- Test set others
- Window size 50byte length
- Rule for making train set
- There are errors in the automatic sense tagging.
- For reducing errors and improving tagging
accuracy of train set, information type of the
high accuracy is firstly used.
29Sense Classification- Dictionary
Information-based Method
Word All Information Type All Information Type All Information Type
Word Total Correct Accuracy
??/Gamja 1,270 1,139 0.8966
??/Gyeonggi 37,763 30,897 0.8182
??/Gigan 15,803 9,278 0.5871
??/Sinbyeong 469 386 0.8230
??/Sinjang 953 671 0.7043
??/Yongi 5,147 4,302 0.8359
??/Indo 2,750 1,212 0.4408
??/Jigu 9,375 8,373 0.8932
??/Jiwon 21,321 8,345 0.3914
SUM 94,851 64,604 0.6811
30Sense Classification- Collocation Overlap-based
Method
Rank Total Correct Accuracy
Top10 94,851 57,201 0.6031
Top30 94,851 58,891 0.6209
Top50 94,851 56,871 0.5996
Top100 94,851 56,916 0.6001
All 94,851 53,218 0.5611
31Sense Classification- Collocation Overlap-based
Method
Word Total Correct Accuracy
??/Gamja 1,270 1,016 0.8000
??/Gyeonggi 37,763 30,787 0.8153
??/Gigan 15,803 10,239 0.6479
??/Sinbyeong 469 290 0.6183
??/Sinjang 953 622 0.6527
??/Yongi 5,147 2,834 0.5507
??/Indo 2,750 1,683 0.6118
??/Jigu 9,375 8,499 0.9065
??/Jiwon 21,321 2,922 0.1371
SUM 94,851 58,891 0.6209
32Comparison of Two Sense Classifications
33Data Fusion of Two Auto Tagging Methods
- Dictionary Information base Method Using all
the information type except definitions - Collocation Overlap base Method Using only the
information type of Top10
34Results of the Auto Tagging Method in Data Fusion
- Words
Word Total Correct Accuracy
??/Gamja 503 485 0.9642
??/Gyeonggi 3,086 2,582 0.8367
??/Gigan 2,189 1,507 0.6884
??/Sinbyeong 96 56 0.5833
??/Sinjang 305 271 0.8885
??/Yongi 950 888 0.9347
??/Indo 367 273 0.7439
??/Jigu 939 918 0.9776
??/Jiwon 2,917 1,657 0.5680
SUM 11,352 8,637 0.7608
35Results of the Auto Tagging Method in Data Fusion
Information Type
No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3 Synonym 2,064 1,856 0.8992
3 Antonym 336 290 0.8631
3 Related Terms 978 907 0.9274
4 Derived Words 1,078 1,071 0.9935
5 Definitions 5,219 2,891 0.5539
SUM SUM 11,352 8,637 0.7608
36Comparison of the Three Auto Tagging Methods
Auto Tagging Method Total Correct Accuracy
Dictionary Information base Method 25,803 18,077 0.7006
Collocation Overlap base Method 9,258 5,215 0.5633
Fusion Method 11,352 8,637 0.7608
37Sense Classification in Data Fusion - Words
Word Total Correct Accuracy
??/Gamja 1,270 1,087 0.8559
??/Gyeonggi 37,763 32,128 0.8508
??/Gigan 15,803 13,055 0.8261
??/Sinbyeong 469 121 0.2580
??/Sinjang 953 702 0.7366
??/Yongi 5,147 4,437 0.8621
??/Indo 2,750 1,251 0.4547
??/Jigu 9,375 8,205 0.8752
??/Jiwon 21,321 11,251 0.5277
SUM 94,851 72,237 0.7616
38Comparison of Three WSD Methods
WSD Method Total Correct Accuracy Improvement()
Fusion Method 94,851 72,237 0.7616 -
Dictionary Information base Method 94,851 64,604 0.6811 11.82
Collocation Overlap base Method 94,851 58,891 0.6209 22.66
39Conclusion(1/2)
- The performance of the automatic tagging
technique differed depending on the type of
information sources in the dictionary. - In case of the frequently used keywords extracted
from the dictionary, to apply feature selection
method is needed.
40Conclusion(2/2)
- The word sense disambiguation model using the
automatic tagging method based on dictionary
information showed a comparable performance to
the supervised learning method using manual
tagging information. - The WSD model using data fusion technique combing
two automatic tagging methods outperforms the
model using a single tagging method.
41QA