An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information - PowerPoint PPT Presentation

About This Presentation

Title:

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information

Description:

www.pitt.edu – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 42

Provided by: yglee

Learn more at: https://sites.pitt.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information

1
An Effective Word Sense Disambiguation Model
Using Automatic Sense Tagging Based on
Dictionary Information

Yong-Gu Lee
2007-08-17
yonggulee_at_hotmail.com

2
Contents

Introduction
Related Works
Research Goals
Effective Word Sense Disambiguation Model and
Evaluation
Conclusion

3
Introduction

Word Sense Disambiguation (WSD)
The problem of selecting a sense for a word from
a set of predefined possibilities.
Intermediate task which is not an end in
itself, but rather is necessary at one level or
another.
Obviously essential for language understanding
applications.
Machine translation
Information retrieval and hypertext navigation
Content and thematic analysis
Speech processing and text processing

4
Related works(1/3)

Approaches to WSD
Knowledge-Based Disambiguation
use of external lexical resources such as
dictionaries and thesauri
discourse properties
Corpus-based Disambiguation
Hybrid Disambiguation

5
Related works(2/3)

Corpus-based Disambiguation
Supervised Disambiguation
based on a labeled training set
the learning system has
a training set of feature-encoded inputs AND
their appropriate sense label (category)
Unsupervised Disambiguation
based on unlabeled corpora
The learning system has
a training set of feature-encoded inputs BUT
NOT their appropriate sense label (category)

6
Related works(3/3)

Lexical Resources for WSD
Machine readable format
Machine Readable Dictionaries (MRD) Longman,
Oxford, etc
Thesauri and semantic networks Roget Thesaurus,
Wordnet, etc
Sense tagged data
Senseval-1,2,3(www.senseval.org)
Provides sense annotated data for many languages,
for several tasks
Languages English, Romanian, Chinese, Basque,
Spanish, etc.
Tasks Lexical Sample, All words, etc.
SemCor, Hector, etc

7
Research Motivation

Manual sense tagging
Labor-intensive and high cost
Limitation of available sense tagged corpus
Except English, other languages have a few corpus
for WSD.
Coverage of sense tagged words
Some corpus has only one or a few words whose
sense was tagged.
Line corpus, interest corpus, etc
If using supervised disambiguation method, the
only word that appeared in the sense tagged
corpus is disambiguated.

8
Research Goals

Minimize or eliminate the cost of manual
labeling.
Automatic sense tagging using MRD and heuristic
rules
Improve the performance of word sense
disambiguation.
Using supervised disambiguation
Naïve Bayes classifier

9
Effective Word Sense Disambiguation Model

Automatic Tagging Technique
Experimental Environment
Evaluation of Automatic Tagging Technique
Evaluation of Sense Classification
Evaluation of Fusion Method

10
An Outline Diagram for the Proposed Research
Sense Classification
Automatic Sense Tagging and Training
Collection
Collocation Extraction
Test Context
Context Extraction of Target Word
Classify Word Sense
Sense Tagging
Auto Sense Tagging Module
Naïve Bayes Classifier
Training Set
Key Word Extraction
Evaluation
Dictionary
11
Automatic Tagging Technique

Dictionary Information-based Method
Collocation Overlap-based Method
Data Fusion Method
Dictionary Information-based Method
Collocation Overlap-based Method

12
Dictionary Information-based Method(1/2)

Extract necessary information from dictionary.
Heuristic 1 One Sense per Collocation / One
Sense per Discourse
Telephone line, ????/Gyeonggi-jeonmang(economic
prospect)
Heuristic 2 Using of corresponding Chinese
characters
??/Gamja ??(Potato)/??(Reduction of capital)
Heuristic 3 Co-occurrence of synonym, antonym
and related terms.
Heuristic 4 Occurrence of the derived words

13
Dictionary Information-based Method(2/2)

Heuristic 5 Co-occurrence of key feature that is
extracted from definition of target word entry
like Lesk(1986).
Algorithm
Retrieve from MRD all sense definitions of the
words to be disambiguated
Determine the overlap between each sense
definition and the current context
Choose senses that lead to highest overlap

14
Collocation Overlap-based Method

Semantic similarity metric using the collocation
overlap
Algorithm
Retrieve keywords from MRD all sense definitions
of the words to be disambiguated
Extract collocation words of the keywords from
test collection by threshold
Extract collocation words of the target words
from the test collection
Determine the overlap of each collocation
words(2, 3)
Choose senses that lead to highest overlap

15
Feature Selection

By document frequency
Test Collection -gt docDF
Definitions as documents -gt dicDF
docDF lt 5000 dicDF lt 300

16
Sense Classification Naïve Bayes Classifier

Algorithm

source Manning and Schütze. 1999. Foundations
of Statistical Natural Language Processing.
17
Experimental Environment(1/2)

Test Collection
Includes all the articles(127,641) in three
Korean daily newspapers for the year 2004
Use part-of-speech tagger and lexical analysis
Evaluation
Accuracy

18
Target Word for WSD
words No of Senses No of Articles Total Frequency
??/Gamja 2 622 1,115
??/Gyeonggi 4 18,484 37,763
??/Gigan 2 11,255 15,803
??/Sinbyeong 3 360 469
??/Sinjang 4 703 952
??/Yongi 4 3,227 5,147
??/Indo 5 2,022 2,750
??/Jigu 2 4,017 9,372
??/Jiwon 3 12,577 21,320
19
Evaluation of Automatic Sense Tagging

Dictionary Information-based Method
By Rule

No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 3,229 2,931 0.9077
2 Chinese characters 74 74 1.0000
3 Synonym 2,107 1,598 0.7584
3 Antonym 237 195 0.8228
3 Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 128,520 60,810 0.5091
SUM SUM 136,091 67,470 0.4958
20
Results of Feature Selection- words
words All Information Type All Information Type All Information Type
words Total Correct Accuracy
??/Gamja 802 800 0.9975
??/Gyeonggi 6,200 4,833 0.7795
??/Gigan 2,128 1,271 0.5973
??/Sinbyeong 299 265 0.8863
??/Sinjang 653 471 0.7213
??/Yongi 4,732 4,169 0.8810
??/Indo 3,956 2,274 0.5748
??/Jigu 2,207 2,124 0.9624
??/Jiwon 4,826 1,870 0.3875
SUM 25,803 18,077 0.7006
21
Results of Feature Selection - Rule
No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3 Synonym 1,650 1,556 0.9430
3 Antonym 237 195 0.8228
3 Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 20,315 12,842 0.6321
SUM SUM 25,803 18,077 0.7006
22
(No Transcript)
23
Evaluation of Automatic Sense Tagging

Collocation Co-occurrence-based Method
Performance by threshold

Rank Total Correct Accuracy
Top10 6,155 3,727 0.6055
Top30 9,258 5,215 0.5633
Top50 11,544 6,264 0.5426
Top100 13,432 6,751 0.5026
All 19,436 7,796 0.4009
24
(No Transcript)
25
Auto Tagging Result of Top 30

By Target Words

Word Main Source Total Correct Accuracy
??/Gamja Definitions 273 251 0.9194
??/Gyeonggi Definitions 3,540 2,951 0.8336
??/Gigan Synonym, Definitions 1,205 365 0.3029
??/Sinbyeong Definitions 112 67 0.5982
??/Sinjang Definitions 101 77 0.7624
??/Yongi Definitions 520 435 0.8365
??/Indo Antonym, Definitions 277 195 0.7040
??/Jigu Definitions 609 546 0.8966
??/Jiwon Related Words, Definitions 2,621 328 0.1251
Sum 9,258 5,215 0.5633
26
Auto Tagging Result of Top 30

By Information type

Information Type All Target Words All Target Words All Target Words
Information Type Total Correct Accuracy
Synonym 544 402 0.7390
Antonym 129 119 0.9225
Related Terms 230 166 0.7217
Definitions 8,355 4,528 0.5420
SUM 9,258 5,215 0.5633
27
Comparison of Two Auto Tagging Methods
28
Build a Classifier

Train set 600
Test set others
Window size 50byte length
Rule for making train set
There are errors in the automatic sense tagging.
For reducing errors and improving tagging
accuracy of train set, information type of the
high accuracy is firstly used.

29
Sense Classification- Dictionary
Information-based Method
Word All Information Type All Information Type All Information Type
Word Total Correct Accuracy
??/Gamja 1,270 1,139 0.8966
??/Gyeonggi 37,763 30,897 0.8182
??/Gigan 15,803 9,278 0.5871
??/Sinbyeong 469 386 0.8230
??/Sinjang 953 671 0.7043
??/Yongi 5,147 4,302 0.8359
??/Indo 2,750 1,212 0.4408
??/Jigu 9,375 8,373 0.8932
??/Jiwon 21,321 8,345 0.3914
SUM 94,851 64,604 0.6811
30
Sense Classification- Collocation Overlap-based
Method

By rank

Rank Total Correct Accuracy
Top10 94,851 57,201 0.6031
Top30 94,851 58,891 0.6209
Top50 94,851 56,871 0.5996
Top100 94,851 56,916 0.6001
All 94,851 53,218 0.5611
31
Sense Classification- Collocation Overlap-based
Method

By target words

Word Total Correct Accuracy
??/Gamja 1,270 1,016 0.8000
??/Gyeonggi 37,763 30,787 0.8153
??/Gigan 15,803 10,239 0.6479
??/Sinbyeong 469 290 0.6183
??/Sinjang 953 622 0.6527
??/Yongi 5,147 2,834 0.5507
??/Indo 2,750 1,683 0.6118
??/Jigu 9,375 8,499 0.9065
??/Jiwon 21,321 2,922 0.1371
SUM 94,851 58,891 0.6209
32
Comparison of Two Sense Classifications
33
Data Fusion of Two Auto Tagging Methods

Dictionary Information base Method Using all
the information type except definitions
Collocation Overlap base Method Using only the
information type of Top10

34
Results of the Auto Tagging Method in Data Fusion
- Words
Word Total Correct Accuracy
??/Gamja 503 485 0.9642
??/Gyeonggi 3,086 2,582 0.8367
??/Gigan 2,189 1,507 0.6884
??/Sinbyeong 96 56 0.5833
??/Sinjang 305 271 0.8885
??/Yongi 950 888 0.9347
??/Indo 367 273 0.7439
??/Jigu 939 918 0.9776
??/Jiwon 2,917 1,657 0.5680
SUM 11,352 8,637 0.7608
35
Results of the Auto Tagging Method in Data Fusion
Information Type
No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3 Synonym 2,064 1,856 0.8992
3 Antonym 336 290 0.8631
3 Related Terms 978 907 0.9274
4 Derived Words 1,078 1,071 0.9935
5 Definitions 5,219 2,891 0.5539
SUM SUM 11,352 8,637 0.7608
36
Comparison of the Three Auto Tagging Methods
Auto Tagging Method Total Correct Accuracy
Dictionary Information base Method 25,803 18,077 0.7006
Collocation Overlap base Method 9,258 5,215 0.5633
Fusion Method 11,352 8,637 0.7608
37
Sense Classification in Data Fusion - Words
Word Total Correct Accuracy
??/Gamja 1,270 1,087 0.8559
??/Gyeonggi 37,763 32,128 0.8508
??/Gigan 15,803 13,055 0.8261
??/Sinbyeong 469 121 0.2580
??/Sinjang 953 702 0.7366
??/Yongi 5,147 4,437 0.8621
??/Indo 2,750 1,251 0.4547
??/Jigu 9,375 8,205 0.8752
??/Jiwon 21,321 11,251 0.5277
SUM 94,851 72,237 0.7616
38
Comparison of Three WSD Methods
WSD Method Total Correct Accuracy Improvement()
Fusion Method 94,851 72,237 0.7616 -
Dictionary Information base Method 94,851 64,604 0.6811 11.82
Collocation Overlap base Method 94,851 58,891 0.6209 22.66
39
Conclusion(1/2)

The performance of the automatic tagging
technique differed depending on the type of
information sources in the dictionary.
In case of the frequently used keywords extracted
from the dictionary, to apply feature selection
method is needed.

40
Conclusion(2/2)

The word sense disambiguation model using the
automatic tagging method based on dictionary
information showed a comparable performance to
the supervised learning method using manual
tagging information.
The WSD model using data fusion technique combing
two automatic tagging methods outperforms the
model using a single tagging method.

41
QA

Write a Comment

User Comments (0)