An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information - PowerPoint PPT Presentation

About This Presentation
Title:

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information

Description:

www.pitt.edu – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 42
Provided by: yglee
Learn more at: https://sites.pitt.edu
Category:

less

Transcript and Presenter's Notes

Title: An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information


1
An Effective Word Sense Disambiguation Model
Using Automatic Sense Tagging Based on
Dictionary Information
  • Yong-Gu Lee
  • 2007-08-17
  • yonggulee_at_hotmail.com

2
Contents
  • Introduction
  • Related Works
  • Research Goals
  • Effective Word Sense Disambiguation Model and
    Evaluation
  • Conclusion

3
Introduction
  • Word Sense Disambiguation (WSD)
  • The problem of selecting a sense for a word from
    a set of predefined possibilities.
  • Intermediate task which is not an end in
    itself, but rather is necessary at one level or
    another.
  • Obviously essential for language understanding
    applications.
  • Machine translation
  • Information retrieval and hypertext navigation
  • Content and thematic analysis
  • Speech processing and text processing

4
Related works(1/3)
  • Approaches to WSD
  • Knowledge-Based Disambiguation
  • use of external lexical resources such as
    dictionaries and thesauri
  • discourse properties
  • Corpus-based Disambiguation
  • Hybrid Disambiguation

5
Related works(2/3)
  • Corpus-based Disambiguation
  • Supervised Disambiguation
  • based on a labeled training set
  • the learning system has
  • a training set of feature-encoded inputs AND
  • their appropriate sense label (category)
  • Unsupervised Disambiguation
  • based on unlabeled corpora
  • The learning system has
  • a training set of feature-encoded inputs BUT
  • NOT their appropriate sense label (category)

6
Related works(3/3)
  • Lexical Resources for WSD
  • Machine readable format
  • Machine Readable Dictionaries (MRD) Longman,
    Oxford, etc
  • Thesauri and semantic networks Roget Thesaurus,
    Wordnet, etc
  • Sense tagged data
  • Senseval-1,2,3(www.senseval.org)
  • Provides sense annotated data for many languages,
    for several tasks
  • Languages English, Romanian, Chinese, Basque,
    Spanish, etc.
  • Tasks Lexical Sample, All words, etc.
  • SemCor, Hector, etc

7
Research Motivation
  • Manual sense tagging
  • Labor-intensive and high cost
  • Limitation of available sense tagged corpus
  • Except English, other languages have a few corpus
    for WSD.
  • Coverage of sense tagged words
  • Some corpus has only one or a few words whose
    sense was tagged.
  • Line corpus, interest corpus, etc
  • If using supervised disambiguation method, the
    only word that appeared in the sense tagged
    corpus is disambiguated.

8
Research Goals
  • Minimize or eliminate the cost of manual
    labeling.
  • Automatic sense tagging using MRD and heuristic
    rules
  • Improve the performance of word sense
    disambiguation.
  • Using supervised disambiguation
  • Naïve Bayes classifier

9
Effective Word Sense Disambiguation Model
  • Automatic Tagging Technique
  • Experimental Environment
  • Evaluation of Automatic Tagging Technique
  • Evaluation of Sense Classification
  • Evaluation of Fusion Method

10
An Outline Diagram for the Proposed Research
Sense Classification
Automatic Sense Tagging and Training
Collection
Collocation Extraction
Test Context
Context Extraction of Target Word
Classify Word Sense
Sense Tagging
Auto Sense Tagging Module
Naïve Bayes Classifier
Training Set
Key Word Extraction
Evaluation
Dictionary
11
Automatic Tagging Technique
  • Dictionary Information-based Method
  • Collocation Overlap-based Method
  • Data Fusion Method
  • Dictionary Information-based Method
    Collocation Overlap-based Method

12
Dictionary Information-based Method(1/2)
  • Extract necessary information from dictionary.
  • Heuristic 1 One Sense per Collocation / One
    Sense per Discourse
  • Telephone line, ????/Gyeonggi-jeonmang(economic
    prospect)
  • Heuristic 2 Using of corresponding Chinese
    characters
  • ??/Gamja ??(Potato)/??(Reduction of capital)
  • Heuristic 3 Co-occurrence of synonym, antonym
    and related terms.
  • Heuristic 4 Occurrence of the derived words

13
Dictionary Information-based Method(2/2)
  • Heuristic 5 Co-occurrence of key feature that is
    extracted from definition of target word entry
    like Lesk(1986).
  • Algorithm
  • Retrieve from MRD all sense definitions of the
    words to be disambiguated
  • Determine the overlap between each sense
    definition and the current context
  • Choose senses that lead to highest overlap

14
Collocation Overlap-based Method
  • Semantic similarity metric using the collocation
    overlap
  • Algorithm
  • Retrieve keywords from MRD all sense definitions
    of the words to be disambiguated
  • Extract collocation words of the keywords from
    test collection by threshold
  • Extract collocation words of the target words
    from the test collection
  • Determine the overlap of each collocation
    words(2, 3)
  • Choose senses that lead to highest overlap

15
Feature Selection
  • By document frequency
  • Test Collection -gt docDF
  • Definitions as documents -gt dicDF
  • docDF lt 5000 dicDF lt 300

16
Sense Classification Naïve Bayes Classifier
  • Algorithm

source Manning and Schütze. 1999. Foundations
of Statistical Natural Language Processing.
17
Experimental Environment(1/2)
  • Test Collection
  • Includes all the articles(127,641) in three
    Korean daily newspapers for the year 2004
  • Use part-of-speech tagger and lexical analysis
  • Evaluation
  • Accuracy

18
Target Word for WSD
words No of Senses No of Articles Total Frequency
??/Gamja 2 622 1,115
??/Gyeonggi 4 18,484 37,763
??/Gigan 2 11,255 15,803
??/Sinbyeong 3 360 469
??/Sinjang 4 703 952
??/Yongi 4 3,227 5,147
??/Indo 5 2,022 2,750
??/Jigu 2 4,017 9,372
??/Jiwon 3 12,577 21,320
19
Evaluation of Automatic Sense Tagging
  • Dictionary Information-based Method
  • By Rule

No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 3,229 2,931 0.9077
2 Chinese characters 74 74 1.0000
3 Synonym 2,107 1,598 0.7584
3 Antonym 237 195 0.8228
3 Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 128,520 60,810 0.5091
SUM SUM 136,091 67,470 0.4958
20
Results of Feature Selection- words
words All Information Type All Information Type All Information Type
words Total Correct Accuracy
??/Gamja 802 800 0.9975
??/Gyeonggi 6,200 4,833 0.7795
??/Gigan 2,128 1,271 0.5973
??/Sinbyeong 299 265 0.8863
??/Sinjang 653 471 0.7213
??/Yongi 4,732 4,169 0.8810
??/Indo 3,956 2,274 0.5748
??/Jigu 2,207 2,124 0.9624
??/Jiwon 4,826 1,870 0.3875
SUM 25,803 18,077 0.7006
21
Results of Feature Selection - Rule
No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3 Synonym 1,650 1,556 0.9430
3 Antonym 237 195 0.8228
3 Related Terms 846 791 0.9350
4 Derived Words 1,078 1,071 0.9935
5 Definitions 20,315 12,842 0.6321
SUM SUM 25,803 18,077 0.7006
22
(No Transcript)
23
Evaluation of Automatic Sense Tagging
  • Collocation Co-occurrence-based Method
  • Performance by threshold

Rank Total Correct Accuracy
Top10 6,155 3,727 0.6055
Top30 9,258 5,215 0.5633
Top50 11,544 6,264 0.5426
Top100 13,432 6,751 0.5026
All 19,436 7,796 0.4009
24
(No Transcript)
25
Auto Tagging Result of Top 30
  • By Target Words

Word Main Source Total Correct Accuracy
??/Gamja Definitions 273 251 0.9194
??/Gyeonggi Definitions 3,540 2,951 0.8336
??/Gigan Synonym, Definitions 1,205 365 0.3029
??/Sinbyeong Definitions 112 67 0.5982
??/Sinjang Definitions 101 77 0.7624
??/Yongi Definitions 520 435 0.8365
??/Indo Antonym, Definitions 277 195 0.7040
??/Jigu Definitions 609 546 0.8966
??/Jiwon Related Words, Definitions 2,621 328 0.1251
Sum   9,258 5,215 0.5633
26
Auto Tagging Result of Top 30
  • By Information type

Information Type All Target Words All Target Words All Target Words
Information Type Total Correct Accuracy
Synonym 544 402 0.7390
Antonym 129 119 0.9225
Related Terms 230 166 0.7217
Definitions 8,355 4,528 0.5420
SUM 9,258 5,215 0.5633
27
Comparison of Two Auto Tagging Methods
28
Build a Classifier
  • Train set 600
  • Test set others
  • Window size 50byte length
  • Rule for making train set
  • There are errors in the automatic sense tagging.
  • For reducing errors and improving tagging
    accuracy of train set, information type of the
    high accuracy is firstly used.

29
Sense Classification- Dictionary
Information-based Method
Word All Information Type All Information Type All Information Type
Word Total Correct Accuracy
??/Gamja 1,270 1,139 0.8966
??/Gyeonggi 37,763 30,897 0.8182
??/Gigan 15,803 9,278 0.5871
??/Sinbyeong 469 386 0.8230
??/Sinjang 953 671 0.7043
??/Yongi 5,147 4,302 0.8359
??/Indo 2,750 1,212 0.4408
??/Jigu 9,375 8,373 0.8932
??/Jiwon 21,321 8,345 0.3914
SUM 94,851 64,604 0.6811
30
Sense Classification- Collocation Overlap-based
Method
  • By rank

Rank Total Correct Accuracy
Top10 94,851 57,201 0.6031
Top30 94,851 58,891 0.6209
Top50 94,851 56,871 0.5996
Top100 94,851 56,916 0.6001
All 94,851 53,218 0.5611
31
Sense Classification- Collocation Overlap-based
Method
  • By target words

Word Total Correct Accuracy
??/Gamja 1,270 1,016 0.8000
??/Gyeonggi 37,763 30,787 0.8153
??/Gigan 15,803 10,239 0.6479
??/Sinbyeong 469 290 0.6183
??/Sinjang 953 622 0.6527
??/Yongi 5,147 2,834 0.5507
??/Indo 2,750 1,683 0.6118
??/Jigu 9,375 8,499 0.9065
??/Jiwon 21,321 2,922 0.1371
SUM 94,851 58,891 0.6209
32
Comparison of Two Sense Classifications
33
Data Fusion of Two Auto Tagging Methods
  • Dictionary Information base Method Using all
    the information type except definitions
  • Collocation Overlap base Method Using only the
    information type of Top10

34
Results of the Auto Tagging Method in Data Fusion
- Words
Word Total Correct Accuracy
??/Gamja 503 485 0.9642
??/Gyeonggi 3,086 2,582 0.8367
??/Gigan 2,189 1,507 0.6884
??/Sinbyeong 96 56 0.5833
??/Sinjang 305 271 0.8885
??/Yongi 950 888 0.9347
??/Indo 367 273 0.7439
??/Jigu 939 918 0.9776
??/Jiwon 2,917 1,657 0.5680
SUM 11,352 8,637 0.7608
35
Results of the Auto Tagging Method in Data Fusion
Information Type
No Information Type All Target Words All Target Words All Target Words
No Information Type Total Correct Accuracy
1 Collocation 1,603 1,548 0.9657
2 Chinese characters 74 74 1.0000
3 Synonym 2,064 1,856 0.8992
3 Antonym 336 290 0.8631
3 Related Terms 978 907 0.9274
4 Derived Words 1,078 1,071 0.9935
5 Definitions 5,219 2,891 0.5539
SUM SUM 11,352 8,637 0.7608
36
Comparison of the Three Auto Tagging Methods
Auto Tagging Method Total Correct Accuracy
Dictionary Information base Method 25,803 18,077 0.7006
Collocation Overlap base Method 9,258 5,215 0.5633
Fusion Method 11,352 8,637 0.7608
37
Sense Classification in Data Fusion - Words
Word Total Correct Accuracy
??/Gamja 1,270 1,087 0.8559
??/Gyeonggi 37,763 32,128 0.8508
??/Gigan 15,803 13,055 0.8261
??/Sinbyeong 469 121 0.2580
??/Sinjang 953 702 0.7366
??/Yongi 5,147 4,437 0.8621
??/Indo 2,750 1,251 0.4547
??/Jigu 9,375 8,205 0.8752
??/Jiwon 21,321 11,251 0.5277
SUM 94,851 72,237 0.7616
38
Comparison of Three WSD Methods
WSD Method Total Correct Accuracy Improvement()
Fusion Method 94,851 72,237 0.7616 -
Dictionary Information base Method 94,851 64,604 0.6811 11.82
Collocation Overlap base Method 94,851 58,891 0.6209 22.66
39
Conclusion(1/2)
  • The performance of the automatic tagging
    technique differed depending on the type of
    information sources in the dictionary.
  • In case of the frequently used keywords extracted
    from the dictionary, to apply feature selection
    method is needed.

40
Conclusion(2/2)
  • The word sense disambiguation model using the
    automatic tagging method based on dictionary
    information showed a comparable performance to
    the supervised learning method using manual
    tagging information.
  • The WSD model using data fusion technique combing
    two automatic tagging methods outperforms the
    model using a single tagging method.

41
QA
Write a Comment
User Comments (0)
About PowerShow.com