The Interaction of Knowledge Sources in WSD

About This Presentation

Title:

The Interaction of Knowledge Sources in WSD

Description:

One Filter Tagger Module: remove some unlikely senses. ... Grolier's Encyclopedia 1991 as training corpus. 11/21/09. 15. Subject Codes (Cont. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 26

Provided by: NanZ1

Category:

more less

Transcript and Presenter's Notes

Title: The Interaction of Knowledge Sources in WSD

1
The Interaction of Knowledge Sources in WSD
Mark Stevenson and Yorik Wilks

Davis Zhou
College of Information Science Technology
Drexel University

2
Outline

Overview
Word Sense in LDOCE
Algorithm
Part of Speech
Dictionary Definition Overlap
Selectional Preferences
Subject Codes
Feature Extractor
Combining Disambiguation Modules.
Evaluation

3
Overview

WSD
What kind of context information can be used?
How to model these information?

4
Overview (Cont.)

Architecture
Preprocessing
Feature Extractor
One Filter Tagger Module remove some unlikely
senses.
Three Partial Tagger Modules evidence some very
possible sense.
One Combination Module

5
Overview (Cont.)

Results
Disambiguate all content words in text
Use senses defined in LDOCE rather than WordNet
Two corpus SEMCOR, SENSUS
Fine-grained sense level 90 precision
(monosemous words excluded)
Coarse-grained homograph level 94 precision
(monohomographic words excluded)

6
Sense in LDOCE

Homograph
Sets of senses with related meaning.
Monohomographic (88) and polyhomographic (12)
One homograph may contain more than one part of
speech. (2 words in LDOCE contain a homograph
with multiple POS)
Sense
34 words are polysemous

7
Sense in LDOCE (Cont.)

Example one homograph of Bank

8
Part of Speech

Disambiguate Polyhomographic Words
88 words can be disambiguated to the homograph
level since they dont have 2 homographs with the
same POS.
Additional 7 can be disambiguated if they are
assigned a certain POS but not others.
Experiment
Wall Street Journal, 391 polyhomographic words
87.4 precision against baseline approachs (most
frequently used sense) 78.

9
Part of Speech (Cont.)

Further Analysis
Full Disambiguation
Partial Disambiguation
No Disambiguation
POS Error

10
Dictionary Definition Overlap

Basic idea
Using an overlap count of content words in
dictionary definitions as a measure of semantic
closeness. (Lesk, 1986)
Optimization
Computing overlap using the simulated annealing
optimization algorithm (Cowage et al.,1992). 47
sense level and 72 homograph level precisions.
Improvement
Normalize count by dividing the length of
definition.
Dont use pragmatic code.

11
Selectional Preferences

Background
LDOCE senses are marked with selectional
restrictions expressed by 36 semantic codes.
Bruce and Guthrie (1992) manually identified the
relations.

12
Selectional Preference (Cont.)

Mapping named entities onto LDOCE semantic codes.
Identification of the sentence sites of
relationship subject, direct and indirect
object, noun modified by an adjective (shallow
syntactic parser)
An conservative algorithm return, for ambiguous
noun, verb and adjective occurrences, the set of
senses which satisfy the preferences imposed on
them.

13
Selectional Preference (Cont.)
14
Subject Codes

Basic idea
Use training data to find weighted indicative
words for each sense of a word. Then calculate
the score of a window around the target word.
Identify the sense with maximum score as the
sense of the word.
Yarowsky (1992)
Collect contexts which are representative of the
Roget category
Groliers Encyclopedia 1991 as training corpus.

15
Subject Codes (Cont.)

Concordances of 100 surrounding words for each
occurrence of each member of the Rogets head
(sense)
Identify salient words in the collective context,
and weight appropriately.

16
Subject Codes (Cont.)

Use the resulting weights to predict the
appropriate category for a word in novel context
Papers Implementation
Use British National Corpus
Consider only words which appeared at least 10
times in the training corpus.
Use LDOCE pragmatic codes as category
Return senses marked with the most likely
pragmatic code.

17
Collocation Extractor

A set of 10 collocates are extracted.
First word to the left
First word to the right
Second word to the left
Second word to the right
First noun to the left
First noun to the right
First verb to the left
First verb to the right
First adjective to the left
First adjective to the right

18
TiMBL

Background
Daelemans et al., 1999
Used for various NLP tasks
Classify
Classifies new examples by comparing them against
previously seen cases. The class of the most
similar examples is assigned.

19
TiMBL (Cont.)

Using training data to determine wi

20
TiMBL (Cont.)

Feature Vector
Head-word
Homograph number
Sense number
Rank of sense
Part of speech
Surface form of the head-word
Three partial taggers
Ten collocates

21
TiMBL (Cont.)

Examples

22
Evaluation

Testing Material
SEMCOR and SENSUS
Converting SEMCORs WordNet sense to LDOCE sense.
36869 words tagged with LDOCE senses.
Occurrence of ambiguous words

23
Evaluation (Cont.)

Average polysemy 14.62
Testing technique 10-fold cross validation
Evaluation metrics precision
Baseline first sense algorithm

24
Evaluation (Cont.)

Contribution of POS
Accuracy at the sense level reduced to 87.87
reduced to 93.36 at the homograph level when
POS filter was removed.
Interaction of Knowledge Sources.

25
Comments

The context information the author used includes
only previous word, next word, synonym,
hypernym/hyponym. The limited use of context
attributes to the low recall and high precision.
Using clean topical signature as the context may
improve the recall while keeping the high
precision. The problem is it is highly expensive
to build clean signature for each word.

Write a Comment

User Comments (0)

About PowerShow.com

The Interaction of Knowledge Sources in WSD - PowerPoint PPT Presentation

The Interaction of Knowledge Sources in WSD

One Filter Tagger Module: remove some unlikely senses. ... Grolier's Encyclopedia 1991 as training corpus. 11/21/09. 15. Subject Codes (Cont. ... – PowerPoint PPT presentation