The Interaction of Knowledge Sources in WSD - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

The Interaction of Knowledge Sources in WSD

Description:

One Filter Tagger Module: remove some unlikely senses. ... Grolier's Encyclopedia 1991 as training corpus. 11/21/09. 15. Subject Codes (Cont. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 26
Provided by: NanZ1
Category:

less

Transcript and Presenter's Notes

Title: The Interaction of Knowledge Sources in WSD


1
The Interaction of Knowledge Sources in WSD
Mark Stevenson and Yorik Wilks
  • Davis Zhou
  • College of Information Science Technology
    Drexel University

2
Outline
  • Overview
  • Word Sense in LDOCE
  • Algorithm
  • Part of Speech
  • Dictionary Definition Overlap
  • Selectional Preferences
  • Subject Codes
  • Feature Extractor
  • Combining Disambiguation Modules.
  • Evaluation

3
Overview
  • WSD
  • What kind of context information can be used?
  • How to model these information?

4
Overview (Cont.)
  • Architecture
  • Preprocessing
  • Feature Extractor
  • One Filter Tagger Module remove some unlikely
    senses.
  • Three Partial Tagger Modules evidence some very
    possible sense.
  • One Combination Module

5
Overview (Cont.)
  • Results
  • Disambiguate all content words in text
  • Use senses defined in LDOCE rather than WordNet
  • Two corpus SEMCOR, SENSUS
  • Fine-grained sense level 90 precision
    (monosemous words excluded)
  • Coarse-grained homograph level 94 precision
    (monohomographic words excluded)

6
Sense in LDOCE
  • Homograph
  • Sets of senses with related meaning.
  • Monohomographic (88) and polyhomographic (12)
  • One homograph may contain more than one part of
    speech. (2 words in LDOCE contain a homograph
    with multiple POS)
  • Sense
  • 34 words are polysemous

7
Sense in LDOCE (Cont.)
  • Example one homograph of Bank

8
Part of Speech
  • Disambiguate Polyhomographic Words
  • 88 words can be disambiguated to the homograph
    level since they dont have 2 homographs with the
    same POS.
  • Additional 7 can be disambiguated if they are
    assigned a certain POS but not others.
  • Experiment
  • Wall Street Journal, 391 polyhomographic words
  • 87.4 precision against baseline approachs (most
    frequently used sense) 78.

9
Part of Speech (Cont.)
  • Further Analysis
  • Full Disambiguation
  • Partial Disambiguation
  • No Disambiguation
  • POS Error

10
Dictionary Definition Overlap
  • Basic idea
  • Using an overlap count of content words in
    dictionary definitions as a measure of semantic
    closeness. (Lesk, 1986)
  • Optimization
  • Computing overlap using the simulated annealing
    optimization algorithm (Cowage et al.,1992). 47
    sense level and 72 homograph level precisions.
  • Improvement
  • Normalize count by dividing the length of
    definition.
  • Dont use pragmatic code.

11
Selectional Preferences
  • Background
  • LDOCE senses are marked with selectional
    restrictions expressed by 36 semantic codes.
  • Bruce and Guthrie (1992) manually identified the
    relations.

12
Selectional Preference (Cont.)
  • Mapping named entities onto LDOCE semantic codes.
  • Identification of the sentence sites of
    relationship subject, direct and indirect
    object, noun modified by an adjective (shallow
    syntactic parser)
  • An conservative algorithm return, for ambiguous
    noun, verb and adjective occurrences, the set of
    senses which satisfy the preferences imposed on
    them.

13
Selectional Preference (Cont.)
14
Subject Codes
  • Basic idea
  • Use training data to find weighted indicative
    words for each sense of a word. Then calculate
    the score of a window around the target word.
    Identify the sense with maximum score as the
    sense of the word.
  • Yarowsky (1992)
  • Collect contexts which are representative of the
    Roget category
  • Groliers Encyclopedia 1991 as training corpus.

15
Subject Codes (Cont.)
  • Concordances of 100 surrounding words for each
    occurrence of each member of the Rogets head
    (sense)
  • Identify salient words in the collective context,
    and weight appropriately.

16
Subject Codes (Cont.)
  • Use the resulting weights to predict the
    appropriate category for a word in novel context
  • Papers Implementation
  • Use British National Corpus
  • Consider only words which appeared at least 10
    times in the training corpus.
  • Use LDOCE pragmatic codes as category
  • Return senses marked with the most likely
    pragmatic code.

17
Collocation Extractor
  • A set of 10 collocates are extracted.
  • First word to the left
  • First word to the right
  • Second word to the left
  • Second word to the right
  • First noun to the left
  • First noun to the right
  • First verb to the left
  • First verb to the right
  • First adjective to the left
  • First adjective to the right

18
TiMBL
  • Background
  • Daelemans et al., 1999
  • Used for various NLP tasks
  • Classify
  • Classifies new examples by comparing them against
    previously seen cases. The class of the most
    similar examples is assigned.

19
TiMBL (Cont.)
  • Using training data to determine wi

20
TiMBL (Cont.)
  • Feature Vector
  • Head-word
  • Homograph number
  • Sense number
  • Rank of sense
  • Part of speech
  • Surface form of the head-word
  • Three partial taggers
  • Ten collocates

21
TiMBL (Cont.)
  • Examples

22
Evaluation
  • Testing Material
  • SEMCOR and SENSUS
  • Converting SEMCORs WordNet sense to LDOCE sense.
    36869 words tagged with LDOCE senses.
  • Occurrence of ambiguous words

23
Evaluation (Cont.)
  • Average polysemy 14.62
  • Testing technique 10-fold cross validation
  • Evaluation metrics precision
  • Baseline first sense algorithm

24
Evaluation (Cont.)
  • Contribution of POS
  • Accuracy at the sense level reduced to 87.87
  • reduced to 93.36 at the homograph level when
    POS filter was removed.
  • Interaction of Knowledge Sources.

25
Comments
  • The context information the author used includes
    only previous word, next word, synonym,
    hypernym/hyponym. The limited use of context
    attributes to the low recall and high precision.
  • Using clean topical signature as the context may
    improve the recall while keeping the high
    precision. The problem is it is highly expensive
    to build clean signature for each word.
Write a Comment
User Comments (0)
About PowerShow.com