Advances in Word Sense Disambiguation - PowerPoint PPT Presentation

About This Presentation
Title:

Advances in Word Sense Disambiguation

Description:

... resources such as dictionaries and thesauri. discourse properties ... In recent years, most dictionaries made available in Machine Readable format (MRD) ... – PowerPoint PPT presentation

Number of Views:908
Avg rating:3.0/5.0
Slides: 209
Provided by: Rada76
Learn more at: https://www.d.umn.edu
Category:

less

Transcript and Presenter's Notes

Title: Advances in Word Sense Disambiguation


1
Advances in Word Sense Disambiguation
  • Tutorial at ACL 2005
  • June 25, 2005
  • Ted Pedersen
  • University of Minnesota, Duluth
  • http//www.d.umn.edu/tpederse
  • Rada Mihalcea
  • University of North Texas
  • http//www.cs.unt.edu/rada

2
Goal of the Tutorial
  • Introduce the problem of word sense
    disambiguation (WSD), focusing on the range of
    formulations and approaches currently practiced.
  • Accessible to anyone with an interest in NLP.
  • Persuade you to work on word sense disambiguation
  • Its an interesting problem
  • Lots of good work already done, still more to do
  • There is infrastructure to help you get started
  • Persuade you to use word sense disambiguation in
    your text applications.

3
Outline of Tutorial
  • Introduction (Ted)
  • Methodolodgy (Rada)
  • Knowledge Intensive Methods (Rada)
  • Supervised Approaches (Ted)
  • Minimally Supervised Approaches (Rada) / BREAK
  • Unsupervised Learning (Ted)
  • How to Get Started (Rada)
  • Conclusion (Ted)

4
Part 1Introduction
5
Outline
  • Definitions
  • Ambiguity for Humans and Computers
  • Very Brief Historical Overview
  • Theoretical Connections
  • Practical Applications

6
Definitions
  • Word sense disambiguation is the problem of
    selecting a sense for a word from a set of
    predefined possibilities.
  • Sense Inventory usually comes from a dictionary
    or thesaurus.
  • Knowledge intensive methods, supervised learning,
    and (sometimes) bootstrapping approaches
  • Word sense discrimination is the problem of
    dividing the usages of a word into different
    meanings, without regard to any particular
    existing sense inventory.
  • Unsupervised techniques

7
Outline
  • Definitions
  • Ambiguity for Humans and Computers
  • Very Brief Historical Overview
  • Theoretical Connections
  • Practical Applications

8
Computers versus Humans
  • Polysemy most words have many possible
    meanings.
  • A computer program has no basis for knowing which
    one is appropriate, even if it is obvious to a
    human
  • Ambiguity is rarely a problem for humans in their
    day to day communication, except in extreme
    cases

9
Ambiguity for Humans - Newspaper Headlines!
  • DRUNK GETS NINE YEARS IN VIOLIN CASE
  • FARMER BILL DIES IN HOUSE
  • PROSTITUTES APPEAL TO POPE
  • STOLEN PAINTING FOUND BY TREE
  • RED TAPE HOLDS UP NEW BRIDGE
  • DEER KILL 300,000
  • RESIDENTS CAN DROP OFF TREES
  • INCLUDE CHILDREN WHEN BAKING COOKIES
  • MINERS REFUSE TO WORK AFTER DEATH

10
Ambiguity for a Computer
  • The fisherman jumped off the bank and into the
    water.
  • The bank down the street was robbed!
  • Back in the day, we had an entire bank of
    computers devoted to this problem.
  • The bank in that road is entirely too steep and
    is really dangerous.
  • The plane took a bank to the left, and then
    headed off towards the mountains.

11
Outline
  • Definitions
  • Ambiguity for Humans and Computers
  • Very Brief Historical Overview
  • Theoretical Connections
  • Practical Applications

12
Early Days of WSD
  • Noted as problem for Machine Translation (Weaver,
    1949)
  • A word can often only be translated if you know
    the specific sense intended (A bill in English
    could be a pico or a cuenta in Spanish)
  • Bar-Hillel (1960) posed the following
  • Little John was looking for his toy box. Finally,
    he found it. The box was in the pen. John was
    very happy.
  • Is pen a writing instrument or an enclosure
    where children play?
  • declared it unsolvable, left the field of MT!

13
Since then
  • 1970s - 1980s
  • Rule based systems
  • Rely on hand crafted knowledge sources
  • 1990s
  • Corpus based approaches
  • Dependence on sense tagged text
  • (Ide and Veronis, 1998) overview history from
    early days to 1998.
  • 2000s
  • Hybrid Systems
  • Minimizing or eliminating use of sense tagged
    text
  • Taking advantage of the Web

14
Outline
  • Definitions
  • Ambiguity for Humans and Computers
  • Very Brief Historical Overview
  • Interdisciplinary Connections
  • Practical Applications

15
Interdisciplinary Connections
  • Cognitive Science Psychology
  • Quillian (1968), Collins and Loftus (1975)
    spreading activation
  • Hirst (1987) developed marker passing model
  • Linguistics
  • Fodor Katz (1963) selectional preferences
  • Resnik (1993) pursued statistically
  • Philosophy of Language
  • Wittgenstein (1958) meaning as use
  • For a large class of cases-though not for all-in
    which we employ the word "meaning" it can be
    defined thus the meaning of a word is its use in
    the language.

16
Outline
  • Definitions
  • Ambiguity for Humans and Computers
  • Very Brief Historical Overview
  • Theoretical Connections
  • Practical Applications

17
Practical Applications
  • Machine Translation
  • Translate bill from English to Spanish
  • Is it a pico or a cuenta?
  • Is it a bird jaw or an invoice?
  • Information Retrieval
  • Find all Web Pages about cricket
  • The sport or the insect?
  • Question Answering
  • What is George Millers position on gun control?
  • The psychologist or US congressman?
  • Knowledge Acquisition
  • Add to KB Herb Bergson is the mayor of Duluth.
  • Minnesota or Georgia?

18
References
  • (Bar-Hillel, 1960) The Present Status of
    Automatic Translations of Languages. In Advances
    in Computers. Volume 1. Alt, F. (editor).
    Academic Press, New York, NY. pp 91-163.
  • (Collins and Loftus, 1975) A Spreading Activation
    Theory of Semantic Memory. Psychological Review,
    (82) pp. 407-428.
  • (Fodor and Katz, 1963) The structure of semantic
    theory. Language (39). pp 170-210.
  • (Hirst, 1987) Semantic Interpretation and the
    Resolution of Ambiguity. Cambridge University
    Press.
  • (Ide and Véronis, 1998)Word Sense Disambiguation
    The State of the Art.. Computational Linguistics
    (24) pp 1-40.
  • (Quillian, 1968) Semantic Memory. In Semantic
    Information Processing. Minsky, M. (editor). The
    MIT Press, Cambridge, MA. pp. 227-270.
  • (Resnik, 1993) Selection and Information A
    Class-Based Approach to Lexical Relationships.
    Ph.D. Dissertation. University of Pennsylvania.
  • (Weaver, 1949) Translation. In Machine
    Translation of Languages fourteen essays. Locke,
    W.N. and Booth, A.D. (editors) The MIT Press,
    Cambridge, Mass. pp. 15-23.
  • (Wittgenstein, 1958) Philosophical
    Investigations, 3rd edition. Translated by G.E.M.
    Anscombe. Macmillan Publishing Co., New York.

19
Part 2 Methodology
20
Outline
  • General considerations
  • All-words disambiguation
  • Targeted-words disambiguation
  • Word sense discrimination, sense discovery
  • Evaluation (granularity, scoring)

21
Overview of the Problem
  • Many words have several meanings (homonymy /
    polysemy)
  • Determine which sense of a word is used in a
    specific sentence
  • Note
  • often, the different senses of a word are closely
    related
  • Ex title - right of legal ownership
  • - document that is evidence of
    the legal ownership,
  • sometimes, several senses can be activated in a
    single context (co-activation)
  • Ex This could bring competition to the trade
  • competition - the act of competing
  • - the people who are
    competing
  • Ex chair furniture or person
  • Ex child young person or human offspring



22
Word Senses
  • The meaning of a word in a given context
  • Word sense representations
  • With respect to a dictionary
  • chair a seat for one person, with a
    support for the back "he put his coat over the
    back of the chair and sat down"
  • chair the position of professor "he was
    awarded an endowed chair in economics"
  • With respect to the translation in a second
    language
  • chair chaise
  • chair directeur
  • With respect to the context where it occurs
    (discrimination)
  • Sit on a chair Take a seat on this chair
  • The chair of the Math Department The chair
    of the meeting

23
Approaches to Word Sense Disambiguation
  • Knowledge-Based Disambiguation
  • use of external lexical resources such as
    dictionaries and thesauri
  • discourse properties
  • Supervised Disambiguation
  • based on a labeled training set
  • the learning system has
  • a training set of feature-encoded inputs AND
  • their appropriate sense label (category)
  • Unsupervised Disambiguation
  • based on unlabeled corpora
  • The learning system has
  • a training set of feature-encoded inputs BUT
  • NOT their appropriate sense label (category)

24
All Words Word Sense Disambiguation
  • Attempt to disambiguate all open-class words in a
    text
  • He put his suit over the back of the chair
  • Knowledge-based approaches
  • Use information from dictionaries
  • Definitions / Examples for each meaning
  • Find similarity between definitions and current
    context
  • Position in a semantic network
  • Find that table is closer to chair/furniture
    than to chair/person
  • Use discourse properties
  • A word exhibits the same sense in a discourse /
    in a collocation

25
All Words Word Sense Disambiguation
  • Minimally supervised approaches
  • Learn to disambiguate words using small annotated
    corpora
  • E.g. SemCor corpus where all open class words
    are disambiguated
  • 200,000 running words
  • Most frequent sense

26
Targeted Word Sense Disambiguation
  • Disambiguate one target word
  • Take a seat on this chair
  • The chair of the Math Department
  • WSD is viewed as a typical classification problem
  • use machine learning techniques to train a system
  • Training
  • Corpus of occurrences of the target word, each
    occurrence annotated with appropriate sense
  • Build feature vectors
  • a vector of relevant linguistic features that
    represents the context (ex a window of words
    around the target word)
  • Disambiguation
  • Disambiguate the target word in new unseen text

27
Targeted Word Sense Disambiguation
  • Take a window of n word around the target word
  • Encode information about the words around the
    target word
  • typical features include words, root forms, POS
    tags, frequency,
  • An electric guitar and bass player stand off to
    one side, not really part of the scene, just as a
    sort of nod to gringo expectations perhaps.
  • Surrounding context (local features)
  • (guitar, NN1), (and, CJC), (player, NN1),
    (stand, VVB)
  • Frequent co-occurring words (topical features)
  • fishing, big, sound, player, fly, rod, pound,
    double, runs, playing, guitar, band
  • 0,0,0,1,0,0,0,0,0,0,1,0
  • Other features
  • followed by "player", contains "show" in the
    sentence,
  • yes, no,

28
Unsupervised Disambiguation
  • Disambiguate word senses
  • without supporting tools such as dictionaries and
    thesauri
  • without a labeled training text
  • Without such resources, word senses are not
    labeled
  • We cannot say chair/furniture or chair/person
  • We can
  • Cluster/group the contexts of an ambiguous word
    into a number of groups
  • Discriminate between these groups without
    actually labeling them

29
Unsupervised Disambiguation
  • Hypothesis same senses of words will have
    similar neighboring words
  • Disambiguation algorithm
  • Identify context vectors corresponding to all
    occurrences of a particular word
  • Partition them into regions of high density
  • Assign a sense to each such region
  • Sit on a chair
  • Take a seat on this chair
  • The chair of the Math Department
  • The chair of the meeting

30
Evaluating Word Sense Disambiguation
  • Metrics
  • Precision percentage of words that are tagged
    correctly, out of the words addressed by the
    system
  • Recall percentage of words that are tagged
    correctly, out of all words in the test set
  • Example
  • Test set of 100 words Precision 50 / 75
    0.66
  • System attempts 75 words Recall 50 / 100
    0.50
  • Words correctly disambiguated 50
  • Special tags are possible
  • Unknown
  • Proper noun
  • Multiple senses
  • Compare to a gold standard
  • SEMCOR corpus, SENSEVAL corpus,

31
Evaluating Word Sense Disambiguation
  • Difficulty in evaluation
  • Nature of the senses to distinguish has a huge
    impact on results
  • Coarse versus fine-grained sense distinction
  • chair a seat for one person, with a support for
    the back "he put his coat over the back of the
    chair and sat down
  • chair the position of professor "he was
    awarded an endowed chair in economics
  • bank a financial institution that accepts
    deposits and channels the money into lending
    activities "he cashed a check at the bank"
    "that bank holds the mortgage on my home"
  • bank a building in which commercial banking is
    transacted "the bank is on the corner of Nassau
    and Witherspoon
  • Sense maps
  • Cluster similar senses
  • Allow for both fine-grained and coarse-grained
    evaluation

32
Bounds on Performance
  • Upper and Lower Bounds on Performance
  • Measure of how well an algorithm performs
    relative to the difficulty of the task.
  • Upper Bound
  • Human performance
  • Around 97-99 with few and clearly distinct
    senses
  • Inter-judge agreement
  • With words with clear distinct senses 95 and
    up
  • With polysemous words with related senses 65
    70
  • Lower Bound (or baseline)
  • The assignment of a random sense / the most
    frequent sense
  • 90 is excellent for a word with 2 equiprobable
    senses
  • 90 is trivial for a word with 2 senses with
    probability ratios of 9 to 1

33
References
  • (Gale, Church and Yarowsky 1992) Gale, W.,
    Church, K., and Yarowsky, D. Estimating upper and
    lower bounds on the performance of word-sense
    disambiguation programs ACL 1992.
  • (Miller et. al., 1994) Miller, G., Chodorow, M.,
    Landes, S., Leacock, C., and Thomas, R. Using a
    semantic concordance for sense identification.
    ARPA Workshop 1994.
  • (Miller, 1995) Miller, G. Wordnet A lexical
    database. ACM, 38(11) 1995.
  • (Senseval) Senseval evaluation exercises
    http//www.senseval.org

34
Part 3 Knowledge-based Methods for Word Sense
Disambiguation
35
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Restrictions
  • Measures of Semantic Similarity
  • Heuristic-based Methods

36
Task Definition
  • Knowledge-based WSD class of WSD methods
    relying (mainly) on knowledge drawn from
    dictionaries and/or raw text
  • Resources
  • Yes
  • Machine Readable Dictionaries
  • Raw corpora
  • No
  • Manually annotated corpora
  • Scope
  • All open-class words

37
Machine Readable Dictionaries
  • In recent years, most dictionaries made available
    in Machine Readable format (MRD)
  • Oxford English Dictionary
  • Collins
  • Longman Dictionary of Ordinary Contemporary
    English (LDOCE)
  • Thesauruses add synonymy information
  • Roget Thesaurus
  • Semantic networks add more semantic relations
  • WordNet
  • EuroWordNet

38
MRD A Resource for Knowledge-based WSD
  • For each word in the language vocabulary, an MRD
    provides
  • A list of meanings
  • Definitions (for all word meanings)
  • Typical usage examples (for most word meanings)

39
MRD A Resource for Knowledge-based WSD
  • A thesaurus adds
  • An explicit synonymy relation between word
    meanings
  • A semantic network adds
  • Hypernymy/hyponymy (IS-A), meronymy/holonymy
    (PART-OF), antonymy, entailnment, etc.

WordNet synsets for the noun plant 1.
plant, works, industrial plant 2. plant,
flora, plant life
WordNet related concepts for the meaning plant
life plant, flora, plant life
hypernym organism, being
hypomym house plant, fungus,
meronym plant tissue, plant part
holonym Plantae, kingdom Plantae, plant
kingdom
40
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Restrictions
  • Measures of Semantic Similarity
  • Heuristic-based Methods

41
Lesk Algorithm
  • (Michael Lesk 1986) Identify senses of words in
    context using definition overlap
  • Algorithm
  • Retrieve from MRD all sense definitions of the
    words to be disambiguated
  • Determine the definition overlap for all possible
    sense combinations
  • Choose senses that lead to highest overlap
  • Example disambiguate PINE CONE
  • PINE
  • 1. kinds of evergreen tree with needle-shaped
    leaves
  • 2. waste away through sorrow or illness
  • CONE
  • 1. solid body which narrows to a point
  • 2. something of this shape whether solid or
    hollow
  • 3. fruit of certain evergreen trees

Pine1 ? Cone1 0 Pine2 ? Cone1 0 Pine1 ?
Cone2 1 Pine2 ? Cone2 0 Pine1 ? Cone3
2 Pine2 ? Cone3 0
42
Lesk Algorithm for More than Two Words?
  • I saw a man who is 98 years old and can still
    walk and tell jokes
  • nine open class words see(26), man(11),
    year(4), old(8), can(5), still(4), walk(10),
    tell(8), joke(3)
  • 43,929,600 sense combinations! How to find the
    optimal sense combination?
  • Simulated annealing (Cowie, Guthrie, Guthrie
    1992)
  • Define a function E combination of word senses
    in a given text.
  • Find the combination of senses that leads to
    highest definition overlap (redundancy)
  • 1. Start with E the most frequent sense
    for each word
  • 2. At each iteration, replace the sense of a
    random word in the set with a different sense,
    and measure E
  • 3. Stop iterating when there is no change in
    the configuration of senses

43
Lesk Algorithm A Simplified Version
  • Original Lesk definition measure overlap between
    sense definitions for all words in context
  • Identify simultaneously the correct senses for
    all words in context
  • Simplified Lesk (Kilgarriff Rosensweig 2000)
    measure overlap between sense definitions of a
    word and current context
  • Identify the correct sense for one word at a time
  • Search space significantly reduced

44
Lesk Algorithm A Simplified Version
  • Algorithm for simplified Lesk
  • Retrieve from MRD all sense definitions of the
    word to be disambiguated
  • Determine the overlap between each sense
    definition and the current context
  • Choose the sense that leads to highest overlap
  • Example disambiguate PINE in
  • Pine cones hanging in a tree
  • PINE
  • 1. kinds of evergreen tree with needle-shaped
    leaves
  • 2. waste away through sorrow or illness

Pine1 ? Sentence 1 Pine2 ? Sentence 0
45
Evaluations of Lesk Algorithm
  • Initial evaluation by M. Lesk
  • 50-70 on short samples of text manually
    annotated set, with respect to Oxford Advanced
    Learners Dictionary
  • Simulated annealing
  • 47 on 50 manually annotated sentences
  • Evaluation on Senseval-2 all-words data, with
    back-off to random sense (Mihalcea Tarau 2004)
  • Original Lesk 35
  • Simplified Lesk 47
  • Evaluation on Senseval-2 all-words data, with
    back-off to most frequent sense (Vasilescu,
    Langlais, Lapalme 2004)
  • Original Lesk 42
  • Simplified Lesk 58

46
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Preferences
  • Measures of Semantic Similarity
  • Heuristic-based Methods

47
Selectional Preferences
  • A way to constrain the possible meanings of words
    in a given context
  • E.g. Wash a dish vs. Cook a dish
  • WASH-OBJECT vs. COOK-FOOD
  • Capture information about possible relations
    between semantic classes
  • Common sense knowledge
  • Alternative terminology
  • Selectional Restrictions
  • Selectional Preferences
  • Selectional Constraints

48
Acquiring Selectional Preferences
  • From annotated corpora
  • Circular relationship with the WSD problem
  • Need WSD to build the annotated corpus
  • Need selectional preferences to derive WSD
  • From raw corpora
  • Frequency counts
  • Information theory measures
  • Class-to-class relations

49
Preliminaries Learning Word-to-Word Relations
  • An indication of the semantic fit between two
    words
  • 1. Frequency counts
  • Pairs of words connected by a syntactic relations
  • 2. Conditional probabilities
  • Condition on one of the words

50
Learning Selectional Preferences (1)
  • Word-to-class relations (Resnik 1993)
  • Quantify the contribution of a semantic class
    using all the concepts subsumed by that class
  • where

51
Learning Selectional Preferences (2)
  • Determine the contribution of a word sense based
    on the assumption of equal sense distributions
  • e.g. plant has two senses ? 50 occurences are
    sense 1, 50 are sense 2
  • Example learning restrictions for the verb to
    drink
  • Find high-scoring verb-object pairs
  • Find prototypical object classes (high
    association score)

52
Learning Selectional Preferences (3)
  • Other algorithms
  • Learn class-to-class relations (Agirre and
    Martinez, 2002)
  • E.g. ingest food is a class-to-class relation
    for eat chicken
  • Bayesian networks (Ciaramita and Johnson, 2000)
  • Tree cut model (Li and Abe, 1998)

53
Using Selectional Preferences for WSD
  • Algorithm
  • 1. Learn a large set of selectional preferences
    for a given syntactic relation R
  • 2. Given a pair of words W1 W2 connected by a
    relation R
  • 3. Find all selectional preferences W1 C
    (word-to-class) or C1 C2 (class-to-class) that
    apply
  • 4. Select the meanings of W1 and W2 based on the
    selected semantic class
  • Example disambiguate coffee in drink coffee
  • 1. (beverage) a beverage consisting of an
    infusion of ground coffee beans
  • 2. (tree) any of several small trees native to
    the tropical Old World
  • 3. (color) a medium to dark brown color

Given the selectional preference DRINK BEVERAGE
coffee1
54
Evaluation of Selectional Preferences for WSD
  • Data set
  • mainly on verb-object, subject-verb relations
    extracted from SemCor
  • Compare against random baseline
  • Results (Agirre and Martinez, 2000)
  • Average results on 8 nouns
  • Similar figures reported in (Resnik 1997)

55
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Restrictions
  • Measures of Semantic Similarity
  • Heuristic-based Methods

56
Semantic Similarity
  • Words in a discourse must be related in meaning,
    for the discourse to be coherent (Haliday and
    Hassan, 1976)
  • Use this property for WSD Identify related
    meanings for words that share a common context
  • Context span
  • 1. Local context semantic similarity between
    pairs of words
  • 2. Global context lexical chains

57
Semantic Similarity in a Local Context
  • Similarity determined between pairs of concepts,
    or between a word and its surrounding context
  • Relies on similarity metrics on semantic networks
  • (Rada et al. 1989)

carnivore
bear
feline, felid
canine, canid
fissiped mamal, fissiped
wild dog
wolf
hyena
dog
hunting dog
hyena dog
dingo
dachshund
terrier
58
Semantic Similarity Metrics (1)
  • Input two concepts (same part of speech)
  • Output similarity measure
  • (Leacock and Chodorow 1998)
  • E.g. Similarity(wolf,dog) 0.60
    Similarity(wolf,bear) 0.42
  • (Resnik 1995)
  • Define information content, P(C) probability of
    seeing a concept of type C in a large corpus
  • Probability of seeing a concept probability of
    seeing instances of that concept
  • Determine the contribution of a word sense based
    on the assumption of equal sense distributions
  • e.g. plant has two senses ? 50 occurrences are
    sense 1, 50 are sense 2

, D is the taxonomy depth
59
Semantic Similarity Metrics (2)
  • Similarity using information content
  • (Resnik 1995) Define similarity between two
    concepts (LCS Least Common Subsumer)
  • Alternatives (Jiang and Conrath 1997)
  • Other metrics
  • Similarity using information content (Lin 1998)
  • Similarity using gloss-based paths across
    different hierarchies (Mihalcea and Moldovan
    1999)
  • Conceptual density measure between noun semantic
    hierarchies and current context (Agirre and Rigau
    1995)
  • Adapted Lesk algorithm (Banerjee and Pedersen
    2002)

60
Semantic Similarity Metrics for WSD
  • Disambiguate target words based on similarity
    with one word to the left and one word to the
    right
  • (Patwardhan, Banerjee, Pedersen 2002)
  • Evaluation
  • 1,723 ambiguous nouns from Senseval-2
  • Among 5 similarity metrics, (Jiang and Conrath
    1997) provide the best precision (39)
  • Example disambiguate PLANT in plant with
    flowers
  • PLANT
  • plant, works, industrial plant
  • plant, flora, plant life
  • Similarity (plant1, flower) 0.2
  • Similarity (plant2, flower) 1.5
    plant2

61
Semantic Similarity in a Global Context
  • Lexical chains (Hirst and St-Onge 1988), (Haliday
    and Hassan 1976)
  • A lexical chain is a sequence of semantically
    related words, which creates a context and
    contributes to the continuity of meaning and the
    coherence of a discourse
  • Algorithm for finding lexical chains
  • Select the candidate words from the text. These
    are words for which we can compute similarity
    measures, and therefore most of the time they
    have the same part of speech.
  • For each such candidate word, and for each
    meaning for this word, find a chain to receive
    the candidate word sense, based on a semantic
    relatedness measure between the concepts that are
    already in the chain, and the candidate word
    meaning.
  • If such a chain is found, insert the word in this
    chain otherwise, create a new chain.

62
Semantic Similarity of a Global Context
A very long train traveling along the rails with
a constant velocity v in a certain direction
train
1 public transport
1 change location
2 a bar of steel for trains
2 order set of things
3 piece of cloth
travel
2 undergo transportation
rail
1 a barrier
3 a small bird
63
Lexical Chains for WSD
  • Identify lexical chains in a text
  • Usually target one part of speech at a time
  • Identify the meaning of words based on their
    membership to a lexical chain
  • Evaluation
  • (Galley and McKeown 2003) lexical chains on 74
    SemCor texts give 62.09
  • (Mihalcea and Moldovan 2000) on five SemCor texts
    give 90 with 60 recall
  • lexical chains anchored on monosemous words
  • (Okumura and Honda 1994) lexical chains on five
    Japanese texts give 63.4

64
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Restrictions
  • Measures of Semantic Similarity
  • Heuristic-based Methods

65
Most Frequent Sense (1)
  • Identify the most often used meaning and use this
    meaning by default
  • Word meanings exhibit a Zipfian distribution
  • E.g. distribution of word senses in SemCor

Example plant/flora is used more often than
plant/factory - annotate any instance of
PLANT as plant/flora
66
Most Frequent Sense (2)
  • Method 1 Find the most frequent sense in an
    annotated corpus
  • Method 2 Find the most frequent sense using a
    method based on distributional similarity
    (McCarthy et al. 2004)
  • 1. Given a word w, find the top k
    distributionally similar words
  • Nw n1, n2, , nk, with associated
    similarity scores dss(w,n1), dss(w,n2),
    dss(w,nk)
  • 2. For each sense wsi of w, identify the
    similarity with the words nj, using the sense of
    nj that maximizes this score
  • 3. Rank senses wsi of w based on the total
    similarity score

67
Most Frequent Sense(3)
  • Word senses
  • pipe 1 tobacco pipe
  • pipe 2 tube of metal or plastic
  • Distributional similar words
  • N tube, cable, wire, tank, hole, cylinder,
    fitting, tap,
  • For each word in N, find similarity with pipei
    (using the sense that maximizes the similarity)
  • pipe1 tube (3) 0.3
  • pipe2 tube (1) 0.6
  • Compute score for each sense pipei
  • score (pipe1) 0.25
  • score (pipe2) 0.73
  • Note results depend on the corpus used to find
    distributionally
  • similar words gt can find domain specific
    predominant senses

68
One Sense Per Discourse
  • A word tends to preserve its meaning across all
    its occurrences in a given discourse (Gale,
    Church, Yarowksy 1992)
  • What does this mean?
  • Evaluation
  • 8 words with two-way ambiguity, e.g. plant,
    crane, etc.
  • 98 of the two-word occurrences in the same
    discourse carry the same meaning
  • The grain of salt Performance depends on
    granularity
  • (Krovetz 1998) experiments with words with more
    than two senses
  • Performance of one sense per discourse measured
    on SemCor is approx. 70

E.g. The ambiguous word PLANT occurs 10 times in
a discourse all instances of plant
carry the same meaning
69
One Sense per Collocation
  • A word tends to preserver its meaning when used
    in the same collocation (Yarowsky 1993)
  • Strong for adjacent collocations
  • Weaker as the distance between words increases
  • An example
  • Evaluation
  • 97 precision on words with two-way ambiguity
  • Finer granularity
  • (Martinez and Agirre 2000) tested the one sense
    per collocation hypothesis on text annotated
    with WordNet senses
  • 70 precision on SemCor words

The ambiguous word PLANT preserves its meaning in
all its occurrences within the collocation
industrial plant, regardless of the context
where this collocation occurs
70
References
  • (Agirre and Rigau, 1995) Agirre, E. and Rigau, G.
    A proposal for word sense disambiguation using
    conceptual distance. RANLP 1995.  
  • (Agirre and Martinez 2001) Agirre, E. and
    Martinez, D. Learning class-to-class selectional
    preferences. CONLL 2001.
  •  (Banerjee and Pedersen 2002) Banerjee, S. and
    Pedersen, T. An adapted Lesk algorithm for word
    sense disambiguation using WordNet. CICLING 2002.
  • (Cowie, Guthrie and Guthrie 1992), Cowie, L. and
    Guthrie, J. A. and Guthrie, L. Lexical
    disambiguation using simulated annealing. COLING
    2002.
  • (Gale, Church and Yarowsky 1992) Gale, W.,
    Church, K., and Yarowsky, D. One sense per
    discourse. DARPA workshop 1992.
  • (Halliday and Hasan 1976) Halliday, M. and Hasan,
    R., (1976). Cohesion in English. Longman.
  • (Galley and McKeown 2003) Galley, M. and McKeown,
    K. (2003) Improving word sense disambiguation in
    lexical chaining. IJCAI 2003
  • (Hirst and St-Onge 1998) Hirst, G. and St-Onge,
    D. Lexical chains as representations of context
    in the detection and correction of malaproprisms.
    WordNet An electronic lexical database, MIT
    Press.
  • (Jiang and Conrath 1997) Jiang, J. and Conrath,
    D. Semantic similarity based on corpus statistics
    and lexical taxonomy. COLING 1997.
  • (Krovetz, 1998) Krovetz, R. More than one sense
    per discourse. ACL-SIGLEX 1998.
  • (Lesk, 1986) Lesk, M. Automatic sense
    disambiguation using machine readable
    dictionaries How to tell a pine cone from an ice
    cream cone. SIGDOC 1986.
  • (Lin 1998) Lin, D An information theoretic
    definition of similarity. ICML 1998.

71
References
  • (Martinez and Agirre 2000) Martinez, D. and
    Agirre, E. One sense per collocation and
    genre/topic variations. EMNLP 2000.
  • (Miller et. al., 1994) Miller, G., Chodorow, M.,
    Landes, S., Leacock, C., and Thomas, R. Using a
    semantic concordance for sense identification.
    ARPA Workshop 1994.
  • (Miller, 1995) Miller, G. Wordnet A lexical
    database. ACM, 38(11) 1995.
  • (Mihalcea and Moldovan, 1999) Mihalcea, R. and
    Moldovan, D. A method for word sense
    disambiguation of unrestricted text. ACL 1999.
  • (Mihalcea and Moldovan 2000) Mihalcea, R. and
    Moldovan, D. An iterative approach to word sense
    disambiguation. FLAIRS 2000.
  • (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P.
    Tarau, E. Figa PageRank on Semantic Networks with
    Application to Word Sense Disambiguation, COLING
    2004.
  • (Patwardhan, Banerjee, and Pedersen 2003)
    Patwardhan, S. and Banerjee, S. and Pedersen, T.
    Using Measures of Semantic Relatedeness for Word
    Sense Disambiguation. CICLING 2003.
  • (Rada et al 1989) Rada, R. and Mili, H. and
    Bicknell, E. and Blettner, M. Development and
    application of a metric on semantic nets. IEEE
    Transactions on Systems, Man, and Cybernetics,
    19(1) 1989.
  • (Resnik 1993) Resnik, P. Selection and
    Information A Class-Based Approach to Lexical
    Relationships. University of Pennsylvania 1993.  
  • (Resnik 1995) Resnik, P. Using information
    content to evaluate semantic similarity. IJCAI
    1995.
  • (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu,
    P. Langlais, G. Lapalme "Evaluating variants of
    the Lesk approach for disambiguating words, LREC
    2004.
  • (Yarowsky, 1993) Yarowsky, D. One sense per
    collocation. ARPA Workshop 1993.

72
  • Part 4
  • Supervised Methods of Word Sense Disambiguation

73
Outline
  • What is Supervised Learning?
  • Task Definition
  • Single Classifiers
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

74
What is Supervised Learning?
  • Collect a set of examples that illustrate the
    various possible classifications or outcomes of
    an event.
  • Identify patterns in the examples associated with
    each particular class of the event.
  • Generalize those patterns into rules.
  • Apply the rules to classify a new event.

75
Learn from these examples when do I go to the
store?
Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
76
Learn from these examples when do I go to the
store?
Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
77
Outline
  • What is Supervised Learning?
  • Task Definition
  • Single Classifiers
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

78
Task Definition
  • Supervised WSD Class of methods that induces a
    classifier from manually sense-tagged text using
    machine learning techniques.
  • Resources
  • Sense Tagged Text
  • Dictionary (implicit source of sense inventory)
  • Syntactic Analysis (POS tagger, Chunker, Parser,
    )
  • Scope
  • Typically one target word per context
  • Part of speech of target word resolved
  • Lends itself to targeted word formulation
  • Reduces WSD to a classification problem where a
    target word is assigned the most appropriate
    sense from a given set of possibilities based on
    the context in which it occurs

79
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I cant walk there.
80
Two Bags of Words(Co-occurrences in the window
of context)
FINANCIAL_BANK_BAG a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
RIVER_BANK_BAG a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West
81
Simple Supervised Approach
  • Given a sentence S containing bank
  • For each word Wi in S
  • If Wi is in FINANCIAL_BANK_BAG then
  • Sense_1 Sense_1 1
  • If Wi is in RIVER_BANK_BAG then
  • Sense_2 Sense_2 1
  • If Sense_1 gt Sense_2 then print Financial
  • else if Sense_2 gt Sense_1 then print River
  • else print Cant Decide

82
Supervised Methodology
  • Create a sample of training data where a given
    target word is manually annotated with a sense
    from a predetermined set of possibilities.
  • One tagged word per instance/lexical sample
    disambiguation
  • Select a set of features with which to represent
    context.
  • co-occurrences, collocations, POS tags, verb-obj
    relations, etc...
  • Convert sense-tagged training instances to
    feature vectors.
  • Apply a machine learning algorithm to induce a
    classifier.
  • Form structure or relation among features
  • Parameters strength of feature interactions
  • Convert a held out sample of test data into
    feature vectors.
  • correct sense tags are known but not used
  • Apply classifier to test instances to assign a
    sense tag.

83
From Text to Feature Vectors
  • My/pronoun grandfather/noun used/verb to/prep
    fish/verb along/adv the/det banks/SHORE of/prep
    the/det Mississippi/noun River/noun. (S1)
  • The/det bank/FINANCE issued/verb a/det check/noun
    for/prep the/det amount/noun of/prep
    interest/noun. (S2)

P-2 P-1 P1 P2 fish check river interest SENSE TAG
S1 adv det prep det Y N Y N SHORE
S2 det verb det N Y N Y FINANCE
84
Supervised Learning Algorithms
  • Once data is converted to feature vector form,
    any supervised learning algorithm can be used.
    Many have been applied to WSD with good results
  • Support Vector Machines
  • Nearest Neighbor Classifiers
  • Decision Trees
  • Decision Lists
  • Naïve Bayesian Classifiers
  • Perceptrons
  • Neural Networks
  • Graphical Models
  • Log Linear Models

85
Outline
  • What is Supervised Learning?
  • Task Definition
  • Naïve Bayesian Classifier
  • Decision Lists and Trees
  • Ensembles of Classifiers

86
Naïve Bayesian Classifier
  • Naïve Bayesian Classifier well known in Machine
    Learning community for good performance across a
    range of tasks (e.g., Domingos and Pazzani, 1997)
  • Word Sense Disambiguation is no exception
  • Assumes conditional independence among features,
    given the sense of a word.
  • The form of the model is assumed, but parameters
    are estimated from training instances
  • When applied to WSD, features are often a bag of
    words that come from the training data
  • Usually thousands of binary features that
    indicate if a word is present in the context of
    the target word (or not)

87
Bayesian Inference
  • Given observed features, what is most likely
    sense?
  • Estimate probability of observed features given
    sense
  • Estimate unconditional probability of sense
  • Unconditional probability of features is a
    normalizing term, doesnt affect sense
    classification

88
Naïve Bayesian Model
89
The Naïve Bayesian Classifier
  • Given 2,000 instances of bank, 1,500 for bank/1
    (financial sense) and 500 for bank/2 (river
    sense)
  • P(S1) 1,500/2000 .75
  • P(S2) 500/2,000 .25
  • Given credit occurs 200 times with bank/1 and 4
    times with bank/2.
  • P(F1credit) 204/2000 .102
  • P(F1creditS1) 200/1,500 .133
  • P(F1creditS2) 4/500 .008
  • Given a test instance that has one feature
    credit
  • P(S1F1credit) .133.75/.102 .978
  • P(S2F1credit) .008.25/.102 .020

90
Comparative Results
  • (Leacock, et. al. 1993) compared Naïve Bayes with
    a Neural Network and a Context Vector approach
    when disambiguating six senses of line
  • (Mooney, 1996) compared Naïve Bayes with a Neural
    Network, Decision Tree/List Learners, Disjunctive
    and Conjunctive Normal Form learners, and a
    perceptron when disambiguating six senses of
    line
  • (Pedersen, 1998) compared Naïve Bayes with
    Decision Tree, Rule Based Learner, Probabilistic
    Model, etc. when disambiguating line and 12 other
    words
  • All found that Naïve Bayesian Classifier
    performed as well as any of the other methods!

91
Outline
  • What is Supervised Learning?
  • Task Definition
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

92
Decision Lists and Trees
  • Very widely used in Machine Learning.
  • Decision trees used very early for WSD research
    (e.g., Kelly and Stone, 1975 Black, 1988).
  • Represent disambiguation problem as a series of
    questions (presence of feature) that reveal the
    sense of a word.
  • List decides between two senses after one
    positive answer
  • Tree allows for decision among multiple senses
    after a series of answers
  • Uses a smaller, more refined set of features than
    bag of words and Naïve Bayes.
  • More descriptive and easier to interpret.

93
Decision List for WSD (Yarowsky, 1994)
  • Identify collocational features from sense tagged
    data.
  • Word immediately to the left or right of target
  • I have my bank/1 statement.
  • The river bank/2 is muddy.
  • Pair of words to immediate left or right of
    target
  • The worlds richest bank/1 is here in New York.
  • The river bank/2 is muddy.
  • Words found within k positions to left or right
    of target, where k is often 10-50
  • My credit is just horrible because my bank/1 has
    made several mistakes with my account and the
    balance is very low.

94
Building the Decision List
  • Sort order of collocation tests using log of
    conditional probabilities.
  • Words most indicative of one sense (and not the
    other) will be ranked highly.

95
Computing DL score
  • Given 2,000 instances of bank, 1,500 for bank/1
    (financial sense) and 500 for bank/2 (river
    sense)
  • P(S1) 1,500/2,000 .75
  • P(S2) 500/2,000 .25
  • Given credit occurs 200 times with bank/1 and 4
    times with bank/2.
  • P(F1credit) 204/2,000 .102
  • P(F1creditS1) 200/1,500 .133
  • P(F1creditS2) 4/500 .008
  • From Bayes Rule
  • P(S1F1credit) .133.75/.102 .978
  • P(S2F1credit) .008.25/.102 .020
  • DL Score abs (log (.978/.020)) 3.89

96
Using the Decision List
  • Sort DL-score, go through test instance looking
    for matching feature. First match reveals sense

DL-score Feature Sense
3.89 credit within bank Bank/1 financial
2.20 bank is muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
97
Using the Decision List
98
Learning a Decision Tree
  • Identify the feature that most cleanly divides
    the training data into the known senses.
  • Cleanly measured by information gain or gain
    ratio.
  • Create subsets of training data according to
    feature values.
  • Find another feature that most cleanly divides a
    subset of the training data.
  • Continue until each subset of training data is
    pure or as clean as possible.
  • Well known decision tree learning algorithms
    include ID3 and C4.5 (Quillian, 1986, 1993)
  • In Senseval-1, a modified decision list (which
    supported some conditional branching) was most
    accurate for English Lexical Sample task
    (Yarowsky, 2000)

99
Supervised WSD with Individual Classifiers
  • Many supervised Machine Learning algorithms have
    been applied to Word Sense Disambiguation, most
    work reasonably well.
  • (Witten and Frank, 2000) is a great intro. to
    supervised learning.
  • Features tend to differentiate among methods more
    than the learning algorithms.
  • Good sets of features tend to include
  • Co-occurrences or keywords (global)
  • Collocations (local)
  • Bigrams (local and global)
  • Part of speech (local)
  • Predicate-argument relations
  • Verb-object, subject-verb,
  • Heads of Noun and Verb Phrases

100
Convergence of Results
  • Accuracy of different systems applied to the same
    data tends to converge on a particular value, no
    one system shockingly better than another.
  • Senseval-1, a number of systems in range of
    74-78 accuracy for English Lexical Sample task.
  • Senseval-2, a number of systems in range of
    61-64 accuracy for English Lexical Sample task.
  • Senseval-3, a number of systems in range of
    70-73 accuracy for English Lexical Sample task
  • What to do next?

101
Outline
  • What is Supervised Learning?
  • Task Definition
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

102
Ensembles of Classifiers
  • Classifier error has two components (Bias and
    Variance)
  • Some algorithms (e.g., decision trees) try and
    build a representation of the training data Low
    Bias/High Variance
  • Others (e.g., Naïve Bayes) assume a parametric
    form and dont represent the training data High
    Bias/Low Variance
  • Combining classifiers with different bias
    variance characteristics can lead to improved
    overall accuracy
  • Bagging a decision tree can smooth out the
    effect of small variations in the training data
    (Breiman, 1996)
  • Sample with replacement from the training data to
    learn multiple decision trees.
  • Outliers in training data will tend to be
    obscured/eliminated.

103
Ensemble Considerations
  • Must choose different learning algorithms with
    significantly different bias/variance
    characteristics.
  • Naïve Bayesian Classifier versus Decision Tree
  • Must choose feature representations that yield
    significantly different (independent?) views of
    the training data.
  • Lexical versus syntactic features
  • Must choose how to combine classifiers.
  • Simple Majority Voting
  • Averaging of probabilities across multiple
    classifier output
  • Maximum Entropy combination (e.g., Klein, et.
    al., 2002)

104
Ensemble Results
  • (Pedersen, 2000) achieved state of art for
    interest and line data using ensemble of Naïve
    Bayesian Classifiers.
  • Many Naïve Bayesian Classifiers trained on
    varying sized windows of context / bags of words.
  • Classifiers combined by a weighted vote
  • (Florian and Yarowsky, 2002) achieved state of
    the art for Senseval-1 and Senseval-2 data using
    combination of six classifiers.
  • Rich set of collocational and syntactic features.
  • Combined via linear combination of top three
    classifiers.
  • Many Senseval-2 and Senseval-3 systems employed
    ensemble methods.

105
References
  • (Black, 1988) An experiment in computational
    discrimination of English word senses. IBM
    Journal of Research and Development (32) pg.
    185-194.
  • (Breiman, 1996) The heuristics of instability in
    model selection. Annals of Statistics (24) pg.
    2350-2383.
  • (Domingos and Pazzani, 1997) On the Optimality of
    the Simple Bayesian Classifier under Zero-One
    Loss, Machine Learning (29) pg. 103-130.
  • (Domingos, 2000) A Unified Bias Variance
    Decomposition for Zero-One and Squared Loss. In
    Proceedings of AAAI. Pg. 564-569.
  • (Florian an dYarowsky, 2002) Modeling Consensus
    Classifier Combination for Word Sense
    Disambiguation. In Proceedings of EMNLP, pp
    25-32.
  • (Kelly and Stone, 1975). Computer Recognition of
    English Word Senses, North Holland Publishing
    Co., Amsterdam.
  • (Klein, et. al., 2002) Combining Heterogeneous
    Classifiers for Word-Sense Disambiguation,
    Proceedings of Senseval-2. pg. 87-89.
  • (Leacock, et. al. 1993) Corpus based statistical
    sense resolution. In Proceedings of the ARPA
    Workshop on Human Language Technology. pg.
    260-265.
  • (Mooney, 1996) Comparative experiments on
    disambiguating word senses An illustration of
    the role of bias in machine learning. Proceedings
    of EMNLP. pg. 82-91.

106
References
  • (Pedersen, 1998) Learning Probabilistic Models of
    Word Sense Disambiguation. Ph.D. Dissertation.
    Southern Methodist University.
  • (Pedersen, 2000) A simple approach to building
    ensembles of Naive Bayesian classifiers for word
    sense disambiguation. In Proceedings of NAACL.
  • (Quillian, 1986). Induction of Decision Trees.
    Machine Learning (1). pg. 81-106.
  • (Quillian, 1993). C4.5 Programs for Machine
    Learning. San Francisco, Morgan Kaufmann.
  • (Witten and Frank, 2000). Data Mining Practical
    Machine Learning Tools and Techniques with Java
    Implementations. Morgan-Kaufmann. San Francisco.
  • (Yarowsky, 1994) Decision lists for lexical
    ambiguity resolution Application to accent
    restoration in Spanish and French. In Proceedings
    of ACL. pp. 88-95.
  • (Yarowsky, 2000) Hierarchical decision lists for
    word sense disambiguation. Computers and the
    Humanities, 34.

107
Part 5 Minimally Supervised Methods for Word
Sense Disambiguation
108
Outline
  • Task definition
  • What does minimally supervised mean?
  • Bootstrapping algorithms
  • Co-training
  • Self-training
  • Yarowsky algorithm
  • Using the Web for Word Sense Disambiguation
  • Web as a corpus
  • Web as collective mind

109
Task Definition
  • Supervised WSD learning sense classifiers
    starting with annotated data
  • Minimally supervised WSD learning sense
    classifiers from annotated data, with minimal
    human supervision
  • Examples
  • Automatically bootstrap a corpus starting with a
    few human annotated examples
  • Use monosemous relatives / dictionary definitions
    to automatically construct sense tagged data
  • Rely on Web-users active learning for corpus
    annotation

110
Outline
  • Task definition
  • What does minimally supervised mean?
  • Bootstrapping algorithms
  • Co-training
  • Self-training
  • Yarowsky algorithm
  • Using the Web for Word Sense Disambiguation
  • Web as a corpus
  • Web as collective mind

111
Bootstrapping WSD Classifiers
  • Build sense classifiers with little training data
  • Expand applicability of supervised WSD
  • Bootstrapping approaches
  • Co-training
  • Self-training
  • Yarowsky algorithm

112
Bootstrapping Recipe
  • Ingredients
  • (Some) labeled data
  • (Large amounts of) unlabeled data
  • (One or more) basic classifiers
  • Output
  • Classifier that improves over the basic
    classifiers

113
building the only atomic plant plant growth
is retarded a herb or flowering plant a
nuclear power plant building a new vehicle
plant the animal and plant life the
passion-fruit plant
plants1 and animals industry plant2
114
Co-training / Self-training
  • A set L of labeled training examples
  • A set U of unlabeled exa
Write a Comment
User Comments (0)
About PowerShow.com