Discriminating Word Senses Using McQuittys Similarity Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Discriminating Word Senses Using McQuittys Similarity Analysis

Description:

Research supported by National Science Foundation (NSF) ... Senseval-2 Results POS wise. 8. 7. 3. 5. 7. 6. MAT. COS. SOC. UNI. BI. 29 NOUNS. MAT. COS. MAT. COS ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 22
Provided by: UMD6
Learn more at: https://www.d.umn.edu
Category:

less

Transcript and Presenter's Notes

Title: Discriminating Word Senses Using McQuittys Similarity Analysis


1
Discriminating Word Senses Using McQuittys
Similarity Analysis
  • Amruta Purandare
  • University of Minnesota, Duluth
  • Advisor Dr Ted Pedersen
  • Research supported by National Science Foundation
    (NSF)
  • Faculty Early Career Development Award (0092784)

2
Discriminating line
They will begin line formation before
ceremony Connect modem to any jack on your
line Quit printing after the last line of each
file Your line will not get tied while you are
connected to net Stand balanced and comfortable
during line up Lines that do not fit a page are
truncated New line service provides reliable
connections Pages are separated by line feed
characters They stand far right when in line
formation
3
They will begin line formation before
ceremony Stand balanced and comfortable during
line up They stand far right when in line
formation
Your line will not get tied while you are
connected to net Connect modem to any jack on
your line New line service provides reliable
connections
Quit printing after the last line of each
page Lines that do not fit a page are
truncated Pages are separated by line feed
characters
4
Introduction
  • What is Word Sense Discrimination ?
  • Unsupervised learning

Clusters
Training
Features
Test
Feature Vectors
similarity matrix
evaluate
5
Representing context
  • Features (from training)
  • Bi grams
  • Unigrams
  • Second Order Co-occurrences/SOCs (Schütze98)
  • Mixture
  • Feature vectors (Binary)
  • Measuring similarity
  • Cosine
  • Match

6
Feature examples
7
McQuittys method
  • Pedersen Bruce, 1997
  • Agglomerative
  • UPGMA / Average Link
  • Stopping rules
  • Number of clusters
  • Score cutoff

xy/2
y
x
8
Evaluation
sense1
( Maj )
sense2
sense3
sense4
c2
c3
c1
c4
9
Evaluation
Accuracy38/550.69
sense3
sense4
sense1
sense2
10
Majority Sense Classifier
Maj. 17/550.31
sense2
11
Experimental Data
12
Scope of the experiments
  • 584 experiments (73 4 2)
  • 73 Words 72 Senseval-2, LINE
  • 4 Features Bi grams, Unigrams, SOCs, Mix
  • 2 Similarity Measures Match, Cosine
  • Window 5
  • for Bi grams and SOCs
  • Frequency cutoff 2

13
Senseval-2 Results POS wise
29 NOUNS
28verbs
15 adjs
Maj0.57
Maj0.51
Maj0.64
No of words of a POS for which experiment
obtained accuracy more than Majority
14
Senseval-2 Results Feature wise
SOC
UNI
BI
32
18
38
72 words X 2 measures 144
15
Senseval-2 Results Measure wise
COS
MAT
49
39
72 words x 3 features 216
16
Line Results
Maj 0.16
On uniform distribution of 6 senses
17
Sample Confusion Table (fine.soc.cos)
S0 elegant S1 small grained S2 superior S3
satisfactory S4 thin
60
precision 36/60 60.00
18
Conclusions
  • Small set of SOCs was powerful
  • Half the number of unigrams/bigrams
  • Scaling done by Cosine helps !
  • Need more training data!
  • Need to improve feature
  • Selection (Tests of associations)
  • extraction (Stemming)
  • matching (Fuzzy matching)
  • strategies for bi grams
  • Explore new features
  • POS
  • Collocations

19
Recent work
  • PDL implementation
  • Cluto - Clustering Toolkit
  • http//www-users.cs.umn.edu/karypis/cluto
  • 6 clustering methods, 12 merging criteria
  • Plans
  • Comparing clustering in
  • similarity space Vs vector space (Schütze, 1998)
  • Stopping rules

20
Sense labeling
They will begin line formation before
ceremony Stand balanced and comfortable during
line up They stand far right when in line
formation
formation
Your line will not get tied while you are
connected to net Connect modem to any jack on
your line New line service provides reliable
connections
phone
Quit printing after the last line of each
file Lines that do not fit a page are
truncated Pages are separated by line feed
characters
text
21
Software Packages
  • SenseClusters (Our Discrimination Toolkit)
  • http//www.d.umn.edu/tpederse/senseclusters.html
  • PDL (Used to implement clustering algorithms)
  • http//pdl.perl.org/
  • NSP (Used for extracting features)
  • http//www.d.umn.edu/tpederse/nsp.html
  • SenseTools (Used for preprocessing, feature
    matching)
  • http//www.d.umn.edu/tpederse/sensetools.html
  • Cluto (Clustering Toolkit)
  • http//www-users.cs.umn.edu/karypis/cluto
Write a Comment
User Comments (0)
About PowerShow.com