Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

About This Presentation

Title:

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Description:

Missing meaningful patterns due to limited n-grams range ... 'Rose is a very popular flower in the US. ... Patterns from context rather than from an ontology ... – PowerPoint PPT presentation

Number of Views:191

Avg rating:3.0/5.0

Slides: 6

Provided by: Zhes

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

1
Measuring Semantic Similarity between Words
Using Web Search EnginesDanushka Bollegala,
Yutaka Matsuo, Mitsuru Ishizuka

Topic
Semantic similarity measures between two words
Why interesting?
In information retrieval
Query expansion
Automatic annotation of Web pages
Community mining
In natural language processing
Word-sense disambiguation
Synonym extraction
Language modeling

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
2

Solution proposed
By using the information available on the Web
Page Counts Text Snippets
SVM for an optimal combination

Page Counts
Co-occurrence measures Jaccard, Overlap
(Simpson), Dice, PMI
Modification Suppress random co-occurrences
Score0, if H(PnQ)ltc, H(x) page counts for the
query x

Text Snippets (context and statistical based)?
top 200 Pattern Freq
Lexico -syntactic Patterns Extraction
e.g. Toyota and Nissan are two major Japanese
car manufactures.
If the appearing times of a pattern words
in snippets for synonymous words gtgt in
snippets for non-synonymous
it is a reliable indicator of synonymy.

Combination
204-D Feature vector F 200 Pattern Freq, 4
co-occurrence measures
Two-class SVM
synonymous word-pairs (Positive), non-synonymous
word-pairs (Negative)

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
3
My criticisms of the solution

Statistics and context based pattern selection is
not reliable (No ontology or syntax templates)
Sparse Distribution
Noises (meaningless patterns)
Correlations (e.g. X and Y , X and Y are, X
and Y are two)
Missing meaningful patterns due to limited
n-grams range
(e.g. X and Y are far apart, beyond the range of
n-grams, n2,3,4,5
Rose is a very popular flower in
the US.)
Feature vector F 200 Pattern Freq, 4
co-occurrence measures
Error prone for uncommon words
e.g. rarely used professional terms
Base set from the web is too small to be
reliable.
Like the case of CBioC, users voting would be
better

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
4
How it is related to our course?

Web-based information extraction (Knowledge
Extraction)
Extract base level knowledge (facts) directly
from the web
Page counts(Hits), e.g. Knowitall
Inevitable drawback Error prone for uncommon
words in the web, e.g. CBioC
Making use of Collective UnconsciousBig Idea 3
Analyzing term co-occurrences to capture semantic
information
Co-occurrence measures
Similarity measure in terms of co-occurrence
Jaccard, Overlap (Simpson), PMI
Making use of context based on statistics
Patterns from context rather than from an
ontology (SemTag Seeker).
Patterns decided by statistics rather than
templates from syntax tree (Generic extraction
patterns, Hearst 92).
n-grams for a word, somewhat like the
20-word-window of spot(l,c) in SemTag
Seeker.

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
5
Measuring Semantic Similarity between Words Using
Web Search EnginesDanushka Bollegala, Yutaka
Matsuo, Mitsuru Ishizuka
WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007

Write a Comment

User Comments (0)

About PowerShow.com

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka - PowerPoint PPT Presentation

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Missing meaningful patterns due to limited n-grams range ... 'Rose is a very popular flower in the US. ... Patterns from context rather than from an ontology ... – PowerPoint PPT presentation