Title: Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka
1Measuring Semantic Similarity between Words
Using Web Search EnginesDanushka Bollegala,
Yutaka Matsuo, Mitsuru Ishizuka
- Topic
- Semantic similarity measures between two words
- Why interesting?
- In information retrieval
- Query expansion
- Automatic annotation of Web pages
- Community mining
- In natural language processing
- Word-sense disambiguation
- Synonym extraction
- Language modeling
-
WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
2- Solution proposed
- By using the information available on the Web
- Page Counts Text Snippets
- SVM for an optimal combination
- Page Counts
- Co-occurrence measures Jaccard, Overlap
(Simpson), Dice, PMI - Modification Suppress random co-occurrences
- Score0, if H(PnQ)ltc, H(x) page counts for the
query x
- Text Snippets (context and statistical based)?
top 200 Pattern Freq - Lexico -syntactic Patterns Extraction
- e.g. Toyota and Nissan are two major Japanese
car manufactures. - If the appearing times of a pattern words
- in snippets for synonymous words gtgt in
snippets for non-synonymous - it is a reliable indicator of synonymy.
- Combination
- 204-D Feature vector F 200 Pattern Freq, 4
co-occurrence measures - Two-class SVM
- synonymous word-pairs (Positive), non-synonymous
word-pairs (Negative)
WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
3My criticisms of the solution
- Statistics and context based pattern selection is
not reliable (No ontology or syntax templates) - Sparse Distribution
- Noises (meaningless patterns)
- Correlations (e.g. X and Y , X and Y are, X
and Y are two) - Missing meaningful patterns due to limited
n-grams range - (e.g. X and Y are far apart, beyond the range of
n-grams, n2,3,4,5 - Rose is a very popular flower in
the US.) - Feature vector F 200 Pattern Freq, 4
co-occurrence measures - Error prone for uncommon words
- e.g. rarely used professional terms
- Base set from the web is too small to be
reliable. - Like the case of CBioC, users voting would be
better
WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
4How it is related to our course?
- Web-based information extraction (Knowledge
Extraction) - Extract base level knowledge (facts) directly
from the web - Page counts(Hits), e.g. Knowitall
- Inevitable drawback Error prone for uncommon
words in the web, e.g. CBioC - Making use of Collective UnconsciousBig Idea 3
- Analyzing term co-occurrences to capture semantic
information - Co-occurrence measures
- Similarity measure in terms of co-occurrence
- Jaccard, Overlap (Simpson), PMI
- Making use of context based on statistics
- Patterns from context rather than from an
ontology (SemTag Seeker). - Patterns decided by statistics rather than
templates from syntax tree (Generic extraction
patterns, Hearst 92). - n-grams for a word, somewhat like the
20-word-window of spot(l,c) in SemTag
Seeker.
WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
5Measuring Semantic Similarity between Words Using
Web Search EnginesDanushka Bollegala, Yutaka
Matsuo, Mitsuru Ishizuka
WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007