Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka - PowerPoint PPT Presentation

About This Presentation
Title:

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Description:

Missing meaningful patterns due to limited n-grams range ... 'Rose is a very popular flower in the US. ... Patterns from context rather than from an ontology ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 6
Provided by: Zhes
Category:

less

Transcript and Presenter's Notes

Title: Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka


1
Measuring Semantic Similarity between Words
Using Web Search EnginesDanushka Bollegala,
Yutaka Matsuo, Mitsuru Ishizuka
  • Topic
  • Semantic similarity measures between two words
  • Why interesting?
  • In information retrieval
  • Query expansion
  • Automatic annotation of Web pages
  • Community mining
  • In natural language processing
  • Word-sense disambiguation
  • Synonym extraction
  • Language modeling

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
2
  • Solution proposed
  • By using the information available on the Web
  • Page Counts Text Snippets
  • SVM for an optimal combination
  • Page Counts
  • Co-occurrence measures Jaccard, Overlap
    (Simpson), Dice, PMI
  • Modification Suppress random co-occurrences
  • Score0, if H(PnQ)ltc, H(x) page counts for the
    query x
  • Text Snippets (context and statistical based)?
    top 200 Pattern Freq
  • Lexico -syntactic Patterns Extraction
  • e.g. Toyota and Nissan are two major Japanese
    car manufactures.
  • If the appearing times of a pattern words
  • in snippets for synonymous words gtgt in
    snippets for non-synonymous
  • it is a reliable indicator of synonymy.
  • Combination
  • 204-D Feature vector F 200 Pattern Freq, 4
    co-occurrence measures
  • Two-class SVM
  • synonymous word-pairs (Positive), non-synonymous
    word-pairs (Negative)

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
3
My criticisms of the solution
  • Statistics and context based pattern selection is
    not reliable (No ontology or syntax templates)
  • Sparse Distribution
  • Noises (meaningless patterns)
  • Correlations (e.g. X and Y , X and Y are, X
    and Y are two)
  • Missing meaningful patterns due to limited
    n-grams range
  • (e.g. X and Y are far apart, beyond the range of
    n-grams, n2,3,4,5
  • Rose is a very popular flower in
    the US.)
  • Feature vector F 200 Pattern Freq, 4
    co-occurrence measures
  • Error prone for uncommon words
  • e.g. rarely used professional terms
  • Base set from the web is too small to be
    reliable.
  • Like the case of CBioC, users voting would be
    better

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
4
How it is related to our course?
  • Web-based information extraction (Knowledge
    Extraction)
  • Extract base level knowledge (facts) directly
    from the web
  • Page counts(Hits), e.g. Knowitall
  • Inevitable drawback Error prone for uncommon
    words in the web, e.g. CBioC
  • Making use of Collective UnconsciousBig Idea 3
  • Analyzing term co-occurrences to capture semantic
    information
  • Co-occurrence measures
  • Similarity measure in terms of co-occurrence
  • Jaccard, Overlap (Simpson), PMI
  • Making use of context based on statistics
  • Patterns from context rather than from an
    ontology (SemTag Seeker).
  • Patterns decided by statistics rather than
    templates from syntax tree (Generic extraction
    patterns, Hearst 92).
  • n-grams for a word, somewhat like the
    20-word-window of spot(l,c) in SemTag
    Seeker.

WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
5
Measuring Semantic Similarity between Words Using
Web Search EnginesDanushka Bollegala, Yutaka
Matsuo, Mitsuru Ishizuka
WWW 2007 Paper Presentation Zheshen Wang
May 8th, 2007
Write a Comment
User Comments (0)
About PowerShow.com