Mining Context Specific Similarity Relationships Using The World Wide Web - PowerPoint PPT Presentation


PPT – Mining Context Specific Similarity Relationships Using The World Wide Web PowerPoint presentation | free to view - id: c7a4f-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Mining Context Specific Similarity Relationships Using The World Wide Web


Simplistic models: e.g. extending boolean queries ('jaguar OR auto OR power OR car' ... s pct agriculture dlrs trade export prices washington market price wheat ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 25
Provided by: ResearchM53


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Mining Context Specific Similarity Relationships Using The World Wide Web

Mining Context Specific Similarity Relationships
Using The World Wide Web
Weiguo Fan Department of Information
Systems Virginia Tech
  • Dmitri Roussinov
  • Department of Information Systems
  • W.P. Carey School of Business
  • Arizona State University

Leon J. Zhao Department of Management Information
Systems University of Arizona
Slides available at http//
pptNarrative available at http//qa.wpcarey.asu.e
Motivation Inter-Document Similarity
Computation. Involved in
  • Document retrieval
  • Clustering
  • Filtering
  • Summarization
  • Query by example
  • and many other human language technologies.

Motivation Vocabulary Mismatch Problem
  • Vector space model (Salton, 1983) documents are
    represented by vectors of terms
  • Synonyms or similar words are treated as entirely
    different features car and automobile, jacket
    and coat, etc.

Solutions Explored Earlier
  • Thesaurus (e.g. WordNet Miller et al. 1990
    Voorhees, 1994) not always effective
  • Possible explanation not complete coverage,
    ignores context
  • Co-occurrence similarity mining earlier works
    with mixed results
  • von Rijsbergen, 1977 Minker et al., 1972 Peat
    Willett, 1991

Why Earlier Approaches Not Always Worked
Suggested explanation
  • Small collections not enough data for reliable
    similarity mining
  • Tested only with document retrieval tasks, short
    ambiguous queries
  • Simplistic models e.g. extending boolean queries
    (jaguar OR auto OR power OR car)

More Recent Approaches Some Success
  • Grefenstette, 1994 Church et al., 1991 Hearst
    et al., 1992 Schutze and Pedersen, 1997
    Voorhees, 1994 Roussinov and Zhao, 2003 Kwok et
    al. 2004
  • questions remain open
  • Magnitude of the improvement
  • Best expansion/mining models
  • Mine the corpus itself or look for external data
    (e.g. Web)?
  • Even if works in document retrieval, how about
    more general case of similarity computation
    (clustering, summarization, filtering etc.)?
  • Our study posed to answer those questions

What is Done Context Specific Similarity
  • Context Specific External corpus is harvested
    from the Web
  • Co-occurrence mining is performed in it
  • Document vector representations (from target
    collection) are expanded with similar terms
  • Tested on Reuters collections
  • Improvement 50 larger than with using
    self-mining (analyzing the target corpus only,
    using LSI or PRF adaptations)

Context Specific Similarity Discovery
Architecture of the Approach
Context Specific Similarity Discovery Steps
  • Context Hint string
  • 100 most frequent terms (- stopwords) to
    represent context of target collection (e.g.
  • Context Queries
  • each of 1000 most frequent terms combined with
    the Context Hint string (next slide)
  • Co-occurrence mining is performed
  • sim(t1, t2)
  • Document vector representations are expanded with
    similar terms (next slides)

Context Specific Similarity Discovery Example of
Context Hint Terms top 108 most frequent
Some of the 1000 Context Queries for Alta
Vista means the word is required
Context Specific Similarity Discovery Document
  • w(t, D) -- the initial, not expanded, weight of
    the term t in the document D
  • w(t, D) -- the modified weight of the term t in
    the document D,
  • t iterates through all terms in the document D
  • a is the adjustment factor (a control parameter)

Experiment Setup Collection
  • Reuters Collection (Lewis, 1997)
  • Newswires labeled with one or more out of 78
    commodity code topics
  • discarded topics that had only 1 document
  • discarded documents not labeled with any of the
    remaining topics
  • 1841 documents left
  • Still plenty of pairs for comparison!
  • Porter Indexing

Experiment Setup Metric
  • Based on Kruskal-Goodman statistics (Haveliwala,
    et al., 2002)
  • Intuition if documents share topics, the have to
    be more similar than those that do not. Defining
  • Sa set of all the document triples (D, D1, D2)
    such that
  • D?D1, D?D2, D1?D2,
  • D shares at least one common category with D1
  • D shares no common categories with D2
  • total error count (Ec) the number of triples
    in Sa such that sim(D, D1) lt sim(D, D2)
  • similarity error Ec / Sa

Experiment Setup Baseline
Table 4. Similarity Errors comparison of
different weighting schemes with the original
(not expanded) documents.
40 Improvement from boolean vectors (0s and 1s)
to TF-IDF We believed achieveing comparable
(another 40) reduction in addition would be of
practical importance
Experiment Setup Control Parameters
  • Ca the average Euclidian distance between the
    original and the modified document vectors
  • Thresh co-occurrence based similarity threshold
    (0.1-0.5) all values below were ignored

Experiment Results Similarity Error Reduction
Experiment Results Observations
  • Thresh in .2-.4 effect very stable, achieves
  • Thresh lt .1 effect is not that stable
  • Possible explanation many non-reliable
    associations are involved in the expansion.
  • Thresh gt .4 effect depends on Ca, not very
  • Possible explanation only few associations are
    used, which requires large adjustment parameter a
    to achieve same values of Ca

Experiment Results Ignoring Context Hint
Experiment Results Not Using External Corpus
(self mining)
More Self-Mining for Comparison Latent Semantic
Sensitive to the number of semantic axis Best
effect (10), not exceeding self-mining (above)
More Self-Mining for Comparison Pseudo Relevance
Nc number of documents to use for PRF Best
effect (20) comparable with self-mining (above)
  • Developed and Studied an approach that
  • Uses external corpus (Web)
  • While building the external corpus, takes target
    collection context into consideration
  • Tested on similarity computation an important
    general task that is behind retrieval,
    clustering, etc.
  • Practically significant effect Reduces
    similarity errors by up to 50
  • Effect much larger than if not using external
    corpus or not using context

Limitations to address in future research
  • We Do not claim that CCSD is better than LSI or
    PRF they can be possibly extended to be
    applicable to an external corpus as well
  • Larger data sets desired
  • Specific applications to be tested retrieval
    (have preliminary results, see forthcoming HARD
    and ROBUST TREC 2005), clustering,
    categorization, etc.

  • Corresponding Author