Title: Mining Context Specific Similarity Relationships Using The World Wide Web
1Mining Context Specific Similarity Relationships
Using The World Wide Web
Weiguo Fan Department of Information
Systems Virginia Tech
- Dmitri Roussinov
- Department of Information Systems
- W.P. Carey School of Business
- Arizona State University
Leon J. Zhao Department of Management Information
Systems University of Arizona
Slides available at http//qa.wpcarey.asu.edu/HLT.
pptNarrative available at http//qa.wpcarey.asu.e
du/HLT.doc
2Motivation Inter-Document Similarity
Computation. Involved in
- Document retrieval
- Clustering
- Filtering
- Summarization
- Query by example
-
- and many other human language technologies.
3Motivation Vocabulary Mismatch Problem
- Vector space model (Salton, 1983) documents are
represented by vectors of terms - Synonyms or similar words are treated as entirely
different features car and automobile, jacket
and coat, etc.
4Solutions Explored Earlier
- Thesaurus (e.g. WordNet Miller et al. 1990
Voorhees, 1994) not always effective - Possible explanation not complete coverage,
ignores context - Co-occurrence similarity mining earlier works
with mixed results - von Rijsbergen, 1977 Minker et al., 1972 Peat
Willett, 1991
5Why Earlier Approaches Not Always Worked
Suggested explanation
- Small collections not enough data for reliable
similarity mining - Tested only with document retrieval tasks, short
ambiguous queries - Simplistic models e.g. extending boolean queries
(jaguar OR auto OR power OR car)
6More Recent Approaches Some Success
- Grefenstette, 1994 Church et al., 1991 Hearst
et al., 1992 Schutze and Pedersen, 1997
Voorhees, 1994 Roussinov and Zhao, 2003 Kwok et
al. 2004 - questions remain open
- Magnitude of the improvement
- Best expansion/mining models
- Mine the corpus itself or look for external data
(e.g. Web)? - Even if works in document retrieval, how about
more general case of similarity computation
(clustering, summarization, filtering etc.)? - Our study posed to answer those questions
7What is Done Context Specific Similarity
Discovery
- Context Specific External corpus is harvested
from the Web - Co-occurrence mining is performed in it
- Document vector representations (from target
collection) are expanded with similar terms - Tested on Reuters collections
- Improvement 50 larger than with using
self-mining (analyzing the target corpus only,
using LSI or PRF adaptations)
8Context Specific Similarity Discovery
Architecture of the Approach
9Context Specific Similarity Discovery Steps
- Context Hint string
- 100 most frequent terms (- stopwords) to
represent context of target collection (e.g.
Reuters) - Context Queries
- each of 1000 most frequent terms combined with
the Context Hint string (next slide) - Co-occurrence mining is performed
- sim(t1, t2)
- Document vector representations are expanded with
similar terms (next slides)
10Context Specific Similarity Discovery Example of
Queries
Context Hint Terms top 108 most frequent
Some of the 1000 Context Queries for Alta
Vista means the word is required
11Context Specific Similarity Discovery Document
Expansion
- w(t, D) -- the initial, not expanded, weight of
the term t in the document D - w(t, D) -- the modified weight of the term t in
the document D, - t iterates through all terms in the document D
- a is the adjustment factor (a control parameter)
12Experiment Setup Collection
- Reuters Collection (Lewis, 1997)
- Newswires labeled with one or more out of 78
commodity code topics - discarded topics that had only 1 document
- discarded documents not labeled with any of the
remaining topics - 1841 documents left
- Still plenty of pairs for comparison!
- Porter Indexing
13Experiment Setup Metric
- Based on Kruskal-Goodman statistics (Haveliwala,
et al., 2002) - Intuition if documents share topics, the have to
be more similar than those that do not. Defining
formally - Sa set of all the document triples (D, D1, D2)
such that - D?D1, D?D2, D1?D2,
- D shares at least one common category with D1
- D shares no common categories with D2
- total error count (Ec) the number of triples
in Sa such that sim(D, D1) lt sim(D, D2) - similarity error Ec / Sa
14Experiment Setup Baseline
Table 4. Similarity Errors comparison of
different weighting schemes with the original
(not expanded) documents.
40 Improvement from boolean vectors (0s and 1s)
to TF-IDF We believed achieveing comparable
(another 40) reduction in addition would be of
practical importance
15Experiment Setup Control Parameters
- Ca the average Euclidian distance between the
original and the modified document vectors - Thresh co-occurrence based similarity threshold
(0.1-0.5) all values below were ignored
16Experiment Results Similarity Error Reduction
17Experiment Results Observations
- Thresh in .2-.4 effect very stable, achieves
50 - Thresh lt .1 effect is not that stable
- Possible explanation many non-reliable
associations are involved in the expansion. - Thresh gt .4 effect depends on Ca, not very
reliable - Possible explanation only few associations are
used, which requires large adjustment parameter a
to achieve same values of Ca
18Experiment Results Ignoring Context Hint
19Experiment Results Not Using External Corpus
(self mining)
20More Self-Mining for Comparison Latent Semantic
Indexing
Sensitive to the number of semantic axis Best
effect (10), not exceeding self-mining (above)
21More Self-Mining for Comparison Pseudo Relevance
Feedback
Nc number of documents to use for PRF Best
effect (20) comparable with self-mining (above)
22Conclusions
- Developed and Studied an approach that
- Uses external corpus (Web)
- While building the external corpus, takes target
collection context into consideration - Tested on similarity computation an important
general task that is behind retrieval,
clustering, etc. - Practically significant effect Reduces
similarity errors by up to 50 - Effect much larger than if not using external
corpus or not using context
23Limitations to address in future research
- We Do not claim that CCSD is better than LSI or
PRF they can be possibly extended to be
applicable to an external corpus as well - Larger data sets desired
- Specific applications to be tested retrieval
(have preliminary results, see forthcoming HARD
and ROBUST TREC 2005), clustering,
categorization, etc.
24QUESTIONS ?
- Corresponding Author
- Dmitri.Roussinov_at_asu.edu
- www.public.asu.edu/droussi