Semantic Smoothing of Document Models for Agglomerative Clustering

1 / 23
About This Presentation
Title:

Semantic Smoothing of Document Models for Agglomerative Clustering

Description:

... 037; earth 0.037; moon 0.035; orbit 0.032; satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; technology 0.026; ... ROCKET LAUNCH OBSERVED! ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 24
Provided by: NanZ1

less

Transcript and Presenter's Notes

Title: Semantic Smoothing of Document Models for Agglomerative Clustering


1
Semantic Smoothing of Document Models for
Agglomerative Clustering
  • Xiaohua Zhou, Xiaodan Zhang, Tony Hu
  • College of Information Science Technology
    Drexel University, USA

2
Agglomerative Clustering
  • Algorithm Overview
  • Initially assign each document into its own
    cluster and repeatedly merge pairs of most
    similar clusters until only one cluster is left.
  • The core of this algorithm is to compute
    pair-wise document similarities.
  • Cosine and Euclidean similarity (distance) are
    frequently used.

3
Where is the problem?
  • Density of Topic-free General Words
  • An extreme example is stop words.
  • Those words will be assigned with high
    probability (or high score), but no contribution
    to the clustering task.
  • Any document pair could be considered similar for
    clustering because they share lots of common
    words (Steinbach et al., 2000 )
  • Need to discount the effect of those general
    words. The same idea as TF.IDF weighting schema.

4
Where is the problem?
  • Sparsity of Topic-specific Words

I am looking for any information about the space
program. This includes NASA, the shuttles,
history, anything! I would like to know if
anyone could suggest books, periodicals, even ftp
sites for a novice who is interested in the space
program.
The Phobos mission did return some useful data
including images of Phobos itself By the way, the
new book entitled "Mars" (Kieffer et al, 1992,
University of Arizona Press) has a great chapter
on spacecraft exploration of the planet. The
chapter is co-authored by V.I. Moroz of the Space
Research Institute in Moscow, and includes
details never before published in the West.
(From 20-Newsgroup)
5
Existing Solutions
  • Density of general words
  • Removing stop words
  • Using TF-IDF score
  • Term reweighing techniques
  • Sparsity of topic-specific words
  • Ontology-based term similarity
  • Problems of existing solutions
  • All these approaches are heuristic
  • The ontology is not available or the ontology is
    very limited.

6
Language Modeling Approach
  • Agglomerative Clustering
  • Assume each document is generated by a language
    model.
  • The pairwise document similarity is defined as
    the similarity (i.e., KL-divergence) of
    corresponding document models

7
Jelinek-Mercer Smoothing
  • The document model is smoothed by the corpus
    model (simple language model)
  • Discounting general words
  • Partially solve the data sparsity problem

c(w d) is the count of word w in document d. C
denotes the corpus (Zhai and Lafferty 2001)
8
Semantic Smoothing
  • Descriptions
  • Like the statistical translation model (Berger
    and Lafferty 1999), term semantic relationships
    are used for model smoothing.
  • Unlike the statistical translation model,
    contextual and sense information are considered
  • Decompose a document into a set of
    context-sensitive multiword phrases and then
    statistically translate phrases into individual
    words.

9
Semantic Smoothing Model
  • Linearly interpolate the phrase-based translation
    model with a simple language model

Where the translation coefficient (?) controls
the influence of the translation component in the
mixture model.
c(ti, d) is the frequency of topic signature ti
in document d.
10
Semantic Translation Example
11
Semantic Smoothing Example
  • Doc2
  • the Phobos mission did return some useful data
    including images of Phobos itselfBy the way, the
    new book entitled "Mars" (Kieffer et al, 1992,
    University of Arizona Press) has a great chapter
    on spacecraft exploration of the planet. The
    chapter is co-authored by V.I. Moroz of the Space
    Research Institute in Moscow, and includes
    details never before published in the West.
  • Doc1
  • I am looking for any information about the space
    program. This includes NASA, the shuttles,
    history, anything! I would like to know if
    anyone could suggest books, periodicals, even ftp
    sites for a novice who is interested in the space
    program.
  • Doc3
  • ROCKET LAUNCH OBSERVED! A bright light phenomenon
    was observed in the Eastern Finland on April 21.
    I don't know if there were satellite launches in
    Plesetsk Cosmodrome near Arkhangelsk, but this
    may be a rocket experiment too.

12
Translation Probability Estimate
  • Method
  • Use co-occurrence counts (multiword phrase and
    individual words)
  • Use a mixture model to remove noise from
    topic-free general words

Figure 1. Illustration of document indexing. Vt,
Vd and Vw are phrase set, document set and word
set, respectively.
Denotes Dk the set of documents containing the
phrase tk. The parameter a is the coefficient
controlling the influence of the corpus model in
the mixture model.
13
Translation Probability Estimate
  • Log likelihood of generating Dk
  • EM for estimation

Where is the document frequency of term w in Dk,
i.e., the cooccurrence count of w and tk in the
whole collection.
14
Phrase Extraction
  • Phrase Dictionary
  • Use Xtract (Smadja 1993) to learn a phrase
    dictionary.
  • Phrase Extraction
  • Extract phrases from documents using exact string
    matching.

15
Experiment Settings
  • Agglomerative clustering
  • Complete linkage
  • Evaluation criterion
  • Normalized mutual information (NMI, Banerjee and
    Ghosh, 2002)
  • Entropy (Steinbach et al., 2000 )
  • Purity (Zhao and Karypis, 2001 )
  • Experiment Design
  • Randomly create testing collections. 100
    documents are randomly selected for each class.
  • Execute 5 runs for each collection and average
    the results

16
Statistics of Three Datasets
Table 1. Statistics of three datasets
17
Agglomerative Clustering
Table 2. NMI results of the agglomerative
hierarchical clustering with complete linkage
criterion. JM and Semantic denote
Jelinek-Mercer smoothing and semantic smoothing,
respectively. means stop words are not removed.
The translation coefficient ? is trained from
TDT2.
18
Effect of Document Smoothing
Figure 2. The variance of the cluster quality
with the translation coefficient (?) which
controls the influence of semantic smoothing
19
Comparison to the K-Means
Table 3. means stop words are not removed. The
agglomerative clustering with semantic smoothing
is comparable to the standard K-Means clustering.
20
Summary
  • Proposed a context-sensitive semantic smoothing
    method which statistically translates multiword
    phrases into individual terms.
  • Semantic smoothing not only discounted general
    words, but also solved data sparsity problem very
    well.
  • Semantic smoothing is much more effective than
    other schemes on agglomerative clustering where
    data sparsity is the major problem.
  • Removing stops or not have no effect on TFIDF,
    background smoothing, and semantic smoothing, but
    significant effect on other schemes.

21
Future Work
  • How to optimize translation coefficient
  • Alternative translation intermediates (e.g. word
    pair, concept pair)
  • Applies semantic document smoothing to other
    applications such as text retrieval, text
    summarization, and text classification.

22
Dragon Toolkit
  • Descriptions
  • Text retrieval and mining toolkit
  • Written in Java
  • Used for this work
  • Phrase extraction
  • Phrase-word translation probability estimates
  • Clustering
  • Download
  • http//www.ischool.drexel.edu/dmbio/dragontool
  • Search Google with keywords dragon toolkit

23
Questions/Comments?
Write a Comment
User Comments (0)