A New Suffix Tree Similarity Measure for Document Clustering - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

A New Suffix Tree Similarity Measure for Document Clustering

Description:

The Vector Space Document (VSD) - representation of any document as a feature ... 2. all non-word tokens are stripped. 3. all stopwords are identified and removed ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 24
Provided by: nlgCsie
Category:

less

Transcript and Presenter's Notes

Title: A New Suffix Tree Similarity Measure for Document Clustering


1
A New Suffix Tree Similarity Measure for Document
Clustering
  • Hung Chim, Xiaotie Deng
  • City University of Hong Kong
  • WWW2007

2
INTRODUCTION
  • ?? To develop a document clustering algorithm to
    categorize the Web documents in an online
    community
  • The Vector Space Document (VSD) - representation
    of any document as a feature vector of the words
  • Suffix tree document model - identifying phrases
    that are common to groups of documents

3
suffix sub-string
4
Suffix Tree Document Model
  • 1.cat ate cheese
  • 2. mouse ate cheese too
  • 3.cat ate mouse too

5
STC Algorithm (Suffix Tree Clustering)
  • 1. The common suffix tree generating
  • 2. Base cluster selecting Each base cluster B is
    assigned a score s(B)
  • B the number of documents in B
  • P the number of words in Phase
  • 3. Cluster merging
  • Jaccord coefficient

6
The base cluster graph
7
Problem of STC
  • STC algorithm sometimes generates some
    large-sized clusters with poor quality
  • No quality measure like tf-idf
  • No single-link, group-average and complete-link
  • Solution
  • mapping each node of a suffix tree into a unique
    dimension of a M dimensional space
  • M total number of nodes in the suffix tree
    except the root node

8
The New Suffix Tree Similarity Measure
  • Each document d can be represented as a feature
    vector of the weights of M nodes
  • df(n) the number of the different documents
    that have traversed node n
  • tf(n, d) the total traversed times of document
    d through node n
  • ex. df(b) 3 , tf(b,1) 1

9
The New Suffix Tree Similarity Measure
  • tf-idf formula
  • cosine similarity
  • GAHC algorithm (GA with HC mutation )

10
A Closer Look to Sufx Tree Document Model
  • Efciency Analysis
  • constructing the suffix tree O(m2)
  • Ukkonen's paper provided a algorithm to build a
    suffix tree in O(m)
  • Stopword or Stopnode
  • Words in the stoplist - the score s(B) of a base
    cluster
  • stopnode - A node with a high df can be ignored

11
Document Preparing
  • 1. combine all posts of the same thread into a
    single document
  • 2. all non-word tokens are stripped
  • 3. all stopwords are identified and removed
  • 4. Porter stemming algorithm is applied
  • 6. the posts containing at least 3 distinct words
    are selected

12
Cluster Topic Summary Generating
  • topic summary generating concerns two important
    information retrieval work
  • 1. ranking the documents in a cluster by a
    quality score
  • 2. extracting common phrases as the topic summary

13
Cluster Topic Summary Generating
  • Document quality evaluation
  • Web documents provide some additional human
    assessments for the document quality evaluation
  • view clicks, reply posts and recommend clicks
  • top 10 documents as the representatives of the
    cluster
  • the nodes traversed by the representative
    documents are selected and sorted by their idf in
    ascend order. Finally the top 5 nodes are
    selected.

14
EVALUATION
  • ????? cluster C C1,C2, ,Ck
  • ???cluster
  • Recall (i, j)
  • Precision (i, j)

15
Document Collections
  • OHSUMED Document Collection
  • 8 category, 800 documents, containing 6,281
    distinct words. The average length of the
    documents is about 110 (by words)
  • RCV1 Document Collection
  • 10 groups of documents, containing 19,229
    distinct words. The average length of documents
    is about 150

16
Results and Discussion
17
Results and Discussion
  • STC algorithm - there is no effective measure to
    evaluate the quality of the clusters during the
    cluster merging
  • Thus STC algorithm seldom generated large size
    clusters with high quality in the experiments

18
Results and Discussion
  • DS3 document

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
CONCLUSIONS AND FUTURE WORK
  • By completely mapping all nodes in the common
    suffix tree into a M dimensional space of VSD
    model, the advantages of VSD model and suffix
    tree model are smoothly inherited
  • suffix tree similarity measure is very simple,
    but the implementation is quite difficult
  • time efficiency and the space efficiency
  • Applying the new similarity measure in Chinese
    document
Write a Comment
User Comments (0)
About PowerShow.com