A New Suffix Tree Similarity Measure for Document Clustering - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

A New Suffix Tree Similarity Measure for Document Clustering

Description:

Number of Views:148

Avg rating:3.0/5.0

Slides: 24

Provided by: nlgCsie

Category:

more less

Transcript and Presenter's Notes

Title: A New Suffix Tree Similarity Measure for Document Clustering

1
A New Suffix Tree Similarity Measure for Document
Clustering

2
INTRODUCTION

?? To develop a document clustering algorithm to
categorize the Web documents in an online
community
The Vector Space Document (VSD) - representation
of any document as a feature vector of the words
Suffix tree document model - identifying phrases
that are common to groups of documents

3
suffix sub-string
4
Suffix Tree Document Model

5
STC Algorithm (Suffix Tree Clustering)

6
The base cluster graph
7
Problem of STC

STC algorithm sometimes generates some
large-sized clusters with poor quality
No quality measure like tf-idf
No single-link, group-average and complete-link
Solution
mapping each node of a suffix tree into a unique
dimension of a M dimensional space
M total number of nodes in the suffix tree
except the root node

8
The New Suffix Tree Similarity Measure

Each document d can be represented as a feature
vector of the weights of M nodes
df(n) the number of the different documents
that have traversed node n
tf(n, d) the total traversed times of document
d through node n
ex. df(b) 3 , tf(b,1) 1

9
The New Suffix Tree Similarity Measure

10
A Closer Look to Sufx Tree Document Model

11
Document Preparing

12
Cluster Topic Summary Generating

13
Cluster Topic Summary Generating

Document quality evaluation
Web documents provide some additional human
assessments for the document quality evaluation
view clicks, reply posts and recommend clicks
top 10 documents as the representatives of the
cluster
the nodes traversed by the representative
documents are selected and sorted by their idf in
ascend order. Finally the top 5 nodes are
selected.

14
EVALUATION

15
Document Collections

OHSUMED Document Collection
8 category, 800 documents, containing 6,281
distinct words. The average length of the
documents is about 110 (by words)
RCV1 Document Collection
10 groups of documents, containing 19,229
distinct words. The average length of documents
is about 150

16
Results and Discussion
17
Results and Discussion

STC algorithm - there is no effective measure to
evaluate the quality of the clusters during the
cluster merging
Thus STC algorithm seldom generated large size
clusters with high quality in the experiments

18
Results and Discussion

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
CONCLUSIONS AND FUTURE WORK

By completely mapping all nodes in the common
suffix tree into a M dimensional space of VSD
model, the advantages of VSD model and suffix
tree model are smoothly inherited
suffix tree similarity measure is very simple,
but the implementation is quite difficult
time efficiency and the space efficiency
Applying the new similarity measure in Chinese
document

Write a Comment

User Comments (0)