Clustering for web documents presentation

About This Presentation

Transcript and Presenter's Notes

Title: Clustering for web documents

1
Clustering for web documents

2
Contents

Cluto
Criterion Functions for Document Clustering
Experiments and Analysis (2002)
by Ying Zhao and George Karypis
Department of Computer Science, University of
Minnesota, Minneapolis, MN 55455
Feature selection for web documents(2004)

3
Cluto

Clustering Toolkit. 2.1.1
Department of Computer Science, University of
Minnesota, Minneapolis
http//www-users.cs.umn.edu/karypis/
platform
Linux 2.4.18
Sun OS 5.7
Win32
programs
CLUTO's user callable library
vcluster
scluster

4
Cluto

What is Cluto.(1/2)
Clustering algorithms
partitional clustering
agglomerative clustering
graph-partitioning clustering
clustering criterion function
provide seven different criterion functions
both partitional and agglomerative clustering
algorithms
provide some of the more traditional local
criteria (e.g., single-link, complete-link, and
UPGMA)
agglomerative clustering.

5
Cluto

What is Cluto.(2/2)
Analyze discovered clusters
relations between the objects assigned to each
cluster
relations between the different clusters
identify the features that best describe and/or
discriminate each cluster.
relationships between the clusters, objects, and
features.
operate on very large datasets
the number of objects
the number of dimensions.

6
Cluto

Programs
vcluster
operate in the objects feature space
scluster
operate in the objects similarity space.
Interface
vcluster optional parameters MatrixFile
Ncluster
nm matrix. rows to objects, cols to features
space
Ncluster number of cluster

7
Cluto

Parameters of Algorithms
rd, rdr
k-1 repeated bisections. (rdr optimize the
criterion function)
direct
computed by simultaneously finding all k clusters
agglo
the agglomerative paradigm
graph
using a nearest-neighbor graph
bagglo

8
Cluto

Parameters of the similarity function
cos the cosine function. default.
corr the correlation coefficient.
dist the Euclidean distance
applicable when -clmethodgraph.
jacc the extended Jaccard coefficient.
applicable when -clmethodgraph.

9
Cluto

Parameters of the criterion function
i1, i2, e1, g1, g1p, h1, h2

10
Cluto

Parameters of the criterion function
slink single link
wslink weighted single link
clink complete link
wclink weighted complete link
upgma UPGMA
cstype
fulltree
rowmodel, colmodel
showfeatures

11
(No Transcript)
12

Criterion Functions for Document Clustering
Experiments and Analysis (2002)
by Ying Zhao and George Karypis Department of
Computer Science, University of Minnesota,
Minneapolis, MN 55455

13
Data Clustering
A.K. JAIN Michigan State University M.N.
MURTY Indian Institute of Science AND P.J.
FLYNN The Ohio State University ACM Computing
Surveys
14
Introduction(1/2)

Clustering algorithms
Agglomerative algorithms
UPGMA, single-link, complete-link, CURE, ROCK,
Chameleon
Partitional algorithms
K-means, K-medoids, Autoclass, graph-partitional-b
ased, spectral-partitional-based
well suit for large datasets. so fast.
Seven Criterion functions
measure intra-cluster similarity, inter-cluster
similarity, two combinations. i1, i2, e1, g1,
g1p, h1, h2

15
Introduction(2/2)

Datasets
15 different data sets

16
Preliminaries(1/3)

Document Representation
use vector space model for each document
d document, tf term frequency, tfi
frequency of i-th term in the doc
use idf or tfidf
N total
documents
Similarity Measures
The similarity between two docs di, dj
Cosine functions
d
normalize the length of doc vector
1 identical, 0 nothing in common

17
Preliminaries(2/3)

Euclidean functions
if dis0, docs are identical, if ,
nothing in common.
Definitions
S set of documents
S1, S2, Sk set of document of k-th
cluster
k number of clusters
n1, n2, nk size docs of the corresponding
clusters
A a set of docs
composite vector DA centroid vector
CA.
sum of all docs vector in A average
the weight of terms of docs in A

18
Preliminaries(3/3)

Vector Properties
Si, Sj two sets of docs containing ni, nj
documents
Di, Dj the composite vector, Ci, Cj the
centroid vector
The sum of the pair similarity between the docs
in Si and Sj is DjtDj
The sum of the pair similarity between the docs
in Si is Di2

19
Criterion Functions(1/5)

Internal Criterion Functions
maximize sum of the average pairwise similarities
between the docs to each cluster
use cosine function. I1
is similar to function of hierarchical
agglomerative clustering that uses group average
heuristics to determine merge.
use cosine function. I2
vector space of K-means
algorithm.
Cr centroid vector of clusters

20
Criterion Functions (2/5)

External Criterion Functions. E1, E2
optimize a function that different from each
cluster
external function derived that the centroid
vectors of the different clusters as orthogonal
as possible
C the centroid vector of the
entire docs
D the composite vector of the entire docs.
1/D is constant.

21
Criterion Functions (3/5)

define with the Euclidean distance function.
Hybrid Criterion Functions. H1, H2
maximize the similarity of docs in each cluster,
minimize the similarity between the clusters
docs and the entire docs
H1. combine criterion function I1, E1

22
Criterion Functions (4/5)

H2. combine criterion function I2, E1
Graph Based Criterion Functions
view the relations between docs is to use graphs
G1 computing pairwise similarities between the
docs
G2 computing pairwise similarities between the
docs and terms
S given collection of n docs
Gs similarity graph

23
Criterion Functions (5/5)

24
(No Transcript)
25
(No Transcript)
26
Experimental Results

Direct k-way Clustering

27
Experimental Results
28
Experimental Results
29
Data Sets

the Natural Science category in Naver directory
(http//dir.naver.com)
6 subcategories in corpora
1,215 docs, 17,223 terms, 20 clusters,
5 features per a doc, idf

Sub Category No. of Docs. Sub Category No. of Docs.
Physics 102 Earth science 149
Biology 426 Astrology 323
Mathematics 102 Chemistry 113
Total 1,215
30
Experimental parameters

Algorithms
rd, rdr
k-1 repeated bisections. (rdr optimize the
criterion function)
direct
computed by simultaneously finding all k clusters
agglo
the agglomerative paradigm
graph
using a nearest-neighbor graph

31
Experimental parameters

Criterion Functions
i1, i2, e1, g1, g1p, h1, h2, clink, slink
Similarity Functions
cosine measure

32
Experimental results

Entropy

rb rbr direct agglo graph
I1 .464 .452 .490 .642 .417
I2 .379 .375 .374 .564
E1 .388 .398 .416 .540
G1 .389 .418 .398 .895
G1p .326 .366 .391 .562
H1 .386 .392 .386 .541
H2 .348 .352 .367 .559
Clink .761
slink .895
33
Entropy
34
Experimental results

Purity

rb rbr direct agglo graph
I1 .686 .690 .683 .548 .749
I2 .772 .762 .761 .629
E1 .741 .737 .723 .647
G1 .768 .739 .752 .367
G1p .780 .758 .758 .647
H1 .753 .744 .758 .634
H2 .780 .782 .751 .650
Clink .458 Cut functions
slink .368 Cut functions
35
Purity
36
Best results
rb rb rbr rbr direct direct agglo agglo graph graph
entr puri entr puri entr puri entr puri entr puri
g1p g1p h2 h2 h1 h1 h1 h1 cut cut
0.326 0.780 0.352 0.782 0.386 0.758 0.541 0.634 0.417 0.749

Write a Comment

User Comments (0)

About PowerShow.com

Clustering for web documents PowerPoint PPT Presentation