Clustering for web documents - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Clustering for web documents

Description:

Clustering for web documents Contents Cluto Criterion Functions for Document Clustering* Experiments and Analysis (2002) by Ying Zhao and George Karypis ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 37
Provided by: klplRePu
Category:

less

Transcript and Presenter's Notes

Title: Clustering for web documents


1
Clustering for web documents
  • ??

2
Contents
  • Cluto
  • Criterion Functions for Document Clustering
    Experiments and Analysis (2002)
  • by Ying Zhao and George Karypis
  • Department of Computer Science, University of
    Minnesota, Minneapolis, MN 55455
  • Feature selection for web documents(2004)

3
Cluto
  • Clustering Toolkit. 2.1.1
  • Department of Computer Science, University of
    Minnesota, Minneapolis
  • http//www-users.cs.umn.edu/karypis/
  • platform
  • Linux 2.4.18
  • Sun OS 5.7
  • Win32
  • programs
  • CLUTO's user callable library
  • vcluster
  • scluster

4
Cluto
  • What is Cluto.(1/2)
  • Clustering algorithms
  • partitional clustering
  • agglomerative clustering
  • graph-partitioning clustering
  • clustering criterion function
  • provide seven different criterion functions
  • both partitional and agglomerative clustering
    algorithms
  • provide some of the more traditional local
    criteria (e.g., single-link, complete-link, and
    UPGMA)
  • agglomerative clustering.

5
Cluto
  • What is Cluto.(2/2)
  • Analyze discovered clusters
  • relations between the objects assigned to each
    cluster
  • relations between the different clusters
  • identify the features that best describe and/or
    discriminate each cluster.
  • relationships between the clusters, objects, and
    features.
  • operate on very large datasets
  • the number of objects
  • the number of dimensions.

6
Cluto
  • Programs
  • vcluster
  • operate in the objects feature space
  • scluster
  • operate in the objects similarity space.
  • Interface
  • vcluster optional parameters MatrixFile
    Ncluster
  • nm matrix. rows to objects, cols to features
    space
  • Ncluster number of cluster

7
Cluto
  • Parameters of Algorithms
  • rd, rdr
  • k-1 repeated bisections. (rdr optimize the
    criterion function)
  • direct
  • computed by simultaneously finding all k clusters
  • agglo
  • the agglomerative paradigm
  • graph
  • using a nearest-neighbor graph
  • bagglo

8
Cluto
  • Parameters of the similarity function
  • cos the cosine function. default.
  • corr the correlation coefficient.
  • dist the Euclidean distance
  • applicable when -clmethodgraph.
  • jacc the extended Jaccard coefficient.
  • applicable when -clmethodgraph.

9
Cluto
  • Parameters of the criterion function
  • i1, i2, e1, g1, g1p, h1, h2

10
Cluto
  • Parameters of the criterion function
  • slink single link
  • wslink weighted single link
  • clink complete link
  • wclink weighted complete link
  • upgma UPGMA
  • cstype
  • fulltree
  • rowmodel, colmodel
  • showfeatures

11
(No Transcript)
12
  • Criterion Functions for Document Clustering
    Experiments and Analysis (2002)
  • by Ying Zhao and George Karypis Department of
    Computer Science, University of Minnesota,
    Minneapolis, MN 55455

13
Data Clustering
A.K. JAIN Michigan State University M.N.
MURTY Indian Institute of Science AND P.J.
FLYNN The Ohio State University ACM Computing
Surveys
14
Introduction(1/2)
  • Clustering algorithms
  • Agglomerative algorithms
  • UPGMA, single-link, complete-link, CURE, ROCK,
    Chameleon
  • Partitional algorithms
  • K-means, K-medoids, Autoclass, graph-partitional-b
    ased, spectral-partitional-based
  • well suit for large datasets. so fast.
  • Seven Criterion functions
  • measure intra-cluster similarity, inter-cluster
    similarity, two combinations. i1, i2, e1, g1,
    g1p, h1, h2

15
Introduction(2/2)
  • Datasets
  • 15 different data sets

16
Preliminaries(1/3)
  • Document Representation
  • use vector space model for each document
  • d document, tf term frequency, tfi
    frequency of i-th term in the doc
  • use idf or tfidf

  • N total
    documents
  • Similarity Measures
  • The similarity between two docs di, dj
  • Cosine functions
  • d
    normalize the length of doc vector

  • 1 identical, 0 nothing in common

17
Preliminaries(2/3)
  • Euclidean functions
  • if dis0, docs are identical, if ,
    nothing in common.
  • Definitions
  • S set of documents
  • S1, S2, Sk set of document of k-th
    cluster
  • k number of clusters
  • n1, n2, nk size docs of the corresponding
    clusters
  • A a set of docs
  • composite vector DA centroid vector
    CA.
  • sum of all docs vector in A average
    the weight of terms of docs in A

18
Preliminaries(3/3)
  • Vector Properties
  • Si, Sj two sets of docs containing ni, nj
    documents
  • Di, Dj the composite vector, Ci, Cj the
    centroid vector
  • The sum of the pair similarity between the docs
    in Si and Sj is DjtDj
  • The sum of the pair similarity between the docs
    in Si is Di2

19
Criterion Functions(1/5)
  • Internal Criterion Functions
  • maximize sum of the average pairwise similarities
    between the docs to each cluster
  • use cosine function. I1
  • is similar to function of hierarchical
    agglomerative clustering that uses group average
    heuristics to determine merge.
  • use cosine function. I2

  • vector space of K-means
    algorithm.

  • Cr centroid vector of clusters

20
Criterion Functions (2/5)
  • External Criterion Functions. E1, E2
  • optimize a function that different from each
    cluster
  • external function derived that the centroid
    vectors of the different clusters as orthogonal
    as possible
  • C the centroid vector of the
    entire docs
  • D the composite vector of the entire docs.
    1/D is constant.

21
Criterion Functions (3/5)
  • define with the Euclidean distance function.
  • Hybrid Criterion Functions. H1, H2
  • maximize the similarity of docs in each cluster,
    minimize the similarity between the clusters
    docs and the entire docs
  • H1. combine criterion function I1, E1

22
Criterion Functions (4/5)
  • H2. combine criterion function I2, E1
  • Graph Based Criterion Functions
  • view the relations between docs is to use graphs
  • G1 computing pairwise similarities between the
    docs
  • G2 computing pairwise similarities between the
    docs and terms
  • S given collection of n docs
  • Gs similarity graph

23
Criterion Functions (5/5)
  • G1.
  • G2.

24
(No Transcript)
25
(No Transcript)
26
Experimental Results
  • Direct k-way Clustering

27
Experimental Results
28
Experimental Results
29
Data Sets
  • the Natural Science category in Naver directory
    (http//dir.naver.com)
  • 6 subcategories in corpora
  • 1,215 docs, 17,223 terms, 20 clusters,
  • 5 features per a doc, idf

Sub Category No. of Docs. Sub Category No. of Docs.
Physics 102 Earth science 149
Biology 426 Astrology 323
Mathematics 102 Chemistry 113
Total 1,215
30
Experimental parameters
  • Algorithms
  • rd, rdr
  • k-1 repeated bisections. (rdr optimize the
    criterion function)
  • direct
  • computed by simultaneously finding all k clusters
  • agglo
  • the agglomerative paradigm
  • graph
  • using a nearest-neighbor graph

31
Experimental parameters
  • Criterion Functions
  • i1, i2, e1, g1, g1p, h1, h2, clink, slink
  • Similarity Functions
  • cosine measure

32
Experimental results
  • Entropy

rb rbr direct agglo graph
I1 .464 .452 .490 .642 .417
I2 .379 .375 .374 .564
E1 .388 .398 .416 .540
G1 .389 .418 .398 .895
G1p .326 .366 .391 .562
H1 .386 .392 .386 .541
H2 .348 .352 .367 .559
Clink .761
slink .895
33
Entropy
34
Experimental results
  • Purity

rb rbr direct agglo graph
I1 .686 .690 .683 .548 .749
I2 .772 .762 .761 .629
E1 .741 .737 .723 .647
G1 .768 .739 .752 .367
G1p .780 .758 .758 .647
H1 .753 .744 .758 .634
H2 .780 .782 .751 .650
Clink .458 Cut functions
slink .368 Cut functions
35
Purity
36
Best results
rb rb rbr rbr direct direct agglo agglo graph graph
entr puri entr puri entr puri entr puri entr puri
g1p g1p h2 h2 h1 h1 h1 h1 cut cut
0.326 0.780 0.352 0.782 0.386 0.758 0.541 0.634 0.417 0.749
Write a Comment
User Comments (0)
About PowerShow.com