Clustering of the SOM using a clustering validity index based on intercluster and intracluster densi - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Clustering of the SOM using a clustering validity index based on intercluster and intracluster densi

Description:

Clustering of the SOM using a clustering validity index ... Sitao Wu, Tommy W. S. Chow. Pattern Recognition 37 (2004) 175 188. Summarized ... Self ... – PowerPoint PPT presentation

Number of Views:298
Avg rating:3.0/5.0
Slides: 37
Provided by: dmlab6
Category:

less

Transcript and Presenter's Notes

Title: Clustering of the SOM using a clustering validity index based on intercluster and intracluster densi


1
Clustering of the SOM using a clustering
validity index based on inter-cluster and
intra-cluster densitySitao Wu, Tommy W. S.
ChowPattern Recognition 37 (2004)
175188Summarized by Hwang, Seong-Seob
January, 2005
2
Contents
  • Introduction
  • Self-organizing map and clustering
  • Clustering of the SOM using local clustering
    validity index and preprocessing of the SOM for
    filtering
  • Experimental results
  • Conclusions

3
Introduction (1/3)
  • Self-organizing map (SOM)
  • Proposed by Kohonen
  • Many industrial applications
  • pattern recognition, biological modeling, data
    compression, signal processing, data mining
  • Unsupervised Nonparametric neural network
    approach
  • The success of the SOM algorithm lies in its
    simplicity that makes it easy to understand,
    simulate and be used in many applications

4
Introduction (2/3)
  • The basic SOM consists of a set of neurons
    usually arranged in a 2D structure
  • Neighborhood relations among the neurons
  • After completion of training, each neuron is
    attached to a feature vector of the same
    dimension as the input space
  • Vector quantization (VQ)
  • By assigning each input vector to the neuron with
    the nearest feature vector, the SOM is able to
    divide the input space into regions with common
    nearest feature vectors
  • Topology preservation
  • if two feature vectors are near to each other in
    the input space, the corresponding neurons will
    also be close in the output space
  • The SOM is suitable for visualization purpose

5
Introduction (3/3)
  • Clustering algorithms
  • To organize unlabeled input vectors into clusters
    or natural groups such that points within a
    cluster are more similar to each other than
    vectors belonging to different clusters
  • Applications
  • exploratory pattern-analysis, grouping,
    decision-making, machine-learning situations,
    data mining, document retrieval, image
    segmentation, pattern classification
  • Five types of clustering
  • hierarchical clustering, partitioning clustering,
    density-based clustering, grid-based clustering,
    model-based clustering

6
Self-organizing map and visualization (1/3)
  • Competitive learning is an adaptive process
  • A division of neural nodes emerges in the network
    to represent different patterns of the inputs
    after training
  • The division is enforced by competition among the
    neurons
  • The SOM consists of M neurons located on a
    regular low-dimensional grid, usually one or two
    dimensional
  • Higher-dimensional grid ? their visualization is
    problematic
  • The lattice of the grid is either hexagonal or
    rectangle

7
Self-organizing map and visualization (2/3)
  • The basic SOM algorithm
  • The winning neuron is the neuron with the feature
    vector closest to x(t)
  • A set of neighboring nodes of the winning node
  • The weight update

i neuron wi feature vector x(t) data
vector c winning neuron
Nc set of neighboring nodes hic(t)
neighborhood kernel function Posi coordinates
of neuron i s(t) kernel width (decreasing
monotonically) e(t) learning rate (decreasing
monotonically)
8
Self-organizing map and visualization (3/3)
  • Because of the neighborhood relations,
    neighboring neurons are pulled to the same
    direction
  • Feature vectors of neighboring neurons resemble
    each other
  • The 2D map can be easily visualized and thus give
    people useful information about the input data
  • The usual way to display the cluster structure of
    the data is to use a distance matrix, such as
    U-matrix
  • U-matrix method displays the SOM grid according
    to the distance of neighboring neurons

9
Clustering algorithm (1/3)
  • Partitioning clustering
  • Given a database of n object, Constructing k
    partitions (kn)
  • k-means algorithm
  • Each cluster is represented by the mean value of
    the objects in the cluster
  • Advantages
  • The clustering is dynamic
  • Some a priori knowledge, such as cluster shapes,
    can be incorporated in the clustering
  • Drawbacks
  • It encounters difficulty at discovering clusters
    of arbitrary shapes
  • The number of clusters is pre-fixed and the
    optimal number of clusters is hard to determine

10
Clustering algorithm (2/3)
  • Hierarchical clustering
  • An hierarchical decomposition of the given
    dataset
  • It can be classified as either agglomerative or
    divisive
  • Advantage
  • It is not affected by initialization and local
    minima
  • Shortcomings
  • It is impractical for large data sets due to the
    high-computational complexity
  • It does not incorporate any a priori knowledge
    such as cluster shapes
  • The clustering is static

11
Clustering algorithm (3/3)
  • Four types of definitions of inter-cluster
    distance

12
Clustering of SOM
  • The two-level approach of clustering of the SOM
  • Different symbols on the map represent different
    clusters
  • The clustering algorithms can be used in
    clustering the output neurons of SOM
  • If the clusters have nonspherical shapes?
  • Partitioning clustering (X), Hierarchical
    clustering (O)
  • The extended SOM (minimum distance variance)

13
Global clustering validity index (1/4)
  • Evaluation criteria are needed to justify the
    correctness of a partition
  • The index is based on two accepted concepts
  • A clusters compactness separation
  • The implementation of most validity algorithms is
    very computationally intensive (especially, very
    large DB)
  • Dependent on the data the number of clusters
  • Using sample mean of each subset vs. all points
    in each subset

14
Global clustering validity index (2/4)
  • Intra-cluster density
  • The intra-cluster density is high for
    well-separated clusters

15
Global clustering validity index (3/4)
  • Inter-cluster density
  • The density in the between cluster region is
    intended to be significantly low

16
Global clustering validity index (4/4)
  • Clusters separation
  • CDbw
  • Composing Density Between and With clusters

17
Merging criterion using the CDbw (1/2)
  • The inter- intra-cluster density are
    incorporated into merging criteria in addition to
    distance information
  • Compute the CDbw for data belonging to each
    neighboring pair of clusters
  • The merging mechanism is to find the pair of
    clusters with minimal value of the CDbw
  • the two clusters have the strongest tendency to
    be clustered
  • The advantage of the merging mechanism is that
    the clustering result is more accurate, due to
    the more information about the individual
    clusters considered

18
Merging criterion using the CDbw (2/2)
19
Preprocessing before clustering of the SOM (1/2)
  • Neurons with no input data assigned
  • Not included in the next clustering steps
  • Compute devjwj-mj, mean_dev, std_dev
  • wj feature vector, mj the mean vector mj
  • If devjgtmean_devstd_dev
  • Exclude the neuron j for clustering later on
  • This mechanism can filter out the input outliers

20
Preprocessing before clustering of the SOM (2/2)
  • Compute disj(xi)xi-wj, mean_disj, std_disj
  • xi input vector
  • If xi-wjgtmean_disjstd_disj
  • Filter out the input vector xj for the next
    clustering steps
  • This can filter out the input outliers and noises
  • Compute of data belonging to the jth cluster
    numj
  • Compute the statistical information about the
    mean_num and std_dev_num for all the numjs
  • If numjltmean_num-std num
  • Exclude the neuron j for cluster later on
  • This can filter out the input noises

21
Clustering of the SOM (1/2)
  • After the preprocessing the clustering of the
    SOM, some neurons and some input data are
    excluded
  • The neurons and input data left can be
    hierarchically clustered
  • Rectangular grids are used for the SOM
  • The merging process happens for neighboring
    clusters, which mean the neurons belonging to the
    pair of clusters are direct neighbors

22
Clustering of the SOM (2/2)
(a) Neuron A has eight direct neighboring
neurons B-I
(b) Multi-neuron represented neighboring
clusters 1 and 2 can be clustered into one
cluster because the two clusters are direct
neighbors
(c) Multi-neuron represented clusters 1 and 2
cannot be clustered into one cluster because the
two clusters are not direct neighbors
23
The algorithm of clustering of the SOM
  • Train input data by the SOM
  • Preprocessing before clustering of the SOM
  • Cluster SOM by using the agglomerative
    hierarchical clustering
  • The merging criterion is made by the CDbw for all
    pairs of directly neighboring clusters
  • Find the optimal partition of the input data
  • According to the CDbw for all the input data as a
    function of the number of clusters

24
Experimental results
  • To demonstrate the effectiveness of the proposed
    clustering algorithm, four data sets were used in
    our experiments
  • The input data are normalized such that the value
    of each datum in each dimension lies in 0,1
  • For training SOM authors used 100 training epochs
    on the input data and the learning rate decreases
    from 1 to 0.0001

25
The synthetic data set in the 2D plane
  • 2D data set (200)
  • Three shallow elongated parallel clusters in the
    2D plane
  • Some noises and outliers
  • Preprocessing
  • Using the SOM algorithm
  • k-means, four different hierarchical clustering
    algorithms, and the proposed algorithm to cluster
    of the SOM

26
Three partitions of the synthetic data set
(c) single-linkage
(a) the proposed algorithm
(b) k-means
(d) complete-linkage
(e) centroid-linkage
(f) average-linkage
27
2D synthetic data set
Three clusters for the synthetic data set are
displayed on the map by the proposed algorithm or
the single-linkage clustering algorithm on the SOM
SOM map size of 44
SOM map size of 66
The CDbw as a function of the number of clusters
for the synthetic data set
28
Iris data set (1/3)
  • Iris data
  • Widely used in pattern classification
  • 150 data points of four dimensions
  • Three classes with 50 points each
  • The first class is linearly separable from the
    other two
  • The other two classes are overlapped to some
    extent and are not linearly separable

29
Iris data set (2/3)
Two clusters for the iris data set are identified
by the proposed algorithm or single-linkage
clustering algorithm on the SOM (SOM map size of
4 4)
For the known three classes, three clusters are
formed (SOM map size of 44) for the iris data set
The CDbw as a function of the number of clusters
for the iris data
(a) the proposed (b) single-linkage
algorithm
30
Iris data set (3/3)
  • Performance comparison of different clustering
    algorithms for the iris data set

31
15D synthetic data set (1/2)
  • The statistical information of 20 clusters for
    the 15D synthetic data set

32
15D synthetic data set (2/2)
  • Twenty clusters for the 15D synthetic data set
    are displayed on the map by the proposed
    algorithm on the SOM (SOM map size of 8 8)

33
Wine data set (1/2)
  • 13D data with known three classes (178 59, 71,
    48)

Three clusters for the wine data set are
displayed on the map by the proposed clustering
algorithm on the SOM (SOM map size of 4 4)
The CDbw as a function of the number of clusters
for the wine data set by the proposed algorithm
on the SOM
34
Wine data set (2/2)
  • Performance comparison of different clustering
    algorithms for the wine data set

35
Conclusion (1/2)
  • A new SOM-based clustering algorithm
  • Using the clustering validity index locally to
    determine which pair of clusters to be merged
  • The optimal number of clusters can be determined
    by the maximum value of the CDbw, which is the
    clustering validity index globally for all input
    data
  • Compared with classical clustering methods on the
    SOM, the proposed algorithm utilizes more
    information about the data in each cluster in
    addition to inter-cluster distances
  • The proposed algorithm clusters data better than
    the classical clustering algorithms on the SOM

36
Conclusion (2/2)
  • The preprocessing steps for filtering out noises
    and outliers
  • To increase the accuracy and robustness for
    clustering of the SOM
  • The experimental results demonstrate that the
    proposed clustering algorithm is a better
    clustering algorithm than other clustering
    algorithms on the SOM
Write a Comment
User Comments (0)
About PowerShow.com