Clustering of the SOM using a clustering validity index based on intercluster and intracluster densi - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Clustering of the SOM using a clustering validity index based on intercluster and intracluster densi

Description:

Clustering of the SOM using a clustering validity index ... Sitao Wu, Tommy W. S. Chow. Pattern Recognition 37 (2004) 175 188. Summarized ... Self ... – PowerPoint PPT presentation

Number of Views:298

Avg rating:3.0/5.0

Slides: 37

Provided by: dmlab6

Category:

more less

Transcript and Presenter's Notes

Title: Clustering of the SOM using a clustering validity index based on intercluster and intracluster densi

1
Clustering of the SOM using a clustering
validity index based on inter-cluster and
intra-cluster densitySitao Wu, Tommy W. S.
ChowPattern Recognition 37 (2004)
175188Summarized by Hwang, Seong-Seob
January, 2005
2
Contents

Introduction
Self-organizing map and clustering
Clustering of the SOM using local clustering
validity index and preprocessing of the SOM for
filtering
Experimental results
Conclusions

3
Introduction (1/3)

Self-organizing map (SOM)
Proposed by Kohonen
Many industrial applications
pattern recognition, biological modeling, data
compression, signal processing, data mining
Unsupervised Nonparametric neural network
approach
The success of the SOM algorithm lies in its
simplicity that makes it easy to understand,
simulate and be used in many applications

4
Introduction (2/3)

The basic SOM consists of a set of neurons
usually arranged in a 2D structure
Neighborhood relations among the neurons
After completion of training, each neuron is
attached to a feature vector of the same
dimension as the input space
Vector quantization (VQ)
By assigning each input vector to the neuron with
the nearest feature vector, the SOM is able to
divide the input space into regions with common
nearest feature vectors
Topology preservation
if two feature vectors are near to each other in
the input space, the corresponding neurons will
also be close in the output space
The SOM is suitable for visualization purpose

5
Introduction (3/3)

Clustering algorithms
To organize unlabeled input vectors into clusters
or natural groups such that points within a
cluster are more similar to each other than
vectors belonging to different clusters
Applications
exploratory pattern-analysis, grouping,
decision-making, machine-learning situations,
data mining, document retrieval, image
segmentation, pattern classification
Five types of clustering
hierarchical clustering, partitioning clustering,
density-based clustering, grid-based clustering,
model-based clustering

6
Self-organizing map and visualization (1/3)

Competitive learning is an adaptive process
A division of neural nodes emerges in the network
to represent different patterns of the inputs
after training
The division is enforced by competition among the
neurons
The SOM consists of M neurons located on a
regular low-dimensional grid, usually one or two
dimensional
Higher-dimensional grid ? their visualization is
problematic
The lattice of the grid is either hexagonal or
rectangle

7
Self-organizing map and visualization (2/3)

The basic SOM algorithm
The winning neuron is the neuron with the feature
vector closest to x(t)
A set of neighboring nodes of the winning node
The weight update

i neuron wi feature vector x(t) data
vector c winning neuron
Nc set of neighboring nodes hic(t)
neighborhood kernel function Posi coordinates
of neuron i s(t) kernel width (decreasing
monotonically) e(t) learning rate (decreasing
monotonically)
8
Self-organizing map and visualization (3/3)

Because of the neighborhood relations,
neighboring neurons are pulled to the same
direction
Feature vectors of neighboring neurons resemble
each other
The 2D map can be easily visualized and thus give
people useful information about the input data
The usual way to display the cluster structure of
the data is to use a distance matrix, such as
U-matrix
U-matrix method displays the SOM grid according
to the distance of neighboring neurons

9
Clustering algorithm (1/3)

Partitioning clustering
Given a database of n object, Constructing k
partitions (kn)
k-means algorithm
Each cluster is represented by the mean value of
the objects in the cluster
Advantages
The clustering is dynamic
Some a priori knowledge, such as cluster shapes,
can be incorporated in the clustering
Drawbacks
It encounters difficulty at discovering clusters
of arbitrary shapes
The number of clusters is pre-fixed and the
optimal number of clusters is hard to determine

10
Clustering algorithm (2/3)

Hierarchical clustering
An hierarchical decomposition of the given
dataset
It can be classified as either agglomerative or
divisive
Advantage
It is not affected by initialization and local
minima
Shortcomings
It is impractical for large data sets due to the
high-computational complexity
It does not incorporate any a priori knowledge
such as cluster shapes
The clustering is static

11
Clustering algorithm (3/3)

Four types of definitions of inter-cluster
distance

12
Clustering of SOM

The two-level approach of clustering of the SOM
Different symbols on the map represent different
clusters
The clustering algorithms can be used in
clustering the output neurons of SOM
If the clusters have nonspherical shapes?
Partitioning clustering (X), Hierarchical
clustering (O)
The extended SOM (minimum distance variance)

13
Global clustering validity index (1/4)

Evaluation criteria are needed to justify the
correctness of a partition
The index is based on two accepted concepts
A clusters compactness separation
The implementation of most validity algorithms is
very computationally intensive (especially, very
large DB)
Dependent on the data the number of clusters
Using sample mean of each subset vs. all points
in each subset

14
Global clustering validity index (2/4)

Intra-cluster density
The intra-cluster density is high for
well-separated clusters

15
Global clustering validity index (3/4)

Inter-cluster density
The density in the between cluster region is
intended to be significantly low

16
Global clustering validity index (4/4)

Clusters separation
CDbw
Composing Density Between and With clusters

17
Merging criterion using the CDbw (1/2)

The inter- intra-cluster density are
incorporated into merging criteria in addition to
distance information
Compute the CDbw for data belonging to each
neighboring pair of clusters
The merging mechanism is to find the pair of
clusters with minimal value of the CDbw
the two clusters have the strongest tendency to
be clustered
The advantage of the merging mechanism is that
the clustering result is more accurate, due to
the more information about the individual
clusters considered

18
Merging criterion using the CDbw (2/2)
19
Preprocessing before clustering of the SOM (1/2)

Neurons with no input data assigned
Not included in the next clustering steps
Compute devjwj-mj, mean_dev, std_dev
wj feature vector, mj the mean vector mj
If devjgtmean_devstd_dev
Exclude the neuron j for clustering later on
This mechanism can filter out the input outliers

20
Preprocessing before clustering of the SOM (2/2)

Compute disj(xi)xi-wj, mean_disj, std_disj
xi input vector
If xi-wjgtmean_disjstd_disj
Filter out the input vector xj for the next
clustering steps
This can filter out the input outliers and noises
Compute of data belonging to the jth cluster
numj
Compute the statistical information about the
mean_num and std_dev_num for all the numjs
If numjltmean_num-std num
Exclude the neuron j for cluster later on
This can filter out the input noises

21
Clustering of the SOM (1/2)

After the preprocessing the clustering of the
SOM, some neurons and some input data are
excluded
The neurons and input data left can be
hierarchically clustered
Rectangular grids are used for the SOM
The merging process happens for neighboring
clusters, which mean the neurons belonging to the
pair of clusters are direct neighbors

22
Clustering of the SOM (2/2)
(a) Neuron A has eight direct neighboring
neurons B-I
(b) Multi-neuron represented neighboring
clusters 1 and 2 can be clustered into one
cluster because the two clusters are direct
neighbors
(c) Multi-neuron represented clusters 1 and 2
cannot be clustered into one cluster because the
two clusters are not direct neighbors
23
The algorithm of clustering of the SOM

Train input data by the SOM
Preprocessing before clustering of the SOM
Cluster SOM by using the agglomerative
hierarchical clustering
The merging criterion is made by the CDbw for all
pairs of directly neighboring clusters
Find the optimal partition of the input data
According to the CDbw for all the input data as a
function of the number of clusters

24
Experimental results

To demonstrate the effectiveness of the proposed
clustering algorithm, four data sets were used in
our experiments
The input data are normalized such that the value
of each datum in each dimension lies in 0,1
For training SOM authors used 100 training epochs
on the input data and the learning rate decreases
from 1 to 0.0001

25
The synthetic data set in the 2D plane

2D data set (200)
Three shallow elongated parallel clusters in the
2D plane
Some noises and outliers
Preprocessing
Using the SOM algorithm
k-means, four different hierarchical clustering
algorithms, and the proposed algorithm to cluster
of the SOM

26
Three partitions of the synthetic data set
(c) single-linkage
(a) the proposed algorithm
(b) k-means
(d) complete-linkage
(e) centroid-linkage
(f) average-linkage
27
2D synthetic data set
Three clusters for the synthetic data set are
displayed on the map by the proposed algorithm or
the single-linkage clustering algorithm on the SOM
SOM map size of 44
SOM map size of 66
The CDbw as a function of the number of clusters
for the synthetic data set
28
Iris data set (1/3)

Iris data
Widely used in pattern classification
150 data points of four dimensions
Three classes with 50 points each
The first class is linearly separable from the
other two
The other two classes are overlapped to some
extent and are not linearly separable

29
Iris data set (2/3)
Two clusters for the iris data set are identified
by the proposed algorithm or single-linkage
clustering algorithm on the SOM (SOM map size of
4 4)
For the known three classes, three clusters are
formed (SOM map size of 44) for the iris data set
The CDbw as a function of the number of clusters
for the iris data
(a) the proposed (b) single-linkage
algorithm
30
Iris data set (3/3)

Performance comparison of different clustering
algorithms for the iris data set

31
15D synthetic data set (1/2)

The statistical information of 20 clusters for
the 15D synthetic data set

32
15D synthetic data set (2/2)

Twenty clusters for the 15D synthetic data set
are displayed on the map by the proposed
algorithm on the SOM (SOM map size of 8 8)

33
Wine data set (1/2)

13D data with known three classes (178 59, 71,
48)

Three clusters for the wine data set are
displayed on the map by the proposed clustering
algorithm on the SOM (SOM map size of 4 4)
The CDbw as a function of the number of clusters
for the wine data set by the proposed algorithm
on the SOM
34
Wine data set (2/2)

Performance comparison of different clustering
algorithms for the wine data set

35
Conclusion (1/2)

A new SOM-based clustering algorithm
Using the clustering validity index locally to
determine which pair of clusters to be merged
The optimal number of clusters can be determined
by the maximum value of the CDbw, which is the
clustering validity index globally for all input
data
Compared with classical clustering methods on the
SOM, the proposed algorithm utilizes more
information about the data in each cluster in
addition to inter-cluster distances
The proposed algorithm clusters data better than
the classical clustering algorithms on the SOM

36
Conclusion (2/2)

The preprocessing steps for filtering out noises
and outliers
To increase the accuracy and robustness for
clustering of the SOM
The experimental results demonstrate that the
proposed clustering algorithm is a better
clustering algorithm than other clustering
algorithms on the SOM

Write a Comment

User Comments (0)