Title: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets
1A Clustering Framework for Unbalanced
Partitioning and Outlier Filtering on High
Dimensional Datasets
- 1Turgay Tugay Bilgin and A.Yilmaz Camurcu 2
- 1Department of Computer Engineering, Maltepe
University - ttbilgin_at_maltepe.edu.tr
- 2Department of Computer and Control
Education,Marmara University - camurcu_at_marmara.edu.tr
2Outline
- Introduction
- Relationship based clustering approach /
framework - Visualization using CLUSION (CLUSter
visualizatION) - Problems of the Framework
- Graclus partitioning system
- Our Proposed Framework
- Using Graclus to create Micro-partition Space
- Outlier filtering on micro-partition space
- Using Graclus to cluster ?P Space
- Visualization of the results using CLUSION graphs
- Experiments
- Results
3Introduction
- Mining high dimensional datasets are an important
problem of Data Mining community - Well-known problem curse of dimensionality
- Graph based methods such as METIS and CHACO
perform best on high dimensional space - However, these methods have 2 major problems
- can not perform outlier filtering
- Force clusters to be balanced
4Relationship based Clustering Approach
- Strehl A. and Ghosh J. proposed a better approach
for mining high dimensional datasets 1. - They focus on similarity space rather than
Feature space. - A graph partitioning tool METIS is used to
perform balanced clustering (OPOSSUM) - They also provide a customized matrix
visualization tool called CLUSION. - CLUSION is fast,simple and it can operate on very
high dimensional datasets.
5Relationship based Clustering Framework
OPOSSUM (Optimal partitioning of Similarity space
using Metis)
Similarity computation
Feature Extraction
Cluster Labels
Data Sources
Feature Space
Similarity Space
6Visualization using CLUSION
- Clusters appear as symmetrical dark squares
across the main diagonal
Similarity Matrix
Cluster Visualization
CLUSION S is permuted with a nxn permutation
matrix P
? index
7Problems of the Framework
- Produces balanced clusters only
-
- It forces clusters to be of equal size. In
some datasests this could be important, because
it avoids trivial clusterings. But in most cases,
can cause undesired results. - No outlier filtering
- Outliers can reduce the quality and the
validity of the clusters depending on the
resolution and distribution of the dataset.
8Graclus partitioning system
- Graclus is a fast kernel based multilevel
algorithm which involves coarsening, initial
partitioning and refinement phases. - Unlike METIS,
- it does not force clusters to be nearly,equal
size. - Uses weighted form of kernel based k-means
approach - kernel k-means approach is extremely fast and
gives high-quality partitions () - Dhillon, I., Guan, Y., Kulis,B. A Fast
Kernel-based Multilevel Algorithm for Graph
Clustering, Proceedings of The 11th ACM SIGKDD,
Chicago, IL, August 21 - 24, (2005).
9Our Proposed Framework
- Three major improvements
- An intermediate space (P)We call it
micro-partition space. Graclus is used for
creating unbalanced micro-partitions. - Outlier filtering on the P space (results ?P)
Graclus creates micro-partitions of different
sizes. The singletons on the P space means the
points that have not enough neighbors can be
filtered or marked as outliers. - Using Graclus for clustering ?P space Graclus
has two important roles on our framework. The
first role is creating the micro-partition space
.The second role is unbalanced clustering of the
filtered space ?P which is denoted by F.
10Our Proposed Framework
Micro-partition space (P) Contains unbalanced
tiny partitions
creating micro-partitions (using Graclus)
outlier filtering and (re)clustering (using
Graclus) results ?P Space
?P
11Using Graclus to create Micro-partition Space
- Use Graclus in Similarity Space to create tiny
partitions (micro-partitions) - Notation
- n number of samples,
- k number of micro-partitions on P space
- relation between k and p should be
-
1 - Micro-partitions can contain up to 4 objects,
therefore -
2
12Outlier filtering on micro-partition space
13Outlier filtering on micro-partition space
- Outliers in P space (Po) is
-
- where To is Outlier threshold value
- Then, ?P space is
14Using Graclus to cluster ?P Space
- Graclus needs the number of partitions k.
- In formula 1, k refers to the number of micro
partitions. - Here k refers to the number of clusters we
desire. - we denote the former one by k1 and the latter one
by k2 . - Graclus performs clustering on the ?P space and
produces ? index which is defined as
15Visualization of the results using CLUSION graphs
- CLUSION looks at the ?,
- reorders the ?P space so that points with same
cluster label are contiguous - then visualize the resulting permuted ?P'
- there are two ? indices produced during
clustering process. - ?1 is created while forming micro-partitions
- ?2 is created while clustering ?P space
- We use ?2 for CLUSION, the first one is only used
for forming micro-partitions
16Experiments Datasets
- We evaluated our proposed framework on two
different real world datasets. - 9636 terms from 2225 complete news articles from
the BBC News web site. (2225 dimensional
dataset, 5 natural clusters) - Collection of news articles from Turkish
newspaper Milliyet. Contains 6223 terms in
Turkish from 1455 news articles. (1455
dimensional dataset, 3 natural clusters)
17ExperimentsEvaluated Frameworks
- OPOSSUM Strehl Ghoshs METIS based original
framework - SG(Graclus)We replaced METIS by Graclus on
Strehl Ghoshs framework for testing the
quality of the clusters produced by Graclus
algorithm. - P spaceGraclus Our proposed framework.
18Experiments Comparison Criteria
- Purity
- Entropy
- Mutual Information
- CLUSION graphics (visually identification,
visual data mining)
19Results BBC Dataset
20Results BBC Dataset OPOSSUM
21Results BBC Dataset SG(Graclus)
22ResultsBBC Dataset P spaceGraclus
23Results Milliyet Dataset
24Results Milliyet Dataset OPOSSUM
25Results Milliyet Dataset SG(Graclus)
26ResultsMilliyet Dataset P spaceGraclus
27Thank You!Presenter T.Tugay BiLGiN