A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets - PowerPoint PPT Presentation

About This Presentation

Title:

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets

Description:

A graph partitioning tool METIS is used to perform balanced clustering (OPOSSUM) ... OPOSSUM (Optimal partitioning of Similarity space using Metis) ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 28

Provided by: TTB8

Learn more at: http://www.adbis.org

Category:

more less

Transcript and Presenter's Notes

Title: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets

1
A Clustering Framework for Unbalanced
Partitioning and Outlier Filtering on High
Dimensional Datasets

1Turgay Tugay Bilgin and A.Yilmaz Camurcu 2
1Department of Computer Engineering, Maltepe
University
ttbilgin_at_maltepe.edu.tr
2Department of Computer and Control
Education,Marmara University
camurcu_at_marmara.edu.tr

2
Outline

Introduction
Relationship based clustering approach /
framework
Visualization using CLUSION (CLUSter
visualizatION)
Problems of the Framework
Graclus partitioning system
Our Proposed Framework
Using Graclus to create Micro-partition Space
Outlier filtering on micro-partition space
Using Graclus to cluster ?P Space
Visualization of the results using CLUSION graphs
Experiments
Results

3
Introduction

Mining high dimensional datasets are an important
problem of Data Mining community
Well-known problem curse of dimensionality
Graph based methods such as METIS and CHACO
perform best on high dimensional space
However, these methods have 2 major problems
can not perform outlier filtering
Force clusters to be balanced

4
Relationship based Clustering Approach

Strehl A. and Ghosh J. proposed a better approach
for mining high dimensional datasets 1.
They focus on similarity space rather than
Feature space.
A graph partitioning tool METIS is used to
perform balanced clustering (OPOSSUM)
They also provide a customized matrix
visualization tool called CLUSION.
CLUSION is fast,simple and it can operate on very
high dimensional datasets.

5
Relationship based Clustering Framework
OPOSSUM (Optimal partitioning of Similarity space
using Metis)
Similarity computation
Feature Extraction
Cluster Labels
Data Sources
Feature Space
Similarity Space
6
Visualization using CLUSION

Clusters appear as symmetrical dark squares
across the main diagonal

Similarity Matrix
Cluster Visualization
CLUSION S is permuted with a nxn permutation
matrix P
? index
7
Problems of the Framework

Produces balanced clusters only
It forces clusters to be of equal size. In
some datasests this could be important, because
it avoids trivial clusterings. But in most cases,
can cause undesired results.
No outlier filtering
Outliers can reduce the quality and the
validity of the clusters depending on the
resolution and distribution of the dataset.

8
Graclus partitioning system

Graclus is a fast kernel based multilevel
algorithm which involves coarsening, initial
partitioning and refinement phases.
Unlike METIS,
it does not force clusters to be nearly,equal
size.
Uses weighted form of kernel based k-means
approach
kernel k-means approach is extremely fast and
gives high-quality partitions ()
Dhillon, I., Guan, Y., Kulis,B. A Fast
Kernel-based Multilevel Algorithm for Graph
Clustering, Proceedings of The 11th ACM SIGKDD,
Chicago, IL, August 21 - 24, (2005).

9
Our Proposed Framework

Three major improvements
An intermediate space (P)We call it
micro-partition space. Graclus is used for
creating unbalanced micro-partitions.
Outlier filtering on the P space (results ?P)
Graclus creates micro-partitions of different
sizes. The singletons on the P space means the
points that have not enough neighbors can be
filtered or marked as outliers.
Using Graclus for clustering ?P space Graclus
has two important roles on our framework. The
first role is creating the micro-partition space
.The second role is unbalanced clustering of the
filtered space ?P which is denoted by F.

10
Our Proposed Framework
Micro-partition space (P) Contains unbalanced
tiny partitions
creating micro-partitions (using Graclus)
outlier filtering and (re)clustering (using
Graclus) results ?P Space
?P
11
Using Graclus to create Micro-partition Space

Use Graclus in Similarity Space to create tiny
partitions (micro-partitions)
Notation
n number of samples,
k number of micro-partitions on P space
relation between k and p should be
1
Micro-partitions can contain up to 4 objects,
therefore
2

12
Outlier filtering on micro-partition space

illustration

13
Outlier filtering on micro-partition space

Outliers in P space (Po) is
where To is Outlier threshold value
Then, ?P space is

14
Using Graclus to cluster ?P Space

Graclus needs the number of partitions k.
In formula 1, k refers to the number of micro
partitions.
Here k refers to the number of clusters we
desire.
we denote the former one by k1 and the latter one
by k2 .
Graclus performs clustering on the ?P space and
produces ? index which is defined as

15
Visualization of the results using CLUSION graphs

CLUSION looks at the ?,
reorders the ?P space so that points with same
cluster label are contiguous
then visualize the resulting permuted ?P'
there are two ? indices produced during
clustering process.
?1 is created while forming micro-partitions
?2 is created while clustering ?P space
We use ?2 for CLUSION, the first one is only used
for forming micro-partitions

16
Experiments Datasets

We evaluated our proposed framework on two
different real world datasets.
9636 terms from 2225 complete news articles from
the BBC News web site. (2225 dimensional
dataset, 5 natural clusters)
Collection of news articles from Turkish
newspaper Milliyet. Contains 6223 terms in
Turkish from 1455 news articles. (1455
dimensional dataset, 3 natural clusters)

17
ExperimentsEvaluated Frameworks

OPOSSUM Strehl Ghoshs METIS based original
framework
SG(Graclus)We replaced METIS by Graclus on
Strehl Ghoshs framework for testing the
quality of the clusters produced by Graclus
algorithm.
P spaceGraclus Our proposed framework.

18
Experiments Comparison Criteria

Purity
Entropy
Mutual Information
CLUSION graphics (visually identification,
visual data mining)

19
Results BBC Dataset
20
Results BBC Dataset OPOSSUM
21
Results BBC Dataset SG(Graclus)
22
ResultsBBC Dataset P spaceGraclus
23
Results Milliyet Dataset
24
Results Milliyet Dataset OPOSSUM
25
Results Milliyet Dataset SG(Graclus)
26
ResultsMilliyet Dataset P spaceGraclus
27
Thank You!Presenter T.Tugay BiLGiN

Write a Comment

User Comments (0)