A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets

Description:

A graph partitioning tool METIS is used to perform balanced clustering (OPOSSUM) ... OPOSSUM (Optimal partitioning of Similarity space using Metis) ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 28
Provided by: TTB8
Learn more at: http://www.adbis.org
Category:

less

Transcript and Presenter's Notes

Title: A Clustering Framework for Unbalanced Partitioning and Outlier Filtering on High Dimensional Datasets


1
A Clustering Framework for Unbalanced
Partitioning and Outlier Filtering on High
Dimensional Datasets
  • 1Turgay Tugay Bilgin and A.Yilmaz Camurcu 2
  • 1Department of Computer Engineering, Maltepe
    University
  • ttbilgin_at_maltepe.edu.tr
  • 2Department of Computer and Control
    Education,Marmara University
  • camurcu_at_marmara.edu.tr

2
Outline
  • Introduction
  • Relationship based clustering approach /
    framework
  • Visualization using CLUSION (CLUSter
    visualizatION)
  • Problems of the Framework
  • Graclus partitioning system
  • Our Proposed Framework
  • Using Graclus to create Micro-partition Space
  • Outlier filtering on micro-partition space
  • Using Graclus to cluster ?P Space
  • Visualization of the results using CLUSION graphs
  • Experiments
  • Results

3
Introduction
  • Mining high dimensional datasets are an important
    problem of Data Mining community
  • Well-known problem curse of dimensionality
  • Graph based methods such as METIS and CHACO
    perform best on high dimensional space
  • However, these methods have 2 major problems
  • can not perform outlier filtering
  • Force clusters to be balanced

4
Relationship based Clustering Approach
  • Strehl A. and Ghosh J. proposed a better approach
    for mining high dimensional datasets 1.
  • They focus on similarity space rather than
    Feature space.
  • A graph partitioning tool METIS is used to
    perform balanced clustering (OPOSSUM)
  • They also provide a customized matrix
    visualization tool called CLUSION.
  • CLUSION is fast,simple and it can operate on very
    high dimensional datasets.

5
Relationship based Clustering Framework
OPOSSUM (Optimal partitioning of Similarity space
using Metis)
Similarity computation
Feature Extraction
Cluster Labels
Data Sources
Feature Space
Similarity Space
6
Visualization using CLUSION
  • Clusters appear as symmetrical dark squares
    across the main diagonal

Similarity Matrix
Cluster Visualization
CLUSION S is permuted with a nxn permutation
matrix P
? index
7
Problems of the Framework
  • Produces balanced clusters only
  • It forces clusters to be of equal size. In
    some datasests this could be important, because
    it avoids trivial clusterings. But in most cases,
    can cause undesired results.
  • No outlier filtering
  • Outliers can reduce the quality and the
    validity of the clusters depending on the
    resolution and distribution of the dataset.

8
Graclus partitioning system
  • Graclus is a fast kernel based multilevel
    algorithm which involves coarsening, initial
    partitioning and refinement phases.
  • Unlike METIS,
  • it does not force clusters to be nearly,equal
    size.
  • Uses weighted form of kernel based k-means
    approach
  • kernel k-means approach is extremely fast and
    gives high-quality partitions ()
  • Dhillon, I., Guan, Y., Kulis,B. A Fast
    Kernel-based Multilevel Algorithm for Graph
    Clustering, Proceedings of The 11th ACM SIGKDD,
    Chicago, IL, August 21 - 24, (2005).

9
Our Proposed Framework
  • Three major improvements
  • An intermediate space (P)We call it
    micro-partition space. Graclus is used for
    creating unbalanced micro-partitions.
  • Outlier filtering on the P space (results ?P)
    Graclus creates micro-partitions of different
    sizes. The singletons on the P space means the
    points that have not enough neighbors can be
    filtered or marked as outliers.
  • Using Graclus for clustering ?P space Graclus
    has two important roles on our framework. The
    first role is creating the micro-partition space
    .The second role is unbalanced clustering of the
    filtered space ?P which is denoted by F.

10
Our Proposed Framework
Micro-partition space (P) Contains unbalanced
tiny partitions
creating micro-partitions (using Graclus)
outlier filtering and (re)clustering (using
Graclus) results ?P Space
?P
11
Using Graclus to create Micro-partition Space
  • Use Graclus in Similarity Space to create tiny
    partitions (micro-partitions)
  • Notation
  • n number of samples,
  • k number of micro-partitions on P space
  • relation between k and p should be

  • 1
  • Micro-partitions can contain up to 4 objects,
    therefore

  • 2

12
Outlier filtering on micro-partition space
  • illustration

13
Outlier filtering on micro-partition space
  • Outliers in P space (Po) is
  • where To is Outlier threshold value
  • Then, ?P space is

14
Using Graclus to cluster ?P Space
  • Graclus needs the number of partitions k.
  • In formula 1, k refers to the number of micro
    partitions.
  • Here k refers to the number of clusters we
    desire.
  • we denote the former one by k1 and the latter one
    by k2 .
  • Graclus performs clustering on the ?P space and
    produces ? index which is defined as

15
Visualization of the results using CLUSION graphs
  • CLUSION looks at the ?,
  • reorders the ?P space so that points with same
    cluster label are contiguous
  • then visualize the resulting permuted ?P'
  • there are two ? indices produced during
    clustering process.
  • ?1 is created while forming micro-partitions
  • ?2 is created while clustering ?P space
  • We use ?2 for CLUSION, the first one is only used
    for forming micro-partitions

16
Experiments Datasets
  • We evaluated our proposed framework on two
    different real world datasets.
  • 9636 terms from 2225 complete news articles from
    the BBC News web site. (2225 dimensional
    dataset, 5 natural clusters)
  • Collection of news articles from Turkish
    newspaper Milliyet. Contains 6223 terms in
    Turkish from 1455 news articles. (1455
    dimensional dataset, 3 natural clusters)

17
ExperimentsEvaluated Frameworks
  • OPOSSUM Strehl Ghoshs METIS based original
    framework
  • SG(Graclus)We replaced METIS by Graclus on
    Strehl Ghoshs framework for testing the
    quality of the clusters produced by Graclus
    algorithm.
  • P spaceGraclus Our proposed framework.

18
Experiments Comparison Criteria
  • Purity
  • Entropy
  • Mutual Information
  • CLUSION graphics (visually identification,
    visual data mining)

19
Results BBC Dataset
20
Results BBC Dataset OPOSSUM
21
Results BBC Dataset SG(Graclus)
22
ResultsBBC Dataset P spaceGraclus
23
Results Milliyet Dataset
24
Results Milliyet Dataset OPOSSUM
25
Results Milliyet Dataset SG(Graclus)
26
ResultsMilliyet Dataset P spaceGraclus
27
Thank You!Presenter T.Tugay BiLGiN
Write a Comment
User Comments (0)
About PowerShow.com