Algorithms for clustering large datasets in arbitrary metric spaces presentation

About This Presentation

Transcript and Presenter's Notes

Title: Algorithms for clustering large datasets in arbitrary metric spaces

1
Algorithms for clustering large datasets in
arbitrary metric spaces
2
Introduction

A set of 2-dimensional points shown adjacent.
They clearly form three distinct groups (called
clusters).
The goal of any clustering algorithm is to find
such groups in data to better understand its
distribution.

3
Introduction What is Clustering?

Input
Database of objects.
A distance function that captures the notion of
similarity between objects.
Number of groups.
Goal
Partition the database into the specified number
of groups such that each group consists of
similar objects.

4
Goals of our clustering algorithm

Good clustering quality
Scalability
Only use a bounded amount of main memory

5
Outline

Introduction
The BIRCH framework
BIRCH for n-dimensional spaces
BUBBLE for arbitrary metric spaces
BUBBLE-FM An improvement over BUBBLE.
Experimental evaluation
Conclusions

6
BIRCH Introduction

BIRCH is a framework for scalable incremental
clustering algorithms.
Output is a set of sub-clusters which can further
be analyzed by a more expensive domain-specific
clustering algorithm.
BIRCH can be instantiated to yield different
clustering algorithms.

7
BIRCH Incremental Algorithm

Clusters evolve as data is scanned.
A current set of clusters is always maintained in
memory.
Each new object is either
inserted into the cluster to which it is
closest, or
it forms a cluster of its own.
Requirements
a representation for clusters.
a structure to search for the closest cluster.

8
BIRCH Important features

Cluster features (CF)
Condensed representation for a cluster of objects
CF-tree
A height-balanced index for CFs
Rebuilding algorithm
When the allocated amount of memory is exhausted,
a smaller CF-tree is built from the old tree.

9
BIRCHCluster Feature (CF)

CFs are summarized representations of clusters.
They contain sufficient information to find
the distance between a cluster and an object.
the distance between any two clusters.
They are incrementally maintainable
when new objects are inserted in clusters.
when two clusters are merged.

10
BIRCH CF-tree

Two parameters
Branching factor
Threshold
Each entry contains the CF of the cluster of
objects in the sub-tree beneath it.
Starting from the root, the closest entry is
selected to traverse downwards until a leaf node
is reached.

11
BIRCH CF-Tree insertion (contd)

At the leaf node, the closest cluster is selected
to insert the object.
If the threshold criterion is satisfied, the
object is absorbed into the cluster. Else, it
forms a new cluster on the leaf node.
The path from the root to the leaf is updated to
reflect the insertion.

12
BIRCH CF-tree Insertion (contd)

If there is no space on the leaf node it is split
and the entries are redistributed based on the
closeness criterion.
A new entry is created at its parent to reflect
the formation of a new leaf node.

13
BIRCH Rebuilding Algorithm

If the CF-tree grows to occupy more space than it
is allocated, the threshold is increased and the
CF-tree is rebuilt.
CFs of leaf clusters are inserted into the new
tree. The insertion algorithm is the same as for
individual objects.

14
BIRCH Instantiation Summary

To instantiate BIRCH we have to define
Cluster features at leaf and non-leaf levels.
Incremental maintenance of leaf-level CFs and
updates to non-leaf level CFs when new objects
are inserted.
Distance measures between any two CFs to define
the notion of closeness.

15
BIRCH Instantiation of BIRCH

CF of a cluster of n k-dimensional vectors,
V1,,Vn is defined as (n, LS, SS)
n is the number of vectors
LS is the sum of vectors
SS is the sum of squares of vectors
CF1CF2 (n1n2, LS1LS2, SS1SS2)
This property is used for incremental maintaining
cluster features.
Distance between two clusters C1 and C2 is
defined to be the distance between their
centroids.

16
Arbitrary metric space (AMS) Issues

Only operation allowed between objects is the
distance computation.
Specifically, the notion of a centroid of a set
of objects does not exist.
The distance function can be computationally very
expensive. E.g., the edit distance between
strings.

17
Definitions

Given a set O of objects O1,,On
Row sum of Oi is defined as
Clustroid of O is the object with the least row
sum value.
Clustroid is a concept parallel to that of the
centroid in the Euclidean space.

18
BUBBLE CF

The CF of a set O of objects O1,,On is defined
as (n, O0, SS, R, RS).
N number of objects.
O0 clustroid
SS sum of squared distances of all objects from
O0
R set of representative objects (explained
later)
RS row sum values of the representative objects

19
BUBBLE Non-leaf CFs

Non-leaf CFs direct a new object to an
appropriate child node.
They capture the distribution of objects in the
sub-tree below them.
A set of sample objects randomly collected from
the sub-tree at a non-leaf entry forms its CF.

20
BUBBLE Incremental Maintenance (Leaf CF)

Types of insertion
Type I Insertion of a single object.
Type II Insertion of a cluster of objects.
Under Type I insertion, the location of the new
clustroid is within a bounded distance of the old
clustroid. (The bound depends on the threshold of
the cluster.)
Heuristic1 Maintain a few objects close to the
clustroid.

21
BUBBLEIncremental Maintenance (Leaf CF)

Under Type II insertions, the location of the new
clustroid is between the two old clustroids.
Heuristic2 Maintain a few objects farthest from
the clustroid in the leaf CF.
The set of objects maintained at each leaf
cluster are its representative objects.

22
BUBBLEUpdates to Non-leaf CFs

The sample objects at a non-leaf entry are
updated whenever its child node splits.
The distribution of clusters changes
significantly whenever a node splits.

23
BUBBLE Distance measures

Distance between two leaf level clusters is
defined to be the distance between their
clustroids.
If C1,C2 are leaf clusters with clustroids O10,
O20 then
D(C1,C2) d(O10,O20)
Distance between two non-leaf level clusters C1,
C2 with sample objects S1,S2 is defined to be the
average distance between S1 and S2.
D(C1,C2)

24
BUBBLE-FM

Distance functions in arbitrary metric spaces can
be computationally expensive.
Idea Use the Euclidean distance function instead.

25
BUBBLE-FM Non-leaf CF

Map S using FastMap into a k-d Euclidean image
space.
Each non-leaf CF now contains the centroid of the
image vectors of its sample objects.
New objects are mapped into the image space and
the Euclidean distance function is used.

Algorithms for clustering large datasets in arbitrary metric spaces PowerPoint PPT Presentation