Title: Algorithms for clustering large datasets in arbitrary metric spaces
1Algorithms for clustering large datasets in
arbitrary metric spaces
2Introduction
- A set of 2-dimensional points shown adjacent.
- They clearly form three distinct groups (called
clusters). - The goal of any clustering algorithm is to find
such groups in data to better understand its
distribution.
3Introduction What is Clustering?
- Input
- Database of objects.
- A distance function that captures the notion of
similarity between objects. - Number of groups.
- Goal
- Partition the database into the specified number
of groups such that each group consists of
similar objects.
4Goals of our clustering algorithm
- Good clustering quality
- Scalability
- Only use a bounded amount of main memory
5Outline
- Introduction
- The BIRCH framework
- BIRCH for n-dimensional spaces
- BUBBLE for arbitrary metric spaces
- BUBBLE-FM An improvement over BUBBLE.
- Experimental evaluation
- Conclusions
6BIRCH Introduction
- BIRCH is a framework for scalable incremental
clustering algorithms. - Output is a set of sub-clusters which can further
be analyzed by a more expensive domain-specific
clustering algorithm. - BIRCH can be instantiated to yield different
clustering algorithms.
7BIRCH Incremental Algorithm
- Clusters evolve as data is scanned.
- A current set of clusters is always maintained in
memory. - Each new object is either
- inserted into the cluster to which it is
closest, or - it forms a cluster of its own.
- Requirements
- a representation for clusters.
- a structure to search for the closest cluster.
8BIRCH Important features
- Cluster features (CF)
- Condensed representation for a cluster of objects
- CF-tree
- A height-balanced index for CFs
- Rebuilding algorithm
- When the allocated amount of memory is exhausted,
a smaller CF-tree is built from the old tree.
9BIRCHCluster Feature (CF)
- CFs are summarized representations of clusters.
- They contain sufficient information to find
- the distance between a cluster and an object.
- the distance between any two clusters.
- They are incrementally maintainable
- when new objects are inserted in clusters.
- when two clusters are merged.
10BIRCH CF-tree
- Two parameters
- Branching factor
- Threshold
- Each entry contains the CF of the cluster of
objects in the sub-tree beneath it. - Starting from the root, the closest entry is
selected to traverse downwards until a leaf node
is reached.
11BIRCH CF-Tree insertion (contd)
- At the leaf node, the closest cluster is selected
to insert the object. - If the threshold criterion is satisfied, the
object is absorbed into the cluster. Else, it
forms a new cluster on the leaf node. - The path from the root to the leaf is updated to
reflect the insertion.
12BIRCH CF-tree Insertion (contd)
- If there is no space on the leaf node it is split
and the entries are redistributed based on the
closeness criterion. - A new entry is created at its parent to reflect
the formation of a new leaf node.
13BIRCH Rebuilding Algorithm
- If the CF-tree grows to occupy more space than it
is allocated, the threshold is increased and the
CF-tree is rebuilt. - CFs of leaf clusters are inserted into the new
tree. The insertion algorithm is the same as for
individual objects.
14BIRCH Instantiation Summary
- To instantiate BIRCH we have to define
- Cluster features at leaf and non-leaf levels.
- Incremental maintenance of leaf-level CFs and
updates to non-leaf level CFs when new objects
are inserted. - Distance measures between any two CFs to define
the notion of closeness.
15BIRCH Instantiation of BIRCH
- CF of a cluster of n k-dimensional vectors,
V1,,Vn is defined as (n, LS, SS) - n is the number of vectors
- LS is the sum of vectors
- SS is the sum of squares of vectors
- CF1CF2 (n1n2, LS1LS2, SS1SS2)
- This property is used for incremental maintaining
cluster features. - Distance between two clusters C1 and C2 is
defined to be the distance between their
centroids.
16Arbitrary metric space (AMS) Issues
- Only operation allowed between objects is the
distance computation. - Specifically, the notion of a centroid of a set
of objects does not exist. - The distance function can be computationally very
expensive. E.g., the edit distance between
strings.
17Definitions
- Given a set O of objects O1,,On
- Row sum of Oi is defined as
- Clustroid of O is the object with the least row
sum value. - Clustroid is a concept parallel to that of the
centroid in the Euclidean space.
18BUBBLE CF
- The CF of a set O of objects O1,,On is defined
as (n, O0, SS, R, RS). - N number of objects.
- O0 clustroid
- SS sum of squared distances of all objects from
O0 - R set of representative objects (explained
later) - RS row sum values of the representative objects
19BUBBLE Non-leaf CFs
- Non-leaf CFs direct a new object to an
appropriate child node. - They capture the distribution of objects in the
sub-tree below them. - A set of sample objects randomly collected from
the sub-tree at a non-leaf entry forms its CF.
20BUBBLE Incremental Maintenance (Leaf CF)
- Types of insertion
- Type I Insertion of a single object.
- Type II Insertion of a cluster of objects.
- Under Type I insertion, the location of the new
clustroid is within a bounded distance of the old
clustroid. (The bound depends on the threshold of
the cluster.) - Heuristic1 Maintain a few objects close to the
clustroid. -
21BUBBLEIncremental Maintenance (Leaf CF)
- Under Type II insertions, the location of the new
clustroid is between the two old clustroids. - Heuristic2 Maintain a few objects farthest from
the clustroid in the leaf CF. - The set of objects maintained at each leaf
cluster are its representative objects.
22BUBBLEUpdates to Non-leaf CFs
- The sample objects at a non-leaf entry are
updated whenever its child node splits. - The distribution of clusters changes
significantly whenever a node splits.
23BUBBLE Distance measures
- Distance between two leaf level clusters is
defined to be the distance between their
clustroids. - If C1,C2 are leaf clusters with clustroids O10,
O20 then - D(C1,C2) d(O10,O20)
- Distance between two non-leaf level clusters C1,
C2 with sample objects S1,S2 is defined to be the
average distance between S1 and S2. - D(C1,C2)
24BUBBLE-FM
- Distance functions in arbitrary metric spaces can
be computationally expensive. - Idea Use the Euclidean distance function instead.
25BUBBLE-FM Non-leaf CF
- Map S using FastMap into a k-d Euclidean image
space. - Each non-leaf CF now contains the centroid of the
image vectors of its sample objects. - New objects are mapped into the image space and
the Euclidean distance function is used.
26Scalability
27Conclusions
- BIRCH framework for scalable incremental
clustering algorithms. - Instantiation for n-d spaces (BIRCH).
- Instantiation for AMS (BUBBLE).
- FastMap to reduce the number of times the
distance function is called.