Algorithms for clustering large datasets in arbitrary metric spaces PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Algorithms for clustering large datasets in arbitrary metric spaces


1
Algorithms for clustering large datasets in
arbitrary metric spaces
2
Introduction
  • A set of 2-dimensional points shown adjacent.
  • They clearly form three distinct groups (called
    clusters).
  • The goal of any clustering algorithm is to find
    such groups in data to better understand its
    distribution.

3
Introduction What is Clustering?
  • Input
  • Database of objects.
  • A distance function that captures the notion of
    similarity between objects.
  • Number of groups.
  • Goal
  • Partition the database into the specified number
    of groups such that each group consists of
    similar objects.

4
Goals of our clustering algorithm
  • Good clustering quality
  • Scalability
  • Only use a bounded amount of main memory

5
Outline
  • Introduction
  • The BIRCH framework
  • BIRCH for n-dimensional spaces
  • BUBBLE for arbitrary metric spaces
  • BUBBLE-FM An improvement over BUBBLE.
  • Experimental evaluation
  • Conclusions

6
BIRCH Introduction
  • BIRCH is a framework for scalable incremental
    clustering algorithms.
  • Output is a set of sub-clusters which can further
    be analyzed by a more expensive domain-specific
    clustering algorithm.
  • BIRCH can be instantiated to yield different
    clustering algorithms.

7
BIRCH Incremental Algorithm
  • Clusters evolve as data is scanned.
  • A current set of clusters is always maintained in
    memory.
  • Each new object is either
  • inserted into the cluster to which it is
    closest, or
  • it forms a cluster of its own.
  • Requirements
  • a representation for clusters.
  • a structure to search for the closest cluster.

8
BIRCH Important features
  • Cluster features (CF)
  • Condensed representation for a cluster of objects
  • CF-tree
  • A height-balanced index for CFs
  • Rebuilding algorithm
  • When the allocated amount of memory is exhausted,
    a smaller CF-tree is built from the old tree.

9
BIRCHCluster Feature (CF)
  • CFs are summarized representations of clusters.
  • They contain sufficient information to find
  • the distance between a cluster and an object.
  • the distance between any two clusters.
  • They are incrementally maintainable
  • when new objects are inserted in clusters.
  • when two clusters are merged.

10
BIRCH CF-tree
  • Two parameters
  • Branching factor
  • Threshold
  • Each entry contains the CF of the cluster of
    objects in the sub-tree beneath it.
  • Starting from the root, the closest entry is
    selected to traverse downwards until a leaf node
    is reached.

11
BIRCH CF-Tree insertion (contd)
  • At the leaf node, the closest cluster is selected
    to insert the object.
  • If the threshold criterion is satisfied, the
    object is absorbed into the cluster. Else, it
    forms a new cluster on the leaf node.
  • The path from the root to the leaf is updated to
    reflect the insertion.

12
BIRCH CF-tree Insertion (contd)
  • If there is no space on the leaf node it is split
    and the entries are redistributed based on the
    closeness criterion.
  • A new entry is created at its parent to reflect
    the formation of a new leaf node.

13
BIRCH Rebuilding Algorithm
  • If the CF-tree grows to occupy more space than it
    is allocated, the threshold is increased and the
    CF-tree is rebuilt.
  • CFs of leaf clusters are inserted into the new
    tree. The insertion algorithm is the same as for
    individual objects.

14
BIRCH Instantiation Summary
  • To instantiate BIRCH we have to define
  • Cluster features at leaf and non-leaf levels.
  • Incremental maintenance of leaf-level CFs and
    updates to non-leaf level CFs when new objects
    are inserted.
  • Distance measures between any two CFs to define
    the notion of closeness.

15
BIRCH Instantiation of BIRCH
  • CF of a cluster of n k-dimensional vectors,
    V1,,Vn is defined as (n, LS, SS)
  • n is the number of vectors
  • LS is the sum of vectors
  • SS is the sum of squares of vectors
  • CF1CF2 (n1n2, LS1LS2, SS1SS2)
  • This property is used for incremental maintaining
    cluster features.
  • Distance between two clusters C1 and C2 is
    defined to be the distance between their
    centroids.

16
Arbitrary metric space (AMS) Issues
  • Only operation allowed between objects is the
    distance computation.
  • Specifically, the notion of a centroid of a set
    of objects does not exist.
  • The distance function can be computationally very
    expensive. E.g., the edit distance between
    strings.

17
Definitions
  • Given a set O of objects O1,,On
  • Row sum of Oi is defined as
  • Clustroid of O is the object with the least row
    sum value.
  • Clustroid is a concept parallel to that of the
    centroid in the Euclidean space.

18
BUBBLE CF
  • The CF of a set O of objects O1,,On is defined
    as (n, O0, SS, R, RS).
  • N number of objects.
  • O0 clustroid
  • SS sum of squared distances of all objects from
    O0
  • R set of representative objects (explained
    later)
  • RS row sum values of the representative objects

19
BUBBLE Non-leaf CFs
  • Non-leaf CFs direct a new object to an
    appropriate child node.
  • They capture the distribution of objects in the
    sub-tree below them.
  • A set of sample objects randomly collected from
    the sub-tree at a non-leaf entry forms its CF.

20
BUBBLE Incremental Maintenance (Leaf CF)
  • Types of insertion
  • Type I Insertion of a single object.
  • Type II Insertion of a cluster of objects.
  • Under Type I insertion, the location of the new
    clustroid is within a bounded distance of the old
    clustroid. (The bound depends on the threshold of
    the cluster.)
  • Heuristic1 Maintain a few objects close to the
    clustroid.

21
BUBBLEIncremental Maintenance (Leaf CF)
  • Under Type II insertions, the location of the new
    clustroid is between the two old clustroids.
  • Heuristic2 Maintain a few objects farthest from
    the clustroid in the leaf CF.
  • The set of objects maintained at each leaf
    cluster are its representative objects.

22
BUBBLEUpdates to Non-leaf CFs
  • The sample objects at a non-leaf entry are
    updated whenever its child node splits.
  • The distribution of clusters changes
    significantly whenever a node splits.

23
BUBBLE Distance measures
  • Distance between two leaf level clusters is
    defined to be the distance between their
    clustroids.
  • If C1,C2 are leaf clusters with clustroids O10,
    O20 then
  • D(C1,C2) d(O10,O20)
  • Distance between two non-leaf level clusters C1,
    C2 with sample objects S1,S2 is defined to be the
    average distance between S1 and S2.
  • D(C1,C2)

24
BUBBLE-FM
  • Distance functions in arbitrary metric spaces can
    be computationally expensive.
  • Idea Use the Euclidean distance function instead.

25
BUBBLE-FM Non-leaf CF
  • Map S using FastMap into a k-d Euclidean image
    space.
  • Each non-leaf CF now contains the centroid of the
    image vectors of its sample objects.
  • New objects are mapped into the image space and
    the Euclidean distance function is used.

26
Scalability
27
Conclusions
  • BIRCH framework for scalable incremental
    clustering algorithms.
  • Instantiation for n-d spaces (BIRCH).
  • Instantiation for AMS (BUBBLE).
  • FastMap to reduce the number of times the
    distance function is called.
Write a Comment
User Comments (0)
About PowerShow.com