Clustering - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Clustering

Description:

Clustering refers to methods for grouping objects documents, ... mammal worm insect crustacean. invertebrate. 7. Department of Computer Science & Engineering ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 18
Provided by: raym113
Category:
Tags: clustering

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
2
Clustering
  • Clustering refers to methods for grouping objects
    documents, customers, products, markets, and
    services
  • Clustering is known through a variety of names -
    unsupervised classification, Q analysis,
    typology, numerical taxonomy

3
Similarity Measures
  • The basis of clustering lies in measuring
    similarity between a pair of objects

4
Taxonomy of Clustering Methods
  • Hierarchical
  • Agglomerative
  • Divisive
  • Partitional
  • Sequential or simultaneous procedures
  • Direct or indirect methods

5
(No Transcript)
6
Hierarchical Clustering
  • Build a tree-based hierarchical taxonomy
    (dendrogram) from a set of examples.
  • Recursive application of a standard clustering
    algorithm can produce a hierarchical clustering.

7
Agglomerative vs. Divisive Clustering
  • Agglomerative (bottom-up) methods start with each
    example in its own cluster and iteratively
    combine them to form larger and larger clusters.
  • Divisive (partitional, top-down) separate all
    examples immediately into clusters.

8
Direct Clustering Method
  • Direct clustering methods require a specification
    of the number of clusters, k, desired.
  • A clustering evaluation function assigns a
    real-value quality measure to a clustering.
  • The number of clusters can be determined
    automatically by explicitly generating clustering
    for multiple values of k and choosing the best
    result according to a clustering evaluation
    function.

9
Indirect Clustering Methods
  • An indirect clustering method is characterized
    by two components a criterion function, and an
    optimization procedure

10
How Many Clusters?
  • Statistical significance of differences between
    clusters
  • Cluster sizes
  • Meaningful cluster profiles
  • Aggregation or decomposition patterns of clusters
    at different stages of clustering

11
Hierarchical Agglomerative Clustering
  • Assumes a similarity function for determining the
    similarity of two instances.
  • Starts with all instances in a separate cluster
    and then repeatedly joins the two clusters that
    are most similar until there is only one cluster.
  • The history of merging forms a binary tree or
    hierarchy.

12
Cluster Similarity
  • How to compute similarity of two clusters each
    possibly containing multiple instances?
  • Single Link Similarity of two most similar
    members.
  • Complete Link Similarity of two least similar
    members.
  • Group Average Average similarity between members.

13
Popular Agglomerative Clustering Procedures
14
Direct Clustering
  • Typically must provide the number of desired
    clusters, k.
  • Randomly choose k instances as seeds, one per
    cluster.
  • Form initial clusters based on these seeds.
  • Iterate, repeatedly reallocating instances to
    different clusters to improve the overall
    clustering.
  • Stop when clustering converges or after a fixed
    number of iterations.

15
K-Means
  • Assumes instances are real-valued vectors.
  • Clusters based on centroids, center of gravity,
    or mean of points in a cluster, c
  • Reassignment of instances to clusters is based on
    distance to the current cluster centroids.

16
Seed Choice
  • Results can vary based on random seed selection.
  • Some seeds can result in poor convergence rate,
    or convergence to sub-optimal clustering.
  • Select good seeds using a heuristic or the
    results of another method.

17
A Hybrid Algorithm
  • Combines HAC and K-Means clustering.
  • First randomly take a sample of instances of size
    ?n
  • Run group-average HAC on this sample, which takes
    only O(n) time.
  • Use the results of HAC as initial seeds for
    K-means.
  • Overall algorithm is O(n) and avoids problems of
    bad seed selection.
Write a Comment
User Comments (0)
About PowerShow.com