CLUSTERING - PowerPoint PPT Presentation

About This Presentation



There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed. Title: – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 28
Provided by: BL46


Transcript and Presenter's Notes


  • Definition of Clustering
  • Existing clustering methods
  • Clustering examples

  • Clustering can be considered the most important
    unsupervised learning technique so, as every
    other problem of this kind, it deals with finding
    a structure in a collection of unlabeled data.
  • Clustering is the process of organizing objects
    into groups whose members are similar in some
  • A cluster is therefore a collection of objects
    which are similar between them and are
    dissimilar to the objects belonging to other

(No Transcript)
Why clustering?
  • A few good reasons ...
  • Simplifications
  • Pattern detection
  • Useful in data concept construction
  • Unsupervised learning process

Where to use clustering?
  • Data mining
  • Information retrieval
  • text mining
  • Web analysis
  • medical diagnostic

Major Existing clustering methods
  • Distance-based
  • Hierarchical
  • Partitioning
  • Probabilistic

Measuring Similarity
  • Dissimilarity/Similarity metric Similarity is
    expressed in terms of a distance function, which
    is typically metric d(i, j)
  • There is a separate quality function that
    measures the goodness of a cluster.
  • The definitions of distance functions are usually
    very different for interval-scaled, boolean,
    categorical, ordinal and ratio variables.
  • Weights should be associated with different
    variables based on applications and data
  • It is hard to define similar enough or good
  • the answer is typically highly subjective.

Hierarchical clustering
  • Agglomerative (bottom up)
  • start with 1 point (singleton)
  • recursively add two or more appropriate clusters
  • Stop when k number of clusters is achieved.
  • Divisive (top down)
  • Start with a big cluster
  • Recursively divide into smaller clusters
  • Stop when k number of clusters is achieved.

general steps of hierarchical clustering
  • Given a set of N items to be clustered, and an
    NN distance (or similarity) matrix, the basic
    process of hierarchical clustering (defined by
    S.C. Johnson in 1967) is this
  • Start by assigning each item to a cluster, so
    that if you have N items, you now have N
    clusters, each containing just one item. Let the
    distances (similarities) between the clusters the
    same as the distances (similarities) between the
    items they contain.
  • Find the closest (most similar) pair of clusters
    and merge them into a single cluster, so that now
    you have one cluster less.
  • Compute distances (similarities) between the new
    cluster and each of the old clusters.
  • Repeat steps 2 and 3 until all items are
    clustered into K number of clusters

  • Exclusive vs. non exclusive clustering
  • In the first case data are grouped in an
    exclusive way, so that if a certain datum belongs
    to a definite cluster then it could not be
    included in another cluster. A simple example of
    that is shown in the figure below, where the
    separation of points is achieved by a straight
    line on a bi-dimensional plane.
  • On the contrary the second type, the overlapping
    clustering, uses fuzzy sets to cluster data, so
    that each point may belong to two or more
    clusters with different degrees of membership.

Partitioning clustering
  • Divide data into proper subset
  • recursively go through each subset and relocate
    points between clusters (opposite to visit-once
    approach in Hierarchical approach)
  • This recursive relocation higher quality cluster

Probabilistic clustering
  1. Data are picked from mixture of probability
  2. Use the mean, variance of each distribution as
    parameters for cluster
  3. Single cluster membership

Single-Linkage Clustering(hierarchical)
  • The NN proximity matrix is D d(i,j)
  • The clusterings are assigned sequence numbers
    0,1,......, (n-1)
  • L(k) is the level of the kth clustering
  • A cluster with sequence number m is denoted (m)
  • The proximity between clusters (r) and (s) is
    denoted d (r),(s)

The algorithm is composed of the following steps
  • Begin with the disjoint clustering having level
    L(0) 0 and sequence number m 0.
  • Find the least dissimilar pair of clusters in the
    current clustering, say pair (r), (s), according
    tod(r),(s) min d(i),(j)where the
    minimum is over all pairs of clusters in the
    current clustering.

The algorithm is composed of the following
  • Increment the sequence number m m 1. Merge
    clusters (r) and (s) into a single cluster to
    form the next clustering m. Set the level of this
    clustering toL(m) d(r),(s)
  • Update the proximity matrix, D, by deleting the
    rows and columns corresponding to clusters (r)
    and (s) and adding a row and column corresponding
    to the newly formed cluster. The proximity
    between the new cluster, denoted (r,s) and old
    cluster (k) is defined in this wayd(k),
    (r,s) min d(k),(r), d(k),(s)
  • If all objects are in one cluster, stop. Else, go
    to step 2.

Hierarchical clustering example
  • Lets now see a simple example a hierarchical
    clustering of distances in kilometers between
    some Italian cities. The method used is
  • Input distance matrix (L 0 for all the

  • The nearest pair of cities is MI and TO, at
    distance 138. These are merged into a single
    cluster called "MI/TO". The level of the new
    cluster is L(MI/TO) 138 and the new sequence
    number is m 1.Then we compute the distance
    from this new compound object to all other
    objects. In single link clustering the rule is
    that the distance from the compound object to
    another object is equal to the shortest distance
    from any member of the cluster to the outside
    object. So the distance from "MI/TO" to RM is
    chosen to be 564, which is the distance from MI
    to RM, and so on.

  • After merging MI with TO we obtain the following

  • min d(i,j) d(NA,RM) 219 gt merge NA and RM
    into a new cluster called NA/RML(NA/RM) 219m

  • min d(i,j) d(BA,NA/RM) 255 gt merge BA and
    NA/RM into a new cluster called
    BA/NA/RML(BA/NA/RM) 255m 3

  • min d(i,j) d(BA/NA/RM,FI) 268 gt merge
    BA/NA/RM and FI into a new cluster called
    BA/FI/NA/RML(BA/FI/NA/RM) 268m 4

  • Finally, we merge the last two clusters at level
  • The process is summarized by the following
    hierarchical tree

K-mean algorithm
  • It accepts the number of clusters to group data
    into, and the dataset to cluster as input values.
  • It then creates the first K initial clusters (K
    number of clusters needed) from the dataset by
    choosing K rows of data randomly from the
    dataset. For Example, if there are 10,000 rows of
    data in the dataset and 3 clusters need to be
    formed, then the first K3 initial clusters will
    be created by selecting 3 records randomly from
    the dataset as the initial clusters. Each of the
    3 initial clusters formed will have just one row
    of data.

3. The K-Means algorithm calculates the
Arithmetic Mean of each cluster formed in the
dataset. The Arithmetic Mean of a cluster is the
mean of all the individual records in the
cluster. In each of the first K initial
clusters, their is only one record. The
Arithmetic Mean of a cluster with one record is
the set of values that make up that record. For
Example if the dataset we are discussing is a set
of Height, Weight and Age measurements for
students in a University, where a record P in the
dataset S is represented by a Height, Weight and
Age measurement, then P Age, Height,
Weight. Then a record containing
the measurements of a student John, would be
represented as John 20, 170, 80 where John's
Age 20 years, Height 1.70 meters and Weight
80 Pounds. Since there is only one record in each
initial cluster then the Arithmetic Mean of a
cluster with only the record for John as a member
20, 170, 80.
  • Next, K-Means assigns each record in the dataset
    to only one of the initial clusters. Each record
    is assigned to the nearest cluster (the cluster
    which it is most similar to) using a measure of
    distance or similarity like the Euclidean
    Distance Measure or Manhattan/City-Block Distance
  • K-Means re-assigns each record in
    the dataset to the most similar cluster
    and re-calculates the arithmetic mean of all the
    clusters in the dataset. The arithmetic mean of a
    cluster is the arithmetic mean of all the records
    in that cluster. For Example, if a cluster
    contains two records where the record of the set
    of measurements for John 20, 170, 80 and
    Henry 30, 160, 120, then the arithmetic mean
    Pmean is represented as Pmean Agemean,
    Heightmean, Weightmean).  Agemean (20 30)/2,
    Heightmean (170 160)/2 and Weightmean (80
    120)/2. The arithmetic mean of this cluster
    25, 165, 100. This new arithmetic mean becomes
    the center of this new cluster. Following the
    same procedure, new cluster centers are formed
    for all the existing clusters.

  • K-Means re-assigns each record in the dataset to
    only one of the new clusters formed. A record or
    data point is assigned to the nearest cluster
    (the cluster which it is most similar to) using a
    measure of distance or similarity 
  • The preceding steps are repeated until stable
    clusters are formed and the K-Means clustering
    procedure is completed. Stable clusters are
    formed when new iterations or repetitions of the
    K-Means clustering algorithm does not create new
    clusters as the cluster center or Arithmetic Mean
    of each cluster formed is the same as the old
    cluster center. There are different techniques
    for determining when a stable cluster is formed
    or when the k-means clustering algorithm
    procedure is completed.
Write a Comment
User Comments (0)