Chapter 5: Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 5: Clustering

Description:

Clustering is unsupervised or undirected. ... Choose k cluster centers to coincide with k randomly-chosen points. Assign each data point to the closest cluster center ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 41
Provided by: dis12
Learn more at: https://www.cs.uic.edu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5: Clustering


1
Chapter 5 Clustering
2
Searching for groups
  • Clustering is unsupervised or undirected.
  • Unlike classification, in clustering, no
    pre-classified data.
  • Search for groups or clusters of data points
    (records) that are similar to one another.
  • Similar points may mean similar customers,
    products, that will behave in similar ways.

3
Group similar points together
  • Group points into classes using some distance
    measures.
  • Within-cluster distance, and between cluster
    distance
  • Applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms

4
An Illustration
5
Examples of Clustering Applications
  • Marketing Help marketers discover distinct
    groups in their customer bases, and then use this
    knowledge to develop targeted marketing programs
  • Insurance Identifying groups of motor insurance
    policy holders with some interesting
    characteristics.
  • City-planning Identifying groups of houses
    according to their house type, value, and
    geographical location

6
Concepts of Clustering
  • Clusters
  • Different ways of representing clusters
  • Division with boundaries
  • Spheres
  • Probabilistic
  • Dendrograms

1 2 3
I1 I2 In
0.5 0.2 0.3
7
Clustering
  • Clustering quality
  • Inter-clusters distance ? maximized
  • Intra-clusters distance ? minimized
  • The quality of a clustering result depends on
    both the similarity measure used by the method
    and its application.
  • The quality of a clustering method is also
    measured by its ability to discover some or all
    of the hidden patterns
  • Clustering vs. classification
  • Which one is more difficult? Why?
  • There are a huge number of clustering techniques.

8
Dissimilarity/Distance Measure
  • Dissimilarity/Similarity metric Similarity is
    expressed in terms of a distance function, which
    is typically metric d (i, j)
  • The definitions of distance functions are usually
    very different for interval-scaled, boolean,
    categorical, ordinal and ratio variables.
  • Weights should be associated with different
    variables based on applications and data
    semantics.
  • It is hard to define similar enough or good
    enough. The answer is typically highly
    subjective.

9
Types of data in clustering analysis
  • Interval-scaled variables
  • Binary variables
  • Nominal, ordinal, and ratio variables
  • Variables of mixed types

10
Interval-valued variables
  • Continuous measurements in a roughly linear
    scale, e.g., weight, height, temperature, etc
  • Standardize data (depending on applications)
  • Calculate the mean absolute deviation
  • where
  • Calculate the standardized measurement (z-score)

11
Similarity Between Objects
  • Distance Measure the similarity or dissimilarity
    between two data objects
  • Some popular ones include Minkowski distance
  • where (xi1, xi2, , xip) and (xj1, xj2, , xjp)
    are two p-dimensional data objects, and q is a
    positive integer
  • If q 1, d is Manhattan distance

12
Similarity Between Objects (Cont.)
  • If q 2, d is Euclidean distance
  • Properties
  • d(i,j) ? 0
  • d(i,i) 0
  • d(i,j) d(j,i)
  • d(i,j) ? d(i,k) d(k,j)
  • Also, one can use weighted distance, and many
    other similarity/distance measures.

13
Binary Variables
  • A contingency table for binary data
  • Simple matching coefficient (invariant, if the
    binary variable is symmetric)
  • Jaccard coefficient (noninvariant if the binary
    variable is asymmetric)

Object j
Object i
14
Dissimilarity of Binary Variables
  • Example
  • gender is a symmetric attribute (not used below)
  • the remaining attributes are asymmetric
    attributes
  • let the values Y and P be set to 1, and the value
    N be set to 0

15
Nominal Variables
  • A generalization of the binary variable in that
    it can take more than 2 states, e.g., red,
    yellow, blue, green, etc
  • Method 1 Simple matching
  • m of matches, p total of variables
  • Method 2 use a large number of binary variables
  • creating a new binary variable for each of the M
    nominal states

16
Ordinal Variables
  • An ordinal variable can be discrete or continuous
  • Order is important, e.g., rank
  • Can be treated like interval-scaled (f is a
    variable)
  • replace xif by their ranks
  • map the range of each variable onto 0, 1 by
    replacing i-th object in the f-th variable by
  • compute the dissimilarity using methods for
    interval-scaled variables

17
Ratio-Scaled Variables
  • Ratio-scaled variable a measurement on a
    nonlinear scale, approximately at exponential
    scale, such as AeBt or Ae-Bt, e.g., growth of a
    bacteria population.
  • Methods
  • treat them like interval-scaled variablesnot a
    good idea! (why?the scale can be distorted)
  • apply logarithmic transformation
  • yif log(xif)
  • treat them as continuous ordinal data and then
    treat their ranks as interval-scaled

18
Variables of Mixed Types
  • A database may contain all six types of variables
  • symmetric binary, asymmetric binary, nominal,
    ordinal, interval and ratio
  • One may use a weighted formula to combine their
    effects
  • f is binary or nominal
  • dij(f) 0 if xif xjf , or dij(f) 1 o.w.
  • f is interval-based use the normalized distance
  • f is ordinal or ratio-scaled
  • compute ranks rif and
  • and treat zif as interval-scaled

19
Major Clustering Techniques
  • Partitioning algorithms Construct various
    partitions and then evaluate them by some
    criterion
  • Hierarchy algorithms Create a hierarchical
    decomposition of the set of data (or objects)
    using some criterion
  • Density-based based on connectivity and density
    functions
  • Model-based A model is hypothesized for each of
    the clusters and the idea is to find the best fit
    of the model to each other.

20
Partitioning Algorithms Basic Concept
  • Partitioning method Construct a partition of a
    database D of n objects into a set of k clusters
  • Given a k, find a partition of k clusters that
    optimizes the chosen partitioning criterion
  • Global optimal exhaustively enumerate all
    partitions
  • Heuristic methods k-means and k-medoids
    algorithms
  • k-means Each cluster is represented by the
    center of the cluster
  • k-medoids or PAM (Partition around medoids) Each
    cluster is represented by one of the objects in
    the cluster

21
The K-Means Clustering
  • Given k, the k-means algorithm is as follows
  • Choose k cluster centers to coincide with k
    randomly-chosen points
  • Assign each data point to the closest cluster
    center
  • Recompute the cluster centers using the current
    cluster memberships.
  • If a convergence criterion is not met, go to 2).
  • Typical convergence criteria are no (or minimal)
    reassignment of data points to new cluster
    centers, or minimal decrease in squared error.

p is a point and mi is the mean of cluster Ci
22
Example
  • For simplicity, 1 dimensional data and k2.
  • data 1, 2, 5, 6,7
  • K-means
  • Randomly select 5 and 6 as initial centroids
  • gt Two clusters 1,2,5 and 6,7 meanC18/3,
    meanC26.5
  • gt 1,2, 5,6,7 meanC11.5, meanC26
  • gt no change.
  • Aggregate dissimilarity 0.52 0.52 12
    12 2.5

23
Comments on K-Means
  • Strength efficient O(tkn), where n is data
    points, k is clusters, and t is iterations.
    Normally, k, t ltlt n.
  • Comment Often terminates at a local optimum. The
    global optimum may be found using techniques such
    as deterministic annealing and genetic
    algorithms
  • Weakness
  • Applicable only when mean is defined, difficult
    for categorical data
  • Need to specify k, the number of clusters, in
    advance
  • Sensitive to noisy data and outliers
  • Not suitable to discover clusters with non-convex
    shapes
  • Sensitive to initial seeds

24
Variations of the K-Means Method
  • A few variants of the k-means which differ in
  • Selection of the initial k seeds
  • Dissimilarity measures
  • Strategies to calculate cluster means
  • Handling categorical data k-modes
  • Replacing means of clusters with modes
  • Using new dissimilarity measures to deal with
    categorical objects
  • Using a frequency based method to update modes of
    clusters

25
k-Medoids clustering method
  • k-Means algorithm is sensitive to outliers
  • Since an object with an extremely large value may
    substantially distort the distribution of the
    data.
  • Medoid the most centrally located point in a
    cluster, as a representative point of the
    cluster.
  • An example
  • In contrast, a centroid is not necessarily inside
    a cluster.

Initial Medoids
26
Partition Around Medoids
  • PAM
  • Given k
  • Randomly pick k instances as initial medoids
  • Assign each data point to the nearest medoid x
  • Calculate the objective function
  • the sum of dissimilarities of all points to their
    nearest medoids. (squared-error criterion)
  • Randomly select an point y
  • Swap x by y if the swap reduces the objective
    function
  • Repeat (3-6) until no change

27
Comments on PAM
Outlier (100 unit away)
  • Pam is more robust than k-means in the presence
    of noise and outliers because a medoid is less
    influenced by outliers or other extreme values
    than a mean (why?)
  • Pam works well for small data sets but does not
    scale well for large data sets.
  • O(k(n-k)2 ) for each change
  • where n is of data, k is of clusters

28
CLARA Clustering Large Applications
  • CLARA Built in statistical analysis packages,
    such as S
  • It draws multiple samples of the data set,
    applies PAM on each sample, and gives the best
    clustering as the output
  • Strength deals with larger data sets than PAM
  • Weakness
  • Efficiency depends on the sample size
  • A good clustering based on samples will not
    necessarily represent a good clustering of the
    whole data set if the sample is biased
  • There are other scale-up methods e.g., CLARANS

29
Hierarchical Clustering
  • Use distance matrix for clustering. This method
    does not require the number of clusters k as an
    input, but needs a termination condition

30
Agglomerative Clustering
  • At the beginning, each data point forms a cluster
    (also called a node).
  • Merge nodes/clusters that have the least
    dissimilarity.
  • Go on merging
  • Eventually all nodes belong to the same cluster

31
A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
32
Divisive Clustering
  • Inverse order of agglomerative clustering
  • Eventually each node forms a cluster on its own

33
More on Hierarchical Methods
  • Major weakness of agglomerative clustering
    methods
  • do not scale well time complexity at least
    O(n2), where n is the total number of objects
  • can never undo what was done previously
  • Integration of hierarchical with distance-based
    clustering to scale-up these clustering methods
  • BIRCH (1996) uses CF-tree and incrementally
    adjusts the quality of sub-clusters
  • CURE (1998) selects well-scattered points from
    the cluster and then shrinks them towards the
    center of the cluster by a specified fraction

34
Summary
  • Cluster analysis groups objects based on their
    similarity and has wide applications
  • Measure of similarity can be computed for various
    types of data
  • Clustering algorithms can be categorized into
    partitioning methods, hierarchical methods,
    density-based methods, etc
  • Clustering can also be used for outlier detection
    which are useful for fraud detection
  • What is the best clustering algorithm?

35
Other Data Mining Methods
36
Sequence analysis
  • Market basket analysis analyzes things that
    happen at the same time.
  • How about things happen over time?
  • E.g., If a customer buys a bed, he/she is likely
    to come to buy a mattress later
  • Sequential analysis needs
  • A time stamp for each data record
  • customer identification

37
Sequence analysis (cont )
  • The analysis shows which item come before, after
    or at the same time as other items.
  • Sequential patterns can be used for analyzing
    cause and effect.
  • Other applications
  • Finding cycles in association rules
  • Some association rules hold strongly in certain
    periods of time
  • E.g., every Monday people buy item X and Y
    together
  • Stock market predicting
  • Predicting possible failure in network, etc

38
Discovering holes in data
  • Holes are empty (sparse) regions in the data
    space that contain few or no data points. Holes
    may represent impossible value combinations in
    the application domain.
  • E.g., in a disease database, we may find that
    certain test values and/or symptoms do not go
    together, or when certain medicine is used, some
    test value never go beyond certain range.
  • Such information could lead to significant
    discovery a cure to a disease or some biological
    law.

39
Data and pattern visualization
  • Data visualization Use computer graphics effect
    to reveal the patterns in data,
  • 2-D, 3-D scatter plots, bar charts, pie charts,
    line plots, animation, etc.
  • Pattern visualization Use good interface and
    graphics to present the results of data mining.
  • Rule visualizer, cluster visualizer, etc

40
Scaling up data mining algorithms
  • Adapt data mining algorithms to work on very
    large databases.
  • Data reside on hard disk (too large to fit in
    main memory)
  • Make fewer passes over the data
  • Quadratic algorithms are too expensive
  • Many data mining algorithms are quadratic,
    especially, clustering algorithms.
Write a Comment
User Comments (0)
About PowerShow.com