Chapter 5: Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter 5: Clustering

Description:

Clustering is unsupervised or undirected. ... Choose k cluster centers to coincide with k randomly-chosen points. Assign each data point to the closest cluster center ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 41

Provided by: dis12

Learn more at: https://www.cs.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5: Clustering

1
Chapter 5 Clustering
2
Searching for groups

Clustering is unsupervised or undirected.
Unlike classification, in clustering, no
pre-classified data.
Search for groups or clusters of data points
(records) that are similar to one another.
Similar points may mean similar customers,
products, that will behave in similar ways.

3
Group similar points together

Group points into classes using some distance
measures.
Within-cluster distance, and between cluster
distance
Applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

4
An Illustration
5
Examples of Clustering Applications

Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
Insurance Identifying groups of motor insurance
policy holders with some interesting
characteristics.
City-planning Identifying groups of houses
according to their house type, value, and
geographical location

6
Concepts of Clustering

Clusters
Different ways of representing clusters
Division with boundaries
Spheres
Probabilistic
Dendrograms

1 2 3
I1 I2 In
0.5 0.2 0.3
7
Clustering

Clustering quality
Inter-clusters distance ? maximized
Intra-clusters distance ? minimized
The quality of a clustering result depends on
both the similarity measure used by the method
and its application.
The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns
Clustering vs. classification
Which one is more difficult? Why?
There are a huge number of clustering techniques.

8
Dissimilarity/Distance Measure

Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function, which
is typically metric d (i, j)
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define similar enough or good
enough. The answer is typically highly
subjective.

9
Types of data in clustering analysis

Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types

10
Interval-valued variables

Continuous measurements in a roughly linear
scale, e.g., weight, height, temperature, etc
Standardize data (depending on applications)
Calculate the mean absolute deviation
where
Calculate the standardized measurement (z-score)

11
Similarity Between Objects

Distance Measure the similarity or dissimilarity
between two data objects
Some popular ones include Minkowski distance
where (xi1, xi2, , xip) and (xj1, xj2, , xjp)
are two p-dimensional data objects, and q is a
positive integer
If q 1, d is Manhattan distance

12
Similarity Between Objects (Cont.)

If q 2, d is Euclidean distance
Properties
d(i,j) ? 0
d(i,i) 0
d(i,j) d(j,i)
d(i,j) ? d(i,k) d(k,j)
Also, one can use weighted distance, and many
other similarity/distance measures.

13
Binary Variables

A contingency table for binary data
Simple matching coefficient (invariant, if the
binary variable is symmetric)
Jaccard coefficient (noninvariant if the binary
variable is asymmetric)

Object j
Object i
14
Dissimilarity of Binary Variables

Example
gender is a symmetric attribute (not used below)
the remaining attributes are asymmetric
attributes
let the values Y and P be set to 1, and the value
N be set to 0

15
Nominal Variables

A generalization of the binary variable in that
it can take more than 2 states, e.g., red,
yellow, blue, green, etc
Method 1 Simple matching
m of matches, p total of variables
Method 2 use a large number of binary variables
creating a new binary variable for each of the M
nominal states

16
Ordinal Variables

An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled (f is a
variable)
replace xif by their ranks
map the range of each variable onto 0, 1 by
replacing i-th object in the f-th variable by
compute the dissimilarity using methods for
interval-scaled variables

17
Ratio-Scaled Variables

Ratio-scaled variable a measurement on a
nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt, e.g., growth of a
bacteria population.
Methods
treat them like interval-scaled variablesnot a
good idea! (why?the scale can be distorted)
apply logarithmic transformation
yif log(xif)
treat them as continuous ordinal data and then
treat their ranks as interval-scaled

18
Variables of Mixed Types

A database may contain all six types of variables
symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine their
effects
f is binary or nominal
dij(f) 0 if xif xjf , or dij(f) 1 o.w.
f is interval-based use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled

19
Major Clustering Techniques

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Density-based based on connectivity and density
functions
Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of the model to each other.

20
Partitioning Algorithms Basic Concept

Partitioning method Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
Global optimal exhaustively enumerate all
partitions
Heuristic methods k-means and k-medoids
algorithms
k-means Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) Each
cluster is represented by one of the objects in
the cluster

21
The K-Means Clustering

Given k, the k-means algorithm is as follows
Choose k cluster centers to coincide with k
randomly-chosen points
Assign each data point to the closest cluster
center
Recompute the cluster centers using the current
cluster memberships.
If a convergence criterion is not met, go to 2).
Typical convergence criteria are no (or minimal)
reassignment of data points to new cluster
centers, or minimal decrease in squared error.

p is a point and mi is the mean of cluster Ci
22
Example

For simplicity, 1 dimensional data and k2.
data 1, 2, 5, 6,7
K-means
Randomly select 5 and 6 as initial centroids
gt Two clusters 1,2,5 and 6,7 meanC18/3,
meanC26.5
gt 1,2, 5,6,7 meanC11.5, meanC26
gt no change.
Aggregate dissimilarity 0.52 0.52 12
12 2.5

23
Comments on K-Means

Strength efficient O(tkn), where n is data
points, k is clusters, and t is iterations.
Normally, k, t ltlt n.
Comment Often terminates at a local optimum. The
global optimum may be found using techniques such
as deterministic annealing and genetic
algorithms
Weakness
Applicable only when mean is defined, difficult
for categorical data
Need to specify k, the number of clusters, in
advance
Sensitive to noisy data and outliers
Not suitable to discover clusters with non-convex
shapes
Sensitive to initial seeds

24
Variations of the K-Means Method

A few variants of the k-means which differ in
Selection of the initial k seeds
Dissimilarity measures
Strategies to calculate cluster means
Handling categorical data k-modes
Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency based method to update modes of
clusters

25
k-Medoids clustering method

k-Means algorithm is sensitive to outliers
Since an object with an extremely large value may
substantially distort the distribution of the
data.
Medoid the most centrally located point in a
cluster, as a representative point of the
cluster.
An example
In contrast, a centroid is not necessarily inside
a cluster.

Initial Medoids
26
Partition Around Medoids

PAM
Given k
Randomly pick k instances as initial medoids
Assign each data point to the nearest medoid x
Calculate the objective function
the sum of dissimilarities of all points to their
nearest medoids. (squared-error criterion)
Randomly select an point y
Swap x by y if the swap reduces the objective
function
Repeat (3-6) until no change

27
Comments on PAM
Outlier (100 unit away)

Pam is more robust than k-means in the presence
of noise and outliers because a medoid is less
influenced by outliers or other extreme values
than a mean (why?)
Pam works well for small data sets but does not
scale well for large data sets.
O(k(n-k)2 ) for each change
where n is of data, k is of clusters

28
CLARA Clustering Large Applications

CLARA Built in statistical analysis packages,
such as S
It draws multiple samples of the data set,
applies PAM on each sample, and gives the best
clustering as the output
Strength deals with larger data sets than PAM
Weakness
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the
whole data set if the sample is biased
There are other scale-up methods e.g., CLARANS

29
Hierarchical Clustering

Use distance matrix for clustering. This method
does not require the number of clusters k as an
input, but needs a termination condition

30
Agglomerative Clustering

At the beginning, each data point forms a cluster
(also called a node).
Merge nodes/clusters that have the least
dissimilarity.
Go on merging
Eventually all nodes belong to the same cluster

31
A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
32
Divisive Clustering

Inverse order of agglomerative clustering
Eventually each node forms a cluster on its own

33
More on Hierarchical Methods

Major weakness of agglomerative clustering
methods
do not scale well time complexity at least
O(n2), where n is the total number of objects
can never undo what was done previously
Integration of hierarchical with distance-based
clustering to scale-up these clustering methods
BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters
CURE (1998) selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction

34
Summary

Cluster analysis groups objects based on their
similarity and has wide applications
Measure of similarity can be computed for various
types of data
Clustering algorithms can be categorized into
partitioning methods, hierarchical methods,
density-based methods, etc
Clustering can also be used for outlier detection
which are useful for fraud detection
What is the best clustering algorithm?

35
Other Data Mining Methods
36
Sequence analysis

Market basket analysis analyzes things that
happen at the same time.
How about things happen over time?
E.g., If a customer buys a bed, he/she is likely
to come to buy a mattress later
Sequential analysis needs
A time stamp for each data record
customer identification

37
Sequence analysis (cont )

The analysis shows which item come before, after
or at the same time as other items.
Sequential patterns can be used for analyzing
cause and effect.
Other applications
Finding cycles in association rules
Some association rules hold strongly in certain
periods of time
E.g., every Monday people buy item X and Y
together
Stock market predicting
Predicting possible failure in network, etc

38
Discovering holes in data

Holes are empty (sparse) regions in the data
space that contain few or no data points. Holes
may represent impossible value combinations in
the application domain.
E.g., in a disease database, we may find that
certain test values and/or symptoms do not go
together, or when certain medicine is used, some
test value never go beyond certain range.
Such information could lead to significant
discovery a cure to a disease or some biological
law.

39
Data and pattern visualization

Data visualization Use computer graphics effect
to reveal the patterns in data,
2-D, 3-D scatter plots, bar charts, pie charts,
line plots, animation, etc.
Pattern visualization Use good interface and
graphics to present the results of data mining.
Rule visualizer, cluster visualizer, etc

40
Scaling up data mining algorithms