1 / 67

Cluster Analysis

Chapter 7

- The Course

DW

DS

OLAP

Star Schema

DP

DS

DM

Association

DS

Classification

Clustering

DS Data source DW Data warehouse DM Data

Mining DP Staging Database

Chapter Outline

- What is Cluster Analysis?
- Types of Data in Cluster Analysis
- A Categorization of Major Clustering Methods
- Partitioning
- Hierarchical
- Density-Based
- Grid-Based (skip)
- Model-Based (skip)
- Outlier Analysis

What is Clustering?

- Also called unsupervised learning, sometimes

called classification by statisticians and

sorting by psychologists and segmentation by

people in marketing - Organizing data into classes such that there is
- high intra-class similarity
- low inter-class similarity
- Finding the class labels and the number of

classes directly from the data Pin contrast to

classification). - More informally, finding natural groupings among

objects

What is Cluster Analysis?

- Cluster Collection of data objects (records)
- (Intraclass similarity) - Objects are similar to

objects in same cluster - (Interclass dissimilarity) - Objects are

dissimilar to objects in other clusters - Cluster analysis
- Statistical method for grouping a set of data

objects into clusters - A good clustering method produces high quality

clusters with high intraclass similarity and low

interclass similarity - Clustering is unsupervised classification
- More informally, finding natural groupings among

objects

What is Cluster Analysis?

- Finding groups of objects such that the objects

in a group will be similar (or related) to one

another and different from (or unrelated to) the

objects in other groups

Salary

Age

Job Title

Typical Clustering Applications

- As a stand-alone tool to
- get insight into data distribution
- find the characteristics of each cluster
- assign the cluster of a new example
- As a preprocessing step for other algorithms
- e.g. numerosity reduction using cluster centers

to represent data in clusters. (See the example

in the next slide.) - It is a building block for many data mining

solutions

Clustering Example Fitting Troops

- Fitting the troops re-design of uniforms for

soldiers - Goal reduce the number of uniform sizes to be

kept in inventory while still providing good fit - Researchers from Cornell University used

clustering and designed a new set of sizes - Traditional clothing size system ordered set of

graduated sizes where all dimensions increase

together - The new system sizes that fit body types
- E.g. one size for short-legged, small waisted,

women with wide and long torsos, average arms,

broad shoulders, and skinny necks

Other Examples of Clustering Applications

- Marketing
- help discover distinct groups of customers, and

then use this knowledge to develop targeted

marketing programs - Biology
- derive plant and animal taxonomies
- find genes with similar function
- Land use
- identify areas of similar land use in an earth

observation database - Insurance
- identify groups of motor insurance policy holders

with a high average claim cost - City-planning
- identify groups of houses according to their

house type, value, and geographical location

Requirements of Clustering

- Scalability
- Ability to deal with various types of attributes
- Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge
- Can deal with noise and outliers
- Insensitive to the order of input records
- Can handle high dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability

What is Cluster Analysis?

- What is Good Clustering
- High within-class similarity and low

between-class similarity - The ability to discover some or all of the

hidden patterns

Outlier

How do we measure similarity or dissimilarity?

Peter

Piotr

0.23

3

342.7

Data Representation

- Data matrix
- N objects with p attributes
- Distances are normally used to measure the

similarity or dissimilarity between two data

objects - Dissimilarity matrix
- d(i,j) dissimilarity
- between i and j

Data Representation

- Properties
- d(i,j) ? 0
- d(i,i) 0
- d(i,j) d(j,i)
- d(i,j) ? d(i,k) d(k,j)

Types of Data in Cluster Analysis

- Interval-Scaled Attributes Continuous

measurements of a roughly linear scale. Example

weight, temperature, income, etc - Binary Attributes
- Nominal Attributes categorical values where

order has no meaning Example color, gender - Ordinal Attributes categorical values where

order has meaning Example Rank - Ratio-Scaled Attributes Continuous measurements

of a non linear scale, approximately at

exponential scale - Mixed Attributes combination of the above data

types

Dissimilarity of Interval-Scaled Values

- Step 1 Standardize the data
- To ensure they all have equal weight
- To match up different scales into a uniform,

single scale - Not always needed! Sometimes we require unequal

weights for an attribute - Step 2 Compute dissimilarity between records
- Use Euclidean, Manhattan or Minkowski distance

Distance Metrics

- Minkowski
- Manhattan
- Euclidean
- Weighted

Dissimilarity between Binary Variables

- Method A contingency table for binary data
- If the binary variable is symmetric
- If the binary variable is asymmetric

Object j

Object i

Dissimilarity between Binary Variables Example

- gender is a symmetric binary
- the remaining attributes are asymmetric binary
- let the values Y and P be set to 1, and the value

N be set to 0

Dissimilarity Between Nominal Attributes

- A generalization of the binary attribute in that

it can take more than 2 states, e.g., red,

yellow, blue, green - Method 1 Simple matching
- m of attributes that are same for both

records, p total of attributes - Method 2 rewrite the database and create a new

binary attribute for each of the m states - For an object with color yellow, the yellow

attribute is set to 1, while the remaining

attributes are set to 0.

Dissimilarity Between Ordinal Attributes

- An ordinal attribute can be discrete or

continuous - Order is important (e.g. rank)
- Can be treated like interval-scaled
- replacing xif by their rank
- map the range of each variable onto 0, 1 by

replacing i-th object in the f-th attribute by - compute the dissimilarity using methods for

interval-scaled attributes

Dissimilarity Between Ratio-Scaled Attributes

- Ratio-scaled attribute a positive measurement on

a nonlinear scale, approximately at exponential

scale, such as AeBt or Ae-Bt - Methods
- treat them like interval-scaled attributes not

a good choice because scales may be distorted - apply logarithmic transformation
- yif log(xif)
- treat them as continuous ordinal data and treat

their rank as interval-scaled.

Dissimilarity Between Attributes of Mixed Types

- A database may contain all the six types of

attributes - symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio. - Use a weighted formula to combine their effects.
- f is binary or nominal dij(f) 0 if xif xjf

, o.w. dij(f) 1. - f is interval-based use the normalized distance
- f is ordinal or ratio-scaled
- compute ranks rif and
- and treat zif as interval-scaled

Major Clustering Approaches

- Partitioning approach
- Partitions objects and then evaluates the

partitions by some criterion, like, minimizing

the sum of square errors. E.g. k-means

k-medoids - Hierarchical approach
- Create a hierarchical clustering of the set of

data (or objects) using some criterion E.g.

Diana, Agnes, BIRCH, ROCK, and CAMELEON - Density-based approach
- Clusters objects based on connectivity and

density functions E.g. DBSACN OPTICS - Grid-Based (skip)
- Model-Based (skip)

Partitioning Algorithms Basic Concept

- Partitioning method Construct a partition of a

database D of n objects into a set of k clusters,

s.t., min sum of squared distance - Given a k, find a partition of k clusters that

optimizes the chosen partitioning criterion - Global optimal exhaustively enumerate all

partitions - Heuristic methods
- k-means Each cluster is represented by the

center of the cluster - k-medoids or PAM (Partition around medoids)

Each cluster is represented by one of the objects

in the cluster

K-Means

The K-Means Clustering Method

- Given k, the k-means algorithm is implemented in

four steps

Stopping/convergence criterion

- no (or minimum) re-assignments of data points to

different clusters, - no (or minimum) change of centroids, or
- minimum decrease in the sum of squared error

(SSE),

(1)

Ci is the jth cluster, mj is the centroid of

cluster Cj (the mean vector of all the data

points in Cj), and dist(x, mj) is the distance

between data point x and centroid mj.

3-means Example Step 1

3-means Example Step 2

3-means Example Step 3

3-means Example Step 4

3-means Example Step 5

3-means Example 2, Step 6

2-means Example

- For simplicity, 1 dimensional objects and k2.
- Objects 1, 2, 5, 6,7
- K-means
- Randomly select 5 and 6 as initial centroids
- gt Two clusters 1,2,5 and 6,7 meanC18/3,

meanC26.5 - gt 1,2, 5,6,7 meanC11.5, meanC26
- gt no change.
- Aggregate dissimilarity 0.52 0.52 12

12 2.5

Comments on the K-Means Method

- Strength of the k-means
- Relatively efficient O(tkn), where n is of

objects, k is of clusters, and t is of

iterations. Normally, k, t ltlt n. - Often terminates at a local optimum.
- Weakness of the k-means
- Applicable only when mean is defined, then what

about categorical data? - Need to specify k, the number of clusters, in

advance. - Unable to handle noisy data and outliers.
- Not suitable to discover clusters with non-convex

shapes.

Variations of the K-Means Method

- A few variants of the k-means which differ in
- Selection of the initial k means.
- Dissimilarity calculations.
- Strategies to calculate cluster means.
- Handling categorical data k-modes (Huang98)
- Replacing means of clusters with modes.
- Using new dissimilarity measures to deal with

categorical objects. - Using a frequency-based method to update modes of

clusters. - A mixture of categorical and numerical data

k-prototype method.

K-Medoid

Medoid - definition

- A medoid is an actual point in the dataset that

is centrally located and is therefore

representative of the cluster.

k-medoid methods

- There are three best-known k-medoid methods
- PAM (Partitioning Around Medoids)
- CLARA (Clustering LARge Applications)
- CLARANS

PAM

- Arbitrarily choose k objects as the initial

medoids - Until no change, do
- (Re)assign each object to the cluster to which

the nearest medoid - Randomly select a non-medoid object o, compute

the total cost, E, of swapping medoid o with o - If E lt 0 then swap o with o to form the new set

of k medoids

Swapping Cost

- Measure whether o is better than o as a medoid
- Use the squared-error criterion
- Compute Eo-Eo
- Negative swapping brings benefit

PAM Example

Arbitrary choose k object as initial medoids

Assign each remaining object to nearest medoids

K2

Randomly select a nonmedoid object,Oramdom

Do loop Until no change

Compute total cost of swapping

Swapping O and Oramdom If quality is improved.

Pros and Cons of PAM

- PAM is more robust than k-means in the presence

of noise and outliers - Medoids are less influenced by outliers
- PAM is efficiently for small data sets but does

not scale well for large data sets - O(k(n-k)2 ) for each iteration
- Sampling based method CLARA

CLARA (Clustering LARge Applications)

- Draw multiple samples of the data set, apply PAM

on each sample, give the best clustering - Perform better than PAM in larger data sets
- Efficiency depends on the sample size
- A good clustering on samples may not be a good

clustering of the whole data set

CLARANS (Clustering Large Applications based upon

RANdomized Search)

- The problem space graph of clustering
- A vertex is k from n numbers, vertices in

total - PAM search the whole graph
- CLARA search some random sub-graphs
- CLARANS climbs mountains
- Randomly sample a set and select k medoids
- Consider neighbors of medoids as candidate for

new medoids - Use the sample set to verify
- Repeat multiple times to avoid bad samples

- Algorithm CLARANS
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- Input parameters numlocal and maxneighbor.
- Initialize i to 1, and mincost to a large number.
- Set current to an arbitrary node in G,,k.
- Set j to 1.
- Consider a random neighbor S of current, and
- based on Equation (5) calculate the cost

differential - of the two nodes.
- If 5 haa a lower cost, set current to S, and go

to - Step (3).
- Otherwise, increment j by 1. If j 5 maxneighbor,

- Requires arbitrary objects and a distance

function - ! Medoid mC representative object in a cluster C
- ! Measure for the compactness of a Cluster C
- ! Measure for the compactness of a clustering
- 6
- p C
- C TD(C) dist( p,m )
- k
- i
- TD TD Ci
- 1
- ( )

- CLARA Kaufmann and Rousseeuw,1990
- ! Additional parameter numlocal
- ! Draws numlocal samples of the data set
- ! Applies PAM on each sample
- ! Returns the best of these sets of medoids as

output - ! CLARANS Ng and Han, 1994)
- ! Two additional parameters maxneighbor and

numlocal - ! At most maxneighbor many pairs (medoid M,

non-medoid N) are - evaluated in the algorithm.
- ! The first pair (M, N) for which TDN9M is

smaller than TDcurrent is - swapped (instead of the pair with the minimal

value of TDN9M ) - ! Finding the local minimum with this procedure

is repeated - numlocal times.
- ! Efficiency runtime(CLARANS) lt runtime(CLARA) lt

runtime(PAM)

- CLARANS(objects DB, Integer k, Real dist,
- Integer

numlocal, Integer maxneighbor) - for r from 1 to numlocal do
- Randomly select k objects as medoids i 0
- while i lt maxneighbor do
- Randomly select (Medoid M, Non-medoid N)
- Compute changeOfTD_ TDN9M TD
- if changeOfTD lt 0 then
- substitute M by N
- TD TDN9M i 0
- else i i 1
- if TD lt TD_best then
- TD_best TD Store current medoids
- return Medoids

Hierarchical Methods

Hierarchical Clustering

- Use distance matrix as clustering criteria. This

method does not require the number of clusters k

as an input, but needs a termination condition

AGNES (Agglomerative Nesting)

- Agglomerative, Bottom-up approach
- Merge nodes that have the least dissimilarity
- Go on in a non-descending fashion
- Eventually all nodes belong to the same cluster

DIANA (Divisive Analysis)

- Top-down approach
- Inverse order of AGNES
- Eventually each node forms a cluster on its own

A Dendrogram

- Shows How the Clusters are Merged Hierarchically
- Decompose data objects into a several levels of

nested partitioning (tree of clusters), called a

dendrogram. - A clustering of the data objects is obtained by

cutting the dendrogram at the desired level, then

each connected component forms a cluster

Recent Hierarchical Clustering Methods

- Major weakness of agglomerative clustering

methods - do not scale well time complexity of at least

O(n2), where n is the number of total objects - can never undo what was done previously
- Integration of hierarchical with distance-based

clustering - BIRCH uses CF-tree and incrementally adjusts the

quality of sub-clusters - ROCK clustering categorical data by neighbor and

link analysis - CHAMELEON hierarchical clustering using dynamic

modeling

BIRCH

- Birch Balanced Iterative Reducing and Clustering

using Hierarchies - Incrementally construct a CF (Clustering Feature)

tree, a hierarchical data structure for

multiphase clustering - Phase 1 scan DB to build an initial in-memory CF

tree (a multi-level compression of the data that

tries to preserve the inherent clustering

structure of the data) - Phase 2 use an arbitrary clustering algorithm to

cluster the leaf nodes of the CF-tree

Clustering Feature Vector in BIRCH

CF (5, (16,30),(54,190))

(3,4) (2,6) (4,5) (4,7) (3,8)

CF-Tree in BIRCH

- Clustering feature
- summary of the statistics for a given subcluster

the 0-th, 1st and 2nd moments of the subcluster

from the statistical point of view. - registers crucial measurements for computing

cluster and utilizes storage efficiently - A CF tree is a height-balanced tree that stores

the clustering features for a hierarchical

clustering - A nonleaf node in a tree has descendants or

children - The nonleaf nodes store sums of the CFs of their

children - A CF tree has two parameters
- Branching factor specify the maximum number of

children. - threshold max diameter of sub-clusters stored at

the leaf nodes

The CF Tree Structure

Root

B 7 L 6

Non-leaf node

CF1

CF3

CF2

CF5

child1

child3

child2

child5

Leaf node

Leaf node

CF1

CF2

CF6

prev

next

CF1

CF2

CF4

prev

next

BIRCH

- Strength
- Scales linearly finds a good clustering with a

single scan - Improves the quality with a few additional scans
- Weakness
- handles only numeric data
- No natural clustering due to the specification of

branching factor. - Clusters are of spherical shape

Density Based Clustering

Density-Based Clustering Methods

- Clustering based on density (local cluster

criterion), such as density-connected points - Major features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Several interesting studies
- DBSCAN
- OPTICS (If there is time)

DBSCAN

- DBSCAN (Density Based Spatial Clustering of

Applications with Noise) is a density-based

algorithm. - Relies on a density-based notion of cluster A

cluster is defined as a maximal set of

density-connected points - A point is a core point if it has more than a

specified number of points MinPts and within a

specified raduis Eps - These are points that are at the interior of a

cluster - A border point has fewer than MinPts within Eps,

but is in the neighborhood of a core point - A noise point is any point that is not a core

point or a border point.

DBSCAN Core, Border, and Noise Points

DBSCAN The Algorithm

- Algorithms
- Arbitrary select a point p
- Retrieve all points density-reachable from p

w.r.t. Eps and MinPts. - If p is a core point, a cluster is formed.
- If p is a border point, no points are

density-reachable from p and DBSCAN visits the

next point of the database. - Continue the process until all of the points have

been processed. - Note A point p is density-reachable from a point

q w.r.t. Eps, MinPts if there is a chain of

points p1, , pn, p1 q, pn p such that p2

pn-1 are all core points.

DBSCAN Core, Border and Noise Points

Original Points

Point types core border and noise

Eps 10, MinPts 4

When DBSCAN Works Well

Original Points

- Resistant to Noise
- Can handle clusters of different shapes and sizes

When DBSCAN Does NOT Work Well

MinPts4 Eps9.75

Original Points

MinPts4 Eps9.92

- Varying densities
- High-dimensional data

End