Cluster Analysis - PowerPoint PPT Presentation

1 / 96

About This Presentation

Title:

Cluster Analysis

Description:

Agglomerative Nesting (AGNES) Divisive Analysis (DIANA) BIRCH ... AGNES Results. MSCS 228: Data Mining - Cluster Analysis. 53. Divisive Analysis (DIANA) ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 97

Provided by: CraigAS7

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis

1
Cluster Analysis

Craig A. Struble
Department of Mathematics, Statistics, and
Computer Science
Marquette University

2
Clustering Outline

Problem Overview
Techniques
Partitional Algorithms
Hierarchical Algorithms
Probability Based Algorithms
Other Approaches
Interpretations
Applications

3
Goals

Explore different clustering techniques
Understand complexity issues
Learn to interpret clustering results
Explore applications of clustering

4
Clustering Examples

Segment customer database based on similar buying
patterns.
Group houses in a town into neighborhoods based
on similar features.
Identify new plant species
Identify similar Web usage patterns

5
Clustering Example
6
Clustering Houses
7
Clustering Problem

Given a database Dt1,t2,,tn of tuples and an
integer value k, the Clustering Problem is to
define a mapping fD ? 1,..,k where each ti is
assigned to one cluster Kj, 1ltjltk.
A cluster, Kj, contains precisely those tuples
mapped to it.
Unlike classification problem, clusters are not
known a priori.

8
Clustering vs. Classification

No prior knowledge
Number of clusters
Meaning of clusters
Unsupervised learning
Data has no class labels

9
Clustering Approaches
Clustering
Sampling
Compression
10
Clustering Issues

Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability

11
Impact of Outliers on Clustering
12
Visualizations
13
Cluster Parameters
14
Resources

Classic text is Finding Groups in Data by Kaufman
and Rousseeuw, 1990
Overwhelming number of algorithms
Several implementations
R and Weka

15
Partitional Clustering

Simultaneous clustering
All elements are in some cluster during each
iteration
May be shifted from one cluster to another
Some metric is used to determine goodness of
clustering
Average distance between clusters
Squared error metric
Combinatorial problem
11,259,666,000 ways to cluster 19 items into 4
clusters

16
K-Means
Algorithm Kmeans Input k number
of clusters t number of
iterations data the
data Output C a set of k
clusters cent arbitrarily select k objects as
initial centers for i 1 to t do for each d
in data do assign label x to d such that
dist(d, centx) is minimized for
x 1 to k do centx mean value of all
data with label x
17
Example Data
18
K-Means Example

Use Euclidean distance (l is number of
dimensions)
Select 10,4 and 5,4 as initial points

19
K-Means Clustering
3 clusters
2 clusters
20
Cluster Centers
21
K-Means Summary

Very simple algorithm
Only works on data for which means can be
calculated
Continuous data
O(knt) time complexity
k - number of clusters,
n - number of instances,
t - number of iterations
Circular cluster shape only
Outliers can have very negative impact

22
Outliers
23
Partitioning Methods

K-Means (already done)
MST
K-Medoids (PAM)
Fuzzy Clustering

24
Dissimilarity Matrices

Many clustering algorithms use a dissimilarity
matrix as input
Instance x Instance

25
Graph Perspective
1
9
5
4.472
4
5.83
5
2
5.38
3
1
5
4
3
2
26
MST Algorithm

Compute the minimal spanning tree of the graph
Set of edges with minimal total weight so that
each node can be reached
Remove edges that are inconsistent
E.G. One whose weight is much larger than
average weight of neighboring edges
Connected graph components form clusters

27
MST Example
B
A
1
2
E
C
3
1
D
28
MST Algorithm
29
MST Example
Let k be 3, and let inconsistent edge be
defined as an edge with maximum weight.
B
A
1
E
C
1
D
30
MST Summary

Cost is dominated by MST procedure
Time and space O(n2)
Number of clusters not necessarily needed
Implicitly defined by inconsistent

31
K-Medoids

K-Means is restricted to numeric attributes
Have to calculate the average object
A medoid is a representative object in a cluster.
The algorithm searches for K medoids in the set
of objects.

32
K-Medoids
33
PAM Algorithm

PAM consists of two phases
BUILD - constructs an initial clustering
SWAP - refines the clustering
Goal Minimize the sum of dissimilarities to K
representative objects.
Mathematically equivalent to minimizing average
dissimilarity

34
PAM Algorithm (BUILD Phase)

Algorithm PAM (BUILD Phase)
// Select k representative objects that appear to
minimize
// dissimilarities
selected // empty set
for x 1 to k do
maxgain 0
for each i in data - selected do
for each j in data - selected do
// See if j is closer to i than some
other
// previous selected object
let Dj min(diss(j,k) for each k in
selected)
let Cji max(Dj - diss(j,i), 0)
gain gain Cji // total up
improvements from i
if gain gt maxgain then
maxgain gain
best i
selected selected best // best
representative object chosen

35
PAM Algorithm (SWAP Phase)

Algorithm PAM (SWAP Phase)
// Improve the partitioning, selected comes from
BUILD phase
besti besth first element in selected
repeat
selected selected - besti besth // swap i
and h
minswap 0
for each i in selected do
for each h in data - selected do
swap 0
for each j in data - selected do
let Dj min(diss(j,k) for each k
in selected)
if Djltdiss(j,i) and Djltdiss(j,h)
then
// closer to something else
swap swap 0 // do
nothing
else if diss(j,i) Dj then //
closest to i
let Ej be min(diss(j,k) for
each k in selected - i)
if diss(j,h) lt Ej then
swap swap diss(j,h) -
Dj
else

36
PAM Output (from R)
37
PAM Output (from R)
38
Silhouettes

These plots give an intuitive sense of how good
the clustering is
Let diss(i,C) be the average dissimilarity
between i and each element in cluster C
Let A be the cluster instance i is in
a(i) diss(i,A)
Let B ltgt A be the cluster such that diss(i,B) is
minimized
b(i) diss(i,B)
The silhouette number s(i) is

39
Silhouette Example
40
Silhouettes

Let s(k) be the average silhouette width for k
clusters.
The silhoutte coefficient of a data set is
The k that maximizes this value is an indication
of the of clusters

41
Fuzzy Clustering (FANNY)

Previous partitioning methods are hard
Data can be in one and only one cluster
Instead of saying in or out, give a percent of
membership
This is basis for fuzzy logic

42
Fuzzy Clustering (FANNY)

Algorithm is a bit too complex to cover
Idea Minimize the objective function
where uiv is the unknown membership of object i
to cluster v

43
FANNY Results
44
FANNY Results
45
Hierarchical Methods

Top-down vs. bottom-up
Agglomerative Nesting (AGNES)
Divisive Analysis (DIANA)
BIRCH

46
Top-Down vs. Bottom-Up

Top-down or divisive approaches split the whole
data set into smaller pieces
Bottom-up or agglomerative approaches combine
individual elements

47
Agglomerative Nesting

Combine clusters until one cluster is obtained
Initially each cluster contains one object
At each step, select the two most similar
clusters (e.g., average linking)

48
Cluster Dissimilarities
diss(i,j)
R
Q
49
Cluster Dissimilarities

The dissimilarity between clusters can be defined
differently
Maximum dissimilarity between two objects
Complete linkage
Minimum dissimilarity between two objects
Single linkage
Centroid method
Interval scaled attributes
Wards method
Interval scaled attributes
Error sum of squares of a cluster

50
Example
51
AGNES Results
52
AGNES Results
53
Divisive Analysis (DIANA)

Calculate the diameter of each cluster Q
Select the cluster Q with the largest diameter
Split into A and B
Select object i that maximizes
Move i from A to B if max value gt 0

54
DIANA Results
55
DIANA Results
56
BIRCH

Balanced Iterative Reducing and Clustering Using
Hierarchies
Mixes hierarchical clustering with other
techniques
Useful for large data sets, because entire data
is not kept in memory
Identifies and removes outliers from clustering
Due to differing distribution of data
Presentation assumes continuous data

57
Two central concepts

A cluster feature (CF) is a triple summarizing
information about a cluster
where N is the number of points in the cluster,
LS is the linear sum of the data points, SS is
the square sum of data points.

58
Two central concepts

Contain enough information to calculate a variety
of distance measures
Addition of CFs accurately represents CF for
merged clusters
Memory and time efficient to maintain

59
Two Central Concepts

A CF tree is a height balanced tree with two
parameters, branching factor B, and diameter
threshold T.

Root
CF1
CF2
CF3
CFn
CF11
CF12
CF13
Level 1
CF1n
Clusters
60
Phase 1 Build CF tree

CF tree is contained in memory and created
dynamically
Identify appropriate leaf Recursively descend
tree following closest child node.
Modify leaf When leaf is reached, add new data
item x to leaf. If leaf contains more than L
entries, split the leaf. (Must also satisfy T.)

61
Phase 1 Build CF tree

Steps continued
Modify path to leaf Update CF of parent. If leaf
split, add new entry to parent. If parent
violates B, split parent node. Update parent
nodes recursively.
Merging refinement Find some non-leaf node Nj
corresponding to a split stoppage. Find two
closest entries in Nj. If they are not due to the
split, merge the two entries.

62
Phase 1 Comments

The parameters B and L are a function of the page
size P
Splits are caused by P, not by data distribution
Hence, refinement step
Increasing T makes a smaller tree, but can hide
outliers
Change T if memory runs out. (Phase 2)

63
Phase 3 Global Clustering

Apply some global clustering technique to the
leaf clusters in the CF tree
Fast, because everything is in memory
Accurate, because outliers removed, data
represented at a level allowable by memory
Less order dependent, because leaves have data
locality

64
Phase 4 Cluster Refinement

Use centroids of the clusters found in Phase 3
Identify centroid C closest to data point
Place data point in cluster represented by C

65
Probabilistic Methods

COBWEB
Hierarchical description with probabilities
associated to attributes.
Mixture Models
Define probability distributions for each cluster
in the data.

66
COBWEB

Fisher, 1987
Incremental approach to clustering
Creates a classification tree, in which each node
describes a concept and a probabilistic
description of the concept
Prior probability of the concept
Conditional probabilities for the attributes
given that concept.

67
Classification Tree
68
Algorithm

Add each data item to the hierarchy one at a
time.
Try placing the data item in each existing node
(going level by level), select good node by
maximizing average category utility

69
Algorithm

Incorporating a new instance might cause the two
best nodes to merge
Calculate CU for the merged nodes
Alternatively, incorporating a new instance might
cause a split
Calculate CU for splitting the best node

70
Probability-Based Clustering

Consider clustering data into k clusters
Model each cluster with a probability
distribution
This set of k distributions is called a mixture,
and the overall model is a finite mixture model.
Each probability distribution gives the
probability of an instance being in a given
cluster

71
Mixture Model Clustering

Simplest case A single numeric attribute and two
clusters A and B each represented by a normal
distribution
Parameters for A ?A - mean, ?A - standard dev.
Parameters for B ?B - mean, ?B - standard dev.
And P(A), P(B) 1 - P(A), the prior
probabilities of being in cluster A and B
respectively

72
Probability -Based Clustering
?A50, ?A 5, pA0.6 ?B65, ?B 2, pB0.4
73
Probability-Based Clustering

Question is, how do we know the parameters for
the mixture?
?A ,?A, ?B ,?B,P(A)
If data is labeled, easy
But clustering is more often used for unlabeled
data
Use an iterative approach similar in spirit to
the k-means algorithm

74
Expectation Maximization

Start with initial guesses for the parameters
Calculate cluster probabilities for each instance
Expectation
Reestimate the parameters from probabilities
Maximization
Repeat

75
Maximization
Probality xi is in A
Estimated Mean of A
Estimated Variance of A (maximum likelihood
estimate)
Prior probability of being in A
76
Termination

The EM algorithm converges to a maximum, but
never gets there
Continue until overall likelihood growth is
negligible
Maximum could be local, so repeat several times
with different initial values

77
Extending the Model

Extending to multiple clusters is
straightforward, just use k normal distributions
For multiple attributes, assume independence and
multiply attribute probabilities as in Naïve
Bayes
For nominal attributes, cant use normal
distribution. Have to create probability
distributions for the values, one per cluster.
This gives kv parameters to estimate, where v is
the number of values for the nominal attribute.
Can use different distributions depending on
data e.g., log-normal distribution for
attributes with minimum

78
Other Clustering Approaches

Genetic Algorithms
Global search for solutions
Neural Networks
Competitive Learning
Kohonen Network

79
Kohonen Data
80
Applications of Clustering

Gene function identification
Document clustering
Modeling economic data

81
Gene Function Identification

Genome is the blueprint defining an organism
(DNA)
Genes are inherited portions of the genome
related to biological functions
Proteins, non-coding RNA
Given the collection of biological information,
try to predict or identify the function of genes
with unknown function

82
Gene Expression

A gene is expressed if it is actively being
transcribed (copied into RNA)
Rate of expression is related to the rate of
transcription
Microarray experiments

83
Gene Expression Data
Experiments
Clones
84
Clustering Gene Expression Data

Identify genes with similar expression profiles
Use clustering
Identify function of known genes in a cluster
Assign that function to genes of unknown function
in the same cluster

85
Clustering Gene Expression Data
Yeast Genome
86
Document Clustering

Represent documents as vectors in a vector space
Cluster documents in this representation
Describe/summarize/evaluate the clusters
Label clusters with meaningful descriptions

87
Document Transformation

Convert document into table form
Attributes are important words
Value is number of times word appears

88
Document Classification

Could select a word as classification label
Identify after clustering
Look at medoid or centroid and examine
characteristics
Look at of times certain words appear
Look at which words appear together
Look at words that dont appear at all

89
Document Classification

Once clusters identified, label each document
with a cluster label
Use a classification technique to identify
cluster relationships
Decision trees for example
Other kinds of rule induction

90
Document Clustering

MeSH
21975 Terms
RGD
2713 Papers
Dissimilarity matrix
Multidimensional scaling
FANNY
Red - Sequence related
Black - Physiological

91
Economic Modelling

Nettleton et al. (2000)
Objective is to identify how to make the Port of
Barcelona the principal port of entry for
merchandise.
Statistics, clustering, outlier analysis,
categorization of continuous values

92
Data

Vessel specific
Date, type, origin, destination, metric tons
loaded/unloaded, amount invoiced, quality
measure, etc.
Economic indicators
Consumer price index, date (monthly), Industrial
production index, etc.

93
Data Transformation