Cluster Analysis - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Cluster Analysis

Description:

Agglomerative Nesting (AGNES) Divisive Analysis (DIANA) BIRCH ... AGNES Results. MSCS 228: Data Mining - Cluster Analysis. 53. Divisive Analysis (DIANA) ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 97
Provided by: CraigAS7
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
  • Craig A. Struble
  • Department of Mathematics, Statistics, and
    Computer Science
  • Marquette University

2
Clustering Outline
  • Problem Overview
  • Techniques
  • Partitional Algorithms
  • Hierarchical Algorithms
  • Probability Based Algorithms
  • Other Approaches
  • Interpretations
  • Applications

3
Goals
  • Explore different clustering techniques
  • Understand complexity issues
  • Learn to interpret clustering results
  • Explore applications of clustering

4
Clustering Examples
  • Segment customer database based on similar buying
    patterns.
  • Group houses in a town into neighborhoods based
    on similar features.
  • Identify new plant species
  • Identify similar Web usage patterns

5
Clustering Example
6
Clustering Houses
7
Clustering Problem
  • Given a database Dt1,t2,,tn of tuples and an
    integer value k, the Clustering Problem is to
    define a mapping fD ? 1,..,k where each ti is
    assigned to one cluster Kj, 1ltjltk.
  • A cluster, Kj, contains precisely those tuples
    mapped to it.
  • Unlike classification problem, clusters are not
    known a priori.

8
Clustering vs. Classification
  • No prior knowledge
  • Number of clusters
  • Meaning of clusters
  • Unsupervised learning
  • Data has no class labels

9
Clustering Approaches
Clustering
Sampling
Compression
10
Clustering Issues
  • Outlier handling
  • Dynamic data
  • Interpreting results
  • Evaluating results
  • Number of clusters
  • Data to be used
  • Scalability

11
Impact of Outliers on Clustering
12
Visualizations
13
Cluster Parameters
14
Resources
  • Classic text is Finding Groups in Data by Kaufman
    and Rousseeuw, 1990
  • Overwhelming number of algorithms
  • Several implementations
  • R and Weka

15
Partitional Clustering
  • Simultaneous clustering
  • All elements are in some cluster during each
    iteration
  • May be shifted from one cluster to another
  • Some metric is used to determine goodness of
    clustering
  • Average distance between clusters
  • Squared error metric
  • Combinatorial problem
  • 11,259,666,000 ways to cluster 19 items into 4
    clusters

16
K-Means
Algorithm Kmeans Input k number
of clusters t number of
iterations data the
data Output C a set of k
clusters cent arbitrarily select k objects as
initial centers for i 1 to t do for each d
in data do assign label x to d such that
dist(d, centx) is minimized for
x 1 to k do centx mean value of all
data with label x
17
Example Data
18
K-Means Example
  • Use Euclidean distance (l is number of
    dimensions)
  • Select 10,4 and 5,4 as initial points

19
K-Means Clustering
3 clusters
2 clusters
20
Cluster Centers
21
K-Means Summary
  • Very simple algorithm
  • Only works on data for which means can be
    calculated
  • Continuous data
  • O(knt) time complexity
  • k - number of clusters,
  • n - number of instances,
  • t - number of iterations
  • Circular cluster shape only
  • Outliers can have very negative impact

22
Outliers
23
Partitioning Methods
  • K-Means (already done)
  • MST
  • K-Medoids (PAM)
  • Fuzzy Clustering

24
Dissimilarity Matrices
  • Many clustering algorithms use a dissimilarity
    matrix as input
  • Instance x Instance

25
Graph Perspective
1
9
5
4.472
4
5.83
5
2
5.38
3
1
5
4
3
2
26
MST Algorithm
  • Compute the minimal spanning tree of the graph
  • Set of edges with minimal total weight so that
    each node can be reached
  • Remove edges that are inconsistent
  • E.G. One whose weight is much larger than
    average weight of neighboring edges
  • Connected graph components form clusters

27
MST Example
B
A
1
2
E
C
3
1
D
28
MST Algorithm
29
MST Example
Let k be 3, and let inconsistent edge be
defined as an edge with maximum weight.
B
A
1
E
C
1
D
30
MST Summary
  • Cost is dominated by MST procedure
  • Time and space O(n2)
  • Number of clusters not necessarily needed
  • Implicitly defined by inconsistent

31
K-Medoids
  • K-Means is restricted to numeric attributes
  • Have to calculate the average object
  • A medoid is a representative object in a cluster.
  • The algorithm searches for K medoids in the set
    of objects.

32
K-Medoids
33
PAM Algorithm
  • PAM consists of two phases
  • BUILD - constructs an initial clustering
  • SWAP - refines the clustering
  • Goal Minimize the sum of dissimilarities to K
    representative objects.
  • Mathematically equivalent to minimizing average
    dissimilarity

34
PAM Algorithm (BUILD Phase)
  • Algorithm PAM (BUILD Phase)
  • // Select k representative objects that appear to
    minimize
  • // dissimilarities
  • selected // empty set
  • for x 1 to k do
  • maxgain 0
  • for each i in data - selected do
  • for each j in data - selected do
  • // See if j is closer to i than some
    other
  • // previous selected object
  • let Dj min(diss(j,k) for each k in
    selected)
  • let Cji max(Dj - diss(j,i), 0)
  • gain gain Cji // total up
    improvements from i
  • if gain gt maxgain then
  • maxgain gain
  • best i
  • selected selected best // best
    representative object chosen

35
PAM Algorithm (SWAP Phase)
  • Algorithm PAM (SWAP Phase)
  • // Improve the partitioning, selected comes from
    BUILD phase
  • besti besth first element in selected
  • repeat
  • selected selected - besti besth // swap i
    and h
  • minswap 0
  • for each i in selected do
  • for each h in data - selected do
  • swap 0
  • for each j in data - selected do
  • let Dj min(diss(j,k) for each k
    in selected)
  • if Djltdiss(j,i) and Djltdiss(j,h)
    then
  • // closer to something else
  • swap swap 0 // do
    nothing
  • else if diss(j,i) Dj then //
    closest to i
  • let Ej be min(diss(j,k) for
    each k in selected - i)
  • if diss(j,h) lt Ej then
  • swap swap diss(j,h) -
    Dj
  • else

36
PAM Output (from R)
37
PAM Output (from R)
38
Silhouettes
  • These plots give an intuitive sense of how good
    the clustering is
  • Let diss(i,C) be the average dissimilarity
    between i and each element in cluster C
  • Let A be the cluster instance i is in
  • a(i) diss(i,A)
  • Let B ltgt A be the cluster such that diss(i,B) is
    minimized
  • b(i) diss(i,B)
  • The silhouette number s(i) is

39
Silhouette Example
40
Silhouettes
  • Let s(k) be the average silhouette width for k
    clusters.
  • The silhoutte coefficient of a data set is
  • The k that maximizes this value is an indication
    of the of clusters

41
Fuzzy Clustering (FANNY)
  • Previous partitioning methods are hard
  • Data can be in one and only one cluster
  • Instead of saying in or out, give a percent of
    membership
  • This is basis for fuzzy logic

42
Fuzzy Clustering (FANNY)
  • Algorithm is a bit too complex to cover
  • Idea Minimize the objective function
  • where uiv is the unknown membership of object i
    to cluster v

43
FANNY Results
44
FANNY Results
45
Hierarchical Methods
  • Top-down vs. bottom-up
  • Agglomerative Nesting (AGNES)
  • Divisive Analysis (DIANA)
  • BIRCH

46
Top-Down vs. Bottom-Up
  • Top-down or divisive approaches split the whole
    data set into smaller pieces
  • Bottom-up or agglomerative approaches combine
    individual elements

47
Agglomerative Nesting
  • Combine clusters until one cluster is obtained
  • Initially each cluster contains one object
  • At each step, select the two most similar
    clusters (e.g., average linking)

48
Cluster Dissimilarities
diss(i,j)
R
Q
49
Cluster Dissimilarities
  • The dissimilarity between clusters can be defined
    differently
  • Maximum dissimilarity between two objects
  • Complete linkage
  • Minimum dissimilarity between two objects
  • Single linkage
  • Centroid method
  • Interval scaled attributes
  • Wards method
  • Interval scaled attributes
  • Error sum of squares of a cluster

50
Example
51
AGNES Results
52
AGNES Results
53
Divisive Analysis (DIANA)
  • Calculate the diameter of each cluster Q
  • Select the cluster Q with the largest diameter
  • Split into A and B
  • Select object i that maximizes
  • Move i from A to B if max value gt 0

54
DIANA Results
55
DIANA Results
56
BIRCH
  • Balanced Iterative Reducing and Clustering Using
    Hierarchies
  • Mixes hierarchical clustering with other
    techniques
  • Useful for large data sets, because entire data
    is not kept in memory
  • Identifies and removes outliers from clustering
  • Due to differing distribution of data
  • Presentation assumes continuous data

57
Two central concepts
  • A cluster feature (CF) is a triple summarizing
    information about a cluster
  • where N is the number of points in the cluster,
    LS is the linear sum of the data points, SS is
    the square sum of data points.

58
Two central concepts
  • Contain enough information to calculate a variety
    of distance measures
  • Addition of CFs accurately represents CF for
    merged clusters
  • Memory and time efficient to maintain

59
Two Central Concepts
  • A CF tree is a height balanced tree with two
    parameters, branching factor B, and diameter
    threshold T.

Root
CF1
CF2
CF3
CFn
CF11
CF12
CF13
Level 1
CF1n
Clusters
60
Phase 1 Build CF tree
  • CF tree is contained in memory and created
    dynamically
  • Identify appropriate leaf Recursively descend
    tree following closest child node.
  • Modify leaf When leaf is reached, add new data
    item x to leaf. If leaf contains more than L
    entries, split the leaf. (Must also satisfy T.)

61
Phase 1 Build CF tree
  • Steps continued
  • Modify path to leaf Update CF of parent. If leaf
    split, add new entry to parent. If parent
    violates B, split parent node. Update parent
    nodes recursively.
  • Merging refinement Find some non-leaf node Nj
    corresponding to a split stoppage. Find two
    closest entries in Nj. If they are not due to the
    split, merge the two entries.

62
Phase 1 Comments
  • The parameters B and L are a function of the page
    size P
  • Splits are caused by P, not by data distribution
  • Hence, refinement step
  • Increasing T makes a smaller tree, but can hide
    outliers
  • Change T if memory runs out. (Phase 2)

63
Phase 3 Global Clustering
  • Apply some global clustering technique to the
    leaf clusters in the CF tree
  • Fast, because everything is in memory
  • Accurate, because outliers removed, data
    represented at a level allowable by memory
  • Less order dependent, because leaves have data
    locality

64
Phase 4 Cluster Refinement
  • Use centroids of the clusters found in Phase 3
  • Identify centroid C closest to data point
  • Place data point in cluster represented by C

65
Probabilistic Methods
  • COBWEB
  • Hierarchical description with probabilities
    associated to attributes.
  • Mixture Models
  • Define probability distributions for each cluster
    in the data.

66
COBWEB
  • Fisher, 1987
  • Incremental approach to clustering
  • Creates a classification tree, in which each node
    describes a concept and a probabilistic
    description of the concept
  • Prior probability of the concept
  • Conditional probabilities for the attributes
    given that concept.

67
Classification Tree
68
Algorithm
  • Add each data item to the hierarchy one at a
    time.
  • Try placing the data item in each existing node
    (going level by level), select good node by
    maximizing average category utility

69
Algorithm
  • Incorporating a new instance might cause the two
    best nodes to merge
  • Calculate CU for the merged nodes
  • Alternatively, incorporating a new instance might
    cause a split
  • Calculate CU for splitting the best node

70
Probability-Based Clustering
  • Consider clustering data into k clusters
  • Model each cluster with a probability
    distribution
  • This set of k distributions is called a mixture,
    and the overall model is a finite mixture model.
  • Each probability distribution gives the
    probability of an instance being in a given
    cluster

71
Mixture Model Clustering
  • Simplest case A single numeric attribute and two
    clusters A and B each represented by a normal
    distribution
  • Parameters for A ?A - mean, ?A - standard dev.
  • Parameters for B ?B - mean, ?B - standard dev.
  • And P(A), P(B) 1 - P(A), the prior
    probabilities of being in cluster A and B
    respectively

72
Probability -Based Clustering
?A50, ?A 5, pA0.6 ?B65, ?B 2, pB0.4
73
Probability-Based Clustering
  • Question is, how do we know the parameters for
    the mixture?
  • ?A ,?A, ?B ,?B,P(A)
  • If data is labeled, easy
  • But clustering is more often used for unlabeled
    data
  • Use an iterative approach similar in spirit to
    the k-means algorithm

74
Expectation Maximization
  • Start with initial guesses for the parameters
  • Calculate cluster probabilities for each instance
  • Expectation
  • Reestimate the parameters from probabilities
  • Maximization
  • Repeat

75
Maximization
Probality xi is in A
Estimated Mean of A
Estimated Variance of A (maximum likelihood
estimate)
Prior probability of being in A
76
Termination
  • The EM algorithm converges to a maximum, but
    never gets there
  • Continue until overall likelihood growth is
    negligible
  • Maximum could be local, so repeat several times
    with different initial values

77
Extending the Model
  • Extending to multiple clusters is
    straightforward, just use k normal distributions
  • For multiple attributes, assume independence and
    multiply attribute probabilities as in Naïve
    Bayes
  • For nominal attributes, cant use normal
    distribution. Have to create probability
    distributions for the values, one per cluster.
    This gives kv parameters to estimate, where v is
    the number of values for the nominal attribute.
  • Can use different distributions depending on
    data e.g., log-normal distribution for
    attributes with minimum

78
Other Clustering Approaches
  • Genetic Algorithms
  • Global search for solutions
  • Neural Networks
  • Competitive Learning
  • Kohonen Network

79
Kohonen Data
80
Applications of Clustering
  • Gene function identification
  • Document clustering
  • Modeling economic data

81
Gene Function Identification
  • Genome is the blueprint defining an organism
    (DNA)
  • Genes are inherited portions of the genome
    related to biological functions
  • Proteins, non-coding RNA
  • Given the collection of biological information,
    try to predict or identify the function of genes
    with unknown function

82
Gene Expression
  • A gene is expressed if it is actively being
    transcribed (copied into RNA)
  • Rate of expression is related to the rate of
    transcription
  • Microarray experiments

83
Gene Expression Data
Experiments
Clones
84
Clustering Gene Expression Data
  • Identify genes with similar expression profiles
  • Use clustering
  • Identify function of known genes in a cluster
  • Assign that function to genes of unknown function
    in the same cluster

85
Clustering Gene Expression Data
Yeast Genome
86
Document Clustering
  • Represent documents as vectors in a vector space
  • Cluster documents in this representation
  • Describe/summarize/evaluate the clusters
  • Label clusters with meaningful descriptions

87
Document Transformation
  • Convert document into table form
  • Attributes are important words
  • Value is number of times word appears

88
Document Classification
  • Could select a word as classification label
  • Identify after clustering
  • Look at medoid or centroid and examine
    characteristics
  • Look at of times certain words appear
  • Look at which words appear together
  • Look at words that dont appear at all

89
Document Classification
  • Once clusters identified, label each document
    with a cluster label
  • Use a classification technique to identify
    cluster relationships
  • Decision trees for example
  • Other kinds of rule induction

90
Document Clustering
  • MeSH
  • 21975 Terms
  • RGD
  • 2713 Papers
  • Dissimilarity matrix
  • Multidimensional scaling
  • FANNY
  • Red - Sequence related
  • Black - Physiological

91
Economic Modelling
  • Nettleton et al. (2000)
  • Objective is to identify how to make the Port of
    Barcelona the principal port of entry for
    merchandise.
  • Statistics, clustering, outlier analysis,
    categorization of continuous values

92
Data
  • Vessel specific
  • Date, type, origin, destination, metric tons
    loaded/unloaded, amount invoiced, quality
    measure, etc.
  • Economic indicators
  • Consumer price index, date (monthly), Industrial
    production index, etc.

93
Data Transformation
  • Data aggregated into 4 relational tables
  • Total monthly visits, etc.
  • Joined based on date
  • Separated into training (1997-1998) and test sets
    (1999)

94
Data Exploration
  • Used clustering with Condorcet criterion
  • IBM Intelligent Miner
  • Identified relevant features
  • Production import volume gt 400,000 MT for product
    1
  • Area import volume gt 250,000 MT for area 10
  • Then used rule induction to characterize clusters

95
Cluster Analysis Summary
  • Unsupervised technique
  • Similarity-based
  • Can modify definition of similarity to meet needs
  • Many techniques
  • Partitional, hierarchical, probability based, NN,
    GA, etc.
  • Combined with some other descriptive technique
  • Decision trees, rule induction, etc.

96
Cluster Analysis Summary
  • Issues
  • Number of clusters
  • Quality of clustering
  • Meaning of clusters
  • Clustering large data sets
  • Applications
Write a Comment
User Comments (0)
About PowerShow.com