(No Transcript)

Cluster Analysis (cont.) Pertemuan 12

Matakuliah M0614 / Data Mining OLAP Tahun

Feb - 2010

Learning Outcomes

- Pada akhir pertemuan ini, diharapkan mahasiswa
- akan mampu
- Mahasiswa dapat menggunakan teknik analisis

clustering Partitioning, hierarchical, dan

model-based clustering pada data mining. (C3)

3

Acknowledgments

- These slides have been adapted from Han, J.,

Kamber, M., Pei, Y. Data Mining Concepts and

Technique and Tan, P.-N., Steinbach, M., Kumar,

V. Introduction to Data Mining.

Bina Nusantara

Outline Materi

- A categorization of major clustering methods

Hiararchical methods - A categorization of major clustering methods

Model-based clustering methods - Summary

5

Bina Nusantara

Hierarchical Clustering

- Produces a set of nested clusters organized as a

hierarchical tree - Can be visualized as a dendrogram
- A tree like diagram that records the sequences of

merges or splits

Strengths of Hierarchical Clustering

- Do not have to assume any particular number of

clusters - Any desired number of clusters can be obtained by

cutting the dendogram at the proper level - They may correspond to meaningful taxonomies
- Example in biological sciences (e.g., animal

kingdom, phylogeny reconstruction, )

Hierarchical Clustering

- Two main types of hierarchical clustering
- Agglomerative
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters

until only one cluster (or k clusters) left - Divisive
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster

contains a point (or there are k clusters) - Traditional hierarchical algorithms use a

similarity or distance matrix - Merge or split one cluster at a time

Hierarchical Clustering

- Use distance matrix as clustering criteria. This

method does not require the number of clusters k

as an input, but needs a termination condition

April 9, 2020

Data Mining Concepts and Techniques

9

AGNES (Agglomerative Nesting)

- Introduced in Kaufmann and Rousseeuw (1990)
- Implemented in statistical packages, e.g., Splus
- Use the Single-Link method and the dissimilarity

matrix - Merge nodes that have the least dissimilarity
- Go on in a non-descending fashion
- Eventually all nodes belong to the same cluster

April 9, 2020

Data Mining Concepts and Techniques

10

Agglomerative Clustering Algorithm

- More popular hierarchical clustering technique
- Basic algorithm is straightforward
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat
- Merge the two closest clusters
- Update the proximity matrix
- Until only a single cluster remains
- Key operation is the computation of the proximity

of two clusters - Different approaches to defining the distance

between clusters distinguish the different

algorithms

Starting Situation

- Start with clusters of individual points and a

proximity matrix

Proximity Matrix

Intermediate Situation

- After some merging steps, we have some clusters

C3

C4

Proximity Matrix

C1

C5

C2

Intermediate Situation

- We want to merge the two closest clusters (C2 and

C5) and update the proximity matrix.

C3

C4

Proximity Matrix

C1

C5

C2

After Merging

C2 U C5

- The question is How do we update the proximity

matrix?

C1

C3

C4

?

C1

? ? ? ?

C2 U C5

C3

?

C3

C4

?

C4

Proximity Matrix

C1

C2 U C5

How to Define Inter-Cluster Similarity

Similarity?

- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error

Proximity Matrix

How to Define Inter-Cluster Similarity

?

?

- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error

Proximity Matrix

Cluster Similarity MIN or Single Link

- Similarity of two clusters is based on the two

most similar (closest) points in the different

clusters - Determined by one pair of points, i.e., by one

link in the proximity graph.

Hierarchical Clustering MIN

Nested Clusters

Dendrogram

Strength of MIN

Original Points

- Can handle non-elliptical shapes

Limitations of MIN

Original Points

- Sensitive to noise and outliers

Cluster Similarity MAX or Complete Linkage

- Similarity of two clusters is based on the two

least similar (most distant) points in the

different clusters - Determined by all pairs of points in the two

clusters

Hierarchical Clustering MAX

Nested Clusters

Dendrogram

Strength of MAX

Original Points

- Less susceptible to noise and outliers

Limitations of MAX

Original Points

- Tends to break large clusters
- Biased towards globular clusters

Cluster Similarity Group Average

- Proximity of two clusters is the average of

pairwise proximity between points in the two

clusters. - Need to use average connectivity for scalability

since total proximity favors large clusters

Hierarchical Clustering Group Average

Nested Clusters

Dendrogram

Hierarchical Clustering Group Average

- Compromise between Single and Complete Link
- Strengths
- Less susceptible to noise and outliers
- Limitations
- Biased towards globular clusters

Hierarchical Clustering Comparison

MIN

MAX

Group Average

Hierarchical Clustering Problems and Limitations

- Once a decision is made to combine two clusters,

it cannot be undone - No objective function is directly minimized
- Different schemes have problems with one or more

of the following - Sensitivity to noise and outliers
- Difficulty handling different sized clusters and

convex shapes - Breaking large clusters

DIANA (Divisive Analysis)

- Introduced in Kaufmann and Rousseeuw (1990)
- Implemented in statistical analysis packages,

e.g., Splus - Inverse order of AGNES
- Eventually each node forms a cluster on its own

April 9, 2020

Data Mining Concepts and Techniques

34

MST Divisive Hierarchical Clustering

- Build MST (Minimum Spanning Tree)
- Start with a tree that consists of any point
- In successive steps, look for the closest pair of

points (p, q) such that one point (p) is in the

current tree but the other (q) is not - Add q to the tree and put an edge between p and q

MST Divisive Hierarchical Clustering

- Use MST for constructing hierarchy of clusters

Extensions to Hierarchical Clustering

- Major weakness of agglomerative clustering

methods - Do not scale well time complexity of at least

O(n2), where n is the number of total objects - Can never undo what was done previously
- Integration of hierarchical distance-based

clustering - BIRCH (1996) uses CF-tree and incrementally

adjusts the quality of sub-clusters - ROCK (1999) clustering categorical data by

neighbor and link analysis - CHAMELEON (1999) hierarchical clustering using

dynamic modeling

April 9, 2020

Data Mining Concepts and Techniques

37

Model-Based Clustering

- What is model-based clustering?
- Attempt to optimize the fit between the given

data and some mathematical model - Based on the assumption Data are generated by a

mixture of underlying probability distribution - Typical methods
- Statistical approach
- EM (Expectation maximization), AutoClass
- Machine learning approach
- COBWEB, CLASSIT
- Neural network approach
- SOM (Self-Organizing Feature Map)

April 9, 2020

Data Mining Concepts and Techniques

38

EM Expectation Maximization

- EM A popular iterative refinement algorithm
- An extension to k-means
- Assign each object to a cluster according to a

weight (prob. distribution) - New means are computed based on weighted measures
- General idea
- Starts with an initial estimate of the parameter

vector - Iteratively rescores the patterns against the

mixture density produced by the parameter vector - The rescored patterns are used to update the

parameter updates - Patterns belonging to the same cluster, if they

are placed by their scores in a particular

component - Algorithm converges fast but may not be in global

optima

April 9, 2020

Data Mining Concepts and Techniques

39

The EM (Expectation Maximization) Algorithm

- Initially, randomly assign k cluster centers
- Iteratively refine the clusters based on two

steps - Expectation step assign each data point Xi to

cluster Ci with the following probability - Maximization step
- Estimation of model parameters

April 9, 2020

Data Mining Concepts and Techniques

40

Conceptual Clustering

- Conceptual clustering
- A form of clustering in machine learning
- Produces a classification scheme for a set of

unlabeled objects - Finds characteristic description for each concept

(class) - COBWEB
- A popular a simple method of incremental

conceptual learning - Creates a hierarchical clustering in the form of

a classification tree - Each node refers to a concept and contains a

probabilistic description of that concept

April 9, 2020

Data Mining Concepts and Techniques

41

COBWEB Clustering Method

A classification tree

April 9, 2020

Data Mining Concepts and Techniques

42

More on Conceptual Clustering

- Limitations of COBWEB
- The assumption that the attributes are

independent of each other is often too strong

because correlation may exist - Not suitable for clustering large database data

skewed tree and expensive probability

distributions - CLASSIT
- an extension of COBWEB for incremental clustering

of continuous data - suffers similar problems as COBWEB
- AutoClass
- Uses Bayesian statistical analysis to estimate

the number of clusters - Popular in industry

April 9, 2020

Data Mining Concepts and Techniques

43

Neural Network Approach

- Neural network approaches
- Represent each cluster as an exemplar, acting as

a prototype of the cluster - New objects are distributed to the cluster whose

exemplar is the most similar according to some

distance measure - Typical methods
- SOM (Soft-Organizing feature Map)
- Competitive learning
- Involves a hierarchical architecture of several

units (neurons) - Neurons compete in a winner-takes-all fashion

for the object currently being presented

April 9, 2020

Data Mining Concepts and Techniques

44

Self-Organizing Feature Map (SOM)

- SOMs, also called topological ordered maps, or

Kohonen Self-Organizing Feature Map (KSOMs) - It maps all the points in a high-dimensional

source space into a 2 to 3-d target space, s.t.,

the distance and proximity relationship (i.e.,

topology) are preserved as much as possible - Similar to k-means cluster centers tend to lie

in a low-dimensional manifold in the feature

space - Clustering is performed by having several units

competing for the current object - The unit whose weight vector is closest to the

current object wins - The winner and its neighbors learn by having

their weights adjusted - SOMs are believed to resemble processing that can

occur in the brain - Useful for visualizing high-dimensional data in

2- or 3-D space

April 9, 2020

Data Mining Concepts and Techniques

45

Web Document Clustering Using SOM

- The result of SOM clustering of 12088 Web

articles - The picture on the right drilling down on the

keyword mining - Based on websom.hut.fi Web page

April 9, 2020

Data Mining Concepts and Techniques

46

User-Guided Clustering

- User usually has a goal of clustering, e.g.,

clustering students by research area - User specifies his clustering goal to CrossClus

April 9, 2020

Data Mining Concepts and Techniques

47

Comparing with Classification

- User-specified feature (in the form of attribute)

is used as a hint, not class labels - The attribute may contain too many or too few

distinct values, e.g., a user may want to cluster

students into 20 clusters instead of 3 - Additional features need to be included in

cluster analysis

User hint

All tuples for clustering

April 9, 2020

Data Mining Concepts and Techniques

48

Comparing with Semi-Supervised Clustering

- Semi-supervised clustering User provides a

training set consisting of similar (must-link)

and dissimilar (cannot link) pairs of objects - User-guided clustering User specifies an

attribute as a hint, and more relevant features

are found for clustering

User-guided clustering

Semi-supervised clustering

x

All tuples for clustering

All tuples for clustering

April 9, 2020

Data Mining Concepts and Techniques

49

Why Not Semi-Supervised Clustering?

- Much information (in multiple relations) is

needed to judge whether two tuples are similar - A user may not be able to provide a good training

set - It is much easier for a user to specify an

attribute as a hint, such as a students research

area

Tom Smith SC1211 TA

Jane Chang BI205 RA

Tuples to be compared

April 9, 2020

Data Mining Concepts and Techniques

50

CrossClus An Overview

- Measure similarity between features by how they

group objects into clusters - Use a heuristic method to search for pertinent

features - Start from user-specified feature and gradually

expand search range - Use tuple ID propagation to create feature values
- Features can be easily created during the

expansion of search range, by propagating IDs - Explore three clustering algorithms k-means,

k-medoids, and hierarchical clustering

April 9, 2020

Data Mining Concepts and Techniques

51

Multi-Relational Features

- A multi-relational feature is defined by
- A join path, e.g., Student ? Register ?

OpenCourse ? Course - An attribute, e.g., Course.area
- (For numerical feature) an aggregation operator,

e.g., sum or average - Categorical feature f Student ? Register ?

OpenCourse ? Course, Course.area, null

areas of courses of each student

Values of feature f

Tuple Areas of courses Areas of courses Areas of courses

DB AI TH

t1 5 5 0

t2 0 3 7

t3 1 5 4

t4 5 0 5

t5 3 3 4

Tuple Feature f Feature f Feature f

DB AI TH

t1 0.5 0.5 0

t2 0 0.3 0.7

t3 0.1 0.5 0.4

t4 0.5 0 0.5

t5 0.3 0.3 0.4

April 9, 2020

Data Mining Concepts and Techniques

52

Representing Features

- Similarity between tuples t1 and t2 w.r.t.

categorical feature f - Cosine similarity between vectors f(t1) and f(t2)

Similarity vector Vf

- Most important information of a feature f is how

f groups tuples into clusters - f is represented by similarities between every

pair of tuples indicated by f - The horizontal axes are the tuple indices, and

the vertical axis is the similarity - This can be considered as a vector of N x N

dimensions

April 9, 2020

Data Mining Concepts and Techniques

53

Similarity Between Features

Values of Feature f and g

Feature f (course) Feature f (course) Feature f (course) Feature g (group) Feature g (group) Feature g (group)

DB AI TH Info sys Cog sci Theory

t1 0.5 0.5 0 1 0 0

t2 0 0.3 0.7 0 0 1

t3 0.1 0.5 0.4 0 0.5 0.5

t4 0.5 0 0.5 0.5 0 0.5

t5 0.3 0.3 0.4 0.5 0.5 0

Vf

Similarity between two features cosine

similarity of two vectors

Vg

April 9, 2020

Data Mining Concepts and Techniques

54

Computing Feature Similarity

Tuples

Similarity between feature values w.r.t. the

tuples

Feature g

Feature f

sim(fk,gq)Si1 to N f(ti).pkg(ti).pq

DB

Info sys

AI

Cog sci

TH

Theory

Feature value similarities, easy to compute

Tuple similarities, hard to compute

Compute similarity between each pair of feature

values by one scan on data

April 9, 2020

Data Mining Concepts and Techniques

55

Searching for Pertinent Features

- Different features convey different aspects of

information - Features conveying same aspect of information

usually cluster tuples in more similar ways - Research group areas vs. conferences of

publications - Given user specified feature
- Find pertinent features by computing feature

similarity

April 9, 2020

Data Mining Concepts and Techniques

56

Heuristic Search for Pertinent Features

- Overall procedure
- 1. Start from the user- specified feature
- 2. Search in neighborhood of existing pertinent

features - 3. Expand search range gradually

2

1

User hint

Target of clustering

- Tuple ID propagation is used to create

multi-relational features - IDs of target tuples can be propagated along any

join path, from which we can find tuples joinable

with each target tuple

April 9, 2020

Data Mining Concepts and Techniques

57

Summary

- Cluster analysis groups objects based on their

similarity and has wide applications - Measure of similarity can be computed for various

types of data - Clustering algorithms can be categorized into

partitioning methods, hierarchical methods,

density-based methods, grid-based methods, and

model-based methods - There are still lots of research issues on

cluster analysis

April 9, 2020

Data Mining Concepts and Techniques

58

- Dilanjutkan ke pert. 13
- Applications and Trends in Data Mining