Cluster Analysis (cont.) Pertemuan 12

About This Presentation

Title:

Cluster Analysis (cont.) Pertemuan 12

Description:

... time complexity of at ... refers to a concept and contains a probabilistic description ... Bina Nusantara Dilanjutkan ke pert. 13 Applications and ... – PowerPoint PPT presentation

Number of Views:270

Avg rating:3.0/5.0

Slides: 60

Provided by: sabl47

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis (cont.) Pertemuan 12

1
(No Transcript)
2
Cluster Analysis (cont.) Pertemuan 12
Matakuliah M0614 / Data Mining OLAP Tahun
Feb - 2010
3
Learning Outcomes

Pada akhir pertemuan ini, diharapkan mahasiswa
akan mampu
Mahasiswa dapat menggunakan teknik analisis
clustering Partitioning, hierarchical, dan
model-based clustering pada data mining. (C3)

3
4
Acknowledgments

These slides have been adapted from Han, J.,
Kamber, M., Pei, Y. Data Mining Concepts and
Technique and Tan, P.-N., Steinbach, M., Kumar,
V. Introduction to Data Mining.

Bina Nusantara
5
Outline Materi

A categorization of major clustering methods
Hiararchical methods
A categorization of major clustering methods
Model-based clustering methods
Summary

5
Bina Nusantara
6
Hierarchical Clustering

Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits

7
Strengths of Hierarchical Clustering

Do not have to assume any particular number of
clusters
Any desired number of clusters can be obtained by
cutting the dendogram at the proper level
They may correspond to meaningful taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

8
Hierarchical Clustering

Two main types of hierarchical clustering
Agglomerative
Start with the points as individual clusters
At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
Divisive
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster
contains a point (or there are k clusters)
Traditional hierarchical algorithms use a
similarity or distance matrix
Merge or split one cluster at a time

9
Hierarchical Clustering

Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition

April 9, 2020
Data Mining Concepts and Techniques
9
10
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the Single-Link method and the dissimilarity
matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster

April 9, 2020
Data Mining Concepts and Techniques
10
11
Agglomerative Clustering Algorithm

More popular hierarchical clustering technique
Basic algorithm is straightforward
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the proximity
of two clusters
Different approaches to defining the distance
between clusters distinguish the different
algorithms

12
Starting Situation

Start with clusters of individual points and a
proximity matrix

Proximity Matrix
13
Intermediate Situation

After some merging steps, we have some clusters

C3
C4
Proximity Matrix
C1
C5
C2
14
Intermediate Situation

We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.

C3
C4
Proximity Matrix
C1
C5
C2
15
After Merging
C2 U C5

The question is How do we update the proximity
matrix?

C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
16
How to Define Inter-Cluster Similarity
Similarity?

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
17
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
18
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
19
How to Define Inter-Cluster Similarity

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
20
How to Define Inter-Cluster Similarity
?
?

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
21
Cluster Similarity MIN or Single Link

Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
Determined by one pair of points, i.e., by one
link in the proximity graph.

22
Hierarchical Clustering MIN
Nested Clusters
Dendrogram
23
Strength of MIN
Original Points

Can handle non-elliptical shapes

24
Limitations of MIN
Original Points

Sensitive to noise and outliers

25
Cluster Similarity MAX or Complete Linkage

Similarity of two clusters is based on the two
least similar (most distant) points in the
different clusters
Determined by all pairs of points in the two
clusters

26
Hierarchical Clustering MAX
Nested Clusters
Dendrogram
27
Strength of MAX
Original Points

Less susceptible to noise and outliers

28
Limitations of MAX
Original Points

Tends to break large clusters
Biased towards globular clusters

29
Cluster Similarity Group Average

Proximity of two clusters is the average of
pairwise proximity between points in the two
clusters.
Need to use average connectivity for scalability
since total proximity favors large clusters

30
Hierarchical Clustering Group Average
Nested Clusters
Dendrogram
31
Hierarchical Clustering Group Average

Compromise between Single and Complete Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters

32
Hierarchical Clustering Comparison
MIN
MAX
Group Average
33
Hierarchical Clustering Problems and Limitations

Once a decision is made to combine two clusters,
it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more
of the following
Sensitivity to noise and outliers
Difficulty handling different sized clusters and
convex shapes
Breaking large clusters

34
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages,
e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own

April 9, 2020
Data Mining Concepts and Techniques
34
35
MST Divisive Hierarchical Clustering

Build MST (Minimum Spanning Tree)
Start with a tree that consists of any point
In successive steps, look for the closest pair of
points (p, q) such that one point (p) is in the
current tree but the other (q) is not
Add q to the tree and put an edge between p and q

36
MST Divisive Hierarchical Clustering

Use MST for constructing hierarchy of clusters

37
Extensions to Hierarchical Clustering

Major weakness of agglomerative clustering
methods
Do not scale well time complexity of at least
O(n2), where n is the number of total objects
Can never undo what was done previously
Integration of hierarchical distance-based
clustering
BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters
ROCK (1999) clustering categorical data by
neighbor and link analysis
CHAMELEON (1999) hierarchical clustering using
dynamic modeling

April 9, 2020
Data Mining Concepts and Techniques
37
38
Model-Based Clustering

What is model-based clustering?
Attempt to optimize the fit between the given
data and some mathematical model
Based on the assumption Data are generated by a
mixture of underlying probability distribution
Typical methods
Statistical approach
EM (Expectation maximization), AutoClass
Machine learning approach
COBWEB, CLASSIT
Neural network approach
SOM (Self-Organizing Feature Map)

April 9, 2020
Data Mining Concepts and Techniques
38
39
EM Expectation Maximization

EM A popular iterative refinement algorithm
An extension to k-means
Assign each object to a cluster according to a
weight (prob. distribution)
New means are computed based on weighted measures
General idea
Starts with an initial estimate of the parameter
vector
Iteratively rescores the patterns against the
mixture density produced by the parameter vector
The rescored patterns are used to update the
parameter updates
Patterns belonging to the same cluster, if they
are placed by their scores in a particular
component
Algorithm converges fast but may not be in global
optima

April 9, 2020
Data Mining Concepts and Techniques
39
40
The EM (Expectation Maximization) Algorithm

Initially, randomly assign k cluster centers
Iteratively refine the clusters based on two
steps
Expectation step assign each data point Xi to
cluster Ci with the following probability
Maximization step
Estimation of model parameters

April 9, 2020
Data Mining Concepts and Techniques
40
41
Conceptual Clustering

Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of
unlabeled objects
Finds characteristic description for each concept
(class)
COBWEB
A popular a simple method of incremental
conceptual learning
Creates a hierarchical clustering in the form of
a classification tree
Each node refers to a concept and contains a
probabilistic description of that concept

April 9, 2020
Data Mining Concepts and Techniques
41
42
COBWEB Clustering Method
A classification tree
April 9, 2020
Data Mining Concepts and Techniques
42
43
More on Conceptual Clustering

Limitations of COBWEB
The assumption that the attributes are
independent of each other is often too strong
because correlation may exist
Not suitable for clustering large database data
skewed tree and expensive probability
distributions
CLASSIT
an extension of COBWEB for incremental clustering
of continuous data
suffers similar problems as COBWEB
AutoClass
Uses Bayesian statistical analysis to estimate
the number of clusters
Popular in industry

April 9, 2020
Data Mining Concepts and Techniques
43
44
Neural Network Approach

Neural network approaches
Represent each cluster as an exemplar, acting as
a prototype of the cluster
New objects are distributed to the cluster whose
exemplar is the most similar according to some
distance measure
Typical methods
SOM (Soft-Organizing feature Map)
Competitive learning
Involves a hierarchical architecture of several
units (neurons)
Neurons compete in a winner-takes-all fashion
for the object currently being presented

April 9, 2020
Data Mining Concepts and Techniques
44
45
Self-Organizing Feature Map (SOM)

SOMs, also called topological ordered maps, or
Kohonen Self-Organizing Feature Map (KSOMs)
It maps all the points in a high-dimensional
source space into a 2 to 3-d target space, s.t.,
the distance and proximity relationship (i.e.,
topology) are preserved as much as possible
Similar to k-means cluster centers tend to lie
in a low-dimensional manifold in the feature
space
Clustering is performed by having several units
competing for the current object
The unit whose weight vector is closest to the
current object wins
The winner and its neighbors learn by having
their weights adjusted
SOMs are believed to resemble processing that can
occur in the brain
Useful for visualizing high-dimensional data in
2- or 3-D space

April 9, 2020
Data Mining Concepts and Techniques
45
46
Web Document Clustering Using SOM

The result of SOM clustering of 12088 Web
articles
The picture on the right drilling down on the
keyword mining
Based on websom.hut.fi Web page

April 9, 2020
Data Mining Concepts and Techniques
46
47
User-Guided Clustering

User usually has a goal of clustering, e.g.,
clustering students by research area
User specifies his clustering goal to CrossClus

April 9, 2020
Data Mining Concepts and Techniques
47
48
Comparing with Classification

User-specified feature (in the form of attribute)
is used as a hint, not class labels
The attribute may contain too many or too few
distinct values, e.g., a user may want to cluster
students into 20 clusters instead of 3
Additional features need to be included in
cluster analysis

User hint

All tuples for clustering
April 9, 2020
Data Mining Concepts and Techniques
48
49
Comparing with Semi-Supervised Clustering

Semi-supervised clustering User provides a
training set consisting of similar (must-link)
and dissimilar (cannot link) pairs of objects
User-guided clustering User specifies an
attribute as a hint, and more relevant features
are found for clustering

User-guided clustering
Semi-supervised clustering

x
All tuples for clustering
All tuples for clustering
April 9, 2020
Data Mining Concepts and Techniques
49
50
Why Not Semi-Supervised Clustering?

Much information (in multiple relations) is
needed to judge whether two tuples are similar
A user may not be able to provide a good training
set
It is much easier for a user to specify an
attribute as a hint, such as a students research
area

Tom Smith SC1211 TA
Jane Chang BI205 RA
Tuples to be compared
April 9, 2020
Data Mining Concepts and Techniques
50
51
CrossClus An Overview

Measure similarity between features by how they
group objects into clusters
Use a heuristic method to search for pertinent
features
Start from user-specified feature and gradually
expand search range
Use tuple ID propagation to create feature values
Features can be easily created during the
expansion of search range, by propagating IDs
Explore three clustering algorithms k-means,
k-medoids, and hierarchical clustering

April 9, 2020
Data Mining Concepts and Techniques
51
52
Multi-Relational Features

A multi-relational feature is defined by
A join path, e.g., Student ? Register ?
OpenCourse ? Course
An attribute, e.g., Course.area
(For numerical feature) an aggregation operator,
e.g., sum or average
Categorical feature f Student ? Register ?
OpenCourse ? Course, Course.area, null

areas of courses of each student
Values of feature f
Tuple Areas of courses Areas of courses Areas of courses
DB AI TH
t1 5 5 0
t2 0 3 7
t3 1 5 4
t4 5 0 5
t5 3 3 4
Tuple Feature f Feature f Feature f
DB AI TH
t1 0.5 0.5 0
t2 0 0.3 0.7
t3 0.1 0.5 0.4
t4 0.5 0 0.5
t5 0.3 0.3 0.4
April 9, 2020
Data Mining Concepts and Techniques
52
53
Representing Features

Similarity between tuples t1 and t2 w.r.t.
categorical feature f
Cosine similarity between vectors f(t1) and f(t2)

Similarity vector Vf

Most important information of a feature f is how
f groups tuples into clusters
f is represented by similarities between every
pair of tuples indicated by f
The horizontal axes are the tuple indices, and
the vertical axis is the similarity
This can be considered as a vector of N x N
dimensions

April 9, 2020
Data Mining Concepts and Techniques
53
54
Similarity Between Features
Values of Feature f and g
Feature f (course) Feature f (course) Feature f (course) Feature g (group) Feature g (group) Feature g (group)
DB AI TH Info sys Cog sci Theory
t1 0.5 0.5 0 1 0 0
t2 0 0.3 0.7 0 0 1
t3 0.1 0.5 0.4 0 0.5 0.5
t4 0.5 0 0.5 0.5 0 0.5
t5 0.3 0.3 0.4 0.5 0.5 0
Vf
Similarity between two features cosine
similarity of two vectors
Vg
April 9, 2020
Data Mining Concepts and Techniques
54
55
Computing Feature Similarity
Tuples
Similarity between feature values w.r.t. the
tuples
Feature g
Feature f
sim(fk,gq)Si1 to N f(ti).pkg(ti).pq
DB
Info sys
AI
Cog sci
TH
Theory
Feature value similarities, easy to compute
Tuple similarities, hard to compute
Compute similarity between each pair of feature
values by one scan on data
April 9, 2020
Data Mining Concepts and Techniques
55
56
Searching for Pertinent Features

Different features convey different aspects of
information
Features conveying same aspect of information
usually cluster tuples in more similar ways
Research group areas vs. conferences of
publications
Given user specified feature
Find pertinent features by computing feature
similarity

April 9, 2020
Data Mining Concepts and Techniques
56
57
Heuristic Search for Pertinent Features

Overall procedure
1. Start from the user- specified feature
2. Search in neighborhood of existing pertinent
features
3. Expand search range gradually

2
1
User hint
Target of clustering

Tuple ID propagation is used to create
multi-relational features
IDs of target tuples can be propagated along any
join path, from which we can find tuples joinable
with each target tuple

April 9, 2020
Data Mining Concepts and Techniques
57
58
Summary

Cluster analysis groups objects based on their
similarity and has wide applications
Measure of similarity can be computed for various
types of data
Clustering algorithms can be categorized into
partitioning methods, hierarchical methods,
density-based methods, grid-based methods, and
model-based methods
There are still lots of research issues on
cluster analysis

Cluster Analysis (cont.) Pertemuan 12 - PowerPoint PPT Presentation

Cluster Analysis (cont.) Pertemuan 12

... time complexity of at ... refers to a concept and contains a probabilistic description ... Bina Nusantara Dilanjutkan ke pert. 13 Applications and ... – PowerPoint PPT presentation