CSE634: DATA CLUSTERING METHODS

About This Presentation

Title:

CSE634: DATA CLUSTERING METHODS

Description:

Discovery of clusters with arbitrary shape. Good efficiency on large databases. 9/6/09 ... Good for both automatic and interactive cluster analysis, including finding ... – PowerPoint PPT presentation

Number of Views:538

Avg rating:3.0/5.0

Slides: 91

Provided by: csSu5

Category:

more less

Transcript and Presenter's Notes

Title: CSE634: DATA CLUSTERING METHODS

1
CSE634 DATACLUSTERING METHODS
Group 9

Karthik Anandh Govindaraj (105845335)
Shashank Viswandha ( 105955553 )
Praveen Durairaj (105948340 )
Ravikanth Pulavarthy ( 105227609 )

2
CSE634 DATACLUSTERING METHODS
Group 9

Karthik Anandh Govindaraj
(karthikanandh_at_gmail.com)

3
References

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X.
A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise.,
In Proc. 2nd International Conference on
Knowledge Discovery and Data Mining
(KDD'96),pages 226-231,1996
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J.
Sander.OPTICS Ordering Points to Identify the
Clustering Structure.In Proc. ACM SIGMOD Int.
Conf. on Management of Data (SIGMOD99), pages
4960,1999.
Hinneburg A., Keim D.A. An Efficient Approach to
Clustering in Large Multimedia Databases with
Noise, In Proc. 4rd Int. Conf. on Knowledge
Discovery and Data Mining(KDD'98), AAAI Press,
pages 58-65,1998.

4
What is Cluster Analysis?

Cluster a collection of data objects
Similar to the objects in the same cluster
(Intraclass similarity)
Dissimilar to the objects in other clusters
(Interclass dissimilarity)
Cluster analysis
Statistical method for grouping a set of data
objects into clusters
A good clustering method produces high quality
clusters with high intraclass similarity and low
interclass similarity
Clustering is unsupervised classification

5
Clustering methods

Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods

6
Issues - Large Spatial Databases

Minimal requirements of domain knowledge to
determine the input parameters
Discovery of clusters with arbitrary shape
Good efficiency on large databases

7
Density-Based Clustering Methods

Clustering based on density (local cluster
criterion), such as density-connected points
Features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Studies
DBSCAN Ester, et al. (KDD96)
OPTICS Ankerst, et al (SIGMOD99).
DENCLUE Hinneburg Keim (KDD98)

8
Density-Based Clustering Definitions

Parameters
Eps Maximum radius of the neighbourhood
MinPts Minimum number of points in an
Eps-neighbourhood of that point
NEps(p) q belongs to D dist(p,q) lt Eps

9
Definitions

Directly density-reachable A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if
1) p belongs to NEps(q)
2) core point condition
NEps (q) gt MinPts

10
Contd.

Density-reachable
A point p is density-reachable from a point q
wrt. Eps, MinPts if there is a chain of points
p1, , pn, p1 q, pn p such that pi1 is
directly density-reachable from pi
Density-connected
A point p is density-connected to a point q wrt.
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps and
MinPts.

p
p1
q
11
Density Based Cluster Definition

cluster - A maximal set of density-connected
points
A cluster C is a subset of D satisfying
For all p,q if p is in C, and q is density
reachable from p, then q is also in C
For all p,q in C p is density connected to q

12
Contd.

Lemma 1 If p is a core point, and O is the set
of points density reachable from p, then O is a
cluster
Lemma 2 Let C be a cluster and p be any core
point of C, then C equals the set of density
reachable points from p
Implication Finding density reachable point of
an arbitrary point generates a cluster. A cluster
is unique determined by any of its core points

13
DBSCAN The Algorithm

Arbitrary select a point p
Retrieve all points density-reachable from p wrt
Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database.
Continue the process until all of the points have
been processed.
Complexity O(kN2)

14
OPTICS A Cluster-Ordering Method

OPTICS Ordering Points To Identify the
Clustering Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
Produces a special order of the database wrt its
density-based clustering structure
This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a
broad range of parameter settings
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering
structure
Can be represented graphically or using
visualization techniques

15
Core- and Reachability Distance

Parameters generating distance e, fixed value
MinPts
core-distancee,MinPts(o)
smallest distance such that o is a core object
(if that distance is e ? otherwise)
reachability-distancee,MinPts(p, o)
smallest distance such that p is
directly density-reachable from o
(if that distance is e ? otherwise)

16
The Algorithm OPTICS

foreach o e Database
// initially, o.processed false for all objects
o
if o.processed false
insert (o, ?) into ControlList
while ControlList is not empty
select first element (o, r-dist) from
ControlList
retrieve Ne(o) and determine c_dist
core-distance(o)
set o.processed true
write (o, r_dist, c_dist) to file
if o is a core object at any distance e

17
Contd..

foreach p e Ne(o) not yet processed
determine r_distp reachability-distance(p, o)
if (p, _) ? ControlList
insert (p, r_distp) in ControlList
else if (p, old_r_dist) e ControlList and r_distp
lt old_r_dist
update (p, r_distp) in ControlList

18
Reachability-distance
undefined

Cluster-order of the objects
19
DENCLUE using density functions

DENsity-based CLUstEring by Hinneburg Keim
(KDD98)
Major features
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of
arbitrarily shaped clusters in high-dimensional
data sets
Significant faster than existing algorithm
(faster than DBSCAN by a factor of up to 45)
But needs a large number of parameters

20
DENCLUE Technical Essence

Uses grid cells but only keeps information about
grid cells that do actually contain data points
and manages these cells in a tree-based access
structure.
Influence function describes the impact of a
data point within its neighborhood.
Overall density of the data space can be
calculated as the sum of the influence function
of all data points.
Clusters can be determined mathematically by
identifying density attractors.
Density attractors are local maximal of the
overall density function.

21
Gradient The steepness of a slope

Example

22
Example Density Computation
Dx1,x2,x3,x4 fDGaussian(x) influence(x1)
influence(x2) influence(x3)
influence(x4)0.040.060.080.60.78
x1
x3
0.04
0.08
y
x2
x4
0.06
x
0.6
Remark the density value of y would be larger
than the one for x
23
Density Attractor
24
Basic Steps - DENCLUE Algorithms

Determine density atttractors
Associate data objects with density attractors (?
initial clustering)
Merge the initial clusters further relying on a
hierarchical clustering approach

25
CSE634 DATACLUSTERING METHODS
Group 9

Shashank Viswanadha
(sviswana_at_cs.sunysb.edu)

26
Sources and References

Data Mining, Concepts and Techniques by Jiawei
Han and Micheline Kamber ( Second Edition )
Data Clustering by A.K.Jain (Michigan State
University ), M.N.Murthy ( Indian Institute of
Science ) and P.J.Flynn ( The Ohio State
University )
http//www.cs.rutgers.edu/mlittman/cours
es/lightai03/jain99data.pdf
STING A Statistical Information Grid Approach to
Spatial Data Mining by Wei Wang ( University of
California, LA ), Joing Yang ( University of
California, LA ), and Richard Muntz ( University
of California, LA ).
http//www.sigmod.org/vldb/conf/1997/P186.PDF
WaveCluster a wavelet-based clustering approach
for spatial data in very large databases by
Gholamhosein Sheikholeslami, Surojit Chatterjee,
Aidong Zhang. (The VLDB Journal (2000) 8 289304
)
http//www.cs.uiuc.edu/homes/hanj/refs/pa
pers/sheikholeslami98.pdf

27
Overview

Grid-Based Methods
STING
WaveCluster
Clustering High-Dimensional Data
CLIQUE
PROCLUS

28
Grid-Based Methods

Uses multiresolution grid data structure
Operations are performed on finite number of
cells which form a grid
Fast processing
Examples
STING Explores statistical information stored in
grid cells
WaveCluster Clusters objects using wavelet
transform methods

29
STING STatistical INformation Grid

Grid-based multiresolution clustering technique
in which the spatial area is divided into
rectangular cells.
Different levels of rectangular cell corresponds
to different levels of resolution forming
hierarchical structure.
Statistical parameters of higher-level cells can
be computed from the parameters of the
lower-level cells.

30
Contd.
31
Contd.
32
Contd.

Types of parameters
Attribute independent number of objects in a
cell
Attribute dependent mean, stdev, min and max.
Types of distribution that the attribute value
can follow are
Normal
Uniform
Exponential
None ( if distribution unknown )

33
Contd.

Majorly used for query answering.
Advantages-
Query independent
Facilitates parallel processing and incremental
updating
Efficiency
Disadvantage-
All the cluster boundaries are either horizontal
or
vertical, and no diagonal boundary is
detected
Time complexity for query processing is O(g),
where g is the total number of grid cells at the
lowest level which is much smaller than number of
objects.

34
WaveCluster Clustering using wavelet
Transformation

WaveCluster, a multiresolution clustering
algorithm involves two steps
Summarizes the data imposing a multidimensional
grid structure onto data space.
Transform the original feature space, finding
dense regions in the transformed space.
Handles large data sets efficiently, discovers
clusters with arbitrary shape, handles outliers,
insensitive to order of input.

35
Contd.

Why is wavelet transformation useful for
clustering
Unsupervised clustering It uses hat-shape
filters to emphasize region where points cluster,
but simultaneously to suppress weaker information
in their boundary

36
Contd.

Effective removal of outliers

37
Contd.

Multiresolution The multiresolution property of
wavelet transform can help in detecting the
clusters at different levels of detail. wavelet
transform provides multiple levels of
decompositions, which results in clusters at
different scales from fine to coarse.
Cost efficiency Since applying wavelet transform
is very fast, it makes our approach
cost-effective. As will be shown in the
experiments, clustering very large datasets takes
only a few seconds. Using parallel processing, we
can obtain even faster responses.

38
Clustering High-Dimensional Data

Introduce clustering methods which are designed
for clustering high dimensions generally over 10,
or even thousands of dimensions for some tasks
Primitives to be avoided for when finding
clusters in high dimensional data are
Noise produced by irrelevant dimensions
Sparse data
Data points located at different dimensions
become equally distanced.

39
Contd.

Techniques used
Feature transformation methods Transform the
data onto smaller space while generally
preserving the original distance between objects.
Examples, principal component analysis and
singular value decomposition
Feature selection methods Commonly used for data
reduction by removing irrelevant or redundant
dimensions ( attributes ).
Subspace clustering This is an extension to
feature selection method. Searches for groups of
clusters within different subspaces of the same
data set.

40
Contd.

Examples
CLIQUE dimension-growth subspace clustering
PROCLUS dimension-reduction projected clustering

41
CLIQUE

CLIQUEs clustering algorithm outline
Identify the sparse and crowded areas in space,
thereby discovering the overall distribution.
Cluster is defined as a maximal set of connected
dense units.
Performs multidimensional clustering in two steps
Partitions d-dimensional data space into
nononverlaping rectangular units, identifying the
dense units among these.
Generates a minimal description for each cluster.

42
Contd.

Insensitive to the order of input object
Doesnt presume any canonical data distribution
It scales linearly with the size of input and
hence has good scalability
Clustering results are dependent on proper tuning
of grid size.
Also difficult to find clusters of rather
different density within different dimensional
subspaces

43
PROCLUS

Typical dimension-reduction subspace clustering
method.
Consists of three phases
Initialization
Iteration
Cluster refinement
Initialization select a set of initial mediods
that are far apart from each other so as to
ensure that each cluster is represented by
atleast one object in the selected set.

44
Contd.

Iteration selects a random set of K mediods
from the reduced set and replaces bad mediods
with randomly chosen new mediods if the
clustering is improved.
Refinement computes new dimensions for each
mediod based on the clusters found, reassigns
points to mediods and removes outliners.

45
CSE634 DATACLUSTERING METHODS
Group 9

Praveen Durairaj
(praveend_at_cs.sunysb.edu)

46
Sources and References

Data Mining, Concepts and Techniques by Jiawei
Han and Micheline Kamber ( Second Edition )
Data Clustering by A.K.Jain (Michigan State
University ), M.N.Murthy ( Indian Institute of
Science ) and P.J.Flynn ( The Ohio State
University )
http//www.cs.rutgers.edu/mlittman/cours
es/lightai03/jain99data.pdf
Clustering Through Decision Tree Construction
(2000) - Bing Liu, Yiyuan Xia, Philip S Yu
link

47
Constraint-Based Cluster Analysis

Used when clustering task involves a very high
dimensional space.
User Preferences.
Constraints while clustering.
Example
Expected number of Clusters
Minimal/Maximal cluster size

48
Categories of constraint based clustering

Constraints on individual objects
Constraints on the selection of clustering
parameters
Constraints on distance or similarity functions
Clustering with obstacle object
User specified constraints on properties of
individual clusters
Semi-supervised clustering based on partial
supervision

49
Clustering with Obstacle objects

Consider obstacle objects during clustering
Partitioning clustering method
k-medoids method
Uses triangulation method to compute the distance
between two objects.
Computational cost is very high if a large number
of objects and obstacles are present.

50
Solving clustering with obstacles-Visibility
Graphs

A visibility graph is the graph, VG (V,E), such
that the each vertex of the obstacles has a
corresponding node in V and two nodes v1 and
v2, in V are joined by an edge in E if and
only if the corresponding vertices they represent
are visible to each other.
An example visibility graph

51
Visibility graphs

Consider another visibility graph VG (V, E)
created from VG by adding two points p and q,
in V.
The shortest points between the two points p
and q will be a sub-path of VG.

52
Cost of distance computation

Preprocessing and optimization techniques are
used
Triangulating the region into triangles
Group nearby points to form micro-clusters
Uses two types of indices for optimization
VV indices, for any pair of obstacle vertices
MV indices, for any pair of micro-cluster and
obstacle vertex

53
User constrained cluster analysis

Constrained optimization problem
Package industry n customers and k service
stations
Customer classification
High value customers
Ordinary customers

54
Micro-clustering

Partition data set into k clusters satisfying
user specified constraints
Iterative refinement of solution
Move m surplus objects from cluster Ci to
Cj
Total sum of the distance of the objects to their
corresponding cluster centers is reduced

55
Computational efficiency

Should handle deadlock situations
Constraint may be impossible to satisfy
Data preprocessed to form micro-clusters
Object Movement
Deadlock detection
Constraint satisfaction
Advantage
Scalability is reduced

56
Semi-supervised cluster analysis

Clustering process based on user feedback or
guidance constraints
Pair-wise constraints
Objects are labeled as belonging to the same
cluster or different clusters
Generates highly desirable clusters

57
Methods

Constraint-based semi supervised clustering
Relies on user provided labels or constraints
Example CLTree (based on decision trees)
Distance-based semi supervised clustering
Adaptive distance measure
String-edit distance using Expectation-Maximizatio
n
Euclidean distance

58
Clustering using decision trees

Converts clustering problem into a classification
problem
Considers set of points to be clustered in one
class Y
Adds a set of relatively uniformly distributed
non existence points with label N
Do not physically add points, but only assume
their existence

59
Clustering using decision trees
a) Set of data points (Y) to be clustered
b) Addition of uniformly distributed points
c) Clustering the resulting with Y points only
60
Clustering using decision trees

Works efficiently because the decision tree only
needs the number of N points
The number of N points for the current node E is
determined by the following rule (note that at
the root node, the number of inherited N points
is 0)
If the number of N points inherited from the
parent node of E is less than the number of Y
points in E then
the number of N points for E is increased to the
number of Y points in E
else the number of inherited N points is used for
E

61
Clustering in Data Mining

Searching for useful information in large volumes
of data
Current real world Data mining systems
Detecting trends and patterns of play for NBA
players
Categorizing patterns of children in the foster
care system
Data Mining approaches while using clustering
Segmentation
Predictive Modeling
Visualization of large databases

62
Segmentation

Clusters homogeneous groups
Clustering pixels in Landsat images
Each pixel has 7 values from different satellite
bands
Clusters these 7 values into 256 groups and
performs a k-means algorithm
Image displayed with the spatial information

63
Predictive Modeling

Clusters group items
Infers rules to characterize groups and suggest
models
Consider magazine subscribers
Clustered based on age, sex, income etc
Groups clustered further to predict whether the
subscribers will renew the subscription

64
Visualization

Aid human analysts in identifying groups that
have similar characteristics
WinViz tool
Exports derived clusters as new attributes and
characterizes them
Cereals can be clustered based on calories,
carbohydrates, sugar etc
Milk cereals can be characterized by high
potassium content

65
Mining large unstructured databases

Classifying web documents using words or
functions of words
Problems
Very high dimensionality of data sets
Relatively small sets of labeled samples
Cluster words from a small collection in the
world wide space in the document space

66
CSE634 DATACLUSTERING METHODS
Group 9

Ravikanth Pulavarthy
(ravi.ingr_at_gmail.com)

67
Sources and References

Data Mining, Concepts and Techniques by Jiawei
Han and Micheline Kamber ( Second Edition)
Data Clustering by A.K.Jain (Michigan State
University ), M.N.Murthy ( Indian Institute of
Science ) and P.J.Flynn ( The Ohio State
University )
http//www.cs.rutgers.edu/mlittman/cours
es/lightai03/jain99data.pdf
Parsing images of Architectural
Scenes-A.Berg,M.Agrawala,J.Malik
http//www.cs.berkeley.edu/asimma/294-fal
l06/projects/reports/grabler.pdf

68
What defines an object?
"I stand at the window and see a house, trees,
sky. Theoretically I might say there were 327
brightnesses and nuances of colour. Do I have
"327"? No. I have sky, house, and trees." --Max
Wertheimer
69
Segmentation and Grouping
To recognize objects rather than dealing with
too many pixels we need a compact/summary
representation Obtain this representation from
image/motion sequence/set of tokens What is
interesting and what is not depends on the
application
70
Image segmentation

Segmentation splitting an image into regions
based on some criteria (intensity, color,
texture, orientation energy, ).

71
Segmentation Algorithms

Simple Segmentation Algorithms
Thresholding
Segmentation by Clustering
Agglomerative clustering
Divisive clustering
K-means

72
Thresholding

Gray level thresholding is the simplest
segmentation process.

Multilevel thresholding
(object)
(background)
73
Thresholding

Thresholding is computationally inexpensive and
fast
Correct threshold selection is crucial for
successful threshold segmentation

74
Thresholding-example
75
Simple Clustering Methods

Two natural Algorithms
Agglomerative clustering (bottom up)
attach closest to cluster it is closest to
repeat
Divisive clustering (top-down)
split cluster along best boundary
repeat

76
Agglomerative Methods

Make each point a separate cluster
Until the clustering is satisfactory
Merge the two clusters with the smallest
inter-cluster distance

77
Divisive Methods

Construct a single cluster containing all points
Until the clustering is satisfactory
- Split the cluster that yields the two
components with the largest intercluster distance

78
Agglomerative Versus Divisive Clustering

The user can specify the desired number of
clusters as a termination condition

79
Measure of distance used

Min Distance dmin( Ci ,Cj)minP?Ci,P?Cjp-p
Nearest Neighbor Clustering Algorithm
Max Distancedmax ( Ci ,Cj)maxP?Ci,P?Cjp-p
Farthest Neighbour Clustering
Algorithm
Mean Distance
dmean(Ci ,Cj )mi-mj where mi is
the mean for Ci.
Average Distancedavg(Ci ,Cj )1/ninjS
S p-p
P?Ci P?Cj

80
Single Linkage

The distance between clusters is based on the
points in each cluster that are nearest apart.

81
Complete Linkage Method

The distance between clusters is based on the
points in each cluster that are farthest apart.

82
Centroid Linkage Method

The distance between clusters is defined as the
distance between cluster centroids.

83
Average Linkage Method

The distance between clusters is the average
distance between all pairs of observations.

84
Optimality

Neither agglomerative clustering nor divisive
clustering is optimal
In other words, the set of centroids which they
give is not guaranteed to minimise distortion

85
Contd.

For example
In agglomerative clustering, a dense cluster of
data points will be combined into a single
centroid
But to minimise distortion, need several
centroids in a region where there are many data
points
A single outlier may get its own cluster
Agglomerative clustering provides a useful
starting point, but further refinement is needed

86
K-means Clustering

Choose a fixed number of clusters
Choose cluster centers and point-cluster
allocations to minimize error

87
K-means Algorithm

Choose k data points to act as cluster centers
Until the clustering is satisfactory
Assign each data point to the cluster that has
the nearest cluster center
Ensure each cluster has at least one data point
splitting, etc
Replace the cluster centers with the means of the
elements in the clusters

88
Image
Clusters on intensity
Clusters on color
K-means clustering using intensity alone and
color alone
89
Conclusion

The approaches for the high dimensional spatial
data clustering methods are well addressed.
Some of the applications of data clustering in
data mining and image segmentation are discussed
as these are vital as huge amounts of spatial
data are obtained in real life from satellite
images, medical equipments, geographic
information systems (GIS), image database
exploration, etc.

90
THANK YOU

Write a Comment

User Comments (0)