CSE 634 Data Mining Techniques

- CLUSTERING
- Part 2( Group no 1 )
- By Anushree Shibani Shivaprakash Fatima

Zarinni - Spring 2006
- Professor Anita WasilewskaSUNY Stony Brook

References

- Jiawei Han and Michelle Kamber. Data Mining

Concept and Techniques (Chapter8). Morgan

Kaufman, 2002. - M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A

density-based algorithm for discovering clusters

in large spatial databases. KDD'96.

http//ifsc.ualr.edu/xwxu/publications/kdd-96.pdf - How to explain hierarchical clustering.

http//www.analytictech.com/networks/hiclus.htm - Tian Zhang, Raghu Ramakrishnan, Miron Livny.

Birch An efficient data clustering method for

very large databases - Data mining- Margaret H. Dunham
- http//cs.sunysb.edu/cse634/ Presentation 9

Cluster Analysis

Introduction

- Major clustering methods
- Partitioning methods
- Hierarchical methods
- Density-based methods
- Grid-based methods

Hierarchical methods

- Here we group data objects into a tree of

clusters. - There are two types of hierarchical clustering
- Agglomerative hierarchical
- clustering.
- Divisive hierarchical clustering

Agglomerative hierarchical clustering

- Group data objects in a bottom-up fashion.
- Initially each data object is in its own cluster.
- Then we merge these atomic clusters into larger

and larger clusters, until all of the objects are

in a single cluster or until certain termination

conditions are satisfied. - A user can specify the desired number of clusters

as a termination condition.

Divisive hierarchical clustering

- Groups data objects in a top-down fashion.
- Initially all data objects are in one cluster.
- We then subdivide the cluster into smaller and

smaller clusters, until each object forms cluster

on its own or satisfies certain termination

conditions, such as a desired number of clusters

is obtained.

AGNES DIANA

- Application of AGNES( AGglomerative NESting) and

DIANA( Divisive ANAlysis) to a data set of five

objects, a, b, c, d, e.

AGNES-Explored

- Given a set of N items to be clustered, and an

NxN distance (or similarity) matrix, the basic

process of Johnson's (1967) hierarchical

clustering is this - Start by assigning each item to its own cluster,

so that if you have N items, you now have N

clusters, each containing just one item. Let the

distances (similarities) between the clusters

equal the distances (similarities) between the

items they contain. - Find the closest (most similar) pair of clusters

and merge them into a single cluster, so that now

you have one less cluster.

AGNES

- Compute distances (similarities) between the new

cluster and each of the old clusters. - Repeat steps 2 and 3 until all items are

clustered into a single cluster of size N. - Step 3 can be done in different ways, which is

what distinguishes single-link from complete-link

and average-link clustering

Similarity/Distance metrics

- single-link clustering, distance
- shortest distance
- complete-link clustering, distance
- longest distance
- average-link clustering, distance
- average distance
- from any member of one cluster to any member of

the other cluster.

Single Linkage Hierarchical Clustering

- Say Every point is its own cluster

Single Linkage Hierarchical Clustering

- Say Every point is its own cluster
- Find most similar pair of clusters

Single Linkage Hierarchical Clustering

- Say Every point is its own cluster
- Find most similar pair of clusters
- Merge it into a parent cluster

Single Linkage Hierarchical Clustering

- Say Every point is its own cluster
- Find most similar pair of clusters
- Merge it into a parent cluster
- Repeat

Single Linkage Hierarchical Clustering

- Say Every point is its own cluster
- Find most similar pair of clusters
- Merge it into a parent cluster
- Repeat

DIANA (Divisive Analysis)

- Introduced in Kaufmann and Rousseeuw (1990)
- Inverse order of AGNES
- Eventually each node forms a cluster on its own

Overview

- Divisive Clustering starts by placing all objects

into a single group. Before we start the

procedure, we need to decide on a threshold

distance. - The procedure is as follows
- The distance between all pairs of objects within

the same group is determined and the pair with

the largest distance is selected.

Overview-contd

- This maximum distance is compared to the

threshold distance. - If it is larger than the threshold, this group is

divided in two. This is done by placing the

selected pair into different groups and using

them as seed points. All other objects in this

group are examined, and are placed into the new

group with the closest seed point. The procedure

then returns to Step 1. - If the distance between the selected objects is

less than the threshold, the divisive clustering

stops. - To run a divisive clustering, you simply need to

decide upon a method of measuring the distance

between two objects.

DIANA- Explored

- In DIANA, a divisive hierarchical clustering

method, all of the objects form one cluster. - The cluster is split according to some principle,

such as the minimum Euclidean distance between

the closest neighboring objects in the cluster. - The cluster splitting process repeats until,

eventually, each new cluster contains a single

object or a termination condition is met.

Difficulties with Hierarchical clustering

- It encounters difficulties regarding the

selection of merge and split points. - Such a decision is critical because once a group

of objects is merged or split, the process at the

next step will operate on the newly generated

clusters. - It will not undo what was done previously.
- Thus, split or merge decisions, if not well

chosen at some step, may lead to low-quality

clusters.

Solution to improve Hierarchical clustering

- One promising direction for improving the

clustering quality of hierarchical methods is to

integrate hierarchical clustering with other

clustering techniques. A few such methods are - Birch
- Cure
- Chameleon

BIRCH An Efficient Data Clustering Method for

Very Large Databases

- Paper by

- Miron Livny
- Computer Sciences Dept.
- University of Wisconsin- Madison
- miron_at_cs.wisc.edu

- Raghu Ramakrishnan
- Computer Sciences Dept.
- University of Wisconsin- Madison
- raghu_at_cs.wisc.edu

Tian Zhang Computer Sciences Dept. University of

Wisconsin- Madison zhang_at_cs.wisc.edu

In Proceedings of the International Conference

Management of Data (ACM-SIGMOD), pages 103-114,

Montreal, Canada, June, 1996.

Reference For Paper

- www2.informatik.huberlin.de/wm/mldm2004/zhang96bir

ch.pdf

Birch (Balanced Iterative Reducing and Clustering

Using Hierarchies)

- A hierarchical clustering method.
- It introduces two concepts
- Clustering feature
- Clustering feature tree (CF tree)
- These structures help the clustering method

achieve good speed and scalability in large

databases.

Clustering Feature Definition

- Given N d-dimensional data points in a cluster

Xi where i 1, 2, , N, - CF (N, LS, SS)
- N is the number of data points in the cluster,
- LS is the linear sum of the N data points,
- SS is the square sum of the N data points.

Clustering feature concepts

- Each record (data object) is a tuple of values of

attributes and here is called a vector. - Here is a database.
- We define
- (Vi1, Vid) Oi
- N N N N
- LS ? Oi (?Vi1, ? Vi2, ?Vid)
- i1 i1 i1 i 1

Linear Sum Definition

Definition

Name

Square sum

- N N N

N - SS ? Oi2 ( ?Vi12, ?Vi22 ?Vid2)
- i 1 i1 i1

i1

Definition

Name

Example of a case

- Assume N 5 and d 2
- Linear Sum
- 5 5 5
- LS ? Oi (?Vi1, ? Vi2)
- i1 i1 i1
- Square Sum
- 5 5
- SS ( ?Vi12), ?Vi22)
- i1 i1

Example 2

Clustering feature CF( N, LS, SS) N 5 LS

(16, 30) SS ( 54, 190)

CF (5, (16,30),(54,190))

Object Attribute1 Attribute2

O1 3 4

O2 2 6

O3 4 5

O4 4 7

O5 3 8

CF-Tree

- A CF-tree is a height-balanced tree with two

parameters branching factor (B for nonleaf node

and L for leaf node) and threshold T. - The entry in each nonleaf node has the form CFi,

childi - The entry in each leaf node is a CF each leaf

node has two pointers prev' andnext'. - The CF tree is basically a tree used to store all

the clustering features.

CF Tree

Root

Non-leaf node

CF1

CF3

CF2

CF5

child1

child3

child2

child5

Leaf node

Leaf node

CF1

CF2

CF6

prev

next

CF1

CF2

CF4

prev

next

BIRCH Clustering

- Phase 1 scan DB to build an initial in-memory CF

tree (a multi-level compression of the data that

tries to preserve the inherent clustering

structure of the data) - Phase 2 use an arbitrary clustering algorithm to

cluster the leaf nodes of the CF-tree

BIRCH Algorithm Overview

Summary of Birch

- Scales linearly- with a single scan you get good

clustering and the quality of clustering improves

with a few additional scans. - It handles noise (data points that are not part

of the underlying pattern) effectively.

Density-Based Clustering Methods

- Clustering based on density, such as

density-connected points instead of distance

metric. - Cluster set of density connected points.
- Major features
- Discover clusters of arbitrary shape
- Handle noise
- Need density parameters as termination

condition- - (when no new objects can be added to the

cluster.) - Example
- DBSCAN (Ester, et al. 1996)
- OPTICS (Ankerst, et al 1999)
- DENCLUE (Hinneburg D. Keim 1998)

Density-Based Clustering Background

- Eps neighborhood The neighborhood within a

radius Eps of a given object - MinPts Minimum number of points in an

Eps-neighborhood of that object. - Core object If the Eps neighborhood contains at

least a minimum number of points Minpts, then the

object is a core object - Directly density-reachable A point p is directly

density-reachable from a point q wrt. Eps, MinPts

if - 1) p is within the Eps neighborhood of q
- 2) q is a core object

Figure showing the density reachability and

density connectivity in density based clustering

- M, P, O, R and S are core objects since each is

in an Eps neighborhood containing at least 3

points

Minpts 3 Epsradius of the circles

Directly density reachable

- Q is directly density reachable from M. M is

directly density reachable from P and vice versa.

Indirectly density reachable

- Q is indirectly density reachable from P since Q

is directly density reachable from M and M is

directly density reachable from P. But, P is not

density reachable from Q since Q is not a core

object.

Core, border, and noise points

- DBSCAN is a density-based algorithm.
- Density number of points within a specified

radius (Eps) - A point is a core point if it has more than a

specified number of points (MinPts) within Eps - These are points that are at the interior of a

cluster. - A border point has fewer than MinPts within Eps,

but is in the neighborhood of a core point. - A noise point is any point that is not a core

point nor a border point.

DBSCAN (Density based Spatial clustering of

Application with noise) The Algorithm

- Arbitrary select a point p
- Retrieve all points density-reachable from p wrt

Eps and MinPts. - If p is a core point, a cluster is formed.
- If p is a border point, no points are

density-reachable from p and DBSCAN visits the

next point of the database. - Continue the process until all of the points have

been processed.

Conclusions

- We discussed two hierarchical clustering methods

Agglomerative and Divisive. - We also discussed Birch- a hierarchical

clustering which produces good clustering over a

single scan and with a few additional scans you

get better clustering. - DBSCAN is a density based clustering algorithm

and through this algorithm we discover clusters

of arbitrary shapes. Distance is not the metric

unlike the case of hierarchical methods.

GRID-BASED CLUSTERING METHODS

- This is the approach in which we quantize

space into a finite number of cells that form a

grid structure on which all of the operations for

clustering is performed. - So, for example assume that we have a set of

records and we want to cluster with respect to

two attributes, then, we divide the related space

(plane), into a grid structure and then we find

the clusters.

Salary (10,000)

Our space is this plane

8

7

6

5

4

3

2

1

0

20 30 40

50 60

Age

Techniques for Grid-Based Clustering

- The following are some techniques that are used

to perform Grid-Based Clustering - CLIQUE (CLustering In QUest.)
- STING (STatistical Information Grid.)
- WaveCluster

Looking at CLIQUE as an Example

- CLIQUE is used for the clustering of

high-dimensional data present in large tables.

By high-dimensional data we mean records that

have many attributes. - CLIQUE identifies the dense units in the

subspaces of high dimensional data space, and

uses these subspaces to provide more efficient

clustering.

Definitions That Need to Be Known

- Unit After forming a grid structure on the

space, each rectangular cell is

called a Unit. - Dense A unit is dense, if the fraction of

total data points contained in the

unit exceeds the input model

parameter. - Cluster A cluster is defined as a maximal set of

connected dense units.

How Does CLIQUE Work?

- Let us say that we have a set of records that we

would like to cluster in terms of n-attributes. - So, we are dealing with an n-dimensional space.
- MAJOR STEPS
- CLIQUE partitions each subspace that has

dimension 1 into the same number of equal length

intervals. - Using this as basis, it partitions the

n-dimensional data space into non-overlapping

rectangular units.

CLIQUE Major Steps (Cont.)

- Now CLIQUES goal is to identify the dense

n-dimensional units. - It does this in the following way
- CLIQUE finds dense units of higher dimensionality

by finding the dense units in the subspaces. - So, for example if we are dealing with a

3-dimensional space, CLIQUE finds the dense units

in the 3 related PLANES (2-dimensional

subspaces.) - It then intersects the extension of the subspaces

representing the dense units to form a candidate

search space in which dense units of higher

dimensionality would exist.

CLIQUE Major Steps. (Cont.)

- Each maximal set of connected dense units is

considered a cluster. - Using this definition, the dense units in the

subspaces are examined in order to find clusters

in the subspaces. - The information of the subspaces is then used to

find clusters in the n-dimensional space. - It must be noted that all cluster boundaries are

either horizontal or vertical. This is due to the

nature of the rectangular grid cells.

Example for CLIQUE

- Let us say that we want to cluster a set of

records that have three attributes, namely,

salary, vacation and age. - The data space for the this data would be

3-dimensional.

vacation

age

salary

Example (Cont.)

- After plotting the data objects, each dimension,

(i.e., salary, vacation and age) is split into

intervals of equal length. - Then we form a 3-dimensional grid on the space,

each unit of which would be a 3-D rectangle. - Now, our goal is to find the dense 3-D

rectangular units.

Example (Cont.)

- To do this, we find the dense units of the

subspaces of this 3-d space. - So, we find the dense units with respect to age

for salary. This means that we look at the

salary-age plane and find all the 2-D rectangular

units that are dense. - We also find the dense 2-D rectangular units for

the vacation-age plane.

Example 1

Example (Cont.)

- Now let us try to visualize the dense units of

the two planes on the following 3-d figure

Example (Cont.)

- We can extend the dense areas in the vacation-age

plane inwards. - We can extend the dense areas in the salary-age

plane upwards. - The intersection of these two spaces would give

us a candidate search space in which

3-dimensional dense units exist. - We then find the dense units in the

salary-vacation plane and we form an extension of

the subspace that represents these dense units.

Example (Cont.)

- Now, we perform an intersection of the candidate

search space with the extension of the dense

units of the salary-vacation plane, in order to

get all the 3-d dense units. - So, What was the main idea?
- We used the dense units in subspaces in order to

find the dense units in the 3-dimensional space. - After finding the dense units, it is very easy to

find clusters.

Reflecting upon CLIQUE

- Why does CLIQUE confine its search for dense

units in high dimensions to the intersection of

dense units in subspaces? - Because the Apriori property employs prior

knowledge of the items in the search space so

that portions of the space can be pruned. - The property for CLIQUE says that if a

k-dimensional unit is dense then so are its

projections in the (k-1) dimensional space.

Strength and Weakness of CLIQUE

- Strength
- It automatically finds subspaces of the highest

dimensionality such that high density clusters

exist in those subspaces. - It is quite efficient.
- It is insensitive to the order of records in

input and does not presume some canonical data

distribution. - It scales linearly with the size of input and has

good scalability as the number of dimensions in

the data increases. - Weakness
- The accuracy of the clustering result may be

degraded at the expense of simplicity of the

simplicity of this method.

STING A Statistical Information Grid Approach to

Spatial Data Mining

- Paper by

Jiong Yang Department of Computer

Science University of California, Los Angeles CA

90095, U.S.A. jyang_at_cs.ucla.edu

Richard Muntz Department of Computer

Science University of California, Los Angeles CA

90095, U.S.A. muntz_at_cs.ucla.edu

Wei Wang Department of Computer

Science University of California, Los Angeles CA

90095, U.S.A. weiwang_at_cs.ucla.edu

VLDB Conference Athens, Greece, 1997

Reference For Paper

- http//georges.gardarin.free.fr/Cours_XMLDM_Maste

r2/Sting.PDF

Definitions That Need to Be Known

- Spatial Data
- Data that have a spatial or location component.
- These are objects that themselves are located in

physical space. - Examples My house, lake Geneva, New York City,

etc. - Spatial Area
- The area that encompasses the locations of all

the spatial data is called spatial area.

STING (Introduction)

- STING is used for performing clustering on

spatial data. - STING uses a hierarchical multi resolution grid

data structure to partition the spatial area. - STINGS big benefit is that it processes many

common region oriented queries on a set of

points, efficiently. - We want to cluster the records that are in a

spatial table in terms of location. - Placement of a record in a grid cell is

completely determined by its physical location.

Hierarchical Structure of Each Grid Cell

- The spatial area is divided into rectangular

cells. (Using latitude and longitude.) - Each cell forms a hierarchical structure.
- This means that each cell at a higher level is

further partitioned into 4 smaller cells in the

lower level. - In other words each cell at the ith level (except

the leaves) has 4 children in the i1 level. - The union of the 4 children cells would give back

the parent cell in the level above them.

Hierarchical Structure of Cells (Cont.)

- The size of the leaf level cells and the number

of layers depends upon how much granularity the

user wants. - So, Why do we have a hierarchical structure for

cells? - We have them in order to provide a better

granularity, or higher resolution.

A Hierarchical Structure for Sting Clustering

Statistical Parameters Stored in each Cell

- For each cell in each layer we have attribute

dependent and attribute independent parameters. - Attribute Independent Parameter
- Count number of records in this cell.
- Attribute Dependent Parameter
- (We are assuming that our attribute values are

real numbers.)

Statistical Parameters (Cont.)

- For each attribute of each cell we store the

following parameters - M ? mean of all values of each attribute in this

cell. - S ? Standard Deviation of all values of each

attribute in this cell. - Min ? The minimum value for each attribute in

this cell. - Max ? The maximum value for each attribute in

this cell. - Distribution ? The type of distribution that the

attribute value in this cell follows. (e.g.

normal, exponential, etc.) None is assigned to

Distribution if the distribution is unknown.

Storing of Statistical Parameters

- Statistical information regarding the attributes

in each grid cell, for each layer are

pre-computed and stored before hand. - The statistical parameters for the cells in the

lowest layer is computed directly from the values

that are present in the table. - The Statistical parameters for the cells in all

the other levels are computed from their

respective children cells that are in the lower

level.

How are Queries Processed ?

- STING can answer many queries, (especially region

queries) efficiently, because we dont have to

access full database. - How are spatial data queries processed?
- We use a top-down approach to answer spatial

data queries. - Start from a pre-selected layer-typically with a

small number of cells. - The pre-selected layer does not have to be the

top most layer. - For each cell in the current layer compute the

confidence interval (or estimated range of

probability) reflecting the cells relevance to

the given query.

Query Processing (Cont.)

- The confidence interval is calculated by using

the statistical parameters of each cell. - Remove irrelevant cells from further

consideration. - When finished with the current layer, proceed to

the next lower level. - Processing of the next lower level examines only

the remaining relevant cells. - Repeat this process until the bottom layer is

reached.

Different Grid Levels during Query Processing.

Sample Query Examples

- Assume that the spatial area is the map of the

regions of Long Island, Brooklyn and Queens. - Our records represent apartments that are

present throughout the above region. - Query Find all the apartments that are for

rent near Stony Brook University that have a rent

range of 800 to 1000 - The above query depend upon the parameter near.

For our example near means within 15 miles of

Stony Brook University.

Advantages and Disadvantages of STING

- ADVANTAGES
- Very efficient.
- The computational complexity is O(k) where k is

the number of grid cells at the lowest level.

Usually - k ltlt N, where N is the number of records.
- STING is a query independent approach, since

statistical information exists independently of

queries. - Incremental update.
- DISADVANTAGES
- All Cluster boundaries are either horizontal or

vertical, and no diagonal boundary is selected.

- Thank you !