CSE 634 Data Mining Techniques - PowerPoint PPT Presentation

About This Presentation

CSE 634 Data Mining Techniques


CSE 634 Data Mining Techniques CLUSTERING Part 2( Group no: 1 ) By: Anushree Shibani Shivaprakash & Fatima Zarinni Spring 2006 Professor Anita Wasilewska – PowerPoint PPT presentation

Number of Views:399
Avg rating:3.0/5.0
Slides: 76
Provided by: AnuSh4


Transcript and Presenter's Notes

Title: CSE 634 Data Mining Techniques

CSE 634 Data Mining Techniques
  • Part 2( Group no 1 )
  • By Anushree Shibani Shivaprakash Fatima
  • Spring 2006
  • Professor Anita WasilewskaSUNY Stony Brook

  • Jiawei Han and Michelle Kamber. Data Mining
    Concept and Techniques (Chapter8). Morgan
    Kaufman, 2002.
  • M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A
    density-based algorithm for discovering clusters
    in large spatial databases. KDD'96.
  • How to explain hierarchical clustering.
  • Tian Zhang, Raghu Ramakrishnan, Miron Livny.
    Birch An efficient data clustering method for
    very large databases
  • Data mining- Margaret H. Dunham
  • http//cs.sunysb.edu/cse634/ Presentation 9
    Cluster Analysis

  • Major clustering methods
  • Partitioning methods
  • Hierarchical methods
  • Density-based methods
  • Grid-based methods

Hierarchical methods
  • Here we group data objects into a tree of
  • There are two types of hierarchical clustering
  • Agglomerative hierarchical
  • clustering.
  • Divisive hierarchical clustering

Agglomerative hierarchical clustering
  • Group data objects in a bottom-up fashion.
  • Initially each data object is in its own cluster.
  • Then we merge these atomic clusters into larger
    and larger clusters, until all of the objects are
    in a single cluster or until certain termination
    conditions are satisfied.
  • A user can specify the desired number of clusters
    as a termination condition.

Divisive hierarchical clustering
  • Groups data objects in a top-down fashion.
  • Initially all data objects are in one cluster.
  • We then subdivide the cluster into smaller and
    smaller clusters, until each object forms cluster
    on its own or satisfies certain termination
    conditions, such as a desired number of clusters
    is obtained.

  • Application of AGNES( AGglomerative NESting) and
    DIANA( Divisive ANAlysis) to a data set of five
    objects, a, b, c, d, e.

  1. Given a set of N items to be clustered, and an
    NxN distance (or similarity) matrix, the basic
    process of Johnson's (1967) hierarchical
    clustering is this
  2. Start by assigning each item to its own cluster,
    so that if you have N items, you now have N
    clusters, each containing just one item. Let the
    distances (similarities) between the clusters
    equal the distances (similarities) between the
    items they contain.
  3. Find the closest (most similar) pair of clusters
    and merge them into a single cluster, so that now
    you have one less cluster.

  1. Compute distances (similarities) between the new
    cluster and each of the old clusters.
  2. Repeat steps 2 and 3 until all items are
    clustered into a single cluster of size N.
  3. Step 3 can be done in different ways, which is
    what distinguishes single-link from complete-link
    and average-link clustering

Similarity/Distance metrics
  • single-link clustering, distance
  • shortest distance
  • complete-link clustering, distance
  • longest distance
  • average-link clustering, distance
  • average distance
  • from any member of one cluster to any member of
    the other cluster.

Single Linkage Hierarchical Clustering
  1. Say Every point is its own cluster

Single Linkage Hierarchical Clustering
  1. Say Every point is its own cluster
  2. Find most similar pair of clusters

Single Linkage Hierarchical Clustering
  1. Say Every point is its own cluster
  2. Find most similar pair of clusters
  3. Merge it into a parent cluster

Single Linkage Hierarchical Clustering
  1. Say Every point is its own cluster
  2. Find most similar pair of clusters
  3. Merge it into a parent cluster
  4. Repeat

Single Linkage Hierarchical Clustering
  1. Say Every point is its own cluster
  2. Find most similar pair of clusters
  3. Merge it into a parent cluster
  4. Repeat

DIANA (Divisive Analysis)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Inverse order of AGNES
  • Eventually each node forms a cluster on its own

  • Divisive Clustering starts by placing all objects
    into a single group. Before we start the
    procedure, we need to decide on a threshold
  • The procedure is as follows
  • The distance between all pairs of objects within
    the same group is determined and the pair with
    the largest distance is selected.

  • This maximum distance is compared to the
    threshold distance.
  • If it is larger than the threshold, this group is
    divided in two. This is done by placing the
    selected pair into different groups and using
    them as seed points. All other objects in this
    group are examined, and are placed into the new
    group with the closest seed point. The procedure
    then returns to Step 1.
  • If the distance between the selected objects is
    less than the threshold, the divisive clustering
  • To run a divisive clustering, you simply need to
    decide upon a method of measuring the distance
    between two objects.

DIANA- Explored
  • In DIANA, a divisive hierarchical clustering
    method, all of the objects form one cluster.
  • The cluster is split according to some principle,
    such as the minimum Euclidean distance between
    the closest neighboring objects in the cluster.
  • The cluster splitting process repeats until,
    eventually, each new cluster contains a single
    object or a termination condition is met.

Difficulties with Hierarchical clustering
  • It encounters difficulties regarding the
    selection of merge and split points.
  • Such a decision is critical because once a group
    of objects is merged or split, the process at the
    next step will operate on the newly generated
  • It will not undo what was done previously.
  • Thus, split or merge decisions, if not well
    chosen at some step, may lead to low-quality

Solution to improve Hierarchical clustering
  • One promising direction for improving the
    clustering quality of hierarchical methods is to
    integrate hierarchical clustering with other
    clustering techniques. A few such methods are
  • Birch
  • Cure
  • Chameleon

BIRCH An Efficient Data Clustering Method for
Very Large Databases
  • Paper by
  • Miron Livny
  • Computer Sciences Dept.
  • University of Wisconsin- Madison
  • miron_at_cs.wisc.edu
  • Raghu Ramakrishnan
  • Computer Sciences Dept.
  • University of Wisconsin- Madison
  • raghu_at_cs.wisc.edu

Tian Zhang Computer Sciences Dept. University of
Wisconsin- Madison zhang_at_cs.wisc.edu
In Proceedings of the International Conference
Management of Data (ACM-SIGMOD), pages 103-114,
Montreal, Canada, June, 1996.
Reference For Paper
  • www2.informatik.huberlin.de/wm/mldm2004/zhang96bir

Birch (Balanced Iterative Reducing and Clustering
Using Hierarchies)
  • A hierarchical clustering method.
  • It introduces two concepts
  • Clustering feature
  • Clustering feature tree (CF tree)
  • These structures help the clustering method
    achieve good speed and scalability in large

Clustering Feature Definition
  • Given N d-dimensional data points in a cluster
    Xi where i 1, 2, , N,
  • CF (N, LS, SS)
  • N is the number of data points in the cluster,
  • LS is the linear sum of the N data points,
  • SS is the square sum of the N data points.

Clustering feature concepts
  • Each record (data object) is a tuple of values of
    attributes and here is called a vector.
  • Here is a database.
  • We define
  • (Vi1, Vid) Oi
  • N N N N
  • LS ? Oi (?Vi1, ? Vi2, ?Vid)
  • i1 i1 i1 i 1

Linear Sum Definition
Square sum
  • N N N
  • SS ? Oi2 ( ?Vi12, ?Vi22 ?Vid2)
  • i 1 i1 i1

Example of a case
  • Assume N 5 and d 2
  • Linear Sum
  • 5 5 5
  • LS ? Oi (?Vi1, ? Vi2)
  • i1 i1 i1
  • Square Sum
  • 5 5
  • SS ( ?Vi12), ?Vi22)
  • i1 i1

Example 2
Clustering feature CF( N, LS, SS) N 5 LS
(16, 30) SS ( 54, 190)
CF (5, (16,30),(54,190))
Object Attribute1 Attribute2
O1 3 4
O2 2 6
O3 4 5
O4 4 7
O5 3 8
  • A CF-tree is a height-balanced tree with two
    parameters branching factor (B for nonleaf node
    and L for leaf node) and threshold T.
  • The entry in each nonleaf node has the form CFi,
  • The entry in each leaf node is a CF each leaf
    node has two pointers prev' andnext'.
  • The CF tree is basically a tree used to store all
    the clustering features.

CF Tree
Non-leaf node
Leaf node
Leaf node
BIRCH Clustering
  • Phase 1 scan DB to build an initial in-memory CF
    tree (a multi-level compression of the data that
    tries to preserve the inherent clustering
    structure of the data)
  • Phase 2 use an arbitrary clustering algorithm to
    cluster the leaf nodes of the CF-tree

BIRCH Algorithm Overview

Summary of Birch
  • Scales linearly- with a single scan you get good
    clustering and the quality of clustering improves
    with a few additional scans.
  • It handles noise (data points that are not part
    of the underlying pattern) effectively.

Density-Based Clustering Methods
  • Clustering based on density, such as
    density-connected points instead of distance
  • Cluster set of density connected points.
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • Need density parameters as termination
  • (when no new objects can be added to the
  • Example
  • DBSCAN (Ester, et al. 1996)
  • OPTICS (Ankerst, et al 1999)
  • DENCLUE (Hinneburg D. Keim 1998)

Density-Based Clustering Background
  • Eps neighborhood The neighborhood within a
    radius Eps of a given object
  • MinPts Minimum number of points in an
    Eps-neighborhood of that object.
  • Core object If the Eps neighborhood contains at
    least a minimum number of points Minpts, then the
    object is a core object
  • Directly density-reachable A point p is directly
    density-reachable from a point q wrt. Eps, MinPts
  • 1) p is within the Eps neighborhood of q
  • 2) q is a core object

Figure showing the density reachability and
density connectivity in density based clustering
  • M, P, O, R and S are core objects since each is
    in an Eps neighborhood containing at least 3

Minpts 3 Epsradius of the circles
Directly density reachable
  • Q is directly density reachable from M. M is
    directly density reachable from P and vice versa.

Indirectly density reachable
  • Q is indirectly density reachable from P since Q
    is directly density reachable from M and M is
    directly density reachable from P. But, P is not
    density reachable from Q since Q is not a core

Core, border, and noise points
  • DBSCAN is a density-based algorithm.
  • Density number of points within a specified
    radius (Eps)
  • A point is a core point if it has more than a
    specified number of points (MinPts) within Eps
  • These are points that are at the interior of a
  • A border point has fewer than MinPts within Eps,
    but is in the neighborhood of a core point.
  • A noise point is any point that is not a core
    point nor a border point.

DBSCAN (Density based Spatial clustering of
Application with noise) The Algorithm
  • Arbitrary select a point p
  • Retrieve all points density-reachable from p wrt
    Eps and MinPts.
  • If p is a core point, a cluster is formed.
  • If p is a border point, no points are
    density-reachable from p and DBSCAN visits the
    next point of the database.
  • Continue the process until all of the points have
    been processed.

  • We discussed two hierarchical clustering methods
    Agglomerative and Divisive.
  • We also discussed Birch- a hierarchical
    clustering which produces good clustering over a
    single scan and with a few additional scans you
    get better clustering.
  • DBSCAN is a density based clustering algorithm
    and through this algorithm we discover clusters
    of arbitrary shapes. Distance is not the metric
    unlike the case of hierarchical methods.

  • This is the approach in which we quantize
    space into a finite number of cells that form a
    grid structure on which all of the operations for
    clustering is performed.
  • So, for example assume that we have a set of
    records and we want to cluster with respect to
    two attributes, then, we divide the related space
    (plane), into a grid structure and then we find
    the clusters.

Salary (10,000)
Our space is this plane
20 30 40
50 60
Techniques for Grid-Based Clustering
  • The following are some techniques that are used
    to perform Grid-Based Clustering
  • CLIQUE (CLustering In QUest.)
  • STING (STatistical Information Grid.)
  • WaveCluster

Looking at CLIQUE as an Example
  • CLIQUE is used for the clustering of
    high-dimensional data present in large tables.
    By high-dimensional data we mean records that
    have many attributes.
  • CLIQUE identifies the dense units in the
    subspaces of high dimensional data space, and
    uses these subspaces to provide more efficient

Definitions That Need to Be Known
  • Unit After forming a grid structure on the
    space, each rectangular cell is
    called a Unit.
  • Dense A unit is dense, if the fraction of
    total data points contained in the
    unit exceeds the input model
  • Cluster A cluster is defined as a maximal set of
    connected dense units.

How Does CLIQUE Work?
  • Let us say that we have a set of records that we
    would like to cluster in terms of n-attributes.
  • So, we are dealing with an n-dimensional space.
  • CLIQUE partitions each subspace that has
    dimension 1 into the same number of equal length
  • Using this as basis, it partitions the
    n-dimensional data space into non-overlapping
    rectangular units.

CLIQUE Major Steps (Cont.)
  • Now CLIQUES goal is to identify the dense
    n-dimensional units.
  • It does this in the following way
  • CLIQUE finds dense units of higher dimensionality
    by finding the dense units in the subspaces.
  • So, for example if we are dealing with a
    3-dimensional space, CLIQUE finds the dense units
    in the 3 related PLANES (2-dimensional
  • It then intersects the extension of the subspaces
    representing the dense units to form a candidate
    search space in which dense units of higher
    dimensionality would exist.

CLIQUE Major Steps. (Cont.)
  • Each maximal set of connected dense units is
    considered a cluster.
  • Using this definition, the dense units in the
    subspaces are examined in order to find clusters
    in the subspaces.
  • The information of the subspaces is then used to
    find clusters in the n-dimensional space.
  • It must be noted that all cluster boundaries are
    either horizontal or vertical. This is due to the
    nature of the rectangular grid cells.

Example for CLIQUE
  • Let us say that we want to cluster a set of
    records that have three attributes, namely,
    salary, vacation and age.
  • The data space for the this data would be

Example (Cont.)
  • After plotting the data objects, each dimension,
    (i.e., salary, vacation and age) is split into
    intervals of equal length.
  • Then we form a 3-dimensional grid on the space,
    each unit of which would be a 3-D rectangle.
  • Now, our goal is to find the dense 3-D
    rectangular units.

Example (Cont.)
  • To do this, we find the dense units of the
    subspaces of this 3-d space.
  • So, we find the dense units with respect to age
    for salary. This means that we look at the
    salary-age plane and find all the 2-D rectangular
    units that are dense.
  • We also find the dense 2-D rectangular units for
    the vacation-age plane.

Example 1
Example (Cont.)
  • Now let us try to visualize the dense units of
    the two planes on the following 3-d figure

Example (Cont.)
  • We can extend the dense areas in the vacation-age
    plane inwards.
  • We can extend the dense areas in the salary-age
    plane upwards.
  • The intersection of these two spaces would give
    us a candidate search space in which
    3-dimensional dense units exist.
  • We then find the dense units in the
    salary-vacation plane and we form an extension of
    the subspace that represents these dense units.

Example (Cont.)
  • Now, we perform an intersection of the candidate
    search space with the extension of the dense
    units of the salary-vacation plane, in order to
    get all the 3-d dense units.
  • So, What was the main idea?
  • We used the dense units in subspaces in order to
    find the dense units in the 3-dimensional space.
  • After finding the dense units, it is very easy to
    find clusters.

Reflecting upon CLIQUE
  • Why does CLIQUE confine its search for dense
    units in high dimensions to the intersection of
    dense units in subspaces?
  • Because the Apriori property employs prior
    knowledge of the items in the search space so
    that portions of the space can be pruned.
  • The property for CLIQUE says that if a
    k-dimensional unit is dense then so are its
    projections in the (k-1) dimensional space.

Strength and Weakness of CLIQUE
  • Strength
  • It automatically finds subspaces of the highest
    dimensionality such that high density clusters
    exist in those subspaces.
  • It is quite efficient.
  • It is insensitive to the order of records in
    input and does not presume some canonical data
  • It scales linearly with the size of input and has
    good scalability as the number of dimensions in
    the data increases.
  • Weakness
  • The accuracy of the clustering result may be
    degraded at the expense of simplicity of the
    simplicity of this method.

STING A Statistical Information Grid Approach to
Spatial Data Mining
  • Paper by

Jiong Yang Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. jyang_at_cs.ucla.edu
Richard Muntz Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. muntz_at_cs.ucla.edu
Wei Wang Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. weiwang_at_cs.ucla.edu
VLDB Conference Athens, Greece, 1997
Reference For Paper
  • http//georges.gardarin.free.fr/Cours_XMLDM_Maste

Definitions That Need to Be Known
  • Spatial Data
  • Data that have a spatial or location component.
  • These are objects that themselves are located in
    physical space.
  • Examples My house, lake Geneva, New York City,
  • Spatial Area
  • The area that encompasses the locations of all
    the spatial data is called spatial area.

STING (Introduction)
  • STING is used for performing clustering on
    spatial data.
  • STING uses a hierarchical multi resolution grid
    data structure to partition the spatial area.
  • STINGS big benefit is that it processes many
    common region oriented queries on a set of
    points, efficiently.
  • We want to cluster the records that are in a
    spatial table in terms of location.
  • Placement of a record in a grid cell is
    completely determined by its physical location.

Hierarchical Structure of Each Grid Cell
  • The spatial area is divided into rectangular
    cells. (Using latitude and longitude.)
  • Each cell forms a hierarchical structure.
  • This means that each cell at a higher level is
    further partitioned into 4 smaller cells in the
    lower level.
  • In other words each cell at the ith level (except
    the leaves) has 4 children in the i1 level.
  • The union of the 4 children cells would give back
    the parent cell in the level above them.

Hierarchical Structure of Cells (Cont.)
  • The size of the leaf level cells and the number
    of layers depends upon how much granularity the
    user wants.
  • So, Why do we have a hierarchical structure for
  • We have them in order to provide a better
    granularity, or higher resolution.

A Hierarchical Structure for Sting Clustering
Statistical Parameters Stored in each Cell
  • For each cell in each layer we have attribute
    dependent and attribute independent parameters.
  • Attribute Independent Parameter
  • Count number of records in this cell.
  • Attribute Dependent Parameter
  • (We are assuming that our attribute values are
    real numbers.)

Statistical Parameters (Cont.)
  • For each attribute of each cell we store the
    following parameters
  • M ? mean of all values of each attribute in this
  • S ? Standard Deviation of all values of each
    attribute in this cell.
  • Min ? The minimum value for each attribute in
    this cell.
  • Max ? The maximum value for each attribute in
    this cell.
  • Distribution ? The type of distribution that the
    attribute value in this cell follows. (e.g.
    normal, exponential, etc.) None is assigned to
    Distribution if the distribution is unknown.

Storing of Statistical Parameters
  • Statistical information regarding the attributes
    in each grid cell, for each layer are
    pre-computed and stored before hand.
  • The statistical parameters for the cells in the
    lowest layer is computed directly from the values
    that are present in the table.
  • The Statistical parameters for the cells in all
    the other levels are computed from their
    respective children cells that are in the lower

How are Queries Processed ?
  • STING can answer many queries, (especially region
    queries) efficiently, because we dont have to
    access full database.
  • How are spatial data queries processed?
  • We use a top-down approach to answer spatial
    data queries.
  • Start from a pre-selected layer-typically with a
    small number of cells.
  • The pre-selected layer does not have to be the
    top most layer.
  • For each cell in the current layer compute the
    confidence interval (or estimated range of
    probability) reflecting the cells relevance to
    the given query.

Query Processing (Cont.)
  • The confidence interval is calculated by using
    the statistical parameters of each cell.
  • Remove irrelevant cells from further
  • When finished with the current layer, proceed to
    the next lower level.
  • Processing of the next lower level examines only
    the remaining relevant cells.
  • Repeat this process until the bottom layer is

Different Grid Levels during Query Processing.
Sample Query Examples
  • Assume that the spatial area is the map of the
    regions of Long Island, Brooklyn and Queens.
  • Our records represent apartments that are
    present throughout the above region.
  • Query Find all the apartments that are for
    rent near Stony Brook University that have a rent
    range of 800 to 1000
  • The above query depend upon the parameter near.
    For our example near means within 15 miles of
    Stony Brook University.

Advantages and Disadvantages of STING
  • Very efficient.
  • The computational complexity is O(k) where k is
    the number of grid cells at the lowest level.
  • k ltlt N, where N is the number of records.
  • STING is a query independent approach, since
    statistical information exists independently of
  • Incremental update.
  • All Cluster boundaries are either horizontal or
    vertical, and no diagonal boundary is selected.

  • Thank you !
Write a Comment
User Comments (0)
About PowerShow.com