Clustering - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Clustering

Description:

As a preprocessing step for other algorithms. 10/5/09 ... can dominate the clustering and, in some cases, are eliminated by preprocessing. ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 64
Provided by: jiaw219
Category:

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
  • dr. János Abonyi
  • University of Veszprem
  • abonyij_at_fmt.vein.hu
  • www.fmt.vein.hu/softcomp/dw

2
What is Cluster Analysis?
  • Cluster a collection of data objects
  • Similar to one another within the same cluster
  • Dissimilar to the objects in other clusters
  • Cluster analysis
  • Grouping a set of data objects into clusters
  • Clustering is unsupervised classification no
    predefined classes
  • Typical applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms

3
Input Data for Clustering
  • A set of N points in an M dimensional
    spaceOR
  • A proximity matrix that gives the pairwise
    distance or similarity between points.
  • Can be viewed as a weighted graph.

4
Measures of Similarity
  • The first step in clustering raw data is to
    define some measure of similarity between two
    data items
  • That is, we need to know when two data items are
    close enough to be considered members of the same
    class
  • Different measures may produce entirely different
    clusters, so the measure selected must reflect
    the nature of the data

5
Similarity and Dissimilarity Between Objects
  • Distances are normally used to measure the
    similarity or dissimilarity between two data
    objects
  • Some popular ones include Minkowski distance
  • where i (xi1, xi2, , xip) and j (xj1, xj2,
    , xjp) are two p-dimensional data objects, and q
    is a positive integer
  • If q 1, d is Manhattan distance

6
Similarity and Dissimilarity Between Objects
(Cont.)
  • If q 2, d is Euclidean distance
  • Properties
  • d(i,j) ? 0
  • d(i,i) 0
  • d(i,j) d(j,i)
  • d(i,j) ? d(i,k) d(k,j)
  • Also one can use weighted distance, or other
    disimilarity measures.

7
Binary Variables
  • A contingency table for binary data
  • Simple matching coefficient
  • Jaccard coefficient

8
Similarity Coefficients
9
Nominal Variables
  • A generalization of the binary variable in that
    it can take more than 2 states, e.g., red,
    yellow, blue, green
  • Method 1 Simple matching
  • m of matches, p total of variables
  • Method 2 use a large number of binary variables
  • creating a new binary variable for each of the M
    nominal states

10
Major Clustering Approaches
  • Partitioning algorithms Construct various
    partitions and then evaluate them by some
    criterion
  • Hierarchy algorithms Create a hierarchical
    decomposition of the set of data (or objects)
    using some criterion
  • Density-based based on connectivity and density
    functions
  • Grid-based based on a multiple-level granularity
    structure
  • Model-based A model is hypothesized for each of
    the clusters and the idea is to find the best fit
    of that model to each other

11
Types of Clustering Partitional and Hierarchical
  • Partitional Clustering ( K-means and K-medoid)
    finds a one-level partitioning of the data into K
    disjoint groups.
  • Hierarchical Clustering finds a hierarchy of
    nested clusters (dendogram).
  • May proceed either bottom-up (agglomerative) or
    top-down (divisive).
  • Uses a proximity matrix.
  • Can be viewed as operating on a proximity graph.

12
Hierarchical Clustering
  • Use distance matrix as clustering criteria.
  • This method does not require the number of
    clusters k as an input, but needs a termination
    condition

13
AGNES (Agglomerative Nesting)
  • Implemented in statistical analysis packages
  • Use the Single-Link method and the dissimilarity
    matrix.
  • Merge nodes that have the least dissimilarity
  • Eventually all nodes belong to the same cluster

14
A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
15
Cluster Similarity MIN or Single Link
  • Similarity of two clusters is based on the two
    most similar (closest) points in the different
    clusters.
  • Determined by one pair of points, i.e., by one
    link in the proximity graph.
  • Can handle non-elliptical shapes.
  • Sensitive to noise and outliers.

16
(No Transcript)
17
Hierarchical Clustering Problems and Limitations
  • Once a decision is made to combine two clusters,
    it cannot be undone.
  • Do not scale well
  • No objective function is directly minimized.
  • Different schemes have problems with one or more
    of the following
  • Sensitivity to noise and outliers.
  • Difficulty handling different sized clusters and
    convex shapes.
  • Breaking large clusters.

18
Partitioning Algorithms Basic Concept
  • Partitioning method Construct a partition of a
    database D of n objects into a set of k clusters
  • Given a k, find a partition of k clusters that
    optimizes the chosen partitioning criterion
  • Global optimal exhaustively enumerate all
    partitions
  • Heuristic methods k-means and k-medoids
    algorithms
  • k-means (MacQueen67) Each cluster is
    represented by the center of the cluster
  • k-medoids or PAM (Partition around medoids)
    (Kaufman Rousseeuw87) Each cluster is
    represented by one of the objects in the cluster

19
Clustering Heuristic
  • Our objective will be to look for a k
    representative points for the clusters.
  • These points will be the cluster centers or
    means. May not be part of the data set.
  • This gives rise to the famous k-means algorithm.

20
K-means algorithm
  • Given a set of d-dimensional points
  • Find k-points
  • Which minimizes
  • Where, the Ps are disjoint and their union covers
    the entire data set.

21
K-means algorithm
  • Notice once k centers are picked, they give rise
    to a natural partition of the entire data set.
    Namely associate each data point to its nearest
    center.

22
K-means Clustering
  • Find a single partition of the data into K
    clusters such that the within cluster error
  • is minimized.
  • Basic K-means Algorithm
  • 1. Select K points as the initial centroids.
  • 2. Assign all points to the closest centroid.
  • 3. Recompute the centroids.
  • 4. Repeat steps 2 and 3 until the centroids dont
    change.
  • K-means is a gradient-descent algorithm that
    always converges - perhaps to a local minimum.
    (Clustering for Applications, Anderberg)

23
Example II.
Initial Data and Seeds
Final Clustering
24
Example III.
Initial Data and Seeds
Final Clustering
25
K-means Initial Point Selection
  • Bad set of initial points gives a poor solution.
  • Random selection
  • Simple and efficient.
  • Initial points dont cover clusters with high
    probability.
  • Many runs may be needed for optimal solution.
  • Choose initial points from
  • Dense regions so that the points are
    well-separated.

26
K-means When to Update Centroids
  • Update the centroids only after all points are
    assigned to centers.
  • Update the centroids after each point assignment.
  • May adjust the relative weight of the point being
    added and the current center to speed
    convergence.
  • Possibility of better accuracy and faster
    convergence at the cost of more work.
  • Update issues are similar to those of updating
    weights for neural-nets using back-propagation.
    (Artificial Intelligence, Winston)

27
K-means Pre and Post Processing
  • Outliers can dominate the clustering and, in some
    cases, are eliminated by preprocessing.
  • Post-processing attempts to fix-up the
    clustering produced by the K-means algorithm.
  • Merge clusters that are close to each other.
  • Split loose clusters that contribute most to
    the error.
  • Permanently eliminate small clusters since they
    may represent groups of outliers.
  • Approaches are based on heuristics and require
    the user to choose parameter values.

28
K-means Time and Space requirements
  • O(MN) space since it uses just the vectors, not
    the proximity matrix.
  • M is the number of attributes.
  • N is the number of points.
  • Also keep track of which cluster each point
    belongs to and the K cluster centers.
  • Time for basic K-means is O(TKMN),
  • T is the number of iterations. (T is often
    small, 5-10, and can easily be bounded, as few
    changes occur after the first few iterations).

29
  • Example 1 Relay stations for mobile phones
  • Optimal placement of relay stations optimal
    k-clustering!
  • Complications
  • points correspond to phones
  • positions are not fixed
  • number of patterns is not fixed
  • how to choose k ?
  • distance function complicated 3D geographic
    model with mountains and buildings, shadowing,
    ...

30
  • Example 2 Placement of Warehouses for Goods
  • points correspond to customer locations
  • centroids correspond to locations of warehouses
  • distance function is delivery time from
    warehouse multiplied by number of trips, i.e.
    related to volume of delivered goods
  • multilevel clustering, e.g. for post office,
    train companies, airlines (which airports to
    choose as hubs), etc.

31
K-means Determining the Number of Clusters
  • Mostly heuristic and domain dependant
    approaches.
  • Plot the error for 2, 3, clusters and find the
    knee in the curve.
  • Use domain specific knowledge and inspect the
    clusters for desired characteristics.

32
K-means Problems and Limitations
  • Based on minimizing within cluster error - a
    criterion that is not appropriate for many
    situations.
  • Unsuitable when clusters have widely different
    sizes or have convex shapes.
  • Restricted to data in Euclidean spaces, but
    variants of K-means can be used for other types
    of data.

33
Feature Extraction
  • (Nonlinear) mapping of the input space into a
    lower dimensional one
  • Reduction of the number of inputs
  • Useful for visualisation Non-parametric (Sammon
    projection) or Model-based (principal curves,
    NN, Gaussian mixtures, SOM)

34
Brains self-organization
The brain maps the external multidimensional
representation of the world into a similar 1 or 2
- dimensional internal representation. That
is, the brain processes the external signals in a
topology-preserving way Mimicking the way the
brain learns, our system should be able to do the
same thing.
35
Senso-motoric map
Visual signals are analyzed by maps coupled with
motor maps and providing senso-motoric responses.
Figure from P.S. Churchland, T.J.
Sejnowski, The computational brain. MIT Press,
1992
36
Somatosensoric and motor maps
37
Representation of fingers
Hand
Face
38
Models of self-organizaiton
  • SOM or SOFM (Self-Organized Feature Mapping)
    self-organizing feature map, one of the simplest
    models.

How can such maps develop spontaneously? Local
neural connections neurons interact strongly
with those nearby, but weakly with those that are
far (in addition inhibiting some intermediate
neurons).
History von der Malsburg and Willshaw (1976),
competitive learning, Hebb mechanisms, Mexican
hat interactions, models of visual
systems. Amari (1980) models of continuous
neural tissue. Kohonen (1981) - simplification,
no inhibition leaving two essential factors
competition and cooperation.
39
Concept of the SOM I.
Input space Input layer
Reduced feature space Map layer
Ba
s1
s2
Mn
Sr
Cluster centers (code vectors)
Place of these code vectors in the reduced space
Clustering and ordering of the cluster centers
in a two dimensional grid
40
Concept of the SOM II.
We can use it for visualization
Ba
Mn
SA3
We can use it for classification
Sr
SA3

Mg
We can use it for clustering
41
Self-Organized Map idea
Data vectors XT (X1, ... Xd) from
d-dimensional space. Grid of nodes, with local
processor (called neuron) in each node. Local
processor j has d adaptive parameters W(j).
Goal change W(j) parameters to recover data
clusters in X space.
42
SOM algorithm competition
  • Nodes should calculate similarity of input data
    to their parameters.
  • Input vector X is compared to node parameters W.
  • Similar minimal distance or maximal scalar
    product. Competition find node jc with W most
    similar to X.

Node number c is most similar to the input vector
X It is a winner, and it will learn to be more
similar to X, hence this is a competitive
learning procedure. Brain those neurons that
react to some signals pick it up and learn.
43
SOM algorithm cooperation
Cooperation nodes on a grid close to the winner
c should behave similarly. Define the
neighborhood function O(c)
t iteration number (or time) rc position of
the winning node c (in physical space, usually
2D). r-rc distance from the winning node,
scaled by sc(t). h0(t) slowly decreasing
multiplicative factor The neighborhood function
determines how strongly the parameters of the
winning node and nodes in its neighborhood will
be changed, making them more similar to data X
44
SOM algorithm dynamics
Adaptation rule take the winner node c, and
those in its neighborhood O(rc), change their
parameters making them more similar to the data X
  • Select randomly new sample vector X, and repeat.
  • Decrease h0(t) slowly until there will be no
    changes.
  • Result
  • W(i) point to the center of local clusters in
    the X feature space
  • Nodes in the neighborhood point to adjacent
    areas in X space

45
SOM algorithm
  • XT(X1, X2 .. Xd), samples from feature space.
  • Create a grid with nodes i 1 .. K in 1D, 2D or
    3D,
  • each node with d-dimensional vector W(i)T
    (W1(i) W2(i) .. Wd(i)),
  • W(i) W(i)(t), changing with t discrete time.
  • Initialize random small W(i)(0) for all i1...K.
    Define parameters of neighborhood function
    h(ri-rc/s(t),t)
  • Iterate select randomly input vector X
  • Calculate distances d(X,W(i)), find the winner
    node W(c) most similar (closest to) X
  • Update weights of all neurons in the neighborhood
    O(rc)
  • Decrease the influence h0(t) and shrink
    neighborhood s(t).
  • If in the last T steps all W(i) changed less than
    e stop.

46
1D network, 2D data
Position in the feature space
Processors in 1D array
47
2D network, 3D data
48
Training process
Java demos http//www.neuroinformatik.ruhr-uni-bo
chum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
49
2D gt 2D, square
Initially all W?0, but over time they learn to
point in adjacent positions.
50
2D gt 1D in a triangle
The line in the data space forms a Peano curve,
an example of a fractal.
51
Italian olive oil
An example of SOM application
  • 572 samples of olive oil were collected from 9
    Italian provinces.
  • Content of 8 fats were
  • was determine for each oil.
  • SOM 20 x 20 network,
  • Map 8D gt 2D.
  • Classification accuracy was around 95-97.

Note that topographical relations are preserve,
region 3 is most diverse.
52
Similarity of faces
300 faces, similarity matrix evaluated and Sammon
mapping applied (from Klock Buhmann 1997).
53
Some examples of real-life applications
  • Helsinki University of Technology web site
  • http//www.cis.hut.fi/research/refs/
  • has a list of gt 5000 papers on SOM and its
    applications!
  • Brain research modeling of formation of various
    topographical maps in motor, auditory, visual and
    somatotopic areas.
  • AI and robotics analysis of data from sensors,
    control of robots movement (motor maps), spatial
    orientation maps.
  • Information retrieval and text categorization.
  • Clusterization of genes, protein properties,
    chemical compounds, speech phonemes, sounds of
    birds and insects, astronomical objects,
    economical data, business and financial data
    ....
  • Data compression (images and audio), information
    filtering.
  • Medical and technical diagnostics.

54
More examples
  • Natural language processing linguistic analysis,
    parsing, learning languages, hyphenation
    patterns.
  • Optimization configuration of telephone
    connections, VLSI design, time series prediction,
    scheduling algorithms.
  • Signal processing adaptive filters, real-time
    signal analysis, radar, sonar seismic, USG, EKG,
    EEG and other medical signals ...
  • Image recognition and processing segmentation,
    object recognition, texture recognition ...
  • Content-based retrieval examples of WebSOM,
    Cartia, Visier
  • PicSom similarity based image retrieval.
  • http//www.ntu.edu.sg/home/aswduch/CI.htmlSOM

55
Quality of life data
  • WorldBank data 1992, 39 quality of life
    indicators.
  • SOM map and the same colors on the world map.
  • More examples of business applications from
    http//www.eudaptics.com/

56
Semantic maps
  • How to capture the meaning of words, semantic
    relations?
  • 16 animals pigeon, chicken, duck, goose, owl,
    hawk, eagle, fox, dog, wolf, cat, tiger, lion,
    horse, zebra, cow.
  • Use 13 binary features small, medium large has
    2 legs, 4 legs, hair, hoofs, mane, feathers
    hunts, runs, flies, swims.
  • Form 76 sentences that describe 16 animals using
    13 features.
  • Horse runs horse has 4 legs horse is big ...
    eagle fly, fox hunt, ...
  • Assign a vector of properties to each animal
  • V(horse) small0,medium0,large1,has 2 legs0,
    4 legs1, ...
  • 0,0,1,0,1,1,1,1,0,0,1,0,0
  • Map these 13D vectors in 2D

57
Semantic maps MDS SOM
58
SOM software
  • A number of free programs for SOM were written.
  • Best visualization is offered by Viscovery free
    viewerhttp//www.eudaptics.com/It can be used
    with free SOM_pack software fromhttp//www.cis.hu
    t.fi/research/som_lvq_pak.shtml

59
World map of clinkers I.
We can use it for visualization
We can use it for correlation hunting
60
World map of clinkers II.
61
Density-Based Clustering Methods
  • Clustering based on density (local cluster
    criterion), such as density-connected points
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Several interesting studies
  • DBSCAN Ester, et al. (KDD96)
  • OPTICS Ankerst, et al (SIGMOD99).
  • DENCLUE Hinneburg D. Keim (KDD98)
  • CLIQUE Agrawal, et al. (SIGMOD98)

62
Requirements of Clustering in Data Mining
  • Scalability
  • Ability to deal with different types of
    attributes
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge to
    determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • High dimensionality
  • Incorporation of user-specified constraints
  • Interpretability and usability

63
Clustering Summary
  • Clustering is an old and multidisciplinary area.
  • New challenges related to new or newly important
    kinds of data
  • Noisy
  • Large
  • High-Dimensional
  • New Kinds of Similarity Measures (non-metric)
  • Clusters of Variable Size and Density
  • Arbitrary Cluster Shapes (non-globular)
  • Many and Mixed Attribute Types (temporal,
    continuous, categorical)
  • New data mining approaches and algorithms are
    being developed that may be more suitable for
    these problems.
Write a Comment
User Comments (0)
About PowerShow.com