Cluster Analysis (cont.) Pertemuan 12 - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Analysis (cont.) Pertemuan 12

Description:

... time complexity of at ... refers to a concept and contains a probabilistic description ... Bina Nusantara Dilanjutkan ke pert. 13 Applications and ... – PowerPoint PPT presentation

Number of Views:270
Avg rating:3.0/5.0
Slides: 60
Provided by: sabl47
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis (cont.) Pertemuan 12


1
(No Transcript)
2
Cluster Analysis (cont.) Pertemuan 12
Matakuliah M0614 / Data Mining OLAP Tahun
Feb - 2010
3
Learning Outcomes
  • Pada akhir pertemuan ini, diharapkan mahasiswa
  • akan mampu
  • Mahasiswa dapat menggunakan teknik analisis
    clustering Partitioning, hierarchical, dan
    model-based clustering pada data mining. (C3)

3
4
Acknowledgments
  • These slides have been adapted from Han, J.,
    Kamber, M., Pei, Y. Data Mining Concepts and
    Technique and Tan, P.-N., Steinbach, M., Kumar,
    V. Introduction to Data Mining.

Bina Nusantara
5
Outline Materi
  • A categorization of major clustering methods
    Hiararchical methods
  • A categorization of major clustering methods
    Model-based clustering methods
  • Summary

5
Bina Nusantara
6
Hierarchical Clustering
  • Produces a set of nested clusters organized as a
    hierarchical tree
  • Can be visualized as a dendrogram
  • A tree like diagram that records the sequences of
    merges or splits

7
Strengths of Hierarchical Clustering
  • Do not have to assume any particular number of
    clusters
  • Any desired number of clusters can be obtained by
    cutting the dendogram at the proper level
  • They may correspond to meaningful taxonomies
  • Example in biological sciences (e.g., animal
    kingdom, phylogeny reconstruction, )

8
Hierarchical Clustering
  • Two main types of hierarchical clustering
  • Agglomerative
  • Start with the points as individual clusters
  • At each step, merge the closest pair of clusters
    until only one cluster (or k clusters) left
  • Divisive
  • Start with one, all-inclusive cluster
  • At each step, split a cluster until each cluster
    contains a point (or there are k clusters)
  • Traditional hierarchical algorithms use a
    similarity or distance matrix
  • Merge or split one cluster at a time

9
Hierarchical Clustering
  • Use distance matrix as clustering criteria. This
    method does not require the number of clusters k
    as an input, but needs a termination condition

April 9, 2020
Data Mining Concepts and Techniques
9
10
AGNES (Agglomerative Nesting)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Implemented in statistical packages, e.g., Splus
  • Use the Single-Link method and the dissimilarity
    matrix
  • Merge nodes that have the least dissimilarity
  • Go on in a non-descending fashion
  • Eventually all nodes belong to the same cluster

April 9, 2020
Data Mining Concepts and Techniques
10
11
Agglomerative Clustering Algorithm
  • More popular hierarchical clustering technique
  • Basic algorithm is straightforward
  • Compute the proximity matrix
  • Let each data point be a cluster
  • Repeat
  • Merge the two closest clusters
  • Update the proximity matrix
  • Until only a single cluster remains
  • Key operation is the computation of the proximity
    of two clusters
  • Different approaches to defining the distance
    between clusters distinguish the different
    algorithms

12
Starting Situation
  • Start with clusters of individual points and a
    proximity matrix

Proximity Matrix
13
Intermediate Situation
  • After some merging steps, we have some clusters

C3
C4
Proximity Matrix
C1
C5
C2
14
Intermediate Situation
  • We want to merge the two closest clusters (C2 and
    C5) and update the proximity matrix.

C3
C4
Proximity Matrix
C1
C5
C2
15
After Merging
C2 U C5
  • The question is How do we update the proximity
    matrix?

C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
16
How to Define Inter-Cluster Similarity
Similarity?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
17
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
18
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
19
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
20
How to Define Inter-Cluster Similarity
?
?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
21
Cluster Similarity MIN or Single Link
  • Similarity of two clusters is based on the two
    most similar (closest) points in the different
    clusters
  • Determined by one pair of points, i.e., by one
    link in the proximity graph.

22
Hierarchical Clustering MIN
Nested Clusters
Dendrogram
23
Strength of MIN
Original Points
  • Can handle non-elliptical shapes

24
Limitations of MIN
Original Points
  • Sensitive to noise and outliers

25
Cluster Similarity MAX or Complete Linkage
  • Similarity of two clusters is based on the two
    least similar (most distant) points in the
    different clusters
  • Determined by all pairs of points in the two
    clusters

26
Hierarchical Clustering MAX
Nested Clusters
Dendrogram
27
Strength of MAX
Original Points
  • Less susceptible to noise and outliers

28
Limitations of MAX
Original Points
  • Tends to break large clusters
  • Biased towards globular clusters

29
Cluster Similarity Group Average
  • Proximity of two clusters is the average of
    pairwise proximity between points in the two
    clusters.
  • Need to use average connectivity for scalability
    since total proximity favors large clusters

30
Hierarchical Clustering Group Average
Nested Clusters
Dendrogram
31
Hierarchical Clustering Group Average
  • Compromise between Single and Complete Link
  • Strengths
  • Less susceptible to noise and outliers
  • Limitations
  • Biased towards globular clusters

32
Hierarchical Clustering Comparison
MIN
MAX
Group Average
33
Hierarchical Clustering Problems and Limitations
  • Once a decision is made to combine two clusters,
    it cannot be undone
  • No objective function is directly minimized
  • Different schemes have problems with one or more
    of the following
  • Sensitivity to noise and outliers
  • Difficulty handling different sized clusters and
    convex shapes
  • Breaking large clusters

34
DIANA (Divisive Analysis)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Implemented in statistical analysis packages,
    e.g., Splus
  • Inverse order of AGNES
  • Eventually each node forms a cluster on its own

April 9, 2020
Data Mining Concepts and Techniques
34
35
MST Divisive Hierarchical Clustering
  • Build MST (Minimum Spanning Tree)
  • Start with a tree that consists of any point
  • In successive steps, look for the closest pair of
    points (p, q) such that one point (p) is in the
    current tree but the other (q) is not
  • Add q to the tree and put an edge between p and q

36
MST Divisive Hierarchical Clustering
  • Use MST for constructing hierarchy of clusters

37
Extensions to Hierarchical Clustering
  • Major weakness of agglomerative clustering
    methods
  • Do not scale well time complexity of at least
    O(n2), where n is the number of total objects
  • Can never undo what was done previously
  • Integration of hierarchical distance-based
    clustering
  • BIRCH (1996) uses CF-tree and incrementally
    adjusts the quality of sub-clusters
  • ROCK (1999) clustering categorical data by
    neighbor and link analysis
  • CHAMELEON (1999) hierarchical clustering using
    dynamic modeling

April 9, 2020
Data Mining Concepts and Techniques
37
38
Model-Based Clustering
  • What is model-based clustering?
  • Attempt to optimize the fit between the given
    data and some mathematical model
  • Based on the assumption Data are generated by a
    mixture of underlying probability distribution
  • Typical methods
  • Statistical approach
  • EM (Expectation maximization), AutoClass
  • Machine learning approach
  • COBWEB, CLASSIT
  • Neural network approach
  • SOM (Self-Organizing Feature Map)

April 9, 2020
Data Mining Concepts and Techniques
38
39
EM Expectation Maximization
  • EM A popular iterative refinement algorithm
  • An extension to k-means
  • Assign each object to a cluster according to a
    weight (prob. distribution)
  • New means are computed based on weighted measures
  • General idea
  • Starts with an initial estimate of the parameter
    vector
  • Iteratively rescores the patterns against the
    mixture density produced by the parameter vector
  • The rescored patterns are used to update the
    parameter updates
  • Patterns belonging to the same cluster, if they
    are placed by their scores in a particular
    component
  • Algorithm converges fast but may not be in global
    optima

April 9, 2020
Data Mining Concepts and Techniques
39
40
The EM (Expectation Maximization) Algorithm
  • Initially, randomly assign k cluster centers
  • Iteratively refine the clusters based on two
    steps
  • Expectation step assign each data point Xi to
    cluster Ci with the following probability
  • Maximization step
  • Estimation of model parameters

April 9, 2020
Data Mining Concepts and Techniques
40
41
Conceptual Clustering
  • Conceptual clustering
  • A form of clustering in machine learning
  • Produces a classification scheme for a set of
    unlabeled objects
  • Finds characteristic description for each concept
    (class)
  • COBWEB
  • A popular a simple method of incremental
    conceptual learning
  • Creates a hierarchical clustering in the form of
    a classification tree
  • Each node refers to a concept and contains a
    probabilistic description of that concept

April 9, 2020
Data Mining Concepts and Techniques
41
42
COBWEB Clustering Method
A classification tree
April 9, 2020
Data Mining Concepts and Techniques
42
43
More on Conceptual Clustering
  • Limitations of COBWEB
  • The assumption that the attributes are
    independent of each other is often too strong
    because correlation may exist
  • Not suitable for clustering large database data
    skewed tree and expensive probability
    distributions
  • CLASSIT
  • an extension of COBWEB for incremental clustering
    of continuous data
  • suffers similar problems as COBWEB
  • AutoClass
  • Uses Bayesian statistical analysis to estimate
    the number of clusters
  • Popular in industry

April 9, 2020
Data Mining Concepts and Techniques
43
44
Neural Network Approach
  • Neural network approaches
  • Represent each cluster as an exemplar, acting as
    a prototype of the cluster
  • New objects are distributed to the cluster whose
    exemplar is the most similar according to some
    distance measure
  • Typical methods
  • SOM (Soft-Organizing feature Map)
  • Competitive learning
  • Involves a hierarchical architecture of several
    units (neurons)
  • Neurons compete in a winner-takes-all fashion
    for the object currently being presented

April 9, 2020
Data Mining Concepts and Techniques
44
45
Self-Organizing Feature Map (SOM)
  • SOMs, also called topological ordered maps, or
    Kohonen Self-Organizing Feature Map (KSOMs)
  • It maps all the points in a high-dimensional
    source space into a 2 to 3-d target space, s.t.,
    the distance and proximity relationship (i.e.,
    topology) are preserved as much as possible
  • Similar to k-means cluster centers tend to lie
    in a low-dimensional manifold in the feature
    space
  • Clustering is performed by having several units
    competing for the current object
  • The unit whose weight vector is closest to the
    current object wins
  • The winner and its neighbors learn by having
    their weights adjusted
  • SOMs are believed to resemble processing that can
    occur in the brain
  • Useful for visualizing high-dimensional data in
    2- or 3-D space

April 9, 2020
Data Mining Concepts and Techniques
45
46
Web Document Clustering Using SOM
  • The result of SOM clustering of 12088 Web
    articles
  • The picture on the right drilling down on the
    keyword mining
  • Based on websom.hut.fi Web page

April 9, 2020
Data Mining Concepts and Techniques
46
47
User-Guided Clustering
  • User usually has a goal of clustering, e.g.,
    clustering students by research area
  • User specifies his clustering goal to CrossClus

April 9, 2020
Data Mining Concepts and Techniques
47
48
Comparing with Classification
  • User-specified feature (in the form of attribute)
    is used as a hint, not class labels
  • The attribute may contain too many or too few
    distinct values, e.g., a user may want to cluster
    students into 20 clusters instead of 3
  • Additional features need to be included in
    cluster analysis

User hint

















All tuples for clustering
April 9, 2020
Data Mining Concepts and Techniques
48
49
Comparing with Semi-Supervised Clustering
  • Semi-supervised clustering User provides a
    training set consisting of similar (must-link)
    and dissimilar (cannot link) pairs of objects
  • User-guided clustering User specifies an
    attribute as a hint, and more relevant features
    are found for clustering













User-guided clustering
Semi-supervised clustering












x
All tuples for clustering
All tuples for clustering
April 9, 2020
Data Mining Concepts and Techniques
49
50
Why Not Semi-Supervised Clustering?
  • Much information (in multiple relations) is
    needed to judge whether two tuples are similar
  • A user may not be able to provide a good training
    set
  • It is much easier for a user to specify an
    attribute as a hint, such as a students research
    area

Tom Smith SC1211 TA
Jane Chang BI205 RA
Tuples to be compared
April 9, 2020
Data Mining Concepts and Techniques
50
51
CrossClus An Overview
  • Measure similarity between features by how they
    group objects into clusters
  • Use a heuristic method to search for pertinent
    features
  • Start from user-specified feature and gradually
    expand search range
  • Use tuple ID propagation to create feature values
  • Features can be easily created during the
    expansion of search range, by propagating IDs
  • Explore three clustering algorithms k-means,
    k-medoids, and hierarchical clustering

April 9, 2020
Data Mining Concepts and Techniques
51
52
Multi-Relational Features
  • A multi-relational feature is defined by
  • A join path, e.g., Student ? Register ?
    OpenCourse ? Course
  • An attribute, e.g., Course.area
  • (For numerical feature) an aggregation operator,
    e.g., sum or average
  • Categorical feature f Student ? Register ?
    OpenCourse ? Course, Course.area, null

areas of courses of each student
Values of feature f
Tuple Areas of courses Areas of courses Areas of courses
DB AI TH
t1 5 5 0
t2 0 3 7
t3 1 5 4
t4 5 0 5
t5 3 3 4
Tuple Feature f Feature f Feature f
DB AI TH
t1 0.5 0.5 0
t2 0 0.3 0.7
t3 0.1 0.5 0.4
t4 0.5 0 0.5
t5 0.3 0.3 0.4
April 9, 2020
Data Mining Concepts and Techniques
52
53
Representing Features
  • Similarity between tuples t1 and t2 w.r.t.
    categorical feature f
  • Cosine similarity between vectors f(t1) and f(t2)

Similarity vector Vf
  • Most important information of a feature f is how
    f groups tuples into clusters
  • f is represented by similarities between every
    pair of tuples indicated by f
  • The horizontal axes are the tuple indices, and
    the vertical axis is the similarity
  • This can be considered as a vector of N x N
    dimensions

April 9, 2020
Data Mining Concepts and Techniques
53
54
Similarity Between Features
Values of Feature f and g
Feature f (course) Feature f (course) Feature f (course) Feature g (group) Feature g (group) Feature g (group)
DB AI TH Info sys Cog sci Theory
t1 0.5 0.5 0 1 0 0
t2 0 0.3 0.7 0 0 1
t3 0.1 0.5 0.4 0 0.5 0.5
t4 0.5 0 0.5 0.5 0 0.5
t5 0.3 0.3 0.4 0.5 0.5 0
Vf
Similarity between two features cosine
similarity of two vectors
Vg
April 9, 2020
Data Mining Concepts and Techniques
54
55
Computing Feature Similarity
Tuples
Similarity between feature values w.r.t. the
tuples
Feature g
Feature f
sim(fk,gq)Si1 to N f(ti).pkg(ti).pq
DB
Info sys
AI
Cog sci
TH
Theory
Feature value similarities, easy to compute
Tuple similarities, hard to compute
Compute similarity between each pair of feature
values by one scan on data
April 9, 2020
Data Mining Concepts and Techniques
55
56
Searching for Pertinent Features
  • Different features convey different aspects of
    information
  • Features conveying same aspect of information
    usually cluster tuples in more similar ways
  • Research group areas vs. conferences of
    publications
  • Given user specified feature
  • Find pertinent features by computing feature
    similarity

April 9, 2020
Data Mining Concepts and Techniques
56
57
Heuristic Search for Pertinent Features
  • Overall procedure
  • 1. Start from the user- specified feature
  • 2. Search in neighborhood of existing pertinent
    features
  • 3. Expand search range gradually

2
1
User hint
Target of clustering
  • Tuple ID propagation is used to create
    multi-relational features
  • IDs of target tuples can be propagated along any
    join path, from which we can find tuples joinable
    with each target tuple

April 9, 2020
Data Mining Concepts and Techniques
57
58
Summary
  • Cluster analysis groups objects based on their
    similarity and has wide applications
  • Measure of similarity can be computed for various
    types of data
  • Clustering algorithms can be categorized into
    partitioning methods, hierarchical methods,
    density-based methods, grid-based methods, and
    model-based methods
  • There are still lots of research issues on
    cluster analysis

April 9, 2020
Data Mining Concepts and Techniques
58
59
  • Dilanjutkan ke pert. 13
  • Applications and Trends in Data Mining
Write a Comment
User Comments (0)
About PowerShow.com