David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation
Title:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

Description:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne_at_gmail.com ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 57
Provided by: macs6
Category:

less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com


1
Data Mining(and machine learning)
  • DM Lecture 5 Clustering

2
Today
  • (unsupervised) Clustering
  • What and why
  • A good first step towards understanding your data
  • Discover patterns and structure in data
  • Which then guides further data mining
  • Helps to spot problems and outliers
  • Identifies market segments, e.g. specific types
    of customers or users this is called
    segmentation or market segmentation
  • How to do it
  • Choose distance measure or a similarity measure
  • Run a (usually) simple algorithm
  • We will cover the two main algorithms

3
What is Clustering? Why do it?
Subscriber Calls per day Monthly bill
1 1 3
2 4 7
3 4 8
4 3 5
5 6 1
6 9 3
7 3 5
8 7 2
9 4 7
10 6 3
11 2 5
12 8 4
13 5 3
14 6 2
15 2 4
16 3 6
17 8 3
Consider these data Made up maybe they are 17
subscribers to a mobile phone services company,
and show the mean calls per day and the
mean monthly bill for each customer Do you spot
any patterns or Structure ??
4
What is Clustering? Why do it?
Here is a plot of the data, with calls as X and
bills as Y
Now do you spot any patterns or structure ??
5
What is Clustering? Why do it?
Clearly there are two clusters -- two distinct
types of customer Top left few
calls but highish bills bottom right many
calls, low bills
6

So, clustering is all about plotting/visualising
and noting distinct groups by eye, right?
  • Not really, because
  • We can only spot patterns by eye (i.e. with our
    brains) if
  • the data is 1D, 2D or 3D. Most data of
    interest is much higher
  • dimensional e.g. 10D, 20D, 1000D.
  • Sometimes the clusters are not so obvious as a
    bunch of data
  • all in the same place we will see examples.
  • So we need automated algorithms which can do
    what you
  • just did (find distinct groups in the data),
    but which can do
  • this for any number of dimensions, and for
    perhaps more
  • complex kinds of groups.

7

OK, so when we apply an automated clustering
algorithm to data, the result is a collection of
groups, which tells us potentially useful things
about our data e.g. if we are doing this with
supermarket baskets, each group is a collection
of typical baskets, which may relate to general
housekeeping, late night dinner, quick
lunchtime shopper, and perhaps other types that
we are not expecting

Yes
8
Quality of a clustering?
Why is this
Better than this?
9
Quality of a clustering?
  • A good clustering has the following properties
  • Items in the same cluster tend to be close to
    each other
  • Items in different clusters tend to be far from
    each other

It is not hard to come up with a metric an
easily calculated value that can be used to
give a score to any clustering. There are many
such metrics. E.g.
S the mean distance between pairs of items in
the same cluster D the mean distance between
pairs of items in different clusters Measure of
cluster quality is D/S -- the higher the
better.
10
Lets try that
A B C D E F G
H
S AB AD AF AH BD BF BH DF DH
FH CE CG EG / 13 44/13
3.38 D AC AE AG BC BE BG DC DE
DG FC FE FG HC HE HG /15
40/15 2.67
Cluster Quality D/S 0.77
11
Lets try that again
A B C D E F G
H
S AB AC AD BC BD CD EF
EG EH FG FH GH / 12 20/12 1.67 D
AE AF AG AH BE BF BG BH
CE CF CG CH DE DF DG DH/16
68/16 4.25
Cluster Quality D/S 2.54
12
But what about this?
A B C D E F G
H
S AB CD EF EG EH FG FH
GH / 8 12/8 1.5 D AC AD AE
AF AG AH BC BD BE BF BG
BH CE CF CG CH DE DF
DG DH/20 72/20 3.6
Cluster Quality D/S 2.40
13
Some important notes
  • There is usually no correct clustering.
  • Clustering algorithms (whether or not they work
    with
  • cluster quality metrics) always use some kind
    of distance
  • or similarity measure -- the result of the
    clustering process
  • will depend on the chosen distance measure.
  • Choice of algorithm, and/or distance measure,
    will depend on the
  • kind of cluster shapes you might expect in the
    data.
  • Our D/S measure for cluster quality will not work
    well in lots
  • of cases


14
Examples sometimes groups arenot simple to
spot, even in 2D
Slide credit Julia Handl
15
Examples sometimes groups arenot simple to
spot, even in 2D
Slide credit Julia Handl
16
Brain Training
Think about why D/S is not a useful cluster
quality measure in the general case Try to
design a cluster quality metric that will work
well in the cases of the previous slides (not
very difficult)
17
In many problems the clusters are more
conventional but maybe fuzzy and unclear
Slide credit Julia Handl
18
And there is a different kind of clusteringthat
can be done, which avoids the issue ofdeciding
the number of clusters in advance
Slide credit Elias Raftopoulos Prof. Maria
Papadopouli
19
How to do it
  • The most commonly used methods
  • K-Means
  • Hierarchical Agglomerative Clustering

20
K-Means
  • If you want to see K clusters, then run K-means.
    I.e. you need to choose in advance the number of
    clusters. Say K3 -- run 3-means and the result
    is a good grouping of the data into 3
    clusters.
  • It works by generating K points (in a way, these
    are made-up records in the data) each point is
    the centre (or centroid) of one cluster. As the
    algorithm iterates, the points adjust their
    positions until they stabilise.
  • Very simple, fairly fast, very common a few
    drawbacks.

21
Lets see it
22
Here is the data we choose k 2and run 2-means
23
We choose two cluster centres -- randomly
24
Step 1 decide which cluster each point is in
the one whose centre is closest
25
Step 2 We now have two clusters recompute the
centre of each cluster
26
These are the new centres
27
Step 1 decide which cluster each point is in
the one whose centre is closest
28
This one has to be reassigned
29
Step 2 We now have two new clusters recompute
the centre of each cluster
30
Centres now slightly moved
31
Step 1 decide which cluster each point is in
the one whose centre is closest
32
In this case, nothing gets reassigned to a new
cluster so the algorithm is finished
33
The K-Means Algorithm
  • Choose k
  • Randomly choose k points, labelled 1, 2, , k
    to be the initial cluster centroids.
  • For each datum, let its cluster ID be the label
    of its closest centroid.
  • For each cluster, recalculate its actual centre.
  • Go back to Step 2, stop when step 2 does not
    change the cluster ID of any point

34
Simple but often not ideal
  • Variable results with noisy data and outliers
  • Very large or very small values can skew the
    centroid positions, and give poor clusterings
  • Only suitable for the cases where we can expect
    clusters to be clumps that are close together
    e.g. terrible in the two-spirals and similar cases

35
Hierarchical Agglomerative Clustering
  • Before we discuss this
  • We need to know how to work out the distance
    between two points

36
Hierarchical Agglomerative Clustering
  • Before we discuss this
  • And the distance between a point and a cluster

?
37
Hierarchical Agglomerative Clustering
  • Before we discuss this
  • And the distance between two clusters

?
38
Hierarchical Agglomerative Clustering
  • Before we discuss this
  • There are many options for all of these things
    we will discuss them in a later lecture.

?
39
Hierarchical Agglomerative Clustering
  • Is very commonly used
  • Very different from K-means
  • Provides a much richer structuring of the data
  • No need to choose k
  • But, quite sensitive to the various ways of
    working out distance (different results for
    different ways)

40
Lets see it
1
5
7
9
4
3
8
2
6
41
Initially, each point is a cluster
1
5
7
9
4
3
8
2
6
42
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
43
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
3 4
44
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
3 4
8
45
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
3 4
8
2
46
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
7 9
3 4
8
2
47
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
7 9
3 4
8
2
5
48
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
49
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
50
Now all one cluster, so stop
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
51
The thing on the right is a dendrogram it
contains the information for us to group the data
into clusters in various ways
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
52
E.g. 2 clusters
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
53
E.g. 3 clusters
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
54
In a proper dendrogram
  • The height of a bar indicates how different the
    items are
  • A dendrogram is also called a binary tree
  • The data points are the leaves of the tree
  • Each node represents a cluster all the leaves
    of its subtree

55
The Agglomerative Hierarchical Clustering
Algorithm
  • Decide on how to work out distance between two
    clusters
  • Initialise each of the N data items is a cluster
  • Repeat N-1 times
  • Find the closest pair of clusters merge them
    into a single cluster (and update your tree
    representation)

56
Next time
  • Correlation what are the important fields in
    your dataset?
Write a Comment
User Comments (0)
About PowerShow.com