David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation

Title:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

Description:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne_at_gmail.com ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 57

Provided by: macs6

Category:

more less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

1
Data Mining(and machine learning)

DM Lecture 5 Clustering

2
Today

(unsupervised) Clustering
What and why
A good first step towards understanding your data
Discover patterns and structure in data
Which then guides further data mining
Helps to spot problems and outliers
Identifies market segments, e.g. specific types
of customers or users this is called
segmentation or market segmentation
How to do it
Choose distance measure or a similarity measure
Run a (usually) simple algorithm
We will cover the two main algorithms

3
What is Clustering? Why do it?
Subscriber Calls per day Monthly bill
1 1 3
2 4 7
3 4 8
4 3 5
5 6 1
6 9 3
7 3 5
8 7 2
9 4 7
10 6 3
11 2 5
12 8 4
13 5 3
14 6 2
15 2 4
16 3 6
17 8 3
Consider these data Made up maybe they are 17
subscribers to a mobile phone services company,
and show the mean calls per day and the
mean monthly bill for each customer Do you spot
any patterns or Structure ??
4
What is Clustering? Why do it?
Here is a plot of the data, with calls as X and
bills as Y
Now do you spot any patterns or structure ??
5
What is Clustering? Why do it?
Clearly there are two clusters -- two distinct
types of customer Top left few
calls but highish bills bottom right many
calls, low bills
6

So, clustering is all about plotting/visualising
and noting distinct groups by eye, right?

Not really, because
We can only spot patterns by eye (i.e. with our
brains) if
the data is 1D, 2D or 3D. Most data of
interest is much higher
dimensional e.g. 10D, 20D, 1000D.
Sometimes the clusters are not so obvious as a
bunch of data
all in the same place we will see examples.
So we need automated algorithms which can do
what you
just did (find distinct groups in the data),
but which can do
this for any number of dimensions, and for
perhaps more
complex kinds of groups.

OK, so when we apply an automated clustering
algorithm to data, the result is a collection of
groups, which tells us potentially useful things
about our data e.g. if we are doing this with
supermarket baskets, each group is a collection
of typical baskets, which may relate to general
housekeeping, late night dinner, quick
lunchtime shopper, and perhaps other types that
we are not expecting

Yes
8
Quality of a clustering?
Why is this
Better than this?
9
Quality of a clustering?

A good clustering has the following properties
Items in the same cluster tend to be close to
each other
Items in different clusters tend to be far from
each other

It is not hard to come up with a metric an
easily calculated value that can be used to
give a score to any clustering. There are many
such metrics. E.g.
S the mean distance between pairs of items in
the same cluster D the mean distance between
pairs of items in different clusters Measure of
cluster quality is D/S -- the higher the
better.
10
Lets try that
A B C D E F G
H
S AB AD AF AH BD BF BH DF DH
FH CE CG EG / 13 44/13
3.38 D AC AE AG BC BE BG DC DE
DG FC FE FG HC HE HG /15
40/15 2.67
Cluster Quality D/S 0.77
11
Lets try that again
A B C D E F G
H
S AB AC AD BC BD CD EF
EG EH FG FH GH / 12 20/12 1.67 D
AE AF AG AH BE BF BG BH
CE CF CG CH DE DF DG DH/16
68/16 4.25
Cluster Quality D/S 2.54
12
But what about this?
A B C D E F G
H
S AB CD EF EG EH FG FH
GH / 8 12/8 1.5 D AC AD AE
AF AG AH BC BD BE BF BG
BH CE CF CG CH DE DF
DG DH/20 72/20 3.6
Cluster Quality D/S 2.40
13
Some important notes

There is usually no correct clustering.
Clustering algorithms (whether or not they work
with
cluster quality metrics) always use some kind
of distance
or similarity measure -- the result of the
clustering process
will depend on the chosen distance measure.
Choice of algorithm, and/or distance measure,
will depend on the
kind of cluster shapes you might expect in the
data.
Our D/S measure for cluster quality will not work
well in lots
of cases

14
Examples sometimes groups arenot simple to
spot, even in 2D
Slide credit Julia Handl
15
Examples sometimes groups arenot simple to
spot, even in 2D
Slide credit Julia Handl
16
Brain Training
Think about why D/S is not a useful cluster
quality measure in the general case Try to
design a cluster quality metric that will work
well in the cases of the previous slides (not
very difficult)
17
In many problems the clusters are more
conventional but maybe fuzzy and unclear
Slide credit Julia Handl
18
And there is a different kind of clusteringthat
can be done, which avoids the issue ofdeciding
the number of clusters in advance
Slide credit Elias Raftopoulos Prof. Maria
Papadopouli
19
How to do it

The most commonly used methods
K-Means
Hierarchical Agglomerative Clustering

20
K-Means

If you want to see K clusters, then run K-means.
I.e. you need to choose in advance the number of
clusters. Say K3 -- run 3-means and the result
is a good grouping of the data into 3
clusters.
It works by generating K points (in a way, these
are made-up records in the data) each point is
the centre (or centroid) of one cluster. As the
algorithm iterates, the points adjust their
positions until they stabilise.
Very simple, fairly fast, very common a few
drawbacks.

21
Lets see it
22
Here is the data we choose k 2and run 2-means
23
We choose two cluster centres -- randomly
24
Step 1 decide which cluster each point is in
the one whose centre is closest
25
Step 2 We now have two clusters recompute the
centre of each cluster
26
These are the new centres
27
Step 1 decide which cluster each point is in
the one whose centre is closest
28
This one has to be reassigned
29
Step 2 We now have two new clusters recompute
the centre of each cluster
30
Centres now slightly moved
31
Step 1 decide which cluster each point is in
the one whose centre is closest
32
In this case, nothing gets reassigned to a new
cluster so the algorithm is finished
33
The K-Means Algorithm

Choose k
Randomly choose k points, labelled 1, 2, , k
to be the initial cluster centroids.
For each datum, let its cluster ID be the label
of its closest centroid.
For each cluster, recalculate its actual centre.
Go back to Step 2, stop when step 2 does not
change the cluster ID of any point

34
Simple but often not ideal

Variable results with noisy data and outliers
Very large or very small values can skew the
centroid positions, and give poor clusterings
Only suitable for the cases where we can expect
clusters to be clumps that are close together
e.g. terrible in the two-spirals and similar cases

35
Hierarchical Agglomerative Clustering

Before we discuss this
We need to know how to work out the distance
between two points

36
Hierarchical Agglomerative Clustering

Before we discuss this
And the distance between a point and a cluster

?
37
Hierarchical Agglomerative Clustering

Before we discuss this
And the distance between two clusters

?
38
Hierarchical Agglomerative Clustering

Before we discuss this
There are many options for all of these things
we will discuss them in a later lecture.

?
39
Hierarchical Agglomerative Clustering

Is very commonly used
Very different from K-means
Provides a much richer structuring of the data
No need to choose k
But, quite sensitive to the various ways of
working out distance (different results for
different ways)

40
Lets see it
1
5
7
9
4
3
8
2
6
41
Initially, each point is a cluster
1
5
7
9
4
3
8
2
6
42
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
43
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
3 4
44
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
3 4
8
45
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
7 9
3 4
8
2
46
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
7 9
3 4
8
2
47
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
7 9
3 4
8
2
5
48
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
49
Find closest pair of clusters, and merge them
into one cluster
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
50
Now all one cluster, so stop
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
51
The thing on the right is a dendrogram it
contains the information for us to group the data
into clusters in various ways
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
52
E.g. 2 clusters
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
53
E.g. 3 clusters
1
5
7
9
4
3
8
2
6
1
6
7 9
3 4
8
2
5
54
In a proper dendrogram

The height of a bar indicates how different the
items are
A dendrogram is also called a binary tree
The data points are the leaves of the tree
Each node represents a cluster all the leaves
of its subtree

55
The Agglomerative Hierarchical Clustering
Algorithm

Decide on how to work out distance between two
clusters
Initialise each of the N data items is a cluster
Repeat N-1 times
Find the closest pair of clusters merge them
into a single cluster (and update your tree
representation)

56
Next time