Clustering Tutorial

About This Presentation

Title:

Clustering Tutorial

Description:

Hence SD as the square root of variance was born. Covariance ... makes variance and covariance calculation easier by simplifying their equations. ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 61

Provided by: csd6

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Tutorial

1
Clustering Tutorial

Elias Raftopoulos
HY539 29/3/06
Prof. Maria Papadopouli

2
Roadmap

Math Reminder
Principle Components Analysis
Clustering
ANOVA

3
Standard Deviation

Statistics analyzing data sets in terms of the
relationships between the individual points
Standard Deviation is a measure of the spread of
the data
Calculation average distance from the mean of
the data

4
Variance

Another measure of the spread of the data in a
data set
Calculation
Var( X ) E(( x µ )2)
Why have both variance and SD to calculate the
spread of data?
Variance is claimed to be the original
statistical measure of spread of data. However
its unit would be expressed as a square e.g.
cm2, which is unrealistic to express heights or
other measures. Hence SD as the square root of
variance was born.

5
Covariance

Variance measure of the deviation from the mean
for points in one dimension e.g. heights
Covariance as a measure of how much each of the
dimensions vary from the mean with respect to
each other.
Covariance is measured between 2 dimensions to
see if there is a relationship between the 2
dimensions e.g. number of hours studied marks
obtained
The covariance between one dimension and itself
is the variance

6
Covariance Matrix

Representing Covariance between dimensions as a
matrix e.g. for 3 dimensions
cov(x,x) cov(x,y) cov(x,z)
C cov(y,x) cov(y,y) cov(y,z)
cov(z,x) cov(z,y) cov(z,z)
Diagonal is the variances of x, y and z
cov(x,y) cov(y,x) hence matrix is symmetrical
about the diagonal
N-dimensional data will result in nxn covariance
matrix

7
Covariance

Exact value is not as important as its sign.
A positive value of covariance indicates both
dimensions increase or decrease together e.g. as
the number of hours studied increases, the marks
in that subject increase.
A negative value indicates while one increases
the other decreases, or vice-versa e.g. active
social life at RIT vs performance in CS dept.
If covariance is zero the two dimensions are
independent of each other e.g. heights of
students vs the marks obtained in a subject

8
Transformation matrices

Consider
2 3 3 12 3
2 1 2 8 2
Square transformation matrix transforms (3,2)
from its original location. Now if we were to
take a multiple of (3,2)
3 6
2 4
2 3 6 24 6
2 1 4 16 4

x

x

4
x
2

x

x

4
9
Transformation matrices

Scale vector (3,2) by a value 2 to get (6,4)
Multiply by the square transformation matrix
We see the result is still a multiple of 4.
WHY?
A vector consists of both length and direction.
Scaling a vector only changes its length and not
its direction. This is an important observation
in the transformation of matrices leading to
formation of eigenvectors and eigenvalues.
Irrespective of how much we scale (3,2) by, the
solution is always a multiple of 4.

10
eigenvalue problem

The eigenvalue problem is any problem having the
following form
A . v ? . v
A n x n matrix
v n x 1 non-zero vector
? scalar
Any value of ? for which this equation has a
solution is called the eigenvalue of A and vector
v which corresponds to this value is called the
eigenvector of A.

11
eigenvalue problem

2 3 3 12 3
2 1 2 8 2
A . v ? . v
Therefore, (3,2) is an eigenvector of the square
matrix A and 4 is an eigenvalue of A
Given matrix A, how can we calculate the
eigenvector and eigenvalues for A?

x

x

4
12
Calculating eigenvectors eigenvalues

Given A . v ? . v
A . v - ? . I . v 0
(A - ? . I ). v 0
Finding the roots of A - ? . I will give the
eigenvalues and for each of these eigenvalues
there will be an eigenvector
Example

13
Calculating eigenvectors eigenvalues

If A 0 1
-2 -3
Then A - ? . I 0 1 ? 0 0
-2 -3 0 ?
-? 1 ?2 3? 2 0
-2 -3-?
This gives us 2 eigenvalues
?1 -1 and ?2 -2

14
Properties of eigenvectors and eigenvalues

Note that Irrespective of how much we scale (3,2)
by, the solution is always a multiple of 4.
Eigenvectors can only be found for square
matrices and not every square matrix has
eigenvectors.
Given an n x n matrix, we can find n eigenvectors

15
Roadmap

Principle Components Analysis
Clustering
ANOVA

16
PCA

principal components analysis (PCA) is a
technique that can be used to simplify a dataset
It is a linear transformation that chooses a new
coordinate system for the data set such that
greatest variance by any projection of the data
set comes to lie on the first axis (then called
the first principal component),
the second greatest variance on the second axis,
and so on.
PCA can be used for reducing dimensionality by
eliminating the later principal components.

17
PCA

By finding the eigenvalues and eigenvectors of
the covariance matrix, we find that the
eigenvectors with the largest eigenvalues
correspond to the dimensions that have the
strongest correlation in the dataset.
This is the principal component.
PCA is a useful statistical technique that has
found application in
fields such as face recognition and image
compression
finding patterns in data of high dimension

18
PCA process STEP 1

Subtract the mean
from each of the data dimensions. All the x
values have x subtracted and y values have y
subtracted from them. This produces a data set
whose mean is zero.
Subtracting the mean makes variance and
covariance calculation easier by simplifying
their equations. The variance and co-variance
values are not affected by the mean value.

19
PCA process STEP 1

DATA
x y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9

ZERO MEAN DATA x y .69 .49 -1.31
-1.21 .39 .99 .09 .29 1.29 1.09 .49
.79 .19 -.31 -.81 -.81 -.31 -.31 -.71
-1.01
20
PCA process STEP 1
21
PCA process STEP 2

Calculate the covariance matrix
cov .616555556 .615444444
.615444444 .716555556
since the non-diagonal elements in this
covariance matrix are positive, we should expect
that both the x and y variable increase together.

22
PCA process STEP 3

Calculate the eigenvectors and eigenvalues of the
covariance matrix
eigenvalues .0490833989
1.28402771
eigenvectors -.735178656 -.677873399
.677873399 -.735178656

23
PCA process STEP 3

eigenvectors are plotted as diagonal dotted lines
on the plot.
Note they are perpendicular to each other.
Note one of the eigenvectors goes through the
middle of the points, like drawing a line of best
fit.
The second eigenvector gives us the other, less
important, pattern in the data, that all the
points follow the main line, but are off to the
side of the main line by some amount.

24
PCA process STEP 4

Reduce dimensionality and form feature vector the
eigenvector with the highest eigenvalue is the
principle component of the data set.
In our example, the eigenvector with the larges
eigenvalue was the one that pointed down the
middle of the data.
Once eigenvectors are found from the covariance
matrix, the next step is to order them by
eigenvalue, highest to lowest. This gives you the
components in order of significance.

25
PCA process STEP 4

Now, if you like, you can decide to ignore the
components of lesser significance
You do lose some information, but if the
eigenvalues are small, you dont lose much
n dimensions in your data
calculate n eigenvectors and eigenvalues
choose only the first p eigenvectors
final data set has only p dimensions.

26
PCA process STEP 4

Feature Vector
FeatureVector (eig1 eig2 eig3 eign)
We can either form a feature vector with both of
the eigenvectors
-.677873399 -.735178656
-.735178656 .677873399
or, we can choose to leave out the smaller, less
significant component and only have a single
column
- .677873399
- .735178656

27
PCA process STEP 5

Deriving the new data
FinalData RowFeatureVector x RowZeroMeanData
RowFeatureVector is the matrix with the
eigenvectors in the columns transposed so that
the eigenvectors are now in the rows, with the
most significant eigenvector at the top
RowZeroMeanData is the mean-adjusted data
transposed, ie. the data items are in each
column, with each row holding a separate
dimension.

28
PCA process STEP 5
S
R
VT
U

factors
variables
variables
factors
significant
sig.
noise
noise
noise
significant
factors
factors
samples
samples
29
PCA process STEP 5

FinalData is the final data set, with data items
in columns, and dimensions along rows.
What will this give us?
It will give us the original data solely in terms
of the vectors we chose.
We have changed our data from being in terms of
the axes x and y , and now they are in terms of
our 2 eigenvectors.

30
PCA process STEP 5

FinalData transpose dimensions along columns
x y
-.827970186 -.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
-1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287

31
PCA process STEP 5
32
Reconstruction of original Data

If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example
let us assume that we considered only the x
dimension

33
Reconstruction of original Data

x
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056

34
Roadmap

Principle Components Analysis
Clustering
ANOVA

35
What is Cluster Analysis?

Cluster a collection of data objects
Similar to the objects in the same cluster
(Intraclass similarity)
Dissimilar to the objects in other clusters
(Interclass dissimilarity)
Cluster analysis
Statistical method for grouping a set of data
objects into clusters
A good clustering method produces high quality
clusters with high intraclass similarity and low
interclass similarity
Clustering is unsupervised classification
Can be a stand-alone tool or as a preprocessing
step for other algorithms

36
Group objects according to their similarity
Cluster a set of objects that are similar to
each other and separated from the
other objects. Example green/ red data
points were generated from two different normal
distributions
37
Clustering data
object expression data matrix

Experiments/samples are given as the row and
column vectors of an expression data matrix
Clustering may be applied either to objects
experiments (regarded as vectors in Ro or Rn).

n experiments
o objects
38
Pattern matrix ? Proximity matrix

Pattern matrix (nxp)
pattributes
n of objects
Proximity matrix (nxn)
d(i,j)difference/
dissimilarity between i and j

39
Proximity matrix

Clustering methods require that a index of
proximity, or alikeness, or affinity or
association be established between pairs of
patterns
A proximity index is either a similarity or a
dissimilarity
The crucial problem in identifying clusters in
data is to specify what proximity is and how to
measure it

40
Proximity indices

A proximity index between the ith and kth
patterns is denoted d(i,k) and must satisfy the
following three properties
1. (a) for a dissimilarity d(i,i) 0, all i
(b) for a similarity d(i,i) max
d(i,k), all I
2. d(i,k) d(k,i), all (i,k)
3. d(i,k) 0, all (i,k)

41
Different proximity measures

r 2(Euclidean distance)
42 221/2 4.472
r 1(Manhattan distance)
4 2 6
r ? 8 (sup distance)
max4,2 4

42
K-Means Clustering

The meaning of K-means
Why it is called K-means clustering K points
are used to represent the clustering result each
point corresponds to the centre (mean) of a
cluster
Each point is assigned to the cluster with the
closest center point
The number K, must be specified
Basic algorithm

43
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in
4 steps
Partition objects into k non-empty subsets
Arbitrarily choose k points as initial centers
Assign each object to the cluster with the
nearest seed point (center)
Calculate the mean of the cluster and update the
seed point
Go back to Step 3, stop when no more new
assignment

44
The K-Means Clustering Method (cntd)

The basic step of k-means clustering is simple
Iterate until stable ( no object move group)
Determine the centroid coordinate
Determine the distance of each object to the
centroids
Group the object based on minimum distance

45
The K-Means Clustering Method (cntd)
46
The K-Means Clustering Results

Example

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
47
Weaknesses of the K-Means Method

Unable to handle noisy data and outliers
Very large or very small values could skew the
mean
Not suitable to discover clusters with non-convex
shapes

48
Hierarchical Clustering

Start with every data point in a separate cluster
Keep merging the most similar pairs of data
points/clusters until we have one big cluster
left
This is called a bottom-up or agglomerative
method

49
Hierarchical Clustering (cont.)

This produces a binary tree or dendrogram
The final cluster is the root and each data item
is a leaf
The height of the bars indicate how close the
items are

50
Hierarchical Clustering Demo
51
Levels of Clustering
52
Linkage in Hierarchical Clustering

We already know about distance measures between
data items, but what about between a data item
and a cluster or between two clusters?
We just treat a data point as a cluster with a
single item, so our only problem is to define a
linkage method between clusters
As usual, there are lots of choices

53
Average Linkage

Definition
Each cluster ci is associated with a mean vector
?i which is the mean of all the data items in the
cluster
The distance between two clusters ci and cj is
then just d(?i , ?j )
This is somewhat non-standard this method is
usually referred to as centroid linkage and
average linkage is defined as the average of all
pairwise distances between points in the two
clusters

54
Single Linkage

The minimum of all pairwise distances between
points in the two clusters
Tends to produce long, loose clusters

55
Complete Linkage

The maximum of all pairwise distances between
points in the two clusters
Tends to produce very tight clusters

56
Distances between clusters (summary)

Calculation of the distance between two clusters
is based on the pairwise distances between
members of the clusters.
Complete linkage largest distance between points
Average linkage average distance between points
Single linkage smallest distance between points
Centroid distance between centroids

Complete linkage gives preference to
compact/spherical clusters. Single linkage can
produce long stretched clusters.
57

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
EXAMPLE
58
More on Hierarchical Clustering Methods

Major advantage
Conceptually very simple
Easy to implement ? most commonly used technique
Major weakness of agglomerative clustering
methods
do not scale well time complexity of at least
O(n2), where n is the number of total objects
can never undo what was done previously ? high
likelihood of getting stuck in local minima

59
Roadmap

Principle Components Analysis
Clustering
ANOVA

60
(M)ANOVA

The analysis of variance technique in One-Way
Analysis of Variance (ANOVA) takes a set of
grouped data and determine whether the mean of a
variable differs significantly between groups
Often there are multiple variables and you are
interested in determining whether the entire set
of means is different from one group to the next
There is a multivariate version of analysis of
variance that can address that problem (MANOVA)

Write a Comment

User Comments (0)