Title: Approaches%20to%20clustering-based%20analysis%20and%20validation
1Approaches to clustering-based analysis and
validation
Dr. Huiru Zheng Dr. Francisco Azuaje School
of Computing and Mathematics Faculty of
Engineering University of Ulster
2Gene Expression Data
Genes vs perturbations
Tissues vs genes
Expression matrices illustrating two distinct
genetic profiles. genes (G), biological
conditions (C), types of tissue (T).
3Clustering
- At the end of an unsupervised recognition
process (learning), we obtain a number of classes
or CLUSTERS - Each cluster groups a number of cases/input
vectors - When new input vectors are presented to the
system, they are categorised into one of the
existing classes or CLUSTERS
x
4Clustering of genes according to their expression
patterns
5Clustering approaches to classification
Traditional approach
a1 a2 a3 a4 a5 s1 x y z y y s2
x x x y y s3 y y x y y s4 y y y y y s5 y x x y y
C1 s3, s4 C2 s1, s2, s5
Exclusive clusters
6Clustering approaches to classification Direct
sample/attribute correlation
a1 a2 a3 a4 a5 s1 x y z y y s2
x x x y y s3 y y x y y s4 y y y y y s5 y x x y y
C1 s3, s4, a1, a2 C2 s1, s2, s5 , a4,
a5
Exclusive biclusters
7Clustering approaches to classificationMultiple
cluster membership
a1 a2 a3 a4 a5 s1 x y z y y s2
x x x y y s3 y y x y y s4 y y y y y s5 y x x y y
C1 s3, s4, a1, a2 C2 s1, s2, s3, s5,
a4, a5
8Key Algorithms
- Hierarchical Clustering
- K-means
- Kohonen Maps
- Self-adaptive methods
9Hierarchical Clustering(1)
- Organizes the data into larger groups, which
contain smaller groups, like a tree or
dendrogram. - They avoid specifying how many clusters are
appropriate by providing a partition for each K.
The partitions are obtained from cutting the tree
at different levels. - The tree can be built in two distinct ways
- bottom-up agglomerative clustering
- top-down divisive clustering.
- Algorithms
- Agglomerative (Single-linkage, complete-linkage,
average-linkage) .
10Hierarchical Clustering(2)
Degrees of dissimilarity
genes
11Hierarchical Clustering (3)
- P set of genes
- While more than one subtree in P
- Pick the most similar pair i, j in P
- Define a new subtree k joining i and j
- Remove i and j from P and insert k
12Figures of Hierarchical Clustering
1
1 2 3
4 5
13Figures of Hierarchical Clustering
2
1 2 3
4 5
14Figures of Hierarchical Clustering
2
3
1 2 3
4 5
15Figures of Hierarchical Clustering
1 2 3
4 5
16Hierarchical Clustering
Method Distance metric
Single-linkage
Complete-linkage
Average-linkage
Centriod
17An Example of Hierarchical Clustering
18Partitional clustering
- To create one set of clusters that partitions the
data into similar groups. - Algorithms Forgys, k-means, Isodata
19K-means Clustering
- A value for k is selected up front of expected
cluster - The algorithm divides the data into k many
clusters in such a way that the profiles within
each cluster are more similar than those across
clusters.
20K-mean approach
- One more input k is required. There are many
variants of k-mean. - Sum-of squares criterion
- minimize
21An example of k-mean approach
- Two passes
- Begin with k clusters, each consisting of one of
the first k samples. For the remaining n-k
samples, find the centroid nearest it. After each
sample is assigned, re-compute the centroid of
the altered cluster. - For each sample, find the centroid nearest it.
Put the sample in the cluster identified with
this nearest centroid. ( do not need to
re-compute.)
22Examples
23Kohonen Self-Organising Maps (SOM)
- The aim of Kohonen learning is to map similar
signals/input-vectors/cases to similar neurone
positions - Neurones or nodes that are physically adjacent
in the network encode patterns or inputs that are
similar
24SOM architecture (1)
neurone i
Kohonen layer
wi
Winning neurone
Input vector X
Xx1,x2,xn ? Rn wiwi1,wi2,,win ? Rn
25SOM architecture (2)
A rectangular grid of neurones representing a
Kohonen map. Lines are used to link neighbour
neurons.
26SOM architecture (3)
2-dimensional representation of random weight
vectors. The lines are drawn to connect neurones
which are physically adjacent.
27SOM architecture (4)
2-dimensional representation of 6 input vectors
(a training data set)
28SOM architecture (5)
In a well trained (ordered) network the diagram
in the weight space should have the same topology
as that in physical space and will reflect the
properties of the training data set.
29SOM architecture (6)
Input space (training data set)
Weight vector representations after training
30SOM-based Clustering (1)
- Type of input
- a) The input a neural network can process is a
vector of fixed length. - b) This means that only numbers can be used as
input and that one must setup the network in such
a way that the longest input vector can be
processed. This also means that to all vectors
with less elements, elements must be added until
they have the same size as the longest vector.
31SOM-based Clustering (2)
- Classification of inputs
- a) In a Kohonen network, each neurone is
represented by a so-called weight vector - b) During training these vectors are adjusted to
match the input vectors in such a way that after
training each of the weight vectors represents a
certain class of input vectors - c) If in the test phase a vector is presented as
input, the weight vector which represents the
class this input vector belongs to, is given as
output, i.e. the neurone is activated.
32SOM-based Clustering (3)
- Learning (training) behaviour.
- a) During training (learning) the neurones of a
Kohonen network are adjusted in such a way, that
on the map there will form regions which consist
of neurones with similar weight vectors. - b) This means that in a well-trained map, a class
will not be represented by one single neurone,
but by a group of neurons. - c) In this group there is one central neurone
which can be said to represent the most
prototypical member of this class, while the
surrounding neurons represent less prototypical
members.
33SOM Learning Alogorithm
- SOMs define a mapping from a m-dimensional input
data space onto a one- or two-dimensional array
of nodes - Algorithm
- 1. initialize the network with n nodes
- 2. select one case from the set of training
cases - 3. find the node in the network that is closest
(according to some measure of distance) to the
selected case - 4. adjust the set of weights of the closest node
and of the nodes around it - 5. repeat from 2. until some termination
criterion is reached.
34SOM One single learning cycle (1)
- 1) The weights are initialised to random values
(between the interval -0.1 to 0.1, for instance)
and the neighbourhood sizes set to cover over
half of the network - 2) a m-dimensional input vector Xs (scaled
between -1 and 1, for instance) enters the
network - 3) The distances di(Wi, Xs) between all the
weight vectors on the SOM and Xs are calculated
by using (for instance) - where
- Wi denotes the ith weight vector
- wj and xj represent the jth elements of Wi and
Xi respectively
35SOM One single learning cycle (2)
4) Find the best matching neurone or winning
neurone whose weight vector Wk is closest to the
current input vector Xi 5) Modify the weights
of the winning neurone and all the neurones in
the neighbourhood Nk by applying Wjnew Wjold
?(Xi - Wjold) Where ? represents the learning
rate 6) Next input vector X(i1) , the process
is repeated.
36SOM Learning Parameters
- If a data set consists of P input vectors or
cases, then 1 learning epoch is equal to P single
learning cycles - After a number of N learning epochs, the size of
the neighbourhood is decreased. - After a number of M learning epochs, the
learning rate, ?, may be decreased
37SOM Neighbourhood Schemes(1)
38SOM Neighbourhood Schemes(2)
First neighbourhood
Second neighbourhood
39SOM Neighbourhood Schemes(3)
Why do we have to modify the size of
neighbourhood ?
- We need to induce map formation by adapting
regions according to the similarity between
weights and input vectors - We need to ensure that neighbourhoods are
adjacent - Thus, a neighbourhood will represent a number of
similar clusters or neurones - By starting with a large neighbourhood we
guarantee that a GLOBAL ordering takes place,
otherwise there may be more than one region on
the map encoding a given part of the input space.
40SOM Neighbourhood Schemes(4)
- One good strategy is to gradually reduce the size
of the neighbourhood for each neurone to zero
over a first part of the learning phase, during
the formation of the map topography - and then continue to modify only the weight
vectors of the winning neurones to pick up the
fine details of the input space -
-
41SOM Learning rate
Why do we need to decrease the learning rate ? ?
- If the learning rate ? is kept constant, it is
possible for weight vectors to oscillate back and
forth between two nearby positions - Lowering ? ensures that this does not occur and
the network is stable.
42Visualising data and clusters with Kohonen maps
U-matrix and median distance matrix maps for
leukaemia data The U-matrix holds distances
between neighbouring map units
43Basic Criteria For The Selection Of Clustering
Techniques (1)
- Which clustering algorithm should I use?
- Should I apply an alternative solution?
- How can results be improved by using different
methods?
44Basic Criteria For The Selection Of Clustering
Techniques (2)
- There are multiple clustering techniques that can
be used to analyse expression data. - Choosing the best algorithm for a particular
problem may represent a challenging task. - Advantages and limitations may depend on factors
such as the statistical nature of the data,
pre-processing procedures, number of features
etc. - It is not uncommon to observe inconsistent
results when different clustering methods are
tested on a particular data set
45Basic Criteria For The Selection Of Clustering
Techniques (3)
- In order to make an appropriate choice, it is
important to have a good understanding of - the problem domain under study, and
- the clustering options available.
46Basic Criteria For The Selection Of Clustering
Techniques (4)
- Knowledge on the underlying biological problem
may allow a scientist to choose a tool that
satisfies certain requirements, such as the
capacity to detect overlapping classes. - Knowledge on the mathematical properties of a
clustering technique may support the selection
process. - How does this algorithm represent similarity (or
dissimilarity)?, - How much relevance does it assign to cluster
heterogeneity?, - How does it implement the process of measuring
cluster isolation?. - Answers to these questions may indicate crucial
directions for the selection of an adequate
clustering algorithm.
47Basic Criteria For The Selection Of Clustering
Techniques (5)
- Empirical studies have defined several
mathematical criteria of acceptability - For example, there may be clustering algorithms
that are capable of guaranteeing the generation
of partitions whose cluster structures do not
intersect. - Several algorithms indirectly assume that the
cluster structure of the data under consideration
exhibits particular characteristics. - For instance, the k-means algorithm assumes that
the shape of the clusters is spherical and
single-linkage hierarchical clustering assumes
that the clusters are well separated
48Basic Criteria For The Selection Of Clustering
Techniques (6)
- Unfortunately, this type of knowledge may not
always be available in an expression data study. - In this situation a solution may be to test a
number of techniques on related data sets, which
have previously been classified (a reference data
set). - Thus, a user may choose a clustering method if
it produced consistent categorisation results in
relation to such reference data set.
49Basic Criteria For The Selection Of Clustering
Techniques (7)
- Specific user requirements may also influence a
selection decision. - For example, a scientist may be interested in
observing direct relationships between classes
and subclasses in a data partition. In this case,
a hierarchical clustering approach may represent
a basic solution. - But in some studies hierarchical clustering
results could be difficult to visualise because
of the number of samples and features involved.
Thus, for instance, a SOM may be considered to
guide an exploratory analysis of the data.
50Basic Criteria For The Selection Of Clustering
Techniques (8)
- In general the application of two or more
clustering techniques may provide the basis for
the synthesis of accurate and reliable results. - A scientist may be more confident about the
clustering experiments if very similar results
are obtained by using different techniques.
51Clustering approaches to classification Key
experimental factors
- Type of clustering algorithm
- Number of experiments (partitions)
- Number of clustering (learning) cycles in an
algorithm - In KD applications the number of classes may not
be known a priori - The number of clusters in each experiment
52The problem of assessing cluster validity and
evaluation
Partition 1 (2 clusters)
Partition 2 (3 clusters)
Partition 3 (4 clusters)
Partition n ( 3 clusters)
53Clustering and cluster validity assessment
Data on a 2D space
54Cluster validity assessment
c 2
c 4
Method 1
Method 2
55Cluster validity assessment Key questions
- Is this a relevant partition?
- Should we analyse these clusters?
- Is there a better partition?
- Which clustering method should we apply ?
- Is this relevant information from a biological
point of view ?
56Cluster validity assessment
- Quality indices
- Maximize or minimize indices
- Quality factorscompactness, heterogeneity,
isolation, shape - The best or correct partition is the one that
maximize or minimize an index
57Cluster validity assessment based on a quality
index I
Partition 1
(2 clusters)
Partition 2
(3 clusters)
Partition 3
(4 clusters)
P1 I1 0.1 P2 I2 0.9 P3 I3 0.5
P2 is the best/correct partition
Analyze and interpret P2
58Cluster Validity assessment
?(Xi, Xj) intercluster distance ?(Xk)
intracluster distance
d
(
X
,
X
)
1
2
D
(
X
)
1
59Cluster Validity assessment
There are different ways to calculate ?(Xi, Xj)
and ?(Xk)
60Cluster validity assessment (I) -
The Dunns index
- This index aims at identifying sets of clusters
that are compact and well separated - For any partition U ? X X1 ?... Xi ? X, the
Dunns validation index is defined as
- ?(Xi, Xj) intercluster distance between
clusters Xi and Xj () ?(Xk) intracluster
distance of cluster Xk c number of clusters of
partition U -
- Large values of V correspond to good clusters
- The number of clusters that maximises V is taken
as the optimal number of clusters, c.
61Cluster validity assessment (II) the
Davies-Bouldin index
- Small values of DB correspond to good clusters
(the clusters are compact and their centres are
far away from each other) - The cluster configuration that minimizes DB is
taken as the optimall number of clusters, c.
62Cluster Validity assessment Obtaining the
partitions
Leukemia data 2 clusters
A B
AML ALL
63Cluster Validity assessment Obtaining the
partitions
Leukemia data 4 clusters
A B C D
AML T-ALL B-ALL
64Cluster Validity assessment Davies-Bouldin
indexes - leukemia data
- DB11 using inter-cluster distance 1 and
intra-cluster distance 1 (complete diameter) - DB21 using inter-cluster distance 2 and
intra-cluster distance 1
Bold entries represent the optimal number of
clusters, c, predicted by each index.
65Cluster validity assessment Dunns indexes -
leukemia data
Bold entries represent the optimal number of
clusters, c, predicted by each index.
66Cluster validity assessment Aggregation of
Dunns indices - leukemia data
67Cluster validity assessment Aggregation of
Davies-Bouldin indexes - leukemia data
68Cluster validity assessment
- Different intercluster/intracluster distance
combinations may produce validation indices of
different scale ranges. - Indices with higher values may have a stronger
effect on the calculation of the average index
values. - This may result in a biased prediction of the
optimal number of clusters.
69Cluster validity assessment
An approach for the prediction of the optimal
partition is the implementation of an aggregation
method based on a weighed voting strategy.
Leukaemia data and Davies-Bouldin validation
index
70Cluster validity assessment
Effect of the distance metric on the prediction
process
Dunns validity indexes and leukemia data
71Cluster validity assessment -
Effect of the distance metric on the prediction
process
Dunns validity indexes and DLBCL data
72Visualisation - Dendrograms
- Dendrograms are often used to visualize the
nested sequence of clusters resulting from
hierarchical clustering. - While dendrograms are quite appealing because of
their apparent ease of interpretation, they can
be misleading. - first, the dendrogram corresponding to a given
hierarchical clustering is not unique, since for
each merge one needs to specify which subtree
should go on the left and which on the right
there are 2(n-1) different dendrograms. - Second, and perhaps less recognized shortcoming
of dendrograms, is that they impose structure on
the data, instead of revealing structure in these
data.
73Dendrograms
- Such a representation will be valid only to the
extent that the pairwise distances possess the
hierarchical structure imposed by the clustering
algorithm.
74Dendrograms - example
- Genes correspond to the rows, and the time points
of each experiment are the columns. - The ratio of expression is color coded
- Red upregulated
- Green downregulated
- Black no change
- Grey missing data
75Visualisation - maps
- The 828 genes were grouped into 30 clusters.
- Each cluster is represented by the centroid for
genes in the clusters. - Expression levels are shown on y-axis and time
points on x-axis
76Key problems, challenges, recent advances
Azuaje F, Clustering-based approaches to
discovering and visualizing expression patterns,
Briefings in Bioinformatics, 4 (1), pp. 31- 42,
2003. (course material)
77Conclusions
- Clustering is a basic computational approach to
pattern recognition and classification in
expression studies - Several methods are available, it is fundamental
to understand the biological problem and
statistical requirements - There is the need for systematic evaluation and
validation frameworks to guide humans and
computers to reach their classification goals - These techniques may support knowledge discovery
processes in complex domains such as the
molecular classification of cancers
78Acknowledgement
- Dr. Haiying Wang
- School of Computing and Mathematics
- Faculty of Engineering
- University of Ulster