Approaches%20to%20clustering-based%20analysis%20and%20validation - PowerPoint PPT Presentation

About This Presentation
Title:

Approaches%20to%20clustering-based%20analysis%20and%20validation

Description:

Each cluster groups a number of cases/input vectors; ... After each sample is assigned, re-compute the centroid of the altered cluster. ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 79
Provided by: Jan797
Category:

less

Transcript and Presenter's Notes

Title: Approaches%20to%20clustering-based%20analysis%20and%20validation


1
Approaches to clustering-based analysis and
validation

Dr. Huiru Zheng Dr. Francisco Azuaje School
of Computing and Mathematics Faculty of
Engineering University of Ulster
2
Gene Expression Data
Genes vs perturbations
Tissues vs genes
Expression matrices illustrating two distinct
genetic profiles. genes (G), biological
conditions (C), types of tissue (T).
3
Clustering
  • At the end of an unsupervised recognition
    process (learning), we obtain a number of classes
    or CLUSTERS
  • Each cluster groups a number of cases/input
    vectors
  • When new input vectors are presented to the
    system, they are categorised into one of the
    existing classes or CLUSTERS

x
4
Clustering of genes according to their expression
patterns
5
Clustering approaches to classification
Traditional approach
a1 a2 a3 a4 a5 s1 x y z y y s2
x x x y y s3 y y x y y s4 y y y y y s5 y x x y y
C1 s3, s4 C2 s1, s2, s5
Exclusive clusters
6
Clustering approaches to classification Direct
sample/attribute correlation
a1 a2 a3 a4 a5 s1 x y z y y s2
x x x y y s3 y y x y y s4 y y y y y s5 y x x y y
C1 s3, s4, a1, a2 C2 s1, s2, s5 , a4,
a5
Exclusive biclusters
7
Clustering approaches to classificationMultiple
cluster membership
a1 a2 a3 a4 a5 s1 x y z y y s2
x x x y y s3 y y x y y s4 y y y y y s5 y x x y y
C1 s3, s4, a1, a2 C2 s1, s2, s3, s5,
a4, a5
8
Key Algorithms
  • Hierarchical Clustering
  • K-means
  • Kohonen Maps
  • Self-adaptive methods

9
Hierarchical Clustering(1)
  • Organizes the data into larger groups, which
    contain smaller groups, like a tree or
    dendrogram.
  • They avoid specifying how many clusters are
    appropriate by providing a partition for each K.
    The partitions are obtained from cutting the tree
    at different levels.
  • The tree can be built in two distinct ways
  • bottom-up agglomerative clustering
  • top-down divisive clustering.
  • Algorithms
  • Agglomerative (Single-linkage, complete-linkage,
    average-linkage) .

10
Hierarchical Clustering(2)
Degrees of dissimilarity
genes
11
Hierarchical Clustering (3)
  • P set of genes
  • While more than one subtree in P
  • Pick the most similar pair i, j in P
  • Define a new subtree k joining i and j
  • Remove i and j from P and insert k

12
Figures of Hierarchical Clustering
1
1 2 3
4 5
13
Figures of Hierarchical Clustering
2
1 2 3
4 5
14
Figures of Hierarchical Clustering
2
3
1 2 3
4 5
15
Figures of Hierarchical Clustering
1 2 3
4 5
16
Hierarchical Clustering
Method Distance metric
Single-linkage
Complete-linkage
Average-linkage
Centriod
17
An Example of Hierarchical Clustering
18
Partitional clustering
  • To create one set of clusters that partitions the
    data into similar groups.
  • Algorithms Forgys, k-means, Isodata

19
K-means Clustering
  • A value for k is selected up front of expected
    cluster
  • The algorithm divides the data into k many
    clusters in such a way that the profiles within
    each cluster are more similar than those across
    clusters.

20
K-mean approach
  • One more input k is required. There are many
    variants of k-mean.
  • Sum-of squares criterion
  • minimize

21
An example of k-mean approach
  • Two passes
  • Begin with k clusters, each consisting of one of
    the first k samples. For the remaining n-k
    samples, find the centroid nearest it. After each
    sample is assigned, re-compute the centroid of
    the altered cluster.
  • For each sample, find the centroid nearest it.
    Put the sample in the cluster identified with
    this nearest centroid. ( do not need to
    re-compute.)

22
Examples
23
Kohonen Self-Organising Maps (SOM)
  • The aim of Kohonen learning is to map similar
    signals/input-vectors/cases to similar neurone
    positions
  • Neurones or nodes that are physically adjacent
    in the network encode patterns or inputs that are
    similar

24
SOM architecture (1)
neurone i
Kohonen layer
wi
Winning neurone
Input vector X
Xx1,x2,xn ? Rn wiwi1,wi2,,win ? Rn
25
SOM architecture (2)
A rectangular grid of neurones representing a
Kohonen map. Lines are used to link neighbour
neurons.
26
SOM architecture (3)
2-dimensional representation of random weight
vectors. The lines are drawn to connect neurones
which are physically adjacent.
27
SOM architecture (4)
2-dimensional representation of 6 input vectors
(a training data set)
28
SOM architecture (5)
In a well trained (ordered) network the diagram
in the weight space should have the same topology
as that in physical space and will reflect the
properties of the training data set.
29
SOM architecture (6)
Input space (training data set)
Weight vector representations after training
30
SOM-based Clustering (1)
  • Type of input
  • a) The input a neural network can process is a
    vector of fixed length.
  • b) This means that only numbers can be used as
    input and that one must setup the network in such
    a way that the longest input vector can be
    processed. This also means that to all vectors
    with less elements, elements must be added until
    they have the same size as the longest vector.

31
SOM-based Clustering (2)
  • Classification of inputs
  • a) In a Kohonen network, each neurone is
    represented by a so-called weight vector
  • b) During training these vectors are adjusted to
    match the input vectors in such a way that after
    training each of the weight vectors represents a
    certain class of input vectors
  • c) If in the test phase a vector is presented as
    input, the weight vector which represents the
    class this input vector belongs to, is given as
    output, i.e. the neurone is activated.

32
SOM-based Clustering (3)
  • Learning (training) behaviour.
  • a) During training (learning) the neurones of a
    Kohonen network are adjusted in such a way, that
    on the map there will form regions which consist
    of neurones with similar weight vectors.
  • b) This means that in a well-trained map, a class
    will not be represented by one single neurone,
    but by a group of neurons.
  • c) In this group there is one central neurone
    which can be said to represent the most
    prototypical member of this class, while the
    surrounding neurons represent less prototypical
    members.

33
SOM Learning Alogorithm
  • SOMs define a mapping from a m-dimensional input
    data space onto a one- or two-dimensional array
    of nodes
  • Algorithm
  • 1. initialize the network with n nodes
  • 2. select one case from the set of training
    cases
  • 3. find the node in the network that is closest
    (according to some measure of distance) to the
    selected case
  • 4. adjust the set of weights of the closest node
    and of the nodes around it
  • 5. repeat from 2. until some termination
    criterion is reached.

34
SOM One single learning cycle (1)
  • 1) The weights are initialised to random values
    (between the interval -0.1 to 0.1, for instance)
    and the neighbourhood sizes set to cover over
    half of the network
  • 2) a m-dimensional input vector Xs (scaled
    between -1 and 1, for instance) enters the
    network
  • 3) The distances di(Wi, Xs) between all the
    weight vectors on the SOM and Xs are calculated
    by using (for instance)
  • where
  • Wi denotes the ith weight vector
  • wj and xj represent the jth elements of Wi and
    Xi respectively

35
SOM One single learning cycle (2)
4) Find the best matching neurone or winning
neurone whose weight vector Wk is closest to the
current input vector Xi 5) Modify the weights
of the winning neurone and all the neurones in
the neighbourhood Nk by applying Wjnew Wjold
?(Xi - Wjold) Where ? represents the learning
rate 6) Next input vector X(i1) , the process
is repeated.
36
SOM Learning Parameters
  • If a data set consists of P input vectors or
    cases, then 1 learning epoch is equal to P single
    learning cycles
  • After a number of N learning epochs, the size of
    the neighbourhood is decreased.
  • After a number of M learning epochs, the
    learning rate, ?, may be decreased

37
SOM Neighbourhood Schemes(1)
  • Linear

38
SOM Neighbourhood Schemes(2)
  • Rectangular

First neighbourhood
Second neighbourhood
39
SOM Neighbourhood Schemes(3)
Why do we have to modify the size of
neighbourhood ?
  • We need to induce map formation by adapting
    regions according to the similarity between
    weights and input vectors
  • We need to ensure that neighbourhoods are
    adjacent
  • Thus, a neighbourhood will represent a number of
    similar clusters or neurones
  • By starting with a large neighbourhood we
    guarantee that a GLOBAL ordering takes place,
    otherwise there may be more than one region on
    the map encoding a given part of the input space.

40
SOM Neighbourhood Schemes(4)
  • One good strategy is to gradually reduce the size
    of the neighbourhood for each neurone to zero
    over a first part of the learning phase, during
    the formation of the map topography
  • and then continue to modify only the weight
    vectors of the winning neurones to pick up the
    fine details of the input space

41
SOM Learning rate
Why do we need to decrease the learning rate ? ?
  • If the learning rate ? is kept constant, it is
    possible for weight vectors to oscillate back and
    forth between two nearby positions
  • Lowering ? ensures that this does not occur and
    the network is stable.

42
Visualising data and clusters with Kohonen maps
U-matrix and median distance matrix maps for
leukaemia data The U-matrix holds distances
between neighbouring map units
43
Basic Criteria For The Selection Of Clustering
Techniques (1)
  • Which clustering algorithm should I use?
  • Should I apply an alternative solution?
  • How can results be improved by using different
    methods?

44
Basic Criteria For The Selection Of Clustering
Techniques (2)
  • There are multiple clustering techniques that can
    be used to analyse expression data.
  • Choosing the best algorithm for a particular
    problem may represent a challenging task.
  • Advantages and limitations may depend on factors
    such as the statistical nature of the data,
    pre-processing procedures, number of features
    etc.
  • It is not uncommon to observe inconsistent
    results when different clustering methods are
    tested on a particular data set

45
Basic Criteria For The Selection Of Clustering
Techniques (3)
  • In order to make an appropriate choice, it is
    important to have a good understanding of
  • the problem domain under study, and
  • the clustering options available.

46
Basic Criteria For The Selection Of Clustering
Techniques (4)
  • Knowledge on the underlying biological problem
    may allow a scientist to choose a tool that
    satisfies certain requirements, such as the
    capacity to detect overlapping classes.
  • Knowledge on the mathematical properties of a
    clustering technique may support the selection
    process.
  • How does this algorithm represent similarity (or
    dissimilarity)?,
  • How much relevance does it assign to cluster
    heterogeneity?,
  • How does it implement the process of measuring
    cluster isolation?.
  • Answers to these questions may indicate crucial
    directions for the selection of an adequate
    clustering algorithm.

47
Basic Criteria For The Selection Of Clustering
Techniques (5)
  • Empirical studies have defined several
    mathematical criteria of acceptability
  • For example, there may be clustering algorithms
    that are capable of guaranteeing the generation
    of partitions whose cluster structures do not
    intersect.
  • Several algorithms indirectly assume that the
    cluster structure of the data under consideration
    exhibits particular characteristics.
  • For instance, the k-means algorithm assumes that
    the shape of the clusters is spherical and
    single-linkage hierarchical clustering assumes
    that the clusters are well separated

48
Basic Criteria For The Selection Of Clustering
Techniques (6)
  • Unfortunately, this type of knowledge may not
    always be available in an expression data study.
  • In this situation a solution may be to test a
    number of techniques on related data sets, which
    have previously been classified (a reference data
    set).
  • Thus, a user may choose a clustering method if
    it produced consistent categorisation results in
    relation to such reference data set.

49
Basic Criteria For The Selection Of Clustering
Techniques (7)
  • Specific user requirements may also influence a
    selection decision.
  • For example, a scientist may be interested in
    observing direct relationships between classes
    and subclasses in a data partition. In this case,
    a hierarchical clustering approach may represent
    a basic solution.
  • But in some studies hierarchical clustering
    results could be difficult to visualise because
    of the number of samples and features involved.
    Thus, for instance, a SOM may be considered to
    guide an exploratory analysis of the data.

50
Basic Criteria For The Selection Of Clustering
Techniques (8)
  • In general the application of two or more
    clustering techniques may provide the basis for
    the synthesis of accurate and reliable results.
  • A scientist may be more confident about the
    clustering experiments if very similar results
    are obtained by using different techniques.

51
Clustering approaches to classification Key
experimental factors
  • Type of clustering algorithm
  • Number of experiments (partitions)
  • Number of clustering (learning) cycles in an
    algorithm
  • In KD applications the number of classes may not
    be known a priori
  • The number of clusters in each experiment

52
The problem of assessing cluster validity and
evaluation
Partition 1 (2 clusters)
Partition 2 (3 clusters)
Partition 3 (4 clusters)
Partition n ( 3 clusters)
53
Clustering and cluster validity assessment
Data on a 2D space
54
Cluster validity assessment
c 2
c 4
Method 1
Method 2
55
Cluster validity assessment Key questions
  • Is this a relevant partition?
  • Should we analyse these clusters?
  • Is there a better partition?
  • Which clustering method should we apply ?
  • Is this relevant information from a biological
    point of view ?

56
Cluster validity assessment
  • Quality indices
  • Maximize or minimize indices
  • Quality factorscompactness, heterogeneity,
    isolation, shape
  • The best or correct partition is the one that
    maximize or minimize an index

57
Cluster validity assessment based on a quality
index I
Partition 1
(2 clusters)
Partition 2
(3 clusters)
Partition 3
(4 clusters)
P1 I1 0.1 P2 I2 0.9 P3 I3 0.5
P2 is the best/correct partition
Analyze and interpret P2
58
Cluster Validity assessment
?(Xi, Xj) intercluster distance ?(Xk)
intracluster distance
d
(
X
,
X
)
1
2
D
(
X
)
1
59
Cluster Validity assessment
There are different ways to calculate ?(Xi, Xj)
and ?(Xk)
60
Cluster validity assessment (I) -
The Dunns index
  • This index aims at identifying sets of clusters
    that are compact and well separated
  • For any partition U ? X X1 ?... Xi ? X, the
    Dunns validation index is defined as
  • ?(Xi, Xj) intercluster distance between
    clusters Xi and Xj () ?(Xk) intracluster
    distance of cluster Xk c number of clusters of
    partition U
  • Large values of V correspond to good clusters
  • The number of clusters that maximises V is taken
    as the optimal number of clusters, c.

61
Cluster validity assessment (II) the
Davies-Bouldin index
  • Small values of DB correspond to good clusters
    (the clusters are compact and their centres are
    far away from each other)
  • The cluster configuration that minimizes DB is
    taken as the optimall number of clusters, c.

62
Cluster Validity assessment Obtaining the
partitions
Leukemia data 2 clusters
A B
AML ALL
63
Cluster Validity assessment Obtaining the
partitions
Leukemia data 4 clusters
A B C D

AML T-ALL B-ALL
64
Cluster Validity assessment Davies-Bouldin
indexes - leukemia data
  • DB11 using inter-cluster distance 1 and
    intra-cluster distance 1 (complete diameter)
  • DB21 using inter-cluster distance 2 and
    intra-cluster distance 1

Bold entries represent the optimal number of
clusters, c, predicted by each index.
65
Cluster validity assessment Dunns indexes -
leukemia data
Bold entries represent the optimal number of
clusters, c, predicted by each index.
66
Cluster validity assessment Aggregation of
Dunns indices - leukemia data
67
Cluster validity assessment Aggregation of
Davies-Bouldin indexes - leukemia data
68
Cluster validity assessment
  • Different intercluster/intracluster distance
    combinations may produce validation indices of
    different scale ranges.
  • Indices with higher values may have a stronger
    effect on the calculation of the average index
    values.
  • This may result in a biased prediction of the
    optimal number of clusters.

69
Cluster validity assessment
An approach for the prediction of the optimal
partition is the implementation of an aggregation
method based on a weighed voting strategy.
Leukaemia data and Davies-Bouldin validation
index
70
Cluster validity assessment
Effect of the distance metric on the prediction
process
Dunns validity indexes and leukemia data
71
Cluster validity assessment -
Effect of the distance metric on the prediction
process
Dunns validity indexes and DLBCL data
72
Visualisation - Dendrograms
  • Dendrograms are often used to visualize the
    nested sequence of clusters resulting from
    hierarchical clustering.
  • While dendrograms are quite appealing because of
    their apparent ease of interpretation, they can
    be misleading.
  • first, the dendrogram corresponding to a given
    hierarchical clustering is not unique, since for
    each merge one needs to specify which subtree
    should go on the left and which on the right
    there are 2(n-1) different dendrograms.
  • Second, and perhaps less recognized shortcoming
    of dendrograms, is that they impose structure on
    the data, instead of revealing structure in these
    data.

73
Dendrograms
  • Such a representation will be valid only to the
    extent that the pairwise distances possess the
    hierarchical structure imposed by the clustering
    algorithm.

74
Dendrograms - example
  • Genes correspond to the rows, and the time points
    of each experiment are the columns.
  • The ratio of expression is color coded
  • Red upregulated
  • Green downregulated
  • Black no change
  • Grey missing data

75
Visualisation - maps
  • The 828 genes were grouped into 30 clusters.
  • Each cluster is represented by the centroid for
    genes in the clusters.
  • Expression levels are shown on y-axis and time
    points on x-axis

76
Key problems, challenges, recent advances
Azuaje F, Clustering-based approaches to
discovering and visualizing expression patterns,
Briefings in Bioinformatics, 4 (1), pp. 31- 42,
2003. (course material)
77
Conclusions
  • Clustering is a basic computational approach to
    pattern recognition and classification in
    expression studies
  • Several methods are available, it is fundamental
    to understand the biological problem and
    statistical requirements
  • There is the need for systematic evaluation and
    validation frameworks to guide humans and
    computers to reach their classification goals
  • These techniques may support knowledge discovery
    processes in complex domains such as the
    molecular classification of cancers

78
Acknowledgement
  • Dr. Haiying Wang
  • School of Computing and Mathematics
  • Faculty of Engineering
  • University of Ulster
Write a Comment
User Comments (0)
About PowerShow.com