Title: Clustering analysis of microarray gene expression data
1Clustering analysis of microarray gene expression
data
Ping Zhang November 19th, 2008
2Outline
- Gene expression
- Similarity between gene expression profiles
- Concept of clustering
- K-Means clustering
- Hierarchical clustering
- Minimum spanning tree-based clustering
3What is a DNA Microarray?
DNA microarray technology allows measuring
expressions for tens of thousands of genes at a
time
4Scanning/Signal Detection
Cy3 channel
Cy5 channel
5Data-flow schema of microarray data analysis
6Outline
- Gene expression
- Similarity between gene expression profiles
- Concept of clustering
- K-Means clustering
- Hierarchical clustering
- Minimum spanning tree-based clustering
7Gene expression profiles
Expression (relatively levels to reference point
at 0)
Time/Condition
8Similarity between Profiles
expression
- Similarity measure
- Euclidean distance
- Correlation coefficient
- Trend
-
- Correlation coefficient
- often works better.
0
time
Expression profile
9Pearson Correlation Coefficient
- Compares scaled profiles!
- Can detect inverse relationships
- Most commonly used
nnumber of conditions xaverage expression of
gene x in all n conditions yaverage expression
of gene y in all n conditions sxstandard
deviation of x Systandard deviation of y
10Correlation Pitfalls
Correlation0.97
11Correlation coefficient
Gene Y
Gene X
S(X,Y) 0
12Euclidean Distance
- Scaled versus unscaled
- Cannot detect inverse relation ships
For Gene X(x1, x2,xn) and Gene Y(y1, y2,yn)
13Outline
- Gene expression
- Similarity between gene expression profiles
- Concept of clustering
- K-Means clustering
- Hierarchical clustering
- Minimum spanning tree-based clustering
14Data-Mining through Clustering
- Assumptions for clustering analysis
- Expression level of a gene reflects the genes
activity. - Genes involved in same biological process exhibit
- statistical relationship in their expression
profiles.
15Idea of Clustering
- Clustering group objects into clusters so that
- objects in each cluster have similar features
- objects of different clusters have dissimilar
features
16Methods of Clustering
- discriminant analysis (Fisher,1931)
- self-organizing maps (Kohonen, 1980)
- support vector machines (Vapnik, 1985)
17Issues in Cluster Analysis
- A lot of clustering algorithms
- A lot of distance/similarity metrics
- Which clustering algorithm runs faster and uses
less memory? - How many clusters after all?
- Are the clusters stable?
- Are the clusters meaningful?
18Which Clustering Method Should I Use?
- What is the biological question?
- Do I have a preconceived notion of how many
clusters there should be? - How strict do I want to be? Spilt or Join?
- Can a gene be in multiple clusters?
- Hard or soft boundaries between clusters
19Outline
- Gene expression
- Similarity between gene expression profiles
- Concept of clustering
- K-Means clustering
- Hierarchical clustering
- Minimum spanning tree-based clustering
20K-means clustering for expression profiles
Step 1 Transform n (genes) m (experiments)
matrix into n(genes) n(genes) distance matrix
To transform the nm matrix into nn matrix, use
a similarity (distance) metric.
Step 2 Cluster genes based on a k-means
clustering algorithm
21K-means algorithm
The most popular algorithm for clustering
What is so attractive?
22K-Means Clustering
- Basic Ideas using cluster centre (means) to
represent cluster - Assigning data elements to the closet cluster
(centre). - Goal Minimize square error (intra-class
dissimilarity) - There is no hierarchy.
- Must supply the number of clusters (k) into which
the data are to be grouped.
2
23K-means Clustering Procedure (1)
Initialization 1 Specify the number of cluster k
-- for example, k 4
Expression matrix
Each point is called gene
24K-means Clustering Procedure (2)
Initialization 2 Genes are randomly assigned to
one of k clusters
or choose random starting centers
25K-means Clustering Procedure (3)
Calculate the mean of each cluster
(6,7)
(3,4)
(3,2)
(1,2)
26K-means Clustering Procedure (4)
Each gene is reassigned to the nearest cluster
27K-means Clustering Procedure (5)
Iterate until the means are converged
28Outline
- Gene expression
- Similarity between gene expression profiles
- Concept of clustering
- K-Means clustering
- Hierarchical clustering
- Minimum spanning tree-based clustering
29Hierarchical clustering (1)
30Hierarchical clustering (2)
31Hierarchical Clustering Results
32Outline
- Gene expression
- Similarity between gene expression profiles
- Concept of clustering
- K-Means clustering
- Hierarchical clustering
- Minimum spanning tree-based clustering
33Graph Representation
- Represent a set of n-dimensional points as a
graph - each data point (gene) represented as a node
- each pair of genes represented as an edge with a
weight defined by the dissimilarity between the
two genes
n-D data points
34Minimum Spanning Tree
- Spanning tree a sub-graph that has all nodes
connected and has no cycles - Minimum spanning tree (MST) a spanning tree with
the minimum total distance
35How to ConstructMinimum Spanning Tree
- Prims algorithm and Kruskals algorithm
- Kruskals algorithm
- step 1 select an edge with the smallest distance
from graph - step 2 add to tree as along as no cycle is
formed - step 3 remove the edge from graph
- step 4 repeat steps 1-3 till all nodes are
connected in tree.
4
8
7
10
14
5
3
6
(a)
36Foundation of MST Approach
- Significantly simplifies the data clustering
problem, while losing very little essential
information for clustering. - We have mathematically proved
A multi-dimensional clustering problem is
equivalent to a tree-partitioning problem!
37Clustering by Cutting Long Edge
Hierarchical cutting 1st cut longest edge 2nd
cut second longest edge Work well for easy
cases. Produce many clusters with single element
for some difficult cases.
1
38Tree-Based Clustering
- For each edge, calculate the assessment value
- Find the edge that give the minimum assessment
value as the place to cut
- Clustering using iterative method
- guarantee to find the global optimality
- using tree-based dynamic programming
39Clustering through Removing Long MST-Edges
- Objective partition an MST into K subtrees so
that the total edge-distance of all the K
subtrees in minimized - Finding K-1 longest MST-edges and cutting them gt
we get K clusters - This works as long as the inter-cluster
edge-distances are clearly larger than the
intra-cluster edge-distances
40An Iterative Clustering Algorithm
- Find K subtrees Ti of an MST such that to
minimize - Informally, the total distance between the center
of each cluster and its data points is minimized - The center c of a cluster C is defined as
- the sum of the distances between c and all the
data points in C is minimized - Does not work well if the cluster boundary is not
convex
41A Globally Optimal Clustering Algorithm
- Given an MST T, partition T into K subtrees Ti
and find a set of data points di, i 1k, di in
D such that to minimize - Informally, group data points around the best
representatives rather than around the center - Using Dynamic Programming for this algorithm
42Automated Selection of Number of Clusters
- Select transition point in the assessment value
- as thecorrect number of clusters.
43Transition Profiles
- indicatorn (An-1 An) / (An An1)
- Ak is the assessment value for partition with k
clusters
Our clustering of yeast data
44Reference
- 1 Ying Xu, Victor Olman, and Dong Xu.
Clustering Gene Expression Data Using a
Graph-Theoretic Approach An Application of
Minimum Spanning Trees. Bioinformatics.
18526-535, 2002. - 2 Dong Xu, Victor Olman, Li Wang, and Ying Xu.
EXCAVATOR a computer program for gene expression
data analysis. Nucleic Acid Research. 31
5582-5589. 2003. - Using slides from
- Michael Hongbo Xie, Temple University (in
2006) - Vipin Kumar, University of Minnesota
- Dong Xu, University of Missouri
45Acknowledgement