Clustering analysis of microarray gene expression data - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Clustering analysis of microarray gene expression data

Description:

Idea of Clustering. Methods of Clustering. discriminant analysis (Fisher,1931) ... Do I have a preconceived notion of how many clusters there should be? ... – PowerPoint PPT presentation

Number of Views:202

Avg rating:3.0/5.0

Slides: 45

Provided by: suk5

Learn more at: http://www.cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering analysis of microarray gene expression data

1
Clustering analysis of microarray gene expression
data
Ping Zhang November 19th, 2008
2
Outline

Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering

3
What is a DNA Microarray?
DNA microarray technology allows measuring
expressions for tens of thousands of genes at a
time
4
Scanning/Signal Detection
Cy3 channel
Cy5 channel
5
Data-flow schema of microarray data analysis
6
Outline

Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering

7
Gene expression profiles
Expression (relatively levels to reference point
at 0)
Time/Condition
8
Similarity between Profiles
expression

Similarity measure
Euclidean distance
Correlation coefficient
Trend
Correlation coefficient
often works better.

0
time
Expression profile
9
Pearson Correlation Coefficient

Compares scaled profiles!
Can detect inverse relationships
Most commonly used

nnumber of conditions xaverage expression of
gene x in all n conditions yaverage expression
of gene y in all n conditions sxstandard
deviation of x Systandard deviation of y
10
Correlation Pitfalls
Correlation0.97
11
Correlation coefficient
Gene Y
Gene X
S(X,Y) 0
12
Euclidean Distance

Scaled versus unscaled
Cannot detect inverse relation ships

For Gene X(x1, x2,xn) and Gene Y(y1, y2,yn)
13
Outline

Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering

14
Data-Mining through Clustering

Assumptions for clustering analysis
Expression level of a gene reflects the genes
activity.
Genes involved in same biological process exhibit
statistical relationship in their expression
profiles.

15
Idea of Clustering

Clustering group objects into clusters so that
objects in each cluster have similar features
objects of different clusters have dissimilar
features

16
Methods of Clustering

discriminant analysis (Fisher,1931)

K-means (Lloyd,1948)

hierarchical clustering

self-organizing maps (Kohonen, 1980)

support vector machines (Vapnik, 1985)

17
Issues in Cluster Analysis

A lot of clustering algorithms
A lot of distance/similarity metrics
Which clustering algorithm runs faster and uses
less memory?
How many clusters after all?
Are the clusters stable?
Are the clusters meaningful?

18
Which Clustering Method Should I Use?

What is the biological question?
Do I have a preconceived notion of how many
clusters there should be?
How strict do I want to be? Spilt or Join?
Can a gene be in multiple clusters?
Hard or soft boundaries between clusters

19
Outline

Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering

20
K-means clustering for expression profiles
Step 1 Transform n (genes) m (experiments)
matrix into n(genes) n(genes) distance matrix
To transform the nm matrix into nn matrix, use
a similarity (distance) metric.
Step 2 Cluster genes based on a k-means
clustering algorithm
21
K-means algorithm
The most popular algorithm for clustering
What is so attractive?

Simple

Fast

Mathematically correct

Invariant to dimension

Easy to implement

22
K-Means Clustering

Basic Ideas using cluster centre (means) to
represent cluster
Assigning data elements to the closet cluster
(centre).
Goal Minimize square error (intra-class
dissimilarity)
There is no hierarchy.
Must supply the number of clusters (k) into which
the data are to be grouped.

2
23
K-means Clustering Procedure (1)
Initialization 1 Specify the number of cluster k
-- for example, k 4
Expression matrix
Each point is called gene
24
K-means Clustering Procedure (2)
Initialization 2 Genes are randomly assigned to
one of k clusters
or choose random starting centers
25
K-means Clustering Procedure (3)
Calculate the mean of each cluster
(6,7)
(3,4)
(3,2)
(1,2)
26
K-means Clustering Procedure (4)
Each gene is reassigned to the nearest cluster
27
K-means Clustering Procedure (5)
Iterate until the means are converged
28
Outline

Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering

29
Hierarchical clustering (1)
30
Hierarchical clustering (2)
31
Hierarchical Clustering Results
32
Outline

Gene expression
Similarity between gene expression profiles
Concept of clustering
K-Means clustering
Hierarchical clustering
Minimum spanning tree-based clustering

33
Graph Representation

Represent a set of n-dimensional points as a
graph
each data point (gene) represented as a node
each pair of genes represented as an edge with a
weight defined by the dissimilarity between the
two genes

n-D data points
34
Minimum Spanning Tree

Spanning tree a sub-graph that has all nodes
connected and has no cycles
Minimum spanning tree (MST) a spanning tree with
the minimum total distance

35
How to ConstructMinimum Spanning Tree

Prims algorithm and Kruskals algorithm
Kruskals algorithm
step 1 select an edge with the smallest distance
from graph
step 2 add to tree as along as no cycle is
formed
step 3 remove the edge from graph
step 4 repeat steps 1-3 till all nodes are
connected in tree.

4
8
7
10
14
5
3
6
(a)
36
Foundation of MST Approach

Significantly simplifies the data clustering
problem, while losing very little essential
information for clustering.
We have mathematically proved

A multi-dimensional clustering problem is
equivalent to a tree-partitioning problem!
37
Clustering by Cutting Long Edge
Hierarchical cutting 1st cut longest edge 2nd
cut second longest edge Work well for easy
cases. Produce many clusters with single element
for some difficult cases.
1
38
Tree-Based Clustering

For each edge, calculate the assessment value
Find the edge that give the minimum assessment
value as the place to cut

Clustering using iterative method
guarantee to find the global optimality
using tree-based dynamic programming

39
Clustering through Removing Long MST-Edges

Objective partition an MST into K subtrees so
that the total edge-distance of all the K
subtrees in minimized
Finding K-1 longest MST-edges and cutting them gt
we get K clusters
This works as long as the inter-cluster
edge-distances are clearly larger than the
intra-cluster edge-distances

40
An Iterative Clustering Algorithm

Find K subtrees Ti of an MST such that to
minimize
Informally, the total distance between the center
of each cluster and its data points is minimized
The center c of a cluster C is defined as
the sum of the distances between c and all the
data points in C is minimized
Does not work well if the cluster boundary is not
convex

41
A Globally Optimal Clustering Algorithm

Given an MST T, partition T into K subtrees Ti
and find a set of data points di, i 1k, di in
D such that to minimize
Informally, group data points around the best
representatives rather than around the center
Using Dynamic Programming for this algorithm

42
Automated Selection of Number of Clusters

Select transition point in the assessment value
as thecorrect number of clusters.

43
Transition Profiles

indicatorn (An-1 An) / (An An1)
Ak is the assessment value for partition with k
clusters

Our clustering of yeast data
44
Reference

1 Ying Xu, Victor Olman, and Dong Xu.
Clustering Gene Expression Data Using a
Graph-Theoretic Approach An Application of
Minimum Spanning Trees. Bioinformatics.
18526-535, 2002.
2 Dong Xu, Victor Olman, Li Wang, and Ying Xu.
EXCAVATOR a computer program for gene expression
data analysis. Nucleic Acid Research. 31
5582-5589. 2003.
Using slides from
Michael Hongbo Xie, Temple University (in
2006)
Vipin Kumar, University of Minnesota
Dong Xu, University of Missouri