PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles - PowerPoint PPT Presentation

About This Presentation

Title:

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles

Description:

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer Science – PowerPoint PPT presentation

Number of Views:175

Avg rating:3.0/5.0

Slides: 40

Provided by: inb64

Category:

more less

Transcript and Presenter's Notes

Title: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles

1
PCluster Probabilistic Agglomerative Clustering
of Gene Expression Profiles

Nir Friedman

Presenting Inbar Matarasso 09/05/2005 The School
of Computer Science Tel Aviv University
2
Outline

A little about clustering
Mathematics background
Introduction
The problem
Notation
Scoring Method
Agglomerative clustering
Double clustering
Conclusion

3
A little about clustering

Partition entities (genes) into groups called
clusters (according to similarity in their
expression profiles across the probed
conditions).
Cluster are homogeneous and well-separated.
Clustering problem arise in numerous disciplines
including biology, medicine, psychology,
economics.

4
Clustering why?

Reduce the dimensionality of the problem
identify the major patterns in the dataset
Pattern Recognition
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns

5
Examples of Clustering Applications

Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
Insurance Identifying groups of motor insurance
policy holders with a high average claim cost
Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults

6
Types of clustering methods

How to choose a particular method?
The type of output desired
The known performance of method with particular
types of data
The hardware and software facilities available
The size of the dataset.
In general , clustering methods may be divided
into two categories based on the cluster
structure which they produce Partitioning
Methods, Hierarchical Agglomerative methods

7
Partitioning Methods

Partition the objects into a prespecified number
of groups K
Iteratively reallocate objects to clusters until
some criterion is met (e.g. minimize within
cluster sums of squares)
Examples k-means, partitioning around medoids
(PAM), self-organizing maps (SOM), model-based
clustering

8
Partitioning Methods

Result M clusters, each object belonging to one
cluster
Single Pass
Make the first object the centroid for the first
cluster.
For the next object, calculate the similarity, S,
with each existing cluster centroid, using some
similarity coefficient.
If the highest calculated S is greater than some
specified threshold value, add the object to the
corresponding cluster and re determine the
centroid otherwise, use the object to initiate a
new cluster. If any objects remain to be
clustered, return to step 2.

9
Partitioning Methods

This method requires only one pass through the
dataset
The time requirements are typically of order
O(NlogN) for order O(logN) clusters.
A disadvantage is that the resulting clusters are
not independent of the order in which the
documents are processed, with the first clusters
formed usually being larger than those created
later in the clustering run

10
Hierarchical Clustering

Produce a dendrogram
Avoid prespecification of the number of clusters
K
The tree can be built in two distinct ways
Bottom-up agglomerative clustering
Top-down divisive clustering

11
Hierarchical Clustering

Organize the genes in a structure of a
hierarchical tree
Initial step each gene is regarded as a cluster
with one item
Find the 2 most similar clusters and merge them
into a common node
The length of the branch is proportional to the
distance
Iterate on merging nodes until all genes are
contained in one cluster- the root of the tree.

12
Partitioning vs. Hierarchical

Partitioning
Advantage Provides clusters that satisfy some
optimality criterion (approximately)
Disadvantages Need initial K, long computation
time
Hierarchical
Advantage Fast computation (agglomerative)
Disadvantages Rigid, cannot correct later for
erroneous decisions made earlier

13
Mathematical evaluation of clustering solution

Merits of a good clustering solution
Homogeneity
Genes inside a cluster are highly similar to each
other.
Average similarity between a gene and the center
(average profile) of its cluster.
Separation
Genes from different clusters have low similarity
to each other.
Weighted average similarity between centers of
clusters.
These are conflicting features increasing the
number of clusters tends to improve with-in
cluster Homogeneity on the expense of
between-cluster Separation

14
Gaussian Distribution Function

Large number of events
describes physical events
approximates the exact binomial distribution of
events

Distribution Functional Form Mean Standard Deviation
Gaussian a s
15
Bayes' Theorem

p(AX) p(XA)p(A)
p(XA)p(A) p(XA)p(A)
1 of women at age forty who participate in
routine screening have breast cancer. 80 of
women with breast cancer will get positive
mammographies. 9.6 of women without breast
cancer will also get positive mammographies. A
woman in this age group had a positive
mammography in a routine screening. What is the
probability that she actually has breast cancer?

16
Bayes' Theorem

The correct answer is 7.8, obtained as follows
Out of 10,000 women, 100 have breast cancer 80
of those 100 have positive mammographies. From
the same 10,000 women, 9,900 will not have breast
cancer and of those 9,900 women, 950 will also
get positive mammographies. This makes the total
number of women with positive mammographies
95080 or 1,030. Of those 1,030 women with
positive mammographies, 80 will have cancer.
Expressed as a proportion, this is 80/1,030 or
0.07767 or 7.8.

17
Bayes' Theorem
p(cancer) 0.01 Group 1 100 women with breast cancer
p(cancer) 0.99 Group 2 9900 women without breast cancer
p(positivecancer) 80.0 80 of women with breast cancer have positive mammographies
p(positivecancer) 20.0 20 of women with breast cancer have negative mammographies
p(positivecancer) 9.6 9.6 of women without breast cancer have positive mammographies
p(positivecancer) 90.4 90.4 of women without breast cancer have negative mammographies
p(cancerpositive) 0.008 Group A 80 women with breast cancer and positive mammographies
p(cancerpositive) 0.002 Group B 20 women with breast cancer and negative mammographies
p(cancerpositive) 0.095 Group C 950 women without breast cancer and positive mammographies
p(cancerpositive) 0.895 Group D 8950 women without breast cancer and negative mammographies
p(positive) 0.103 1030 women with positive results
p(positive) 0.897 8970 women with negative results
p(cancerpositive) 7.80 Chance you have breast cancer if mammography is positive 7.8
p(cancerpositive) 92.20 Chance you are healthy if mammography is positive 92.2
p(cancerpositive) 0.22 Chance you have breast cancer if mammography is negative 0.22
p(cancerpositive) 99.78 Chance you are healthy if mammography is negative 99.78
18
Bayes' Theorem

to find the chance that a woman with positive
mammography has breast cancer, we computed

p(positivecancer)p(cancer)
p(positivecancer)p(cancer)
p(positivecancer)p(cancer)

which isp(positivecancer) / p(positivecancer)
p(positivecancer)
which isp(positivecancer) / p(positive)
which isp(cancerpositive)

19
Bayes' Theorem

The original proportion of patients with breast
cancer is known as the prior probability. The
chance that a patient with breast cancer gets a
positive mammography, and the chance that a
patient without breast cancer gets a positive
mammography, are known as the two conditional
probabilities. Collectively, this initial
information is known as the priors. The final
answer - the estimated probability that a patient
has breast cancer, given that we know she has a
positive result on her mammography - is known as
the revised probability or the posterior
probability.

20
Bayes' Theorem

p(AX) p(AX)
p(AX) p(XA) p(X)
p(AX) p(XA)
p(XA) p(XA)
p(AX) p(XA)p(A)
p(XA)p(A) p(XA)p(A)

21
Introduction

A central problem in analysis of gene expression
data is clustering of genes with similar
expression profiles.
We are going to get familiar with an hierarchical
clustering procedure that is based on simple
probabilistic model.
Genes that are expressed similarly in each group
of conditions are clustered together.

22
The problem

The goal of clustering is identify groups of
genes with similar expression patterns.
A group of genes are clustered together if their
measured expression values could have been
sampled from the same stochastic source with a
high probability.
The user specifies in advance a partition of the
experimental conditions

23
Clustering Gene Expression Data

Cluster genes , e.g. to (attempt to) identify
groups of co-regulated genes
Cluster samples , e.g. to identify tumors based
on profiles
Cluster both at the same time
Can be helpful for identifying patterns in time
or space
Useful (essential?) when seeking new subclasses
of samples
Can be used for exploratory purposes

24
Notation

a matrix of gene expression measurement
D eg,c g?Genes, c?Conds
Genes is a set genes, and Conds is a set of
conditions

25
Scoring Method

partition C C1, ,Cm of conditions in Conds
and a partition G G1 , , Gn of genes in
Genes.
We want to score the combined partition.
Assumption g and g are in the same gene
cluster, and c and c in the same condition
cluster, then the expression value eg,c and
eg,c are sampled from the same distribution.

26
Scoring Method

Likelihood function
Where ?i,k are the parameters that describe the
expression of genes in Gi in conditions in Ck.
L(G,C,?D) L(G,C,?D) for any choice of G and
?.

27
Scoring Method

Parameterization for expression is using a
Gaussian distribution.

28
Scoring Method

Using the previous Parameterization for each data
we choose the best parameter sets.
To compensate for this overestimate we use the
Bayesian approach, and average the likelihood
over all of them.

29
Scoring Method - Summary

Local score of a particular cell

30
Agglomerative Clustering

Given a partition C C1, ,Cm of conditions.
One approach to learn a clustering of genes is
using an agglomerative procedure.

31
Agglomerative Clustering

G(1) G1, ,GGenes where each Gi is a
singleton.
While t lt Genes and G(t) contains a single
cluster.
Compute the change in the score that results from
merging the clusters Gi and Gj

32
Agglomerative Clustering

Choose (it,jt) to be the pair of clusters whose
merger is the most beneficial according to the
score
Define
O(Genes2C)

33
Double Clustering

We want the procedure to select for us the best
partition
Track the sequence of partitions G(1),,
GGenes.
Select the partition with the highest score.
In theory the maximum likelihood score should
select G(1)
In Practice it selects a partition in a much
later stage.
Intuition the best scoring partition strikes a
tradeoff between finding groups of genes, so that
each is homogeneous, and there distinct
differences between them.

34
Double Clustering

Cluster both genes and conditions at the same
time
start with some partition of the conditions (say
the one where each is a singleton).
perform gene agglomeration
select the best scoring gene partition
fix this gene partition
perform agglomeration on conditions
Intuitively, each step improves the score, and
thus this procedure should converge.

35
particular features of our algorithm

We can measure a large amount of genes.
The agglomerative clustering algorithm returns a
hierarchical partition that describes
similarities at different scales.
We use a likelihood function rather than a
measure of similarity.
The user specifies in advance a partition of
experimental conditions.

36
Conclusion

Partition entities into groups called clusters .
Cluster are homogeneous and well-separated.
Bayes' Theorem
p(AX) p(XA)p(A)
p(XA)p(A) p(XA)p(A)
Partitions C C1, ,Cm, G G1 , , Gn we
want to score the combined partition.
Likelihood function

37
Conclusion

Agglomerative Clustering
The main advantage of this procedure is that it
can take as input the relevant distinctions
among the conditions

38
Questions?
39
References

1 N. Friedman. PCluster Probabilistic
Agglomerative Clustering of Gene Expression
Profiles. 2003
2 A. Ben-Dor, R. Shamir, and Z. Yakhini.
Clustering gene expression patterns. J. Comp.
Bio., 6(3-4)28197, 1999.
3 M. B. Eisen, P. T. Spellman, P. O. Brown, and
D. Botstein. Cluster analysis and display of
genomewide expression patterns. PNAS,
95(25)148638, 1998.
4 Eliezer Yudkowsky. An Intuitive Explanation
of Bayesian Reasoning. 2003