PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles - PowerPoint PPT Presentation

About This Presentation
Title:

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles

Description:

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer Science – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 40
Provided by: inb64
Category:

less

Transcript and Presenter's Notes

Title: PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles


1
PCluster Probabilistic Agglomerative Clustering
of Gene Expression Profiles
  • Nir Friedman

Presenting Inbar Matarasso 09/05/2005 The School
of Computer Science Tel Aviv University
2
Outline
  • A little about clustering
  • Mathematics background
  • Introduction
  • The problem
  • Notation
  • Scoring Method
  • Agglomerative clustering
  • Double clustering
  • Conclusion

3
A little about clustering
  • Partition entities (genes) into groups called
    clusters (according to similarity in their
    expression profiles across the probed
    conditions).
  • Cluster are homogeneous and well-separated.
  • Clustering problem arise in numerous disciplines
    including biology, medicine, psychology,
    economics.

4
Clustering why?
  • Reduce the dimensionality of the problem
    identify the major patterns in the dataset
  • Pattern Recognition
  • Image Processing
  • Economic Science (especially market research)
  • WWW
  • Document classification
  • Cluster Weblog data to discover groups of similar
    access patterns

5
Examples of Clustering Applications
  • Marketing Help marketers discover distinct
    groups in their customer bases, and then use this
    knowledge to develop targeted marketing programs
  • Insurance Identifying groups of motor insurance
    policy holders with a high average claim cost
  • Earth-quake studies Observed earth quake
    epicenters should be clustered along continent
    faults

6
Types of clustering methods
  • How to choose a particular method?
  • The type of output desired
  • The known performance of method with particular
    types of data
  • The hardware and software facilities available
  • The size of the dataset.
  • In general , clustering methods may be divided
    into two categories based on the cluster
    structure which they produce Partitioning
    Methods, Hierarchical Agglomerative methods

7
Partitioning Methods
  • Partition the objects into a prespecified number
    of groups K
  • Iteratively reallocate objects to clusters until
    some criterion is met (e.g. minimize within
    cluster sums of squares)
  • Examples k-means, partitioning around medoids
    (PAM), self-organizing maps (SOM), model-based
    clustering

8
Partitioning Methods
  • Result M clusters, each object belonging to one
    cluster
  • Single Pass
  • Make the first object the centroid for the first
    cluster.
  • For the next object, calculate the similarity, S,
    with each existing cluster centroid, using some
    similarity coefficient.
  • If the highest calculated S is greater than some
    specified threshold value, add the object to the
    corresponding cluster and re determine the
    centroid otherwise, use the object to initiate a
    new cluster. If any objects remain to be
    clustered, return to step 2.

9
Partitioning Methods
  • This method requires only one pass through the
    dataset
  • The time requirements are typically of order
    O(NlogN) for order O(logN) clusters.
  • A disadvantage is that the resulting clusters are
    not independent of the order in which the
    documents are processed, with the first clusters
    formed usually being larger than those created
    later in the clustering run

10
Hierarchical Clustering
  • Produce a dendrogram
  • Avoid prespecification of the number of clusters
    K
  • The tree can be built in two distinct ways
  • Bottom-up agglomerative clustering
  • Top-down divisive clustering

11
Hierarchical Clustering
  • Organize the genes in a structure of a
    hierarchical tree
  • Initial step each gene is regarded as a cluster
    with one item
  • Find the 2 most similar clusters and merge them
    into a common node
  • The length of the branch is proportional to the
    distance
  • Iterate on merging nodes until all genes are
    contained in one cluster- the root of the tree.

12
Partitioning vs. Hierarchical
  • Partitioning
  • Advantage Provides clusters that satisfy some
    optimality criterion (approximately)
  • Disadvantages Need initial K, long computation
    time
  • Hierarchical
  • Advantage Fast computation (agglomerative)
  • Disadvantages Rigid, cannot correct later for
    erroneous decisions made earlier

13
Mathematical evaluation of clustering solution
  • Merits of a good clustering solution
  • Homogeneity
  • Genes inside a cluster are highly similar to each
    other.
  • Average similarity between a gene and the center
    (average profile) of its cluster.
  • Separation
  • Genes from different clusters have low similarity
    to each other.
  • Weighted average similarity between centers of
    clusters.
  • These are conflicting features increasing the
    number of clusters tends to improve with-in
    cluster Homogeneity on the expense of
    between-cluster Separation

14
Gaussian Distribution Function
  • Large number of events
  • describes physical events
  • approximates the exact binomial distribution of
    events

Distribution Functional Form Mean Standard Deviation
Gaussian a s
15
Bayes' Theorem
  • p(AX)       p(XA)p(A)    
    p(XA)p(A) p(XA)p(A)
  • 1 of women at age forty who participate in
    routine screening have breast cancer.  80 of
    women with breast cancer will get positive
    mammographies.  9.6 of women without breast
    cancer will also get positive mammographies.  A
    woman in this age group had a positive
    mammography in a routine screening.  What is the
    probability that she actually has breast cancer?

16
Bayes' Theorem
  • The correct answer is 7.8, obtained as follows 
    Out of 10,000 women, 100 have breast cancer 80
    of those 100 have positive mammographies.  From
    the same 10,000 women, 9,900 will not have breast
    cancer and of those 9,900 women, 950 will also
    get positive mammographies.  This makes the total
    number of women with positive mammographies
    95080 or 1,030.  Of those 1,030 women with
    positive mammographies, 80 will have cancer. 
    Expressed as a proportion, this is 80/1,030 or
    0.07767 or 7.8.

17
Bayes' Theorem
p(cancer) 0.01 Group 1 100 women with breast cancer
p(cancer) 0.99 Group 2 9900 women without breast cancer
p(positivecancer) 80.0 80 of women with breast cancer have positive mammographies
p(positivecancer) 20.0 20 of women with breast cancer have negative mammographies
p(positivecancer) 9.6 9.6 of women without breast cancer have positive mammographies
p(positivecancer) 90.4 90.4 of women without breast cancer have negative mammographies
p(cancerpositive) 0.008 Group A  80 women with breast cancer and positive mammographies
p(cancerpositive) 0.002 Group B 20 women with breast cancer and negative mammographies
p(cancerpositive) 0.095 Group C 950 women without breast cancer and positive mammographies
p(cancerpositive) 0.895 Group D 8950 women without breast cancer and negative mammographies
p(positive) 0.103 1030 women with positive results
p(positive) 0.897 8970 women with negative results
p(cancerpositive) 7.80 Chance you have breast cancer if mammography is positive 7.8
p(cancerpositive) 92.20 Chance you are healthy if mammography is positive 92.2
p(cancerpositive) 0.22 Chance you have breast cancer if mammography is negative 0.22
p(cancerpositive) 99.78 Chance you are healthy if mammography is negative 99.78
18
Bayes' Theorem
  • to find the chance that a woman with positive
    mammography has breast cancer, we computed

p(positivecancer)p(cancer)
p(positivecancer)p(cancer)
p(positivecancer)p(cancer)
  1. which isp(positivecancer) / p(positivecancer)
    p(positivecancer)
  2. which isp(positivecancer) / p(positive)
  3. which isp(cancerpositive)

19
Bayes' Theorem
  • The original proportion of patients with breast
    cancer is known as the prior probability.  The
    chance that a patient with breast cancer gets a
    positive mammography, and the chance that a
    patient without breast cancer gets a positive
    mammography, are known as the two conditional
    probabilities.  Collectively, this initial
    information is known as the priors.  The final
    answer - the estimated probability that a patient
    has breast cancer, given that we know she has a
    positive result on her mammography - is known as
    the revised probability or the posterior
    probability.

20
Bayes' Theorem
  • p(AX)   p(AX)
  • p(AX)  p(XA) p(X)
  • p(AX)      p(XA)     
    p(XA) p(XA)
  • p(AX)     p(XA)p(A)     
    p(XA)p(A) p(XA)p(A)

21
Introduction
  • A central problem in analysis of gene expression
    data is clustering of genes with similar
    expression profiles.
  • We are going to get familiar with an hierarchical
    clustering procedure that is based on simple
    probabilistic model.
  • Genes that are expressed similarly in each group
    of conditions are clustered together.

22
The problem
  • The goal of clustering is identify groups of
    genes with similar expression patterns.
  • A group of genes are clustered together if their
    measured expression values could have been
    sampled from the same stochastic source with a
    high probability.
  • The user specifies in advance a partition of the
    experimental conditions

23
Clustering Gene Expression Data
  • Cluster genes , e.g. to (attempt to) identify
    groups of co-regulated genes
  • Cluster samples , e.g. to identify tumors based
    on profiles
  • Cluster both at the same time
  • Can be helpful for identifying patterns in time
    or space
  • Useful (essential?) when seeking new subclasses
    of samples
  • Can be used for exploratory purposes

24
Notation
  • a matrix of gene expression measurement
  • D eg,c g?Genes, c?Conds
  • Genes is a set genes, and Conds is a set of
    conditions

25
Scoring Method
  • partition C C1, ,Cm of conditions in Conds
    and a partition G G1 , , Gn of genes in
    Genes.
  • We want to score the combined partition.
  • Assumption g and g are in the same gene
    cluster, and c and c in the same condition
    cluster, then the expression value eg,c and
    eg,c are sampled from the same distribution.

26
Scoring Method
  • Likelihood function
  • Where ?i,k are the parameters that describe the
    expression of genes in Gi in conditions in Ck.
  • L(G,C,?D) L(G,C,?D) for any choice of G and
    ?.

27
Scoring Method
  • Parameterization for expression is using a
    Gaussian distribution.

28
Scoring Method
  • Using the previous Parameterization for each data
    we choose the best parameter sets.
  • To compensate for this overestimate we use the
    Bayesian approach, and average the likelihood
    over all of them.

29
Scoring Method - Summary
  • Local score of a particular cell

30
Agglomerative Clustering
  • Given a partition C C1, ,Cm of conditions.
  • One approach to learn a clustering of genes is
    using an agglomerative procedure.

31
Agglomerative Clustering
  • G(1) G1, ,GGenes where each Gi is a
    singleton.
  • While t lt Genes and G(t) contains a single
    cluster.
  • Compute the change in the score that results from
    merging the clusters Gi and Gj

32
Agglomerative Clustering
  • Choose (it,jt) to be the pair of clusters whose
    merger is the most beneficial according to the
    score
  • Define
  • O(Genes2C)

33
Double Clustering
  • We want the procedure to select for us the best
    partition
  • Track the sequence of partitions G(1),,
    GGenes.
  • Select the partition with the highest score.
  • In theory the maximum likelihood score should
    select G(1)
  • In Practice it selects a partition in a much
    later stage.
  • Intuition the best scoring partition strikes a
    tradeoff between finding groups of genes, so that
    each is homogeneous, and there distinct
    differences between them.

34
Double Clustering
  • Cluster both genes and conditions at the same
    time
  • start with some partition of the conditions (say
    the one where each is a singleton).
  • perform gene agglomeration
  • select the best scoring gene partition
  • fix this gene partition
  • perform agglomeration on conditions
  • Intuitively, each step improves the score, and
    thus this procedure should converge.

35
particular features of our algorithm
  • We can measure a large amount of genes.
  • The agglomerative clustering algorithm returns a
    hierarchical partition that describes
    similarities at different scales.
  • We use a likelihood function rather than a
    measure of similarity.
  • The user specifies in advance a partition of
    experimental conditions.

36
Conclusion
  • Partition entities into groups called clusters .
  • Cluster are homogeneous and well-separated.
  • Bayes' Theorem
  • p(AX)     p(XA)p(A)     
    p(XA)p(A) p(XA)p(A)
  • Partitions C C1, ,Cm, G G1 , , Gn we
    want to score the combined partition.
  • Likelihood function

37
Conclusion
  • Agglomerative Clustering
  • The main advantage of this procedure is that it
    can take as input the relevant distinctions
    among the conditions

38
Questions?
39
References
  • 1 N. Friedman. PCluster Probabilistic
    Agglomerative Clustering of Gene Expression
    Profiles. 2003
  • 2 A. Ben-Dor, R. Shamir, and Z. Yakhini.
    Clustering gene expression patterns. J. Comp.
    Bio., 6(3-4)28197, 1999.
  • 3 M. B. Eisen, P. T. Spellman, P. O. Brown, and
    D. Botstein. Cluster analysis and display of
    genomewide expression patterns. PNAS,
    95(25)148638, 1998.
  • 4 Eliezer Yudkowsky. An Intuitive Explanation
    of Bayesian Reasoning. 2003
Write a Comment
User Comments (0)
About PowerShow.com