Cluster Analysis of Gene Expression Profiles - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster Analysis of Gene Expression Profiles

Description:

Identifying groups of genes that exhibit a similar expression ' ... 'Unsupervised learning' in the computer science lingo. 1-12-2006. 2. Cluster analysis ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 24
Provided by: mariomed
Learn more at: http://eh3.uc.edu
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis of Gene Expression Profiles


1
Cluster Analysis of Gene Expression Profiles
  • Identifying groups of genes that exhibit a
    similar expression "behavior" across a number of
    experimental conditions
  • Assuming that such "co-expression" will tell us
    something about these genes are regulated or even
    possibly something about their function
    (Functional Genomics)
  • Using information from multiple genes at a time -
    as opposed to the single gene at a time analysis
    we did so far
  • We can also cluster biological samples based on
    the expression of some or all of the genes
  • Example Identifying groups of molecularly
    similar tumor
  • "Molecular phenotyping"
  • "Unsupervised learning" in the computer science
    lingo

2
Cluster analysissource("http//eh3.uc.edu/teachin
g/cfg/2006/R/ClusterAnalysis.R")
  • Often a large portion of genes are "not
    interesting"
  • The meaning of the "not interesting" depends on
    the context
  • Possibly we are interested in genes that whose
    expression is not constant across all
    experimental conditions. To remove
    "non-interesting" genes one can apply a
    "variation filter".
  • Various sorts of "filtering" of "non-interesting"
    genes generally amounts to performing some kind
    of informal statistical testing with a very low
    confidence.
  • For now, we will just play with our data with
    some more exciting examples to follow
  • We have six measurements for each gene and will
    try to cluster genes and experimental conditions
    using this data

3
Cluster analysis
  • gt load(url("http//eh3.uc.edu/teaching/cfg/2006/da
    ta/SimpleData.RData"))
  • gt Niclt-grep("Nic",dimnames(SimpleData)2)
  • gt Ctllt-grep("Ctl",dimnames(SimpleData)2)
  • gt MNiclt-apply(SimpleData,Nic,1,mean,na.rmTRUE)
  • gt VNiclt-apply(SimpleData,Nic,1,var,na.rmTRUE)
  • gt MCtllt-apply(SimpleData,Ctl,1,mean,na.rmTRUE)
  • gt VCtllt-apply(SimpleData,Ctl,1,var,na.rmTRUE)
  • gt NNiclt-apply(!is.na(SimpleData,Nic),1,sum,na.rm
    TRUE)
  • gt NCtllt-apply(!is.na(SimpleData,Ctl),1,sum,na.rm
    TRUE)
  • gt VNicCtllt-(((NNic-1)VNic)((NCtl-1)VCtl))/(NCtl
    NNic-2)
  • gt DFlt-NNicNCtl-2
  • gt TStatlt-abs(MNic-MCtl)/((VNicCtl((1/NNic)(1/NCt
    l)))0.5)
  • gt TPvaluelt-2pt(TStat,DF,lower.tailFALSE)
  • gt SigGeneslt-(TPvaluelt0.001)
  • gt sum(SigGenes)
  • 1 7

4
Cluster analysis
  • gt library(marray)
  • gt library(mclust)
  • gt pallt-maPalette(low"green", high"red",
    mid"black")
  • gt MinExplt-min(SimpleDataSigGenes,27)
  • gt MaxExplt-max(SimpleDataSigGenes,27)
  • gt heatmap(data.matrix(SimpleDataSigGenes,27),Co
    lvNA,RowvNA,colpal,labRowas.character(SimpleDa
    taSigGenes,1),scale"none")
  • gt maColorBar(seq(MinExp,MaxExp,(MaxExp-MinExp)/5)
    , colpal, horizontalFALSE, k5)

5
Cluster analysis
gt heatmap(data.matrix(SimpleDataSigGenes,27),co
lpal,labRowas.character(SimpleDataSigGenes,1),
scale"none")
  • Genes were selected based on their differences
    between Nic and Ctl treatments - not obvsious
    except for one gene

6
Cluster analysis - centered data
  • gt CenteredDatalt-SimpleData,27-apply(SimpleData
    ,27,1,mean,na.rmT)
  • gt heatmap(data.matrix(CenteredDataSigGenes,),col
    pal,labRowas.character(SimpleDataSigGenes,1),s
    cale"none")
  • gt heatmap(data.matrix(SimpleDataSigGenes,27),co
    lpal,labRowas.character(SimpleDataSigGenes,1))

7
Hierarchical Clustering
  • Calculating the "distance" or "similarity between
    each pair of expression profiles
  • Merging two "closest" profiles, forming a "node"
    in the clustering tree and re-calculating the
    "distance between such a "sub-cluster" and rest
    of the profiles or sub-clusters using on of the
    "linkage" principles. Again merge two closest
    sub-clusters
  • Complete linkage - define the distance/similarity
    between the two clusters as the maximum/minimum
    distance/similarity between pairs of profiles in
    which one profile is from the first sub-cluster
    and the other profile is from the second
    sub-cluster
  • Average linkage - define the distance/similarity
    between the two clusters as the average
    distance/similarity between pairs of profiles in
    which one profile is from the first sub-cluster
    and the other profile is from the second
    sub-cluster
  • Single linkage - define the distance/similarity
    between the two clusters as the minimum/maximum
    distance/similarity between pairs of profiles in
    which one profile is from the first sub-cluster
    and the other profile is from the second
    sub-cluster

8
Euclidian Distance
  • R actually operates on distances, so similarities
    have to be transformed into distances - usually
    straightforward
  • Euclidian distance
  • In 2 and 3 dimensions, this is our usual, every
    day's distance
  • gt EDistanceslt-dist(CenteredDataSigGenes,,method
    "euclidean", diag T, upper T)
  • gt print(EDistances,digits2)

9
Distance Matrix
  • Distance Matrix - whole
  • 34 440 596 2797 4466 4512 7651
  • 34 0.00 8.55 5.64 5.46 8.15 8.03 9.14
  • 440 8.55 0.00 3.01 3.19 0.82 0.82 1.13
  • 596 5.64 3.01 0.00 0.33 2.53 2.48 3.59
  • 2797 5.46 3.19 0.33 0.00 2.71 2.62 3.72
  • 4466 8.15 0.82 2.53 2.71 0.00 0.47 1.18
  • 4512 8.03 0.82 2.48 2.62 0.47 0.00 1.14
  • 7651 9.14 1.13 3.59 3.72 1.18 1.14 0.00
  • Distance Matrix - lower triangular
  • gt EDistanceslt-dist(CenteredDataSigGenes,,method
    "euclidean")
  • gt print(EDistances,digits2)
  • 34 440 596 2797 4466 4512
  • 440 8.55
  • 596 5.64 3.01
  • 2797 5.46 3.19 0.33
  • 4466 8.15 0.82 2.53 2.71
  • 4512 8.03 0.82 2.48 2.62 0.47

10
Dendrograms - Complete Linkage
  • gt Clusteringlt-hclust(EDistances,method"complete")
  • gt plot(Clustering)

Distance Matrix - lower triangular 34
440 596 2797 4466 4512 440 8.55
596 5.64 3.01 2797
5.46 3.19 0.33 4466 8.15 0.82 2.53
2.71 4512 8.03 0.82 2.48 2.62 0.47
7651 9.14 1.13 3.59 3.72 1.18 1.14
11
Clustering genes and samples
  • gt EDistancesSlt-dist(t(CenteredDataSigGenes,),met
    hod "euclidean")
  • gt ClusteringSlt-hclust(EDistancesS,method"complete
    ")
  • gt heatmap(data.matrix(CenteredDataSigGenes,),Col
    vas.dendrogram(ClusteringS),Rowvas.dendrogram(Cl
    ustering),
  • colpal,scale"none")

gt TwoClusterslt-cutree(ClusteringS,k 2, h
NULL) gt TwoClusters Ctl Nic Nic.1 Nic.2 Ctl.1
Ctl.2 1 2 2 2 1 1
12
Clustering by partitioning K-means algorithm
  • For a pre-specified number of clusters iterate
    between calculating cluster "centroides" (i.e.
    cluster means) and re-assigning each profile to
    the cluster with the closest "centroid"
  • t1st iteration
  • iterate until ct1ct

13
Clustering k-means
  • gt TwoCKmeanslt-kmeans(t(CenteredDataSigGenes,),
    2, iter.max 10)
  • gt TwoCKmeans
  • K-means clustering with 2 clusters of sizes 3, 3
  • Cluster means
  • 34 440 596 2797
    4466 4512 7651
  • 1 2.510742 -0.9565299 0.2554164 0.3246475
    -0.770937 -0.7398173 -1.181848
  • 2 -2.510742 0.9565299 -0.2554164 -0.3246475
    0.770937 0.7398173 1.181848
  • Clustering vector
  • Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2
  • 1 2 2 2 1 1
  • Within cluster sum of squares by cluster
  • 1 1.0805679 0.9474704
  • Available components
  • 1 "cluster" "centers" "withinss" "size"

14
Questions
  • How many clusters there are in the data?
  • What is the statistical significance of a
    clustering?
  • What is a confidence in assigning any particular
    expression profile to any particular cluster?
  • Difficult questions, particularly difficult to
    answer when using heuristic methods like
    hierarchical clustering and k-means
  • Need statistical models

15
Two genes at a time
  • Are these two genes co-expressed?
  • By looking at their expression patterns alone,
    combined with the null distribution of the
    similarity measure in non-co-expressed genes, we
    could conclude that this is the case.

16
Another look
  • What if we knew that there are two and only two
    distinct patterns in the data and we know how
    they look (thick dashed lines)?
  • Given this additional information we are likely
    to conclude that our two genes actually have
    different patterns of expression.

17
Many genes at a time
  • Simultaneous detection of patterns of
    expression defined by groups of expression
    profiles and assignment of individual expression
    profiles to appropriate patterns.
  • By looking at all genes at the same time, we
    came up with a completely different conclusion
    than when looking at only two of them.
  • Questions How many clusters? How confident are
    we in the number of clusters in the data? How
    confident are we that our two genes belong to two
    different clusters? Is such a confidence
    statement taking into account the uncertainty
    about the true number of clusters?

18
Gene-specific normalization of the data
19
Clustering using non-normalized data
K-means
Euclidian Distance
Pearson's correlation
20
Clustering using normalized data
K-means
Euclidian Distance
Pearson's correlation
21
Why do we cluster?
  • Guilt by association

Co-expression
Co-regulation
Functional relationship
Assigning function to genes
22
Why do we cluster - Functional Annotation?
23
Dissecting the gene expression regulatory
mechanisms
S.Tavazoie, J.D.Hughes, M.J.Campbell, R.J.Cho,
G.M.Church. Systematic determination of genetic
network architecture, Nat.Genet., 22, (1999)
281-285.
Write a Comment
User Comments (0)
About PowerShow.com