Reduce the dimensionality of the problem identify the major patterns in the dataset
Functional annotation of ESTs
Links among pathways
Dissection of regulatory networks
5 Similarity measures
Clustering identifies group of genes with similar expression profiles
How similarity/distance between genes expression profiles is measured
M conditions X (x1 x2 x3 xm) Y (y1 y2 y3 ym) 6 Similarity measure - Euclidian distance In general m experiments X (x1 x2 x3 xm) Y (y1 y2 y3 ym) 7 Similarity measure Correlation Coefficient
X (x1 x2 x3 xm)
Y (y1 y2 y3 ym)
-1 S(XY) 1 8 Euclidian vs Correlation
Euclidian distance takes into account the magnitude of the expression
Correlation coef - insensitive to the amplitude of expression takes into account the trends of the change.
Common trends are considered very biologically relevant the magnitude is considered less important correlation
9 Standardization of expression levels X (x1 x2 x3 xm) Xj Xj mean(X)/std(X) (doesnt change corr(XY)) Before standardization After standardization 10 Clustering Algorithms
The user sets the number of clusters- k
Initialization each gene is randomly assigned to one of the k clusters
Average expression vector is calculated for each cluster (clusters profile)
Iterate over the genes
For each gene- compute its similarity to the cluster profiles.
Move the gene to the cluster it is most similar to.
Recalculated cluster profiles.
Score current partition sum of distances between genes and the profile of the cluster they are assigned to (homogeneity of the solution).
Stop criteria further shuffling of genes results in minor improvement in the clustering score
12 How Many Clusters
Try several parameters and compare the clustering solutions
Criteria for comparison later in the presentation
PCA (Principle Component Analysis)
A technique for projecting the gene expression data set onto a reduced (2 or 3 dimensional) easily visualized space
13 PCA - Example
Dataset Thousands of genes probed in 5 conditions (time points relative to treatment)
The expression profile of each gene is presented by the vector of its expression levels X (X1 X2 X3 X4 X5)
Imagine each gene X as a point in a 5-dimentional space.
Each direction/axis corresponds to a specific condition
Genes with similar profiles are close to each other in this space
PCA- Project this dataset to 2 dimensions preserving as much information as possible
14 PCA Example Visual estimation of the number of clusters in the data 15 K-MEANS example 4 clusters 16 Cluster 1 Cluster 3 Mis-classified Cluster 4 Cluster 2 17 K-means example 3 clusters 18 Too few clusters K2 19 SOMs (Self-Organizing Maps)
User sets the number of clusters in a form of a rectangular grid (e.g. 3x2) map nodes
Imagine genes as points in (M-dimensional) space
Initialization map nodes are randomly placed in the data space
20 Genes data points Clusters map nodes 21 SOM - Scheme
Randomly choose a data point (gene).
Find its closest map node
Move this map node towards the data point
Move the neighbor map nodes towards this point but to lesser extent
Iterate over data points
The extent of node displacements is relaxed with the iteration number
After thousands of iterations
Assign each gene to the map node (cluster) it is most similar to
23 (No Transcript) 24 CLICK (CLuster Identification via Connectivity Kernels)
Compute similarity between all pairs of genes
Construct weighted similarity graph
Genes represented by nodes
The weight of an edge connecting 2 genes reflects their expression similarity
Find minimum weight cut that separates the graph into 2 un-connected sub-graphs
Iterate on cutting subgraphs
Stop criteria for cutting
Estimates the optimal number of clusters in the dataset
Identify outlier genes and leave them un-clustered (singletons)
26 Hierarchical Clustering
Organize the genes in a structure of a hierarchical tree
Initial step each gene is regarded as a cluster with one item
Find the 2 most similar clusters and merge them into a common node
The length of the branch is proportional to the distance
Iterate on merging nodes until all genes are contained in one cluster- the root of the tree.
27 Hierarchical Clustering distance between clusters Single-linkage Average-linkage Complete-linkage 28 Mathematical evaluation of clustering solution
Merits of a good clustering solution
Genes inside a cluster are highly similar to each other.
Average similarity between a gene and the center (average profile) of its cluster.
Genes from different clusters have low similarity to each other.
Weighted average similarity between centers of clusters.
These are conflicting features increasing the number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation
29 Performance on Yeast Cell Cycle Data 698 genes 72 conditions (Spellman et al. 1998). Each algorithm was run by its authors in a blind test. Ben-Dor Shamir Yakhini 1999 30 Which genes to cluster
Apply filtering prior to clustering focus the analysis on the responding genes
Applying controlled statistical tests to identify responding genes usually ends up with too few genes that doesnt allow global characterization of the response
Fold change choose genes that changed by at least M-folds in at least L conditions
Variance choose top P genes with the highest variance over the dataset
Try various filtering scheme to find the setting that gives the best results (biologically)
31 Clustering Tools
Cluster (Eisen) hierarchical
GeneCluster (Tamayo) SOM
TIGR MeV K-Means SOM hierarchical QTC CAST
Expander CLICK SOM K-means hierarchical
32 Ascribe Biological Meaning to Clusters
Identify over-represented functional categories in the clusters (i.e. cluster contains much more genes of specific biological process than expected by chance)
Standard assignment of genes into functional categories
33 Gene Ontology (GO) project
Defined controlled terms (ontologies) for description of gene products from 3 aspects
Biological process (DNA repair mitosis)
Molecular function (protein serine/threonine kinase activity transcription factor activity)
Cellular component (nucleus ribosome)
Unified framework for genes annotation species-independent vocabularies
A gene can have multiple associations in each ontology
GO terms are organized in hierarchical structures called directed acyclic graphs (DAGs)
Very general terms at top levels of the graph
Terms get more specialized at lower levels
34 (No Transcript) 35 Genes annotations using GO
Human LocusLink (NCBI) GOA (EBI) 15K genes with biological process annotation
Mouse MGI GOA 10K annotated genes
Rat RGD 2.5k annotated genes
Fly FlyBase 4.5k annotated genes
Arabidopsis TAIR 12k annotated genes
Affymetrix chips Netaffx
36 Ascribe Biological Meaning to Clusters
This analysis is NOT INFORMATIVE!
Some of the abundances can be explained just by chance
Statistical tests are essential to detect significant phenomena
37 Identifying enriched GO categories in clusters
In the previous example
Total number of chips genes with annotation 5000
Total number of chips genes associated with metabolism GO category 3600
Number of annotated genes in cluster 3 73
Number of metabolic genes in cluster 3 50
Is it statistically significant phenomena
Hyper-Geometric probability score
38 (No Transcript) 39 Functional GO enrichment - Tools
SOM Figures in this presentations were taken from presentation of Benedikt Brors
PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.
You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!
For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!