Title: Desde la Fsica hacia la Genmica Funcional va chips de ADN
1Desde la Física hacia la Genómica Funcional vía
chips de ADN
Dr. Ariel Chernomoretz Depto de Fisica FCEN - UBA
2Plan of the talk
- Cental dogma of molecular biology
- Genechip Microarrays
- Low level analysis (affy preprocessing
algorithms)? - High level analysis
- Differential expression
- Clustering
- Classification
- Applications
3Microarrays what for?
Central Dogma of Molecular Biology From
chromosomes to protein
transcription
translation
DNA
mRNA
protein
- Existence of different type state of cells
(blood, muscle, skin, cancerous cells, dividing
cells, etc) due to differential gene expresion
(when, where, and how much a gene is expressed).
4Microarrays what for?
5Microarrays conceptual limitation
- WARNING Post transcriptional regulation!
6Microarrays Basic scheme
- Each microarray contains thousands of probes of
either - cDNA fragments corresponding to each gene
- short oligonucleotide sequences
- By hybridizing labeled mRNA or cDNA from a sample
to a microarray, transcripts from all expressed
genes can be assayed simultaneously
7Microarray technologies
- Affymetrix Genechip
- Oligonucleotides (25mers) synthesized on a glass
slide (photolitography). - Two-color chips (Brown's Lab _at_ Princeton)?
- cDNA or oligonucleotides spotted on glass or
nylon membranes
8Affymetrix workflow
9Basic Microarray Experiment
Low level analysis
High level analysis
10Affymetrix Genechip platform
11Affymetrix Genechip platform
- Each measured element (e.g. a gene) is
represented by a probe-set of 11-20 pairs of
probes - Matching (PM) and mismatching (MM) probes
12Affymetrix Genechip platform
-Scanner -Repères dhybridation
-Auto quadrillage
13Affymetrix Genechip platformProbset typical
behavior
14Affymetrix Genechip platformData preprocessing
- Probe intensity mRNA
- But firstsome data-massage!
- Noise background and cross-hybridization
- Non-biological variation normalization to be
able to compare data between samples - Probeset expression index estimation
- To perform 1.), 2.), and 3.) MAS 5.0, RMA,
GCRMA, MBEI, etc.
15Affy Preprocessing AlgosRMA
- RMA Robust Multiarray Average algorithm
- Considers only PM
- Substracting MM adds noise increase variability
- MMs do detect signal
- Sequence specific effects
log(PM) S X ? log(PM)-log(MM) S
? log(MM) X ?
16Affy Preprocessing AlgosRMA - Summarization
- Multichip method
- For each probeset, fit a linear model to the
results, accounting for chip and probe effects,
using a robust method - Log(PMij) ci pj ? (i chip jprobe)?
17Affy RMA
Calibration Dataset Transcripts spiked at
controlled concentrations
Spike in experiment
Transcripts spiked at controlled concentration
There is a need for a model of the physics of
hybridisation
(Naef and Magnasco 2003)?
18Physics of hybridization IGC content
AT bonds have two hydrogen bonds. GC have 3
hydrogen bonds
19Physics of hybridization IISequence order matters
H-bond interactions between adjacent bases
Nearest-neighbour interactions predict duplex
kinetics and so sequence order is important
(Santa Lucia)?
The binding energy of GAC is not the same as CAG
20Phys of hyb. III Size matters
Pyrimidines are small
e.g. perfect match 13 T, so mismatch 13 is A,
and the complementary base in mRNA is also A
T
There will be a large steric hindrance between
the purine in the mismatch and the purine in the
mRNA of interest.
C
Purines are large
13th base
PM
PM
MM
PM
target
target
target
21Phys of hyb. III Size matters
Pyrimidines are small
e.g. perfect match 13 A, so mismatch 13 is T,
and the complementary base in mRNA is also T/U
T
There will be no steric hindrance between the
pyrimidine in the mismatch and the pyrimidine in
the mRNA of interest.
C
Purines are large
13th base
MM
PM
target
target
22Physics of hybridization IVLabelling matters
C T (pyrimidines) are labelled. So GC binds
less strongly than CG, and AT binding is weaker
than TA.
If the probe contains no C T, it will hybridise
well but with no fluorescence. If you have all C
T, it will have difficulty hybridising.
C and T within your mRNA fragment but immediately
outside your probe will fluoresce and not
interfere with hybridisation
23Physics of Hybridization
Naef and Magnasco (PRE 2003)?
The difference in intensity between the PM and MM
is sensitive to the choice of the central base of
the probe!
24Physics of hybridization
- so...there is a lot of physics to consider
- GC content (3 bounds vs 2 bounds)?
- probe sequence composition (stacking energies)?
- labelling
- nature of 13th base
- In order to simplify matters, there have been
several attempts to generate simple mathematical
models which incorporates the key physics. - The parameters in the models are then fitted
using the data from many chips.
Naef and Magnasco (PRE 2003)?
25Physics of hybridization
Naef and Magnasco (PRE 2003)?
The model contains only position specific
affinities for each base (fitted using 80 chips)?
A low order function can be fitted to the
hybridisation for a given base at a given
position. The total hybridisation for the 25 base
sequence is then the sum of the local
hybridisations.
26Physics of hybridization GCRMA
Wu and Irizarry report spike in yeast controls on
a human chip.
This measures non-specific hybridisation directly
Many unchanging genes do not express!
Theory is comparable to experiment
27Physics of hybridization GCRMA
Wu and Irizarry used Naef's phenomenological
model of probe 'affinities' to deal with
non-specific hybridization new algorithm
GCRMA (2004)?
RMA
GCRMA
28Physics of hybridization
29Basic Microarray Experiment
Low level analysis
High level analysis
30High level analysis
- After data-processing data is in the form of a
data matrix - samples Ns 10
- genes Ng 10000
- Genes that do not change vs genes that change
differential expression
Genes
31Differential Expression
- Detect genes that are expressed at significantly
different level in one sample compared to another - Identify list of genes that act like markers
between different samples - Infer transcriptional networks from timecourse
experiments
32Differential expressionFold Change
- FC Experiment/Control
- In the beginning
- FC 2 upregulation
- FC downregulation
- Why 2? Intensity dependent cutoff!
- More ellaborted intensity dependent methods were
developped.
Intensity dependent variation
33Differential Expressiont-test
- If t is higher than a certain threshold, the
difference between X and Y can be said to be
significant - Under the Null hypothesis t follows a Student
distribution - The pvalue tells us how probable is to find a
higher t value by chance if X and Y came from the
same distribution - List of candidate genes pvalue
34Differential ExpressionOther methods of gene
selection
- ANOVA
- Fisher criterion score
- Entropy measure (information theory)?
- ?2 measure
- Information gain - Information gain ratio
- Correlation-based feature selection
- Principal Component Analysis (PCA)?
- Linear models Bayesian estimates
- Etc
35High level analysis
- After all this physics and statistics...rather
long list of DE genes - What's next?
- Select some gene for validation (e.g. By QRTPCR)?
- Do follow up experiments on some genes?
- Try to learn about all the genes on the
list...(read 500 papers)? - Try to publish a huge table with all the results
(not any more). - ....
- Be careful you know too much!
- Do the genes identified by statistics make
biological sense?
36Pattern recognition
- Let the data speaks.
- Find structure in the data that correlate/explain
some biological behavior (implicit assumption
there is an underlying pattern within data!)? - Which experimental conditions have similar
effects across a set of genes? (disease markers,
cancer subgroup discovery,etc)? - Which genes behaves similarly across experiments?
(gene networks,etc)?. Groups of genes with
similar expression profiles are co-expressed
genes
37Clustering Biological Relevance
- Genes that are co-expressed are often
coordinately regulated via a common mechanism
(co-regulated)? - Clustering becomes a way of identifying sets of
genes that are putatively co-regulated and share
similar functionality (guilty by companion)
38Clustering
- Clustering Given N data points Xi, i1,...,N
embedded in D-dim, find the partition of N points
into M clusters such that points of the same
cluster are 'more similar' to each other than
two points that belong to different clusters - A cluster is a group of genes (experiments) with
similar expression profiles - It is an unsupervised procedure. Once the notions
of distance and neighborhood are given, no
previous knowledge is used to find the grouping. - Several methods
- Hierarchical clustering (hierarchical method)?
- K-means (partitioning method)?
- Super paramagnetic clustering
- More
39ClusteringWhat is similar?
Experiments
Gene A Gene B Gene C Gene D Gene E Gene F
40ClusteringWhat is similar?
- How can we quantify the notion of similarity?
- Think genes (or samples) as vectors
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
41ClusteringWhat is similar?
42ClusteringHierarchical clustering
Regroupement selon la similitude entre
échantillons
- Assumption objects to be clustered present a
hierarchical character. - Works succesively grouping similar pairs of genes
(or samples)? - The two genes with the most similar profile are
grouped into a single node. - This procedure continues until every gene has
been placed in a node.
Gènes sur-exprimés Gènes sous-exprimés
Échantillons 1 5 2 9 11 3 4 6 10 7 8
43ClusteringHierarchical clustering
44ClusteringHierarchical clustering
High similarity
Low similarity
45ClusteringHierarchical clustering
- Join A and F. Recalculate distance matrix.
- Distance to a cluster
- Single linkage
- Average linkage
- Complete linkage
46ClusteringHierarchical clustering
47ClusteringHierarchical clustering
48ClusteringHierarchical clustering
49ClusteringHierarchical clustering
50ClusteringHierarchical clustering
- Pros
- Usefull to provide a view of a data structure and
similarities. - It is simple.
- It is colorful. It is part of most microarray
studies. - Cons
- Be cautious! Anything will cluster. Even random
data! - Clusters are kind of arbitrary, depending on how
the tree is cut. - What is a good cluster?
51ClusteringSuperparamagnetic Clustering
- SPC mapping of the problem at hand to a Potts
model (Wiseman, Blatt, Domany PRL 1996).
N spins si in 1,2,...,q
Dij
52Example Transcriptional Profile Studies
testosterone
Dopping X
- No detectable changes in muscle but...
- transcriptional profile do change in prostate!!
53Example Transcriptional Profile Studies
- Letrazole inhibits estrogen synthesis (chemical
castration)? - After treatment, when does it happen ?
54Example yeast cell cycle transcriptional program
Cyclins
Cdk
Cln1-3
Cdc28
Clb1-6
- Cdc20/cyclin accomplish specific and different
tasks - Proper progression through the cell cycle
requires the successive activation and
inactivation of these Cdc28/cyclin dimers.
55Example yeast cell cycle transcriptional program
56Example leukemia classification
- Golub et al, Science (1999)?
- Cancer classification applied to acute leukemias
Acute Myeloid Leukemia vs Acute Lymphoblastic
Leukemia - Class prediction assignement of samples to
already defined classes
Success 80 (training set) 70 (independent
set)?
57Example metastasis vs primary tumors
- Ramaswamy et al. (Nature Genetics, 2003) Gene
expression program of metastasis is already
present in primary tumors? - Used a weighted voting algorithm to find genes
that separate primary tumours from metastases - 128 genes that best distinguished primary from
metastatic tumours using a weighting voting
algorithm
58Example metastasis vs primary tumors
- horiz bar
- red recurrent
- black non-recurrent
- vert bar
- red originally primary
- blackoriginally metastasis
- Hierarchical clustering in the space of the 128
genes identifies two main clusters of primary
tumors, highly correlated to the original
primary-tumor vs. metastasis distinction
59PCA analysis
60Topomap
A Gene Expression Map for Caenorhabditis
elegans Kim et al,Science 2001
Data from C. elegans DNA microarray experiments
(many growth conditions, developmental stages,
and varieties of mutants) Co-regulated genes
were grouped together and visualized in a
three-dimensional expression map that displays
correlations of gene expression profiles as
distances in two dimensions and gene density in
the third dimension.
61Topomap
The gene expression map can be used as a gene
discovery tool to identify genes that are
co-regulated with known sets of genes (such as
heat shock, growth control genes, germ line
genes, and so forth) or to uncover previously
unknown genetic functions (such as genomic
instability in males and sperm caused by specific
transposons).
62Gene Recommender
Genes involved in Retinoblastoma complex in
C.elegans lin-9 lin-35 lin-46 lin-53 hda-1
Are there, more? Can be found from data on 19738
genes, 500 experiments?
Eggs, larvae,adult, heat shock, other stresses,
mutants, various labs
Some exps. irrelevant to Rb They will add noise!
Algorithm for ranking genes according to how
strongly they correlate, across a LARGE number
of microarray experiments with a set of query
genes already known to have closely related
function, in those experiments for which the
query genes are most strongly correlate.
63Gene Recommender
- Google set labs.google.com/sets
- Automatically create sets of items from a few
examples.
64Plan of the talk
- Cental dogma of molecular biology
- Genechip Microarrays
- Low level analysis (affy preprocessing
algorithms)? - High level analysis
- Differential expression
- Clustering
- Classification
- Examples
65Things I did not mention
Quality control standards Two color
arrays Langmuir adsorption model Experimental
Design Differential Expression (let's get
serious) Multiple Hypothesis Testing Problem
(get used to unusual)? Systems Biology (Graph
theory, scale free networks, Dynamical systems
modeling)? Annotation (Gene Ontology, KEGG
pathway database, NCBI,....)? Highly
Interdisciplinary experience.
Low level analysis
High level analysis
66Gracias
Senor! Descubrieron el codigo del genoma humano!
Malditos hackers! Voy a tener que cambiar el
password!
67ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
68ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation Pearson
correlation of ranks - Mutual information Amount of info gained about X
when Y is learned
69Affymetrix workflow
70Affymetrix workflow
71Affymetrix workflow
72Affymetrix Genechip printing
- Oligonucleotide arrays are printed using a
combination of photolitography and combinatorial
chemistry - Quartz buffer
- Blocking compund (can be removed by light
exposure)? - Mask (18-20 µ2 windows)?
73Affy On chip hybridization I
74Affy On chip hybridization II
Uso de cARN muy largos reducen la eficacia de
hibridacion (replegamiento de la molecula)?
(Southern E. et al. Nature Genetics, Vol. 21,
10-14, 1999)?
75Affy On chip hybridization III
76Affy Preprocessing AlgosRMA
- RMA considers log2 transformed intensities
- log(Intensities)Normal (Noise is additive)?
- Has a normalizing effect on calculated fold
changes - 10-fold-up regulation distance to random
expectation (1) is 9. - 10-fold-down regulation distance to random
expectation (1) is 0.9 - Raw ratios will emphasize up-regulation and
ignore down-regulation. - e.g. 4-fold-up regulation FC4
log2(FC) 2 - 4-fold-down regulation FC0.25
log2(FC) -2
77Affy Preprocessing AlgosRMA - Background
Substraction
- Adjust for global background (convolution model)?
- O N S
- Find P(SO,N) and then E(SO,N)?
78Affy Preprocessing AlgosRMA - Normalization
- Quantile normalization
- Transforms the data, quantile by quantile, to
make arrays have the same distribution of data
1 2 3 4 5 6 7 . .
79Affy Preprocessing AlgosRMA - Normalization
- After quantile normalization
80Affy Preprocessing AlgosRMA - Background
Substraction
81Affy Preprocessing AlgosRMA - Normalization
82Affy RMAAn example
- Spike in experiment
- 14 human genes were spiked-in at concentrations
ranging from 0 to 1024 pM (0, 0.25, 0.5, 1, 2, 4,
8, 16, 32, 64, 128, 256, 512 and 1024 pM)?
Gen1 Gen2 Gen3 ..... Gen4
chip1
chip2
chip3
chip4
83ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
84ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
85ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
86ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation Pearson
correlation of ranks - Mutual information Amount of info gained about X
when Y is learned
87ClusteringHierarchical clustering
- The resulting figure is known as a hierarchical
tree. - Different number of clusters depending on how
deep we look.
88ClusteringHierarchical clustering
- In each step the individual order between the two
group joined is arbitrary
89ClusteringHierarchical clustering
- What is a good cluster?. Bootstraping
- The data is resampled (some experiments taken out
randomly and replaced by copies of other
experiments)? - The whole clustering is repeated.
- Clusters that often appear are more statistically
safe than others.
90ClusteringK-mean
- A specific number of clusters have to be provided
- Goal assign element to clusters
91ClusteringK-mean
- Start by guessing k centers
92ClusteringK-mean
- Assign elements to these centers
93ClusteringK-mean
94ClusteringK-mean
- Reassign elements, and repeat until convergence