Desde la Fsica hacia la Genmica Funcional va chips de ADN - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Desde la Fsica hacia la Genmica Funcional va chips de ADN

Description:

But first...some data-massage! Noise: background and cross-hybridization ... transcriptional profile do change in prostate!! 1.5 1 -1.5. testosterone. Dopping X ... – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 46

Provided by: dfU4

Category:

more less

Transcript and Presenter's Notes

Title: Desde la Fsica hacia la Genmica Funcional va chips de ADN

1
Desde la Física hacia la Genómica Funcional vía
chips de ADN
Dr. Ariel Chernomoretz Depto de Fisica FCEN - UBA
2
Plan of the talk

Cental dogma of molecular biology
Genechip Microarrays
Low level analysis (affy preprocessing
algorithms)?
High level analysis
Differential expression
Clustering
Classification
Applications

3
Microarrays what for?
Central Dogma of Molecular Biology From
chromosomes to protein
transcription
translation
DNA
mRNA
protein

Existence of different type state of cells
(blood, muscle, skin, cancerous cells, dividing
cells, etc) due to differential gene expresion
(when, where, and how much a gene is expressed).

4
Microarrays what for?
5
Microarrays conceptual limitation

WARNING Post transcriptional regulation!

6
Microarrays Basic scheme

Each microarray contains thousands of probes of
either
cDNA fragments corresponding to each gene
short oligonucleotide sequences
By hybridizing labeled mRNA or cDNA from a sample
to a microarray, transcripts from all expressed
genes can be assayed simultaneously

7
Microarray technologies

Affymetrix Genechip
Oligonucleotides (25mers) synthesized on a glass
slide (photolitography).
Two-color chips (Brown's Lab _at_ Princeton)?
cDNA or oligonucleotides spotted on glass or
nylon membranes

8
Affymetrix workflow
9
Basic Microarray Experiment
Low level analysis
High level analysis
10
Affymetrix Genechip platform
11
Affymetrix Genechip platform

Each measured element (e.g. a gene) is
represented by a probe-set of 11-20 pairs of
probes
Matching (PM) and mismatching (MM) probes

12
Affymetrix Genechip platform
-Scanner -Repères dhybridation
-Auto quadrillage
13
Affymetrix Genechip platformProbset typical
behavior
14
Affymetrix Genechip platformData preprocessing

Probe intensity mRNA
But firstsome data-massage!
Noise background and cross-hybridization
Non-biological variation normalization to be
able to compare data between samples
Probeset expression index estimation
To perform 1.), 2.), and 3.) MAS 5.0, RMA,
GCRMA, MBEI, etc.

15
Affy Preprocessing AlgosRMA

RMA Robust Multiarray Average algorithm
Considers only PM
Substracting MM adds noise increase variability
MMs do detect signal
Sequence specific effects

log(PM) S X ? log(PM)-log(MM) S
? log(MM) X ?
16
Affy Preprocessing AlgosRMA - Summarization

Multichip method
For each probeset, fit a linear model to the
results, accounting for chip and probe effects,
using a robust method
Log(PMij) ci pj ? (i chip jprobe)?

17
Affy RMA
Calibration Dataset Transcripts spiked at
controlled concentrations
Spike in experiment
Transcripts spiked at controlled concentration
There is a need for a model of the physics of
hybridisation
(Naef and Magnasco 2003)?
18
Physics of hybridization IGC content
AT bonds have two hydrogen bonds. GC have 3
hydrogen bonds
19
Physics of hybridization IISequence order matters
H-bond interactions between adjacent bases
Nearest-neighbour interactions predict duplex
kinetics and so sequence order is important
(Santa Lucia)?
The binding energy of GAC is not the same as CAG
20
Phys of hyb. III Size matters
Pyrimidines are small
e.g. perfect match 13 T, so mismatch 13 is A,
and the complementary base in mRNA is also A
T
There will be a large steric hindrance between
the purine in the mismatch and the purine in the
mRNA of interest.
C
Purines are large
13th base
PM
PM
MM
PM
target
target
target
21
Phys of hyb. III Size matters
Pyrimidines are small
e.g. perfect match 13 A, so mismatch 13 is T,
and the complementary base in mRNA is also T/U
T
There will be no steric hindrance between the
pyrimidine in the mismatch and the pyrimidine in
the mRNA of interest.
C
Purines are large
13th base
MM
PM
target
target
22
Physics of hybridization IVLabelling matters
C T (pyrimidines) are labelled. So GC binds
less strongly than CG, and AT binding is weaker
than TA.
If the probe contains no C T, it will hybridise
well but with no fluorescence. If you have all C
T, it will have difficulty hybridising.
C and T within your mRNA fragment but immediately
outside your probe will fluoresce and not
interfere with hybridisation
23
Physics of Hybridization
Naef and Magnasco (PRE 2003)?
The difference in intensity between the PM and MM
is sensitive to the choice of the central base of
the probe!
24
Physics of hybridization

so...there is a lot of physics to consider
GC content (3 bounds vs 2 bounds)?
probe sequence composition (stacking energies)?
labelling
nature of 13th base
In order to simplify matters, there have been
several attempts to generate simple mathematical
models which incorporates the key physics.
The parameters in the models are then fitted
using the data from many chips.

Naef and Magnasco (PRE 2003)?
25
Physics of hybridization
Naef and Magnasco (PRE 2003)?
The model contains only position specific
affinities for each base (fitted using 80 chips)?
A low order function can be fitted to the
hybridisation for a given base at a given
position. The total hybridisation for the 25 base
sequence is then the sum of the local
hybridisations.
26
Physics of hybridization GCRMA
Wu and Irizarry report spike in yeast controls on
a human chip.
This measures non-specific hybridisation directly
Many unchanging genes do not express!
Theory is comparable to experiment
27
Physics of hybridization GCRMA
Wu and Irizarry used Naef's phenomenological
model of probe 'affinities' to deal with
non-specific hybridization new algorithm
GCRMA (2004)?
RMA
GCRMA
28
Physics of hybridization
29
Basic Microarray Experiment
Low level analysis
High level analysis
30
High level analysis

After data-processing data is in the form of a
data matrix
samples Ns 10
genes Ng 10000
Genes that do not change vs genes that change
differential expression

Genes
31
Differential Expression

Detect genes that are expressed at significantly
different level in one sample compared to another
Identify list of genes that act like markers
between different samples
Infer transcriptional networks from timecourse
experiments

32
Differential expressionFold Change

FC Experiment/Control
In the beginning
FC 2 upregulation
FC downregulation
Why 2? Intensity dependent cutoff!
More ellaborted intensity dependent methods were
developped.

Intensity dependent variation
33
Differential Expressiont-test

If t is higher than a certain threshold, the
difference between X and Y can be said to be
significant
Under the Null hypothesis t follows a Student
distribution
The pvalue tells us how probable is to find a
higher t value by chance if X and Y came from the
same distribution
List of candidate genes pvalue

34
Differential ExpressionOther methods of gene
selection

ANOVA
Fisher criterion score
Entropy measure (information theory)?
?2 measure
Information gain - Information gain ratio
Correlation-based feature selection
Principal Component Analysis (PCA)?
Linear models Bayesian estimates
Etc

35
High level analysis

After all this physics and statistics...rather
long list of DE genes
What's next?
Select some gene for validation (e.g. By QRTPCR)?
Do follow up experiments on some genes?
Try to learn about all the genes on the
list...(read 500 papers)?
Try to publish a huge table with all the results
(not any more).
....
Be careful you know too much!
Do the genes identified by statistics make
biological sense?

36
Pattern recognition

Let the data speaks.
Find structure in the data that correlate/explain
some biological behavior (implicit assumption
there is an underlying pattern within data!)?
Which experimental conditions have similar
effects across a set of genes? (disease markers,
cancer subgroup discovery,etc)?
Which genes behaves similarly across experiments?
(gene networks,etc)?. Groups of genes with
similar expression profiles are co-expressed
genes

37
Clustering Biological Relevance

Genes that are co-expressed are often
coordinately regulated via a common mechanism
(co-regulated)?
Clustering becomes a way of identifying sets of
genes that are putatively co-regulated and share
similar functionality (guilty by companion)

38
Clustering

Clustering Given N data points Xi, i1,...,N
embedded in D-dim, find the partition of N points
into M clusters such that points of the same
cluster are 'more similar' to each other than
two points that belong to different clusters
A cluster is a group of genes (experiments) with
similar expression profiles
It is an unsupervised procedure. Once the notions
of distance and neighborhood are given, no
previous knowledge is used to find the grouping.
Several methods
Hierarchical clustering (hierarchical method)?
K-means (partitioning method)?
Super paramagnetic clustering
More

39
ClusteringWhat is similar?
Experiments
Gene A Gene B Gene C Gene D Gene E Gene F
40
ClusteringWhat is similar?

How can we quantify the notion of similarity?
Think genes (or samples) as vectors
Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

41
ClusteringWhat is similar?
42
ClusteringHierarchical clustering
Regroupement selon la similitude entre
échantillons

Assumption objects to be clustered present a
hierarchical character.
Works succesively grouping similar pairs of genes
(or samples)?
The two genes with the most similar profile are
grouped into a single node.
This procedure continues until every gene has
been placed in a node.

Gènes sur-exprimés Gènes sous-exprimés
Échantillons 1 5 2 9 11 3 4 6 10 7 8
43
ClusteringHierarchical clustering
44
ClusteringHierarchical clustering

Distance matrix

High similarity
Low similarity
45
ClusteringHierarchical clustering

Join A and F. Recalculate distance matrix.
Distance to a cluster
Single linkage
Average linkage
Complete linkage

46
ClusteringHierarchical clustering
47
ClusteringHierarchical clustering
48
ClusteringHierarchical clustering
49
ClusteringHierarchical clustering
50
ClusteringHierarchical clustering

Pros
Usefull to provide a view of a data structure and
similarities.
It is simple.
It is colorful. It is part of most microarray
studies.
Cons
Be cautious! Anything will cluster. Even random
data!
Clusters are kind of arbitrary, depending on how
the tree is cut.
What is a good cluster?

51
ClusteringSuperparamagnetic Clustering

SPC mapping of the problem at hand to a Potts
model (Wiseman, Blatt, Domany PRL 1996).

N spins si in 1,2,...,q
Dij
52
Example Transcriptional Profile Studies
testosterone
Dopping X

No detectable changes in muscle but...
transcriptional profile do change in prostate!!

53
Example Transcriptional Profile Studies

Letrazole inhibits estrogen synthesis (chemical
castration)?
After treatment, when does it happen ?

54
Example yeast cell cycle transcriptional program
Cyclins
Cdk
Cln1-3
Cdc28
Clb1-6

Cdc20/cyclin accomplish specific and different
tasks
Proper progression through the cell cycle
requires the successive activation and
inactivation of these Cdc28/cyclin dimers.

55
Example yeast cell cycle transcriptional program
56
Example leukemia classification

Golub et al, Science (1999)?
Cancer classification applied to acute leukemias
Acute Myeloid Leukemia vs Acute Lymphoblastic
Leukemia
Class prediction assignement of samples to
already defined classes

Success 80 (training set) 70 (independent
set)?
57
Example metastasis vs primary tumors

Ramaswamy et al. (Nature Genetics, 2003) Gene
expression program of metastasis is already
present in primary tumors?
Used a weighted voting algorithm to find genes
that separate primary tumours from metastases
128 genes that best distinguished primary from
metastatic tumours using a weighting voting
algorithm

58
Example metastasis vs primary tumors

horiz bar
red recurrent
black non-recurrent
vert bar
red originally primary
blackoriginally metastasis
Hierarchical clustering in the space of the 128
genes identifies two main clusters of primary
tumors, highly correlated to the original
primary-tumor vs. metastasis distinction

59
PCA analysis
60
Topomap
A Gene Expression Map for Caenorhabditis
elegans Kim et al,Science 2001
Data from C. elegans DNA microarray experiments
(many growth conditions, developmental stages,
and varieties of mutants) Co-regulated genes
were grouped together and visualized in a
three-dimensional expression map that displays
correlations of gene expression profiles as
distances in two dimensions and gene density in
the third dimension.
61
Topomap
The gene expression map can be used as a gene
discovery tool to identify genes that are
co-regulated with known sets of genes (such as
heat shock, growth control genes, germ line
genes, and so forth) or to uncover previously
unknown genetic functions (such as genomic
instability in males and sperm caused by specific
transposons).
62
Gene Recommender
Genes involved in Retinoblastoma complex in
C.elegans lin-9 lin-35 lin-46 lin-53 hda-1
Are there, more? Can be found from data on 19738
genes, 500 experiments?
Eggs, larvae,adult, heat shock, other stresses,
mutants, various labs
Some exps. irrelevant to Rb They will add noise!
Algorithm for ranking genes according to how
strongly they correlate, across a LARGE number
of microarray experiments with a set of query
genes already known to have closely related
function, in those experiments for which the
query genes are most strongly correlate.
63
Gene Recommender

Google set labs.google.com/sets
Automatically create sets of items from a few
examples.

64
Plan of the talk

Cental dogma of molecular biology
Genechip Microarrays
Low level analysis (affy preprocessing
algorithms)?
High level analysis
Differential expression
Clustering
Classification
Examples

65
Things I did not mention
Quality control standards Two color
arrays Langmuir adsorption model Experimental
Design Differential Expression (let's get
serious) Multiple Hypothesis Testing Problem
(get used to unusual)? Systems Biology (Graph
theory, scale free networks, Dynamical systems
modeling)? Annotation (Gene Ontology, KEGG
pathway database, NCBI,....)? Highly
Interdisciplinary experience.
Low level analysis
High level analysis
66
Gracias
Senor! Descubrieron el codigo del genoma humano!
Malditos hackers! Voy a tener que cambiar el
password!

67
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

68
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation Pearson
correlation of ranks
Mutual information Amount of info gained about X
when Y is learned

69
Affymetrix workflow
70
Affymetrix workflow
71
Affymetrix workflow
72
Affymetrix Genechip printing

Oligonucleotide arrays are printed using a
combination of photolitography and combinatorial
chemistry
Quartz buffer
Blocking compund (can be removed by light
exposure)?
Mask (18-20 µ2 windows)?

73
Affy On chip hybridization I
74
Affy On chip hybridization II
Uso de cARN muy largos reducen la eficacia de
hibridacion (replegamiento de la molecula)?
(Southern E. et al. Nature Genetics, Vol. 21,
10-14, 1999)?
75
Affy On chip hybridization III
76
Affy Preprocessing AlgosRMA

RMA considers log2 transformed intensities
log(Intensities)Normal (Noise is additive)?
Has a normalizing effect on calculated fold
changes
10-fold-up regulation distance to random
expectation (1) is 9.
10-fold-down regulation distance to random
expectation (1) is 0.9
Raw ratios will emphasize up-regulation and
ignore down-regulation.
e.g. 4-fold-up regulation FC4
log2(FC) 2
4-fold-down regulation FC0.25
log2(FC) -2

77
Affy Preprocessing AlgosRMA - Background
Substraction

Adjust for global background (convolution model)?
O N S
Find P(SO,N) and then E(SO,N)?

78
Affy Preprocessing AlgosRMA - Normalization

Quantile normalization
Transforms the data, quantile by quantile, to
make arrays have the same distribution of data

1 2 3 4 5 6 7 . .
79
Affy Preprocessing AlgosRMA - Normalization

After quantile normalization

80
Affy Preprocessing AlgosRMA - Background
Substraction

E(SPM)?

81
Affy Preprocessing AlgosRMA - Normalization

Before normalization

82
Affy RMAAn example

Spike in experiment
14 human genes were spiked-in at concentrations
ranging from 0 to 1024 pM (0, 0.25, 0.5, 1, 2, 4,
8, 16, 32, 64, 128, 256, 512 and 1024 pM)?

Gen1 Gen2 Gen3 ..... Gen4
chip1
chip2
chip3
chip4
83
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

84
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

85
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

86
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation Pearson
correlation of ranks
Mutual information Amount of info gained about X
when Y is learned

87
ClusteringHierarchical clustering

The resulting figure is known as a hierarchical
tree.
Different number of clusters depending on how
deep we look.

88
ClusteringHierarchical clustering

In each step the individual order between the two
group joined is arbitrary

89
ClusteringHierarchical clustering

What is a good cluster?. Bootstraping
The data is resampled (some experiments taken out
randomly and replaced by copies of other
experiments)?
The whole clustering is repeated.
Clusters that often appear are more statistically
safe than others.

90
ClusteringK-mean