Desde la Fsica hacia la Genmica Funcional va chips de ADN - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Desde la Fsica hacia la Genmica Funcional va chips de ADN

Description:

But first...some data-massage! Noise: background and cross-hybridization ... transcriptional profile do change in prostate!! 1.5 1 -1.5. testosterone. Dopping X ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 46
Provided by: dfU4
Category:

less

Transcript and Presenter's Notes

Title: Desde la Fsica hacia la Genmica Funcional va chips de ADN


1
Desde la Física hacia la Genómica Funcional vía
chips de ADN
Dr. Ariel Chernomoretz Depto de Fisica FCEN - UBA
2
Plan of the talk
  • Cental dogma of molecular biology
  • Genechip Microarrays
  • Low level analysis (affy preprocessing
    algorithms)?
  • High level analysis
  • Differential expression
  • Clustering
  • Classification
  • Applications

3
Microarrays what for?
Central Dogma of Molecular Biology From
chromosomes to protein
transcription
translation
DNA
mRNA
protein
  • Existence of different type state of cells
    (blood, muscle, skin, cancerous cells, dividing
    cells, etc) due to differential gene expresion
    (when, where, and how much a gene is expressed).

4
Microarrays what for?
5
Microarrays conceptual limitation
  • WARNING Post transcriptional regulation!

6
Microarrays Basic scheme
  • Each microarray contains thousands of probes of
    either
  • cDNA fragments corresponding to each gene
  • short oligonucleotide sequences
  • By hybridizing labeled mRNA or cDNA from a sample
    to a microarray, transcripts from all expressed
    genes can be assayed simultaneously

7
Microarray technologies
  • Affymetrix Genechip
  • Oligonucleotides (25mers) synthesized on a glass
    slide (photolitography).
  • Two-color chips (Brown's Lab _at_ Princeton)?
  • cDNA or oligonucleotides spotted on glass or
    nylon membranes

8
Affymetrix workflow
9
Basic Microarray Experiment
Low level analysis
High level analysis
10
Affymetrix Genechip platform
11
Affymetrix Genechip platform
  • Each measured element (e.g. a gene) is
    represented by a probe-set of 11-20 pairs of
    probes
  • Matching (PM) and mismatching (MM) probes

12
Affymetrix Genechip platform
-Scanner -Repères dhybridation
-Auto quadrillage
13
Affymetrix Genechip platformProbset typical
behavior
14
Affymetrix Genechip platformData preprocessing
  • Probe intensity mRNA
  • But firstsome data-massage!
  • Noise background and cross-hybridization
  • Non-biological variation normalization to be
    able to compare data between samples
  • Probeset expression index estimation
  • To perform 1.), 2.), and 3.) MAS 5.0, RMA,
    GCRMA, MBEI, etc.

15
Affy Preprocessing AlgosRMA
  • RMA Robust Multiarray Average algorithm
  • Considers only PM
  • Substracting MM adds noise increase variability
  • MMs do detect signal
  • Sequence specific effects

log(PM) S X ? log(PM)-log(MM) S
? log(MM) X ?
16
Affy Preprocessing AlgosRMA - Summarization
  • Multichip method
  • For each probeset, fit a linear model to the
    results, accounting for chip and probe effects,
    using a robust method
  • Log(PMij) ci pj ? (i chip jprobe)?

17
Affy RMA
Calibration Dataset Transcripts spiked at
controlled concentrations
Spike in experiment
Transcripts spiked at controlled concentration
There is a need for a model of the physics of
hybridisation
(Naef and Magnasco 2003)?
18
Physics of hybridization IGC content
AT bonds have two hydrogen bonds. GC have 3
hydrogen bonds
19
Physics of hybridization IISequence order matters
H-bond interactions between adjacent bases
Nearest-neighbour interactions predict duplex
kinetics and so sequence order is important
(Santa Lucia)?
The binding energy of GAC is not the same as CAG
20
Phys of hyb. III Size matters
Pyrimidines are small
e.g. perfect match 13 T, so mismatch 13 is A,
and the complementary base in mRNA is also A
T
There will be a large steric hindrance between
the purine in the mismatch and the purine in the
mRNA of interest.
C
Purines are large
13th base
PM
PM
MM
PM
target
target
target
21
Phys of hyb. III Size matters
Pyrimidines are small
e.g. perfect match 13 A, so mismatch 13 is T,
and the complementary base in mRNA is also T/U
T
There will be no steric hindrance between the
pyrimidine in the mismatch and the pyrimidine in
the mRNA of interest.
C
Purines are large
13th base
MM
PM
target
target
22
Physics of hybridization IVLabelling matters
C T (pyrimidines) are labelled. So GC binds
less strongly than CG, and AT binding is weaker
than TA.
If the probe contains no C T, it will hybridise
well but with no fluorescence. If you have all C
T, it will have difficulty hybridising.
C and T within your mRNA fragment but immediately
outside your probe will fluoresce and not
interfere with hybridisation
23
Physics of Hybridization
Naef and Magnasco (PRE 2003)?
The difference in intensity between the PM and MM
is sensitive to the choice of the central base of
the probe!
24
Physics of hybridization
  • so...there is a lot of physics to consider
  • GC content (3 bounds vs 2 bounds)?
  • probe sequence composition (stacking energies)?
  • labelling
  • nature of 13th base
  • In order to simplify matters, there have been
    several attempts to generate simple mathematical
    models which incorporates the key physics.
  • The parameters in the models are then fitted
    using the data from many chips.

Naef and Magnasco (PRE 2003)?
25
Physics of hybridization
Naef and Magnasco (PRE 2003)?
The model contains only position specific
affinities for each base (fitted using 80 chips)?
A low order function can be fitted to the
hybridisation for a given base at a given
position. The total hybridisation for the 25 base
sequence is then the sum of the local
hybridisations.
26
Physics of hybridization GCRMA
Wu and Irizarry report spike in yeast controls on
a human chip.
This measures non-specific hybridisation directly
Many unchanging genes do not express!
Theory is comparable to experiment
27
Physics of hybridization GCRMA
Wu and Irizarry used Naef's phenomenological
model of probe 'affinities' to deal with
non-specific hybridization new algorithm
GCRMA (2004)?
RMA
GCRMA
28
Physics of hybridization
29
Basic Microarray Experiment
Low level analysis
High level analysis
30
High level analysis
  • After data-processing data is in the form of a
    data matrix
  • samples Ns 10
  • genes Ng 10000
  • Genes that do not change vs genes that change
    differential expression

Genes
31
Differential Expression
  • Detect genes that are expressed at significantly
    different level in one sample compared to another
  • Identify list of genes that act like markers
    between different samples
  • Infer transcriptional networks from timecourse
    experiments

32
Differential expressionFold Change
  • FC Experiment/Control
  • In the beginning
  • FC 2 upregulation
  • FC downregulation
  • Why 2? Intensity dependent cutoff!
  • More ellaborted intensity dependent methods were
    developped.

Intensity dependent variation
33
Differential Expressiont-test
  • If t is higher than a certain threshold, the
    difference between X and Y can be said to be
    significant
  • Under the Null hypothesis t follows a Student
    distribution
  • The pvalue tells us how probable is to find a
    higher t value by chance if X and Y came from the
    same distribution
  • List of candidate genes pvalue

34
Differential ExpressionOther methods of gene
selection
  • ANOVA
  • Fisher criterion score
  • Entropy measure (information theory)?
  • ?2 measure
  • Information gain - Information gain ratio
  • Correlation-based feature selection
  • Principal Component Analysis (PCA)?
  • Linear models Bayesian estimates
  • Etc

35
High level analysis
  • After all this physics and statistics...rather
    long list of DE genes
  • What's next?
  • Select some gene for validation (e.g. By QRTPCR)?
  • Do follow up experiments on some genes?
  • Try to learn about all the genes on the
    list...(read 500 papers)?
  • Try to publish a huge table with all the results
    (not any more).
  • ....
  • Be careful you know too much!
  • Do the genes identified by statistics make
    biological sense?

36
Pattern recognition
  • Let the data speaks.
  • Find structure in the data that correlate/explain
    some biological behavior (implicit assumption
    there is an underlying pattern within data!)?
  • Which experimental conditions have similar
    effects across a set of genes? (disease markers,
    cancer subgroup discovery,etc)?
  • Which genes behaves similarly across experiments?
    (gene networks,etc)?. Groups of genes with
    similar expression profiles are co-expressed
    genes

37
Clustering Biological Relevance
  • Genes that are co-expressed are often
    coordinately regulated via a common mechanism
    (co-regulated)?
  • Clustering becomes a way of identifying sets of
    genes that are putatively co-regulated and share
    similar functionality (guilty by companion)

38
Clustering
  • Clustering Given N data points Xi, i1,...,N
    embedded in D-dim, find the partition of N points
    into M clusters such that points of the same
    cluster are 'more similar' to each other than
    two points that belong to different clusters
  • A cluster is a group of genes (experiments) with
    similar expression profiles
  • It is an unsupervised procedure. Once the notions
    of distance and neighborhood are given, no
    previous knowledge is used to find the grouping.
  • Several methods
  • Hierarchical clustering (hierarchical method)?
  • K-means (partitioning method)?
  • Super paramagnetic clustering
  • More

39
ClusteringWhat is similar?
Experiments
Gene A Gene B Gene C Gene D Gene E Gene F
40
ClusteringWhat is similar?
  • How can we quantify the notion of similarity?
  • Think genes (or samples) as vectors
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

41
ClusteringWhat is similar?
42
ClusteringHierarchical clustering
Regroupement selon la similitude entre
échantillons
  • Assumption objects to be clustered present a
    hierarchical character.
  • Works succesively grouping similar pairs of genes
    (or samples)?
  • The two genes with the most similar profile are
    grouped into a single node.
  • This procedure continues until every gene has
    been placed in a node.

Gènes sur-exprimés Gènes sous-exprimés
Échantillons 1 5 2 9 11 3 4 6 10 7 8
43
ClusteringHierarchical clustering
44
ClusteringHierarchical clustering
  • Distance matrix

High similarity
Low similarity
45
ClusteringHierarchical clustering
  • Join A and F. Recalculate distance matrix.
  • Distance to a cluster
  • Single linkage
  • Average linkage
  • Complete linkage

46
ClusteringHierarchical clustering
47
ClusteringHierarchical clustering
48
ClusteringHierarchical clustering
49
ClusteringHierarchical clustering
50
ClusteringHierarchical clustering
  • Pros
  • Usefull to provide a view of a data structure and
    similarities.
  • It is simple.
  • It is colorful. It is part of most microarray
    studies.
  • Cons
  • Be cautious! Anything will cluster. Even random
    data!
  • Clusters are kind of arbitrary, depending on how
    the tree is cut.
  • What is a good cluster?

51
ClusteringSuperparamagnetic Clustering
  • SPC mapping of the problem at hand to a Potts
    model (Wiseman, Blatt, Domany PRL 1996).

N spins si in 1,2,...,q
Dij
52
Example Transcriptional Profile Studies
testosterone
Dopping X
  • No detectable changes in muscle but...
  • transcriptional profile do change in prostate!!

53
Example Transcriptional Profile Studies
  • Letrazole inhibits estrogen synthesis (chemical
    castration)?
  • After treatment, when does it happen ?

54
Example yeast cell cycle transcriptional program
Cyclins
Cdk
Cln1-3
Cdc28
Clb1-6
  • Cdc20/cyclin accomplish specific and different
    tasks
  • Proper progression through the cell cycle
    requires the successive activation and
    inactivation of these Cdc28/cyclin dimers.

55
Example yeast cell cycle transcriptional program
56
Example leukemia classification
  • Golub et al, Science (1999)?
  • Cancer classification applied to acute leukemias
    Acute Myeloid Leukemia vs Acute Lymphoblastic
    Leukemia
  • Class prediction assignement of samples to
    already defined classes

Success 80 (training set) 70 (independent
set)?
57
Example metastasis vs primary tumors
  • Ramaswamy et al. (Nature Genetics, 2003) Gene
    expression program of metastasis is already
    present in primary tumors?
  • Used a weighted voting algorithm to find genes
    that separate primary tumours from metastases
  • 128 genes that best distinguished primary from
    metastatic tumours using a weighting voting
    algorithm

58
Example metastasis vs primary tumors
  • horiz bar
  • red recurrent
  • black non-recurrent
  • vert bar
  • red originally primary
  • blackoriginally metastasis
  • Hierarchical clustering in the space of the 128
    genes identifies two main clusters of primary
    tumors, highly correlated to the original
    primary-tumor vs. metastasis distinction

59
PCA analysis
60
Topomap
A Gene Expression Map for Caenorhabditis
elegans Kim et al,Science 2001
Data from C. elegans DNA microarray experiments
(many growth conditions, developmental stages,
and varieties of mutants) Co-regulated genes
were grouped together and visualized in a
three-dimensional expression map that displays
correlations of gene expression profiles as
distances in two dimensions and gene density in
the third dimension.
61
Topomap
The gene expression map can be used as a gene
discovery tool to identify genes that are
co-regulated with known sets of genes (such as
heat shock, growth control genes, germ line
genes, and so forth) or to uncover previously
unknown genetic functions (such as genomic
instability in males and sperm caused by specific
transposons).
62
Gene Recommender
Genes involved in Retinoblastoma complex in
C.elegans lin-9 lin-35 lin-46 lin-53 hda-1
Are there, more? Can be found from data on 19738
genes, 500 experiments?
Eggs, larvae,adult, heat shock, other stresses,
mutants, various labs
Some exps. irrelevant to Rb They will add noise!
Algorithm for ranking genes according to how
strongly they correlate, across a LARGE number
of microarray experiments with a set of query
genes already known to have closely related
function, in those experiments for which the
query genes are most strongly correlate.
63
Gene Recommender
  • Google set labs.google.com/sets
  • Automatically create sets of items from a few
    examples.

64
Plan of the talk
  • Cental dogma of molecular biology
  • Genechip Microarrays
  • Low level analysis (affy preprocessing
    algorithms)?
  • High level analysis
  • Differential expression
  • Clustering
  • Classification
  • Examples

65
Things I did not mention
Quality control standards Two color
arrays Langmuir adsorption model Experimental
Design Differential Expression (let's get
serious) Multiple Hypothesis Testing Problem
(get used to unusual)? Systems Biology (Graph
theory, scale free networks, Dynamical systems
modeling)? Annotation (Gene Ontology, KEGG
pathway database, NCBI,....)? Highly
Interdisciplinary experience.
Low level analysis
High level analysis
66
Gracias
Senor! Descubrieron el codigo del genoma humano!
Malditos hackers! Voy a tener que cambiar el
password!
  • FIN

67
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

68
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation Pearson
    correlation of ranks
  • Mutual information Amount of info gained about X
    when Y is learned

69
Affymetrix workflow
70
Affymetrix workflow
71
Affymetrix workflow
72
Affymetrix Genechip printing
  • Oligonucleotide arrays are printed using a
    combination of photolitography and combinatorial
    chemistry
  • Quartz buffer
  • Blocking compund (can be removed by light
    exposure)?
  • Mask (18-20 µ2 windows)?

73
Affy On chip hybridization I
74
Affy On chip hybridization II
Uso de cARN muy largos reducen la eficacia de
hibridacion (replegamiento de la molecula)?
(Southern E. et al. Nature Genetics, Vol. 21,
10-14, 1999)?
75
Affy On chip hybridization III
76
Affy Preprocessing AlgosRMA
  • RMA considers log2 transformed intensities
  • log(Intensities)Normal (Noise is additive)?
  • Has a normalizing effect on calculated fold
    changes
  • 10-fold-up regulation distance to random
    expectation (1) is 9.
  • 10-fold-down regulation distance to random
    expectation (1) is 0.9
  • Raw ratios will emphasize up-regulation and
    ignore down-regulation.
  • e.g. 4-fold-up regulation FC4
    log2(FC) 2
  • 4-fold-down regulation FC0.25
    log2(FC) -2

77
Affy Preprocessing AlgosRMA - Background
Substraction
  • Adjust for global background (convolution model)?
  • O N S
  • Find P(SO,N) and then E(SO,N)?

78
Affy Preprocessing AlgosRMA - Normalization
  • Quantile normalization
  • Transforms the data, quantile by quantile, to
    make arrays have the same distribution of data

1 2 3 4 5 6 7 . .
79
Affy Preprocessing AlgosRMA - Normalization
  • After quantile normalization

80
Affy Preprocessing AlgosRMA - Background
Substraction
  • E(SPM)?

81
Affy Preprocessing AlgosRMA - Normalization
  • Before normalization

82
Affy RMAAn example
  • Spike in experiment
  • 14 human genes were spiked-in at concentrations
    ranging from 0 to 1024 pM (0, 0.25, 0.5, 1, 2, 4,
    8, 16, 32, 64, 128, 256, 512 and 1024 pM)?

Gen1 Gen2 Gen3 ..... Gen4
chip1
chip2
chip3
chip4
83
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

84
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

85
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

86
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation Pearson
    correlation of ranks
  • Mutual information Amount of info gained about X
    when Y is learned

87
ClusteringHierarchical clustering
  • The resulting figure is known as a hierarchical
    tree.
  • Different number of clusters depending on how
    deep we look.

88
ClusteringHierarchical clustering
  • In each step the individual order between the two
    group joined is arbitrary

89
ClusteringHierarchical clustering
  • What is a good cluster?. Bootstraping
  • The data is resampled (some experiments taken out
    randomly and replaced by copies of other
    experiments)?
  • The whole clustering is repeated.
  • Clusters that often appear are more statistically
    safe than others.

90
ClusteringK-mean
  • A specific number of clusters have to be provided
  • Goal assign element to clusters

91
ClusteringK-mean
  • Start by guessing k centers

92
ClusteringK-mean
  • Assign elements to these centers

93
ClusteringK-mean
  • Move to gravity centers

94
ClusteringK-mean
  • Reassign elements, and repeat until convergence
Write a Comment
User Comments (0)
About PowerShow.com