Gene expression

About This Presentation

Title:

Gene expression

Description:

Title: Part 1 Microarray Timeseries Analysis with replicates OSM and EGF treatments over time Author: Computer Centre Last modified by: WEHI ITS Created Date – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 66

Provided by: Compute83

Category:

more less

Transcript and Presenter's Notes

Title: Gene expression

1
Gene expression

Terry Speed
Lecture 4, December 18, 2001

2
Thesis the analysis of gene expression data is
going to be big in 21st century statistics

Many different technologies, including
High-density nylon membrane arrays
Serial analysis of gene expression (SAGE)
Short oligonucleotide arrays (Affymetrix)
Long oligo arrays (Agilent)
Fibre optic arrays (Illumina)
cDNA arrays (Brown/Botstein)

3
Total microarray articles indexed in Medline
4
Common themes

Parallel approach to collection of very large
amounts of data (by biological standards)
Sophisticated instrumentation, requires some
understanding
Systematic features of the data are at least as
important as the random ones
Often more like industrial process than single
investigator lab research
Integration of many data types clinical,
genetic, molecular..databases

5
Biological background
DNA
G T A A T C C T C
C A T T A G G A G
6
Idea measure the amount of mRNA to see which
genes are being expressed in (used by) the
cell. Measuring protein might be better, but is
currently harder.
7
Reverse transcription
Clone cDNA strands, complementary to the mRNA
G U A A U C C U C
mRNA
Reverse transcriptase
T T A G G A G
cDNA
C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
8
cDNA microarray experiments

mRNA levels compared in many different
contexts
Different tissues, same organism (brain v.
liver)
Same tissue, same organism (ttt v. ctl, tumor v.
non-tumor)
Same tissue, different organisms (wt v. ko, tg,
or mutant)
Time course experiments (effect of ttt,
development)
Other special designs (e.g. to detect spatial
patterns).

9
cDNA microarrays
cDNA clones
10
cDNA microarrays

Compare the genetic expression in two samples of
cells

PRINT cDNA from one gene on each spot
SAMPLES cDNA labelled red/green
e.g. treatment / control normal / tumor
tissue
11
HYBRIDIZE Add equal amounts of labelled cDNA
samples to microarray.
SCAN
Laser
Detector
12
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
13
Some statistical questions

Image analysis addressing, segmenting,
quantifying
Normalisation within and between slides
Quality of images, of spots, of (log) ratios
Which genes are (relatively) up/down regulated?
Assigning p-values to tests/confidence to
results.

14
Some statistical questions, ctd

Planning of experiments design, sample size
Discrimination and allocation of samples
Clustering, classification of samples, of genes
Selection of genes relevant to any given analysis
Analysis of time course, factorial and other
special experiments..... much more.

15
Some bioinformatic questions

Connecting spots to databases, e.g. to sequence,
structure, and pathway databases
Discovering short sequences regulating sets of
genes direct and inverse methods
Relating expression profiles to structure and
function, e.g. protein localisation
Identifying novel biochemical or signalling
pathways, ..and much more.

16
Part of the image of one channel false-coloured
on a white (v. high) red (high) through yellow
and green (medium) to blue (low) and black scale
17
Does one size fit all?
18
Segmentation limitation of the fixed circle
method
Fixed Circle
SRG
Inside the boundary is spot (foreground), outside
is not.
19
Some local backgrounds
Single channel grey scale
We use something different again a smaller, less
variable value.
20
Quantification of expression

For each spot on the slide we calculate
Red intensity Rfg - Rbg
fg foreground, bg background, and
Green intensity Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)

21
Gene Expression Data

On p genes for n slides p is O(10,000), n is
O(10-100), but growing,

Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4

Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
22
(No Transcript)
23
The red/green ratios can be spatially biased

Top 2.5of ratios red, bottom 2.5 of ratios green
24
The red/green ratios can be intensity-biased
M log2R/G log2R - log2G
Values should scatter about zero.
(log2R log2G )/2
25
Normalization how we fix the previous problem
The curved line becomes the new zero line
Orange Schadt-Wong rank invariant set
Red line lowess
smooth
Yellow GAPDH, tubulin Light blue MSP
pool / titration
26
Normalizing before
2
0
M
-2
-4
6 8 10
12 14 16
27
Normalizing after
M normalised
6 8 10
12 14 16
28
A basic problem

SCIENTIFIC To determine which genes are
differentially expressed between two sources of
mRNA (trt, ctl).
STATISTICAL To assign appropriately adjusted
p-values to thousands of genes.

29
Apo AI experiment (Callow et al 2000, LBNL)
Goal. To identify genes with altered expression
in the livers of Apo AI knock-out mice (T)
compared to inbred C57Bl/6 control mice (C).

8 treatment mice and 8 control mice
16 hybridizations liver mRNA from each of the
16 mice (Ti , Ci ) is labelled with Cy5,
while pooled liver mRNA from the control mice
(C) is labelled with Cy3.
Probes 6,000 cDNAs (genes), including 200
related to lipid metabolism.

30
Leukemia experiments (Golub et al 1999,WI)

Goal. To identify genes which are differentially
expressed in acute lymphoblastic leukemia (ALL)
tumours in comparison with acute myeloid
leukemia (AML) tumours.
38 tumour samples 27 ALL, 11 AML.
Data from Affymetrix chips, some
pre-processing.
Originally 6,817 genes 3,051 after reduction.
Data therefore a 3,051 ? 38 array of expression
values.

31
Univariate hypothesis testing
Initially, focus on one
gene only. We wish to test the null
hypothesis H that the gene is not
differentially expressed. In order to
do so, we use a two sample t-statistic

32
Single-step adjustments of pi

Bonferroni min (mpi, 1), m genes
Sidák 1 - (1 - pi)m
minP method of Westfall and Young
Pr( min Pl
pi H)
1lm
maxT method of Westfall and Young
Pr( max Tl ti H0C )
1lm

33
More powerful methods step-down adjustments
The idea S Holms modification of
Bonferroni. Also applies to Sidák, maxT, and
minP. We illustrate this last adjustment.
34
Step-down adjustment of minP

Initialization Order the unadjusted p-values
such that pr1 pr2 ??? prm. The
indices r1, r2, r3,.. are fixed for given data.
Step-down adjustment
Compare min Pr1, ??? , Prm with pr1
Compare min Pr2, ??? , Prm with pr2
Compare min Pr3 ??? , Prm with pri3 .
m. Compare Prm with prm
Enforce the monotonicity on the adjusted pri

35
(No Transcript)
36
gene t unadj. p minP plower maxT
index statistic (?104) adjust. adjust.
2139 -22 1.5 .53 8 ? 10-5 2 ? 10-4
4117 -13 1.5 .53 8 ? 10-5 5 ? 10-4
5330 -12 1.5 .53 8 ? 10-5 5 ? 10-4
1731 -11 1.5 .53 8 ? 10-5 5 ? 10-4
538 -11 1.5 .53 8 ? 10-5 5 ? 10-4
1489 -9.1 1.5 .53 8 ? 10-5 1 ? 10-3
2526 -8.3 1.5 .53 8 ? 10-5 3 ? 10-3
4916 -7.7 1.5 .53 8 ? 10-5 8 ? 10-3
941 -4.7 1.5 .53 8 ? 10-5 0.65
2000 3.1 1.5 .53 8 ? 10-5 1.00
5867 -4.2 3.1 .76 0.54 0.90
4608 4.8 6.2 .93 0.87 0.61
948 -4.7 7.8 .96 0.93 0.66
5577 -4.5 12 .99 0.93 0.74
37
Apo AI. Histogram Q-Q plot
ApoA1
38
(No Transcript)
39
Brief discussion

Not mentioned strong vs weak control of Type 1
error.
The minP adjustment seems more conservative than
the maxT adjustment, but is essentially
model-free.
The adjusted minP values are very discrete it
seems that 12,870 permutations are not enough for
6,000 tests.
Extends to other statistics Wilcoxon, paired t,
F, blocked F..
Major question in practice minP, maxT or
something else?
Wanted are guidelines for use of minP in terms of
sample sizes and number of genes.
Other approaches False Discovery Rate (V/R),
Bayes.

40
From a study of the mouse olfactory system
Main (Auxiliary) Olfactory Bulb
VomeroNasal Organ
Olfactory Epithelium
From Buck (2000)
41
Axonal connectivity between the nose and
the mouse olfactory bulb
Neocortex
gt2M, 1,800 types
Two principles zone-to-zone projection, and
glomerular convergence
42
Of interest the hardwiring of the vertebrate
olfactory system

Expression of a specific odorant receptor gene by
an olfactory neuron.
Targeting and convergence of like axons to
specific glomeruli in the olfactory bulb.

43
The biological question in this case

Are there genes with spatially restricted
expression patterns within
the olfactory bulb?

44
(No Transcript)
45
Layout of the cDNA Microarrays

Sequence verified mouse cDNAs
19,200 spots in two print groups of 9,600 each
4 x 4 grid, each with 25 x24 spots
Controls on the first 2 rows of each grid.

77
pg1
pg2
46
Design How We Sliced Up the Bulb
A
D
P
L
V
M
47
Design Two Ways to Do the Comparisons

Goal 3-D representation of gene expression

Compare all samples to a common reference sample
(e.g., whole bulb)
Multiple direct comparisons between different
samples (no common reference)
48
An Important Aspect of Our Design
D
A
Different ways of estimating the same
contrast e.g. A compared to P Direct A-P
Indirect A-M (M-P) or
A-D (D-P) or -(L-A) - (P-L)
L
M
P
V
How do we combine these?
49
Analysis using a linear model
Define a matrix X so that E(M)X?
Use least squares
estimates for A-L, P-L, D-L, V-L, M-L In
practice, we use robust regression. Estimates for
other estimable contrasts follow in the usual
way.
50
The Olfactory Bulb Experiments
completed so far
51
Contrasts Patterns

Because of the connectivity of our
experiment, we can estimate all 15 different
pairwise comparisons directly and/or indirectly.
For every gene we thus have a pattern based
on the 15 pairwise comparisons.

Gene 15,228
52
Contrasts patternsanother way

Instead of estimating pairwise comparisons
between each of the six effects, we can come
closer to estimating the effects themselves by
doing so subject to the standard zero sum
constraint (6 parameters, 5 d.f.).
What we estimate for A, say, subject to this
constraint, is in reality an estimate of
A - 1/6(A P D V M
L).
This set of parameter estimates gives results
similar to, but better than, the ones we would
have obtained had we carried out the experiments
with whole-bulb reference tissue.
In effect we have created the whole-bulb
reference in silico.

53
Alternative pattern representation
Gene 15,228 once again.
54
Reconstruction of the Bulb as a CubeExpression
of Gene 15,228
High
Low
Expression Level
55
Patterns, More Globally...

Can we identify genes with interesting patterns
of expression across the bulb?
Two approaches

1. Find the genes whose expression fits specific,
predefined patterns. 2. Perform cluster analysis
- see what expression patterns emerge.
56
Clustering procedure

Start with a sets of genes exhibiting some
minimal level of differential expression across
the bulb here 650 were chosen from all 15
contrasts.
Carry out hierarchical clustering, building a
dendrogram Mahalanobis distance and Ward
agglomeration (minimum variance) were used.
Now consider all clusters of 2 or more genes
in the tree. Singles are added separately.
Measure the heterogeneity h of a cluster by
calculating the 15 SDs
across the cluster of each of the pairwise
effects, and taking the largest.
Choose a score s (see plots) and take all
maximal disjoint clusters with
h lt s. Here we used s 0.46 and obtained
16 clusters.

57
Red genes chosen Bluecontrols 15 p/w effects
58
(No Transcript)
59
The 16 groups systematically arranged (6 point
representation)

60
(No Transcript)
61
Validation of Gene 15,228 Expression Pattern by
RNA In Situ Hybridization
62
Validation of predicted patterns using in situ
hybridization and neurolucida reconstructions
from them.
63
Some statistical research stimulated by
microarray data analysis

Experimental design Churchill Kerr
Image analysis Zuzan West, .
Data visualization Carr et al
Estimation Ideker et al, .
Multiple testing Westfall Young , Storey, .
Discriminant analysis Golub et al,
Clustering Hastie Tibshirani, Van der Laan,
Fridlyand Dudoit,
.
Empirical Bayes Efron et al, Newton et al,.
Multiplicative models Li Wong
Multivariate analysis Alter et al
Genetic networks DHaeseleer et al and
more

64
Acknowledgments

Statistical collaborators
Yee Hwa Yang (Berkeley)
Sandrine Dudoit (Berkeley)
Ingrid Lönnstedt (Uppsala)
Natalie Thorne (WEHI)
Mauro Delorenzi (WEHI)
CSIRO Image Analysis Group
Michael Buckley
Ryan Lagerstorm
WEHI
Glenn Begley
Suzie Grant
Rob Good
PMCI
Chuang Fong Kong

Ngai Lab (Berkeley)
Cynthia Duggan
Jonathan Scolnick
Dave Lin
Vivian Peng
Percy Luu
Elva Diaz
John Ngai
LBNL
Matt Callow
RIKEN Genomic Sciences Center
Yasushi Okazaki
Yoshihide Hayashizaki

Some web sites
Technical reports, talks, software etc.
http//www.stat.berkeley.edu/users/terry/zarray/Ht
ml/
Statistical software R GNUs S
http//lib.stat.cmu.edu/R/CRAN/
Packages within R environment
-- Spot http//www.cmis.csiro.au/iap/spot.htm
-- SMA (statistics for microarray analysis)
http//www.stat.berkeley.edu/users/terry/zarray/So
ftware /smacode.html