The second-simplest cDNA microarray data analysis problem - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

The second-simplest cDNA microarray data analysis problem

Description:

From Peter McCallum Cancer Research Institute, Australia. Normalisation - print tip ... David Freedman. CSIRO Image Analysis Group. Michael Buckley. Ryan ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 63
Provided by: cen7154
Category:

less

Transcript and Presenter's Notes

Title: The second-simplest cDNA microarray data analysis problem


1
The second-simplest cDNA microarray data
analysis problem
  • Terry Speed, UC Berkeley
  • Bioinformatic Strategies For Application of
    Genomic Tools to Environmental Health Research,
    March 5, 2001
  • NIEHS National Center for Toxicogenomics NCSU
    Bioinformatics Research Center

2
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
3
Some motherhood statements
  • Important aspects of a statistical analysis
    include
  • Tentatively separating systematic from random
    sources of variation
  • Removing the former and quantifying the latter,
    when the system is in control
  • Identifying and dealing with the most relevant
    source of variation in subsequent analyses
  • Only if this is done can we hope to make more or
    less valid probability statements

4
The simplest cDNA microarray data analysis
problem is identifying differentially expressed
genes using one slide
  • This is a common enough hope
  • Efforts are frequently successful
  • It is not hard to do by eye
  • The problem is probably beyond formal statistical
    inference (valid p-values, etc)
  • for the foreseeable future, and heres why

5
An M vs. A plot
M log2(R / G) A log2(RG) / 2
6
Background matters
From Spot
From GenePix
7
No background correction
With background correction
From the NCI60 data set (Stanford web site)
8
An experiment having within-slide replicates
9
Background makes a difference
Background method Segmentation method Exp1
Exp2 S.nbg 6 6 Gp.nbg 7 6 SA.nbg 6
6 No background QA.fix.nbg 7 6 QA.hist.nbg
7 6 QA.adp.nbg 14 14 S.valley 17 21 GP
11 11 Local surrounding SA 12 14 QA.fix
18 23 QA.hist 9 8 QA.adp 27 26 Others
S.morph 9 9 S.const 14 14
Medians of the SD of log2(R/G) for 8 replicated
spots multiplied by 100 and rounded to the
nearest integer.
10
Normalisation - lowess
  • Global lowess (Matt Callows data, LNBL)
  • Assumption changes roughly symmetric at all
    intensities.

11
From the NCI60 data set (Stanford web site)
12
Ngai lab, UCB
13
Tiagos data from the Goodman lab, UCB
14
From the Ernest Gallo Clinic Research Center
15
From Peter McCallum Cancer Research Institute,
Australia
16
Normalisation - print tip
Assumption For every print group, changes
roughly symmetric at all intensities.
17
M vs A after print-tip normalisation
18
Normalization (ctd) Another data set
Log-ratios
Print-tip groups
  • After within slide global lowess normalization.
  • Likely to be a spatial effect.

19
Taking scale into account
  • Assumption
  • All print-tip-groups have the same spread
    in M
  • True log ratio is mij where i represents
    different print-tip-groups and j
    represents different spots.
  • Observed is Mij, where
  • Mij ai mij
  • Robust estimate of ai is
  • MADi medianj yij - median(yij)

20
Normalization (ctd) That same data set
Log-ratios
Print-tip groups
  • After print-tip location and scale normalization.
  • Incorporate quality measures.

21
Matt Callows Srb1 dataset (5). Newtons and
Chens single slide method
22
Matt Callows Srb1
dataset (8). Newtons, Sapir Churchills and
Chens single slide method
23
The approach of Roberts et al (Rosetta)
Genomic DNA vs. Genomic DNA
Data from Bing Ren
24
The second simplest cDNA microarray data analysis
problem is identifying differentially expressed
genes using replicated slides
  • There are a number of different aspects
  • First, between-slide normalization then
  • What should we look at averages, SDs
    t-statistics, other summaries?
  • How should we look at them?
  • Can we make valid probability statements?
  • A report on work in progress

25
Normalization (ctd) Yet another data set
  • Between slides this time (10 here)
  • Only small differences in spread apparent
  • We often see much greater differences

Log-ratios
Slides
26
The NCI 60 experiments (no bg)
27
Taking scale into account
  • Assumption All slides have the same spread
    in M
  • True log ratio is mij where i represents
    different slides and j represents different
    spots.
  • Observed is Mij, where
  • Mij ai mij
  • Robust estimate of ai is
  • MADi medianj yij - median(yij)

28
Which genes are (relatively) up/down regulated?
  • Two samples.
  • e.g. KO vs. WT or mutant vs. WT

? n
T
C
? n
For each gene form the t statistic
average of n trt Ms sqrt(1/n (SD of n trt
Ms)2)
29
Which genes are (relatively) up/down regulated?
  • Two samples with a reference (e.g. pooled control)

? n
T
C
? n
C
C
  • For each gene form the t statistic
  • average of n trt Ms - average of n ctl Ms
  • sqrt(1/n (SD of n trt Ms)2 (SD of n ctl Ms)2)

30
One factor more than 2 samples
T2
T3
T4
T1
x 2
x 2
x 2
x 2
C
  • Samples Liver tissue from mice treated by
    cholesterol modifying drugs.
  • Question 1 Find genes that respond differently
    between the treatment and the control.
  • Question 2 Find genes that respond similarly
    across two or more treatments relative to control.

31
One factor more than 2 samples
T6
T1
T5
T2
T4
T3
  • Samples tissues from different regions of the
    mouse olfactory bulb.
  • Question 1 differences between different
    regions.
  • Question 2 identify genes with a pre-specified
    patterns across regions.

32
Two or more factors
  • 6 different experiments at each time point.
  • Dyeswaps.
  • 4 time points (30 minutes, 1 hour, 4 hours, 24
    hours)
  • 2 x 2 x 4 factorial experiment.

ctl
OSM
? 4 times
OSM EGF
EGF
33
Which genes have changed?When permutation
testing possible
  • 1. For each gene and each hybridisation (8 ko 8
    ctl), use Mlog2(R/G).
  • 2. For each gene form the t statistic
  • average of 8 ko Ms - average of 8 ctl Ms
  • sqrt(1/8 (SD of 8 ko Ms)2 (SD of 8 ctl Ms)2)
  • 3. Form a histogram of 6,000 t values.
  • 4. Do a normal Q-Q plot look for values off the
    line.
  • 5. Permutation testing.
  • 6. Adjust for multiple testing.

34
Histogram qq plot
ApoA1
35
Apo A1 Adjusted and Unadjusted p-values for the
50 genes with the largest absolute t-statistics.
36
Which genes have changed?Permutation testing not
possible
  • Our current approach is to use averages, SDs,
    t-statistics and a new statistic we call B,
    inspired by empirical Bayes.
  • We hope in due course to calibrate B and use that
    as our main tool.
  • We begin with the motivation, using data from a
    study in which each slide was replicated four
    times.

37
Results from 4 replicates
38
BLOR compared
39
  • M
  • t
  • t ?M

Results from the Apo AI ko experiment
40
  • M
  • t
  • t ?M

Results from the Apo AI ko experiment
41
Empirical Bayes log posterior odds ratio
42
  • M
  • B
  • t
  • M ? B
  • t ?B
  • t ? M ?B

Results from SR-BI transgenic experiment
43
  • M
  • B
  • t
  • M ? B
  • t ?B
  • t ? M ?B

Results from SR-BI transgenic experiment
44
Extensions include dealing with
  • Replicates within and between slides
  • Several effects use a linear model
  • ANOVA are the effects equal?
  • Time series selecting genes for trends

45
Rosetta once more In vivo Binding Sites of
Gal4p in Galactose
P lt0.001
Un-enriched DNA (Cy3)
antibody-enriched DNA (Cy5)
46
Summary (for the second simplest problem)
  • Microarray experiments typically have thousands
    of genes, but only few (1-10) replicates for each
    gene.
  • Averages can be driven by outliers.
  • Ts can be driven by tiny variances.
  • B LOR will, we hope
  • use information from all the genes
  • combine the best of M. and T
  • avoid the problems of M. and T

47
Acknowledgments
  • UCB/WEHI
  • Yee Hwa Yang
  • Sandrine Dudoit
  • Ingrid Lönnstedt
  • Natalie Thorne
  • David Freedman
  • CSIRO Image Analysis Group
  • Michael Buckley
  • Ryan Lagerstorm
  • Ngai lab, UCB
  • Goodman lab, UCB
  • Peter Mac CI, Melb.
  • Ernest Gallo CRC
  • Brown-Botstein lab
  • Matt Callow (LBNL)
  • Bing Ren (WI)

48
  • Some web sites
  • Technical reports, talks, software etc.
  • http//www.stat.berkeley.edu/users/terry/zarray/Ht
    ml/
  • Statistical software R GNUs S
    http//lib.stat.cmu.edu/R/CRAN/
  • Packages within R environment
  • -- Spot http//www.cmis.csiro.au/iap/spot.htm
  • -- SMA (statistics for microarray analysis)
    http//www.stat.berkeley.edu/users/terry/zarray/So
    ftware /smacode.html

49
Factorial Design
Age Effect
2
A1
P01
4
Zone Effect
1
3
5
P04
A 4
50
Factorial design
m
ma
Different ways of estimating parameters. e.g. Z
effect. 1 (m z) - (m) z 2 - 5 ((m
a) - (m)) -((m a)-(m z)) (a) - (a z)
z 4 3 - 5 z
2
A1
P01
4
1
3
5
P04
A 4
mz
mzaza
How do we combine the information?
51
Regression analysis
Define a matrix X so that E(M)X? Use least
squares estimate for z, a, za
52
Looking at effect of Z log(zone 4 / zone1)
gene A
gene B
53
Estimate
Z effect
  • ?
  • t ? / SE
  • ? ? t

Log2(SE)
54
Zone Age Zone ? Age
55
Top 50 genes from each effect
Zone . Age interaction
Age
19
0
48
29
2
0
19
Zone
56
  • T
  • B
  • t ? M ?B
  • t ?B

57
(No Transcript)
58
  • M
  • t
  • t ?M

59
  • M
  • B
  • t
  • M ? B
  • t ?B
  • t ? M?B

60
Some statistical questions
  • Image analysis addressing, segmenting,
    quantifying
  • Normalisation within and between slides
  • Quality of images, of spots, of (log) ratios
  • Which genes are (relatively) up/down regulated?
  • Assigning p-values to tests/confidence to
    results.

61
Some statistical questions, ctd
  • Planning of experiments design, sample size
  • Discrimination and allocation of samples
  • Clustering, classification of samples, of genes
  • Selection of genes relevant to any given analysis
  • Analysis of time course, factorial and other
    special experiments... much more

62
The NCI 60 experiments (bg)
Write a Comment
User Comments (0)
About PowerShow.com