Title: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data
1Exploration, Normalization, and Summaries of High
Density Oligonucleotide Array Probe Level Data
- Rafael A. Irizarry
- Department of Biostatistics, JHU
- (joint work with Leslie Cope, Ben Bolstad,
Francois Collin, Bridget Hobbs, and Terry Speed) - http//biosun01.biostat.jhsph.edu/ririzarr
2Summary
- Review of technology
- Probe level summaries
- Normalization
- Assess technology and expression measures
- Conclusion/future work
3Probe Arrays
Hybridized Probe Cell
GeneChip Probe Array
Single stranded, labeled RNA target
Oligonucleotide probe
24µm
Millions of copies of a specific oligonucleotide
probe
1.28cm
gt200,000 different complementary probes
Image of Hybridized Probe Array
Compliments of D. Gerhold
4PM MM
5Data and Notation
- PMijn , MMijn Intensity for perfect/mis-match
- probe cell j, in chip i, in gene n
- i 1,, I (ranging from 1 to hundreds)
- j1,, J (usually 16 or 20)
- n 1,, N (between 8,000 and 12,000)
-
6The Big Picture
- Summarize 20 PM,MM pairs (probe level data) into
one number for each gene - We call this number an expression measure
- Affymetrix GeneChips Software has defaults.
- Does it work? Can it be improved?
7What is the evidence?
- Lockhart et. al. Nature
Biotechnology 14 (1996)
8Competing Measures of Expression
- GeneChip software uses Avg.diff
- with A a set of suitable pairs chosen by
software. - Log ratio version is also used.
- For differential expression Avg.diffs are
compared between chips.
9Competing Measures of Expression
- GeneChip new version uses something else
- with MM a version of MM that is never bigger
than PM.
10Competing Measures of Expression
- Li and Wong fit a model
- Consider expression in chip i
- Efron et. al. consider log PM 0.5 log MM
- Another is second largest PM
11Competing Measures of Expression
- Why not stick to what has worked for cDNA?
- with A a set of suitable pairs.
12Features of Probe Level Data
13SD vs. Avg
14ANOVA Strong probe effect5 times bigger than
gene effect
15Normalization at Probe Level
16Spike-In Experiments
- Set A 11 control cRNAs were spiked in, all at
the same concentration, which varied across
chips. - Set B 11 control cRNAs were spiked in, all at
different concentrations, which varied across
chips. The concentrations were arranged in 12x12
cyclic Latin square (with 3 replicates)
17Set A Probe Level Data
18What Did We Learn?
- Dont subtract or divide by MM
- Probe effect is additive on log scale
- Take logs
19Why Remove Background?
20Background Distribution
21RMA
- Background correct PM
- Normalize (quantile normalization)
- Assume additive model
- Estimate ai using robust method
22Spike-In B
Probe Set Conc 1 Conc 2 Rank
BioB-5 100 0.5 1
BioB-3 0.5 25.0 2
BioC-5 2.0 75.0 4
BioB-M 1.0 37.5 4
BioDn-3 1.5 50.0 5
DapX-3 35.7 3.0 6
CreX-3 50.0 5.0 7
CreX-5 12.5 2.0 8
BioC-3 25.0 100 9
DapX-5 5.0 1.5 10
DapX-M 3.0 1.0 11
Later we consider 23 different combinations of
concentrations
23Differential Expression
24Differential Expression
25Differential Expression
26Differential Expression
27Observed Ranks
Gene AvDiff MAS 5.0 LiWong AvLog(PM-BG)
BioB-5 6 2 1 1
BioB-3 16 1 3 2
BioC-5 74 6 2 5
BioB-M 30 3 7 3
BioDn-3 44 5 6 4
DapX-3 239 24 24 7
CreX-3 333 73 36 9
CreX-5 3276 33 3128 8
BioC-3 2709 8572 681 6431
DapX-5 2709 102 12203 10
DapX-M 165 19 13 6
Top 15 1 5 6 10
28Observed vs True Ratio
29Dilution Experiment
- cRNA hybridized to human chip (HGU95) in range of
proportions and dilutions - Dilution series begins at 1.25 ?g cRNA per
GeneChip array, and rises through 2.5, 5.0, 7.5,
10.0, to 20.0 ?g per array. 5 replicate chips
were used at each dilution - Normalize just within each set of 5 replicates
- For each probe set compute expression, average
and SD over replicates
30Dilution Experiment Data
31Expression
32SD
33Log Scale SD
34Model check
- Compute observed SD of 5 replicate expression
estimates - Compute RMS of 5 nominal SDs
- Compare by taking the log ratio
- Closeness of observed and nominal SD taken as a
measure of goodness of fit of the model
35Observed vs. Model SE
36Conclusion
- Take logs
- PMs need to be normalized
- Using global background improves on use of
probe-specific MM - Gene Logic spike-in and dilution study show
technology works well - RMA is arguably the best summary in terms of
bias, variance and model fit - Future What stastistic should we use to rank?
37Acknowledgements
- Gene Browns group at Wyeth/Genetics Institute,
and Uwe Scherfs Genomics Research Development
Group at Gene Logic, for generating the spike-in
and dilution data - Gene Logic for permission to use these data
- Magnus Åstrand (Astra Zeneca Mölndal)
- Skip Garcia, Tom Cappola, and Joshua Hare (JHU)