Title: Gene expression arrays: from the bench to data management and analysis.
1Gene expression arrays from the bench to data
management and analysis.
- Elisabetta Manduchi
- Center for Bioinformatics
- University of Pennsylvania
- manduchi_at_pcbi.upenn.edu
2Outline
- Generating data with gene expression arrays.
- Image analysis and data pre-processing.
- Zoomed-out overview of analysis methods grouped
according to questions of interest. - Data management issues collection, exchange, and
storage.
3Different array platforms
nylon filter array
short oligonucleotide array
two-channel microarray
4Generalities
- Exploits complementary base-pairing. The steps
are - Array manufacturing
- Sample preparation and RNA extraction
- Labeling
- Hybridization of probe to target
- Scanning
- Image quantification
- Data pre-processing
- Data analysis
5Filter Arrays
- cDNA clones or oligos (e.g. 70-mers) are spotted
on a nylon filter - cDNA from a given sample is radioactively
labeled. - Limitations
- Cross-hybridization (sequences with high sequence
identity, Alu repeats, etc.). - Hard to distinguish the transcripts generated by
alternative splicing. - Distortion.
- Several sources of bias and noise
- variation in spot size, shape, and concentration
- variation in PCR reaction efficiency
- variation in labeled nucleotide incorporation.
6Two-channel Microarrays
- Preparing the array
- Amplified cDNA clones or oligos (e.g. 70-mers)
are printed onto a glass microscope slide - the array is processed by chemical and heat
treatment to attach the DNA sequences to the
glass surface and denature them.
7Building the chip
Ngai Lab arrayer , UC Berkeley
Print-tip head
(Slide kindly provided by T. Speed)
8Glass Slide Array of bound cDNA probes In this
case 4x4 blocks 16 print-tip groups
(Slide kindly provided by T. Speed)
9Two-channel Microarrays (cont.)
- Preparing the RNA sources two samples are
analyzed simultaneously. For each of them - polyA mRNA is prepared and reverse transcribed
with incorporation of a fluorescent label
(usually Cy3 green for one sample and Cy5 red
for the other) - A variety of labeling methods are currently
available (e.g. direct labeling, indirect
labeling, dendrimers) - the RNA is then degraded.
10Two-channel Microarrays (cont.)
- Hybridization the labeled cDNAs are
competitively hybridized to the array. - Scanning utilizes a laser fluorescent scanning
procedure (sequential excitation of the
fluorophores). Emitted light is split according
to wavelength and detected. - Quantification signals are then quantified
separately, and the ratio of the two channels for
each spot is also reported.
11A two-channel microarray experiment
Figure from David J. Duggan et al. (1999)
Expression Profiling using cDNA microarrays.
Nature Genetics 21 10-14
12Two-channel Microarrays limitations
- Some of the limitations that are also common to
filter arrays - A large number of cDNA or PCR products must be
prepared, purified, quantified, catalogued, and
spotted onto a solid support. - If the cDNAs are derived from a cDNA library, low
abundance cDNAs are unlikely to be spotted and
the library must be normalized to reduce the
redundant spotting of cDNAs from highly expressed
genes . - Cross-hybridization.
- Alternative splicing hard to detect.
13Short Oligonucleotide Arrays
- Preparing the array
- covalently attached oligonucleotides chemically
synthesized directly on a solid substrate - for each mRNA being monitored, a collection
(probe set) of probe pairs (16 to 20) is
synthesized on the array - each probe pair consists two probe cells one
containing (millions of) copies of a given short
oligo (e.g. 25-mer) that is a perfect match (PM)
to a subsequence of the mRNA in question and the
other containing copies of a companion (MM) short
oligo that has a single base difference in a
central position. - Preparing the RNA source
- polyA RNA is converted to cDNA
- cDNA is transcribed in vitro in the presence of
fluorescently labeled (biotin or fluorescein)
ribonucleotides, giving rise to labeled RNA - RNA is then fragmented with heat (fragment
average size of 50 to 100 bp). - Hybridization occurs in a flow-cell. A brief
washing step follows to remove un-hybridized RNA.
In some cases only PM are used
14Short Oligonucleotide Arrays quantification
- MAS 4.0 and MAS 5.0
- An intensity is computed for each cell (3rd
quartile of pixels distribution in that cell,
after excluding bordering pixels) - Background values are computed (after dividing
the array into sectors) and subtracted from cell
intensities - A presence/absence call is made on each probe set
- Probe set intensities are computed based on the
background-subtracted intensities of the cells in
the probe set
15MAS 4.0 vs MAS 5.0
- Differ in how (2)-(4) above are calculated
- For (4)
- MAS 4.0 uses the Average Difference (AD) method,
that is the average of dPM-MM over the subset of
probes for which djPMj-MMj are within 3 SDs away
from the average of d(2), , d(j-1) where d(j) is
the j-th smallest difference. This is called
Super-Olympic-Scoring (SOS) method. - MAS 5.0 uses the Tukey biweight (a MAD-weighted
mean) of log2(PM)-log2(stray) - stray signals are typically estimated using the
MM values, but anomalous MM values are handled
with imputation
16Short Oligonucleotide Arraysother low level
analysis work
- See http//www.stat.Berkeley.EDU/users/terry/zarra
y/Affy/GL_Workshop/genelogic2001.html for a
workshop held in Nov 2001. - This includes work by Li and Wong (MBEI) and by
Irizarry et al. (RMA) for low-level analyses of
short oligo arrays. - Match-Only Integral Distribution (MOID) (Zhou
and Abagyan, BMC Bioinformatics, 2002, 33)
17Image analysis(for spotted arrays)
- Gridding in order to extract spot intensities it
is necessary to accurately identify the location
of each of the spots. - Segmentation it is necessary to identify, within
each such location, which pixels correspond to
probe hybridized to target. - Intensity extraction after detecting location,
size, and shape of each spot, one needs to
calculate the signal (foreground) and the
background intensities as well as quality
measures at each spot.
18Image analysis (cont.)
Figures from http//www.nhgri.nih.gov/DIR/Microarr
ay/image_analysis.html
19Image analysis (cont.)
- There are different public and commercial
software packages for image analysis, using
different algorithms for the 3 steps involved and
requiring/allowing different degrees of manual
intervention. Moreover, different software might
give a more or less copious output in terms of
quality measures - For the segmentation step, the following
possibilities might be available - fixed circle
- adaptive
- histogram
- For intensity extraction there are also various
possibilities - Foreground sum, mean, median, mode, etc. of
pixel intensities - Background none, global, local, morphological
opening
20Local background
---- GenePix QuantArray ScanAnalyze
(Slide kindly provided by T. Speed)
21Morphological non-linear filter on background
pixel signal(Spot software)
Measures overall baseline background level.
(Slide kindly provided by T. Speed)
22Quality measures
- Spot
- One channel, R or G
- Signal/noise ratio
- Variation in pixel intensities
- Identification of bad spots (no signal), etc.
- Two channels, R/G
- Circularity, etc.
- Array
- Percentage of spots with no signal
- Distribution of spot signal area, etc.
(Slide kindly provided by T. Speed)
23Normalization motivation
- Need to identify and remove systematic sources
of variation in the measured intensities, due to
one or more of - Different labeling efficiency of the dyes
- Separate reverse transcription and labeling
- Different scanning parameters
- Print-tip-group differences
- Spatial effects, e.g. due to the placement of the
cover slip - Plate effects
- Necessary for within and between slides
comparisons of expression levels.
(Slide kindly provided by T. Speed)
24Normalization methods
- Use a constant factor for all spots on the array,
where the constant is obtained from a given set
of spots on the array, e.g. - c/(average intensity) or c/(median intensity),
for a given constant c multiplicative. - mean or median log(ratio) in 2-channel
microarrays additive (log scale) - Use appropriate functions for normalizing (log)
ratios(R/G) in 2-channel microarrays - intensity-dependent normalization the
normalization factor depends on the overall
intensity of the spot, not just on the array - intensity-and-print-tip-dependent normalization
the normalization factor also depends on the
print-tip group - scale normalization (within and between slides)
25- Approaches in the same spirit for short
oligonucleotide arrays - Li and Wong
- Astrand
- Quantile normalization for short oligonucleotide
arrays. - The Bioconductor project (R packages for gene
expression analysis) contains implementations of
some of the above methods - http//www.bioconductor.org
26MA plots
log2R vs. log2G
M vs. A
M log2R - log2G, A (log2R log2G)/2
(Slide kindly provided by T. Speed)
27Normalization - lowess
- Assumption Changes roughly symmetric at all
intensities or few genes change.
(Slide kindly provided by T. Speed)
28Normalization - print-tip-group
Assumption For every print-tip-group, changes
roughly symmetric at all intensities or few genes
change.
(Slide kindly provided by T. Speed)
29MA plot - after print-tip-group normalization
(Slide kindly provided by T. Speed)
30- Which genes should be used to compute the
normalization function? It depends. - All genes on the array.
- Constantly expressed genes (housekeeping).
- Controls
- Spiked controls
- Genomic DNA or Microarray Sample Pool (MSP)
titration series - Rank invariant set
- Every normalization method relies on the
samples and arrays at hand satisfying certain
assumptions. Thus, to judge what is the most
appropriate normalization for a given dataset, it
is important to ascertain which of the necessary
assumptions are satisfied.
31The promises of this technology
- Gives a snapshot of the mRNA abundance for
thousands of transcripts - Can be used to address many interesting
biological questions - Which genes are expressed in a given sample?
- Which genes are differentially expressed between
two sample types? - Which genes are co-expressed (possibly
co-regulated or sharing similar functions)? - Class discovery
- Class prediction based on molecular fingerprints
- Gene networks
32Down-stream Analyses
- Differential Expression Which genes are
differentially expressed between two conditions
of interest? - This will be discussed in detail in the next
talk. - Experimental design issues
- Need replicate (experimental and biological) to
assess variability within sample type - In the case of 2-channel microarray experiments,
possible experimental designs are
? n
? n
A
B
A
C
(reference design)
(direct comparison)
B
C
A)
(B
? n
? n
33- Class discovery (unsupervised clustering)
- Group (i) the experiments or (ii) the genes by
similarity of their profiles. Motivation - (i) determine a molecular classification of
samples (e.g. subtypes of tumors which are
morphologically undistinguishable) - (ii) determine groups of genes which are
candidate for co-expression and possibly
co-regulation. - (ii) data reduction
- Discussed in previous lectures.
34- Class prediction Given M known classes, a
learning set L for a predictor consists of
observations which are known to belong to certain
classes class 1, class 2, , class M. - Predictors are usually built from learning
sets and are used to predict the class of each
novel observation. - See http//www.stat.berkeley.edu/users/terry/zarra
y/Html/discr.html for an overview and comparison
of the methods listed below. - Fisher linear discriminant analysis (FLDA).
- Maximum likelihood (ML) discriminant rule.
- Nearest neighbor classifiers.
- Classification trees.
- Aggregate predictors.
35Measuring the error rate of a predictor
- Use a test-set, if available apply the predictor
to entities whose class is known. - Repeat random divisions of a given dataset into
learning set and test set, say N times, then
compute the average error rate. - Use cross-validation for i1,2,,s
- remove the i-th element from the learning set
- build a predictor using the remaining elements in
the learning set - apply this predictor to the removed element.
36FLDA
- Assign a new item to the class which minimizes
the sum of the squared distances between
appropriate linear combinations of the
coordinates of this item and the corresponding
linear combinations of the coordinates of the
class average.
37ML discriminant rule
- If the class conditional densities Prob(xyk)
were fully known, one could
assign a given x to the class which gives
the largest likelihood. Note that in this case
the learning set would not be needed. - If one assumes given forms for the class
conditional densities, then parameters can be
estimated from the learning set. This yields the
sample ML discriminant rule.
38Nearest neighbor rule
- Nearest neighbor methods are based on a
measure of distance between observations, such as
the Euclidean distance or one minus the
correlation between two gene expression profiles.
- The k-nearest neighbor rule, due to Fix and
Hodges (1951), classifies an observation x as
follows - i. find the k observations in the learning set
that - are closest to x.
- ii. predict the class of x by majority vote,
i.e., - choose the class that is most common among
- those k observations.
- The number of neighbors k can be chosen by
cross-validation.
(Slide kindly provided by T. Speed)
39Classification trees(aka Decision trees)
- From the learning set build a set of rules which
are used to classify a new instance x(x1, x2, ,
xn). - Each node in the tree specifies a test of some
coordinate(s) of x. - An instance is classified by starting at the root
node of the tree, testing the coordinate(s)
specified by this node, then moving down the tree
branch corresponding to the output of the test. - This process is then repeated for the subtree
rooted at the new node, till the instance reaches
a node labeled with one of 1,2,, M.
40Classification trees
(Slide kindly provided by T. Speed)
41Aggregating predictors.
- Some prediction methods (e.g. classification
trees) tend to be unstable, i.e. small changes in
the learning set can cause large changes in the
predictor. - To attenuate this, multiple versions of a
predictor can be aggregated by plurality voting. - Different versions of a predictor are built from
perturbations of the original learning set.
42Other methods
- PAM nearest shrunken centroids
(http//www-stat.stanford.edu/tibs/PAM) - Machine learning techniques
- Neural networks.
- Support Vector Machines (SVM).
43Microarray gene expression studies Challenges
and Needs
- There are many technical steps involved, thus
many places where errors can occur and where
protocols might need optimization - Array manufacturing
- RNA extraction
- Labeling
- Hybridization
- Scanning
- Image quantification
- Need
- Studies on protocol optimization
- Accurate tracking of operations
- Quality Control (QC) checks
44- There are technical limitations and computational
limitations, e.g. - The number of technical or biological replicates
in a given study might be limited by array
availability, sample amounts, amount of time
needed to prepare each hyb., etc - Limited replication might limit the statistical
analyses that can be done - A given data pre-processing method (e.g.
normalization) may or may not be applicable
according to the probes represented on the array
and the samples to it hybridized - Need collaborative dialog between bench
investigators and statisticians/bioinformaticians
from the onset of a study, possibly including the
choice of an appropriate array platform to be used
45- Exchange of files between wet labs and
bioinformatics labs adds more room for errors - Need integrity checks
- Information on the details of the bench work,
typically kept in lab notebooks, is often very
relevant to the computational analyses and needs
to be passed on to the researchers doing the
latter - All relevant information regarding a study must
be organized to enable further research as well
as peer evaluation - Need
- Structured storage of this information,
appropriate db schema - Common language ontologies
- User-friendly interfaces for the bench
investigator to enter this information
46What info is necessary and how should it be
exchanged?
- MGED efforts (www.mged.org)
- MIAME
- The formulation of the minimum information
about a microarray experiment required to
interpret and verify the results. - (Oct 2002) Major scientific journals
(including Cell, The Lancet, Nature and Science)
endorsed MIAME guidelines for publishing
microarray gene expression data. - MAGE-OM data exchange model
- MAGE-ML data exchange format
- OWG
- The development of ontologies for microarray
experiment description and biological material
(biomaterial) annotation in particular. - TRANSFORMATIONS
- The development of recommendations regarding
microarray data transformations and normalization
methods.
47Relationship of MGED Efforts
Software and database developers
MIAME DB
MAGE
MGED Ontology
MIAME DB
External Ontologies/CVs
Investigators annotating experiments
48Data Storage Needs
- A suitable database schema should capture all
aspects of the experiments - Array platform
- Biomaterials and their treatments (including RNA
extraction and labeling protocols) - Hybridization protocols, scanning and image
quantification hardware, software and settings
(dates and operators) - Raw measurements
- Processed data and info on processing procedures
- Possibly tables to store down-stream analyses
results and procedures
49Annotation Needs
- In order to collect all the necessary
experimental info from the bench-investigator,
need - user-friendly interfaces
- use of ontologies
- known terms with a defined meaning
- minimize free text
- queries can be generated using CV terms
50What is out there?
- Databases and Annotation
- Public repositories
- ArrayExpress (EBI)
- GEO (NCBI)
- CIBEX (NIG)
- A list of MIAME compliant software is at
- www.mged.org/Workgroups/MIAME/miame_software.html
- RAD (RNA Abundance Database), developed at CBIL
(PCBI) www.cbil.upenn.edu/RAD and its
Study-Annotator
51(No Transcript)
52BioMaterial RAD3 Tables
BioMaterialMeasurement
OntologyEntry
BioMaterialCharacteristic
Treatment
BioMaterialImp
LabelMethod
LabeledExtract
BioSample
BioSource
AssayLEX
AssayBioMaterial
Assay
53Polacek et al. Physiol. Genomics 13
HAEC 5038
split
Culture dish 1
Culture dish 2
Culture dish 3
Culture dish 4
TNF treatment
pool
Culture dish 1, TNF
Culture dish 2, TNF
pool
TNF- pooled culture
TNF pooled culture
RNA extraction
TNF pooled culture RNA
TNF- pooled cultured RNA
split
2, , 5, 6, ...
2, , 5, 6, ...
TNF alqt 1
TNF alqt 9
TNF- alqt 1
TNF- alqt 9
Amplify and 33P label
Amplify and 33P label
33P label
33P label
18 distinct resulting Labeled Extracts
54Ontology
55(No Transcript)