Gene expression arrays: from the bench to data management and analysis. - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Gene expression arrays: from the bench to data management and analysis.

Description:

Zoomed-out overview of analysis methods grouped according to ... To attenuate this, multiple versions of a predictor can be aggregated by plurality voting. ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 56
Provided by: elisabett3
Category:

less

Transcript and Presenter's Notes

Title: Gene expression arrays: from the bench to data management and analysis.


1
Gene expression arrays from the bench to data
management and analysis.
  • Elisabetta Manduchi
  • Center for Bioinformatics
  • University of Pennsylvania
  • manduchi_at_pcbi.upenn.edu

2
Outline
  • Generating data with gene expression arrays.
  • Image analysis and data pre-processing.
  • Zoomed-out overview of analysis methods grouped
    according to questions of interest.
  • Data management issues collection, exchange, and
    storage.

3
Different array platforms
nylon filter array
short oligonucleotide array
two-channel microarray
4
Generalities
  • Exploits complementary base-pairing. The steps
    are
  • Array manufacturing
  • Sample preparation and RNA extraction
  • Labeling
  • Hybridization of probe to target
  • Scanning
  • Image quantification
  • Data pre-processing
  • Data analysis

5
Filter Arrays
  • cDNA clones or oligos (e.g. 70-mers) are spotted
    on a nylon filter
  • cDNA from a given sample is radioactively
    labeled.
  • Limitations
  • Cross-hybridization (sequences with high sequence
    identity, Alu repeats, etc.).
  • Hard to distinguish the transcripts generated by
    alternative splicing.
  • Distortion.
  • Several sources of bias and noise
  • variation in spot size, shape, and concentration
  • variation in PCR reaction efficiency
  • variation in labeled nucleotide incorporation.

6
Two-channel Microarrays
  • Preparing the array
  • Amplified cDNA clones or oligos (e.g. 70-mers)
    are printed onto a glass microscope slide
  • the array is processed by chemical and heat
    treatment to attach the DNA sequences to the
    glass surface and denature them.

7
Building the chip
Ngai Lab arrayer , UC Berkeley
Print-tip head
(Slide kindly provided by T. Speed)
8
Glass Slide Array of bound cDNA probes In this
case 4x4 blocks 16 print-tip groups
(Slide kindly provided by T. Speed)
9
Two-channel Microarrays (cont.)
  • Preparing the RNA sources two samples are
    analyzed simultaneously. For each of them
  • polyA mRNA is prepared and reverse transcribed
    with incorporation of a fluorescent label
    (usually Cy3 green for one sample and Cy5 red
    for the other)
  • A variety of labeling methods are currently
    available (e.g. direct labeling, indirect
    labeling, dendrimers)
  • the RNA is then degraded.

10
Two-channel Microarrays (cont.)
  • Hybridization the labeled cDNAs are
    competitively hybridized to the array.
  • Scanning utilizes a laser fluorescent scanning
    procedure (sequential excitation of the
    fluorophores). Emitted light is split according
    to wavelength and detected.
  • Quantification signals are then quantified
    separately, and the ratio of the two channels for
    each spot is also reported.

11
A two-channel microarray experiment
Figure from David J. Duggan et al. (1999)
Expression Profiling using cDNA microarrays.
Nature Genetics 21 10-14
12
Two-channel Microarrays limitations
  • Some of the limitations that are also common to
    filter arrays
  • A large number of cDNA or PCR products must be
    prepared, purified, quantified, catalogued, and
    spotted onto a solid support.
  • If the cDNAs are derived from a cDNA library, low
    abundance cDNAs are unlikely to be spotted and
    the library must be normalized to reduce the
    redundant spotting of cDNAs from highly expressed
    genes .
  • Cross-hybridization.
  • Alternative splicing hard to detect.

13
Short Oligonucleotide Arrays
  • Preparing the array
  • covalently attached oligonucleotides chemically
    synthesized directly on a solid substrate
  • for each mRNA being monitored, a collection
    (probe set) of probe pairs (16 to 20) is
    synthesized on the array
  • each probe pair consists two probe cells one
    containing (millions of) copies of a given short
    oligo (e.g. 25-mer) that is a perfect match (PM)
    to a subsequence of the mRNA in question and the
    other containing copies of a companion (MM) short
    oligo that has a single base difference in a
    central position.
  • Preparing the RNA source
  • polyA RNA is converted to cDNA
  • cDNA is transcribed in vitro in the presence of
    fluorescently labeled (biotin or fluorescein)
    ribonucleotides, giving rise to labeled RNA
  • RNA is then fragmented with heat (fragment
    average size of 50 to 100 bp).
  • Hybridization occurs in a flow-cell. A brief
    washing step follows to remove un-hybridized RNA.

In some cases only PM are used
14
Short Oligonucleotide Arrays quantification
  • MAS 4.0 and MAS 5.0
  • An intensity is computed for each cell (3rd
    quartile of pixels distribution in that cell,
    after excluding bordering pixels)
  • Background values are computed (after dividing
    the array into sectors) and subtracted from cell
    intensities
  • A presence/absence call is made on each probe set
  • Probe set intensities are computed based on the
    background-subtracted intensities of the cells in
    the probe set

15
MAS 4.0 vs MAS 5.0
  • Differ in how (2)-(4) above are calculated
  • For (4)
  • MAS 4.0 uses the Average Difference (AD) method,
    that is the average of dPM-MM over the subset of
    probes for which djPMj-MMj are within 3 SDs away
    from the average of d(2), , d(j-1) where d(j) is
    the j-th smallest difference. This is called
    Super-Olympic-Scoring (SOS) method.
  • MAS 5.0 uses the Tukey biweight (a MAD-weighted
    mean) of log2(PM)-log2(stray)
  • stray signals are typically estimated using the
    MM values, but anomalous MM values are handled
    with imputation

16
Short Oligonucleotide Arraysother low level
analysis work
  • See http//www.stat.Berkeley.EDU/users/terry/zarra
    y/Affy/GL_Workshop/genelogic2001.html for a
    workshop held in Nov 2001.
  • This includes work by Li and Wong (MBEI) and by
    Irizarry et al. (RMA) for low-level analyses of
    short oligo arrays.
  • Match-Only Integral Distribution (MOID) (Zhou
    and Abagyan, BMC Bioinformatics, 2002, 33)

17
Image analysis(for spotted arrays)
  • Gridding in order to extract spot intensities it
    is necessary to accurately identify the location
    of each of the spots.
  • Segmentation it is necessary to identify, within
    each such location, which pixels correspond to
    probe hybridized to target.
  • Intensity extraction after detecting location,
    size, and shape of each spot, one needs to
    calculate the signal (foreground) and the
    background intensities as well as quality
    measures at each spot.

18
Image analysis (cont.)
Figures from http//www.nhgri.nih.gov/DIR/Microarr
ay/image_analysis.html
19
Image analysis (cont.)
  • There are different public and commercial
    software packages for image analysis, using
    different algorithms for the 3 steps involved and
    requiring/allowing different degrees of manual
    intervention. Moreover, different software might
    give a more or less copious output in terms of
    quality measures
  • For the segmentation step, the following
    possibilities might be available
  • fixed circle
  • adaptive
  • histogram
  • For intensity extraction there are also various
    possibilities
  • Foreground sum, mean, median, mode, etc. of
    pixel intensities
  • Background none, global, local, morphological
    opening

20
Local background
---- GenePix QuantArray ScanAnalyze
(Slide kindly provided by T. Speed)
21
Morphological non-linear filter on background
pixel signal(Spot software)
Measures overall baseline background level.
(Slide kindly provided by T. Speed)
22
Quality measures
  • Spot
  • One channel, R or G
  • Signal/noise ratio
  • Variation in pixel intensities
  • Identification of bad spots (no signal), etc.
  • Two channels, R/G
  • Circularity, etc.
  • Array
  • Percentage of spots with no signal
  • Distribution of spot signal area, etc.

(Slide kindly provided by T. Speed)
23
Normalization motivation
  • Need to identify and remove systematic sources
    of variation in the measured intensities, due to
    one or more of
  • Different labeling efficiency of the dyes
  • Separate reverse transcription and labeling
  • Different scanning parameters
  • Print-tip-group differences
  • Spatial effects, e.g. due to the placement of the
    cover slip
  • Plate effects
  • Necessary for within and between slides
    comparisons of expression levels.

(Slide kindly provided by T. Speed)
24
Normalization methods
  • Use a constant factor for all spots on the array,
    where the constant is obtained from a given set
    of spots on the array, e.g.
  • c/(average intensity) or c/(median intensity),
    for a given constant c multiplicative.
  • mean or median log(ratio) in 2-channel
    microarrays additive (log scale)
  • Use appropriate functions for normalizing (log)
    ratios(R/G) in 2-channel microarrays
  • intensity-dependent normalization the
    normalization factor depends on the overall
    intensity of the spot, not just on the array
  • intensity-and-print-tip-dependent normalization
    the normalization factor also depends on the
    print-tip group
  • scale normalization (within and between slides)

25
  • Approaches in the same spirit for short
    oligonucleotide arrays
  • Li and Wong
  • Astrand
  • Quantile normalization for short oligonucleotide
    arrays.
  • The Bioconductor project (R packages for gene
    expression analysis) contains implementations of
    some of the above methods
  • http//www.bioconductor.org

26
MA plots
log2R vs. log2G
M vs. A
M log2R - log2G, A (log2R log2G)/2
(Slide kindly provided by T. Speed)
27
Normalization - lowess
  • Assumption Changes roughly symmetric at all
    intensities or few genes change.

(Slide kindly provided by T. Speed)
28
Normalization - print-tip-group
Assumption For every print-tip-group, changes
roughly symmetric at all intensities or few genes
change.
(Slide kindly provided by T. Speed)
29
MA plot - after print-tip-group normalization
(Slide kindly provided by T. Speed)
30
  • Which genes should be used to compute the
    normalization function? It depends.
  • All genes on the array.
  • Constantly expressed genes (housekeeping).
  • Controls
  • Spiked controls
  • Genomic DNA or Microarray Sample Pool (MSP)
    titration series
  • Rank invariant set
  • Every normalization method relies on the
    samples and arrays at hand satisfying certain
    assumptions. Thus, to judge what is the most
    appropriate normalization for a given dataset, it
    is important to ascertain which of the necessary
    assumptions are satisfied.

31
The promises of this technology
  • Gives a snapshot of the mRNA abundance for
    thousands of transcripts
  • Can be used to address many interesting
    biological questions
  • Which genes are expressed in a given sample?
  • Which genes are differentially expressed between
    two sample types?
  • Which genes are co-expressed (possibly
    co-regulated or sharing similar functions)?
  • Class discovery
  • Class prediction based on molecular fingerprints
  • Gene networks

32
Down-stream Analyses
  • Differential Expression Which genes are
    differentially expressed between two conditions
    of interest?
  • This will be discussed in detail in the next
    talk.
  • Experimental design issues
  • Need replicate (experimental and biological) to
    assess variability within sample type
  • In the case of 2-channel microarray experiments,
    possible experimental designs are

? n
? n
A
B
A
C
(reference design)
(direct comparison)
B
C
A)
(B
? n
? n
33
  • Class discovery (unsupervised clustering)
  • Group (i) the experiments or (ii) the genes by
    similarity of their profiles. Motivation
  • (i) determine a molecular classification of
    samples (e.g. subtypes of tumors which are
    morphologically undistinguishable)
  • (ii) determine groups of genes which are
    candidate for co-expression and possibly
    co-regulation.
  • (ii) data reduction
  • Discussed in previous lectures.

34
  • Class prediction Given M known classes, a
    learning set L for a predictor consists of
    observations which are known to belong to certain
    classes class 1, class 2, , class M.
  • Predictors are usually built from learning
    sets and are used to predict the class of each
    novel observation.
  • See http//www.stat.berkeley.edu/users/terry/zarra
    y/Html/discr.html for an overview and comparison
    of the methods listed below.
  • Fisher linear discriminant analysis (FLDA).
  • Maximum likelihood (ML) discriminant rule.
  • Nearest neighbor classifiers.
  • Classification trees.
  • Aggregate predictors.

35
Measuring the error rate of a predictor
  • Use a test-set, if available apply the predictor
    to entities whose class is known.
  • Repeat random divisions of a given dataset into
    learning set and test set, say N times, then
    compute the average error rate.
  • Use cross-validation for i1,2,,s
  • remove the i-th element from the learning set
  • build a predictor using the remaining elements in
    the learning set
  • apply this predictor to the removed element.

36
FLDA
  • Assign a new item to the class which minimizes
    the sum of the squared distances between
    appropriate linear combinations of the
    coordinates of this item and the corresponding
    linear combinations of the coordinates of the
    class average.

37
ML discriminant rule
  • If the class conditional densities Prob(xyk)
    were fully known, one could
    assign a given x to the class which gives
    the largest likelihood. Note that in this case
    the learning set would not be needed.
  • If one assumes given forms for the class
    conditional densities, then parameters can be
    estimated from the learning set. This yields the
    sample ML discriminant rule.

38
Nearest neighbor rule
  • Nearest neighbor methods are based on a
    measure of distance between observations, such as
    the Euclidean distance or one minus the
    correlation between two gene expression profiles.
  • The k-nearest neighbor rule, due to Fix and
    Hodges (1951), classifies an observation x as
    follows
  • i. find the k observations in the learning set
    that
  • are closest to x.
  • ii. predict the class of x by majority vote,
    i.e.,
  • choose the class that is most common among
  • those k observations.
  • The number of neighbors k can be chosen by
    cross-validation.

(Slide kindly provided by T. Speed)
39
Classification trees(aka Decision trees)
  • From the learning set build a set of rules which
    are used to classify a new instance x(x1, x2, ,
    xn).
  • Each node in the tree specifies a test of some
    coordinate(s) of x.
  • An instance is classified by starting at the root
    node of the tree, testing the coordinate(s)
    specified by this node, then moving down the tree
    branch corresponding to the output of the test.
  • This process is then repeated for the subtree
    rooted at the new node, till the instance reaches
    a node labeled with one of 1,2,, M.

40
Classification trees
(Slide kindly provided by T. Speed)
41
Aggregating predictors.
  • Some prediction methods (e.g. classification
    trees) tend to be unstable, i.e. small changes in
    the learning set can cause large changes in the
    predictor.
  • To attenuate this, multiple versions of a
    predictor can be aggregated by plurality voting.
  • Different versions of a predictor are built from
    perturbations of the original learning set.

42
Other methods
  • PAM nearest shrunken centroids
    (http//www-stat.stanford.edu/tibs/PAM)
  • Machine learning techniques
  • Neural networks.
  • Support Vector Machines (SVM).

43
Microarray gene expression studies Challenges
and Needs
  • There are many technical steps involved, thus
    many places where errors can occur and where
    protocols might need optimization
  • Array manufacturing
  • RNA extraction
  • Labeling
  • Hybridization
  • Scanning
  • Image quantification
  • Need
  • Studies on protocol optimization
  • Accurate tracking of operations
  • Quality Control (QC) checks

44
  • There are technical limitations and computational
    limitations, e.g.
  • The number of technical or biological replicates
    in a given study might be limited by array
    availability, sample amounts, amount of time
    needed to prepare each hyb., etc
  • Limited replication might limit the statistical
    analyses that can be done
  • A given data pre-processing method (e.g.
    normalization) may or may not be applicable
    according to the probes represented on the array
    and the samples to it hybridized
  • Need collaborative dialog between bench
    investigators and statisticians/bioinformaticians
    from the onset of a study, possibly including the
    choice of an appropriate array platform to be used

45
  • Exchange of files between wet labs and
    bioinformatics labs adds more room for errors
  • Need integrity checks
  • Information on the details of the bench work,
    typically kept in lab notebooks, is often very
    relevant to the computational analyses and needs
    to be passed on to the researchers doing the
    latter
  • All relevant information regarding a study must
    be organized to enable further research as well
    as peer evaluation
  • Need
  • Structured storage of this information,
    appropriate db schema
  • Common language ontologies
  • User-friendly interfaces for the bench
    investigator to enter this information

46
What info is necessary and how should it be
exchanged?
  • MGED efforts (www.mged.org)
  • MIAME
  • The formulation of the minimum information
    about a microarray experiment required to
    interpret and verify the results.
  • (Oct 2002) Major scientific journals
    (including Cell, The Lancet, Nature and Science)
    endorsed MIAME guidelines for publishing
    microarray gene expression data.
  • MAGE-OM data exchange model
  • MAGE-ML data exchange format
  • OWG
  • The development of ontologies for microarray
    experiment description and biological material
    (biomaterial) annotation in particular.
  • TRANSFORMATIONS
  • The development of recommendations regarding
    microarray data transformations and normalization
    methods.

47
Relationship of MGED Efforts
Software and database developers
MIAME DB
MAGE
MGED Ontology
MIAME DB
External Ontologies/CVs
Investigators annotating experiments
48
Data Storage Needs
  • A suitable database schema should capture all
    aspects of the experiments
  • Array platform
  • Biomaterials and their treatments (including RNA
    extraction and labeling protocols)
  • Hybridization protocols, scanning and image
    quantification hardware, software and settings
    (dates and operators)
  • Raw measurements
  • Processed data and info on processing procedures
  • Possibly tables to store down-stream analyses
    results and procedures

49
Annotation Needs
  • In order to collect all the necessary
    experimental info from the bench-investigator,
    need
  • user-friendly interfaces
  • use of ontologies
  • known terms with a defined meaning
  • minimize free text
  • queries can be generated using CV terms

50
What is out there?
  • Databases and Annotation
  • Public repositories
  • ArrayExpress (EBI)
  • GEO (NCBI)
  • CIBEX (NIG)
  • A list of MIAME compliant software is at
  • www.mged.org/Workgroups/MIAME/miame_software.html
  • RAD (RNA Abundance Database), developed at CBIL
    (PCBI) www.cbil.upenn.edu/RAD and its
    Study-Annotator

51
(No Transcript)
52
BioMaterial RAD3 Tables
BioMaterialMeasurement
OntologyEntry
BioMaterialCharacteristic
Treatment
BioMaterialImp
LabelMethod
LabeledExtract
BioSample
BioSource
AssayLEX
AssayBioMaterial
Assay
53
Polacek et al. Physiol. Genomics 13
HAEC 5038
split
Culture dish 1
Culture dish 2
Culture dish 3
Culture dish 4
TNF treatment
pool
Culture dish 1, TNF
Culture dish 2, TNF
pool
TNF- pooled culture
TNF pooled culture
RNA extraction
TNF pooled culture RNA
TNF- pooled cultured RNA
split
2, , 5, 6, ...
2, , 5, 6, ...
TNF alqt 1
TNF alqt 9
TNF- alqt 1
TNF- alqt 9
Amplify and 33P label
Amplify and 33P label
33P label
33P label




18 distinct resulting Labeled Extracts
54
Ontology
55
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com