Outline - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Outline

Description:

Calibration experiments help to validate experiment quality and gene-specific variability. ... Arrays of replicates should have high correlation. ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 34
Provided by: george59
Category:
Tags: outline

less

Transcript and Presenter's Notes

Title: Outline


1
  • Outline
  • Experimental design of cDNA array
  • Calibration and replicate
  • Choice of reference sample
  • Design of two-color array
  • Filtering (in all platforms)
  • Missing value imputation (in all platforms)

2
1.1. Calibration and replicate
  • Calibration
  • Use the same sample on both dyes for
    hybridization.
  • Calibration experiments help to validate
    experiment quality and gene-specific variability.

Comparative Tumor vs Ref
Calibration Ref vs Ref
3
1.1. Calibration and replicate
  • Replicates (replicate spots, slides)
  • Multiple-spotting helps to identify local
    contaminated spots but will reduce number of
    genes in the study.
  • Multi-stage strategy Use single-spotting to
    include as many genes as possible for pilot
    study. Identify a subset of interesting genes and
    then use multiple-spotting.
  • Replicate spots and slides help to verify
    reproducibility on the spot and slide level.

4
1.1. Calibration and replicate
Biological replicate
From Yang, UCSF
5
1.1. Calibration and replicate
Technical replicate
From Yang, UCSF
6
1.1. Calibration and replicate
  • (iii) Reverse labelling
  • Advantage
  • Cancel out linear normalization scaling and
    simplifies the analysis. However, the linear
    assumption is often not true.
  • Help to cancel out gene-label interactions if it
    exists.

Sample A
Sample B
7
1.2. Choice of reference sample
  • Different choices of reference sample
  • Normal patient or time 0 sample in time course
    study
  • Pool all samples or all normal samples
  • Embryonic cells
  • Commercial kit

Ideally we want all genes expressed at a
constant moderate level in reference sample.
8
1.3. Design issue
From Yang, UCSF
9
1.3. Design issue
  • Design issues
  • Reference design
  • Loop design
  • Balance design

Reference sample is redundantly measured many
times.
10
(c)
v samples with v2 experiments
v samples with 2v experiments
See Kerr et al. 2001
11
2. Filtering
  • Filter out genes with bad quality
  • Outputs from imaging analysis usually have a
    quality index or flag to identify genes with bad
    quality image.
  • Three common sources of bad quality probes
  • Problematic probes probes with non-uniform
    intensities.
  • Low-intensity probes genes with low intensities
    are known to have bad reproducibility and hard to
    verify by RT-PCR. Normally genes with intensities
    less than 100 or 200 are filtered.
  • Saturated probes genes with intensities reaching
    scanner limit (saturation) should also be
    filtered.

12
2. Filtering
Filtering by quality index different array
platform and image analysis have different format
low intensity
13
2. Filtering
Filtering by quality index
Array 1
Array 2
? ? ?
Array S
NA not applicable Missing values due to bad
quality, low or saturated intensities
14
2. Filtering
  • Filter genes with low information content
  • Small standard deviation (stdev)
  • Small coefficient of variation (CV stdev/mean)

130
125
115
stdev6.45 CV0.29
120
stdev6.45 CV0.053
30
25
15
20
Note CV is more reasonable for original
intensities. But for log-transformed intensities,
stdev is enough Why?
15
2. Filtering
Gene filtering
  • A simple gene filtering routine (I usually use)
    before down-stream analyses
  • Take log (base 2) transformation.
  • Delete genes with more than 20 missing values
    among all samples.
  • Delete genes with average expression level less
    than, say a7 (27128).
  • Delete genes with standard deviation smaller
    than, say ß0.4 (20.41.32, i.e. 32 fold
    change).
  • Adjust a and ß so that the number of remaining
    genes are computationally manageable in
    downstream analysis. (e.g. around 3000 genes)

16
2. Filtering
Sample filtering (detecting problematic slides)
Compute correlation matrix of the samples
  • Arrays of replicates should have high
    correlation. (m,n,o,p are replicates and q,r,s,t
    are another set of replicates)
  • A problematic array is often found to have low
    correlation with all the other arrays.
  • Heatmap is usually plotted for better
    visualization.

17
2. Filtering
Diagnostic plot by correlation matrix
White high correlation Dark gray low
correlation
m,n,o,p
q,r,s,t
18
3. Missing Value Imputation
  • Reasons of missing values in microarray
  • spotting problems (cDNA)
  • dust
  • finger prints
  • poor hybridization
  • inadequate resolution
  • fabrication errors (e.g. scratches)
  • image corruption
  • Many down-stream analysis require a complete
    data.
  • Imputation is usually helpful.

19
3. Missing Value Imputation
It is common to have 5 MVs in a
study. 5000(genes)?50(arrays) ?512,500
20
3. Missing Value Imputation
Existing methods
  • Naïve approaches
  • Missing values 0 or 1 (arbitrary signal)
  • missing values row (gene) average
  • Smarter approaches have been proposed
  • K-nearest neighbors (KNN)
  • Regression-based methods (OLS)
  • Singular value decomposition (SVD)
  • Local SVD (LSVD)
  • Partial least square (PLS)
  • More (Bayesian Principal Component Analysis,
    Least Square Adaptive, Local Lease Square)
  • Assumption behind Genes work cooperatively in
    groups. Genes with similar pattern will provide
    information in MV imputation.

21
3. Missing Value Imputation
KNN.e KNN.c
  • choose k genes that are most similar to the
    gene with the missing value (MV)
  • estimate MV as the weighted mean of the neighbors
  • considerations
  • number of neighbors (k)
  • distance metric
  • normalization step

randomly missing datum
?
Expression
Arrays
22
3. Missing Value Imputation
KNN.e KNN.c
  • parameter k
  • 10 usually works (5-15)
  • distance metric
  • euclidean distance (KNN.e)
  • correlation-based distance (KNN.c)
  • normalization?
  • not necessary for euclidean neighbors
  • required for correlation neighbors

?
Expression
Arrays
23
3. Missing Value Imputation
OLS.e OLS.c
  • regression-based approach
  • KNNOLS
  • algorithm
  • choose k neighbors (euclidean or correlation
    normalize or not)
  • the gene with the MV is regressed over the
    neighbor genes (one at a time, i.e. simple
    regression)
  • for each neighbor, MV is predicted from the
    regression model
  • MV is imputed as the weighed average of the k
    predictions

24
3. Missing Value Imputation
OLS.e OLS.c
randomly missing datum
y1 a1 b1 x1
y2 a2 b2 x2
?
Expression
y w1 y1 w2 y2
Arrays
25
3. Missing Value Imputation
SVD
  • Algorithm
  • set MVs to row average (need a starting point)
  • decompose expression matrix in orthogonal
    components, eigengenes.
  • use the proportion, p, of eigengenes
    corresponding to largest eigenvalues to
    reconstruct the MVs from the original matrix
    (i.e. improve your estimate)
  • use EM approach to iteratively imporove estimates
    of MVs until convergence
  • Assumption
  • The complete expression matrix can be
    well-decomposed by a smaller number of principle
    components.

26
3. Missing Value Imputation
LSVD.e LSVD.c
  • KNNSVD
  • choose k neighbors (euclidean or correlation
    normalize or not)
  • Perform SVD on the k nearest neighbors and get a
    prediction of the missing value.

27
3. Missing Value Imputation
PLS
  • PLS Select linear combinations of genes (PLS
    components) exhibiting high covariance with the
    gene having the MV.
  • The first linear combination of genes has the
    highest correlation with the target gene.
  • The second linear combination of genes had the
    greatest correlation with the target gene in the
    orthogonal space of the first linear combination.
  • MVs are then imputed by regressing the target
    gene onto the PLS components

28
3. Missing Value Imputation
  • Types of missing mechanism
  • Missing completely at random (MCAR)
  • Missingness is independent of the observed values
    and their own unobserved values.
  • Spot missing due to mis-printing or dust
    particle.
  • Spot missing due to scratches.
  • Missing at random (MAR)
  • Missingness is independent of the unobserved data
    but depend on the observed data.
  • Missing not at random (MNAR)
  • MIssingness is dependent on the unobserved data
  • 1. Spots missing due to saturation or low
    expression.

Currently imputation methods only work for MCAR,
not MNAR.
29
Which missing value imputation method to use in
expression profiles a comparative study and two
selection schemes Guy N. Brock1, John R.
Shaffer2, Richard E. Blakesley3, Meredith J.
Lotz3, George C. Tseng2,3,4 BMC Bioinformatics,
2008
30
Comparative study
9 data sets multiple exposure, time series or
both 7 methods were compared KNN, OLS, LSA, LLS,
PLS, SVD, BPCA
31
Comparative study
Global-based methods PLS, SVD,
BPCA Neighbor-based methods KNN, OLS, LSA,
LLS Intuitively global-based methods require
that dimension reduction of the data can be
effectively performed. We define an entropy
measure for a given data D to determine how well
the dimension reduction of the data can be done
(?i are the eigenvalues)
.
Entropy low the first few eigenvalues dominate
and the data can be reduced to low-dimension
effectively.
32
Comparative study
LRMSE is the performance measure, the lower the
better. KNN, OLS, LSA, LLS are neighbor-based
methods and work better in low-entropy data
sets. PLS and SVD are global-based methods and
work better in high-entropy data sets.
33
Comparative study
Three methods (LSA, LLS, BPCA) performed best
but none dominated. Performed two selection
schemes (entropy-based scheme and self-training
scheme) to select the best imputation method.
Write a Comment
User Comments (0)
About PowerShow.com