3' A brief look into data analysis for microarray experiments - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

3' A brief look into data analysis for microarray experiments

Description:

(ii) Perform a permutation analysis. 30 (i) Tabulated p-values ... (ii) Permutations tests. Based on data shuffling. No assumptions ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 72
Provided by: alexsa
Category:

less

Transcript and Presenter's Notes

Title: 3' A brief look into data analysis for microarray experiments


1
3. A brief look into data analysis for
microarray experiments
  • Alex Sánchez. Departament dEstadística.
  • Universitat de Barcelona

2
Outline
  • Introduction
  • 2-fold change approach (2 conditions, 1 sample)
  • T-tests and extensions (2 conditions, gt 1 sample)
  • Significance and multiple testing
  • More than Two Conditions
  • A brief introduction to clustering

3
Introduction
4
Identifying differentially expressed genes
(filtering?)
  • Goal identify genes associated with covariate
    or response of interest such as
  • Qualitative covariates treatment, cell type,
  • Quantitative covariate dose, time
  • Responses survival, infection time
  • Any combination of these!
  • Selecting a subset of differentially expres-sed
    genes is called sometimes filtering
  • Previous step to other analysis procedures

5
Life cycle
Biological question
Experimental design
Failed
Microarray experiment
Quality Measurement
Image analysis
Today
Normalization
Pass
Analysis
Discrimination
Clustering
Testing
Estimation
Biological verification and interpretation
6
The experimental frame
  • We can distinguish different situations
  • No replicates (one chip/condition)
  • 2 conditions ? One k-fold change analysis
  • gt 2 conditions ? Several k-fold change analysis
  • There are replicates
  • 2 conditions on one chip ? One sample tests
  • 2 conditions on two chips? Two sample tests
  • More than 2 conditions ? ANOVA or linear models

7
Experimental frame for 2 conditions
  • One chip per condition? 2 fold change
  • Null hypothesis H0 log2(Rg/Gg)0, g1,,G
  • Decision based on one value per gene
  • Several chips? one or two-sample tests
  • If two samples in same array replicated arrays
  • H0 log2(R/G)0. Decision based on average
    logratios
  • If we have a common reference (boths samples
    hybridize to same control) ?
  • Hypothesis changes to log (R1/G)-log (R2/G)0
  • Decision based on average difference of log ratios

8
Two-fold change (two conditions, one single chip)
9
Fold change (single slide) methods
  • The observed (log2) ratio between two conditions
    is used
  • Arbitrary cut-off value (2 fold?)
  • Not a statistical test ? no associated level
    of confidence
  • Some known problems
  • Subject to bias if data not properly normalized
  • Sensitive to variance heterogeneity across genes
  • Solutions have been suggested (next slide)

10
Origin of the 2-fold change approach
  • deRisi et al (1997) found only 19/6300 false
    positives using this criterion
  • Perhaps correct for their experiment but
  • Similar controls would be required for each new
    experiment? redefinition of k-fold value
  • May be influenced by experimental factors
  • A practical reason often there are no replicates
    because they are too expensive
  • Making inferences with samples of size1 ?!!

11
Some approaches to use of k-fold-change criteria
(1)
  • A naïve approach to standardize the data
  • Assumption of normality (following Chen et al.,
    1998)
  • Ratios can be converted into Z-scores
  • Each of which has an associated P-value
  • Problems
  • Assumes normality, and homocedasticity
  • Must use robust estimates of centrality (e.g.
    median) and dispersion (e.g. MAD)

12
Some approaches to use of k-fold-change criteria
(2)
  • More sophisticated approaches have been proposed
  • Mainly based in relaxing distributional
    assumptions
  • Newton et al. Gamma-Gamma-Bernoulli hierarchical
    model for each (R,G).
  • Roberts et al. Each (R,G) is assumed to be
    normally and independently distributed with
    variance depending linearly on the mean.
  • Sapir Churchill. Each log R/G is assumed to be
    distributed according to a mixture of normal and
    uniform distributions. Decision based on R/G only.

13
Example Matt Callows Srb1 data (5).
Newtons and Chens single slide method
  • It is not hard to do by eye
  • The problem is probably beyond formal statistical
    inference (valid p-values, etc) for the
    foreseeable future.why?

14
T-tests and extensions(2 conditions, several
chips)
15
Tests of Differential Expression between two
conditions, several chips
  • With several replicates per condition ?
  • the variability of gene expression,
  • on a gene per gene basis,
  • can be taken in account
  • Natural measures of differential expression
    will be based on the mean, or difference of
    means, conveniently standardized

16
Natural measures of discrepancy
Direct comparisons
Indirect comparisons
17
Some Issues
  • Can we trust average effect sizes (average
    difference of means) alone?
  • Can we trust the t statistic alone?
  • Here is evidence that the answer is no.

Courtesy of Y.H. Yang
18
Some Issues
  • Can we trust average effect sizes (average
    difference of means) alone?
  • Can we trust the t statistic alone?
  • Here is evidence that the answer is no.

Courtesy of Y.H. Yang
  • Averages can be driven by outliers.

19
Some Issues
  • Can we trust average effect sizes (average
    difference of means) alone?
  • Can we trust the t statistic alone?
  • Here is evidence that the answer is no.
  • ts can be driven by tiny variances.

Courtesy of Y.H. Yang
20
Variations in t-tests (1)
  • Let
  • Rg mean observed log ratio
  • SEg standard error of Rg estimated from data on
    gene g.
  • SE standard error of Rg estimated from data
    across all genes.
  • Global t-test tRg/SE
  • Gene-specific t-test tRg/SEg

21
Some pros and cons of t-test
22
Alternatives suggested
  • SAM
  • Regularized t-test
  • B (Empirical bayes) statistic
  • Others

23
SAM t-test or S-test
  • Adds a small constant (c perc90(SEg))
  • Genes with small fold changes will not be
    selected as significant

24
Regularized t-test statistic
  • ?0 relative contributions of ? and ?g
  • n number of replicate measurements for each
    condition

25
Can we generate a list of candidate genes?
With the tools we have, the reasonable steps to
generate a list of candidate genes may be
?
A list of candidateDE genes
We need an idea of how significant are these
values ?Wed like to assign them p-values
26
Significance and multiple testing
27
Nominal p-values
  • After a test statistic is computed, it is
    convenient to convert it to a p-valueThe
    probability of ocurrence of a test statistic
    equal to, or more extreme than the observed value
    under the assumption that the null hypothesis is
    true pPS S0H0 true

28
Significance testing
  • Test of significance at the a level
  • Reject the null hypothesis if your p-value is
    smaller than the significance level
  • It has advantages but not free from criticisms
  • Genes with p-values falling below a prescribed
    level may be regarded as significant

29
Calculation of p-values
  • Standard methods for calculating p-values
  • (i) Refer to a statistical distribution table
    (Normal, t, F, ) or
  • (ii) Perform a permutation analysis

30
(i) Tabulated p-values
  • Tabulated p-values can be obtained for standard
    test statistics (e.g.the t-test)
  • They often rely on the assumption of normally
    distributed errors in the data
  • This assumption can be checked (approximatedly)
    using a
  • Histogram
  • Q-Q plot

31
Histogram QQ-plots of t-statistics
32
More about QQ plots
  • Not only useful for checking normality
  • Also to identify genes with extreme t-values
    values off the line, at one end or another
  • Very useful with thousands of genes, but
  • We cant expect all differentially expressed
    genes to stand out as extremes
  • many will be masked by more extreme random
    variation, which is a big problem in this context

33
(ii) Permutations tests
  • Based on data shuffling. No assumptions
  • Random interchange of labels between samples
  • Estimate p-values for each comparison (gene) by
    using the permutation distribution of the
    t-statistics
  • Repeat for every possible permutation, b1B
  • Permute the n data points for the gene (x). The
    first n1 are referred to as treatments, the
    second n2 as controls.
  • For each gene, calculate the corresponding two
    sample t-statistic, tb.
  • After all the B permutations are done put p
    b tb tobserved/B

34
Permutation tests (2)
35
Permutation tests (3)
36
Are these p-values correct?
  • Statistical tests usually control type I error
    the probability of rejecting H0 when it is true
  • High number of tests ? Problem!!!
  • If we perform 10.000 simultaneous tests on
    samples under the null hypothesis
  • And fix a type I error ? of 2
  • We expect to reject H0 about 200 times
  • To avoid this adjust p-values

37
Why should we adjust p-values?
  • A simulation study illustrates why
  • Simulations of this process for 6,000 genes with
    8 treatments and 8 controls.
  • All the gene expression values were simulated
    independent and identically distributated (i.i.d)
    from a N (0,1) distribution,
  • i.e. NOTHING is differentially expressed.

38
Unadjusted p-values
Clearly we cant just use standard p-value
thresholds (.05, .01).
39
Steps to generate a list of candidate genes
revisited (2)
Nominal p-valuesP1, P2, , PG
A list of candidateDE genes
Select genes with adjusted P-valuessmaller than
a
Adjusted p-valuesaP1, aP2, , aPG
40
Multiple Testing
  • Define an adequate type I error
  • Use a procedure that
  • Ensures an strict control of type I error
  • Powerful (few false negatives)
  • Based on the joint distribution of the
    multiple tests
  • Calculate an adjusted p-value for each gene that
    reflects the global type I error

41
Multiple Testing (2)Approaches
  • Two alternatives to controlling type I error for
    multiple testing are
  • Control family-wise error rate (FWER) the
    probability of making one or more type I errors
  • Bonferroni, Westfall and Young, etc
  • Control the false discovery rate (FDR)
    proportion of false positives among all of the
    rejected null hypotheses

42
Multiple Testing (3)
  • False discovery rate E(V/R)
  • Family-wise p(V 1)

43
Some Advantages of p-value Adjustment
  • Test level (size) does not need to be determined
    in advance
  • Some procedures most easily described in terms of
    their adjusted p-values
  • Usually easily estimated using resampling
  • Procedures can be readily compared based on the
    corresponding adjusted p-values

44
Some p-values adjustment methods
  • Bonferroni adjustment
  • Westfall, PH and SS Young (1993) Resampling-based
    multiple testing (max T).
  • Benjamini, Y Y Hochberg (1995) Controlling the
    false discovery rate a practical and powerful
    approach to multiple testing
  • J Storey (2001) The positive false discovery
    rate a Bayesian interpretation and the q-value.
  • Y Ge et al (2001) Fast algorithm for resampling
    based p-value adjustment for multiple testing.

45
More than 2 conditions
  • K-samples experiments
  • Factorial experiments

46
More than Two Conditions
  • Many experiments are intended to make complex
    comparisons
  • compare several conditions (gt 2 treatments),
  • compare the joint effect of two drugs
  • Compare two strains of mayze (mutant wild type)
    at two different times
  • These can be analysed using factorial desingns
    which involve the use of ANOVA models
  • We dont discuss them here. See references

47
Anova model
48
Finding patterns in genes
  • Introduction to clustering

49
Expression profiles in microarray data
Expression profile for all genes in a single
experimental condition
Genes
Experimental conditions
Expression profile for one gene in all
experimental conditions
50
Why should we cluster data?
  • Gene expression studies assume that genes with
    similar function
  • Have similar patterns of expression
  • Have common transcription factors
  • If we believe the previous is true the analysis
    of patterns in the expression matrix should help
    to identify
  • Biological function for uncharacterized genes
  • Transcription factor for genes

51
Does common expression mean common regulation?
52
Genes associated with pathologies
53
Leukemia typologies identification
54
Cluster analysis
  • Multivariate statistical methods (data mining,
    machine learning, ) that
  • Given a set of individuals (points),
  • Characterized by a series of attributes,
  • And a similarity measure between them
  • Allows to form groups (clusters) such that
  • Points inside a group are more similar between
    them that points between different groups

55
Clustering Group identification
56
Clustering expression data
  • Cluster genes (rows), to (try to) identify groups
    of co-regulated genes.
  • Cluster samples (columns), to (try to) identify
    phenotypes using their molecular profiles
  • One can cluster genes and samples simultaneously

57
Basic issues in clustering
  • Issues to decide before clustering
  • Which genes/arrays to use?
  • Which similarity/dissimilarity measure?
  • Which clustering algorithm?
  • Its an exploratory technique
  • Theres no optimal solution
  • Any method will yield groups
  • Which method gives good groups?

58
(No Transcript)
59
Similarity measures
  • A similarity/dissimilarity measure between two
    individuals i, j, is an index, sij, (usually
    between 0 and 1, 0 sij 1) which measures the
    intensity by which i, and j are related.
  • Its usually measured using
  • a similarity coefficient in cathegorical
    variables
  • A distance function with continuous
    variables(this is the case of expression data)

60
Similarity measures (1)
61
Distance functions (2)
  • Amounts to pearsons correlation coefficient if
    variables are centered

62
Pearsons correlation coefficient
  • Widely used in expression studies
  • Some known problems
  • Ignores data variability
  • Can yield false positives(a-b)
  • There exist robust variants
  • jackknife

63
Clustering algorithms
  • Many algorithms, based upon many ideas
  • Most popular
  • Hierarchichal methods
  • Aglomerative (N? 1) or Divisive (1?N)
  • Partititioning methods
  • K-means, Self organizing maps (SOM)
  • Other
  • Support Vector Machines (SVM)

64
Clustering algorithms
65
Partititioning methods
  • Each element is initially assigned to one of K
    groups which have been specified a priori
  • A cost function is used to re-assign individuals
    to groups until an optimality criterion is
    reached (e.g. minimize SS inside clusters)
  • Some partitive methods
  • k-means, partitioning around medoids (PAM),
    self-organizing maps (SOM), model-based
    clustering

66
Partitioning methods
67
Hierarchical methods
  • Number of groups isnt defined a priori
  • Build dendogram, s.t. 2 individuals the nearer
    one finds two individuals ? the most similar they
    are
  • Cutting it at any level gives cluters
  • Can be
  • Aglomerative (bottom-up) N ? 1 groups
  • Divisive (top-down) 1 ? N groups

68
Hierarchical methods
69
Partitioning vs Hierarchical
Hierarchical
Partitioning
  • Advantages
  • Faster computation.
  • Visual.
  • Disadvantages
  • Unrelated genes are eventually joined
  • Rigid, cannot correct later for erroneous
    decisions made earlier.
  • Hard to define clusters.
  • Advantages
  • Optimal for certain criteria.
  • Genes automatically assigned to clusters
  • Disadvantages
  • Need initial k
  • Often require long computation times.
  • All genes are forced into a cluster.

70
Example Yeast cell-cycle
71
Acknowledgments
  • Special thanks for Yee Hwa yang (UCSF) for
    allowing me to use some of her materials
  • Sandrine Dudoit Terry Speed, U.C. Berkeley
  • M. Carme Ruíz de Villa, U. Barcelona
  • Sara Marsal, U. Reumatología, HVH Barcelona
Write a Comment
User Comments (0)
About PowerShow.com