Summer Inst. Of Epidemiology and Biostatistics,

2010Gene Expression Data Analysis130pm

500pm in Room W2015

- Carlo Colantuoni
- carlo_at_illuminatobiotech.com

http//www.illuminatobiotech.com/GEA2010/GEA2010.h

tm

Class Outline

- Basic Biology Gene Expression Analysis

Technology - Data Preprocessing, Normalization, QC
- Measures of Differential Expression
- Multiple Comparison Problem
- Clustering and Classification
- The R Statistical Language and Bioconductor
- GRADES independent project with Affymetrix data.

http//www.illuminatobiotech.com/GEA2010/GEA2010.h

tm

Class Outline - Detailed

- Basic Biology Gene Expression Analysis

Technology - The Biology of Our Genome Transcriptome
- Genome and Transcriptome Structure Databases
- Gene Expression Microarray Technology
- Data Preprocessing, Normalization, QC
- Intensity Comparison Ratio vs. Intensity Plots

(log transformation) - Background correction (PM-MM, RMA, GCRMA)
- Global Mean Normalization
- Loess Normalization
- Quantile Normalization (RMA GCRMA)
- Quality Control Batches, plates, pins, hybs,

washes, and other artifacts - Quality Control PCA and MDS for dimension

reduction - SVA Surrogate Variable Analysis
- Measures of Differential Expression
- Basic Statistical Concepts
- T-tests and Associated Problems
- Significance analysis in microarrays (SAM)

Empirical Bayes - Complex ANOVAs (limma package in R)
- Multiple Comparison Problem

DAY 3

Measures of Differential Expression Review of

basic statistical concepts T-tests and

associated problems Significance analysis in

microarrays (SAM) (Empirical Bayes) Complex

ANOVAs (limma package in R) Multiple

Comparison Problem Bonferroni FDR Differential

Expression of Functional Gene Groups Notes on

Experimental Design

Slides from Rob Scharpf

Fold-Change?T-Statistics?

Some genes are more variable than others

Slides from Rob Scharpf

Slides from Rob Scharpf

Slides from Rob Scharpf

Slides from Rob Scharpf

distribution of

distribution of

Slides from Rob Scharpf

Slides from Rob Scharpf

X1-X2 is normally distributed if X1 and X2 are

normally distributed is this the case in

microarray data?

Slides from Rob Scharpf

Problem 1 T-statistic not t-distributed.

Implication p-values/inference incorrect

P-values by permutation

- It is common that the assumptions used to derive

the statistics are not approximate enough to

yield useful p-values (e.g. when T-statistics are

not T distributed.) - An alternative is to use permutations.

p-values by permutations

- We focus on one gene only. For the bth iteration,

b 1, ??? , B - Permute the n data points for the gene (x). The

first n1 are referred to as treatments, the

second n2 as controls. - For each gene, calculate the corresponding two

sample t-statistic, tb. - After all the B permutations are done
- p b tb tobserved / B
- This does not yet address the issue of multiple

tests!

The volcano plot shows, for a particular test,

negative log p-value against the effect size (M).

Another problem with t-tests

Remember this?

Problem 2 t-statistic bigger for genes with

smaller standard error estimates.Implication

Ranking might not be optimal

Problem 2

- With low Ns SD estimates are unstable
- Solutions
- Significance Analysis in Microarrays (SAM)
- Empirical Bayes methods and Stein estimators

Significance analysis in microarrays (SAM)

- A clever adaptation of the t-ratio to borrow

information across genes - Implemented in Bioconductor in the siggenes

package

Significance analysis of microarrays applied to

the ionizing radiation response, Tusher et al.,

PNAS 2002

SAM d-statistic

- For gene i

mean of sample 1

mean of sample 2

Standard deviation of repeated measurements for

gene i

Exchangeability factor estimated using all genes

Minimize the average CV across all genes.

Scatter plots of relative difference (d) vs

standard deviation (s) of repeated expression

measurements

A fix for this problem

Relative difference for a permutation of the

data that was balanced between cell lines 1 and 2.

Random fluctuations in the data, measured by

balanced permutations (for cell line 1 and 2)

eBayes Borrowing Strength

- An advantage of having tens of thousands of genes

is that we can try to learn about typical

standard deviations by looking at all genes - Empirical Bayes gives us a formal way of doing

this - Shrinkage of variance estimates toward a

prior moderated t-statistics eliminates

extreme stats due to small variances. - Implemented in the limma package in R. In

addition, limma provides methods for more complex

experimental designs beyond simple, two-sample

designs.

The Multiple Comparison Problem

- (some slides courtesy of John Storey)

Hypothesis Testing

- Test for each gene
- Null Hypothesis no differential expression.
- Two types of errors can be committed
- Type I error or false positive (say that a gene

is differentially expressed when it is not, i.e.,

reject a true null hypothesis). - Type II error or false negative (fail to identify

a truly differentially expressed gene, i.e.,fail

to reject a false null hypothesis)

Hypothesis testing

- Once you have a given score for each gene, how do

you decide on a cut-off? - p-values are most common.
- How do we decide on a cut-off when we are looking

at many 1000s of tests? - Are 0.05 and 0.01 appropriate? How many false

positives would we get if we applied these

cut-offs to long lists of genes?

Multiple Comparison Problem

- Even if we have good approximations of our

p-values, we still face the multiple comparison

problem. - When performing many independent tests, p-values

no longer have the same interpretation.

Bonferroni Procedure

a 0.05 Tests 1000a 0.05 / 1000

0.00005orp p 1000

Bonferroni Procedure

Too conservative.How else can we interpret many

1000s of observed statistics?Instead of

evaluating each statistic individually, can we

assess a list of statistics FDR (Benjamini

Hochberg 1995)

FDR

- Given a cut-off statistic, FDR gives us an

estimate of the proportion of hits in our list of

differentially expressed genes that are false.

Null Equivalent Expression Alternative

Differential Expression

False Discovery Rate

- The false discovery rate measures the

proportion of false positives among all genes

called significant - This is usually appropriate because one wants to

find as many truly differentially expressed genes

as possible with relatively few false positives - The false discovery rate gives an estimate of the

rate at which further biological verification

will result in dead-ends

Distribution of Statistics

N90

Permuted

Observed

Statistic

Distribution of Statistics

FDR

Permuted

Observed

Permuted

Observed

Statistic

Distribution of p-values

N90

Observed

Permuted

p-value

SAM produces a modified T-statistic (d), and has

an approach to the multiple comparison problem.

Scatter plots of relative difference (d) vs

standard deviation (s) of repeated expression

measurements

A fix for this problem

Relative difference for a permutation of the

data that was balanced between cell lines 1 and 2.

Random fluctuations in the data, measured by

balanced permutations (for cell line 1 and 2)

Selected genesBeyond expected distribution

FDR False Positives/Total Positive CallsThis

FDR analysis requires enough samples in each

condition to estimate a statistic for each gene

observed statistic distribution.And enough

samples in each condition to permute many times

and recalculate this statistic null statistic

distribution.What if we dont have this?

FDR 0.05 Beyond 0.9

FDR 0.05 Beyond 0.9

(No Transcript)

(No Transcript)

False Positive Rate versus False Discovery Rate

- False positive rate is the rate at which truly

null genes are called significant - False discovery rate is the rate at which

significant genes are truly null

False Positive Rate and P-values

- The p-value is a measure of significance in terms

of the false positive rate (aka Type I error

rate) - P-value is defined to be the minimum false

positive rate at which the statistic can be

called significant - Can be described as the probability a truly null

statistic is as or more extreme than the

observed one

False Discovery Rate and Q-values

- The q-value is a measure of significance in terms

of the false discovery rate - Q-value is defined to be the minimum false

discovery rate at which the statistic can be

called significant - Can be described as the probability a statistic

as or more extreme is truly null

Power and Sample Size Calculations are Hard

- Need to specify
- a (Type I error rate, false positives) or FDR
- s (stdev will be sample- and gene-specific)
- Effect size (how do we estimate?)
- Power (1-b, bType II error rate)
- Sample Size
- Some papers
- Mueller, Parmigiani et al. JASA (2004)
- Rich Simons group Biostatistics (2005)
- Tibshirani. A simple method for assessing sample

sizes in microarray experiments. BMC

Bioinformatics. 2006 Mar 27106.

(No Transcript)

Beyond Individual Genes Functional Gene Groups

- Borrow statistical power across entire dataset

- Integrate preexisting biological knowledge

Functional Annotation of Lists of Genes

KEGG PFAM SWISS-PROT GO DRAGON DAVID/EASE MatchMin

er BioConductor (R)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

Gene Cross-Referencing and Gene Annotation Tools

In BioConductor (in the R statistical language)

annotate package

Microarray-specific metadata

packages DB-specific metadata packages

AnnBuilder package

Annotation Tools In BioConductor annotate package

Functions for accessing data in metadata packages.

Functions for accessing NCBI databases.

Functions for assembling HTML tables.

Annotation Tools In BioConductor Annotation for

Commercial Microarrays Array-specific metadata

packages

Annotation Tools In BioConductor Functional

Annotation with other DBs GO metadata package

Annotation Tools In BioConductor Functional

Annotation with other DBs KEGG metadata package

Is there enrichment in our list of differentially

expressed genes for a particular functional gene

group or pathway?

Threshold Enrichment One Way of Assessing

Differential Expression of Functional Gene Groups

Threshold Enrichment

Threshold Enrichment One Way of Assessing

Differential Expression of Functional Gene Groups

Threshold Enrichment One Way of Assessing

Differential Expression of Functional Gene Groups

The argument lower.tail will indicate if you are

looking for over- or under- representation of

differentially expressed genes within a

particular functional group (using lower.tailF

for over-representation).

Can we use more of our data than Threshold

Enrichment (that only uses the top of our gene

list)?

- Beyond threshold enrichment

Functional Gene Subgroups within An Experiment

Statistics for Analysis of Differential

Expression of Gene Subgroups

Is THIS

Different from THIS?

Over-Expression of a Group of Functionally

Related Genes

plt7.42e-08

T statistic

Is THIS

Different from THIS?

Conceptually Distinct from Threshold Enrichment

and the Hypergeometric test!

Statistical Tests c2 Kolmogorov-Smirnov Product

of Probabilities GSEA PAGE geneSetTest (Wilcoxon

rank sum)

c2

All Genes

c2 is the sum of D values where

Subset of Interest

E

histogram bins

O

Kolmogorov-Smirnov

All Genes

Subset of Interest

Product of Individual Probabilities

All Genes

Subset of Interest

What shape/type of distributions would each of

these tests be sensitive to?

Statistics from gene subgroup

All statistics

Gene Set Enrichment Analysis (GSEA)

Subramanian et al, 2005 PNAS

Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment Analysis (GSEA)

Gene Set Enrichment Analysis (GSEA)

Parametric Analysis of Gene Set Enrichment (PAGE)

Kim et al, 2005 BMC Bioinformatics

Parametric Analysis of Gene Set Enrichment (PAGE)

(No Transcript)

Sm-m

Z

s/m0.5

A simple method in Bioconductor

geneSetTest(limma)

Test whether a set of genes is enriched for

differential expression. Usage geneSetTest(sele

cted,statistics,alternative"mixed",type"auto",ra

nks.onlyTRUE,nsim10000)

The test statistic used for the gene-set-test is

the mean of the statistics in the set. If

ranks.only is TRUE the only the ranks of the

statistics are used. In this case the p-value is

obtained from a Wilcoxon test. If ranks.only is

FALSE, then the p-value is obtained by simulation

using nsim random selected sets of

genes. Arguement alternative mixed or

either fundamentally different questions.

Wilcoxon test

What shape/type of distributions would each of

these tests be sensitive to?

Statistics from gene subgroup

All statistics

Analysis of Gene Networks

Large Protein Interaction Network

Network Regulated in Sample 1

Large Protein Interaction Network

Network Regulated in Sample 2

Network Regulated in Sample 1

Large Protein Interaction Network

Network Regulated in Sample 3

Network Regulated in Sample 1

Network Regulated in Sample 2

Large Protein Interaction Network

Network of Interest

Network Regulated in Sample 1

Network Regulated in Sample 2

Network Regulated in Sample 3

Additional Notes on Experimental Design

Old-School Experimental Design Randomization

Replicates in a mouse model

Dissection of tissue

Biological Replicates

RNA Isolation

Amplification

Technical Replicates

Probe labelling

Hybridization

Common question in experimental design

- Should I pool mRNA samples across subjects in an

effort to reduce the effect of biological

variability (or cost)?

Two simple designs

- The following two designs have roughly the same

cost - 3 individuals, 3 arrays
- Pool of three individuals, 3 technical replicates
- To a statistician the second design seems

obviously worse. But, I found it hard to convince

many biologist of this. - 3 pools of 3 animals on individual arrays?

Cons of Pooling Everything

- You can not measure within class variation
- Therefore, no population inference possible
- Mathematical averaging is an alternative way of

reducing variance. - Pooling may have non-linear effects
- You can not take the log before you

average Elog(XY) ? Elog(X) Elog(Y) - You can not detect outliers

If the measurements are independent and

identically distributed

Cons specific to microarrays

- Different genes have dramatically different

biological variances. - Not measuring this variance will result in genes

with larger biological variance having a better

chance of being considered more important

Higher variance larger fold change

We compute fold change for each gene (Y

axis) From 12 individuals we estimate gene

specific variance (X axis) If we pool we never

see this variance.

Remember this?

Useful Books Statistical analysis of gene

expression microarray data Speed. Analysis

of gene expression data Parmigianni Bioinform

atics and computational biology solutions using

R - Irizarry