Analysis of Large Scale Gene Expression Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Analysis of Large Scale Gene Expression Data

1
Analysis of Large Scale Gene Expression Data

John Quackenbush
September 2005

2
Microarray Analysis

General microarray overview
How do platforms compare?
Should I subtract background?
TM4 Array Solutions
Annotating arrays
How complete is the genome?
Tools for array analysis
Experimental design
Normalization
What do we measure?
Some concepts from statistics
Distributions
Randomization Tests
Multiple Testing
Finding significant genes
Fold Change
T-tests
ANOVA
SAM

Hierarchical clustering
Bootstrapping
Jack-knifing
Self-organizing trees
K-means/k-medians Clustering
Self-organizing maps
Template matching
Gene shaving
Relevance networks
Principal Components Analysis
Support Vector Machines (SVM)
k-Nearest Neighbors
EASE Meta-Analysis
Demo Data Set 1

3
General Microarray Analysis
4
Levels of Biological Information DNA mRNA Prote
ins Informational Pathways Informational
Networks Cells Organs Individuals Populations
Ecologies
Genomics Functional Genomics Proteomics Metabolo
mics Systems Biology Cellular Biology Medicine Med
icine Genetics Ecology
The Future!
5
February 2001 Completion of the Draft Human
Genome
April 14, 2003 The Human Genome is completed
again!
October 2004 The Human Genome is now really
finished!
But what does finished mean???
6
Where are the pressing questions?

Can we find the genes and assign them functions?
Can we predict protein structures and functions?
Can we reconstruct metabolic, signaling, and
other pathways?
Can we reconstruct informational networks?
Can we link genotype to phenotype?
Can we use genotype/phenotype to predict relevant
outcome?
Can we use cross-species comparisons to learn
something?

7
The Beast Microarray Robot from Intelligent
Automation lthttp//www.ias.comgt
8
Microarray Overview I
9
Microarray Overview II
10
Microarray Overview II
11
Affymetrix GeneChip Expression Analysis
Generate DNA Sequence
Design and synthesize chips
ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC
ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC
12
Affymetrix GeneChip Expression Analysis
13
Microarray Expression Analysis
Gene
14
32,448 element mouse array
Thanks to M. Ko (NIA) and B. Soares (BMAP)
kidney vs. heart 15?g total RNA
Shuibang Wang, Yan Yu, Renee Gaspard
15
Steps in the Process

Select array elements and annotate them
Build a database to manage stuff
Print arrays and manage the lab
Hybridize and analyze images manage data
Analyze hybridization data and get results

16
What can I do with arrays?
17
Types of Experiments

Class Comparison
Can I find genes that distinguish between two
classes, such as tumor and normal?
Class Discovery
Given what I think is a uniform group of samples,
can I find subsets that are biologically
meaningful?
Classification
Given a set of samples in different classes, can
I assign a new, unknown sample to one of the
classes?
Mechanistic Studies
Can I discover a causative mechanism associated
with the distinction between classes?

18
How do platforms compare?
19
Microarray Platform Comparisons
Platform comparisons may result in little to no
replication of gene expression levels. Findings
may replicate within a platform (top), but not
between platforms (bottom). Spotted long (80mer)
oligos versus Affymetrix 25mer chip.
Rogojina et al. 2003. Use of Affymetrix to
spotted oligonucleotide microarrays using two
retinal epithelial cell lines. Molecular Vision
9482-96
20
Experimental Design
Gene expression in Angiotensin II-induced
hypertension acute versus chronic effects
Saline Ang II

Two Factors
Factor 1 Treatment saline or Angiotensin II
Factor 2 Time
acute (24 h) or chronic (14 day)

acute chronic
Initial RNA samples 1.5 ug total RNA Total RNA
was amplified (TIGR SOP M022) to generate
antisense cRNA cRNA yield 40 - 60 ug from
amplification
Jennie Larkin, Harry Gavras thanks to Affymetrix
21
Experimental Design
Total RNA cRNA aminoallyl labelled cDNA 1.5
ug 50 ug
amplification
TIGR Microarray Protocol
Affymetrix GeneChip Protocol
Cy3/Cy5 dye coupling and hybridization Wash and
scan
Second cycle small sample protocol Biotinylated
cRNA Fragmentation Wash and stain Scan
DATA
Jennie Larkin, Harry Gavras thanks to Affymetrix
22
Affymetrix TIGR shared data

11,714 TCs were present on both platforms
5854 had values in gt 80 of experiments
92 (N 5358) did not statistically differ
between platforms (2-factor ANOVA, a .05)

Jennie Larkin, Harry Gavras thanks to Affymetrix
23
How do platforms compare?
Jennie Larkin, Harry Gavras thanks to Affymetrix
24
Two platforms similar values

Both OSF2 and procollagen type III alpha 1 were
up-regulated in chronic Ang II treatment.
Even the pattern of variability between
biological replicates was reproduced between
platforms.

Jennie Larkin, Harry Gavras thanks to Affymetrix
25
Two platforms similar values

Both ADAMTS1 and ADRP were up-regulated in acute
Ang II treatment.
Again, both the values and the pattern of
variability between biological replicates were
reproduced between platforms.

Jennie Larkin, Harry Gavras thanks to Affymetrix
26
Not all is perfect
There were differences between the arrays, but
these were the exception rather than the
rule. Metallothionein-II was only up-regulated
in acute on the TIGR arrays, but was up-regulated
in both acute and chronic on the Affy chips.
Jennie Larkin, Harry Gavras thanks to Affymetrix
27
When Platforms Agree
28
When Platforms Disagree, They Disagree
29
General Microarray Strategy

Choose an experimentally interesting and
tractable model system
Design an experiment with comparisons between
related variants
Include sufficient biological replication to make
good estimates
Hybridize and collect data
Normalize and filter
Mine data for biological patterns of expression
Integrate expression data with other ancillary
data such, including genotype, phenotype, the
genome, and its annotation

30
Should I subtract background?
31
Non-normalized, background subtracted data
100
32
Annotating andComparing Arrays
33
TIGR Gene Indices home page
www.tigr.org/tdb/tgi
83 species gt26,000,000 sequences
34
The Gene Indices home page
biocomp.dfci.harvard.edu/tgi/tgipage.html
85 species gt28,000,000 sequences
35
TGICL Tools are available with more coming
36
Gene Index Assembly process
37
A TC Example
38
GO Terms and EC Numbers
Babak Parvizi
39
Everything is mapped in the context of the
relevant genome
40
The TIGR Gene Indices lthttp//www.tigr.org.tdb/tdb
/tgigt
Dan Lee, Ingeborg Holt
41
Building TOGs Reflexive, Transitive Closure
Thanks to Woytek Makalowski and Mark Boguski
42
TOGA An Sample Alignment bithoraxoid-like
protein
43
(No Transcript)
44
RESOURCERER
Jennifer Tsai
http//pga.tigr.org/tools.shtml
45
RESOURCERER An Example
46
RESOURCERER Using Genetic Markers
Now Available QTL-based searches, promoter
region extraction, GO-Slim analysis
47
Just added
48
The Features are the same
49
How Complete isthe Human Genome?
50
is easy!
Gene Finding in Humans
Razvan Sultana
51
is easy?
Gene Finding in Humans
Razvan Sultana
52
is difficult?
Gene Finding in Humans
Razvan Sultana
53
is difficult?
Gene Finding in Humans
A genome and its annotation is only a hypothesis
that must be tested.
Razvan Sultana
54
Select TIGRHuman TCsnot mappingto human
genome(about 9000)
Razvan Sultana Laura Beaver Bryan Frank
55
Answer 61 Positives!
SequenceValidationUnderway Approximately50
appear good
Razvan Sultana Laura Beaver Bryan Frank
56
Tools for Array Analysis
57
TM4 Resources
TM4 Website

Application Downloads
Documentation and FAQs
And Much More!

http//www.tm4.org
CD Contents

All TM4 Applications
User Manuals
Supplementary Documentation
Sample Data Sets

TM4 Reference
Saeed, A.I., Sharov, V., White, J., Li, J.,
Liang, W., Bhagabati, N., Braisted, J., et al.
2003. TM4 A Free, Open-Source System for
Microarray Data Management and Analysis.
BioTechniques 34 374-378

58
Microarray Data Flow
Scanner
Printer
.tiff Image File
Image Analysis
Raw Gene Expression Data
Gene Annotation
Normalization / Filtering
Normalized Data with Gene Annotation
Expression Analysis
Interpretation of Analysis Results
59
MAD Database Schema
60
MADAM Microarray Data Manager
MAGE-ML exportnow functional
Joseph White Jerry Li Alexander Saeed Vasily
Sharov Syntek Inc.
? Available with OSI source and MySQL
61
MAGE-ML Export with Madam
62
MIDAS Data Analysis
Wei Liang
Variance Stabilization, Adding Error
Models, MAANOVA, Automated Reporting
? Available with source
63
MeV Data Mining Tools
Alexander Saeed Alexander Sturn Nirmal
Bhagabati John Braisted Syntek Inc. Datanaut,
Inc.
Now with scriptingand the ability to save state
andrestart analysis
? Available with OSI source www.tm4.org
64
Microarray Data Flow
Scheduler (Machine Scheduling)
SliTrack (Machine Control)
.tiff Image File
PCR Score
MABCOS (Barcode System)
Exp Designer
Spotfinder (Image Analysis)
MADAM (Data Manager)
Expression Data
Raw .tav File
Miner (.tav File Creator)
Raw .tav File
MIDAS (Normalization)
GenePix Converter
Normalized .tav File
Query Window
MeV (Data Analysis)
Interpretation
65
Protocols and Methods
66
Publications on Microarray Tools and Techniques
67
Publications on Microarray Data Analysis and
Design
68
Publications on Microarray Data Exchange Standards
MIAME Standards Nature family, Cell family, EMBO
reports, Bioinformatics,Genome Research, Genome
Biology, Science, The Lancet,Science, and
others.
69
The Starting Point Designing the Experiment
70
The Experimental Design

The Experimental Design dictates a good deal of
what you can do with the data
Good normalization and processing reflects the
experimental design
The design also facilitates certain comparisons
between samples and provides the statistical
power you need for assigning confidence limits to
individual measurements
The design must reflect experimental reality
The most straight-forward designs compare
expression in two classes of samples to look for
patterns that distinguish them.

71
Sample Pairing for Co-Hybridization Experiments
Direct Comparison with Dye Swap
A1
B1
A2
B2
A3
B3
A4
B4
A1
B1
A2
A3
B2
B3
A4
B4

RNA sample is not limiting (e.g. plenty of
sample)
Flip dyes account for any gene-dye effects

Balanced Block Design
A1
B1
A2
B2
A4
B4
A3
B3

RNA sample is limiting
Balanced blocking accounts for any gene-dye
effects

72
Multiple Sample Pairings
Reference Design (Indirect Comparison)

More than two samples are compared
(e.g. tumor classification, time course)
Flip dyes are not necessary but can be done to
increase precision
Ratio values are inferred (indirect)
Suited for cluster analysis need common
reference

Loop Design
73
Why perform flip-dye experiments?
74
Loops and Reference Designs
23 Hybs
10 hybs
Standard flip-dye expt
S. Wang , K. Kerr, J. Quackenbush, G. Churchill
75
Loops and Reference Designs
Both approaches can give equivalent results
S. Wang , K. Kerr, J. Quackenbush, G. Churchill
76
Loop vs. Reference Designs

Loop design
Can provide direct measurements
Give more data on each experimental sample
with the same number of hybs
Require more RNA per sample
Can unwind with a bad sample or for a gene
with bad data
Reference design
Easily extensible
Simple interpretation of all results
Requires less RNA per sample
Less sensitive to bad RNA samples and bad
array elements

77
Experimental Design
A1
A2
A3
A4
B1
B2
B3
B4
Keep it simple!
78
One Possible Experimental Paradigm
Examining Genotype, Phenotype, and Environment
Parental - stressed
Derived - stressed
Parental - unstressed
Derived - unstressed
79
Basic Design Principles

Biological replicas are more informative than
correlated replicas (independent RNA, independent
slides)
More replicas are better higher statistical
power
For loops, hybridizations of individual samples
should be balanced (as many Cy3 as Cy5
labelings)
Self-self hybs add data on reproducibility and
can be used to produce error models
At a minimum, should use dye swap replicates to
compensate for any dye biases in labeling or
detection

80
How Many Replicates?
(Simon et al., Genetic Epidemiology 23 21-36,
2002)
n 4(za/2 zb)2 / (d/1.4s)2
Where za/2 and zb are normal percentile values at
significance level a and false negative rate b
parameter d represents the minimum detectable
log2 ratio and s represents the SD of log ratio
values. For a 0.001 and b 0.05, then za/2
-3.29 and zb -1.65. Assume d 1.0 (2-fold
change) and s 0.25, Therefore n 12 samples
(6 query and 6 control).
81
Normalization
82
Why Normalize Data?

Goal is to measure ratios of gene expression
levels (ratio)i Ri/Giwhere Ri/Gi are,
respectively , the measured intensities for the
ith spot.
In a self-self hybridization, we would expect all
ratios to be equal to one Ri/Gi 1 for all i.
But they may not be.
Why not?
Unequal labeling efficiencies for Cy3/Cy5
Noise in the system
Differential expression
Normalization brings (appropriate) ratios back to
one.

83
The Starting Point The Ratio
84
Log2(ratio) measures treat up- and down-regulated
genes equally
log2(1) 0 log2(2) 1 log2(1/2) -1
85
Normalization Approaches A variety exist

Total Intensity
Linear Regression
Ratio statistics described by Chen, Dougherty,
Bittner
J. Biomed. Optics (1997) 2(4) 364-374
Iterative log(ratio) mean centering
Lowess Correction
And others
Any of these using
Entire Data Set
User-defined Data Set/Controls

86
Normalization Approaches
Using the Entire Data Set

Probe Quantification less important
No assumption on which genes constitute
housekeeping set
Uses all the data
No independent confirmation

User-defined Data Set/Controls

Requires definition of housekeeping set
or good added controls
Requires good RNA quantitation
Ignores much data

87
Normalization Approaches
The Solution(?)

The best technique is experiment dependent
A good approach is to use a combination of
techniques
All analysis methods depend on an
intelligent Experimental design

88
Resource A. thaliana DNA Clones for Spiking

chlorophyll a/b binding protein (Cab)
RUBISCO activase (RCA)
ribulose-1,5-bisphosphate carboxylase/oxygenase
(RbcL)
lipid transfer protein 4 (LTP4)
lipid transfer protein 6 (LTP6)
papain-type cysteine endopeptidase (XCP2)
root cap 1 (RCP1)
NAC1
triosphosphate isomerase (TIM)
ribulose-5-phosphate kinase (PRKase)

SP6 Transcription Start
5ATTTA GGTGA CACTA TAGAA TACAA GCTTG GGCTG
CAGGT CGACT CTAGA GGATC CCCGG GCGAG CTCCC
AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA CCGAA TTC3
SP6 Promoter
HindIII
PstI
SatI AccI HincII
XbaI
EcoRI
SacI
AvaI SmaI
BamHi
Clone set available at lthttp//pga.tigr.orggt
89
Resource B. subtillus DNA Clones for Spiking

pGIBS-lys ATCC 87482
pGIBS-phe ATCC 87483
pGIBS-thr ATCC 87484
pGIBS-trp ATCC 87485
pGIBS-dap ATCC 87486
Artificial polyA added to the 3end

Clone set available at lthttp//www.atcc.orggt
90
Normalization Approaches Total Intensity

Conceptually, this is the simplest approach
Assumption Total RNA (mass) used is same for
both samples.
So, averaged across thousands of genes, total
hybridization should be the same for both samples

91
Before and After Normalization
92
The Starting Point The R-I Plot

Data exhibits an intensity-dependent structure
Uncertainty in measurements is greater at lower
intensities
Uncertainty in ratio measurements generally
greater at lower intensities
Plot log2(R/G) vs. log2(RG)
variation Terry Speeds M-A plot with
(½ )log2(RG)

93
Good Data
94
Bad Data from Parts Unknown
Each pen group is colored differently
Gary Churchill
95
Lowess Normalization

Why LOWESS?

Observations
Intensity-dependent structure
Data not mean centered at log2(ratio) 0

96
LOWESS (Contd)

Local linear regression model
Tri-cube weight function
Least Squares

97
LOWESS Results
98
Variance stabilization/regularization

Measurements of expression vary between any two
assays
This can be affected by changes in the mean
expression level, but normalization can help
reduce those differences
However, the variance, or spread in the data, can
be quite different between replicates (or pen
groups)
Variance stabilization can rescale the data for
each experiment to make these more comparable

99
A Box Plot can show the difference in
variancebetween replicates
100
Standard Deviation Regularization
Let aij be the raw log ratio for the jth spot in
ith block (or slide)
aij be the scaled log ratio for the jth spot in
ith block (or slide)
where Nj denotes the number of genes ith block or
ith slide, M denotes the number of blocks or
slides, aij denotes the log ratio mean of ith
block (or ith slide)
101
MIDAS Normalization Methods(Standard deviation
regularization)
Standard deviation regularization
Assumption log-ratio standard deviations within
each block or slide are the same.
Variance regularization can remove the bias
102
There are Limits to what you can Measure
103
The Limits of log-ratios The space we explore
104
The Limits of log-ratios The space we explore
105
The Limits of log-ratios The space we explore
106
What do we measure?
107
Dealing with Data
Before any pattern analysis can be done, one must
first normalize and filter the data.
Normalization facilitates comparisons between
datasets.
Filtering transformations can eliminate
questionable data and reduce complexity.
108
Expression Elements
109
Ratio vs. log-ratio
Ai Red intensity Bi Green intensity
Let
R
4
Gene1
3
log2(AB)
2
1
Gene2
0
AB
Advantages of log transformation Treat
up-regulated and down-regulated genes
symmetrically! Transfer multiplication
operations to addition operations! Because
110
Expression Vectors
Gene Expression Vectors represent the expression
of a gene over a set of experimental conditions
or sample types.
111
Multiple Samples?

Goal is identify genes (or experiments) which
havesimilar patterns of expression
This is a problem in data mining
Clustering Algorithms are most widely used
Types
Agglomerative Hierarchical
Divisive k-means, SOMs
Hybrid SOTA
Nonclustering Principal Component Analysis
(PCA)
All depend on how one measures distance

112
Expression Vectors

Crucial concept for understanding clustering
Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment
x log(ratio)expt1
y log(ratio)expt2
z log(ratio)expt3
etc.

113
Expression Vectors

Crucial concept for understanding clustering
Each gene is represented by a vector where
coordinates are its values log(ratio) in each
experiment
x log(ratio)expt1
y log(ratio)expt2
z log(ratio)expt3
etc.
For example, if we do six experiments,
Gene1 (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
Gene2 (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
Gene3 (1.2, 0.5, 0, -0.25, -0.75, -1.4)
etc.

114
Expt 1 Expt 2 Expt 3 Expt 4 Expt 5 Expt 6
Expression Matrix

These gene expression vectors of log(ratio)
values can be used to construct an expression
matrix

Gene1 -1.2 -0.5 0 0.25 0.75
1.4
Gene2 0.2 -0.5 1.2 -0.25 -1.0
1.5
Gene3 1.2 0.5 0 -0.25
-0.75 -1.4
etc.
This is often represented as a red/green colored
matrix

115
Expression Matrix
The Expression Matrix is a representation of data
frommultiple microarray experiments.
Each element is a log ratio, usually log 2
(Cy5/Cy3)
Black indicates a log-ratio of zero, i.e., Cy5
and Cy3 are very close in value
Green indicates a negative log-ratio, i.e., Cy5 lt
Cy3
Gray indicates missing data
Red indicates a positive log ratio, i.e, Cy5 gt
Cy3
116
Expression Vectors As Points inExpression Space
Similar Expression
Experiment 3
Experiment 2
Experiment 1
117
Distance metrics

Distances are measured between expression
vectors
Distance metrics define the way we measure
distances
Many different ways to measure distance
Euclidean distance
Pearson correlation coefficient(s)
Manhattan distance
Mutual information
Kendalls Tau
etc.
Each has different properties and can reveal
different features of the data

118
Distance and Similarity
The ability to calculate a distance (or
similarity - its inverse) between two expression
vectors is fundamental to clustering
algorithms Distance between vectors is the
basis upon which decisions are made when grouping
similar patterns of expression Selection of a
distance metric defines the concept of distance
for a particular experiment
119
Distance a measure of similarity between genes
Some distances (MeV provides 11 metrics)
120
Distance Is Defined by a Metric
121
Distance is Defined by a Metric
1.4
-0.90
4.2
-1.00
122
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
Distance Matrix

Once a distance metric has been selected, the
starting point for all clustering methods is a
distance matrix

Gene1 0 1.5 1.2 0.25
0.75 1.4
Gene2 1.5 0 1.3 0.55
2.0 1.5
Gene3 1.2 1.3 0 1.3
0.75 0.3
Gene4 0.25 0.55 1.3 0 0.25
0.4
Gene5 0.75 2.0 0.75 0.25
0 1.2
Gene6 1.4 1.5 0.3 0.4
1.2 0
The elements of this matrix are the pair-wise
distances. Note that the matrix is symmetric
about the diagonal.

123
MeV Data Mining Tools
Alexander Saeed Alexander Sturn Nirmal
Bhagabati John Braisted Syntek Inc. Datanaut,
Inc.
? Available with OSI source
124
Some Concepts from Statistics
125
Probability distributions

The probability of an event is the likelihood
of its occurring.
It is sometimes computed as a relative frequency
(rf), where

The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
126
Normal distribution
127

Less than a 5 chance that the sample with mean
s came from Population 1
s is significantly different from Mean 1 at the
p lt 0.05 significance level.
But we cannot reject the hypothesis that the
sample came from Population 2

128
Probability and Expression Data

Many biological variables, such as height and
weight, can reasonably be assumed to approximate
the normal distribution.
But expression measurements? Probably not.
Fortunately, many statistical tests are
considered to be fairly robust to violations of
the normality assumption, and other assumptions
used in these tests.
Randomization / resampling based tests can be
used to get around the violation of the normality
assumption.
Even when parametric statistical tests (the ones
that make use of normal and other distributions)
are valid, randomization tests are still useful.

129
Outline of a randomization test - 1
1. Compute the value of interest (i.e., the
test-statistic s) from your data set.
2. Make fake data sets from your original
data, by taking a random sub-sample of the data,
or by re-arranging the data in a random fashion.
Re-compute s from the fake data set.
130
Outline of a randomization test - 2
3. Repeat step 2 many times (often several
hundred to several thousand times) and record of
the fake s values from step 2 4. Draw
inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values
131
Outline of a randomization test - 3

Rationale
Ideally, we want to know the behavior of the
larger population from which the sample is drawn,
in order to make statistical inferences.
Here, we dont know that the larger population
behaves like a normal distribution, or some
other idealized distribution. All we have to work
with are the data in hand.
Our fake data sets are our best guess about
this behavior (i.e., if we had been pulling data
at random from an infinitely large population, we
might expect to get a distribution similar to
what we get by pulling random sub-samples, or by
reshuffling the order of the data in our sample)

132
The problem of multiple testing (adapted from
presentation by Anja von Heydebreck,
MaxPlanckInstitute for Molecular Genetics,
Dept. Computational Molecular Biology, Berlin,
Germany http//www.bioconductor.org/workshops/Heid
elberg02/mult.pdf)

Lets imagine there are 10,000 genes on a chip,
and
none of them is differentially expressed.
Suppose we use a statistical test for
differential expression, where we consider a gene
to be differentially expressed if it meets the
criterion at a p-value of p lt 0.05.

133
The problem of multiple testing 2

Lets say that applying this test to gene G1
yields a p-value of p 0.01
Remember that a p-value of 0.01 means that there
is a 1 chance that the gene is not
differentially expressed, i.e.,
Even though we conclude that the gene is
differentially expressed (because p lt 0.05),
there is a 1 chance that our conclusion is
wrong.
We might be willing to live with such a low
probability of being wrong
BUT .....

134
The problem of multiple testing 3

We are testing 10,000 genes, not just one!!!
Even though none of the genes is differentially
expressed, about 5 of the genes (i.e., 500
genes) will be erroneously concluded to be
differentially expressed, because we have decided
to live with a p-value of 0.05
If only one gene were being studied, a 5 margin
of error might not be a big deal, but 500 false
conclusions in one study? That doesnt sound too
good.

135
The problem of multiple testing 4

There are tricks we can use to reduce the
severity of this problem.
They all involve slashing the p-value for each
test (i.e., gene), so that while the critical
p-value for the entire data set might still equal
0.05, each gene will be evaluated at a lower
p-value.
Well go into some of these techniques later.

136
The problem of multiple testing 5

Dont get too hung up on p-values.
Ultimately, what matters is biological
relevance.
P-values should help you evaluate the strength of
the evidence, rather than being used as an
absolute yardstick of significance.
Statistical significance is not necessarily the
same as biological significance.

137
Finding Significant Genes

Assume we will compare two conditions with
multiple replicates for each class
Our goal is to find genes that are significantly
different between these classes
These are the genes that we will use for later
data mining

138
Finding Significant Genes

Average Fold Change Difference for each gene
suffers from being arbitrary and not taking into
account systematic variation in the data

139
Finding Significant Genes

t-test for each gene
Tests whether the difference between the mean of
the query and reference groups are the same
Essentially measures signal-to-noise
Calculate p-value (permutations or distributions)
May suffer from intensity-dependent effects

140
t-tests
141
T-Tests (Between Subjects or unpaired) - 1

Assign experiments to two groups, e.g., in the
expression matrix below, assign Experiments 1, 2
and 5 to group A, and experiments 3, 4 and 6 to
group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
142
T-TEST - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the
t-statistic for each gene either from A.
Theoretical t-distribution OR B.
Permutation tests.
143
T-TEST - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
144
T-TEST - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi)
Then, the p-value associated with the gene 1
(x/n)
145
T-TEST - 5

5. Determine whether a genes expression levels
are significantly different between the two
groups by one of three methods
Just alpha (a significance level) If the
calculated p-value for a gene is less than or
equal to the user-input a (critical p-value), the
gene is considered significant.
OR
Use Bonferroni corrections to reduce the
probability of erroneously classifying
non-significant genes as significant.
B) Standard Bonferroni correction The user-input
alpha is divided by the total number of genes to
give a critical p-value that is used as above gt
pcritical a/N.

146
T-TEST 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (a/N), where N is the
total number of genes for the gene with the
second-highest t-value, the critical p-value
will be (a/N-1), and so on.
147
TTEST 1-class (or One-sample t-test) - 1

Used to test if the the mean expression of a gene
over all experiments is different from a
hypothesized mean.

2. Question Is the mean of the values of a given
gene vector significantly different from a
hypothesized mean?
148
TTEST- 1 Class - 2
3. Often, the hypothesized mean in gene
expression studies is zero, meaning that we are
looking for genes whose mean log2 ratio across
all experiments is significantly different from
zero, i.e., 4. Using 1-sample t-tests, we can
select genes which, on average, show differential
expression across all experiments (since genes
with no differential expression should have a
mean log2 ratio of zero across all expts). 5.
Calculate t-value, where Observed mean
of gene vector Hypothesized mean of gene
vector t ------------------------------------
------------------------------------------ Stand
ard error of the mean of the gene vector
149
TTEST 1 class - 3
6. Calculate p-value from a theoretical
t-distribution, OR 7. By permutation 7a.
Randomly pick some elements of the gene vector,
and change their values,such that the new value
of the changed element is original value 2
x (original value - hypothesized mean)
(i.e., flip the elements deviation around the
hypothesized mean) Thus, if the original gene
values are and the hypothesized mean is
zero, then the randomized gene values could
be
150
TTEST 1 class - 4
7b. Calculate t-value from the randomized
gene 7c. Repeat 7a and 7b as many times as
desired. If all permutations are chosen, then
every possible combination of elements in the
gene vector is chosen for flipping. 7d. The
p-value 1 (the proportion of times that the
original absolute t-value exceeds the randomized
absolute t-value over all the permutations
conducted). 8. If a genes p-value is less than
or equal to the user-specified critical p-value,
the genes mean expression over all experiments
is significantly different from the hypothesized
mean. 9. Bonferroni and adjusted Bonferroni
corrections may be applied just as in the
two-sample t-test.
151
Finding Significant Genes

Volcano Plots
Combines p-values and fold change measures
Significant genes appear in upper corners

152
Finding Significant Genes

Analysis of Variation (ANOVA)
Which genes are most significant for separating
classes of samples?
Calculate p-value (permutations or distributions)
Reduces to a t-test for 2 samples
May suffer from intensity-dependent effects

153
One Way Analysis of Variance (ANOVA)

Assign experiments to gt 2 groups

2. Question Is mean expression level of a gene
the same across all groups?
154
ANOVA - 2
3. Calculate an F-ratio for each gene,
where Mean square (groups) F
----------------------------------, which is a
measure of Mean square (error) Between
groups variability -----------------------------
--------- Within groups variability The larger
the value of F, the greater the difference among
the group means relative to the sampling error
variability (which is the within groups
variability). i.e., the larger the value of F,
the more likely it is that the differences among
the group means reflect real differences among
the means of the populations they are drawn
from, rather than being due to random sampling
error.
155
ANOVA - 3 4. The p-value associated with an
F-value is the probability that an F-value that
large would be obtained if there were no
differences among group means (i.e., given the
null hypothesis). Therefore, the smaller the
p-value, the less likely it is that the null
hypothesis is valid, i.e., the differences among
group means are more likely to reflect real
population differences as p-values decrease in
magnitude.
156

ANOVA - 4
5. P-values can be obtained for the F-values from
a theoretical F-distribution, assuming that the
populations from which the data are obtained
are normally distributed, and
have homogeneous variances.

The test is considered robust to violations of
these assumptions, provided sample sizes are
relatively large and similar across groups.
157
ANOVA 5 6. P-values can be obtained from
permutation tests (just like in t-tests), if one
does not want to rely on the assumptions needed
for using the F-distribution. P-values can also
be corrected for multiple comparisons (using
Bonferroni or other procedures).
158
Two-factor ANOVA (TFA)
Can be used to find genes whose expression is
significantly different over two factors (e.g.,
sex and strain), as well as to look for genes
with a significant interaction for these two
factors.
Strain B
Strain C
Strain A
Male
Female
159
TFA - 2
160
TFA - 3

Ideally, design should be balanced, i.e., equal
numbers of samples in each factor A factor B
combination.
If unbalanced, the analysis can still be
conducted, but F-tests will be somewhat biased.
May need to use smaller p-values.
Can have balanced designs with no replication
(see below). In this case, interaction cannot be
tested..

161
Finding Significant Genes

Significance Analysis of Microarrays (SAM)
Uses a modified t-test by estimating and adding a
small positive constant to the denominator
Significant genes are those which exceed the
expected values from permutation analysis.

162
SAM test Statistic

di Score
si Standard Deviation
s0 Safety Factor

163
SAM Variance Estimate

Gene by gene variance estimate safety factor
Variance equal in the two conditions
s0 term is here to deal with cases when variance
estimates gets too close to zero

How to choose s0?
Test statistics are binned in 100 different group
depending on the si value
s0 is chosen so that the dispersion of the test
statistic does not vary from bin to bin
avoids aberrant values when variance estimates
close to 0

164
SAM Hypothesis Testing

Permutation technique
Multiple testing adjustment technique
False Discovery Rate

165
Confidence Level False Discovery Rate

Fix a threshold DELTA for differentially
expressed genes
For each permutation, count how many genes you
declare differentially expressed
NB In a permutation you should find 0 genes.
Compute median number of falsely called genes in
permutations
False Discovery Rate is number of falsely called
genes divided by number of differential expressed
genes in original data

FDR percentage of NON-significant genes you can
expect to find in your result list
166
SAM

SAM gives estimates of the False Discovery Rate
(FDR), which is the proportion of genes likely to
have been wrongly identified by chance as being
significant.
It is a very interactive algorithm allows users
to dynamically change thresholds for significance
(through the tuning parameter delta) after
looking at the distribution of the test
statistic.
The ability to dynamically alter the input
parameters based on immediate visual feedback,
even before completing the analysis, helps make
the data-mining process sensitive.

167
Significance analysis of microarrays (SAM)

SAM can be used to identify significant genes
based on differential expression between sets of
samples.
Currently implemented for the following designs
two-class unpaired
two-class paired
multi-class
censored survival
one-class

168
SAM designs

Two-class unpaired to pick out genes whose mean
expression level is significantly different
between two groups of samples (analogous to
between subjects t-test).
Two-class paired samples are split into two
groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group B
(analogous to paired t-test).

169
SAM designs - 2

Multi-class identifies genes whose mean
expression is different across gt 2 groups of
samples (analogous to one-way ANOVA)
Censored survival finds genes whose expression
levels are correlated with duration of survival
using Cox-regression.
One-class selects genes whose mean expression
across experiments is different from a
user-specified mean.

170
SAM Two-Class 1

Assign experiments to two groups, e.g., in the
expression matrix below, assign Experiments 1, 2
and 5 to group A, and experiments 3, 4 and 6 to
group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
171
SAM Two-Class 2
Permutation tests
i) For each gene, compute d-value (analogous to
t-statistic). This is the observed d-value for
that gene.
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Compute the observed d-value for each randomized
gene
172
SAM Two-Class 3

iii) Repeat step (ii) many times, so that each
gene has many randomized d-values. Take the
average of the randomized d-values for each gene.
This is the expected d-value of that gene.
iv) Plot the observed d-values vs. the expected
d-values

173
SAM Two-Class 4
Significant positive genes (i.e., mean
expression of group B gt mean expression of
group A) in red
The more a gene deviates from the observed
expected line, the more likely it is to be
significant. Any gene beyond the first gene in
the ve or ve direction on the x-axis (including
the first gene), whose observed exceeds the
expected by at least delta, is considered
significant.
174
SAM Two-Class 5

For each permutation of the data, compute the
number of positive and negative significant genes
for a given delta as explained in the previous
slide. The median number of significant genes
from these permutations is the median False
Discovery Rate.
The rationale behind this is, any genes
designated as significant from the randomized
data are being picked up purely by chance (i.e.,
falsely discovered). Therefore, the median
number picked up over many randomizations is a
good estimate of false discovery rate.

175
SAM Two-Class Paired

Samples fall into two groups
Each member of group A is associated with a
member of group B in a 1-to-1 relationship

A-B pair
176
SAM Two-Class Paired - 2

e.g., groups A and B could respectively
represent before and after a drug treatment,
and each A-B pair of samples could come from the
same patient before and after the treatment.
or, groups A and B could represent two strains
for which samples were collected at the several
time points over a time course study. A sample
collected from each of strain A and B at the same
time point could form an AB pair.

The rest of the analysis is similar to two-class
unpaired SAM. Positive significant genes are
those for which Mean(Group B) is significantly
larger than Mean (Group A), and reverse is true
for negative significant genes

177
SAM Multi-Class

Extension of SAM two -class unpaired to more
than 2 groups
Experiments belong to one of at least three
groups
Analogous to one-way between subjects ANOVA

178
SAM Multi-Class - 2

This analysis yields only positive significant
genes
These are genes whose means are significantly
different across some combination of the groups
of experiments.

179
SAM Censored Survival

Each experiment (sample) is associated with an
observation time, and a state at the time of
observation.
The state is either dead or censored
Censored means that the subject survived
beyond the time point at which the sample was
taken.
A positive score means that a higher expression
level for that gene implies shorter survival
(i.e., higher risk), whereas a negative score
means that higher expression implies longer
survival.

180
Finding Patterns in the Data
181
Hierarchical Clustering
1. Calculate the distance between all genes. Find
the smallest distance. If several pairs share the
same similarity, use a predetermined rule to
decide between alternatives.
2. Fuse the two selected clusters to produce a
new cluster that now contains at least two
objects. Calculate the distance between the new
cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single
cluster remains.
4. Draw a tree representing the results.
182
Hierarchical Clustering
(HCL-2)
183
Hierarchical Clustering
(HCL-3)
184
Hierarchical Tree
(HCL-4)
185
Agglomerative Linkage Methods

Linkage methods are rules or metrics that return
a value that can be used to determine which
elements (clusters) should be linked.
Three linkage methods that are commonly used are
Single Linkage
Average Linkage
Complete Linkage

(HCL-6)
186
Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of the another cluster. Single
linkage tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-7)
187
Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-8)
188
Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u Î A and v Î B for all i 1 to NA and j 1 to
NB
DAB
(HCL-9)
189
Comparison of Linkage Methods
Average
Single
Complete
(HCL-10)
190
Bootstrapping
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
191
Jackknifing
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
192
Analysis of bootstrapped and jackknifed support
trees

Bootstrapped or jackknifed expression matrices
are created many times by randomly resampling the
original expression matrix, using either the
bootstrap or jackknife procedure.
Each time, hierarchical trees are created from
the resampled matrices.
The trees are compared to the tree obtained from
the original data set.
The more frequently a given cluster from the
original tree is found in the resampled trees,
the stronger the support for the cluster.
As each resampled matrix lacks some of the
original data, high support for a cluster means
that the clustering is not biased by a small
subset of the data.

193
Self Organizing Tree Algorithm
SOTA - 1

Dopazo, J. , J.M Carazo, Phylogenetic
reconstruction using and unsupervised growing
neural network that adopts the topology of a
phylogenetic tree. J. Mol. Evol. 44226-233,
1997.
Herrero, J., A. Valencia, and J. Dopazo. A
hierarchical unsupervised growing neural network
for clustering gene expression patterns.
Bioinformatics, 17(2)126-136, 2001.

194
SOTA Characteristics
SOTA - 2

Divisive clustering, allowing high level
hierarchical structure to be revealed without
having to completely partition the data set down
to single gene vectors
Data set is reduced to clusters arranged in a
binary tree topology
The number of resulting clusters is not fixed
before clustering
Neural network approach which has advantages
similar to SOMs such as handling large data sets
that have large amounts of noise

195
SOTA Topology
SOTA - 3
Centroid Vector
ap
Members
196
Adaptation Overview
SOTA - 4

Each gene vector associated with the parent is
compared to the centroid vector of its offspring
cells.
The most similar cells centroid and its
neighboring cells are adapted using the
appropriate migration weights.

197

Following the presentation of all genes to the
system a measure of system diversity is used to
determine if training has found an optimal
position for the offspring.
If the system diversity improves (decreases)
then another training epoch is started otherwise
training ends and a new cycle starts with a cell
division.

SOTA - 5
198
SOTA - 6
The most diverse cell is selected for division
at the start of the next training cycle.
199
Growth Termination
SOTA - 7
Expansion stops when the most diverse cells
diversity falls below a threshold.
200
SOTA - 8
Each training cycle ends when the overall tree
diversity stabilizes. This triggers a cell
division and possibly a new training cycle.
201
K-Means/Medians Clustering 1
202
K-Means/Medians Clustering 2
3. Calculate mean/median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean expression
profile (calculated in step 3) is the closest to
that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
k-means is most useful when the user has an a
priori hypothesis about the number of clusters
the genes should belong to.
203
K-Means / K-Medians Support (KMS)

Because of the random initialization of
K-Means/K-Means, clustering results may vary
somewhat between successive runs on the same
dataset. KMS helps us validate the clustering
results obtained from K-Means/K-Medians.
Run K-Means / K-Medians multiple times.
The KMS module generates clusters in which the
member genes frequently group together in the
same clusters (consensus clusters) across
multiple runs of K-Means / K-Medians.
The consensus clusters consist of genes that
clustered together in at least x of the K-Means
/ Medians runs, where x is the threshold
percentage input by the user.

204
Self-organizing maps (SOMs) 1
1. Specify the number of nodes (clusters)
desired, and also specify a 2-D geometry for the
nodes, e.g., rectangular or hexagonal
205
SOMs 2
2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The
node closest to G9 (N2) is moved the most, and
the other nodes are moved by smaller varying
amounts. The farther away the node is from N2,
the less it is moved.
206
SOMs 3
4. Steps 2 and 3 (i.e., choosing a random gene
and moving the nodes towards it) are repeated
many (usually several thousand) times. However,
with each iteration, the amount that the nodes
are allowed to move is decreased.
5. Finally, each node will nestle among a
cluster of genes, and a gene will be considered
to be in the cluster if its distance to the node
in that cluster is less than its distance to any
other node.
207
SOM Neighborhood Options
Gaussian Neighborhood
Bubble Neighborhood
radius
G7
G7
G8
G8
G9
G9
G10
G10
G11
G11
N1
N2
N1
N2
N3
N4
N3
N4
N5
N6
N5
N6
Some move, alpha is constant.
All move, alpha is scaled.
208
Template Matching

Template matching allows one to find expression
vectors which match a provided template
A template can be derived from
a gene known to be central to the area of study
a sample or set of samples of a particular type
a cluster with a mean pattern of interest
a pattern constructed to reveal trends based on
knowledge of the experimental design

209
PTM-2

Sometimes it is useful to identify elements that
have complementary patterns by selecting to use
the absolute value of r.

210
Gene Shaving
Results in a series of nested clusters
Choose cluster of appropriate size as determined
by gap statistic calculation
Repeat until only one gene remains
Orthogonalize expression matrix with respect to
the average gene in the cluster and repeat
shaving procedure
211
Gene Shaving
The final cluster contains a set of genes that
are greatly affected by the experimental
conditions in a similar way.
Create random permutations of the expression
matrix and calculate R2 for each
Compare R2 of each cluster to that of the entire
expression matrix
Choose the cluster whose R2 is furthest from the
average R2 of the permuted expression matrices.
212
Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes.
213
Relevance Networks
The remaining relationships between genes define
the subnets
214
Principal Components Analysis

PCA simplifies the views of the data.
Suppose we have measurements for each gene on
multiple experiments.
Suppose some of the experiments are correlated.
PCA will ignore the redundant experiments, and
will take a weighted average of some of the
experiments, thus possibly making the trends in
the data more interpretable.
5. The

Write a Comment

User Comments (0)

About PowerShow.com

Analysis of Large Scale Gene Expression Data PowerPoint PPT Presentation