Analysis of Large Scale Gene Expression Data - PowerPoint PPT Presentation

Loading...

PPT – Analysis of Large Scale Gene Expression Data PowerPoint presentation | free to view - id: e1b0a-Yjk1M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Analysis of Large Scale Gene Expression Data

Description:

Analysis of Large Scale Gene Expression Data – PowerPoint PPT presentation

Number of Views:585
Avg rating:5.0/5.0
Slides: 257
Provided by: Joh144
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Analysis of Large Scale Gene Expression Data


1
Analysis of Large Scale Gene Expression Data
  • John Quackenbush
  • September 2005

2
Microarray Analysis
  • General microarray overview
  • How do platforms compare?
  • Should I subtract background?
  • TM4 Array Solutions
  • Annotating arrays
  • How complete is the genome?
  • Tools for array analysis
  • Experimental design
  • Normalization
  • What do we measure?
  • Some concepts from statistics
  • Distributions
  • Randomization Tests
  • Multiple Testing
  • Finding significant genes
  • Fold Change
  • T-tests
  • ANOVA
  • SAM
  • Hierarchical clustering
  • Bootstrapping
  • Jack-knifing
  • Self-organizing trees
  • K-means/k-medians Clustering
  • Self-organizing maps
  • Template matching
  • Gene shaving
  • Relevance networks
  • Principal Components Analysis
  • Support Vector Machines (SVM)
  • k-Nearest Neighbors
  • EASE Meta-Analysis
  • Demo Data Set 1

3
General Microarray Analysis
4
Levels of Biological Information DNA mRNA Prote
ins Informational Pathways Informational
Networks Cells Organs Individuals Populations
Ecologies
Genomics Functional Genomics Proteomics Metabolo
mics Systems Biology Cellular Biology Medicine Med
icine Genetics Ecology
The Future!
5
February 2001 Completion of the Draft Human
Genome
April 14, 2003 The Human Genome is completed
again!
October 2004 The Human Genome is now really
finished!
But what does finished mean???
6
Where are the pressing questions?
  • Can we find the genes and assign them functions?
  • Can we predict protein structures and functions?
  • Can we reconstruct metabolic, signaling, and
    other pathways?
  • Can we reconstruct informational networks?
  • Can we link genotype to phenotype?
  • Can we use genotype/phenotype to predict relevant
    outcome?
  • Can we use cross-species comparisons to learn
    something?

7
The Beast Microarray Robot from Intelligent
Automation lthttp//www.ias.comgt
8
Microarray Overview I
9
Microarray Overview II
10
Microarray Overview II
11
Affymetrix GeneChip Expression Analysis
Generate DNA Sequence
Design and synthesize chips
ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC
ACGTAGCTAGCTGATCGTAGCTAGCTAGCTAGCTGATC
12
Affymetrix GeneChip Expression Analysis
13
Microarray Expression Analysis
Gene
14
32,448 element mouse array
Thanks to M. Ko (NIA) and B. Soares (BMAP)
kidney vs. heart 15?g total RNA
Shuibang Wang, Yan Yu, Renee Gaspard
15
Steps in the Process
  • Select array elements and annotate them
  • Build a database to manage stuff
  • Print arrays and manage the lab
  • Hybridize and analyze images manage data
  • Analyze hybridization data and get results

16
What can I do with arrays?
17
Types of Experiments
  • Class Comparison
  • Can I find genes that distinguish between two
    classes, such as tumor and normal?
  • Class Discovery
  • Given what I think is a uniform group of samples,
    can I find subsets that are biologically
    meaningful?
  • Classification
  • Given a set of samples in different classes, can
    I assign a new, unknown sample to one of the
    classes?
  • Mechanistic Studies
  • Can I discover a causative mechanism associated
    with the distinction between classes?

18
How do platforms compare?
19
Microarray Platform Comparisons
Platform comparisons may result in little to no
replication of gene expression levels. Findings
may replicate within a platform (top), but not
between platforms (bottom). Spotted long (80mer)
oligos versus Affymetrix 25mer chip.
Rogojina et al. 2003. Use of Affymetrix to
spotted oligonucleotide microarrays using two
retinal epithelial cell lines. Molecular Vision
9482-96
20
Experimental Design
Gene expression in Angiotensin II-induced
hypertension acute versus chronic effects
Saline Ang II
  • Two Factors
  • Factor 1 Treatment saline or Angiotensin II
  • Factor 2 Time
  • acute (24 h) or chronic (14 day)

acute chronic
Initial RNA samples 1.5 ug total RNA Total RNA
was amplified (TIGR SOP M022) to generate
antisense cRNA cRNA yield 40 - 60 ug from
amplification
Jennie Larkin, Harry Gavras thanks to Affymetrix
21
Experimental Design
Total RNA cRNA aminoallyl labelled cDNA 1.5
ug 50 ug
amplification
TIGR Microarray Protocol
Affymetrix GeneChip Protocol
Cy3/Cy5 dye coupling and hybridization Wash and
scan
Second cycle small sample protocol Biotinylated
cRNA Fragmentation Wash and stain Scan
DATA
Jennie Larkin, Harry Gavras thanks to Affymetrix
22
Affymetrix TIGR shared data
  • 11,714 TCs were present on both platforms
  • 5854 had values in gt 80 of experiments
  • 92 (N 5358) did not statistically differ
    between platforms (2-factor ANOVA, a .05)

Jennie Larkin, Harry Gavras thanks to Affymetrix
23
How do platforms compare?
Jennie Larkin, Harry Gavras thanks to Affymetrix
24
Two platforms similar values
  • Both OSF2 and procollagen type III alpha 1 were
    up-regulated in chronic Ang II treatment.
  • Even the pattern of variability between
    biological replicates was reproduced between
    platforms.

Jennie Larkin, Harry Gavras thanks to Affymetrix
25
Two platforms similar values
  • Both ADAMTS1 and ADRP were up-regulated in acute
    Ang II treatment.
  • Again, both the values and the pattern of
    variability between biological replicates were
    reproduced between platforms.

Jennie Larkin, Harry Gavras thanks to Affymetrix
26
Not all is perfect
There were differences between the arrays, but
these were the exception rather than the
rule. Metallothionein-II was only up-regulated
in acute on the TIGR arrays, but was up-regulated
in both acute and chronic on the Affy chips.
Jennie Larkin, Harry Gavras thanks to Affymetrix
27
When Platforms Agree
28
When Platforms Disagree, They Disagree
29
General Microarray Strategy
  • Choose an experimentally interesting and
    tractable model system
  • Design an experiment with comparisons between
    related variants
  • Include sufficient biological replication to make
    good estimates
  • Hybridize and collect data
  • Normalize and filter
  • Mine data for biological patterns of expression
  • Integrate expression data with other ancillary
    data such, including genotype, phenotype, the
    genome, and its annotation

30
Should I subtract background?
31
Non-normalized, background subtracted data
100
32
Annotating andComparing Arrays
33
TIGR Gene Indices home page
www.tigr.org/tdb/tgi
83 species gt26,000,000 sequences
34
The Gene Indices home page
biocomp.dfci.harvard.edu/tgi/tgipage.html
85 species gt28,000,000 sequences
35
TGICL Tools are available with more coming
36
Gene Index Assembly process
37
A TC Example
38
GO Terms and EC Numbers
Babak Parvizi
39
Everything is mapped in the context of the
relevant genome
40
The TIGR Gene Indices lthttp//www.tigr.org.tdb/tdb
/tgigt
Dan Lee, Ingeborg Holt
41
Building TOGs Reflexive, Transitive Closure
Thanks to Woytek Makalowski and Mark Boguski
42
TOGA An Sample Alignment bithoraxoid-like
protein
43
(No Transcript)
44
RESOURCERER
Jennifer Tsai
http//pga.tigr.org/tools.shtml
45
RESOURCERER An Example
46
RESOURCERER Using Genetic Markers
Now Available QTL-based searches, promoter
region extraction, GO-Slim analysis
47
Just added
48
The Features are the same
49
How Complete isthe Human Genome?
50
is easy!
Gene Finding in Humans
Razvan Sultana
51
is easy?
Gene Finding in Humans
Razvan Sultana
52
is difficult?
Gene Finding in Humans
Razvan Sultana
53
is difficult?
Gene Finding in Humans
A genome and its annotation is only a hypothesis
that must be tested.
Razvan Sultana
54
Select TIGRHuman TCsnot mappingto human
genome(about 9000)
Razvan Sultana Laura Beaver Bryan Frank
55
Answer 61 Positives!
SequenceValidationUnderway Approximately50
appear good
Razvan Sultana Laura Beaver Bryan Frank
56
Tools for Array Analysis
57
TM4 Resources
TM4 Website
  • Application Downloads
  • Documentation and FAQs
  • And Much More!

http//www.tm4.org
CD Contents
  • All TM4 Applications
  • User Manuals
  • Supplementary Documentation
  • Sample Data Sets

TM4 Reference
Saeed, A.I., Sharov, V., White, J., Li, J.,
Liang, W., Bhagabati, N., Braisted, J., et al.
2003. TM4 A Free, Open-Source System for
Microarray Data Management and Analysis.
BioTechniques 34 374-378
                  
58
Microarray Data Flow
Scanner
Printer
.tiff Image File
Image Analysis
Raw Gene Expression Data
Gene Annotation
Normalization / Filtering
Normalized Data with Gene Annotation
Expression Analysis
Interpretation of Analysis Results
59
MAD Database Schema
60
MADAM Microarray Data Manager
MAGE-ML exportnow functional
Joseph White Jerry Li Alexander Saeed Vasily
Sharov Syntek Inc.
? Available with OSI source and MySQL
61
MAGE-ML Export with Madam
62
MIDAS Data Analysis
Wei Liang
Variance Stabilization, Adding Error
Models, MAANOVA, Automated Reporting
? Available with source
63
MeV Data Mining Tools
Alexander Saeed Alexander Sturn Nirmal
Bhagabati John Braisted Syntek Inc. Datanaut,
Inc.
Now with scriptingand the ability to save state
andrestart analysis
? Available with OSI source www.tm4.org
64
Microarray Data Flow
Scheduler (Machine Scheduling)
SliTrack (Machine Control)
.tiff Image File
PCR Score
MABCOS (Barcode System)
Exp Designer
Spotfinder (Image Analysis)
MADAM (Data Manager)
Expression Data
Raw .tav File
Miner (.tav File Creator)
Raw .tav File
MIDAS (Normalization)
GenePix Converter
Normalized .tav File
Query Window
MeV (Data Analysis)
Interpretation
65
Protocols and Methods
66
Publications on Microarray Tools and Techniques
67
Publications on Microarray Data Analysis and
Design
68
Publications on Microarray Data Exchange Standards
MIAME Standards Nature family, Cell family, EMBO
reports, Bioinformatics,Genome Research, Genome
Biology, Science, The Lancet,Science, and
others.
69
The Starting Point Designing the Experiment
70
The Experimental Design
  • The Experimental Design dictates a good deal of
    what you can do with the data
  • Good normalization and processing reflects the
    experimental design
  • The design also facilitates certain comparisons
    between samples and provides the statistical
    power you need for assigning confidence limits to
    individual measurements
  • The design must reflect experimental reality
  • The most straight-forward designs compare
    expression in two classes of samples to look for
    patterns that distinguish them.

71
Sample Pairing for Co-Hybridization Experiments
Direct Comparison with Dye Swap
A1
B1
A2
B2
A3
B3
A4
B4
A1
B1
A2
A3
B2
B3
A4
B4
  • RNA sample is not limiting (e.g. plenty of
    sample)
  • Flip dyes account for any gene-dye effects

Balanced Block Design
A1
B1
A2
B2
A4
B4
A3
B3
  • RNA sample is limiting
  • Balanced blocking accounts for any gene-dye
    effects

72
Multiple Sample Pairings
Reference Design (Indirect Comparison)
  • More than two samples are compared
  • (e.g. tumor classification, time course)
  • Flip dyes are not necessary but can be done to
    increase precision
  • Ratio values are inferred (indirect)
  • Suited for cluster analysis need common
    reference

Loop Design
73
Why perform flip-dye experiments?
74
Loops and Reference Designs
23 Hybs
10 hybs
Standard flip-dye expt
S. Wang , K. Kerr, J. Quackenbush, G. Churchill
75
Loops and Reference Designs
Both approaches can give equivalent results
S. Wang , K. Kerr, J. Quackenbush, G. Churchill
76
Loop vs. Reference Designs
  • Loop design
  • Can provide direct measurements
  • Give more data on each experimental sample
    with the same number of hybs
  • Require more RNA per sample
  • Can unwind with a bad sample or for a gene
    with bad data
  • Reference design
  • Easily extensible
  • Simple interpretation of all results
  • Requires less RNA per sample
  • Less sensitive to bad RNA samples and bad
    array elements

77
Experimental Design
A1
A2
A3
A4
B1
B2
B3
B4
Keep it simple!
78
One Possible Experimental Paradigm
Examining Genotype, Phenotype, and Environment
Parental - stressed
Derived - stressed
Parental - unstressed
Derived - unstressed
79
Basic Design Principles
  • Biological replicas are more informative than
    correlated replicas (independent RNA, independent
    slides)
  • More replicas are better higher statistical
    power
  • For loops, hybridizations of individual samples
    should be balanced (as many Cy3 as Cy5
    labelings)
  • Self-self hybs add data on reproducibility and
    can be used to produce error models
  • At a minimum, should use dye swap replicates to
    compensate for any dye biases in labeling or
    detection

80
How Many Replicates?
(Simon et al., Genetic Epidemiology 23 21-36,
2002)
n 4(za/2 zb)2 / (d/1.4s)2
Where za/2 and zb are normal percentile values at
significance level a and false negative rate b
parameter d represents the minimum detectable
log2 ratio and s represents the SD of log ratio
values. For a 0.001 and b 0.05, then za/2
-3.29 and zb -1.65. Assume d 1.0 (2-fold
change) and s 0.25, Therefore n 12 samples
(6 query and 6 control).
81
Normalization
82
Why Normalize Data?
  • Goal is to measure ratios of gene expression
    levels (ratio)i Ri/Giwhere Ri/Gi are,
    respectively , the measured intensities for the
    ith spot.
  • In a self-self hybridization, we would expect all
    ratios to be equal to one Ri/Gi 1 for all i.
    But they may not be.
  • Why not?
  • Unequal labeling efficiencies for Cy3/Cy5
  • Noise in the system
  • Differential expression
  • Normalization brings (appropriate) ratios back to
    one.

83
The Starting Point The Ratio
84
Log2(ratio) measures treat up- and down-regulated
genes equally
log2(1) 0 log2(2) 1 log2(1/2) -1
85
Normalization Approaches A variety exist
  • Total Intensity
  • Linear Regression
  • Ratio statistics described by Chen, Dougherty,
    Bittner
  • J. Biomed. Optics (1997) 2(4) 364-374
  • Iterative log(ratio) mean centering
  • Lowess Correction
  • And others
  • Any of these using
  • Entire Data Set
  • User-defined Data Set/Controls

86
Normalization Approaches
Using the Entire Data Set
  • Probe Quantification less important
  • No assumption on which genes constitute
  • housekeeping set
  • Uses all the data
  • No independent confirmation

User-defined Data Set/Controls
  • Requires definition of housekeeping set
  • or good added controls
  • Requires good RNA quantitation
  • Ignores much data

87
Normalization Approaches
The Solution(?)
  • The best technique is experiment dependent
  • A good approach is to use a combination of
    techniques
  • All analysis methods depend on an
    intelligent Experimental design

88
Resource A. thaliana DNA Clones for Spiking
  • chlorophyll a/b binding protein (Cab)
  • RUBISCO activase (RCA)
  • ribulose-1,5-bisphosphate carboxylase/oxygenase
    (RbcL)
  • lipid transfer protein 4 (LTP4)
  • lipid transfer protein 6 (LTP6)
  • papain-type cysteine endopeptidase (XCP2)
  • root cap 1 (RCP1)
  • NAC1
  • triosphosphate isomerase (TIM)
  • ribulose-5-phosphate kinase (PRKase)

SP6 Transcription Start
5ATTTA GGTGA CACTA TAGAA TACAA GCTTG GGCTG
CAGGT CGACT CTAGA GGATC CCCGG GCGAG CTCCC
AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA CCGAA TTC3
SP6 Promoter
HindIII
PstI
SatI AccI HincII
XbaI
EcoRI
SacI
AvaI SmaI
BamHi
Clone set available at lthttp//pga.tigr.orggt
89
Resource B. subtillus DNA Clones for Spiking
  • pGIBS-lys ATCC 87482
  • pGIBS-phe ATCC 87483
  • pGIBS-thr ATCC 87484
  • pGIBS-trp ATCC 87485
  • pGIBS-dap ATCC 87486
  • Artificial polyA added to the 3end

Clone set available at lthttp//www.atcc.orggt
90
Normalization Approaches Total Intensity
  • Conceptually, this is the simplest approach
  • Assumption Total RNA (mass) used is same for
  • both samples.
  • So, averaged across thousands of genes, total
  • hybridization should be the same for both samples

91
Before and After Normalization
92
The Starting Point The R-I Plot
  • Data exhibits an intensity-dependent structure
  • Uncertainty in measurements is greater at lower
    intensities
  • Uncertainty in ratio measurements generally
    greater at lower intensities
  • Plot log2(R/G) vs. log2(RG)
  • variation Terry Speeds M-A plot with
  • (½ )log2(RG)

93
Good Data
94
Bad Data from Parts Unknown
Each pen group is colored differently
Gary Churchill
95
Lowess Normalization
  • Why LOWESS?
  • Observations
  • Intensity-dependent structure
  • Data not mean centered at log2(ratio) 0

96
LOWESS (Contd)
  • Local linear regression model
  • Tri-cube weight function
  • Least Squares

97
LOWESS Results
98
Variance stabilization/regularization
  • Measurements of expression vary between any two
    assays
  • This can be affected by changes in the mean
    expression level, but normalization can help
    reduce those differences
  • However, the variance, or spread in the data, can
    be quite different between replicates (or pen
    groups)
  • Variance stabilization can rescale the data for
    each experiment to make these more comparable

99
A Box Plot can show the difference in
variancebetween replicates
100
Standard Deviation Regularization
Let aij be the raw log ratio for the jth spot in
ith block (or slide)
aij be the scaled log ratio for the jth spot in
ith block (or slide)
where Nj denotes the number of genes ith block or
ith slide, M denotes the number of blocks or
slides, aij denotes the log ratio mean of ith
block (or ith slide)
101
MIDAS Normalization Methods(Standard deviation
regularization)
Standard deviation regularization
Assumption log-ratio standard deviations within
each block or slide are the same.
Variance regularization can remove the bias
102
There are Limits to what you can Measure
103
The Limits of log-ratios The space we explore
104
The Limits of log-ratios The space we explore
105
The Limits of log-ratios The space we explore
106
What do we measure?
107
Dealing with Data
Before any pattern analysis can be done, one must
first normalize and filter the data.
Normalization facilitates comparisons between
datasets.
Filtering transformations can eliminate
questionable data and reduce complexity.
108
Expression Elements
109
Ratio vs. log-ratio
Ai Red intensity Bi Green intensity
Let
R
4
Gene1
3
log2(AB)
2
1
Gene2
0
AB
Advantages of log transformation Treat
up-regulated and down-regulated genes
symmetrically! Transfer multiplication
operations to addition operations! Because
110
Expression Vectors
Gene Expression Vectors represent the expression
of a gene over a set of experimental conditions
or sample types.
111
Multiple Samples?
  • Goal is identify genes (or experiments) which
    havesimilar patterns of expression
  • This is a problem in data mining
  • Clustering Algorithms are most widely used
  • Types
  • Agglomerative Hierarchical
  • Divisive k-means, SOMs
  • Hybrid SOTA
  • Nonclustering Principal Component Analysis
    (PCA)
  • All depend on how one measures distance

112
Expression Vectors
  • Crucial concept for understanding clustering
  • Each gene is represented by a vector where
    coordinates are its values log(ratio) in each
    experiment
  • x log(ratio)expt1
  • y log(ratio)expt2
  • z log(ratio)expt3
  • etc.

113
Expression Vectors
  • Crucial concept for understanding clustering
  • Each gene is represented by a vector where
    coordinates are its values log(ratio) in each
    experiment
  • x log(ratio)expt1
  • y log(ratio)expt2
  • z log(ratio)expt3
  • etc.
  • For example, if we do six experiments,
  • Gene1 (-1.2, -0.5, 0, 0.25, 0.75, 1.4)
  • Gene2 (0.2, -0.5, 1.2, -0.25, -1.0, 1.5)
  • Gene3 (1.2, 0.5, 0, -0.25, -0.75, -1.4)
  • etc.

114
Expt 1 Expt 2 Expt 3 Expt 4 Expt 5 Expt 6
Expression Matrix
  • These gene expression vectors of log(ratio)
    values can be used to construct an expression
    matrix
  • Gene1 -1.2 -0.5 0 0.25 0.75
    1.4
  • Gene2 0.2 -0.5 1.2 -0.25 -1.0
    1.5
  • Gene3 1.2 0.5 0 -0.25
    -0.75 -1.4
  • etc.
  • This is often represented as a red/green colored
    matrix

115
Expression Matrix
The Expression Matrix is a representation of data
frommultiple microarray experiments.
Each element is a log ratio, usually log 2
(Cy5/Cy3)
Black indicates a log-ratio of zero, i.e., Cy5
and Cy3 are very close in value
Green indicates a negative log-ratio, i.e., Cy5 lt
Cy3
Gray indicates missing data
Red indicates a positive log ratio, i.e, Cy5 gt
Cy3
116
Expression Vectors As Points inExpression Space
Similar Expression
Experiment 3
Experiment 2
Experiment 1
117
Distance metrics
  • Distances are measured between expression
    vectors
  • Distance metrics define the way we measure
    distances
  • Many different ways to measure distance
  • Euclidean distance
  • Pearson correlation coefficient(s)
  • Manhattan distance
  • Mutual information
  • Kendalls Tau
  • etc.
  • Each has different properties and can reveal
    different features of the data

118
Distance and Similarity
The ability to calculate a distance (or
similarity - its inverse) between two expression
vectors is fundamental to clustering
algorithms Distance between vectors is the
basis upon which decisions are made when grouping
similar patterns of expression Selection of a
distance metric defines the concept of distance
for a particular experiment
119
Distance a measure of similarity between genes
Some distances (MeV provides 11 metrics)
120
Distance Is Defined by a Metric
121
Distance is Defined by a Metric
1.4
-0.90
4.2
-1.00
122
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
Distance Matrix
  • Once a distance metric has been selected, the
    starting point for all clustering methods is a
    distance matrix
  • Gene1 0 1.5 1.2 0.25
    0.75 1.4
  • Gene2 1.5 0 1.3 0.55
    2.0 1.5
  • Gene3 1.2 1.3 0 1.3
    0.75 0.3
  • Gene4 0.25 0.55 1.3 0 0.25
    0.4
  • Gene5 0.75 2.0 0.75 0.25
    0 1.2
  • Gene6 1.4 1.5 0.3 0.4
    1.2 0
  • The elements of this matrix are the pair-wise
    distances. Note that the matrix is symmetric
    about the diagonal.

123
MeV Data Mining Tools
Alexander Saeed Alexander Sturn Nirmal
Bhagabati John Braisted Syntek Inc. Datanaut,
Inc.
? Available with OSI source
124
Some Concepts from Statistics
125
Probability distributions
  • The probability of an event is the likelihood
    of its occurring.
  • It is sometimes computed as a relative frequency
    (rf), where

The probability of an event can sometimes be
inferred from a theoretical probability
distribution, such as a normal distribution.
126
Normal distribution
127
  • Less than a 5 chance that the sample with mean
    s came from Population 1
  • s is significantly different from Mean 1 at the
    p lt 0.05 significance level.
  • But we cannot reject the hypothesis that the
    sample came from Population 2

128
Probability and Expression Data
  • Many biological variables, such as height and
    weight, can reasonably be assumed to approximate
    the normal distribution.
  • But expression measurements? Probably not.
  • Fortunately, many statistical tests are
    considered to be fairly robust to violations of
    the normality assumption, and other assumptions
    used in these tests.
  • Randomization / resampling based tests can be
    used to get around the violation of the normality
    assumption.
  • Even when parametric statistical tests (the ones
    that make use of normal and other distributions)
    are valid, randomization tests are still useful.

129
Outline of a randomization test - 1
1. Compute the value of interest (i.e., the
test-statistic s) from your data set.
2. Make fake data sets from your original
data, by taking a random sub-sample of the data,
or by re-arranging the data in a random fashion.
Re-compute s from the fake data set.
130
Outline of a randomization test - 2
3. Repeat step 2 many times (often several
hundred to several thousand times) and record of
the fake s values from step 2 4. Draw
inferences about the significance of your
original s value by comparing it with the
distribution of the randomized (fake) s values
131
Outline of a randomization test - 3
  • Rationale
  • Ideally, we want to know the behavior of the
    larger population from which the sample is drawn,
    in order to make statistical inferences.
  • Here, we dont know that the larger population
    behaves like a normal distribution, or some
    other idealized distribution. All we have to work
    with are the data in hand.
  • Our fake data sets are our best guess about
    this behavior (i.e., if we had been pulling data
    at random from an infinitely large population, we
    might expect to get a distribution similar to
    what we get by pulling random sub-samples, or by
    reshuffling the order of the data in our sample)

132
The problem of multiple testing (adapted from
presentation by Anja von Heydebreck,
MaxPlanckInstitute for Molecular Genetics,
Dept. Computational Molecular Biology, Berlin,
Germany http//www.bioconductor.org/workshops/Heid
elberg02/mult.pdf)
  • Lets imagine there are 10,000 genes on a chip,
    and
  • none of them is differentially expressed.
  • Suppose we use a statistical test for
    differential expression, where we consider a gene
    to be differentially expressed if it meets the
    criterion at a p-value of p lt 0.05.

133
The problem of multiple testing 2
  • Lets say that applying this test to gene G1
    yields a p-value of p 0.01
  • Remember that a p-value of 0.01 means that there
    is a 1 chance that the gene is not
    differentially expressed, i.e.,
  • Even though we conclude that the gene is
    differentially expressed (because p lt 0.05),
    there is a 1 chance that our conclusion is
    wrong.
  • We might be willing to live with such a low
    probability of being wrong
  • BUT .....

134
The problem of multiple testing 3
  • We are testing 10,000 genes, not just one!!!
  • Even though none of the genes is differentially
    expressed, about 5 of the genes (i.e., 500
    genes) will be erroneously concluded to be
    differentially expressed, because we have decided
    to live with a p-value of 0.05
  • If only one gene were being studied, a 5 margin
    of error might not be a big deal, but 500 false
    conclusions in one study? That doesnt sound too
    good.

135
The problem of multiple testing 4
  • There are tricks we can use to reduce the
    severity of this problem.
  • They all involve slashing the p-value for each
    test (i.e., gene), so that while the critical
    p-value for the entire data set might still equal
    0.05, each gene will be evaluated at a lower
    p-value.
  • Well go into some of these techniques later.

136
The problem of multiple testing 5
  • Dont get too hung up on p-values.
  • Ultimately, what matters is biological
    relevance.
  • P-values should help you evaluate the strength of
    the evidence, rather than being used as an
    absolute yardstick of significance.
  • Statistical significance is not necessarily the
    same as biological significance.

137
Finding Significant Genes
  • Assume we will compare two conditions with
    multiple replicates for each class
  • Our goal is to find genes that are significantly
    different between these classes
  • These are the genes that we will use for later
    data mining

138
Finding Significant Genes
  • Average Fold Change Difference for each gene
  • suffers from being arbitrary and not taking into
    account systematic variation in the data

139
Finding Significant Genes
  • t-test for each gene
  • Tests whether the difference between the mean of
    the query and reference groups are the same
  • Essentially measures signal-to-noise
  • Calculate p-value (permutations or distributions)
  • May suffer from intensity-dependent effects

140
t-tests
141
T-Tests (Between Subjects or unpaired) - 1
  • Assign experiments to two groups, e.g., in the
    expression matrix below, assign Experiments 1, 2
    and 5 to group A, and experiments 3, 4 and 6 to
    group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
142
T-TEST - 2
3. Calculate t-statistic for each gene
4. Calculate probability value of the
t-statistic for each gene either from A.
Theoretical t-distribution OR B.
Permutation tests.
143
T-TEST - 3
Permutation tests
i) For each gene, compute t-statistic
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
144
T-TEST - 4
Permutation tests - continued
iii) Compute t-statistic for the randomized
gene iv) Repeat steps i-iii n times (where n is
specified by the user). v) Let x the number of
times the absolute value of the original
t-statistic exceeds the absolute values of the
randomized t-statistic over n randomizations. vi)
Then, the p-value associated with the gene 1
(x/n)
145
T-TEST - 5
  • 5. Determine whether a genes expression levels
    are significantly different between the two
    groups by one of three methods
  • Just alpha (a significance level) If the
    calculated p-value for a gene is less than or
    equal to the user-input a (critical p-value), the
    gene is considered significant.
  • OR
  • Use Bonferroni corrections to reduce the
    probability of erroneously classifying
    non-significant genes as significant.
  • B) Standard Bonferroni correction The user-input
    alpha is divided by the total number of genes to
    give a critical p-value that is used as above gt
    pcritical a/N.

146
T-TEST 6
5C) Adjusted Bonferroni i) The t-values for
all the genes are ranked in descending order.
ii) For the gene with the highest t-value, the
critical p-value becomes (a/N), where N is the
total number of genes for the gene with the
second-highest t-value, the critical p-value
will be (a/N-1), and so on.
147
TTEST 1-class (or One-sample t-test) - 1
  • Used to test if the the mean expression of a gene
    over all experiments is different from a
    hypothesized mean.

2. Question Is the mean of the values of a given
gene vector significantly different from a
hypothesized mean?
148
TTEST- 1 Class - 2
3. Often, the hypothesized mean in gene
expression studies is zero, meaning that we are
looking for genes whose mean log2 ratio across
all experiments is significantly different from
zero, i.e., 4. Using 1-sample t-tests, we can
select genes which, on average, show differential
expression across all experiments (since genes
with no differential expression should have a
mean log2 ratio of zero across all expts). 5.
Calculate t-value, where Observed mean
of gene vector Hypothesized mean of gene
vector t ------------------------------------
------------------------------------------ Stand
ard error of the mean of the gene vector
149
TTEST 1 class - 3
6. Calculate p-value from a theoretical
t-distribution, OR 7. By permutation 7a.
Randomly pick some elements of the gene vector,
and change their values,such that the new value
of the changed element is original value 2
x (original value - hypothesized mean)
(i.e., flip the elements deviation around the
hypothesized mean) Thus, if the original gene
values are and the hypothesized mean is
zero, then the randomized gene values could
be
150
TTEST 1 class - 4
7b. Calculate t-value from the randomized
gene 7c. Repeat 7a and 7b as many times as
desired. If all permutations are chosen, then
every possible combination of elements in the
gene vector is chosen for flipping. 7d. The
p-value 1 (the proportion of times that the
original absolute t-value exceeds the randomized
absolute t-value over all the permutations
conducted). 8. If a genes p-value is less than
or equal to the user-specified critical p-value,
the genes mean expression over all experiments
is significantly different from the hypothesized
mean. 9. Bonferroni and adjusted Bonferroni
corrections may be applied just as in the
two-sample t-test.
151
Finding Significant Genes
  • Volcano Plots
  • Combines p-values and fold change measures
  • Significant genes appear in upper corners

152
Finding Significant Genes
  • Analysis of Variation (ANOVA)
  • Which genes are most significant for separating
    classes of samples?
  • Calculate p-value (permutations or distributions)
  • Reduces to a t-test for 2 samples
  • May suffer from intensity-dependent effects

153
One Way Analysis of Variance (ANOVA)
  • Assign experiments to gt 2 groups

2. Question Is mean expression level of a gene
the same across all groups?
154
ANOVA - 2
3. Calculate an F-ratio for each gene,
where Mean square (groups) F
----------------------------------, which is a
measure of Mean square (error) Between
groups variability -----------------------------
--------- Within groups variability The larger
the value of F, the greater the difference among
the group means relative to the sampling error
variability (which is the within groups
variability). i.e., the larger the value of F,
the more likely it is that the differences among
the group means reflect real differences among
the means of the populations they are drawn
from, rather than being due to random sampling
error.
155
ANOVA - 3 4. The p-value associated with an
F-value is the probability that an F-value that
large would be obtained if there were no
differences among group means (i.e., given the
null hypothesis). Therefore, the smaller the
p-value, the less likely it is that the null
hypothesis is valid, i.e., the differences among
group means are more likely to reflect real
population differences as p-values decrease in
magnitude.
156
  • ANOVA - 4
  • 5. P-values can be obtained for the F-values from
    a theoretical F-distribution, assuming that the
    populations from which the data are obtained
  • are normally distributed, and
  • have homogeneous variances.

The test is considered robust to violations of
these assumptions, provided sample sizes are
relatively large and similar across groups.
157
ANOVA 5 6. P-values can be obtained from
permutation tests (just like in t-tests), if one
does not want to rely on the assumptions needed
for using the F-distribution. P-values can also
be corrected for multiple comparisons (using
Bonferroni or other procedures).
158
Two-factor ANOVA (TFA)
Can be used to find genes whose expression is
significantly different over two factors (e.g.,
sex and strain), as well as to look for genes
with a significant interaction for these two
factors.
Strain B
Strain C
Strain A
Male
Female
159
TFA - 2
160
TFA - 3
  • Ideally, design should be balanced, i.e., equal
    numbers of samples in each factor A factor B
    combination.
  • If unbalanced, the analysis can still be
    conducted, but F-tests will be somewhat biased.
    May need to use smaller p-values.
  • Can have balanced designs with no replication
    (see below). In this case, interaction cannot be
    tested..

161
Finding Significant Genes
  • Significance Analysis of Microarrays (SAM)
  • Uses a modified t-test by estimating and adding a
    small positive constant to the denominator
  • Significant genes are those which exceed the
    expected values from permutation analysis.

162
SAM test Statistic
  • di Score
  • si Standard Deviation
  • s0 Safety Factor

163
SAM Variance Estimate
  • Gene by gene variance estimate safety factor
  • Variance equal in the two conditions
  • s0 term is here to deal with cases when variance
    estimates gets too close to zero
  • How to choose s0?
  • Test statistics are binned in 100 different group
    depending on the si value
  • s0 is chosen so that the dispersion of the test
    statistic does not vary from bin to bin
  • avoids aberrant values when variance estimates
    close to 0

164
SAM Hypothesis Testing
  • Permutation technique
  • Multiple testing adjustment technique
  • False Discovery Rate

165
Confidence Level False Discovery Rate
  • Fix a threshold DELTA for differentially
    expressed genes
  • For each permutation, count how many genes you
    declare differentially expressed
  • NB In a permutation you should find 0 genes.
  • Compute median number of falsely called genes in
    permutations
  • False Discovery Rate is number of falsely called
    genes divided by number of differential expressed
    genes in original data

FDR percentage of NON-significant genes you can
expect to find in your result list
166
SAM
  • SAM gives estimates of the False Discovery Rate
    (FDR), which is the proportion of genes likely to
    have been wrongly identified by chance as being
    significant.
  • It is a very interactive algorithm allows users
    to dynamically change thresholds for significance
    (through the tuning parameter delta) after
    looking at the distribution of the test
    statistic.
  • The ability to dynamically alter the input
    parameters based on immediate visual feedback,
    even before completing the analysis, helps make
    the data-mining process sensitive.

167
Significance analysis of microarrays (SAM)
  • SAM can be used to identify significant genes
    based on differential expression between sets of
    samples.
  • Currently implemented for the following designs
  • two-class unpaired
  • two-class paired
  • multi-class
  • censored survival
  • one-class

168
SAM designs
  • Two-class unpaired to pick out genes whose mean
    expression level is significantly different
    between two groups of samples (analogous to
    between subjects t-test).
  • Two-class paired samples are split into two
    groups, and there is a 1-to-1 correspondence
    between an sample in group A and one in group B
    (analogous to paired t-test).

169
SAM designs - 2
  • Multi-class identifies genes whose mean
    expression is different across gt 2 groups of
    samples (analogous to one-way ANOVA)
  • Censored survival finds genes whose expression
    levels are correlated with duration of survival
    using Cox-regression.
  • One-class selects genes whose mean expression
    across experiments is different from a
    user-specified mean.

170
SAM Two-Class 1
  • Assign experiments to two groups, e.g., in the
    expression matrix below, assign Experiments 1, 2
    and 5 to group A, and experiments 3, 4 and 6 to
    group B.

2. Question Is mean expression level of a gene
in group A significantly different from mean
expression level in group B?
171
SAM Two-Class 2
Permutation tests
i) For each gene, compute d-value (analogous to
t-statistic). This is the observed d-value for
that gene.
ii) Randomly shuffle the values of the gene
between groups A and B, such that the reshuffled
groups A and B respectively have the same number
of elements as the original groups A and B.
Compute the observed d-value for each randomized
gene
172
SAM Two-Class 3
  • iii) Repeat step (ii) many times, so that each
    gene has many randomized d-values. Take the
    average of the randomized d-values for each gene.
    This is the expected d-value of that gene.
  • iv) Plot the observed d-values vs. the expected
    d-values

173
SAM Two-Class 4
Significant positive genes (i.e., mean
expression of group B gt mean expression of
group A) in red
The more a gene deviates from the observed
expected line, the more likely it is to be
significant. Any gene beyond the first gene in
the ve or ve direction on the x-axis (including
the first gene), whose observed exceeds the
expected by at least delta, is considered
significant.
174
SAM Two-Class 5
  • For each permutation of the data, compute the
    number of positive and negative significant genes
    for a given delta as explained in the previous
    slide. The median number of significant genes
    from these permutations is the median False
    Discovery Rate.
  • The rationale behind this is, any genes
    designated as significant from the randomized
    data are being picked up purely by chance (i.e.,
    falsely discovered). Therefore, the median
    number picked up over many randomizations is a
    good estimate of false discovery rate.

175
SAM Two-Class Paired
  • Samples fall into two groups
  • Each member of group A is associated with a
    member of group B in a 1-to-1 relationship

A-B pair
176
SAM Two-Class Paired - 2
  • e.g., groups A and B could respectively
    represent before and after a drug treatment,
    and each A-B pair of samples could come from the
    same patient before and after the treatment.
  • or, groups A and B could represent two strains
    for which samples were collected at the several
    time points over a time course study. A sample
    collected from each of strain A and B at the same
    time point could form an AB pair.
  • The rest of the analysis is similar to two-class
    unpaired SAM. Positive significant genes are
    those for which Mean(Group B) is significantly
    larger than Mean (Group A), and reverse is true
    for negative significant genes

177
SAM Multi-Class
  • Extension of SAM two -class unpaired to more
    than 2 groups
  • Experiments belong to one of at least three
    groups
  • Analogous to one-way between subjects ANOVA

178
SAM Multi-Class - 2
  • This analysis yields only positive significant
    genes
  • These are genes whose means are significantly
    different across some combination of the groups
    of experiments.

179
SAM Censored Survival
  • Each experiment (sample) is associated with an
    observation time, and a state at the time of
    observation.
  • The state is either dead or censored
  • Censored means that the subject survived
    beyond the time point at which the sample was
    taken.
  • A positive score means that a higher expression
    level for that gene implies shorter survival
    (i.e., higher risk), whereas a negative score
    means that higher expression implies longer
    survival.

180
Finding Patterns in the Data
181
Hierarchical Clustering
1. Calculate the distance between all genes. Find
the smallest distance. If several pairs share the
same similarity, use a predetermined rule to
decide between alternatives.
2. Fuse the two selected clusters to produce a
new cluster that now contains at least two
objects. Calculate the distance between the new
cluster and all other clusters.
3. Repeat steps 1 and 2 until only a single
cluster remains.
4. Draw a tree representing the results.
182
Hierarchical Clustering
(HCL-2)
183
Hierarchical Clustering
(HCL-3)
184
Hierarchical Tree
(HCL-4)
185
Agglomerative Linkage Methods
  • Linkage methods are rules or metrics that return
    a value that can be used to determine which
    elements (clusters) should be linked.
  • Three linkage methods that are commonly used are
  • Single Linkage
  • Average Linkage
  • Complete Linkage

(HCL-6)
186
Single Linkage
Cluster-to-cluster distance is defined as the
minimum distance between members of one cluster
and members of the another cluster. Single
linkage tends to create elongated clusters with
individual genes chained onto clusters. DAB
min ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-7)
187
Average Linkage
Cluster-to-cluster distance is defined as the
average distance between all members of one
cluster and all members of another cluster.
Average linkage has a slight tendency to produce
clusters of similar variance. DAB 1/(NANB) S
S ( d(ui, vj) ) where u Î A and v Î B for all
i 1 to NA and j 1 to NB
DAB
(HCL-8)
188
Complete Linkage
Cluster-to-cluster distance is defined as the
maximum distance between members of one cluster
and members of the another cluster. Complete
linkage tends to create clusters of similar size
and variability. DAB max ( d(ui, vj) ) where
u Î A and v Î B for all i 1 to NA and j 1 to
NB
DAB
(HCL-9)
189
Comparison of Linkage Methods
Average
Single
Complete
(HCL-10)
190
Bootstrapping
Bootstrapping resampling with replacement
Original expression matrix
Various bootstrapped matrices (by experiments)
191
Jackknifing
Jackknifing resampling without replacement
Original expression matrix
Various jackknifed matrices (by experiments)
192
Analysis of bootstrapped and jackknifed support
trees
  • Bootstrapped or jackknifed expression matrices
    are created many times by randomly resampling the
    original expression matrix, using either the
    bootstrap or jackknife procedure.
  • Each time, hierarchical trees are created from
    the resampled matrices.
  • The trees are compared to the tree obtained from
    the original data set.
  • The more frequently a given cluster from the
    original tree is found in the resampled trees,
    the stronger the support for the cluster.
  • As each resampled matrix lacks some of the
    original data, high support for a cluster means
    that the clustering is not biased by a small
    subset of the data.

193
Self Organizing Tree Algorithm
SOTA - 1
  • Dopazo, J. , J.M Carazo, Phylogenetic
    reconstruction using and unsupervised growing
    neural network that adopts the topology of a
    phylogenetic tree. J. Mol. Evol. 44226-233,
    1997.
  • Herrero, J., A. Valencia, and J. Dopazo. A
    hierarchical unsupervised growing neural network
    for clustering gene expression patterns.
    Bioinformatics, 17(2)126-136, 2001.

194
SOTA Characteristics
SOTA - 2
  • Divisive clustering, allowing high level
    hierarchical structure to be revealed without
    having to completely partition the data set down
    to single gene vectors
  • Data set is reduced to clusters arranged in a
    binary tree topology
  • The number of resulting clusters is not fixed
    before clustering
  • Neural network approach which has advantages
    similar to SOMs such as handling large data sets
    that have large amounts of noise

195
SOTA Topology
SOTA - 3
Centroid Vector
ap
Members
196
Adaptation Overview
SOTA - 4
  • Each gene vector associated with the parent is
    compared to the centroid vector of its offspring
    cells.
  • The most similar cells centroid and its
    neighboring cells are adapted using the
    appropriate migration weights.

197
  • Following the presentation of all genes to the
    system a measure of system diversity is used to
    determine if training has found an optimal
    position for the offspring.
  • If the system diversity improves (decreases)
    then another training epoch is started otherwise
    training ends and a new cycle starts with a cell
    division.

SOTA - 5
198
SOTA - 6
The most diverse cell is selected for division
at the start of the next training cycle.
199
Growth Termination
SOTA - 7
Expansion stops when the most diverse cells
diversity falls below a threshold.
200
SOTA - 8
Each training cycle ends when the overall tree
diversity stabilizes. This triggers a cell
division and possibly a new training cycle.
201
K-Means/Medians Clustering 1
202
K-Means/Medians Clustering 2
3. Calculate mean/median expression profile of
each cluster.
4. Shuffle genes among clusters such that each
gene is now in the cluster whose mean expression
profile (calculated in step 3) is the closest to
that genes expression profile.
5. Repeat steps 3 and 4 until genes cannot be
shuffled around any more, OR a user-specified
number of iterations has been reached.
k-means is most useful when the user has an a
priori hypothesis about the number of clusters
the genes should belong to.
203
K-Means / K-Medians Support (KMS)
  • Because of the random initialization of
    K-Means/K-Means, clustering results may vary
    somewhat between successive runs on the same
    dataset. KMS helps us validate the clustering
    results obtained from K-Means/K-Medians.
  • Run K-Means / K-Medians multiple times.
  • The KMS module generates clusters in which the
    member genes frequently group together in the
    same clusters (consensus clusters) across
    multiple runs of K-Means / K-Medians.
  • The consensus clusters consist of genes that
    clustered together in at least x of the K-Means
    / Medians runs, where x is the threshold
    percentage input by the user.

204
Self-organizing maps (SOMs) 1
1. Specify the number of nodes (clusters)
desired, and also specify a 2-D geometry for the
nodes, e.g., rectangular or hexagonal
205
SOMs 2
2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The
node closest to G9 (N2) is moved the most, and
the other nodes are moved by smaller varying
amounts. The farther away the node is from N2,
the less it is moved.
206
SOMs 3
4. Steps 2 and 3 (i.e., choosing a random gene
and moving the nodes towards it) are repeated
many (usually several thousand) times. However,
with each iteration, the amount that the nodes
are allowed to move is decreased.
5. Finally, each node will nestle among a
cluster of genes, and a gene will be considered
to be in the cluster if its distance to the node
in that cluster is less than its distance to any
other node.
207
SOM Neighborhood Options
Gaussian Neighborhood
Bubble Neighborhood
radius
G7
G7
G8
G8
G9
G9
G10
G10
G11
G11
N1
N2
N1
N2
N3
N4
N3
N4
N5
N6
N5
N6
Some move, alpha is constant.
All move, alpha is scaled.
208
Template Matching
  • Template matching allows one to find expression
    vectors which match a provided template
  • A template can be derived from
  • a gene known to be central to the area of study
  • a sample or set of samples of a particular type
  • a cluster with a mean pattern of interest
  • a pattern constructed to reveal trends based on
    knowledge of the experimental design

209
PTM-2
  • Sometimes it is useful to identify elements that
    have complementary patterns by selecting to use
    the absolute value of r.

210
Gene Shaving
Results in a series of nested clusters
Choose cluster of appropriate size as determined
by gap statistic calculation
Repeat until only one gene remains
Orthogonalize expression matrix with respect to
the average gene in the cluster and repeat
shaving procedure
211
Gene Shaving
The final cluster contains a set of genes that
are greatly affected by the experimental
conditions in a similar way.
Create random permutations of the expression
matrix and calculate R2 for each
Compare R2 of each cluster to that of the entire
expression matrix
Choose the cluster whose R2 is furthest from the
average R2 of the permuted expression matrices.
212
Relevance Networks
Set of genes whose expression profiles are
predictive of one another.
Can be used to identify negative correlations
between genes.
213
Relevance Networks
The remaining relationships between genes define
the subnets
214
Principal Components Analysis
  • PCA simplifies the views of the data.
  • Suppose we have measurements for each gene on
    multiple experiments.
  • Suppose some of the experiments are correlated.
  • PCA will ignore the redundant experiments, and
    will take a weighted average of some of the
    experiments, thus possibly making the trends in
    the data more interpretable.
  • 5. The
About PowerShow.com