Title: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data
1Integrative Analysis of High Dimensional Gene
Expression, Metabolite and Blood Chemistry Data
- Kwan R. Lee, Ph.D.
- and
- Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius
- Biomedical Data Sciences
- GlaxoSmithKline
- kwan.lee_at_gsk.com
-
-
2Overview
- Systems Biology
- Challenges for Statisticians
- Possible solutions
- Example of integrative data analysis
- Summary and discussion
3Of mice and men
?? ??
4 Integrate knowledge and technologies Reduce
attrition by running coordinated studies in
animal and man
5Focusing on one platform may miss an obvious
signal!!!
6How can efficacy failures be attacked?
Few data to support analogy
Many data to support analogy
7 Systems Biology approach to drug discovery
8Experimental Platforms Non-omics and Omics, what
are they?
LC-MS Lipid
1H NMR metabolites
A
A
Non-omic markers
Affy Transcriptome
Veh A B C D Veh A B
C D Normal
Disease
LC-MS metabolites
9Experimental Platforms Non-omics and Omics, what
are they? (cont.)
- Traditional Blood Chemistry (non-omics)
- Gene Expression (transcriptomics)
- Metabolite (metabonomics)
- Lipid (lipomics)
- Protein (proteomics)
10Five Challenges
- Data Pre-processing
- High Dimensionality
- Multiple Testing for Marker Selection
- Data Integration
- Validation of the Prediction Model
11Challenge 1 Data Pre-processing
- Peak Alignment (NMR, LC/MS)
- Normalization (Gene Chip, NMR, LC/MS data)
- Why? Remove systematic bias in the data
- Normalization within the platform makes data
comparable across samples
12Challenge 2 High Dimensionality of subjects
ltlt of variables
probe set 1 22,000
Lipid 1 ... 2,000
Metabolite 1 3,000
NMR 1 500
Choles, Trig, ...
Animal 1 Animal 2 . . .
. Animal 100
- Blood Chemistry 9 markers
- Gene Expression 22,000 probe sets
- Lipid LC/MS 2, 000 peaks
- Metabolite LS/MS 3,000 peaks
- NMR 500 buckets
13Challenge 3 Multiple Testing in Variable
Selection
No Adjustment for Multiple Testing
FWER Adjustment
FDR
14Challenge 4 Data integration
LC-MS Lipid
1H NMR metabolites
A
A
Non-omic markers
Affy Transcriptome
Veh A B C D Veh A B
C D Normal
Disease
LC-MS metabolites
15Challenge 4 Data integration (cont.)
Integration Approach 2
Integration Approach 1
Platform A
Platform B
Platform A
Platform B
20000s var.
1000s var.
20000s var.
1000s var.
Dimension Reduction
(
eg
variable selection)
Platform A
Platform B
1000s var.
100s var.
Combined
Combined
Data
Data
16Challenge 4 Data integration Example 1
- Integration approach 1 Simple data integration
- Simply combining the platform data together, the
platform with large amount of data and
variability will dominate the other platforms
17PCA on Non-omics, Transcriptomics, and Combined.
Transcriptomics (12,488)
Mirror image!!!
Combined (12,508)
Transcriptomics data dominate Non-omics data!!!
18PCA on Non-omics, Transcriptomics, and Combined.
Transcriptomics (20 PCs)
Like a mirror image!!!
Combined (40)
19Challenge 4 Data integrationExample 2
- Integration approach 2 Integrate on selected
markers - 9 blood chemistry 2000 probe sets 150
metabolites - There are still platforms with more selected
markers - How to weight different platforms appropriately?
Eg. 9 blood chemistry markers are known to relate
to disease or drug - Identify relationship among the probe sets,
metabolites, along with the blood chemistry
markers in terms of biological pathways
20Principle Component Analysis (PCA ) Projection
of 67 animals of 28 normal (black) , 39 disease
(red) (9 NO, 1991 TA, 115 MT) All markers used
for projection
Normal Disease
21Loading Plot
22Partial Least Square Discriminant Analysis
(PLS-DA) Disease group only
Vehicle Drug
23PLS-DA Corresponding projection of all markers
(9 NO, 1991 TA, 115 MT), Which are important
drug markers?
Veh
Drug
24Ranked drug markers by importance or by
coefficients.
marker importance by variable importance on
projection
Up or down regulation by coefficients
25Validation of the model R2, Q2 and permutation
tests 100 times (P lt 0.01)
26Variation explained by each platformPLS-DA for
prediction of 2 experimental groups
Two Groups HFD, vehicle HFD, Drug treated
The above table is based on 2- component model.
If the 4th model uses more components, 91 of the
variation in the data can be explained by 4
components.
Q2(Y) amount of variation among the 2 groups
explained by the model (cross-validated)
27Challenge 5 Validation of the Prediction Model
- Correct way of doing cross-validation
- Especially when the variables are selected
- Is your prediction accuracy significant?
28Random Noise Data
- Simulate 20,000 marker columns of random noise
for 100 patients and one additional column
containing arbitrary labels of class indicators. - Select 5 marker columns showing most correlation
with class label. - Make a prediction model for class indicators
based on these 5 selected markers.
29PCA of Full Markers
30PLS-DA on Random Noise Data
- Running a full model on SIMCA does not yield a
model no significant Q2. - Multivariate approach is conservative.
- Q2 computes prediction performance.
- But forced the software to fit a 6 -component
model by PLS-DA - (R2 1.0, Q2 0.225)
31Full marker modelPLS-DA
32Was it real or by chance?
33Select 5 Markers
- Selected top 5 markers using VIP from the
over-fitted model and fit PLS-DA again on the
same data. - Now we have (R2 0.459, Q2 0.348)
34Good prediction from PLS-DA? Q2 0.35
35Validated by permutation test?Significance of Q2
36Selection Bias
- When a prediction model is tested on the same
data that were used in the first instance to
select the markers, selection bias makes the test
error overly optimistic. - Many publications claimed a small set of selected
genes is highly predictive. - IBI practice is to use a data set to select
markers and use the same data set to fit a
prediction model based on selected markers.
37How to correct for selection bias?
- External validation should be undertaken
subsequent to feature selection process. - Independent test data set (hold-out data set)
that never used for feature selection. - External cross-validation (ECV).
- Cross validation of the prediction model is
external to the selection process. - In other words, make a new selection for each
cross validation round.
38Externally Validated PLS.Model and variable
selection
- Divide the data set randomly into d parts.
- Set ecv 1 (this means hold-out one part and
use d-1 parts for modeling) - Set a 1 (the number of components, do until
10) - Set k total number of variables
- Loop
- Fit PLS model with given a and k , PLS (a,k)
- Predict hold-out set, compute PRESS (ecv, a, k)
and save - Choose top half of the variables by appropriate
statistics (coeff, vip, t-ratio etc) - Set k k/2
- Go back to Loop until k 2
- Set a a 1
- Go back to Loop until a 10
- Set ecv ecv 1
- Go back to Loop until ecv d
- Compute PRESS (a, k) Sum over ecv PRESS (ecv,
a, k) - Compute Q2(a, k) 1 PRESS (a, k)/TSS
- Plot Q2(a,k) vs. log2(k)
39Simulation of 2000 Random DataR. Simon 2003
- 20 x 6000 and 10/10 for class labels
- Repeat 2000 times
- Compute 3 different error rates
- Re-substitution (wrong)
- Cross validation after selection (wrong)
- Cross validation before selection (correct)
40Results of 2000 Random Data
41Permutation testing
- Because of the high dimensionality of gene
expression data, it may be possible to achieve
relatively small error rates even for random
data. - To assess the significance of the classification
results, permutation test may be suggested.
42Challenge 5 Validation of the Prediction Model
- summary
- Correct way of doing cross-validation
- All the steps of the prediction modeling should
be cross-validated. - Each cross validation step should start from
scratch - Is your prediction accuracy significant?
- Random data can give you low prediction error
- Permutation tests, bootstrap aggregation
43Summary and Discussion
- Recent technological advances present challenging
and interesting biological data at molecular
level. - Statistics and multivariate analysis play an
important role in understanding and extracting
knowledge from these type of data. - Integrative analysis is even more challenging and
we presented some solutions to these challenges.
There is plenty of room for improvement.
44Acknowledgement
- GlaxoSmithKline
- High Throughput Biology
- Biomedical Data Sciences
- Genomics and Proteomics Science
- Pathology, Cellular Biochemical Toxicology
- Discovery IT
45Data exploration Present Challenges
Data is an extremely valuable asset, but like a
cash crop, unless harvested, it is wasted. -
Sid Adelman