Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

About This Presentation

Title:

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

Description:

... LC/MS) Normalization (Gene Chip, NMR, LC/MS data) Why? ... 2004 NISS Metabolomics Workshop, 2005 Integrative Analysis of High Dimensional Gene Expression, ... – PowerPoint PPT presentation

Number of Views:206

Avg rating:3.0/5.0

Slides: 46

Provided by: IBIStatis

Learn more at: https://www.niss.org

Category:

more less

Transcript and Presenter's Notes

Title: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

1
Integrative Analysis of High Dimensional Gene
Expression, Metabolite and Blood Chemistry Data

Kwan R. Lee, Ph.D.
and
Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius
Biomedical Data Sciences
GlaxoSmithKline
kwan.lee_at_gsk.com

2
Overview

Systems Biology
Challenges for Statisticians
Possible solutions
Example of integrative data analysis
Summary and discussion

3
Of mice and men
?? ??
4
Integrate knowledge and technologies Reduce
attrition by running coordinated studies in
animal and man
5
Focusing on one platform may miss an obvious
signal!!!
6
How can efficacy failures be attacked?
Few data to support analogy
Many data to support analogy
7
Systems Biology approach to drug discovery
8
Experimental Platforms Non-omics and Omics, what
are they?
LC-MS Lipid
1H NMR metabolites
A
A
Non-omic markers
Affy Transcriptome
Veh A B C D Veh A B
C D Normal
Disease
LC-MS metabolites
9
Experimental Platforms Non-omics and Omics, what
are they? (cont.)

Traditional Blood Chemistry (non-omics)
Gene Expression (transcriptomics)
Metabolite (metabonomics)
Lipid (lipomics)
Protein (proteomics)

10
Five Challenges

Data Pre-processing
High Dimensionality
Multiple Testing for Marker Selection
Data Integration
Validation of the Prediction Model

11
Challenge 1 Data Pre-processing

Peak Alignment (NMR, LC/MS)
Normalization (Gene Chip, NMR, LC/MS data)
Why? Remove systematic bias in the data
Normalization within the platform makes data
comparable across samples

12
Challenge 2 High Dimensionality of subjects
ltlt of variables
probe set 1 22,000
Lipid 1 ... 2,000
Metabolite 1 3,000
NMR 1 500
Choles, Trig, ...
Animal 1 Animal 2 . . .
. Animal 100

Blood Chemistry 9 markers
Gene Expression 22,000 probe sets
Lipid LC/MS 2, 000 peaks
Metabolite LS/MS 3,000 peaks
NMR 500 buckets

13
Challenge 3 Multiple Testing in Variable
Selection
No Adjustment for Multiple Testing
FWER Adjustment

FDR
14
Challenge 4 Data integration
LC-MS Lipid
1H NMR metabolites
A
A
Non-omic markers
Affy Transcriptome
Veh A B C D Veh A B
C D Normal
Disease
LC-MS metabolites
15
Challenge 4 Data integration (cont.)
Integration Approach 2
Integration Approach 1
Platform A
Platform B
Platform A
Platform B
20000s var.
1000s var.
20000s var.
1000s var.
Dimension Reduction
(
eg
variable selection)
Platform A
Platform B
1000s var.
100s var.
Combined
Combined
Data
Data
16
Challenge 4 Data integration Example 1

Integration approach 1 Simple data integration
Simply combining the platform data together, the
platform with large amount of data and
variability will dominate the other platforms

17
PCA on Non-omics, Transcriptomics, and Combined.

Non-omics (20)

Transcriptomics (12,488)
Mirror image!!!
Combined (12,508)
Transcriptomics data dominate Non-omics data!!!
18
PCA on Non-omics, Transcriptomics, and Combined.

Non-omics (20)

Transcriptomics (20 PCs)
Like a mirror image!!!
Combined (40)
19
Challenge 4 Data integrationExample 2

Integration approach 2 Integrate on selected
markers
9 blood chemistry 2000 probe sets 150
metabolites
There are still platforms with more selected
markers
How to weight different platforms appropriately?
Eg. 9 blood chemistry markers are known to relate
to disease or drug
Identify relationship among the probe sets,
metabolites, along with the blood chemistry
markers in terms of biological pathways

20
Principle Component Analysis (PCA ) Projection
of 67 animals of 28 normal (black) , 39 disease
(red) (9 NO, 1991 TA, 115 MT) All markers used
for projection
Normal Disease
21
Loading Plot
22
Partial Least Square Discriminant Analysis
(PLS-DA) Disease group only
Vehicle Drug
23
PLS-DA Corresponding projection of all markers
(9 NO, 1991 TA, 115 MT), Which are important
drug markers?
Veh
Drug
24
Ranked drug markers by importance or by
coefficients.
marker importance by variable importance on
projection
Up or down regulation by coefficients
25
Validation of the model R2, Q2 and permutation
tests 100 times (P lt 0.01)
26
Variation explained by each platformPLS-DA for
prediction of 2 experimental groups
Two Groups HFD, vehicle HFD, Drug treated
The above table is based on 2- component model.
If the 4th model uses more components, 91 of the
variation in the data can be explained by 4
components.
Q2(Y) amount of variation among the 2 groups
explained by the model (cross-validated)
27
Challenge 5 Validation of the Prediction Model

Correct way of doing cross-validation
Especially when the variables are selected
Is your prediction accuracy significant?

28
Random Noise Data

Simulate 20,000 marker columns of random noise
for 100 patients and one additional column
containing arbitrary labels of class indicators.
Select 5 marker columns showing most correlation
with class label.
Make a prediction model for class indicators
based on these 5 selected markers.

29
PCA of Full Markers
30
PLS-DA on Random Noise Data

Running a full model on SIMCA does not yield a
model no significant Q2.
Multivariate approach is conservative.
Q2 computes prediction performance.
But forced the software to fit a 6 -component
model by PLS-DA
(R2 1.0, Q2 0.225)

31
Full marker modelPLS-DA
32
Was it real or by chance?
33
Select 5 Markers

Selected top 5 markers using VIP from the
over-fitted model and fit PLS-DA again on the
same data.
Now we have (R2 0.459, Q2 0.348)

34
Good prediction from PLS-DA? Q2 0.35
35
Validated by permutation test?Significance of Q2
36
Selection Bias

When a prediction model is tested on the same
data that were used in the first instance to
select the markers, selection bias makes the test
error overly optimistic.
Many publications claimed a small set of selected
genes is highly predictive.
IBI practice is to use a data set to select
markers and use the same data set to fit a
prediction model based on selected markers.

37
How to correct for selection bias?

External validation should be undertaken
subsequent to feature selection process.
Independent test data set (hold-out data set)
that never used for feature selection.
External cross-validation (ECV).
Cross validation of the prediction model is
external to the selection process.
In other words, make a new selection for each
cross validation round.

38
Externally Validated PLS.Model and variable
selection

Divide the data set randomly into d parts.
Set ecv 1 (this means hold-out one part and
use d-1 parts for modeling)
Set a 1 (the number of components, do until
10)
Set k total number of variables
Loop
Fit PLS model with given a and k , PLS (a,k)
Predict hold-out set, compute PRESS (ecv, a, k)
and save
Choose top half of the variables by appropriate
statistics (coeff, vip, t-ratio etc)
Set k k/2
Go back to Loop until k 2
Set a a 1
Go back to Loop until a 10
Set ecv ecv 1
Go back to Loop until ecv d
Compute PRESS (a, k) Sum over ecv PRESS (ecv,
a, k)
Compute Q2(a, k) 1 PRESS (a, k)/TSS
Plot Q2(a,k) vs. log2(k)

39
Simulation of 2000 Random DataR. Simon 2003

20 x 6000 and 10/10 for class labels
Repeat 2000 times
Compute 3 different error rates
Re-substitution (wrong)
Cross validation after selection (wrong)
Cross validation before selection (correct)

40
Results of 2000 Random Data
41
Permutation testing

Because of the high dimensionality of gene
expression data, it may be possible to achieve
relatively small error rates even for random
data.
To assess the significance of the classification
results, permutation test may be suggested.

42
Challenge 5 Validation of the Prediction Model
- summary

Correct way of doing cross-validation
All the steps of the prediction modeling should
be cross-validated.
Each cross validation step should start from
scratch
Is your prediction accuracy significant?
Random data can give you low prediction error
Permutation tests, bootstrap aggregation

43
Summary and Discussion

Recent technological advances present challenging
and interesting biological data at molecular
level.
Statistics and multivariate analysis play an
important role in understanding and extracting
knowledge from these type of data.
Integrative analysis is even more challenging and
we presented some solutions to these challenges.
There is plenty of room for improvement.

44
Acknowledgement

GlaxoSmithKline
High Throughput Biology
Biomedical Data Sciences
Genomics and Proteomics Science
Pathology, Cellular Biochemical Toxicology
Discovery IT

45
Data exploration Present Challenges
Data is an extremely valuable asset, but like a
cash crop, unless harvested, it is wasted. -
Sid Adelman

Write a Comment

User Comments (0)

About PowerShow.com

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data - PowerPoint PPT Presentation

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

... LC/MS) Normalization (Gene Chip, NMR, LC/MS data) Why? ... 2004 NISS Metabolomics Workshop, 2005 Integrative Analysis of High Dimensional Gene Expression, ... – PowerPoint PPT presentation