Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data - PowerPoint PPT Presentation

About This Presentation
Title:

Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data

Description:

... LC/MS) Normalization (Gene Chip, NMR, LC/MS data) Why? ... 2004 NISS Metabolomics Workshop, 2005 Integrative Analysis of High Dimensional Gene Expression, ... – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 46
Provided by: IBIStatis
Learn more at: https://www.niss.org
Category:

less

Transcript and Presenter's Notes

Title: Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data


1
Integrative Analysis of High Dimensional Gene
Expression, Metabolite and Blood Chemistry Data
  • Kwan R. Lee, Ph.D.
  • and
  • Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius
  • Biomedical Data Sciences
  • GlaxoSmithKline
  • kwan.lee_at_gsk.com

2
Overview
  • Systems Biology
  • Challenges for Statisticians
  • Possible solutions
  • Example of integrative data analysis
  • Summary and discussion

3
Of mice and men
?? ??
4
Integrate knowledge and technologies Reduce
attrition by running coordinated studies in
animal and man
5
Focusing on one platform may miss an obvious
signal!!!
6
How can efficacy failures be attacked?
Few data to support analogy
Many data to support analogy
7
Systems Biology approach to drug discovery
8
Experimental Platforms Non-omics and Omics, what
are they?
LC-MS Lipid
1H NMR metabolites
A
A
Non-omic markers
Affy Transcriptome
Veh A B C D Veh A B
C D Normal
Disease
LC-MS metabolites
9
Experimental Platforms Non-omics and Omics, what
are they? (cont.)
  • Traditional Blood Chemistry (non-omics)
  • Gene Expression (transcriptomics)
  • Metabolite (metabonomics)
  • Lipid (lipomics)
  • Protein (proteomics)

10
Five Challenges
  1. Data Pre-processing
  2. High Dimensionality
  3. Multiple Testing for Marker Selection
  4. Data Integration
  5. Validation of the Prediction Model

11
Challenge 1 Data Pre-processing
  • Peak Alignment (NMR, LC/MS)
  • Normalization (Gene Chip, NMR, LC/MS data)
  • Why? Remove systematic bias in the data
  • Normalization within the platform makes data
    comparable across samples

12
Challenge 2 High Dimensionality of subjects
ltlt of variables
probe set 1 22,000
Lipid 1 ... 2,000
Metabolite 1 3,000
NMR 1 500
Choles, Trig, ...
Animal 1 Animal 2 . . .
. Animal 100
  • Blood Chemistry 9 markers
  • Gene Expression 22,000 probe sets
  • Lipid LC/MS 2, 000 peaks
  • Metabolite LS/MS 3,000 peaks
  • NMR 500 buckets

13
Challenge 3 Multiple Testing in Variable
Selection
No Adjustment for Multiple Testing
FWER Adjustment


FDR
14
Challenge 4 Data integration
LC-MS Lipid
1H NMR metabolites
A
A
Non-omic markers
Affy Transcriptome
Veh A B C D Veh A B
C D Normal
Disease
LC-MS metabolites
15
Challenge 4 Data integration (cont.)
Integration Approach 2
Integration Approach 1
Platform A
Platform B
Platform A
Platform B
20000s var.
1000s var.
20000s var.
1000s var.
Dimension Reduction
(
eg
variable selection)
Platform A
Platform B
1000s var.
100s var.
Combined
Combined
Data
Data
16
Challenge 4 Data integration Example 1
  • Integration approach 1 Simple data integration
  • Simply combining the platform data together, the
    platform with large amount of data and
    variability will dominate the other platforms

17
PCA on Non-omics, Transcriptomics, and Combined.
  • Non-omics (20)

Transcriptomics (12,488)
Mirror image!!!
Combined (12,508)
Transcriptomics data dominate Non-omics data!!!
18
PCA on Non-omics, Transcriptomics, and Combined.
  • Non-omics (20)

Transcriptomics (20 PCs)
Like a mirror image!!!
Combined (40)
19
Challenge 4 Data integrationExample 2
  • Integration approach 2 Integrate on selected
    markers
  • 9 blood chemistry 2000 probe sets 150
    metabolites
  • There are still platforms with more selected
    markers
  • How to weight different platforms appropriately?
    Eg. 9 blood chemistry markers are known to relate
    to disease or drug
  • Identify relationship among the probe sets,
    metabolites, along with the blood chemistry
    markers in terms of biological pathways

20
Principle Component Analysis (PCA ) Projection
of 67 animals of 28 normal (black) , 39 disease
(red) (9 NO, 1991 TA, 115 MT) All markers used
for projection
Normal Disease
21
Loading Plot
22
Partial Least Square Discriminant Analysis
(PLS-DA) Disease group only
Vehicle Drug
23
PLS-DA Corresponding projection of all markers
(9 NO, 1991 TA, 115 MT), Which are important
drug markers?
Veh
Drug
24
Ranked drug markers by importance or by
coefficients.
marker importance by variable importance on
projection
Up or down regulation by coefficients
25
Validation of the model R2, Q2 and permutation
tests 100 times (P lt 0.01)
26
Variation explained by each platformPLS-DA for
prediction of 2 experimental groups
Two Groups HFD, vehicle HFD, Drug treated
The above table is based on 2- component model.
If the 4th model uses more components, 91 of the
variation in the data can be explained by 4
components.
Q2(Y) amount of variation among the 2 groups
explained by the model (cross-validated)
27
Challenge 5 Validation of the Prediction Model
  • Correct way of doing cross-validation
  • Especially when the variables are selected
  • Is your prediction accuracy significant?

28
Random Noise Data
  • Simulate 20,000 marker columns of random noise
    for 100 patients and one additional column
    containing arbitrary labels of class indicators.
  • Select 5 marker columns showing most correlation
    with class label.
  • Make a prediction model for class indicators
    based on these 5 selected markers.

29
PCA of Full Markers
30
PLS-DA on Random Noise Data
  • Running a full model on SIMCA does not yield a
    model no significant Q2.
  • Multivariate approach is conservative.
  • Q2 computes prediction performance.
  • But forced the software to fit a 6 -component
    model by PLS-DA
  • (R2 1.0, Q2 0.225)

31
Full marker modelPLS-DA
32
Was it real or by chance?
33
Select 5 Markers
  • Selected top 5 markers using VIP from the
    over-fitted model and fit PLS-DA again on the
    same data.
  • Now we have (R2 0.459, Q2 0.348)

34
Good prediction from PLS-DA? Q2 0.35
35
Validated by permutation test?Significance of Q2
36
Selection Bias
  • When a prediction model is tested on the same
    data that were used in the first instance to
    select the markers, selection bias makes the test
    error overly optimistic.
  • Many publications claimed a small set of selected
    genes is highly predictive.
  • IBI practice is to use a data set to select
    markers and use the same data set to fit a
    prediction model based on selected markers.

37
How to correct for selection bias?
  • External validation should be undertaken
    subsequent to feature selection process.
  • Independent test data set (hold-out data set)
    that never used for feature selection.
  • External cross-validation (ECV).
  • Cross validation of the prediction model is
    external to the selection process.
  • In other words, make a new selection for each
    cross validation round.

38
Externally Validated PLS.Model and variable
selection
  • Divide the data set randomly into d parts.
  • Set ecv 1 (this means hold-out one part and
    use d-1 parts for modeling)
  • Set a 1 (the number of components, do until
    10)
  • Set k total number of variables
  • Loop
  • Fit PLS model with given a and k , PLS (a,k)
  • Predict hold-out set, compute PRESS (ecv, a, k)
    and save
  • Choose top half of the variables by appropriate
    statistics (coeff, vip, t-ratio etc)
  • Set k k/2
  • Go back to Loop until k 2
  • Set a a 1
  • Go back to Loop until a 10
  • Set ecv ecv 1
  • Go back to Loop until ecv d
  • Compute PRESS (a, k) Sum over ecv PRESS (ecv,
    a, k)
  • Compute Q2(a, k) 1 PRESS (a, k)/TSS
  • Plot Q2(a,k) vs. log2(k)

39
Simulation of 2000 Random DataR. Simon 2003
  • 20 x 6000 and 10/10 for class labels
  • Repeat 2000 times
  • Compute 3 different error rates
  • Re-substitution (wrong)
  • Cross validation after selection (wrong)
  • Cross validation before selection (correct)

40
Results of 2000 Random Data
41
Permutation testing
  • Because of the high dimensionality of gene
    expression data, it may be possible to achieve
    relatively small error rates even for random
    data.
  • To assess the significance of the classification
    results, permutation test may be suggested.

42
Challenge 5 Validation of the Prediction Model
- summary
  • Correct way of doing cross-validation
  • All the steps of the prediction modeling should
    be cross-validated.
  • Each cross validation step should start from
    scratch
  • Is your prediction accuracy significant?
  • Random data can give you low prediction error
  • Permutation tests, bootstrap aggregation

43
Summary and Discussion
  • Recent technological advances present challenging
    and interesting biological data at molecular
    level.
  • Statistics and multivariate analysis play an
    important role in understanding and extracting
    knowledge from these type of data.
  • Integrative analysis is even more challenging and
    we presented some solutions to these challenges.
    There is plenty of room for improvement.

44
Acknowledgement
  • GlaxoSmithKline
  • High Throughput Biology
  • Biomedical Data Sciences
  • Genomics and Proteomics Science
  • Pathology, Cellular Biochemical Toxicology
  • Discovery IT

45
Data exploration Present Challenges
Data is an extremely valuable asset, but like a
cash crop, unless harvested, it is wasted. -
Sid Adelman
Write a Comment
User Comments (0)
About PowerShow.com