Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer GenomeWide Association St - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer GenomeWide Association St

Description:

Neither cross-validation nor independent sample validation were performed. ... we made these calculations as a frame of reference only. ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 27

Provided by: alexander2

Learn more at: http://www.statnikov.org

Category:

more less

Transcript and Presenter's Notes

Title: Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer GenomeWide Association St

1
Effects of Environment, Genetics and Data
Analysis in an Esophageal Cancer Genome-Wide
Association Study

Alexander StatnikovDiscovery Systems Laboratory
Department of Biomedical Informatics
Vanderbilt University
10/3/2007

2
Project history

Joint project with Chun Li and Constantin
Aliferis
Cancer Research 2005 paper by Hu et al.
Genome-Wide Association Study in Esophageal
Cancer Using GeneChip Mapping 10K Array
Reported near-perfect classification of cancer
patients healthy controls on the basis of only
SNP data from a case-control GWA study.
This finding suggests that esophageal cancer is a
solely genetic disease
Initial idea of Chun Li
We happened to have a GWA dataset used in that
study, so we decided to re-analyze it

3
Background

SNPs make up gt90 of all human genetic variation
and have been extensively studied for functional
relationships between phenotype genotype.
Modern high-throughput genotyping technologies
allow fast evaluation of SNPs on a genome-wide
scale at a relatively low cost.
During last 2 years, several studies have
reported success in using SNP genotyping assays
in GWA studies in cancer. The strongest result is
reported in the study by Hu et al.

4
Re-analysis of SNP data to assess the claims of
Hu et al.
5
Study dataset its preparation

Study dataset
50 esophageal squamous cell carcinoma patients
50 healthy controls (matched by age, sex, place
of residence)
10k Affymetrix SNP arrays with 11,555 SNPs
Additional variables
Age
Tobacco use
Alcohol consumption
Family history
Consumption of pickled vegetables
Removed 1.5k SNPs to minimize genotyping errors
Implemented recessive A encoding
Imputed missing genotypes

6
SNP selection Original method of Hu et al.

(denote as GLM1)
Fit a GLM model using data for all 100 subjects
Probability(Cancer) 1 / (1 exp(-f)), where
f a b SNP c family history d alcohol
consumption
Obtain deviances
D1 - deviance of the above fitted model
D0 - deviance of the null model (without
predictor variables)
From ?2 distribution, compute a p-value for the
test statistic D0-D1 with 3 degrees of freedom
Perform Bonferroni correction at 0.05 alpha level

7
SNP selection Unbiased GLM-based method

(denote as GLM2)
Fit a GLM model using data for all 100 subjects
Probability(Cancer) 1 / (1 exp(-f)), where
f a b SNP c family history d alcohol
consumption
Obtain deviances
D1 - deviance of the above fitted model
D0?- deviance of the model with family history
and alcohol consumption
From ?2 distribution, compute a p-value for the
test statistic D0?-D1 with 1 degree of freedom
Perform Bonferroni correction at 0.05 alpha level

8
ClassificationOriginal method of Hu et al.

Perform principal component analysis (PCA) on
selected SNPs using all 100 subjects in the
dataset.
Extract the first principal component (PC1).
Use the following rule to classify each of the
same 100 subjects as used for the PCA
If PC1 gt 0, classify as control, otherwise
classify as case

9
Evaluation of classification performance

Hu et al. used proportion of correct
classifications their classifier is trained and
tested in the same dataset
We employ area under ROC curve performance metric
and repeated 10-fold cross-validation scheme

SNP dataset (100 subjects)
10
Reproducing findings of Hu et al.

Using GLM1 method, Hu et al. reported 37
significant SNPs, we found 226!
Apparently, they used an extra filtering step
that was not reported in the paper (personal
comm. with their PI).
Nevertheless, the application of
PCA-based classifier (as in Hu et al.)
to GLM1 significant SNPs resulted
in 0.93 proportion of correct
classifications and 0.98 AUC.
? Major findings are reproduced

11
Bias in SNP selection method GLM1

Calculation of p-values in GLM1 does not reflect
significance of the SNP, but the significance of
3 variables combined (SNP, family history, and
alcohol consumption)
Family history alcohol consumption are strong
risk factors ? p-value is biased towards 0.

12
Bias in SNP selection method GLM1

The distribution of SNP p-values for method GLM1
is not uniform most p-values are lt10-3

On the contrary, GLM2 reflects significance of
SNPs and does not suffer from the above bias
Its distribution of SNP p-values is uniform
It returns no SNPs significant at the Bonferroni
adjusted alpha-level

Bonferroni adjusted a-level
13
Empirical demonstration of bias in SNP selection
method

Main idea Create a null distribution where SNPs
are completely unrelated to the response variable
and see how frequently methods GLM1 and GLM2 find
statistically significant SNPs.
Permute all subjects in the SNP data while
leaving the response variable, family history of
esophageal cancer, and alcohol consumption
intact.
Apply GLM1 and GLM2 to the permuted SNP data.

Repeat 1,000 times
14
Results of permutation experiments

GLM1 found significant SNPs in all 1000
permutations! The number of significant SNPs
found in a permuted dataset ranges from 185 to
1,938 (357 on average).
GLM2 found significant SNPs in only 48/1000
permutations. The number of significant SNPs
found in a permuted dataset ranges from 1 to 3.
? GLM1 is biased, while GLM2 is not.

15
Bias in the classification performance estimate
of Hu et al.

All data-analysis methods of Hu et al. use data
for all subjects. Neither cross-validation nor
independent sample validation were performed.
We repeated their data-analysis (GLM1PCA)
embedded in the repeated 10-fold cross-validation
design. The resulting performance is only 0.68
AUC (versus 0.98 AUC).
? 0.30 AUC bias (overestimation) in the reported
results

16
Empirical demonstration of performance estimation
bias

Main idea Create a null distribution where SNPs
are completely unrelated to the response variable
(i.e. AUC0.5), apply GLM1PCA methodology and
record resulting performance estimates.
Permute all subjects in the SNP data while
leaving the response variable, family history of
esophageal cancer, and alcohol consumption
intact.
Apply GLM1 to the permuted SNP data.
Build and apply classifier using PCA.
Estimate classification performance (AUC).

Repeat 1,000 times
17
Results of permutation experiments

Classification performance of GLM1PCA both
methods applied as in Hu et al. to all data (no
cross-validation) 0.99 AUC
Classification performance of GLM1PCA GLM1
applied to all data, PCA applied by
cross-validation (incomplete cross-validation)
0.98 AUC
Classification performance by GLM1PCA applied by
cross-validation 0.50 AUC
? 0.48-0.49 AUC bias (overestimation) under the
null

18
Additional analysis of SNP data to assess the
effects of genetics and environment.
19
ClassificationSupport Vector Machines (SVMs)

Supervised baseline technique for many types
high-throughput data (microarray, proteomics,
etc).
Trained and applied by cross-validation

20
SNP selection for fitting SVMs Recursive
Feature Elimination

Among the best performing techniques for the
analysis of microarray gene expression data
Applied only to a training set during
cross-validation

SVM model
5,000 SNPs
SVM model
2,500 SNPs
10,000 SNPs

Important for classification
Performance estimate
Important for classification
Performance estimate
2,500 SNPs
5,000 SNPs
Discarded
Discarded
Not important for classification
Not important for classification
21
Classification results repeated 10-fold
cross-valid. estimates
denotes building of classifier by ensembling
22
Feedback from Hu et al. publication history.
23
Feedback from the authors

Concerning bias in SNP selection
If we use p-values to rank the SNPs, the two
methods GLM1 and GLM2 will give the same
order.
Concerning bias in estimation of classifier
performance
It was not our purpose to develop a classifier
in this initial pilot effort.
we made these calculations as a frame of
reference only.
The authors presented results of their
cross-validation effort.

24
Feedback from the authors

SNPs were selected by GLM1 on all 100 subjects
and the classifier was trained and tested by
cross-validation (2/3 of data is used for
training and 1/3 of data is used for testing).
This cross-validation procedure was repeated 1000
times with different splits into training and
testing set.

Proportion of correct classifications
These results are expected because the SNP
selection procedure utilizes both training and
testing data. This is incomplete
cross-validation and is shown to cause biased
performance estimation of the classifier.
25
Going forward with the publication
Original article
Authors of the original article

No useful feedback
Do not recognize their mistakes

26
Conclusions

Data-analysis pitfalls in Hu et al. led
researchers to (1) identify non-statistically
significant SNPs and (2) derive biased estimates
of classification performance.
Environmental factors and family history have
modest association with the disease, while SNPs
do not appear to be associated.
It is crucially important to have sound
statistical analysis in genome-wide association
studies.
Publishing rebuttals is challenging!