Title: Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer GenomeWide Association St
1Effects of Environment, Genetics and Data
Analysis in an Esophageal Cancer Genome-Wide
Association Study
- Alexander StatnikovDiscovery Systems Laboratory
- Department of Biomedical Informatics
- Vanderbilt University
- 10/3/2007
2Project history
- Joint project with Chun Li and Constantin
Aliferis - Cancer Research 2005 paper by Hu et al.
Genome-Wide Association Study in Esophageal
Cancer Using GeneChip Mapping 10K Array - Reported near-perfect classification of cancer
patients healthy controls on the basis of only
SNP data from a case-control GWA study. - This finding suggests that esophageal cancer is a
solely genetic disease - Initial idea of Chun Li
- We happened to have a GWA dataset used in that
study, so we decided to re-analyze it
3Background
- SNPs make up gt90 of all human genetic variation
and have been extensively studied for functional
relationships between phenotype genotype. - Modern high-throughput genotyping technologies
allow fast evaluation of SNPs on a genome-wide
scale at a relatively low cost. - During last 2 years, several studies have
reported success in using SNP genotyping assays
in GWA studies in cancer. The strongest result is
reported in the study by Hu et al.
4Re-analysis of SNP data to assess the claims of
Hu et al.
5Study dataset its preparation
- Study dataset
- 50 esophageal squamous cell carcinoma patients
- 50 healthy controls (matched by age, sex, place
of residence) - 10k Affymetrix SNP arrays with 11,555 SNPs
- Additional variables
- Age
- Tobacco use
- Alcohol consumption
- Family history
- Consumption of pickled vegetables
- Removed 1.5k SNPs to minimize genotyping errors
- Implemented recessive A encoding
- Imputed missing genotypes
6SNP selection Original method of Hu et al.
- (denote as GLM1)
- Fit a GLM model using data for all 100 subjects
- Probability(Cancer) 1 / (1 exp(-f)), where
- f a b SNP c family history d alcohol
consumption - Obtain deviances
- D1 - deviance of the above fitted model
- D0 - deviance of the null model (without
predictor variables) - From ?2 distribution, compute a p-value for the
test statistic D0-D1 with 3 degrees of freedom - Perform Bonferroni correction at 0.05 alpha level
7SNP selection Unbiased GLM-based method
- (denote as GLM2)
- Fit a GLM model using data for all 100 subjects
- Probability(Cancer) 1 / (1 exp(-f)), where
- f a b SNP c family history d alcohol
consumption - Obtain deviances
- D1 - deviance of the above fitted model
- D0?- deviance of the model with family history
and alcohol consumption - From ?2 distribution, compute a p-value for the
test statistic D0?-D1 with 1 degree of freedom - Perform Bonferroni correction at 0.05 alpha level
8ClassificationOriginal method of Hu et al.
- Perform principal component analysis (PCA) on
selected SNPs using all 100 subjects in the
dataset. - Extract the first principal component (PC1).
- Use the following rule to classify each of the
same 100 subjects as used for the PCA -
- If PC1 gt 0, classify as control, otherwise
classify as case
9Evaluation of classification performance
- Hu et al. used proportion of correct
classifications their classifier is trained and
tested in the same dataset - We employ area under ROC curve performance metric
and repeated 10-fold cross-validation scheme
SNP dataset (100 subjects)
10Reproducing findings of Hu et al.
- Using GLM1 method, Hu et al. reported 37
significant SNPs, we found 226! - Apparently, they used an extra filtering step
that was not reported in the paper (personal
comm. with their PI). - Nevertheless, the application of
- PCA-based classifier (as in Hu et al.)
- to GLM1 significant SNPs resulted
- in 0.93 proportion of correct
- classifications and 0.98 AUC.
- ? Major findings are reproduced
11Bias in SNP selection method GLM1
- Calculation of p-values in GLM1 does not reflect
significance of the SNP, but the significance of
3 variables combined (SNP, family history, and
alcohol consumption) - Family history alcohol consumption are strong
risk factors ? p-value is biased towards 0.
12Bias in SNP selection method GLM1
- The distribution of SNP p-values for method GLM1
is not uniform most p-values are lt10-3
- On the contrary, GLM2 reflects significance of
SNPs and does not suffer from the above bias - Its distribution of SNP p-values is uniform
- It returns no SNPs significant at the Bonferroni
adjusted alpha-level
Bonferroni adjusted a-level
13Empirical demonstration of bias in SNP selection
method
- Main idea Create a null distribution where SNPs
are completely unrelated to the response variable
and see how frequently methods GLM1 and GLM2 find
statistically significant SNPs. - Permute all subjects in the SNP data while
leaving the response variable, family history of
esophageal cancer, and alcohol consumption
intact. - Apply GLM1 and GLM2 to the permuted SNP data.
Repeat 1,000 times
14Results of permutation experiments
- GLM1 found significant SNPs in all 1000
permutations! The number of significant SNPs
found in a permuted dataset ranges from 185 to
1,938 (357 on average). - GLM2 found significant SNPs in only 48/1000
permutations. The number of significant SNPs
found in a permuted dataset ranges from 1 to 3. - ? GLM1 is biased, while GLM2 is not.
15Bias in the classification performance estimate
of Hu et al.
- All data-analysis methods of Hu et al. use data
for all subjects. Neither cross-validation nor
independent sample validation were performed. - We repeated their data-analysis (GLM1PCA)
embedded in the repeated 10-fold cross-validation
design. The resulting performance is only 0.68
AUC (versus 0.98 AUC). - ? 0.30 AUC bias (overestimation) in the reported
results
16Empirical demonstration of performance estimation
bias
- Main idea Create a null distribution where SNPs
are completely unrelated to the response variable
(i.e. AUC0.5), apply GLM1PCA methodology and
record resulting performance estimates. - Permute all subjects in the SNP data while
leaving the response variable, family history of
esophageal cancer, and alcohol consumption
intact. - Apply GLM1 to the permuted SNP data.
- Build and apply classifier using PCA.
- Estimate classification performance (AUC).
Repeat 1,000 times
17Results of permutation experiments
- Classification performance of GLM1PCA both
methods applied as in Hu et al. to all data (no
cross-validation) 0.99 AUC - Classification performance of GLM1PCA GLM1
applied to all data, PCA applied by
cross-validation (incomplete cross-validation)
0.98 AUC - Classification performance by GLM1PCA applied by
cross-validation 0.50 AUC - ? 0.48-0.49 AUC bias (overestimation) under the
null
18Additional analysis of SNP data to assess the
effects of genetics and environment.
19ClassificationSupport Vector Machines (SVMs)
- Supervised baseline technique for many types
high-throughput data (microarray, proteomics,
etc). - Trained and applied by cross-validation
20SNP selection for fitting SVMs Recursive
Feature Elimination
- Among the best performing techniques for the
analysis of microarray gene expression data - Applied only to a training set during
cross-validation
SVM model
5,000 SNPs
SVM model
2,500 SNPs
10,000 SNPs
Important for classification
Performance estimate
Important for classification
Performance estimate
2,500 SNPs
5,000 SNPs
Discarded
Discarded
Not important for classification
Not important for classification
21Classification results repeated 10-fold
cross-valid. estimates
denotes building of classifier by ensembling
22Feedback from Hu et al. publication history.
23Feedback from the authors
- Concerning bias in SNP selection
- If we use p-values to rank the SNPs, the two
methods GLM1 and GLM2 will give the same
order. - Concerning bias in estimation of classifier
performance - It was not our purpose to develop a classifier
in this initial pilot effort. - we made these calculations as a frame of
reference only. - The authors presented results of their
cross-validation effort.
24Feedback from the authors
- SNPs were selected by GLM1 on all 100 subjects
and the classifier was trained and tested by
cross-validation (2/3 of data is used for
training and 1/3 of data is used for testing).
This cross-validation procedure was repeated 1000
times with different splits into training and
testing set.
Proportion of correct classifications
These results are expected because the SNP
selection procedure utilizes both training and
testing data. This is incomplete
cross-validation and is shown to cause biased
performance estimation of the classifier.
25Going forward with the publication
Original article
Authors of the original article
- No useful feedback
- Do not recognize their mistakes
26Conclusions
- Data-analysis pitfalls in Hu et al. led
researchers to (1) identify non-statistically
significant SNPs and (2) derive biased estimates
of classification performance. - Environmental factors and family history have
modest association with the disease, while SNPs
do not appear to be associated. - It is crucially important to have sound
statistical analysis in genome-wide association
studies. - Publishing rebuttals is challenging!