Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer GenomeWide Association St - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer GenomeWide Association St

Description:

Neither cross-validation nor independent sample validation were performed. ... we made these calculations as a frame of reference only. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 27
Provided by: alexander2
Learn more at: http://www.statnikov.org
Category:

less

Transcript and Presenter's Notes

Title: Effects of Environment, Genetics and Data Analysis in an Esophageal Cancer GenomeWide Association St


1
Effects of Environment, Genetics and Data
Analysis in an Esophageal Cancer Genome-Wide
Association Study
  • Alexander StatnikovDiscovery Systems Laboratory
  • Department of Biomedical Informatics
  • Vanderbilt University
  • 10/3/2007

2
Project history
  • Joint project with Chun Li and Constantin
    Aliferis
  • Cancer Research 2005 paper by Hu et al.
    Genome-Wide Association Study in Esophageal
    Cancer Using GeneChip Mapping 10K Array
  • Reported near-perfect classification of cancer
    patients healthy controls on the basis of only
    SNP data from a case-control GWA study.
  • This finding suggests that esophageal cancer is a
    solely genetic disease
  • Initial idea of Chun Li
  • We happened to have a GWA dataset used in that
    study, so we decided to re-analyze it

3
Background
  • SNPs make up gt90 of all human genetic variation
    and have been extensively studied for functional
    relationships between phenotype genotype.
  • Modern high-throughput genotyping technologies
    allow fast evaluation of SNPs on a genome-wide
    scale at a relatively low cost.
  • During last 2 years, several studies have
    reported success in using SNP genotyping assays
    in GWA studies in cancer. The strongest result is
    reported in the study by Hu et al.

4
Re-analysis of SNP data to assess the claims of
Hu et al.
5
Study dataset its preparation
  • Study dataset
  • 50 esophageal squamous cell carcinoma patients
  • 50 healthy controls (matched by age, sex, place
    of residence)
  • 10k Affymetrix SNP arrays with 11,555 SNPs
  • Additional variables
  • Age
  • Tobacco use
  • Alcohol consumption
  • Family history
  • Consumption of pickled vegetables
  • Removed 1.5k SNPs to minimize genotyping errors
  • Implemented recessive A encoding
  • Imputed missing genotypes

6
SNP selection Original method of Hu et al.
  • (denote as GLM1)
  • Fit a GLM model using data for all 100 subjects
  • Probability(Cancer) 1 / (1 exp(-f)), where
  • f a b SNP c family history d alcohol
    consumption
  • Obtain deviances
  • D1 - deviance of the above fitted model
  • D0 - deviance of the null model (without
    predictor variables)
  • From ?2 distribution, compute a p-value for the
    test statistic D0-D1 with 3 degrees of freedom
  • Perform Bonferroni correction at 0.05 alpha level

7
SNP selection Unbiased GLM-based method
  • (denote as GLM2)
  • Fit a GLM model using data for all 100 subjects
  • Probability(Cancer) 1 / (1 exp(-f)), where
  • f a b SNP c family history d alcohol
    consumption
  • Obtain deviances
  • D1 - deviance of the above fitted model
  • D0?- deviance of the model with family history
    and alcohol consumption
  • From ?2 distribution, compute a p-value for the
    test statistic D0?-D1 with 1 degree of freedom
  • Perform Bonferroni correction at 0.05 alpha level

8
ClassificationOriginal method of Hu et al.
  • Perform principal component analysis (PCA) on
    selected SNPs using all 100 subjects in the
    dataset.
  • Extract the first principal component (PC1).
  • Use the following rule to classify each of the
    same 100 subjects as used for the PCA
  • If PC1 gt 0, classify as control, otherwise
    classify as case

9
Evaluation of classification performance
  • Hu et al. used proportion of correct
    classifications their classifier is trained and
    tested in the same dataset
  • We employ area under ROC curve performance metric
    and repeated 10-fold cross-validation scheme

SNP dataset (100 subjects)
10
Reproducing findings of Hu et al.
  • Using GLM1 method, Hu et al. reported 37
    significant SNPs, we found 226!
  • Apparently, they used an extra filtering step
    that was not reported in the paper (personal
    comm. with their PI).
  • Nevertheless, the application of
  • PCA-based classifier (as in Hu et al.)
  • to GLM1 significant SNPs resulted
  • in 0.93 proportion of correct
  • classifications and 0.98 AUC.
  • ? Major findings are reproduced

11
Bias in SNP selection method GLM1
  • Calculation of p-values in GLM1 does not reflect
    significance of the SNP, but the significance of
    3 variables combined (SNP, family history, and
    alcohol consumption)
  • Family history alcohol consumption are strong
    risk factors ? p-value is biased towards 0.

12
Bias in SNP selection method GLM1
  • The distribution of SNP p-values for method GLM1
    is not uniform most p-values are lt10-3
  • On the contrary, GLM2 reflects significance of
    SNPs and does not suffer from the above bias
  • Its distribution of SNP p-values is uniform
  • It returns no SNPs significant at the Bonferroni
    adjusted alpha-level

Bonferroni adjusted a-level
13
Empirical demonstration of bias in SNP selection
method
  • Main idea Create a null distribution where SNPs
    are completely unrelated to the response variable
    and see how frequently methods GLM1 and GLM2 find
    statistically significant SNPs.
  • Permute all subjects in the SNP data while
    leaving the response variable, family history of
    esophageal cancer, and alcohol consumption
    intact.
  • Apply GLM1 and GLM2 to the permuted SNP data.

Repeat 1,000 times
14
Results of permutation experiments
  • GLM1 found significant SNPs in all 1000
    permutations! The number of significant SNPs
    found in a permuted dataset ranges from 185 to
    1,938 (357 on average).
  • GLM2 found significant SNPs in only 48/1000
    permutations. The number of significant SNPs
    found in a permuted dataset ranges from 1 to 3.
  • ? GLM1 is biased, while GLM2 is not.

15
Bias in the classification performance estimate
of Hu et al.
  • All data-analysis methods of Hu et al. use data
    for all subjects. Neither cross-validation nor
    independent sample validation were performed.
  • We repeated their data-analysis (GLM1PCA)
    embedded in the repeated 10-fold cross-validation
    design. The resulting performance is only 0.68
    AUC (versus 0.98 AUC).
  • ? 0.30 AUC bias (overestimation) in the reported
    results

16
Empirical demonstration of performance estimation
bias
  • Main idea Create a null distribution where SNPs
    are completely unrelated to the response variable
    (i.e. AUC0.5), apply GLM1PCA methodology and
    record resulting performance estimates.
  • Permute all subjects in the SNP data while
    leaving the response variable, family history of
    esophageal cancer, and alcohol consumption
    intact.
  • Apply GLM1 to the permuted SNP data.
  • Build and apply classifier using PCA.
  • Estimate classification performance (AUC).

Repeat 1,000 times
17
Results of permutation experiments
  • Classification performance of GLM1PCA both
    methods applied as in Hu et al. to all data (no
    cross-validation) 0.99 AUC
  • Classification performance of GLM1PCA GLM1
    applied to all data, PCA applied by
    cross-validation (incomplete cross-validation)
    0.98 AUC
  • Classification performance by GLM1PCA applied by
    cross-validation 0.50 AUC
  • ? 0.48-0.49 AUC bias (overestimation) under the
    null

18
Additional analysis of SNP data to assess the
effects of genetics and environment.
19
ClassificationSupport Vector Machines (SVMs)
  • Supervised baseline technique for many types
    high-throughput data (microarray, proteomics,
    etc).
  • Trained and applied by cross-validation

20
SNP selection for fitting SVMs Recursive
Feature Elimination
  • Among the best performing techniques for the
    analysis of microarray gene expression data
  • Applied only to a training set during
    cross-validation

SVM model
5,000 SNPs
SVM model
2,500 SNPs
10,000 SNPs

Important for classification
Performance estimate
Important for classification
Performance estimate
2,500 SNPs
5,000 SNPs
Discarded
Discarded
Not important for classification
Not important for classification
21
Classification results repeated 10-fold
cross-valid. estimates
denotes building of classifier by ensembling
22
Feedback from Hu et al. publication history.
23
Feedback from the authors
  • Concerning bias in SNP selection
  • If we use p-values to rank the SNPs, the two
    methods GLM1 and GLM2 will give the same
    order.
  • Concerning bias in estimation of classifier
    performance
  • It was not our purpose to develop a classifier
    in this initial pilot effort.
  • we made these calculations as a frame of
    reference only.
  • The authors presented results of their
    cross-validation effort.

24
Feedback from the authors
  • SNPs were selected by GLM1 on all 100 subjects
    and the classifier was trained and tested by
    cross-validation (2/3 of data is used for
    training and 1/3 of data is used for testing).
    This cross-validation procedure was repeated 1000
    times with different splits into training and
    testing set.

Proportion of correct classifications
These results are expected because the SNP
selection procedure utilizes both training and
testing data. This is incomplete
cross-validation and is shown to cause biased
performance estimation of the classifier.
25
Going forward with the publication
Original article
Authors of the original article
  • No useful feedback
  • Do not recognize their mistakes

26
Conclusions
  • Data-analysis pitfalls in Hu et al. led
    researchers to (1) identify non-statistically
    significant SNPs and (2) derive biased estimates
    of classification performance.
  • Environmental factors and family history have
    modest association with the disease, while SNPs
    do not appear to be associated.
  • It is crucially important to have sound
    statistical analysis in genome-wide association
    studies.
  • Publishing rebuttals is challenging!
Write a Comment
User Comments (0)
About PowerShow.com