Title: Data mining for Genotype Phenotype associations: an embarrassment of riches
1Data mining for Genotype - Phenotype
associations an embarrassment of riches
- CART h RandomForest h Golden Helix h Neural
Networks h Neural Networks with Genetic
Programming h Discriminant Analysis h SSVS
Stochastic Search Variable Selection h Bayesian
Variable Selection h  MARS h CPM
Combinatorial Partitioning Methods h MDR
Multifactor-dimensionality reduction  h
Association Rule Mining h HPM Haplotype
pattern mining h QHPM h DICE Detection of
Informative Combined Effects
2Outline
- Project Workflow
- Simulating Genotype-Phenotype data
- A sampling of genetic models
- A Preliminary Study
- Chi-square test vs. RandomForests
- A shift in phenotype definition
- Next Steps
3Project work flow
- Iterate process to explore parameter space
- Size of marginal effects
- degree of epistasis
- gene frequencies
- linkage associations
- number of SNPs
- environmental effects
- sample size
- (lots more)
4Simulating Genetic Data
- Population Genetic model
- Population history
- SNP allele frequencies
- Departures from Hardy-Weinberg Equilibrium
- Linkage associations
- Haplotypes
- Disease/Development model
- SNP sampling strategy
- Candidate genes targeted?
5A model using XOR to create epistasis
from Ritchie, etal. 2003. BMC Bioinformatics 4
- High risk if individual is heterozygous at only
one SNP - AaBB, Aabb, AABb, aaBb but not AaBb
- If all alleles have equal freq, no single locus
genotypic marginal differences in penetrance. - AA, Aa, and aa each confer the same risk.
6A Model for genetic heterogeneity and epistasis
using multiplicative effects
At each locus q0, q1, q2 are penetrances for 0,
1, 2 risk alleles
loci 1 2 3 4 5 6 7 8 . . .
Individual risk calculated as
from Lunetta, etal. 2004. BMC Genetics 5
7A model based on a known developmental process
8Simulation details
- 0100011010101010101000110111101
- 0111010100011010111001100101011
6 active loci
100 neutral loci
- All loci in Hardy-Weinberg Equilibrium
- No linkage-disequilibrium
9Genetic determination of developmental parameters
For each developmental parameter parameter
parameter e e N(0, parameter noise)
noise
Simulate development
10Mapping morphogen gradient and threshold to
disease phenotype
threshold
threshold
no spot
DevMod1
diseased
not diseased
not diseased
If radius gt C diseased If radius lt C not
diseased
DevMod2
11Standard Chi-Square vs. RandomForests
- Multiple samples created, each drawn from a
population with a different array of allele
frequencies - all combinations tried where the freq of 0
0.1, 0.5, or 0.9 at each developmental
parameter locus (729 samples) - genotypic relative risk is highly dependent upon
genetic background - freq of 0 at neutral loci drawn from a uniform
distribution on (0.1, 0.9) - 1000 each of case and control
12Standard Method Genotype Case-Control
copies of 0 allele
Bonferroni correction applied a0.05/106
Gibson and Muse. 2002. A Primer of Genome
Science
13Classification Tree
14A Random Forest
Data Set
Bootstrapped sample
Bootstrapped sample
Bootstrapped sample
...
Create 500 samples
Split each node choosing only from random subset
of variables (mtry 10). Trees are not pruned.
To classify new observation, use majority vote
from the forest.
15Distribution of importance values
16Chi-square test vs. RandomForest
Chi-square
Chi-square
17Conclusions
- RandomForest slightly outperforms standard
Chi-square test. - Both methods do perhaps too well.
- Redefinition of disease phenotype affected
ability to detect disease loci
18Next Steps
- Create more challenging G to P mapping
- More realistic penetrance values, genotypic
relative risks - Integrate disease development model with
realistic haplotype data (Fred Wright). - Test more Data Mining techniques
- Collaborate with Alex Tropshas group (QSAR plus
unbundled software from his workflow - Association Rule Mining (AFI approximate
frequent itemset mining).
19More questions
- Do all the confounding effects --
- gene-gene interactions (epistasis)
- gene-environment interactions
- marker allele-trait allele associations
(population history) - genetic heterogeneity
- high-dimensionality (thousands of SNPs)
- All interact strongly to determine the
performance of an classification method? - Must we test everything simultaneously?
20Should we be looking at Feature Selection?
- Most users want good classification accuracy,
but we want to classify the predictors
themselves -
relevant (disease locus) irrelevant(neutral
locus)