Data mining for Genotype Phenotype associations: an embarrassment of riches

About This Presentation

Title:

Data mining for Genotype Phenotype associations: an embarrassment of riches

Description:

Data mining for Genotype - Phenotype associations: an embarrassment of riches. CART h RandomForest h Golden Helix h Neural Networks h Neural Networks with ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 21

Provided by: Ren165

Category:

more less

Transcript and Presenter's Notes

Title: Data mining for Genotype Phenotype associations: an embarrassment of riches

1
Data mining for Genotype - Phenotype
associations an embarrassment of riches

CART h RandomForest h Golden Helix h Neural
Networks h Neural Networks with Genetic
Programming h Discriminant Analysis h SSVS
Stochastic Search Variable Selection h Bayesian
Variable Selection h MARS h CPM
Combinatorial Partitioning Methods h MDR
Multifactor-dimensionality reduction h
Association Rule Mining h HPM Haplotype
pattern mining h QHPM h DICE Detection of
Informative Combined Effects

2
Outline

Project Workflow
Simulating Genotype-Phenotype data
A sampling of genetic models
A Preliminary Study
Chi-square test vs. RandomForests
A shift in phenotype definition
Next Steps

3
Project work flow

Iterate process to explore parameter space
Size of marginal effects
degree of epistasis
gene frequencies
linkage associations
number of SNPs
environmental effects
sample size
(lots more)

4
Simulating Genetic Data

Population Genetic model
Population history
SNP allele frequencies
Departures from Hardy-Weinberg Equilibrium
Linkage associations
Haplotypes
Disease/Development model
SNP sampling strategy
Candidate genes targeted?

5
A model using XOR to create epistasis
from Ritchie, etal. 2003. BMC Bioinformatics 4

High risk if individual is heterozygous at only
one SNP
AaBB, Aabb, AABb, aaBb but not AaBb
If all alleles have equal freq, no single locus
genotypic marginal differences in penetrance.
AA, Aa, and aa each confer the same risk.

6
A Model for genetic heterogeneity and epistasis
using multiplicative effects
At each locus q0, q1, q2 are penetrances for 0,
1, 2 risk alleles
loci 1 2 3 4 5 6 7 8 . . .
Individual risk calculated as
from Lunetta, etal. 2004. BMC Genetics 5
7
A model based on a known developmental process
8
Simulation details

0100011010101010101000110111101
0111010100011010111001100101011

6 active loci
100 neutral loci

All loci in Hardy-Weinberg Equilibrium
No linkage-disequilibrium

9
Genetic determination of developmental parameters
For each developmental parameter parameter
parameter e e N(0, parameter noise)
noise
Simulate development
10
Mapping morphogen gradient and threshold to
disease phenotype
threshold
threshold
no spot
DevMod1
diseased
not diseased
not diseased
If radius gt C diseased If radius lt C not
diseased
DevMod2
11
Standard Chi-Square vs. RandomForests

Multiple samples created, each drawn from a
population with a different array of allele
frequencies
all combinations tried where the freq of 0
0.1, 0.5, or 0.9 at each developmental
parameter locus (729 samples)
genotypic relative risk is highly dependent upon
genetic background
freq of 0 at neutral loci drawn from a uniform
distribution on (0.1, 0.9)
1000 each of case and control

12
Standard Method Genotype Case-Control
copies of 0 allele
Bonferroni correction applied a0.05/106
Gibson and Muse. 2002. A Primer of Genome
Science
13
Classification Tree
14
A Random Forest
Data Set
Bootstrapped sample
Bootstrapped sample
Bootstrapped sample
...
Create 500 samples
Split each node choosing only from random subset
of variables (mtry 10). Trees are not pruned.
To classify new observation, use majority vote
from the forest.
15
Distribution of importance values
16
Chi-square test vs. RandomForest
Chi-square
Chi-square
17
Conclusions

RandomForest slightly outperforms standard
Chi-square test.
Both methods do perhaps too well.
Redefinition of disease phenotype affected
ability to detect disease loci

18
Next Steps

Create more challenging G to P mapping
More realistic penetrance values, genotypic
relative risks
Integrate disease development model with
realistic haplotype data (Fred Wright).
Test more Data Mining techniques
Collaborate with Alex Tropshas group (QSAR plus
unbundled software from his workflow
Association Rule Mining (AFI approximate
frequent itemset mining).

19
More questions