Data mining for Genotype Phenotype associations: an embarrassment of riches - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Data mining for Genotype Phenotype associations: an embarrassment of riches

Description:

Data mining for Genotype - Phenotype associations: an embarrassment of riches. CART h RandomForest h Golden Helix h Neural Networks h Neural Networks with ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 21
Provided by: Ren165
Category:

less

Transcript and Presenter's Notes

Title: Data mining for Genotype Phenotype associations: an embarrassment of riches


1
Data mining for Genotype - Phenotype
associations an embarrassment of riches
  • CART h RandomForest h Golden Helix h Neural
    Networks h Neural Networks with Genetic
    Programming h Discriminant Analysis h SSVS
    Stochastic Search Variable Selection h Bayesian
    Variable Selection h  MARS h CPM
    Combinatorial Partitioning Methods h MDR
    Multifactor-dimensionality reduction   h
    Association Rule Mining h HPM Haplotype
    pattern mining h QHPM h DICE Detection of
    Informative Combined Effects

2
Outline
  • Project Workflow
  • Simulating Genotype-Phenotype data
  • A sampling of genetic models
  • A Preliminary Study
  • Chi-square test vs. RandomForests
  • A shift in phenotype definition
  • Next Steps

3
Project work flow
  • Iterate process to explore parameter space
  • Size of marginal effects
  • degree of epistasis
  • gene frequencies
  • linkage associations
  • number of SNPs
  • environmental effects
  • sample size
  • (lots more)

4
Simulating Genetic Data
  • Population Genetic model
  • Population history
  • SNP allele frequencies
  • Departures from Hardy-Weinberg Equilibrium
  • Linkage associations
  • Haplotypes
  • Disease/Development model
  • SNP sampling strategy
  • Candidate genes targeted?

5
A model using XOR to create epistasis
from Ritchie, etal. 2003. BMC Bioinformatics 4
  • High risk if individual is heterozygous at only
    one SNP
  • AaBB, Aabb, AABb, aaBb but not AaBb
  • If all alleles have equal freq, no single locus
    genotypic marginal differences in penetrance.
  • AA, Aa, and aa each confer the same risk.

6
A Model for genetic heterogeneity and epistasis
using multiplicative effects
At each locus q0, q1, q2 are penetrances for 0,
1, 2 risk alleles
loci 1 2 3 4 5 6 7 8 . . .
Individual risk calculated as
from Lunetta, etal. 2004. BMC Genetics 5
7
A model based on a known developmental process
8
Simulation details
  • 0100011010101010101000110111101
  • 0111010100011010111001100101011

6 active loci
100 neutral loci
  • All loci in Hardy-Weinberg Equilibrium
  • No linkage-disequilibrium

9
Genetic determination of developmental parameters
For each developmental parameter parameter
parameter e e N(0, parameter noise)
noise
Simulate development
10
Mapping morphogen gradient and threshold to
disease phenotype
threshold
threshold
no spot
DevMod1
diseased
not diseased
not diseased
If radius gt C diseased If radius lt C not
diseased
DevMod2
11
Standard Chi-Square vs. RandomForests
  • Multiple samples created, each drawn from a
    population with a different array of allele
    frequencies
  • all combinations tried where the freq of 0
    0.1, 0.5, or 0.9 at each developmental
    parameter locus (729 samples)
  • genotypic relative risk is highly dependent upon
    genetic background
  • freq of 0 at neutral loci drawn from a uniform
    distribution on (0.1, 0.9)
  • 1000 each of case and control

12
Standard Method Genotype Case-Control
copies of 0 allele
Bonferroni correction applied a0.05/106
Gibson and Muse. 2002. A Primer of Genome
Science
13
Classification Tree
14
A Random Forest
Data Set
Bootstrapped sample
Bootstrapped sample
Bootstrapped sample
...
Create 500 samples
Split each node choosing only from random subset
of variables (mtry 10). Trees are not pruned.
To classify new observation, use majority vote
from the forest.
15
Distribution of importance values
16
Chi-square test vs. RandomForest
Chi-square
Chi-square
17
Conclusions
  • RandomForest slightly outperforms standard
    Chi-square test.
  • Both methods do perhaps too well.
  • Redefinition of disease phenotype affected
    ability to detect disease loci

18
Next Steps
  • Create more challenging G to P mapping
  • More realistic penetrance values, genotypic
    relative risks
  • Integrate disease development model with
    realistic haplotype data (Fred Wright).
  • Test more Data Mining techniques
  • Collaborate with Alex Tropshas group (QSAR plus
    unbundled software from his workflow
  • Association Rule Mining (AFI approximate
    frequent itemset mining).

19
More questions
  • Do all the confounding effects --
  • gene-gene interactions (epistasis)
  • gene-environment interactions
  • marker allele-trait allele associations
    (population history)
  • genetic heterogeneity
  • high-dimensionality (thousands of SNPs)
  • All interact strongly to determine the
    performance of an classification method?
  • Must we test everything simultaneously?

20
Should we be looking at Feature Selection?
  • Most users want good classification accuracy,
    but we want to classify the predictors
    themselves

relevant (disease locus) irrelevant(neutral
locus)
Write a Comment
User Comments (0)
About PowerShow.com