Title: Realistic Simulation of Genotypes for an Association Mapping Bakeoff Fred Wright, Kirk Wilhelmsen, X
1Realistic Simulation of Genotypes for an
Association Mapping BakeoffFred Wright, Kirk
Wilhelmsen, Xiaojun Guan, Kevin Gamiel, William
Barry
- In order to evaluate the efficacy of various
statistical methods for association mapping, we
need an approach to generate realistic datasets - The population genetics of humans is complicated
and difficult to simulate - Moreover, how can we make results applicable to
real SNP platforms for genome-wide scans? - We thus chose to sample from true (HapMap) data,
for which most SNPs on major platforms are
represented - The procedure is more general, however, and
requires only a pool of high-density haplotypes
2chromosome pool
The HAP-SAMPLE Simulation Approach
0
0
0
1
1
Disease SNP allele values
Case genotypes sampled according to specified
probabilities. Simulation appropriate for ancient
mutations not under strong selection
control chromosomes
case chromosomes
origin
0
0
1
1
0
1
1
1
1
1
1
1
origin
3Disease model
List of typed SNPs
Simulation
Output (genotypes or phased haplotypes) for
case-control or affected-child trio designs
HAP-SAMPLE Input/Output
4How are case disease SNP genotypes simulated?
g joint genotype of L disease loci (3L of
these) g0 referent joint genotype D 0 if
control, 1 if disease/case RRg relative risk of
genotype g compared to g0
Obtained from pool
specified
5- The preceding slide assumes that controls can
be thought of as random samples from the
population - But for non-rare diseases, this is not realistic
- Thus we need to simulate controls in a manner
similar to cases - True controls should show slightly reduced
genotype frequencies of causative alleles
compared to the general population - Controls simulated in this manner are called
anti-cases - Even for fairly common diseases (e.g.
prevalencegt5), anti-case genotype probabilities
(at disease loci) are similar to the general
population
6Examples from HAP-SAMPLE paper shows that
simulated SNPs are discoverable
Genome scan threshold
True causative SNP
7The Bakeoff Data
- 5 models
- 300,000 SNPs from the Illumina 300K platform
simulated using HAP-SAMPLE - 5000 cases, 5000 anti-cases for each model
- How many true disease SNPs per model?
- A 4 or 5, depending on the model
- This implies 34 or 35 joint genotypes
8Model 1 Multiplicative, disease prevalence5
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.4 1.5 1 geno1
1.1 1.2 1.25 1 1.4 geno2 1.5 1.7 1 1 1.8
Model 2 Additive, disease prevalence5
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.4 1.5 1 geno1
1.1 1.2 1.3 1 1.4 geno2 1.5 1.7 1 1 1.8
9Model 3 Additive, disease prevalence5
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.15 1.4 1 geno1
1.1 1.1 1.15 1 1.2 geno2 1.1 1.2 1 1 1.4
Model 4 Additive, disease prevalence0.01
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.4 1.4 1 geno1
1.1 1.1 1.2 1 1.2 geno2 1.1 1.2 1 1 1.4
10Model 5 two independent effects of 2 epistatic
loci, disease prevalence10
MAF values
11- How do we know if we are simulating correctly?
- There are 243 joint disease genotypes (only 81
for model 5) - For each model, each of these has an expected vs.
observed frequency in cases and in anti-cases - We can compare observed vs expected for the
simulated data - Chance variation should occur
12Comparing observed cell counts vs. expected, to
see if consistent with chance variation
Expected cell counts for each of the 35 joint
genotypes
13Expected cell counts for each of the 35 joint
genotypes
14- Misc thoughts
- Because the true disease SNPs are themselves on
the typing platform, haplotype-based approaches
should give no extra power - The interaction models are a bit complicated.
But it might be useful to order the genes by
effect size. - We defined a marginal effect size for each
disease SNP using the expected chi-square
statistic for the 3 (genotype) X 2 (case status)
contingency table
We plugged in expected value under true disease
model
Expected under null hypothesis of no gene effect
15- The results look reasonable at the disease loci
- Let the fun begin!