Realistic Simulation of Genotypes for an Association Mapping Bakeoff Fred Wright, Kirk Wilhelmsen, X

About This Presentation

Title:

Realistic Simulation of Genotypes for an Association Mapping Bakeoff Fred Wright, Kirk Wilhelmsen, X

Description:

... reduced genotype frequencies of causative alleles compared to the general population ... True causative SNP. Genome scan threshold. The Bakeoff Data. 5 models ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 16

Provided by: fwri3

Category:

more less

Transcript and Presenter's Notes

Title: Realistic Simulation of Genotypes for an Association Mapping Bakeoff Fred Wright, Kirk Wilhelmsen, X

1
Realistic Simulation of Genotypes for an
Association Mapping BakeoffFred Wright, Kirk
Wilhelmsen, Xiaojun Guan, Kevin Gamiel, William
Barry

In order to evaluate the efficacy of various
statistical methods for association mapping, we
need an approach to generate realistic datasets
The population genetics of humans is complicated
and difficult to simulate
Moreover, how can we make results applicable to
real SNP platforms for genome-wide scans?
We thus chose to sample from true (HapMap) data,
for which most SNPs on major platforms are
represented
The procedure is more general, however, and
requires only a pool of high-density haplotypes

2
chromosome pool
The HAP-SAMPLE Simulation Approach
0
0
0
1
1
Disease SNP allele values
Case genotypes sampled according to specified
probabilities. Simulation appropriate for ancient
mutations not under strong selection
control chromosomes
case chromosomes
origin
0
0
1
1
0
1
1
1
1
1
1
1
origin
3
Disease model
List of typed SNPs
Simulation
Output (genotypes or phased haplotypes) for
case-control or affected-child trio designs
HAP-SAMPLE Input/Output
4
How are case disease SNP genotypes simulated?
g joint genotype of L disease loci (3L of
these) g0 referent joint genotype D 0 if
control, 1 if disease/case RRg relative risk of
genotype g compared to g0
Obtained from pool
specified
5

The preceding slide assumes that controls can
be thought of as random samples from the
population
But for non-rare diseases, this is not realistic
Thus we need to simulate controls in a manner
similar to cases
True controls should show slightly reduced
genotype frequencies of causative alleles
compared to the general population
Controls simulated in this manner are called
anti-cases
Even for fairly common diseases (e.g.
prevalencegt5), anti-case genotype probabilities
(at disease loci) are similar to the general
population

6
Examples from HAP-SAMPLE paper shows that
simulated SNPs are discoverable
Genome scan threshold
True causative SNP
7
The Bakeoff Data

5 models
300,000 SNPs from the Illumina 300K platform
simulated using HAP-SAMPLE
5000 cases, 5000 anti-cases for each model
How many true disease SNPs per model?
A 4 or 5, depending on the model
This implies 34 or 35 joint genotypes

8
Model 1 Multiplicative, disease prevalence5
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.4 1.5 1 geno1
1.1 1.2 1.25 1 1.4 geno2 1.5 1.7 1 1 1.8
Model 2 Additive, disease prevalence5
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.4 1.5 1 geno1
1.1 1.2 1.3 1 1.4 geno2 1.5 1.7 1 1 1.8
9
Model 3 Additive, disease prevalence5
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.15 1.4 1 geno1
1.1 1.1 1.15 1 1.2 geno2 1.1 1.2 1 1 1.4
Model 4 Additive, disease prevalence0.01
Locus L1 L2 L3 L4 L5 MAF 0.4 0.2 0.1 0.1 0.2 RR
contributions geno0 1 1 1.4 1.4 1 geno1
1.1 1.1 1.2 1 1.2 geno2 1.1 1.2 1 1 1.4
10
Model 5 two independent effects of 2 epistatic
loci, disease prevalence10
MAF values
11

How do we know if we are simulating correctly?
There are 243 joint disease genotypes (only 81
for model 5)
For each model, each of these has an expected vs.
observed frequency in cases and in anti-cases
We can compare observed vs expected for the
simulated data
Chance variation should occur

12
Comparing observed cell counts vs. expected, to
see if consistent with chance variation
Expected cell counts for each of the 35 joint
genotypes
13
Expected cell counts for each of the 35 joint
genotypes
14

Misc thoughts
Because the true disease SNPs are themselves on
the typing platform, haplotype-based approaches
should give no extra power
The interaction models are a bit complicated.
But it might be useful to order the genes by
effect size.
We defined a marginal effect size for each
disease SNP using the expected chi-square
statistic for the 3 (genotype) X 2 (case status)
contingency table

We plugged in expected value under true disease
model
Expected under null hypothesis of no gene effect
15