Title: Developing Odds Ratio Estimation under a Multistage Design
1Developing Odds Ratio Estimation under a
Multistage Design
2Outline
- Background
- Multistage Design
- Estimation under Multistage Design
- Simulation Results
3Background, readiness, potential
- Genome-wide association studies are now underway,
enabled by - Rapidly decreasing genotyping costs
- Genotyping costs in the vicinity of 0.01/SNP.
May continue to fall? - Massively multiplexed genotyping technologies
- Large-scale SNP discovery
- For example, David Cox and Perlegen Sciences have
identified 250K tagging SNPs having estimated
minor allele frequencies of 10 or larger across
genome. (Perlegen now uses a 360K tag SNP set,
informed by HapMap2). - Could be used to provide insights into disease
processes and mechanisms to examine genotype
interactions with preventive interventions to
identify susceptibles for targeted disease
screening efforts. - From Dr. Ross Prentice
4Association study sample size
- For some diseases preceding linkage study
results may suggest absence of strong
associations - Hence need large sample sizes, e.g., OR 1.5 for
presence of minor SNP allele (genotype risk
ratio) n number of cases ( number of
controls) -
Test Size - Test Power 0.05 0.01
- minor allele frequency 0.1
- ß 0.80 763 1211
- ß 0.95 1301 1875
-
- minor allele frequency 0.5
- ß 0.80 325 515
- ß 0.95 553 797
- (from Breslow and Day, 1987, V2)
- From Dr. Ross Prentice
5Association study genotyping costs and Approaches
to reducing genotyping costs
- 1000 cases, 1000 controls, 250K SNPs at 0.01/SNP
gives genotyping costs of 5 million. - Also, conventional testing, even at 0.01
gives an expected 2500 false positives, under
global null hypothesis, implying the need for a
larger sample size. - Approaches to reducing genotyping costs
- Reduced costs/SNP through further technology
developments - Reduce number of SNPs, perhaps restricted to SNPs
in coding or regulatory regions of known genes - Multistage design, perhaps testing at more
extreme significance levels at later stages - From Dr. Ross Prentice
62-Stage Design
Random select a proportion cases controls
Remaining Cases Controls
genotyping
SNPs in 1st stage
genotyping
SNPs in 2nd stage
Significant SNPs
meet test criteria
meet test criteria
- e.g., 500 cases and 500 controls at first stage
with 0.05 on 250K SNPs - followed by 1000 cases and 1000 controls at a
second stage using the approximately 12,500 SNPs
significant at first stage and 0.001). - Under global null hypothesis 12.5 false
positives, and genotyping costs of 2.625 million.
7Association Testing Under a Multistage Design
- Individual-level data, observe x 0, 1, 2
according to the number of minor alleles present
for a SNP - Logistic regression of case vs control status on
x (and potential confounding factors) / s - LR comparison of distribution of x between cases
and controls - Test statistic used on the data from stages 1, 2,
, i or based on separate testing at each design
stage. - Testing at stage i can either be based on an
inverse variance weighted log-odds ratio - From Dr. Ross Prentice
8Combined odds-ratio testing in a two-stage design
9Summary From Prentice et al (2006)
- Hence, in summary, we are able to recommend a
multistage design for high-dimensional SNP
association studies. - Testing from such a multistage design can take
place with good power by considering log-odds
ratio test statistics in an inverse variance
weighted fashion from each design stage.
10(No Transcript)
11Odds-ratio estimation under a two stage design
- 1. Use final stage estimator
- 2. Use combined-stage estimator
12Correction Essential
Significant
H2gtC2
H2 in 2nd stage
H1 in 1st stage
H1gtC1
H2lt-C2
Significant
Significant
H2gtC2
H1lt-C1
H2 in 2nd stage
H2lt-C2
Significant
13Correction Essential
14Correction
15Correction
16Correction Essential
Significant
H2gtC2
H2 in 2nd stage
H1 in 1st stage
H1gtC1
H2lt-C2
Significant
Significant
H2gtC2
H1lt-C1
H2 in 2nd stage
H2lt-C2
Significant
17Correction
18Confidence Interval
- Use a bootstrap method to get the confidence
interval for the uncorrected combined log-OR
estimator - Resample Nn0n1 patients (with replacement)
from the original sample, get a new combined
log-OR estimator (sometimes the new samples fail
to go through the two stages, ignore them) - Repeat the above procedure B times, and get the
empirical distribution of - Therefore, we get
- Using the correction equation
19Simulation
- Simulate a SNP having two alleles N and D, with
minor allele D frequency p10 and D is
associated with a higher risk of getting disease - X number of minor allele, X 0, 1, or 2
- Assuming Hardy-Weinberg equilibrium, thus in the
control group - Assume Ds genetic effect is additive (on
log-scale) - Assume Risk ratioOdds ratio for this rare
disease - log OR associated with X is 1/2log(1.35) 0.15
- In the case group
20Simulation
- Control group and Case group each has 2000
patients - Two stage design
- Randomly select 1000 cases the matching controls
for the 1st stage, and the remaining 1000 samples
for the 2nd stage. - At each stage, observe x 0, 1, 2 according to
the number of minor alleles present for a SNP.
Logistic regression of case vs control status on
x (and potential confounding factors)
21Bias
22Correction curve and its sensitivity to sigma
23Correction
2495 Confidence interval
25Simulation Results based on 1000 95 Confidence
Intervals
26Future Work
- Is it appropriate to use Bootstrap method with
the presence of selection? Will it cause bias?
May solve the problem of the CI length and
conservative coverage rate. - Asymptotic distribution of the corrected Log-OR
- Generalize the correction method to various
multistage design, including those use pooling at
the first stage - Apply the correction method on genome-wide scan
in theWomens Health Initiative - Develop correction methods to multistage
biomarker studies or other settings of multistage
designs