EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and - PowerPoint PPT Presentation

Loading...

PPT – EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and PowerPoint presentation | free to download - id: 65d8d6-ODkxZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and

Description:

EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and tag SNP selection Peter Kraft pkraft_at_hsph.harvard.edu – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Date added: 16 August 2019
Slides: 112
Provided by: saru2
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and


1
EPI293Design and analysis of gene association
studiesWinter Term 2008Lecture 2 Patterns of
LD and tag SNP selection
  • Peter Kraft
  • pkraft_at_hsph.harvard.eduBldg 2 Rm 2072-4271

2
Before HapMap looking under lamppost
Study 1 Popn A, small N, no assocn
Study 2 Popn A, large N, no assocn
Study 3 Popn B, large N, assocn
After HapMap
Study 2 revisited Popn A, large N, assocn
3
Outline
  • Measures of linkage disequilibrium
  • Reasons for LD and empirical patterns of LD
  • Tagging SNPs
  • The HapMap project
  • Resources and tools for SNP selection

4
Outline
  • Measures of linkage disequilibrium
  • Reasons for LD and empirical patterns of LD
  • Tagging SNPs
  • The HapMap project
  • Resources and tools for SNP selection

5
Basic idea linkage disequilibrium
a
g
A
G
A
G
a
g
A
g
a
g
A
G
A
G
a
g
A
G
A
G
A
g
a
g
a
g
A
G
A
G
Alleles at two (or more) loci are correlated on
chromosomes drawn at random from the population
6
Measures of linkage disequilibrium
  • Basic data table of haplotype frequencies

a
g
A
G
A
G
a
g
A
g
a
g
A
G
A
G
a
g
A
G
A
G
A
g
a
g
a
g
A
G
A
G
A a
G 8 0 50
g 2 6 50
62.5 37.5
7
Linkage disequilibrium and marginal allele freqs.
A a
G pApG ? x qApG - ? y pG
g pAqG - ? w qAqG ? z qG
pA qA 1
  • pA pG are (minor) allele frequencies
  • qA 1-pA qG 1-pG
  • ? x z y w is a measure of departure from
    independence
  • No association between A and G ? ? 0
  • Max(?) min(pA qG, pG qA)

8
A a
G n11 n10 n1?
g n01 n00 n0?
n?1 n?0
Measure Formula Ref.
D Lewontin (1964)
?2 r2 Hill and Weir (1994)
? Levin (1953)
? Edwards (1963)
Q Yule (1900)
9
D and r2 are most common
  • D prime
  • ranges from 0 no LD to 1 complete LD
  • is less sensitive to marginal allele
    frequencies
  • is directly related to recombination fraction
  • R squared
  • also ranges from 0 to 1
  • is correlation between alleles on the same
    chromosome
  • is very sensitive to marginal allele
    frequencies
  • is directly related to study power
  • If a marker M and causal gene G are in LD, then a
    study with N cases and controls which measures M
    (but not G) will have the same power to detect an
    association as a study with r2 N cases and
    controls that directly measured G
  • r2 N is the effective sample size

10
a
g
A
G
A
G
a
g
A
g
a
g
A
G
A
G
a
g
A
G
A
G
A
g
a
g
a
g
A
G
A
G
A a
G 8 0 50
g 2 6 50
62.5 37.5
D
r2
(8?6 - 0)2 / (10?6?8?8) .6
(8?6 - 0) / (8?6) 1
11
Computational detail
  • Haplotyopes are rarely directly observed
  • Have to infer from genotype data
  • Genotypes consistent with haplotype
    pairs
  • Most popular algorithm Expectation
    Maximizxation1
  • Related to, but not exactly equal to 3x3 table of
    genotypes

Aa
A
a
A
a
Gg
G
g
g
G
AA2 Aa1 Aa0
BB2
Bb1
Bb0
Correlation from this table makes no assumptions
about HWE (Weir, Genetic Data Analysis)
1 Thomas pp. 243-245
12
Outline
  • Measures of linkage disequilibrium
  • Reasons for LD and empirical patterns of LD
  • Tagging SNPs
  • The HapMap project
  • Resources and tools for SNP selection

13
Why does LD exist?
  • Recombination coldspots
  • Demographics (e.g. bottlenecks)
  • Population stratification or admixture
  • Confounds gene-disease association
  • Does not decay with distance

(among other reasons selective pressure etc.)
14
Decay of LD in Pictures
15
Decay of LD ?T ?0 (1 - ?)T
1 generation
5 generations
10
20
40
80
16
(No Transcript)
17
200 kbp from chr2, positions 51,783,239 to
51,983,238
Data from the ENCODE project http//www.hapmap.org
/downloads/encode1.html.en
18
Implications
  • Admixture can lead to false positives
  • Two unlinked loci can stay in LD
  • Recent admixture, continual gene flow problematic
  • Isolated populations have advantages for
    fine-mapping
  • LD extends long distances, so fewer markers need
    be typed
  • But resolution may be poor
  • Knowledge of local LD structure is essential for
    candidate gene studies !

19
Outline
  • Measures of linkage disequilibrium
  • Reasons for LD and empirical patterns of LD
  • Tagging SNPs
  • The HapMap project
  • Resources and tools for SNP selection

20
Basic tagging design
Measure haplotypes/LD pattern in a
subsample (often external database)
Choose subset of SNPs (tagSNPs) that contain
majority of information
Genotype tagSNPs in main study,analyze
appropriately
21
ATM
22
ATM
23
block region of limited haplotype diversity
and/or low LD
24
But there are unappealing aspects of the
haplotype block idea
  • Definition and block finding algorithms are ad
    hoc
  • Different defns, algs lead to different block
    structures
  • Block structure changes with sample size, marker
    density
  • Hard boundaries are
  • unappealing for tagSNP selection (what about
    between blocks)
  • inaccurate description of LD patterns (some
    haps overlap boundaries)
  • Plus, haplotypes present analytic challenges

Wall Pritchard (2003a) Nat Rev Genet 4587
(2003b) AJHG 73502Nothnagel and Rohde (2005)
AJHG 77988
25
CYP19
26
CYP19
27
(No Transcript)
28
Keep it simple
  • We want SNPs that predict unobserved variants
  • Why not choose SNPs based on pairwise
    correlations?
  • Q What if we dont know enough about common
    genetic variation to say weve captured it?
  • A HapMap and resequencing projects

29
Outline
  • Measures of linkage disequilibrium
  • Reasons for LD and empirical patterns of LD
  • Tagging SNPs
  • The HapMap project
  • Resources and tools for SNP selection

30
HapMap application in the design and
interpretation of association studies Mark J.
Daly, PhD on behalf of The International HapMap
Consortium OK it may look like Im totally
stealing these slidesbut they are free on the
web at http//www.hapmap.org/tutorials.html.en
31
Goals of this segment
  • Briefly summarize HapMap design and current
    status
  • Discuss the application of HapMap to all aspects
    of association study design, analysis and
    interpretation

32
HapMap Project
A freely-available public resource to increase
the power and efficiency of genetic association
studies to medical traits
  • High-density SNP genotyping across the genome
    provides information about
  • SNP validation, frequency, assay conditions
  • correlation structure of alleles in the genome

All data is freely available on the web for
application in study design and analyses as
researchers see fit
33
HapMap Samples
  • 90 Yoruba individuals (30 parent-parent-offspring
    trios) from Ibadan, Nigeria (YRI)
  • 90 individuals (30 trios) of European descent
    from Utah (CEU)
  • 45 Han Chinese individuals from Beijing (CHB)
  • 45 Japanese individuals from Tokyo (JPT)

34
HapMap progress
PHASE I completed, described in Nature
paper 1,000,000 SNPs successfully typed in all
270 HapMap samples ENCODE variation reference
resource available PHASE II data generation
complete, data released early November 2005
gt3,500,000 SNPs typed in total !!!
Frazer, K. A., D. G. Ballinger, D. R. Cox, D. A.
Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont,
A. Boudreau, P. Hardenbol, S. M. Leal, S.
Pasternak, D. A. Wheeler, et al. (2007). "A
second generation human haplotype map of over 3.1
million SNPs." Nature 449(7164) 851-61.
35
ENCODE-HAPMAP variation project
  • Ten typical 500kb regions
  • 48 samples sequenced
  • All discovered SNPs (and those dbSNP) typed in
    all 270 HapMap samples
  • Current data set 1 SNP every 279 bp

A much more complete variation resource by
which the genome-wide map can evaluated
36
Completeness of dbSNP
Vast majority of common SNPs are contained in or
highly correlated with a SNP in dbSNP
37
Recombination hotspots are widespreadand account
for LD structure
7q21
38
Coverage of Phase II HapMap(estimated from
ENCODE data)
Panel r2 gt 0.8 max r2 YRI
81 0.90 CEU 94 0.97 CHBJPT 94 0.97
Vast majority of common variation (MAF gt .05)
captured by Phase II HapMap
From Table 6 A Haplotype Map of the Human
Genome, Nature
39
Applying the HapMap
  • Study design - tagging
  • Study coverage evaluation
  • Study analysis - improving association testing
  • Study interpretation
  • Comparison of multiple studies
  • Connection to genes/genomic features
  • Integration with expression and other functional
    data
  • Other uses of HapMap data
  • Admixture, LOH, selection

40
Tagging from HapMap
  • Since HapMap describes the majority of common
    variation in the genome, choosing non-redundant
    sets of SNPs from HapMap offers considerable
    efficiency without power loss in association
    studies

41
(No Transcript)
42
Pairwise tagging
Tags SNP 1 SNP 3 SNP 6 3 in total Test for
association SNP 1 SNP 3 SNP 6
After Carlson et al. (2004) AJHG 74106
43
Pairwise Tagging Efficiency
Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds
YRI CEU CHBJPT
Pairwise r2 0.5 324,865 178,501 159,029
Pairwise r2 0.8 474,409 293,835 259,779
Pairwise r2 1 604,886 447,579 434,476
Tag SNPs were picked to capture common SNPs in
release 16c.1 for every 7,000 SNP bin using
Haploview.
Tagging Phase I HapMap offers 2-5x gains in
efficiency
44
Use of haplotypes can improve genotyping
efficiency
Tags SNP 1 SNP 3 2 in total Test for
association SNP 1 captures 12 SNP 3 captures
35 AG haplotype captures SNP 46
Tags SNP 1 SNP 3 SNP 6 3 in total Test for
association SNP 1 SNP 3 SNP 6
tags in multi-marker test should be conditional
on significance of LD in order to avoid
overfitting
45
Efficiency and power
tag SNPs
300,000 tag SNPs needed to cover
common variation in whole genome in CEU
Relative power ()
random SNPs
Average marker density (per kb)
P.I.W. de Bakker et al. (2005) Nat Genet Advance
Online Publication 23 Oct 2005
46
Will tag SNPs picked from HapMap apply to other
population samples?
Two issues what if LD structure strongly differs
between my samples and the HapMap samples? Are
CEU or YRI panels good surrogates for Latinos
from Los Angeles? Are CEU samples even good
surrogates for whites from France? Is HapMap
sample size sufficient? Small sample ?
correlation overestimated are tagging algorithms
overfitting the sample
PK slide
47
Will tag SNPs picked from HapMap apply to other
population samples?
CEU
CEU
CEU
Whites from Los Angeles, CA
Botnia, Finland
Utah residents with European ancestry(CEPH)
Population differences add very little
inefficiency Paul de Bakker Pac Symp Biocomput
2006
48
De Bakker et al (2006) Nat Genet
49
Need and Goldstein (2006) Nat Genet
50
Impact of training set sample size
Tags chosen as pairwise tags
Tags chosen as multimarker tags(up to 6 markers)
Zeggini et al Nature Genetics 37, 1320 - 1322
(2005)
51
Impact of training set sample size
Tags chosen for common variants
Tags chosen for common and rare varants
Zeggini et al Nature Genetics 37, 1320 - 1322
(2005)
52
Outline
  • Measures of linkage disequilibrium
  • Reasons for LD and empirical patterns of LD
  • Tagging SNPs
  • The HapMap project
  • Resources and tools for SNP selection

53
Public sources of SNP data
  • Candidate genes
  • Seattle SNPs http//pga.gs.washington.edu/
  • Environmental Genome Project http//egp.gs.washing
    ton.edu/
  • IIPGA http//innateimmunity.net/IIPGA2/index_html
  • HAPMAP http//www.hapmap.org/
  • BPC3 http//www.uscnorris.com/MECGenetics/
  • Genome-wide
  • HAPMAP
  • dbSNP http//www.ncbi.nlm.nih.gov/projects/SNP/
  • OMIM (online mendelian inheritance in man)

Resequencing data
54
Bioinformatics tools
  • https//innateimmunity.net/IIPGA2/Bioinformatics/
  • http//pga.gs.washington.edu/software.html
  • Haploview http//www.broad.mit.edu/mpg/haploview/i
    ndex.php
  • SNPSelector

55
So, OK, how should I select SNPs?
  • PubMed/lit search
  • Previous associations with your (or related)
    phenotype
  • GWAS!
  • Functional studies
  • Potentially functional variants
  • nsSNPs (perhaps ranked by SIFT or Polyphen score)
  • Splice sites
  • Conserved regions
  • tagSNPs

56
SNP SelectorBioinformatics 214181
http//primer.duhs.duke.edu/
57
SNP SelectorBioinformatics 214181
58
(No Transcript)
59
Molecular genotyping methods
  • David G. Cox M.S. Ph.D.
  • Instructor of Epidemiology
  • dcox_at_hsph.harvard.edu
  • Bldg. 2 Rm. 211
  • (617) 432-2262

60
Overview
  • How it works
  • Considerations in choosing a method
  • Quality Control (QC)
  • Organizing your data
  • Completing the study

61
PCR
  • Rapid, versatile, in vitro, method for amplifying
    defined target DNA sequences to yield multiple
    copies of specific region of DNA sequence
  • 1980s, K. Mullis invented PCR
  • Won Nobel Prize in 1993
  • Applications for basic science, epidemiology,
    evolution, linkage analysis, forensics,
    anthropology

62
PCR (2)
  • Allows for screening of uncharacterized mutations
  • Rapid genotyping for polymorphic markers
  • Detecting point mutations

63
PCR cycle
  • Three steps
  • Denaturation
  • Denature DNA to separate strands
  • Annealing
  • Primers bind to strands
  • Extension
  • Polymerase synthesizes new strands

64
PCR cycle (2)
  • Reaction mixture proceeds through repeated cycles
    of primer annealing, DNA synthesis, and
    denaturation
  • Target sequence concentration increases
    exponentially for each cycle
  • Each newly synthesized DNA strand acts as a
    template for further DNA synthesis in subsequent
    cycles

65
Denaturation
66
Annealing
67
Extension
68
(No Transcript)
69
(No Transcript)
70
Main assays used
  • Looking to optimize three things
  • Cost of genotyping
  • Speed of genotyping
  • Reliability of data
  • Three main categories
  • Low-plexed
  • Usually PCR based
  • High-plexed
  • PCR or non-PCR based
  • Mega-plexed
  • Non-PCR based

71
PCR bases methods
  • Plex number of separate assays in an individual
    tube
  • Single to low-plexed
  • Usually limited to number of tags
  • Either fluorescent or mass
  • Tags are expensive part
  • Micro scale reactions
  • Low start-up costs
  • Robotics not necessary
  • Machines in many labs

72
Taqman
73
BioTrove
  • Miniaturized Taqman
  • Primers and probes spotted into holes
  • Taqman reaction exactly the same
  • Reduces cost
  • Lowers quantity of probe and master mix
  • Still need to order a minimum amount of primer
    and probe

74
iPLEX
75
Non-PCR bases methods
  • Usually rely on some sort of genome wide
    amplification step
  • Hybridization techniques increase plex
  • Stick DNA to some sort of chip
  • Chips are roughly the size of microscope slides
  • Nano scale reactions
  • High start-up costs for machines and robotics
  • Core facilities normally used

76
Illumina products
  • Highly multiplexed assays
  • From 384 to 1M SNPs
  • Custom chips designed up to 72k
  • GWAS products of 500k and 1M
  • Based on pair-wise tagging of SNPs from hapmap
  • Use specially etched holes
  • Solves spotting problem
  • Addressing system

77
Goldengate
78
Infinium
79
Affymetrix
  • GWAS chip
  • Over 1.8M features
  • SNPs
  • 500k from earlier version
  • Evenly spaced across genome
  • 500k additional SNPs
  • Tag
  • X/Y
  • mtDNA
  • New SNPs
  • Hotspots
  • CNVs
  • 200k specifically targeted to CNVs
  • 750k additional probes across genome

80
Quick word on CNVs
  • Latest craze in genetic epi (one of)
  • Copy Number Variants
  • Either more (3) or less (1) copies of a genetic
    region present
  • Polymorphic regions of varying zygosity
  • Detected as Mendelian errors in HapMap
  • Behave (from a population genetic standpoint)
    like any other polymorphism
  • Still not well characterized
  • i.e. regions with high homology can show up as
    CNVs
  • Genotyped using quantification of genotype signal

81
Back to Affy
82
Affy vs. Illumina
  • Affymetrix
  • Earlier product
  • Began with 100k
  • Assay and software issues
  • Costs have drastically declined
  • SNP coverage has drastically increased
  • tagSNPs added
  • WGA DNA OK
  • Illumina
  • Later product
  • Began with 500k
  • Better assay and software design (originally)
  • Cost issues
  • SNP coverage relatively constant
  • Always based on hapmap
  • WGA DNA discouraged

83
Genotype Clustering
84
Clustering continued
  • Low-plex assays
  • Usually done by eye by a technician
  • Can be labor intensive and subject to user bias
  • High- and mega-plex assays
  • Usually computer assisted or completely automated
  • Less labor intensive but subject to clustering
    errors

85
Best case scenario
86
Software clustering
87
Human clustering
88
What would you do with this?
89
Or this?
90
So you want to genotype?!?
  • Three main things to consider
  • Number of SNPs
  • Number of samples
  • Budgetary considerations
  • Minor considerations
  • DNA source
  • DNA quantity/quality

91
So you want to genotype?!?
Biotrove
92
Budgetary Considerations
  • Low plex
  • Per SNP cost normally doesnt decrease as the
    number of SNPs increases
  • Per genotype cost may decrease as sample size
    increases
  • Overall study cost can be low (Ks)
  • Higher plex
  • Per SNP cost decreases as you get closer to the
    maximum plex
  • Per genotype cost decreases drastically as plex
    goes up
  • Overall study cost can be high (Ms)

93
And the BIG question
94
How to minimize costs while maximizing genotyping
  • Genotype the right number of samples
  • Fill plates
  • Find the sweet spot in assay ordering
  • Genotype the right number of SNPs
  • If you can fill the beads, your per genotype cost
    goes down without drastically increasing the
    total cost

95
Quality control
  • Low-plex
  • Blinded QC
  • Repeated samples
  • 10 of the total sample size
  • gt95 completion rate
  • Easy to repeat individual plates to correct any
    errors
  • High- to mega-plex
  • Blinded QC
  • One or two samples per plate
  • Same samples on every plate
  • Set both SNP and Sample completion rates
  • Not easy to repeat plates

96
Data overload
  • Low-plex
  • Data trickles in
  • Little need for elaborate databases
  • Assay description
  • Primer/probe sequence
  • Alleles
  • SNP description
  • Locus (usually rs sufficient)
  • Relational db for genotype data
  • ID x genotype

97
Data overload
  • High- to mega-plex
  • Data deluge
  • Up to 1M SNPs worth of data at once
  • Annotation of SNPs
  • Assay characteristics
  • SNP characteristics
  • Large samples sizes
  • 1536x1000 samples is over 1.5 million data points

98
Data analysis issues
  • Low-plex
  • Often lt10 SNPs per study
  • Easy/quick to analyze
  • Data presentation simple
  • Data archival simple
  • Multiple comparison issues have largely been
    ignored
  • High- to mega-plex
  • Massive data sets
  • Even summary stats for all the SNPs takes hours
  • Need to be able to access individual SNPs as well
  • Presenting 1536 -1M SNPs worth of data is a
    challenge
  • Multiple comparison issues more obvious

99
In summary
  • Genotyping is now a numbers game
  • Methods are VERY accurate
  • Budgets are tighter
  • Considerations
  • Number of SNPs
  • Number of samples
  • Quantity/Quality of DNA
  • Feel free to contact me regarding DNA sources etc.

100
Online resources (and slide sources)
  • Taqman (appliedbiosystems.com)
  • iplex (sequenom.com)
  • Illumina products (Illumina.com)
  • Affymetrix products (Affymetrix.com)

101
Genotyping Quality Control
P Kraft
102
Quality control
Method of assessment High quality standard
Completion rate gt 95 completeHigh failure rate correlated with high error rate
Reproducible genotypes Repeat genotyping of random 5 sample has lt1 discordance
Hardy Weinberg Single loci no significant deviations or small magnitude of deviation Multiple loci no more deviations than expected (q-q plot), no consistent trend (all undercalling hets)
Non-paternity Where family data available Remove all non-paternities
See Leal (2005) Genet Epidemiol,Cox Kraft
(2006) Hum Hered,Abacasis (2005) Am J Hum Genet
103
Testing for departure from HWE
aa Aa AA
N0 N1 N2
Observed A allele frequency p (2 N2
N1)/(2N),where N N2N1N0
Pearsons chi-square test for departures from
HWE
This should be compared to a central chi-squared
distribution with one degree of freedom
Implemented e.g. in SAS GENETICS PROC ALLELE
104
http//www.sph.umich.edu/csg/abecasis/Exact/index.
html
105
(No Transcript)
106
Example from BPC3
delta (PAa 2pApa)/ 2pApa
107
9 of 22 tests significant at .05 level!
108
quantile of Chi-square distribution
Q-Q plots compares two distributions by plotting
their quantiles against each other. Here it is
useful to similarity between observed
distribution of test statsitics and theoretical
null distribution. Points should lie on yx
diagonal!
109
Good
Better than before
Admixture?
110
CGEMS prostate cancer whole genome scan phase
1a, slide courtesy of G Thomas
111
Statistical Methods to Handle Errors
  • Family-based
  • AE-TDT Models both missing parental data and
    genotype errors
  • Am J Hum Genet. 2001 Aug69(2)371-80.
  • Eur J Hum Genet. 2004 Sep12(9)752-61.
  • Likelihood with nuisance parameters
  • Genet Epidemiol. 2004 Feb26(2)142-54.
  • Bayesian
  • Genet Epidemiol. 2004 Jan26(1)70-80.
  • Case-control
  • Rice Holmans (2003) Ann J Hum Genet 29204
  • Gordon et al. (2004) Stat App Genet Molec Biol 3
  • Need locus-specific error rates
  • Difficult to get for high-throughput platforms

Nondifferential genotyping error can lead to
inflated Type I error rates!
Nondifferential genotyping error does not
generally lead to inflated Type I error rates,
but can lead to loss of power, bias away from null
About PowerShow.com