Title: Haplotype blocks and tagging SNPs
1Haplotype blocks and tagging SNPs
- Lecture 3
- Biostat 830
- Winter 2006
2Outline
- Evidence of haplotype block.
- Haplotype tagging SNPs.
- LD based tagSNPs.
3Haplotype blocks
- Subset of SNPs can capture the majority of the
haplotype diversity observed within a region. - Johnson et al. Nature Genetics 2001.
- Discrete haplotype blocks each with limited
diversity punctuated by recombinations. - Daly et al. Nature Genetics 2001.
- Haplotype structure reveals blocks of limited
haplotype diversity - Patil et al. Science 2001.
4Reich et al. 2001
- Extended LD observed in part of the genome.
- LD extends 60kb in European samples, oppose to
3kb suggested by Kruglyaks simulation estimates. - An alternative explanation for the observed long
range LD is that the recombination rate in the
regions studied might be markedly less than the
genome-wide average. This could happen if
recombination occurred primarily in
well-separated hotspots
5Johnson et al.
- 135 kb across 9 genes.
- 122 SNPs.
- 384 Europeans.
- Found 2-5 SNPs can be used to define the six or
fewer common haplotypes (gt5) observed at each
locus. These common haplotypes and their htSNPs
account for gt80 of all haplotypes observed.
6Johnson et al. 2001.
7Daly et al.
- 5q31 region, 500 kb.
- Region contains a genetic risk factor for Crohns
disease. - 103 SNPs (MAF gt5), 1 SNP every 5kb, highest
resolution then. - 129 European trios.
8(No Transcript)
9A Long Haplotype Block
Daly et al., Nat. Genet., 2001
10Daly et al.
- The region could be largely decomposed into
discrete haplotype blocks, each with a striking
lack of diversity. - E.g., 2 haplotypes account for 95 of the
observed chromosomes. - Some common haplotypes across blocks are
correlated which give rise to long-range LD. - Indicates that inhomogeneous recombination rates
(evident from clustering of major recombination
events) need to be considered in studying modern
human population genetics.
11Patil et al. 2001
- Chromosome wide 21.
- 24 ethnically diverse individuals.
- Separate the two copies of chromosome 21 using
rodent-human somatic cell hybrid technique. Allow
direct determination of the full haplotypes
(Douglas et al. 2001). - 35,989 SNPs identified from 20 chromosomes.
- 24,047 SNPs with minor allele appeared at least
once.
12- 147 SNPs span 106 kb region.
- 18 blocks.
- 26 SNPs spanning 19kb region.
- 7 different haplotypes.
- 4 most common account for 80 of all haplotypes.
- 2 htSNPs.
13Patil et al.s result
- 24,047 common SNPs
- 4,135 blocks of SNPs. The largest block contains
114 common SNPs and spans 115 kb of DNA. Average
block length is 7.8 kb. - 4,563 (19) htSNPs needed to capture all the
common haplotypes.
14Other related papers
- Jeffrey et al. Nature Genetics 2001.
- 216 kb class II of MHC.
- Gabriel et al. Science 2002.
- 51 autosomal regions spanning 13 Mb.
- Dawson et al. Nature 2002.
- Chromosome 22.
15Haplotype block
P53 knowledge base, figure created by HapBlock.
16Haplotype block (I)
- Patil et a. 200, Zhang et al. 2002.
- A set of s consecutive SNPs, which, although in
theory could generate many different haplotypes,
in fact shows markedly fewer. - E.g., collection of consecutive SNPs such that a
subset can account for 80 of all haplotypes
observed. - Outside the block, much more distinct haplotypes
exist.
17Haplotype block (II)
- Gabriel et al. 2002.
- Require substantial LD within block.
- Small proportion of marker pairs show evidence
for historical recombination. - Blocks are partitioned according to whether the
upper and lower confidence limits on estimates of
pairwise D measure fall within certain threshold
values - E.G. 80 of all pairwise LD scores gt0.7
18Haplotype block (III)
- Wang et al. 2002.
- Recombination based.
- Use four gamete test to decide on haplotype block
boundary - Looking for evidence of historical recombination
- --AB --Ab -aB -ab
- Only three out of possible four are observed
(with detectable frequency).
19Four-gamete test
20tagSNP selection
21original
2251
23101
24271
251401
Sarah Betz, Nirav Bhagat, Paul Murphy and Maureen
Stengler
26Haplotype-based tagSNP
27HtSNPs
- A subset of SNPs in the block whose alleles
essentially determine those of the remaining SNPs
in the block.
28HtSNPs
- Haplotype tagging SNPs (htSNPs) are markers which
capture most of the haplotypes in a haplotype
block. Minimum SNPs account for majority of
common haplotypes.
29Different algorithms
- Patil et al.s greedy approach.
- Zhang et al.s dynamic programming approach.
- Claytons PDE approach.
- Strams correlation measure.
30Patil et al.s greedy approach
- The goal is to select the block that maximize the
ratio of the number of SNPs in the block, B, with
f(B), the minimum number of SNPs required to
distinguish certain percentage of all haplotypes
observed.
31Zhangs dynamic programming approach
- The goal is to find the optimal block partition,
which is defined as the one that minimizes the
total number of representative SNPs. - Define Sj to be the number of representative SNPs
for the optimal block partition. - Dynamic programming was used to calculate the
recursion.
32Dynamic programming
- The term was also used in the 1940s by Richard
Bellman, an American mathematician, to describe
the process of solving problems where one needs
to find the best decisions one after another. - applications Viterbi algorithm in HMM,
Needleman-Wunsch algorithm in sequence alignment.
33Dynamic programming
- ---------------
- 5 6 7 4 7 8
- ---------------
- 4 7 6 1 1 4
- ---------------
- 3 3 5 7 8 2
- ---------------
- 2 2 6 7 0 2
- ---------------
- 1 7 3 5 6 1
- ---------------
- 1 2 3 4 5
34Zhangs dynamic programming approach (continued)
- There maybe ties. Several partition may exist
that give the minimum number of htSNPs. Find the
partition with the minimum number of blocks.
35Zhang et al.s results
- 24,047 SNPs spanning 32.4 Mb of Chromosome 21.
use 80 criterion as in Patil et al. -
36Remarks
- Formulated as a nice and simple optimization
problem. - The dynamic programming solution is simple and
fast. - Results better than Patil et al.s greedy
approach. - Can be adapted to other block definitions or
criteria. - No direct biological interpretation of the blocks
identified.
37Claytons approach
- One of the earliest.
- Aims to select htSNPs that best extract the
haplotypes information in a gene. - Choosing a subset of tagging SNPs, from the full
set in such a way that the genotypes not included
in the subset of tagging SNPs can be predicted
well (from haplotypes based on tagging SNPs).
38Claytons approach
- Define measure of haplotype diversity total
number of differences recorded in all N2 pairwise
comparisons. For locus j
39Claytons approach
- For a candidate collection of H htSNPs, define
the residual diversity as the sum of the
diversity within haplotypes defined by the htSNPs
(ht-haplotypes).
k
k
40Claytons approach
- A C T G G A C G T
- A C T G T A T G C
- A C T G G A C A T
- A C G A G G T G C
- A C G A T A C G T
- G G T A G A C A C
- G C T G G G C G T
- G G G G G G T G T
41Probability interpretation
- Two haplotypes differ at j. Then Pj is the
probability that they fall into different
ht-haplotype groups. - Measures the extent to which knowledge of the
ht-haplotypes carried by a subject predicts the
alleles carried at the further locus, j. - If Pj 1, then locus j is perfectly predicted by
the ht-haplotypes.
42Remarks
- Has connection to ANOVA, total variation, within
group variation and between group variation. - For not directly observed haplotypes, use
frequencies as weights when calculating Rj etc. - Not suitable for a large number of SNPs. Time
consuming to calculate P.
43Stram et al.s method
- choose subsets of "tagging" SNPs in such a way
that the haplotype distribution defined across
the full set of SNPs can be predicted well. - Specifically, it considers a measure of
association between the true numbers of copies of
haplotypes (defined over the full set of SNPs)
and the predicted number of copies of haplotypes
that each individual has.
44Stram et al.s Rh2
- Defined the squared correlation between true and
predicted haplotype dosage (the number of copies
of a haplotype which an individual has, which can
be 0,1,2, for a specific haplotype, h) based on
genotype data.
45Rh2 for predicting haplotypes of 2 SNPs
46Examples of Stram et al.
47Examples of Stram et al.
- Best choices, by the criteria, of htSNPs for a
region of reduced haplotype diversity in the
CYP19 gene among Japanese-American members of the
multiethnic cohort study
48Remarks
- Different from other approaches. Stram et al.
advocate evaluating how well dosage (0,1,2
copies) can be predicted by htSNPs. - Oriented toward haplotype-based association
studies. Measure uncertainties in haplotype
inference. Useful in determining sample size
requirements. - PL-EM used to infer haplotypes, so no haplotype
frequencies are needed a priori.
49Comparisons
- Clayton proposes choosing tagging SNPs in such a
way that one can make predictions for the SNPs
that aren't going to be genotyped in the full
sample, on the basis of haplotypes reconstructed
from SNPs which will be genotyped in the full
sample. - Stram et al. choose tagging SNPs so that
predictions for haplotypes across the full set of
SNPs, made on the basis of haplotypes
reconstructed from SNPs which will be genotyped
in the full sample, are optimal.
50Other methods
- SpD algorithm.
- Meng et al. 2002.
- BEST algorithm.
- Sebastiani et al. 2004.
51Meng et al.s SpD method
- Use spectral decomposition of the matrices of
pairwise LD between markers. SpD represents the
variance-covariance matrix (pairwise LD matrix).
R can be written as where ?i and ei are eigen
values and eigen vectors of R. the number of
markers to be retained is determined by .
Where a is proportion of information retained. - A marker is selected if it contributed most to
the top eigen vectors selected.
52Remarks
- Do not rely on definition of haplotype blocks.
- SpD is computationally efficient and can be
applied to analyze a larger number of SNPs (e.g.,
candidate genes typed for several dozen markers)
without using sliding windows.
53BEST algorithm
- BEST recursively partitions the set S (m SNPs) in
two groups H and D, where H (k SNPs) is the
minimal tagging set (the smallest set of SNPs
necessary and sufficient to derive all of the
SNPs in the haplotype set), D is the set of (m-k)
SNPs that are derivable from H, and a SNP dj is
derivable from H if the value of dj in each
haplotype can be expressed as a Boolean function
f(.) of elements of H.
54Remarks
- Exact, analytical, lossless solution to the
problem of identifying the minimum set of SNPs
accounting for the variations in an arbitrary
genomic region. - Need to know haplotypes.
- It does not use haplotype frequency information,
doubtful when rare haplotypes were considered.
55Pairwise LD Based Methods
- Find minimum tagSNP set such that SNP is either a
tagSNP or is in LD with a tagSNP. - Pairwise r2 is directly related to sample size
and power of association studies. - Pritchard and Przeworski 2001.
- Greedy approach keep selecting untagged SNP that
is in LD with the most remaining untagged SNPs. - Carlson et al. 2004.
56Greedy May Not Be Optimal
57Greedy May Not Be Optimal
58Greedy May Not Be Optimal
59Greedy May Not Be Optimal
60Greedy May Not Be Optimal
61Exhaustive Search Achieves Optimality, But
- Go through all k-SNP combinations.
- Start from k 1.
- If not successful, k ? k 1.
- Guaranteed to find the optimal tagSNP set.
- Becomes computationally prohibitive as k increase.
62A different Strategy (1)
- SNPs naturally fall into subsets precincts
- due to blocky LD structure of the genome.
- SNPs in different precincts not in strong LD.
Haploview
63A different Strategy (2)
- Find tagSNPs separately by precinct.
- Precincts often small enough to allow exhaustive
search.
Haploview
64Hybrid Greedy-Exhaustive Search
- What if precinct too large for exhaustive search?
- Hybrid first exhaustive, then greedy
- First conduct k-SNP exhaustive search for k as
large as feasible. - For each k-SNP combination, use greedy to pick
tagSNPs among remaining SNPs. - Combine the smallest tagSNP set with k SNPs.
65Application to Chromosome 2
- HapMap Phase II data (release 18).
- CEU plate.
- Chr 2 243 Mb 7.6 of human genome.
- Total number of SNPs 527,434.
- SNPs with MAF gt 0.05 182,156.
66Extrapolation of TagSNP Counts to Genome Wide
Number shown in this column are estimated based
on extrapolation.
67Additional FESTA Features
- Force some SNPs to be included/excluded in tagSNP
set. - Double coverage.
- Require each untagged SNP to be tagged by 2
tagSNPs. - Robust against genotyping failure.
- Additional criteria.
- User defined priority score.
- E.g., average r2,assay design score, MAF, etc.
68Average of Simulated Design Scores
- For r2.gt0.8, there are 1.51012257 equivalent
tagSNP sets. - Average design scores ranges from 0.757 to 0.883.
- If use greedy approach, the average design score
is 0.830.
69Summary
- Key ideas
- Partition.
- Hybrid search.
- Reduces size of tagSNP set.
- Provides flexibility by allowing choices among
multiple equivalent solutions.
70Thank You
71SNP block structure in chromosome 21
- Begin by considering all possible blocks of 1
consecutive SNPs. - Next, exclude all blocks in which lt 80 of the
chromosomes in the data are defined by haplotypes
represented more than once in the block (80
coverage). Ambiguous haplotypes are treated as
missing data and not included when calculating
coverage. - Considering the remaining overlapping blocks
simultaneously, select the one which maximizes
the ratio of total SNPs in the block to the
number required to uniquely discriminate
haplotypes represented more than once in the
block. Any of the remaining blocks that
physically overlap with the selected block are
discarded, and the process repeated until we have
selected a set of contiguous, non-overlapping
blocks that cover the 32.4 Mb of chr 21 2ith no
gaps and with every SNP assigned to a block.
72- depends on the criterion to be optimized, that
is, by how and to what extent do we wish to trade
off the diversity permitted in a blocks
haplotypes against the number of haplotype tags,
both locally and globally.
73- G A C G G G C C A G A T T G T G C T C C
- G A C G G G C C A G A T T G T A T T C C
- G A C G G G T C A C A T T A T G T T C G
- A G C G A G C T G G T C C G C G T T C G
- G G C C G A C T A G A T T G T G T T C C
- G A T G G G C C A G A T T G T G T T C C
- G A C G G G C C A G A T T G T G T T C C
- G A C G G G C C A G A T T G T G C C T G
- A G C G A G C T G G T C T G C G T T C G
- A G C G A G C T G G T C C G C A T T C C
74Whats haplotype block?
- A block is a set of s consecutive SNPs, which,
although in theory could generate as many as 2s
different haplotypes, in fact shows markedly
fewer in our sample of n, perhaps as few as s1.
In this case, there will be a subset of SNPs in
the block whose alleles in our sample essentially
determine those of the remaining SNPs in the
block. These have been called haplotype tags.
Outside the block, much more distinct haplotypes
exist.