Haplotype blocks and tagging SNPs

About This Presentation

Title:

Haplotype blocks and tagging SNPs

Description:

Biostat 830 Winter 2006. Haplotype blocks ... Biostat 830 Winter 2006. Reich et al. 2001. Extended LD observed in part of the genome. ... – PowerPoint PPT presentation

Number of Views:812

Avg rating:3.0/5.0

Slides: 75

Provided by: sphU

Category:

more less

Transcript and Presenter's Notes

Title: Haplotype blocks and tagging SNPs

1
Haplotype blocks and tagging SNPs

Lecture 3
Biostat 830
Winter 2006

2
Outline

Evidence of haplotype block.
Haplotype tagging SNPs.
LD based tagSNPs.

3
Haplotype blocks

Subset of SNPs can capture the majority of the
haplotype diversity observed within a region.
Johnson et al. Nature Genetics 2001.
Discrete haplotype blocks each with limited
diversity punctuated by recombinations.
Daly et al. Nature Genetics 2001.
Haplotype structure reveals blocks of limited
haplotype diversity
Patil et al. Science 2001.

4
Reich et al. 2001

Extended LD observed in part of the genome.
LD extends 60kb in European samples, oppose to
3kb suggested by Kruglyaks simulation estimates.
An alternative explanation for the observed long
range LD is that the recombination rate in the
regions studied might be markedly less than the
genome-wide average. This could happen if
recombination occurred primarily in
well-separated hotspots

5
Johnson et al.

135 kb across 9 genes.
122 SNPs.
384 Europeans.
Found 2-5 SNPs can be used to define the six or
fewer common haplotypes (gt5) observed at each
locus. These common haplotypes and their htSNPs
account for gt80 of all haplotypes observed.

6
Johnson et al. 2001.
7
Daly et al.

5q31 region, 500 kb.
Region contains a genetic risk factor for Crohns
disease.
103 SNPs (MAF gt5), 1 SNP every 5kb, highest
resolution then.
129 European trios.

8
(No Transcript)
9
A Long Haplotype Block
Daly et al., Nat. Genet., 2001
10
Daly et al.

The region could be largely decomposed into
discrete haplotype blocks, each with a striking
lack of diversity.
E.g., 2 haplotypes account for 95 of the
observed chromosomes.
Some common haplotypes across blocks are
correlated which give rise to long-range LD.
Indicates that inhomogeneous recombination rates
(evident from clustering of major recombination
events) need to be considered in studying modern
human population genetics.

11
Patil et al. 2001

Chromosome wide 21.
24 ethnically diverse individuals.
Separate the two copies of chromosome 21 using
rodent-human somatic cell hybrid technique. Allow
direct determination of the full haplotypes
(Douglas et al. 2001).
35,989 SNPs identified from 20 chromosomes.
24,047 SNPs with minor allele appeared at least
once.

147 SNPs span 106 kb region.
18 blocks.
26 SNPs spanning 19kb region.
7 different haplotypes.
4 most common account for 80 of all haplotypes.
2 htSNPs.

13
Patil et al.s result

24,047 common SNPs
4,135 blocks of SNPs. The largest block contains
114 common SNPs and spans 115 kb of DNA. Average
block length is 7.8 kb.
4,563 (19) htSNPs needed to capture all the
common haplotypes.

14
Other related papers

Jeffrey et al. Nature Genetics 2001.
216 kb class II of MHC.
Gabriel et al. Science 2002.
51 autosomal regions spanning 13 Mb.
Dawson et al. Nature 2002.
Chromosome 22.

15
Haplotype block
P53 knowledge base, figure created by HapBlock.
16
Haplotype block (I)

Patil et a. 200, Zhang et al. 2002.
A set of s consecutive SNPs, which, although in
theory could generate many different haplotypes,
in fact shows markedly fewer.
E.g., collection of consecutive SNPs such that a
subset can account for 80 of all haplotypes
observed.
Outside the block, much more distinct haplotypes
exist.

17
Haplotype block (II)

Gabriel et al. 2002.
Require substantial LD within block.
Small proportion of marker pairs show evidence
for historical recombination.
Blocks are partitioned according to whether the
upper and lower confidence limits on estimates of
pairwise D measure fall within certain threshold
values
E.G. 80 of all pairwise LD scores gt0.7

18
Haplotype block (III)

Wang et al. 2002.
Recombination based.
Use four gamete test to decide on haplotype block
boundary
Looking for evidence of historical recombination
--AB --Ab -aB -ab
Only three out of possible four are observed
(with detectable frequency).

19
Four-gamete test
20
tagSNP selection
21
original
22
51
23
101
24
271
25
1401
Sarah Betz, Nirav Bhagat, Paul Murphy and Maureen
Stengler
26
Haplotype-based tagSNP
27
HtSNPs

A subset of SNPs in the block whose alleles
essentially determine those of the remaining SNPs
in the block.

28
HtSNPs

Haplotype tagging SNPs (htSNPs) are markers which
capture most of the haplotypes in a haplotype
block. Minimum SNPs account for majority of
common haplotypes.

29
Different algorithms

Patil et al.s greedy approach.
Zhang et al.s dynamic programming approach.
Claytons PDE approach.
Strams correlation measure.

30
Patil et al.s greedy approach

The goal is to select the block that maximize the
ratio of the number of SNPs in the block, B, with
f(B), the minimum number of SNPs required to
distinguish certain percentage of all haplotypes
observed.

31
Zhangs dynamic programming approach

The goal is to find the optimal block partition,
which is defined as the one that minimizes the
total number of representative SNPs.
Define Sj to be the number of representative SNPs
for the optimal block partition.
Dynamic programming was used to calculate the
recursion.

32
Dynamic programming

The term was also used in the 1940s by Richard
Bellman, an American mathematician, to describe
the process of solving problems where one needs
to find the best decisions one after another.
applications Viterbi algorithm in HMM,
Needleman-Wunsch algorithm in sequence alignment.

33
Dynamic programming

---------------
5 6 7 4 7 8
---------------
4 7 6 1 1 4
---------------
3 3 5 7 8 2
---------------
2 2 6 7 0 2
---------------
1 7 3 5 6 1
---------------
1 2 3 4 5

34
Zhangs dynamic programming approach (continued)

There maybe ties. Several partition may exist
that give the minimum number of htSNPs. Find the
partition with the minimum number of blocks.

35
Zhang et al.s results

24,047 SNPs spanning 32.4 Mb of Chromosome 21.
use 80 criterion as in Patil et al.

36
Remarks

Formulated as a nice and simple optimization
problem.
The dynamic programming solution is simple and
fast.
Results better than Patil et al.s greedy
approach.
Can be adapted to other block definitions or
criteria.
No direct biological interpretation of the blocks
identified.

37
Claytons approach

One of the earliest.
Aims to select htSNPs that best extract the
haplotypes information in a gene.
Choosing a subset of tagging SNPs, from the full
set in such a way that the genotypes not included
in the subset of tagging SNPs can be predicted
well (from haplotypes based on tagging SNPs).

38
Claytons approach

Define measure of haplotype diversity total
number of differences recorded in all N2 pairwise
comparisons. For locus j

39
Claytons approach

For a candidate collection of H htSNPs, define
the residual diversity as the sum of the
diversity within haplotypes defined by the htSNPs
(ht-haplotypes).

k
k
40
Claytons approach

A C T G G A C G T
A C T G T A T G C
A C T G G A C A T
A C G A G G T G C
A C G A T A C G T
G G T A G A C A C
G C T G G G C G T
G G G G G G T G T

41
Probability interpretation

Two haplotypes differ at j. Then Pj is the
probability that they fall into different
ht-haplotype groups.
Measures the extent to which knowledge of the
ht-haplotypes carried by a subject predicts the
alleles carried at the further locus, j.
If Pj 1, then locus j is perfectly predicted by
the ht-haplotypes.

42
Remarks

Has connection to ANOVA, total variation, within
group variation and between group variation.
For not directly observed haplotypes, use
frequencies as weights when calculating Rj etc.
Not suitable for a large number of SNPs. Time
consuming to calculate P.

43
Stram et al.s method

choose subsets of "tagging" SNPs in such a way
that the haplotype distribution defined across
the full set of SNPs can be predicted well.
Specifically, it considers a measure of
association between the true numbers of copies of
haplotypes (defined over the full set of SNPs)
and the predicted number of copies of haplotypes
that each individual has.

44
Stram et al.s Rh2

Defined the squared correlation between true and
predicted haplotype dosage (the number of copies
of a haplotype which an individual has, which can
be 0,1,2, for a specific haplotype, h) based on
genotype data.

45
Rh2 for predicting haplotypes of 2 SNPs
46
Examples of Stram et al.
47
Examples of Stram et al.

Best choices, by the criteria, of htSNPs for a
region of reduced haplotype diversity in the
CYP19 gene among Japanese-American members of the
multiethnic cohort study

48
Remarks

Different from other approaches. Stram et al.
advocate evaluating how well dosage (0,1,2
copies) can be predicted by htSNPs.
Oriented toward haplotype-based association
studies. Measure uncertainties in haplotype
inference. Useful in determining sample size
requirements.
PL-EM used to infer haplotypes, so no haplotype
frequencies are needed a priori.

49
Comparisons

Clayton proposes choosing tagging SNPs in such a
way that one can make predictions for the SNPs
that aren't going to be genotyped in the full
sample, on the basis of haplotypes reconstructed
from SNPs which will be genotyped in the full
sample.
Stram et al. choose tagging SNPs so that
predictions for haplotypes across the full set of
SNPs, made on the basis of haplotypes
reconstructed from SNPs which will be genotyped
in the full sample, are optimal.

50
Other methods

SpD algorithm.
Meng et al. 2002.
BEST algorithm.
Sebastiani et al. 2004.

51
Meng et al.s SpD method

Use spectral decomposition of the matrices of
pairwise LD between markers. SpD represents the
variance-covariance matrix (pairwise LD matrix).
R can be written as where ?i and ei are eigen
values and eigen vectors of R. the number of
markers to be retained is determined by .
Where a is proportion of information retained.
A marker is selected if it contributed most to
the top eigen vectors selected.

52
Remarks

Do not rely on definition of haplotype blocks.
SpD is computationally efficient and can be
applied to analyze a larger number of SNPs (e.g.,
candidate genes typed for several dozen markers)
without using sliding windows.

53
BEST algorithm

BEST recursively partitions the set S (m SNPs) in
two groups H and D, where H (k SNPs) is the
minimal tagging set (the smallest set of SNPs
necessary and sufficient to derive all of the
SNPs in the haplotype set), D is the set of (m-k)
SNPs that are derivable from H, and a SNP dj is
derivable from H if the value of dj in each
haplotype can be expressed as a Boolean function
f(.) of elements of H.

54
Remarks

Exact, analytical, lossless solution to the
problem of identifying the minimum set of SNPs
accounting for the variations in an arbitrary
genomic region.
Need to know haplotypes.
It does not use haplotype frequency information,
doubtful when rare haplotypes were considered.

55
Pairwise LD Based Methods

Find minimum tagSNP set such that SNP is either a
tagSNP or is in LD with a tagSNP.
Pairwise r2 is directly related to sample size
and power of association studies.
Pritchard and Przeworski 2001.
Greedy approach keep selecting untagged SNP that
is in LD with the most remaining untagged SNPs.
Carlson et al. 2004.

56
Greedy May Not Be Optimal
57
Greedy May Not Be Optimal
58
Greedy May Not Be Optimal
59
Greedy May Not Be Optimal
60
Greedy May Not Be Optimal
61
Exhaustive Search Achieves Optimality, But

Go through all k-SNP combinations.
Start from k 1.
If not successful, k ? k 1.
Guaranteed to find the optimal tagSNP set.
Becomes computationally prohibitive as k increase.

62
A different Strategy (1)

SNPs naturally fall into subsets precincts
due to blocky LD structure of the genome.
SNPs in different precincts not in strong LD.

Haploview
63
A different Strategy (2)

Find tagSNPs separately by precinct.
Precincts often small enough to allow exhaustive
search.

Haploview
64
Hybrid Greedy-Exhaustive Search

What if precinct too large for exhaustive search?
Hybrid first exhaustive, then greedy
First conduct k-SNP exhaustive search for k as
large as feasible.
For each k-SNP combination, use greedy to pick
tagSNPs among remaining SNPs.
Combine the smallest tagSNP set with k SNPs.

65
Application to Chromosome 2

HapMap Phase II data (release 18).
CEU plate.
Chr 2 243 Mb 7.6 of human genome.
Total number of SNPs 527,434.
SNPs with MAF gt 0.05 182,156.

66
Extrapolation of TagSNP Counts to Genome Wide
Number shown in this column are estimated based
on extrapolation.
67
Additional FESTA Features

Force some SNPs to be included/excluded in tagSNP
set.
Double coverage.
Require each untagged SNP to be tagged by 2
tagSNPs.
Robust against genotyping failure.
Additional criteria.
User defined priority score.
E.g., average r2,assay design score, MAF, etc.

68
Average of Simulated Design Scores

For r2.gt0.8, there are 1.51012257 equivalent
tagSNP sets.
Average design scores ranges from 0.757 to 0.883.
If use greedy approach, the average design score
is 0.830.

69
Summary

Key ideas
Partition.
Hybrid search.
Reduces size of tagSNP set.
Provides flexibility by allowing choices among
multiple equivalent solutions.

70
Thank You
71
SNP block structure in chromosome 21

Begin by considering all possible blocks of 1
consecutive SNPs.
Next, exclude all blocks in which lt 80 of the
chromosomes in the data are defined by haplotypes
represented more than once in the block (80
coverage). Ambiguous haplotypes are treated as
missing data and not included when calculating
coverage.
Considering the remaining overlapping blocks
simultaneously, select the one which maximizes
the ratio of total SNPs in the block to the
number required to uniquely discriminate
haplotypes represented more than once in the
block. Any of the remaining blocks that
physically overlap with the selected block are
discarded, and the process repeated until we have
selected a set of contiguous, non-overlapping
blocks that cover the 32.4 Mb of chr 21 2ith no
gaps and with every SNP assigned to a block.

depends on the criterion to be optimized, that
is, by how and to what extent do we wish to trade
off the diversity permitted in a blocks
haplotypes against the number of haplotype tags,
both locally and globally.

G A C G G G C C A G A T T G T G C T C C
G A C G G G C C A G A T T G T A T T C C
G A C G G G T C A C A T T A T G T T C G
A G C G A G C T G G T C C G C G T T C G
G G C C G A C T A G A T T G T G T T C C
G A T G G G C C A G A T T G T G T T C C
G A C G G G C C A G A T T G T G T T C C
G A C G G G C C A G A T T G T G C C T G
A G C G A G C T G G T C T G C G T T C G
A G C G A G C T G G T C C G C A T T C C

74
Whats haplotype block?

A block is a set of s consecutive SNPs, which,
although in theory could generate as many as 2s
different haplotypes, in fact shows markedly
fewer in our sample of n, perhaps as few as s1.
In this case, there will be a subset of SNPs in
the block whose alleles in our sample essentially
determine those of the remaining SNPs in the
block. These have been called haplotype tags.
Outside the block, much more distinct haplotypes
exist.