Haplotype blocks and tagging SNPs - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Haplotype blocks and tagging SNPs

Description:

Biostat 830 Winter 2006. Haplotype blocks ... Biostat 830 Winter 2006. Reich et al. 2001. Extended LD observed in part of the genome. ... – PowerPoint PPT presentation

Number of Views:812
Avg rating:3.0/5.0
Slides: 75
Provided by: sphU
Category:

less

Transcript and Presenter's Notes

Title: Haplotype blocks and tagging SNPs


1
Haplotype blocks and tagging SNPs
  • Lecture 3
  • Biostat 830
  • Winter 2006

2
Outline
  • Evidence of haplotype block.
  • Haplotype tagging SNPs.
  • LD based tagSNPs.

3
Haplotype blocks
  • Subset of SNPs can capture the majority of the
    haplotype diversity observed within a region.
  • Johnson et al. Nature Genetics 2001.
  • Discrete haplotype blocks each with limited
    diversity punctuated by recombinations.
  • Daly et al. Nature Genetics 2001.
  • Haplotype structure reveals blocks of limited
    haplotype diversity
  • Patil et al. Science 2001.

4
Reich et al. 2001
  • Extended LD observed in part of the genome.
  • LD extends 60kb in European samples, oppose to
    3kb suggested by Kruglyaks simulation estimates.
  • An alternative explanation for the observed long
    range LD is that the recombination rate in the
    regions studied might be markedly less than the
    genome-wide average. This could happen if
    recombination occurred primarily in
    well-separated hotspots

5
Johnson et al.
  • 135 kb across 9 genes.
  • 122 SNPs.
  • 384 Europeans.
  • Found 2-5 SNPs can be used to define the six or
    fewer common haplotypes (gt5) observed at each
    locus. These common haplotypes and their htSNPs
    account for gt80 of all haplotypes observed.

6
Johnson et al. 2001.
7
Daly et al.
  • 5q31 region, 500 kb.
  • Region contains a genetic risk factor for Crohns
    disease.
  • 103 SNPs (MAF gt5), 1 SNP every 5kb, highest
    resolution then.
  • 129 European trios.

8
(No Transcript)
9
A Long Haplotype Block
Daly et al., Nat. Genet., 2001
10
Daly et al.
  • The region could be largely decomposed into
    discrete haplotype blocks, each with a striking
    lack of diversity.
  • E.g., 2 haplotypes account for 95 of the
    observed chromosomes.
  • Some common haplotypes across blocks are
    correlated which give rise to long-range LD.
  • Indicates that inhomogeneous recombination rates
    (evident from clustering of major recombination
    events) need to be considered in studying modern
    human population genetics.

11
Patil et al. 2001
  • Chromosome wide 21.
  • 24 ethnically diverse individuals.
  • Separate the two copies of chromosome 21 using
    rodent-human somatic cell hybrid technique. Allow
    direct determination of the full haplotypes
    (Douglas et al. 2001).
  • 35,989 SNPs identified from 20 chromosomes.
  • 24,047 SNPs with minor allele appeared at least
    once.

12
  • 147 SNPs span 106 kb region.
  • 18 blocks.
  • 26 SNPs spanning 19kb region.
  • 7 different haplotypes.
  • 4 most common account for 80 of all haplotypes.
  • 2 htSNPs.

13
Patil et al.s result
  • 24,047 common SNPs
  • 4,135 blocks of SNPs. The largest block contains
    114 common SNPs and spans 115 kb of DNA. Average
    block length is 7.8 kb.
  • 4,563 (19) htSNPs needed to capture all the
    common haplotypes.

14
Other related papers
  • Jeffrey et al. Nature Genetics 2001.
  • 216 kb class II of MHC.
  • Gabriel et al. Science 2002.
  • 51 autosomal regions spanning 13 Mb.
  • Dawson et al. Nature 2002.
  • Chromosome 22.

15
Haplotype block
P53 knowledge base, figure created by HapBlock.
16
Haplotype block (I)
  • Patil et a. 200, Zhang et al. 2002.
  • A set of s consecutive SNPs, which, although in
    theory could generate many different haplotypes,
    in fact shows markedly fewer.
  • E.g., collection of consecutive SNPs such that a
    subset can account for 80 of all haplotypes
    observed.
  • Outside the block, much more distinct haplotypes
    exist.

17
Haplotype block (II)
  • Gabriel et al. 2002.
  • Require substantial LD within block.
  • Small proportion of marker pairs show evidence
    for historical recombination.
  • Blocks are partitioned according to whether the
    upper and lower confidence limits on estimates of
    pairwise D measure fall within certain threshold
    values
  • E.G. 80 of all pairwise LD scores gt0.7

18
Haplotype block (III)
  • Wang et al. 2002.
  • Recombination based.
  • Use four gamete test to decide on haplotype block
    boundary
  • Looking for evidence of historical recombination
  • --AB --Ab -aB -ab
  • Only three out of possible four are observed
    (with detectable frequency).

19
Four-gamete test
20
tagSNP selection
21
original
22
51
23
101
24
271
25
1401
Sarah Betz, Nirav Bhagat, Paul Murphy and Maureen
Stengler
26
Haplotype-based tagSNP
27
HtSNPs
  • A subset of SNPs in the block whose alleles
    essentially determine those of the remaining SNPs
    in the block.

28
HtSNPs
  • Haplotype tagging SNPs (htSNPs) are markers which
    capture most of the haplotypes in a haplotype
    block. Minimum SNPs account for majority of
    common haplotypes.

29
Different algorithms
  • Patil et al.s greedy approach.
  • Zhang et al.s dynamic programming approach.
  • Claytons PDE approach.
  • Strams correlation measure.

30
Patil et al.s greedy approach
  • The goal is to select the block that maximize the
    ratio of the number of SNPs in the block, B, with
    f(B), the minimum number of SNPs required to
    distinguish certain percentage of all haplotypes
    observed.

31
Zhangs dynamic programming approach
  • The goal is to find the optimal block partition,
    which is defined as the one that minimizes the
    total number of representative SNPs.
  • Define Sj to be the number of representative SNPs
    for the optimal block partition.
  • Dynamic programming was used to calculate the
    recursion.

32
Dynamic programming
  • The term was also used in the 1940s by Richard
    Bellman, an American mathematician, to describe
    the process of solving problems where one needs
    to find the best decisions one after another.
  • applications Viterbi algorithm in HMM,
    Needleman-Wunsch algorithm in sequence alignment.

33
Dynamic programming
  • ---------------
  • 5 6 7 4 7 8
  • ---------------
  • 4 7 6 1 1 4
  • ---------------
  • 3 3 5 7 8 2
  • ---------------
  • 2 2 6 7 0 2
  • ---------------
  • 1 7 3 5 6 1
  • ---------------
  • 1 2 3 4 5

34
Zhangs dynamic programming approach (continued)
  • There maybe ties. Several partition may exist
    that give the minimum number of htSNPs. Find the
    partition with the minimum number of blocks.

35
Zhang et al.s results
  • 24,047 SNPs spanning 32.4 Mb of Chromosome 21.
    use 80 criterion as in Patil et al.

36
Remarks
  • Formulated as a nice and simple optimization
    problem.
  • The dynamic programming solution is simple and
    fast.
  • Results better than Patil et al.s greedy
    approach.
  • Can be adapted to other block definitions or
    criteria.
  • No direct biological interpretation of the blocks
    identified.

37
Claytons approach
  • One of the earliest.
  • Aims to select htSNPs that best extract the
    haplotypes information in a gene.
  • Choosing a subset of tagging SNPs, from the full
    set in such a way that the genotypes not included
    in the subset of tagging SNPs can be predicted
    well (from haplotypes based on tagging SNPs).

38
Claytons approach
  • Define measure of haplotype diversity total
    number of differences recorded in all N2 pairwise
    comparisons. For locus j

39
Claytons approach
  • For a candidate collection of H htSNPs, define
    the residual diversity as the sum of the
    diversity within haplotypes defined by the htSNPs
    (ht-haplotypes).

k
k
40
Claytons approach
  • A C T G G A C G T
  • A C T G T A T G C
  • A C T G G A C A T
  • A C G A G G T G C
  • A C G A T A C G T
  • G G T A G A C A C
  • G C T G G G C G T
  • G G G G G G T G T

41
Probability interpretation
  • Two haplotypes differ at j. Then Pj is the
    probability that they fall into different
    ht-haplotype groups.
  • Measures the extent to which knowledge of the
    ht-haplotypes carried by a subject predicts the
    alleles carried at the further locus, j.
  • If Pj 1, then locus j is perfectly predicted by
    the ht-haplotypes.

42
Remarks
  • Has connection to ANOVA, total variation, within
    group variation and between group variation.
  • For not directly observed haplotypes, use
    frequencies as weights when calculating Rj etc.
  • Not suitable for a large number of SNPs. Time
    consuming to calculate P.

43
Stram et al.s method
  • choose subsets of "tagging" SNPs in such a way
    that the haplotype distribution defined across
    the full set of SNPs can be predicted well.
  • Specifically, it considers a measure of
    association between the true numbers of copies of
    haplotypes (defined over the full set of SNPs)
    and the predicted number of copies of haplotypes
    that each individual has.

44
Stram et al.s Rh2
  • Defined the squared correlation between true and
    predicted haplotype dosage (the number of copies
    of a haplotype which an individual has, which can
    be 0,1,2, for a specific haplotype, h) based on
    genotype data.

45
Rh2 for predicting haplotypes of 2 SNPs
46
Examples of Stram et al.
47
Examples of Stram et al.
  • Best choices, by the criteria, of htSNPs for a
    region of reduced haplotype diversity in the
    CYP19 gene among Japanese-American members of the
    multiethnic cohort study

48
Remarks
  • Different from other approaches. Stram et al.
    advocate evaluating how well dosage (0,1,2
    copies) can be predicted by htSNPs.
  • Oriented toward haplotype-based association
    studies. Measure uncertainties in haplotype
    inference. Useful in determining sample size
    requirements.
  • PL-EM used to infer haplotypes, so no haplotype
    frequencies are needed a priori.

49
Comparisons
  • Clayton proposes choosing tagging SNPs in such a
    way that one can make predictions for the SNPs
    that aren't going to be genotyped in the full
    sample, on the basis of haplotypes reconstructed
    from SNPs which will be genotyped in the full
    sample.
  • Stram et al. choose tagging SNPs so that
    predictions for haplotypes across the full set of
    SNPs, made on the basis of haplotypes
    reconstructed from SNPs which will be genotyped
    in the full sample, are optimal.

50
Other methods
  • SpD algorithm.
  • Meng et al. 2002.
  • BEST algorithm.
  • Sebastiani et al. 2004.

51
Meng et al.s SpD method
  • Use spectral decomposition of the matrices of
    pairwise LD between markers. SpD represents the
    variance-covariance matrix (pairwise LD matrix).
    R can be written as where ?i and ei are eigen
    values and eigen vectors of R. the number of
    markers to be retained is determined by .
    Where a is proportion of information retained.
  • A marker is selected if it contributed most to
    the top eigen vectors selected.

52
Remarks
  • Do not rely on definition of haplotype blocks.
  • SpD is computationally efficient and can be
    applied to analyze a larger number of SNPs (e.g.,
    candidate genes typed for several dozen markers)
    without using sliding windows.

53
BEST algorithm
  • BEST recursively partitions the set S (m SNPs) in
    two groups H and D, where H (k SNPs) is the
    minimal tagging set (the smallest set of SNPs
    necessary and sufficient to derive all of the
    SNPs in the haplotype set), D is the set of (m-k)
    SNPs that are derivable from H, and a SNP dj is
    derivable from H if the value of dj in each
    haplotype can be expressed as a Boolean function
    f(.) of elements of H.

54
Remarks
  • Exact, analytical, lossless solution to the
    problem of identifying the minimum set of SNPs
    accounting for the variations in an arbitrary
    genomic region.
  • Need to know haplotypes.
  • It does not use haplotype frequency information,
    doubtful when rare haplotypes were considered.

55
Pairwise LD Based Methods
  • Find minimum tagSNP set such that SNP is either a
    tagSNP or is in LD with a tagSNP.
  • Pairwise r2 is directly related to sample size
    and power of association studies.
  • Pritchard and Przeworski 2001.
  • Greedy approach keep selecting untagged SNP that
    is in LD with the most remaining untagged SNPs.
  • Carlson et al. 2004.

56
Greedy May Not Be Optimal
57
Greedy May Not Be Optimal
58
Greedy May Not Be Optimal
59
Greedy May Not Be Optimal
60
Greedy May Not Be Optimal
61
Exhaustive Search Achieves Optimality, But
  • Go through all k-SNP combinations.
  • Start from k 1.
  • If not successful, k ? k 1.
  • Guaranteed to find the optimal tagSNP set.
  • Becomes computationally prohibitive as k increase.

62
A different Strategy (1)
  • SNPs naturally fall into subsets precincts
  • due to blocky LD structure of the genome.
  • SNPs in different precincts not in strong LD.

Haploview
63
A different Strategy (2)
  • Find tagSNPs separately by precinct.
  • Precincts often small enough to allow exhaustive
    search.

Haploview
64
Hybrid Greedy-Exhaustive Search
  • What if precinct too large for exhaustive search?
  • Hybrid first exhaustive, then greedy
  • First conduct k-SNP exhaustive search for k as
    large as feasible.
  • For each k-SNP combination, use greedy to pick
    tagSNPs among remaining SNPs.
  • Combine the smallest tagSNP set with k SNPs.

65
Application to Chromosome 2
  • HapMap Phase II data (release 18).
  • CEU plate.
  • Chr 2 243 Mb 7.6 of human genome.
  • Total number of SNPs 527,434.
  • SNPs with MAF gt 0.05 182,156.

66
Extrapolation of TagSNP Counts to Genome Wide
Number shown in this column are estimated based
on extrapolation.
67
Additional FESTA Features
  • Force some SNPs to be included/excluded in tagSNP
    set.
  • Double coverage.
  • Require each untagged SNP to be tagged by 2
    tagSNPs.
  • Robust against genotyping failure.
  • Additional criteria.
  • User defined priority score.
  • E.g., average r2,assay design score, MAF, etc.

68
Average of Simulated Design Scores
  • For r2.gt0.8, there are 1.51012257 equivalent
    tagSNP sets.
  • Average design scores ranges from 0.757 to 0.883.
  • If use greedy approach, the average design score
    is 0.830.

69
Summary
  • Key ideas
  • Partition.
  • Hybrid search.
  • Reduces size of tagSNP set.
  • Provides flexibility by allowing choices among
    multiple equivalent solutions.

70
Thank You
71
SNP block structure in chromosome 21
  • Begin by considering all possible blocks of 1
    consecutive SNPs.
  • Next, exclude all blocks in which lt 80 of the
    chromosomes in the data are defined by haplotypes
    represented more than once in the block (80
    coverage). Ambiguous haplotypes are treated as
    missing data and not included when calculating
    coverage.
  • Considering the remaining overlapping blocks
    simultaneously, select the one which maximizes
    the ratio of total SNPs in the block to the
    number required to uniquely discriminate
    haplotypes represented more than once in the
    block. Any of the remaining blocks that
    physically overlap with the selected block are
    discarded, and the process repeated until we have
    selected a set of contiguous, non-overlapping
    blocks that cover the 32.4 Mb of chr 21 2ith no
    gaps and with every SNP assigned to a block.

72
  • depends on the criterion to be optimized, that
    is, by how and to what extent do we wish to trade
    off the diversity permitted in a blocks
    haplotypes against the number of haplotype tags,
    both locally and globally.

73
  • G A C G G G C C A G A T T G T G C T C C
  • G A C G G G C C A G A T T G T A T T C C
  • G A C G G G T C A C A T T A T G T T C G
  • A G C G A G C T G G T C C G C G T T C G
  • G G C C G A C T A G A T T G T G T T C C
  • G A T G G G C C A G A T T G T G T T C C
  • G A C G G G C C A G A T T G T G T T C C
  • G A C G G G C C A G A T T G T G C C T G
  • A G C G A G C T G G T C T G C G T T C G
  • A G C G A G C T G G T C C G C A T T C C

74
Whats haplotype block?
  • A block is a set of s consecutive SNPs, which,
    although in theory could generate as many as 2s
    different haplotypes, in fact shows markedly
    fewer in our sample of n, perhaps as few as s1.
    In this case, there will be a subset of SNPs in
    the block whose alleles in our sample essentially
    determine those of the remaining SNPs in the
    block. These have been called haplotype tags.
    Outside the block, much more distinct haplotypes
    exist.
Write a Comment
User Comments (0)
About PowerShow.com