The%20informatics%20of%20SNPs%20and%20haplotypes - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

The%20informatics%20of%20SNPs%20and%20haplotypes

Description:

... between alleles i.e. it measures how well knowledge of the allele state at one ... one may account for all possible reconstructions when evaluating data-relevance; ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 73
Provided by: Gabor75
Learn more at: http://clavius.bc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The%20informatics%20of%20SNPs%20and%20haplotypes


1
The informatics of SNPs and haplotypes
Cold Spring Harbor Laboratory Advanced
Bioinformatics course October 17, 2005
Gabor T. Marth
Department of Biology, Boston College marth_at_bc.edu
2
Why do we care about variations?
underlie phenotypic differences
3
How do we find sequence variations?
4
Steps of SNP discovery
5
Computational SNP mining PolyBayes
6
SNP mining steps PolyBayes
sequence clustering simplifies to database search
with genome reference
multiple alignment by anchoring fragments to
genome reference
paralog filtering by counting mismatches weighed
by quality values
SNP detection by differentiating true
polymorphism from sequencing error using quality
values
7
SNP discovery with PolyBayes
genome reference sequence
8
Polymorphism discovery SW
Marth et al. Nature Genetics 1999
9
Genotyping by sequence
  • SNP discovery usually deals with single-stranded
    (clonal) sequences
  • It is often necessary to determine the allele
    state of individuals at known polymorphic
    locations
  • Genotyping usually involves double-stranded DNA
    ? the possibility of heterozygosity exists
  • there is no unique underlying nucleotide, no
    meaningful base quality value, hence statistical
    methods of SNP discovery do not apply

10
Het detection Diploid base calling
Automated detection of heterozygous positions in
diploid individual samples
(visit Aaron Quinlans poster)
11
Large SNP mining projects
12
Variation structure is heterogeneous
chromosomal averages
polymorphism density along chromosomes
13
What explains nucleotide diversity?
GC nucleotide content
CpG di-nucleotide content
recombination rate
3 UTR 5.00 x 10-4 5 UTR 4.95 x 10-4 Exon,
overall 4.20 x 10-4 Exon, coding 3.77 x
10-4 synonymous 366 / 653 non-synonymous 287 /
653
functional constraints
Variance is so high that these quantities are
poor predictors of nucleotide diversity in local
regions hence random processes are likely to
govern the basic shape of the genome variation
landscape ? (random) genetic drift
14
Where do variations come from?
  • sequence variations are the result of mutation
    events

TAAAAAT
15
Neutrality vs. selection
  • selective mutations influence the genealogy
    itself in the case of neutral mutations the
    processes of mutation and genealogy are decoupled

16
Mutation rate
  • higher mutation rate (µ) gives rise to more SNPS

17
Long-term demography
small (effective) population size N
  • different world populations have varying
    long-term effective population sizes (e.g.
    African N is larger than European)

18
Population subdivision
shared
  • geographically subdivided populations will have
    differences between their respective variation
    structures

19
Recombination
accgttatgtaga
accgttatgtaga
acggttatgtaga
acggttatgtaga
20
Recombination
accgttatgtaga
accgttatgtaga
acggttatgtaga
  • recombination has a crucial effect on the
    association between different alleles

21
Modeling genetic drift Genealogy
randomly mating population, genealogy evolves in
a non-deterministic fashion
present generation
22
Modeling genetic drift Mutation
mutation randomly drift die out, go to higher
frequency or get fixed
23
Modulators Natural selection
negative (purifying) selection
positive selection
the genealogy is no longer independent of (and
hence cannot be decoupled from) the mutation
process
24
Modeling ancestral processes
forward simulations
the Coalescent process
By focusing on a small sample, complexity of the
relevant part of the ancestral process is greatly
reduced. There are, however, limitations.
25
Models of demographic history
stationary
expansion
collapse
bottleneck
past
history
present
MD (simulation)
AFS (direct form)
26
Data polymorphism distributions
1. marker density (MD) distribution of number of
SNPs in pairs of sequences
Clone 1 Clone 2 SNPs
AL00675 AL00982 8
AS81034 AK43001 0
CB00341 AL43234 2
SNP Minor allele Allele count
A/G A 1
C/T T 9
A/G G 3
27
Model processes that generate SNPs
simulation procedures
28
Models of demographic history
stationary
expansion
collapse
bottleneck
past
history
present
MD (simulation)
AFS (direct form)
29
Data fitting marker density
  • best model is a bottleneck shaped population
    size history

N311,000
N25,000 T2400 gen.
N16,000 T11,200 gen.
present
Marth et al. PNAS 2003
  • our conclusions from the marker density data are
    confounded by the unknown ethnicity of the public
    genome sequence we looked at allele frequency
    data from ethnically defined samples

30
Data fitting allele frequency
  • Data from other populations?

31
Population specific demographic history
European data
African data
Marth et al. Genetics 2004
32
Model-based prediction
computational model encapsulating what we know
about the process
33
Prediction allele frequency and age
African data
European data
34
How to use markers to find disease?
35
Allelic association
  • allelic association is the non-random assortment
    between alleles i.e. it measures how well
    knowledge of the allele state at one site permits
    prediction at another

functional site
marker site
  • significant allelic association between a marker
    and a functional site permits localization
    (mapping) even without having the functional site
    in our collection
  • by necessity, the strength of allelic
    association is measured between markers
  • there are pair-wise and multi-locus measures of
    association

36
Linkage disequilibrium
  • LD measures the deviation from random assortment
    of the alleles at a pair of polymorphic sites
  • other measures of LD are derived from D, by e.g.
    normalizing according to allele frequencies (r2)

37
Haplotype diversity
  • the most useful multi-marker measures of
    associations are related to haplotype diversity

2n possible haplotypes
n markers
random assortment of alleles at different sites
38
Haplotype blocks
Daly et al. Nature Genetics 2001
  • experimental evidence for reduced haplotype
    diversity (mainly in European samples)

39
The promise for medical genetics
  • within blocks a small number of SNPs are
    sufficient to distinguish the few common
    haplotypes ? significant marker reduction is
    possible

CACTACCGA CACGACTAT TTGGCGTAT
40
The HapMap initiative
  • goal to map out human allele and association
    structure of at the kilobase scale
  • deliverables a set of physical and
    informational reagents

41
HapMap physical reagents
  • reference samples 4 world populations, 100
    independent chromosomes from each

42
Informational haplotypes
  • the problem the substrate for genotyping is
    diploid, genomic DNA phasing of alleles at
    multiple loci is in general not possible with
    certainty
  • experimental methods of haplotype determination
    (single-chromosome isolation followed by
    whole-genome PCR amplification, radiation
    hybrids, somatic cell hybrids) are expensive and
    laborious

43
Haplotype inference
  • Parsimony approach minimize the number of
    different haplotypes that explains all diploid
    genotypes in the sample

Clark Mol Biol Evol 1990
44
Haplotype inference
http//pga.gs.washington.edu/
45
Haplotype annotations LD based
Wall Pritchard Nature Rev Gen 2003
  • Pair-wise LD-plots

46
Annotations haplotype blocks
47
Haplotype tagging SNPs (htSNPs)
Find groups of SNPs such that each possible pair
is in strong LD (above threshold).
Carlson AJHG 2005
48
Focal questions about the HapMap
CEPH European samples
49
Samples from a single population?
(random 60-chromosome subsets of 120 CEPH
chromosomes from 60 independent individuals)
50
Consequence for marker performance
51
Sample-to-sample variability?
1. Understanding intrinsic properties of a given
genome region, e.g. estimating local
recombination rate from the HapMap data
McVean et al. Science 2004
3. It would be a desirable alternative to
generate such additional sets with computational
means
52
Towards a marker selection tool
53
Generating data-relevant haplotypes
1. Generate a pair of haplotype sets with
Coalescent genealogies. This models that the
two sets are related to each other by being
drawn from a single population.
54
Generating computational samples
M
Problem The efficiency of generating
data-relevant genealogies (and therefore
additional sample sets) with standard Coalescent
tools is very low even for modest sample size (N)
and number of markers (M). Despite serious
efforts with various approaches (e.g. importance
sampling) efficient generation of such
genealogies is an unsolved problem.
N
We are developing a method to generate
approximative M-marker haplotypes by composing
consecutive, overlapping sets of data-relevant
K-site haplotypes (for small K)
Motivation from composite likelihood approaches
to recombination rate estimation by Hudson,
Clark, Wall, and others.
55
M-site haplotypes as composites of overlapping
K-site haplotypes
M
1. generate K-site sets
56
Piecing together K-site sets
this should work to the degree to which the
constraint at overlapping markers preserves
long-range marker association
57
Building composite haplotypes
58
3-site composite haplotypes
Hinds et al. Science, 2005
59
3-site composite vs. data
60
3-site composites the best case
61
Variability across sets
is to model sample variance across consecutive
data sets.
But the variability across the composite
haplotype sets is compounded by the inherent loss
of long-range association when 3-sites are used.
62
4-site composite haplotypes
63
Best-case 4 site composites
Composite of exact 4-site sub-haplotypes
64
Variability across 4-site composites
65
Variability across 4-site composites
is comparable to the variability across data
sets.
66
Utility for association studies?
  • No matter how good the resource is, its success
    to find disease causing variants greatly depend
    on the allelic structure of common diseases, a
    question under debate

67
http//clavius.bc.edu/marthlab/MarthLab
68
How to use markers to find disease?
genome-wide, dense SNP marker map
  • problem genotyping cost precludes using
    millions of markers simultaneously for an
    association study
  • depends on the patterns of allelic association
    in the human genome

69
The determinants of allelic association
  • recombination breaks down allelic association
    by randomizing allele combinations

70
PopGen predictions extent of LD
71
Software engineering aspects
To do larger-scale testing we must first improve
the efficiency of generating composite sets.
Currently, we run fresh Coalescent runs at each
K-site (several hours per region).
Total genotyped SNPs is 1 million -gt 1
million different K-sites to match. Any given
Coalescent genealogy is likely to match one or
more of these. Computational hap sets can be
databased efficiently.
4 HapMap populations x 1 million K-sites x 1,000
comp sets x 50 bytes lt 200 Gigabytes
72
Un-phased genotypes
  • the primary data represent diploid genotypes

(AC)(CG)(AT)(CT)
A G A C C C T T
http//pga.gs.washington.edu/
  • one has the choice to reconstruct the
    haplotypes with statistical methods as shown
    (e.g. the PHASE program) this may be inaccurate
  • or one may account for all possible
    reconstructions when evaluating data-relevance
    this is computationally very expensive
About PowerShow.com