Title: Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome
1Single Nucleotide Polymorphisms (SNPs),
Haplotypes, Linkage Disequilibrium, and the Human
Genome
- Manish Anand
- Nihar Sheth
- Jim Costello
- 24th November, 2003
2Overview
- Biological Background
- Terminology
- SNP related general information
- SNP detection techniques
- SNP Applications
- References
3Biological Background
- How can researchers hope to identify and study
all the changes that occur in so many different
diseases? - How can they explain why some people respond to
treatment and not others?
4- SNP is the answer to these questions
- So what exactly are SNPs?
- How are they involved in so many different
aspects of health?
5What is SNP ?
- A SNP is defined as a single base change in a DNA
sequence that occurs in a significant proportion
(more than 1 percent) of a large population.
6Variations in Genome
7Terminology
- Polymorphism
- Linkage Disequilibrium
- Correlation of characters states among
polymorphic sites - Insufficient passage of time to randomize
character states by meiotic recombinations - Haplotype
8Some Facts
- In human beings, 99.9 percent bases are same.
- Remaining 0.1 percent makes a person unique.
- Different attributes / characteristics / traits
- how a person looks,
- diseases he or she develops.
- These variations can be
- Harmless (change in phenotype)
- Harmful (diabetes, cancer, heart disease,
Huntington's disease, and hemophilia ) - Latent (variations found in coding and regulatory
regions, are not harmful on their own, and the
change in each gene only becomes apparent under
certain conditions e.g. susceptibility to lung
cancer)
9SNP facts
- SNPs are found in
- coding and (mostly) noncoding regions.
- Occur with a very high frequency
- about 1 in 1000 bases to 1 in 100 to 300 bases.
- The abundance of SNPs and the ease with which
they can be measured make these genetic
variations significant. - SNPs close to particular gene acts as a marker
for that gene. - SNPs in coding regions may alter the protein
structure made by that coding region.
10SNPs may / may not alter protein structure
11SNPs act as gene markers
12SNP maps
- Sequence genomes of a large number of people
-
- Compare the base sequences to discover SNPs.
- Generate a single map of the human genome
containing all possible SNPs gt SNP maps
13SNP Maps
14SNP Profiles
- Genome of each individual contains distinct SNP
pattern. - People can be grouped based on the SNP profile.
- SNPs Profiles important for identifying response
to Drug Therapy. - Correlations might emerge between certain SNP
profiles and specific responses to treatment.
15SNP Profiles
16Techniques to detect known Polymorphisms
- Hybridization Techniques
- Micro arrays
- Real time PCR
- Enzyme based Techniques
- Nucleotide extension
- Cleavage
- Ligation
- Reaction product detection and display
- Comparison of Techniques used
17Hybridization Techniques
- Micro Arrays
- Sequencing by hybridization
- utilize a set of tiling oligonucleotides
- somewhat complex
- pooling and processing of PCR amplicons that are
subsequently hybridized to a DNA micro array and
visualized. - Theoretically capable of genotyping thousands of
polymorphisms simultaneously - Success rate 97 (Somewhat low for this kind of
analysis) - High False rates 1121
- Design and fabrication of micro arrays is
expensive, hence users are confined to the set of
genotypes established by the manufacturer.
18- Real Time PCRs
- Utilizes TaqmanTM DNA probes to detect PCR
products in real-time - TaqmanTM probe contains a fluorescent reporter at
the 5' end and a fluorescence resonance energy
transfer (FRET) moiety at the 3' end, which
quenches the fluorescent signal of the reporter. - The probe sequence is complementary to the PCR
amplicon and is designed to anneal at the
extension temperature. - During extension, the 5' 3' exonuclease activity
of Taq DNA polymerase I cleaves the probe,
emitting signal due to the separation of the
reporter from the quencher. - Polymorphism is determined solely by
hybridization and not by the ability of the
enzyme to discriminate. - Because the enzyme does not confer specificity in
detection, this technique is classified as
hybridization-based. - Depending on optical thermocycler platform 384
reactions can be monitored for each cycle without
removing any sample - amenable to robotic automation.
19Real Time PCRs
20Enzyme based Techniques
- Nucleotide extension
- Simplest techniques for known polymorphism
detection - Existing in numerous variations (also known as
minisequencing, SNuPE, GBA, APEX, AS-PE capture,
FNC, TDI or PROBE) this assay typically involves
the single base extension of an oligonucleotide
by a polymerase - Oligonucleotide is designed to anneal immediately
upstream of the polymorphism locus and
differentially labeled fluorescent
dideoxynucleotides are utilized as substrates for
polymerase extension. - The fluorescent signal emitted corresponds to the
nucleotide incorporated and thus the sequence of
the polymorphism. - Simplicity and accuracy in distinguishing between
heterozygous and homozygous genotypes. - Targets need to be PCR amplified PCR reagents
must be removed. - False negatives due to mis-priming
21Nucleotide Extension
22- Cleavage
- The InvaderTM assay utilizes the exonuclease
activity of Cleavase VIII on overlapping
oligonucleotide strands. - Two oligonucleotides, an invader probe and
either a wild-type or mutant primary probe,
overlap each other at a single nucleotide
position on the template only if they are
complementary to the polymorphism being queried. - Cleavage occurs when the specific overlapping
conformation is present, freeing an
oligonucleotide referred to as a flap. - This flap can be detected in a multiplex manner
by size, mass or sequence - Commonly the flap participates in a second
cleavage assay with another complementary target,
causing release of a fluorescent signal. - Advantage - the same flap may bind to many
targets, generating a cascading signal
amplification and thereby obviating the need for
PCR amplification. - Single-tube one-step reaction.
23Cleavage
24- Ligation
- One of the most specific assays due to the high
specificity of T4 ligase (oligo ligation assay)
and even higher specificity of thermostable
ligases (ligation detection reaction, LDR) - Two primers are designed to anneal adjacent to
one another on the target of interest - Generally, the upstream primer (discriminating
primer) contains a fluorescent label at the 5'
end, with the 3' nucleotide overlapping the
polymorphic base. - The fluorescent signal corresponds to the allele
being queried at the 3' position of the
discriminating primer - When the discriminating primer forms a perfect
complement with the target at the junction, the
ligase covalently attaches the adjacent
downstream primer (common primer) - The resulting product is approximately twice as
long as each of the individual primers and can be
easily monitored for detection by means of
capillary electrophoresis or by display on a
microarray - Advantage Very good sensitivity and specificity
25Techniques to detect unknown Polymorphisms
- Direct Sequencing
- Microarray
- Cleavage / Ligation
- Electrophoretic mobility assays
- Comparison of Techniques used
26Direct Sequencing
- Sanger dideoxysequencing can detect any type of
unknown polymorphism and its position, when the
majority of DNA contains that polymorphism. - Misses polymorphisms and mutations when the DNA
is heterozygous - limited utility for analysis of solid tumors or
pooled samples of DNA due to low sensitivity - Once a sample is known to contain a polymorphism
in a specific region, direct sequencing is
particularly useful for identifying a
polymorphism and its specific position. - Even if the identity of the polymorphism cannot
be discerned in the first pass, multiple
sequencing attempts have proven quite successful
in elucidating sequence and position information.
27Microarray
- Variation detection arrays (VDA) scans large
sequence blocks and identify regions containing
unknown polymorphisms. - This methodology suffers from the same
limitations in fabrication and design as observed
in known polymorphism analysis, but has
demonstrated much greater success in the context
of unknown polymorphism detection for both SNP
and tumor analysis. - With respect to SNP analysis, a recent study of
chromosome 21 successfully identified
approximately half of the estimated number of
common SNPs (frequency of 1050) across the
entire chromosome. - The experimental design required a sacrifice in
sensitivity in order to minimize false positives. - This explains the decrease in successful
identification from 80 to 50.
28 Cleavage/Ligation
- Unknown polymorphisms can also be identified by
the cleavage of mismatches in DNADNA
heteroduplexes. - This can be achieved either chemically chemical
cleavage method (CCM) or enzymatically (T4 Endo
nuclease VII, MutY cleavage or Cleavase). - Typically, at least two samples are PCR amplified
(one sample can be sufficient for solid tumor
samples with high levels of stromal
contamination), denatured and then hybridized to
create DNADNA heteroduplexes of the variant
strands. - Enzymes cleave adjacent to the mismatch and
products are resolved via gel or capillary
electrophoresis. - Unfortunately, the cleavage enzymes often nick
complementary regions of DNA as well. This
increases background noise, lowers specificity,
and reduces the pooling capacity of the assay.
29Cleavage / Ligation
30SNP Applications
- Gene discovery and mapping
- Association-based candidate polymorphism testing
- Diagnostics/risk profiling
- Response prediction
- Homogeneity testing/study design
- Gene function identification
31High-resolution haplotype structure in the human
genome
- Mark J. Daly, John D. Rioux, Stephen F.
Schaffner, Thomas J. Hudson Eric S. Lander
32Abstract
- Authors are describing a high-resolution analysis
of the haplotype structure across 500 KB on
chromosome 5q31 using 103 SNPs in a European
derived population. - They developed an analytical model for Linkage
disequilibrium (LD) mapping based on
high-resolution haplotype blocks, which offers a
coherent framework for creating a haplotype map
of the human genome.
33Data used
- 500 kb region on human chromosome 5q31 that is
implicated as containing a genetic risk factor
for Crohn disease. - Rioux, J. D et al. Hierarchical linkage
disequilibrium mapping of a susceptibility gene
for Crohn s disease to the cytokine cluster on
chromosome 5. Nature Gene. 29, 223-228(2001) - 103 common (gt5 minor allele frequency) SNPs
genotyped from a European-derived population.
Study describe 258 chromosomes transmitted to
individuals with Crohn disease and 258
untransmitted chromosomes.
34Data used
- The genotype data used in study provides the
highest-resolution picture of the patterns of
genetic variation across a large genomic region,
with a market density of 1 SNP roughly every 5
kb.
35Study
- Focus on identifying the underlying haplotypes.
- Authors initial focus was on untransmitted
control chromosomes, however, the same haplotype
structure was seen in the chromosomes transmitted
to individuals with Crohn disease, with the only
difference being that one of the haplotypes was
enriched in frequency, reflecting its association
with Crohn disease.
36Study
- It became evident during the study that the
region could be largely decomposed into discrete
haplotype blocks, each with a lack of diversity. - As haplotype block structure was the same in both
groups, they presented combined data from all
chromosomes (transmitted and untransmitted).
37Haplotype block structure on 5q31
38Haplotype block structure on 5q31
a. Common haplotype patterns in each block of low
diversity. Dashed lines indicate locations where
more than 2 of all chromosomes are observed to
transition one common haplotype to a different
one.
39Haplotype block structure on 5q31
b. Percentage of observed chromosomes that match
one of the common patterns exactly (total
chromosomes 258 transmitted 258
untransmitted).
40Haplotype block structure on 5q31
c. Percentage of each of the common patterns
among 258 untransmitted chromosomes.
41Haplotype block structure on 5q31
d. Rate of haplotype exchange between the blocks
as estimated by the HMM.
42Haplotype block structure on 5q31
- The haplotype blocks span up to 100 kb and
contain multiple (five or more) common SNPs. - The blocks have only few (2-4) haplotypes, which
show no evidence of being derived from one
another by recombination, and which account for
nearly all chromosomes (gt90) in all cases in the
sample.
43Haplotype block structure on 5q31
For example, an 84 kb block shows only two
distinct haplotypes that together account for 95
of the observed chromosomes (table -1).
44Study
- The discrete blocks are separated by intervals in
which several independent historical
recombination event seem to have occurred, giving
rise to greater haplotype diversity for regions
spanning the blocks. - The most common recombination events are
indicated in previous figure by lines connecting
the haplotypes. - The recombination events appear to be clustered
multiple obligate exchanges must have occurred
between most blocks, with little or no exchange
within block.
45Study
- Although there is detectable recombination
between blocks, it is modest enough for there to
be clear long-range correlation among (that is,
LD) blocks. - The haplotypes at the various blocks can be
readily assigned to one of the four ancestral
long-range haplotypes. - Indeed, 38 of the chromosomes studies carried
one of these four haplotypes across the entire
length of the region.
46Study
- Using HMM, they developed an approach to define
the block structure formally. - The HMM simultaneously assigns every position
along each observed chromosome to one of the four
ancestral haplotypes and estimates the
maximum-likelihood values of the historical
recombination frequency (T) between each pair of
markers.
47Study
- The quantity T provides a convenient summary of
the degree of haplotype exchange across
inter-marker intervals and relates directly toe
conventional measures of LD. - In this study, T is estimated at less than 1 for
73 of the inter-marker intervals, 1-4 for 14 of
the intervals, and more than 4 for only 9 of the
intervals.
48Methods Individuals and market selection
- The individuals studies, Canadians from
metropolitan Toronto of predominantly European
descent and the genotyping methodologies are
described in the paper - Rioux, J. D et al. Hierarchical linkage
disequilibrium mapping of a susceptibility gene
for Crohn s disease to the cytokine cluster on
chromosome 5. Nature Gene. 29, 223-228(2001) - To ensure the ability to reconstruct multi-marker
haplotypes, SNPs for haplotype analysis were
selected from the set of markers for which full
genotypes were available for all members. - SNPs at CpG sites were not included to prevent
potential confounding of common haplotype
patterns from recurrent mutations.
49Methods Haplotype counting
- Haplotype percentages in Haplotype block
structure in 5q31 figure were computed using
haplotypes generated by the transmission
disequilibrium test (TDT) implementation in
Genehunter 2.0 (ref. 22 in the paper), followed
by use of an EM-type algorithm (ref. 23,24 in
paper), to include the minority of chromosomes
that had one or more markers with ambiguous phase
or where one marker was missing genotype data.
50Methods Hidden Markov model
- The observation that over long distances most
haplotypes can be described either as belonging
to one of a small number of common haplotypes
categories suggested the use of an HMM in which
haplotype categories were defined as state. - Authors assigned observed chromosomes to those
hidden states and simultaneously estimated the
transition probability in each map interval by
using an EM algorithm and by making the
simplifying assumption that there was any
transition probability for each map interval
rather than allowing specific transition
probabilities from each state to each state. - The output of this method was a
maximum-likelihood assignment to haplotype
category at each position and ML estimates of T
indicating how significantly recombination has
acted to increase haplotype diversity in each map
interval.
51Discussion of Study
- The region of chromosome 5q31 may be largely
divided into discrete blocks of 10-100 kb each
block has only a few common haplotypes and the
haplotype correlation between blocks gives rise o
long-range LD. - Focusing on haplotype blocks greatly clarifies LD
analyses. Once the haplotype blocks are
identified, they can be treated as alleles and
tested for LD (instead of single-marker analyses
of LD).
52Discussion of Study
- In analogous fashion, the haplotype structure
provides a crisp approach for testing the
association of genomic segments with disease. By
contrast, disease association studies
transitionally involve testing individual SNPs in
and around a gene. - Once the haplotype blocks are defined, it is
straightforward to examine a subset of SNPs that
uniquely distinguish the common haplotypes in
each block. This allows the common variation in a
gene to tested exhaustively for association with
disease.
53Discussion of Study
- This approach provides a precise framework for
creating a comprehensive haplotype map of the
human genome. - By testing a sufficiently large collections of
SNPs, it should be possible to define all of the
common haplotypes underlying blocks of LD. Once
such a map is created, it will be possible to
select an optimal reference set of SNPs for any
subsequent genotyping study. - This detailed understanding of common human
variation represents an important step in the
Human genome project.
54Linkage Disequilibrium
- Uses unrelated individuals
- Good for fine scale mapping because there is
greater opportunity for recombination to occur. - Map of loci that contribute to inherited genetic
disorders - States can not be considered independent because
they are related by distance and recombination,
so individual haplotypes may not be the cause of
disease, but rather a combination of several
haplotypes in blocks
55Linkage Disequilibrium
- Greater distance between genes, the greater
chance of recombination - Lesser distance between genes, the less chance of
recombination - Knowing the above and observing inherited
alleles, one can estimate the relative distance
between genes
56Measures of Linkage Disequilibrium
- cM centiMorgans
- 50cM would mean that two genes have a 50 chance
of recombination occurring. - Genes are relatively far apart
57Importance of Linkage Disequilibrium
- Offers us a way to measure the distance between
genes. - Non-random
- Measure of relation between markers and disease
mutations. - Possibly used to map disease genes because high
LD areas would be related to recombination and
formation of new alleles
58Data Mining Applied to Linkage Disequilibrium
Mapping
- HPM - Haplotype Pattern Mining
- Method of data mining LD-based gene mapping
- Uses haplotypes as inputs which can be obtained
from genetic simulation programs such as
GENEHUNTER - Extension of traditional association analysis
- Search for shared and flexible haplotypes and
find out which ones are strongly associated with
a disease. - Uses non-parametric statistical model without
any genetic models on the basis of the locations
of the haplotypes
59What we know
- LD, which has a non-random association of
haplotypes to a disease, is likely strongest
around the DS(Disease Susceptibility) gene. - A locus will most likely be where the strongest
associations are.
60Notation
- Haplotype Map M has k parameters (m1,,mk)
- The haplotype pattern P on M consists of the
vector space (p1,,pk), where each pi is an
allele of mi or a wild-card () - P occurs on the haplotype vector, which is simply
the chromosome (H), so H (h1,,hk) where hi
pi or hi - Example
- P1 (, 2, 5, , 3, , , , , )
- PC (4, 2, 5, 1, 3, 2, 6, 4, 5, 3)
61Issues in Shape of Haplotype Pattern
- Length of the pattern
- Defined as maximal distance between any 2 markers
measured in centiMorgans - Extremely long sequences dont give us much
information, so the size of the P is constrained
in HPM - Gaps in sequences
- Accounts for mutations, errors, missing data, and
recombination - Gap size and number can be controlled in HPM
62Procedure
- Depth-first search finds all haplotype patterns
that exceed the lower bound threshold and meets
the association measure - Calculate the frequency f(mi) of marker mi with
respect to (M, H, Y, x), where Y phenotype and x
positive association threshold - Markers with highest frequencies are predicted to
be the area of the DS gene, assuming a DS gene is
present. - Prediction of granularity of marker density
- Ranked based on frequency
63Results Simulated Data
- Founder population which grows from 300 to 100,
000 in 500 years was simulated in the Populus
simulator package - Simulated data used because it is cheaper and can
be easily manipulated
64- List of 11 most strongly disease-associated
haplotype patterns in the simulated data - Chromosome has 101 markers
- Dashed line indicates the true gene location
65- Frequency histogram of previous slides data, but
with patterns exceeding the threshold of
association - Dashed line indicates the true gene location
- Marker 5 now has the highest frequency
66- The actual vs. predicted locations for 100 data
sets
67- Mutation carrying chromosomes, denoted by A
- Sample founder population size
- Corrupted data
- Missing data
68Real Data HLA complex
- Data consisting of affected sib-pair families
with type 1 diabetes from the UK that were
genotyped for 25 markers was used - Markers covered 14-Mb and covered the entire HLA
complex - The HLA-DQB1 and HLA-DRB1 loci, which are located
in the middle of these 14-Mb, are known to be the
primary factors for type 1 diabetes - Randomly selected 200 from 385 sample space to
compare with simulated results
69- Frequency vs. Map Location of HLA markers
- ___ HPM calculated frequencies
- ----- Background LD frequencies
- Vertical lines indicates true locations of
markers
70Discussion of HPM Technique
- Robust to lost and erroneous data
- Applicable to complex gene mapping
- Works well with small data sets, but accuracy is
increased with the increase of data - Works with real and simulated data
- Does not include any previously derived models
71References
- Introduction to SNPs Discovery of Markers of
Disease - SNP seeking long term association with complex
diseases - SNP mapping using Genome-wide Unique Sequences
- The Structure of Haplotypes Blocks in Human
Genome - Using Haplotype blocks to map human complex trait
loci - High Resolution haplotype structure in human
genome - Detection of regulatory variation in mouse genes
- http//linkage.rockefeller.edu/wli/lld.html
- http//statwww.epfl.ch/davison/teaching/Microarray
s/snp.ppt - http//www.cs.helsinki.fi/u/htoivone/pubs/ajhg_200
0.pdf - Resolution of Haplotypes and Haplotype
Frequencies from SNP Genotype of Pooled Samples - http//www.journals.uchicago.edu/AJHG/journal/issu
es/v71n6/024386/024386.html - http//www.ncbi.nlm.nih.gov/entrez/utils/fref.fcgi
?http//www.sciencemag.org/cgi/pmidlookup?viewful
lpmid11452081 - http//www.genome.gov/10001665
- http//walnut.usc.edu/magnus/papers/tig.pdf