Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome - PowerPoint PPT Presentation

1 / 71
About This Presentation

Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome


Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome Manish Anand Nihar Sheth Jim Costello 24th November, 2003 – PowerPoint PPT presentation

Number of Views:861
Avg rating:3.0/5.0
Slides: 72
Provided by: bioInform1


Transcript and Presenter's Notes

Title: Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome

Single Nucleotide Polymorphisms (SNPs),
Haplotypes, Linkage Disequilibrium, and the Human
  • Manish Anand
  • Nihar Sheth
  • Jim Costello
  • 24th November, 2003

  • Biological Background
  • Terminology
  • SNP related general information
  • SNP detection techniques
  • SNP Applications
  • References

Biological Background
  • How can researchers hope to identify and study
    all the changes that occur in so many different
  • How can they explain why some people respond to
    treatment and not others?

  • SNP is the answer to these questions
  • So what exactly are SNPs?
  • How are they involved in so many different
    aspects of health?

What is SNP ?
  • A SNP is defined as a single base change in a DNA
    sequence that occurs in a significant proportion
    (more than 1 percent) of a large population.

Variations in Genome
  • Polymorphism
  • Linkage Disequilibrium
  • Correlation of characters states among
    polymorphic sites
  • Insufficient passage of time to randomize
    character states by meiotic recombinations
  • Haplotype

Some Facts
  • In human beings, 99.9 percent bases are same.
  • Remaining 0.1 percent makes a person unique.
  • Different attributes / characteristics / traits
  • how a person looks,
  • diseases he or she develops.
  • These variations can be
  • Harmless (change in phenotype)
  • Harmful (diabetes, cancer, heart disease,
    Huntington's disease, and hemophilia )
  • Latent (variations found in coding and regulatory
    regions, are not harmful on their own, and the
    change in each gene only becomes apparent under
    certain conditions e.g. susceptibility to lung

SNP facts
  • SNPs are found in
  • coding and (mostly) noncoding regions.
  • Occur with a very high frequency
  • about 1 in 1000 bases to 1 in 100 to 300 bases.
  • The abundance of SNPs and the ease with which
    they can be measured make these genetic
    variations significant.
  • SNPs close to particular gene acts as a marker
    for that gene.
  • SNPs in coding regions may alter the protein
    structure made by that coding region.

SNPs may / may not alter protein structure
SNPs act as gene markers
SNP maps
  • Sequence genomes of a large number of people
  • Compare the base sequences to discover SNPs.
  • Generate a single map of the human genome
    containing all possible SNPs gt SNP maps

SNP Maps
SNP Profiles
  • Genome of each individual contains distinct SNP
  • People can be grouped based on the SNP profile.
  • SNPs Profiles important for identifying response
    to Drug Therapy.
  • Correlations might emerge between certain SNP
    profiles and specific responses to treatment.

SNP Profiles
Techniques to detect known Polymorphisms
  • Hybridization Techniques
  • Micro arrays
  • Real time PCR
  • Enzyme based Techniques
  • Nucleotide extension
  • Cleavage
  • Ligation
  • Reaction product detection and display
  • Comparison of Techniques used

Hybridization Techniques
  • Micro Arrays
  • Sequencing by hybridization
  • utilize a set of tiling oligonucleotides
  • somewhat complex
  • pooling and processing of PCR amplicons that are
    subsequently hybridized to a DNA micro array and
  • Theoretically capable of genotyping thousands of
    polymorphisms simultaneously
  • Success rate 97 (Somewhat low for this kind of
  • High False rates 1121
  • Design and fabrication of micro arrays is
    expensive, hence users are confined to the set of
    genotypes established by the manufacturer.

  • Real Time PCRs
  • Utilizes TaqmanTM DNA probes to detect PCR
    products in real-time
  • TaqmanTM probe contains a fluorescent reporter at
    the 5' end and a fluorescence resonance energy
    transfer (FRET) moiety at the 3' end, which
    quenches the fluorescent signal of the reporter.
  • The probe sequence is complementary to the PCR
    amplicon and is designed to anneal at the
    extension temperature.
  • During extension, the 5' 3' exonuclease activity
    of Taq DNA polymerase I cleaves the probe,
    emitting signal due to the separation of the
    reporter from the quencher.
  • Polymorphism is determined solely by
    hybridization and not by the ability of the
    enzyme to discriminate.
  • Because the enzyme does not confer specificity in
    detection, this technique is classified as
  • Depending on optical thermocycler platform 384
    reactions can be monitored for each cycle without
    removing any sample
  • amenable to robotic automation.

Real Time PCRs
Enzyme based Techniques
  • Nucleotide extension
  • Simplest techniques for known polymorphism
  • Existing in numerous variations (also known as
    minisequencing, SNuPE, GBA, APEX, AS-PE capture,
    FNC, TDI or PROBE) this assay typically involves
    the single base extension of an oligonucleotide
    by a polymerase
  • Oligonucleotide is designed to anneal immediately
    upstream of the polymorphism locus and
    differentially labeled fluorescent
    dideoxynucleotides are utilized as substrates for
    polymerase extension.
  • The fluorescent signal emitted corresponds to the
    nucleotide incorporated and thus the sequence of
    the polymorphism.
  • Simplicity and accuracy in distinguishing between
    heterozygous and homozygous genotypes.
  • Targets need to be PCR amplified PCR reagents
    must be removed.
  • False negatives due to mis-priming

Nucleotide Extension
  • Cleavage
  • The InvaderTM assay utilizes the exonuclease
    activity of Cleavase VIII on overlapping
    oligonucleotide strands.
  • Two oligonucleotides, an invader probe and
    either a wild-type or mutant primary probe,
    overlap each other at a single nucleotide
    position on the template only if they are
    complementary to the polymorphism being queried.
  • Cleavage occurs when the specific overlapping
    conformation is present, freeing an
    oligonucleotide referred to as a flap.
  • This flap can be detected in a multiplex manner
    by size, mass or sequence
  • Commonly the flap participates in a second
    cleavage assay with another complementary target,
    causing release of a fluorescent signal.
  • Advantage - the same flap may bind to many
    targets, generating a cascading signal
    amplification and thereby obviating the need for
    PCR amplification.
  • Single-tube one-step reaction.

  • Ligation
  • One of the most specific assays due to the high
    specificity of T4 ligase (oligo ligation assay)
    and even higher specificity of thermostable
    ligases (ligation detection reaction, LDR)
  • Two primers are designed to anneal adjacent to
    one another on the target of interest
  • Generally, the upstream primer (discriminating
    primer) contains a fluorescent label at the 5'
    end, with the 3' nucleotide overlapping the
    polymorphic base.
  • The fluorescent signal corresponds to the allele
    being queried at the 3' position of the
    discriminating primer
  • When the discriminating primer forms a perfect
    complement with the target at the junction, the
    ligase covalently attaches the adjacent
    downstream primer (common primer)
  • The resulting product is approximately twice as
    long as each of the individual primers and can be
    easily monitored for detection by means of
    capillary electrophoresis or by display on a
  • Advantage Very good sensitivity and specificity

Techniques to detect unknown Polymorphisms
  • Direct Sequencing
  • Microarray
  • Cleavage / Ligation
  • Electrophoretic mobility assays
  • Comparison of Techniques used

Direct Sequencing
  • Sanger dideoxysequencing can detect any type of
    unknown polymorphism and its position, when the
    majority of DNA contains that polymorphism.
  • Misses polymorphisms and mutations when the DNA
    is heterozygous
  • limited utility for analysis of solid tumors or
    pooled samples of DNA due to low sensitivity
  • Once a sample is known to contain a polymorphism
    in a specific region, direct sequencing is
    particularly useful for identifying a
    polymorphism and its specific position.
  • Even if the identity of the polymorphism cannot
    be discerned in the first pass, multiple
    sequencing attempts have proven quite successful
    in elucidating sequence and position information.

  • Variation detection arrays (VDA) scans large
    sequence blocks and identify regions containing
    unknown polymorphisms.
  • This methodology suffers from the same
    limitations in fabrication and design as observed
    in known polymorphism analysis, but has
    demonstrated much greater success in the context
    of unknown polymorphism detection for both SNP
    and tumor analysis.
  • With respect to SNP analysis, a recent study of
    chromosome 21 successfully identified
    approximately half of the estimated number of
    common SNPs (frequency of 1050) across the
    entire chromosome.
  • The experimental design required a sacrifice in
    sensitivity in order to minimize false positives.
  • This explains the decrease in successful
    identification from 80 to 50.

  • Unknown polymorphisms can also be identified by
    the cleavage of mismatches in DNADNA
  • This can be achieved either chemically chemical
    cleavage method (CCM) or enzymatically (T4 Endo
    nuclease VII, MutY cleavage or Cleavase).
  • Typically, at least two samples are PCR amplified
    (one sample can be sufficient for solid tumor
    samples with high levels of stromal
    contamination), denatured and then hybridized to
    create DNADNA heteroduplexes of the variant
  • Enzymes cleave adjacent to the mismatch and
    products are resolved via gel or capillary
  • Unfortunately, the cleavage enzymes often nick
    complementary regions of DNA as well. This
    increases background noise, lowers specificity,
    and reduces the pooling capacity of the assay.

Cleavage / Ligation
SNP Applications
  • Gene discovery and mapping
  • Association-based candidate polymorphism testing
  • Diagnostics/risk profiling
  • Response prediction
  • Homogeneity testing/study design
  • Gene function identification

High-resolution haplotype structure in the human
  • Mark J. Daly, John D. Rioux, Stephen F.
    Schaffner, Thomas J. Hudson Eric S. Lander

  • Authors are describing a high-resolution analysis
    of the haplotype structure across 500 KB on
    chromosome 5q31 using 103 SNPs in a European
    derived population.
  • They developed an analytical model for Linkage
    disequilibrium (LD) mapping based on
    high-resolution haplotype blocks, which offers a
    coherent framework for creating a haplotype map
    of the human genome.

Data used
  • 500 kb region on human chromosome 5q31 that is
    implicated as containing a genetic risk factor
    for Crohn disease.
  • Rioux, J. D et al. Hierarchical linkage
    disequilibrium mapping of a susceptibility gene
    for Crohn s disease to the cytokine cluster on
    chromosome 5. Nature Gene. 29, 223-228(2001)
  • 103 common (gt5 minor allele frequency) SNPs
    genotyped from a European-derived population.
    Study describe 258 chromosomes transmitted to
    individuals with Crohn disease and 258
    untransmitted chromosomes.

Data used
  • The genotype data used in study provides the
    highest-resolution picture of the patterns of
    genetic variation across a large genomic region,
    with a market density of 1 SNP roughly every 5

  • Focus on identifying the underlying haplotypes.
  • Authors initial focus was on untransmitted
    control chromosomes, however, the same haplotype
    structure was seen in the chromosomes transmitted
    to individuals with Crohn disease, with the only
    difference being that one of the haplotypes was
    enriched in frequency, reflecting its association
    with Crohn disease.

  • It became evident during the study that the
    region could be largely decomposed into discrete
    haplotype blocks, each with a lack of diversity.
  • As haplotype block structure was the same in both
    groups, they presented combined data from all
    chromosomes (transmitted and untransmitted).

Haplotype block structure on 5q31
Haplotype block structure on 5q31
a. Common haplotype patterns in each block of low
diversity. Dashed lines indicate locations where
more than 2 of all chromosomes are observed to
transition one common haplotype to a different
Haplotype block structure on 5q31
b. Percentage of observed chromosomes that match
one of the common patterns exactly (total
chromosomes 258 transmitted 258
Haplotype block structure on 5q31
c. Percentage of each of the common patterns
among 258 untransmitted chromosomes.
Haplotype block structure on 5q31
d. Rate of haplotype exchange between the blocks
as estimated by the HMM.
Haplotype block structure on 5q31
  • The haplotype blocks span up to 100 kb and
    contain multiple (five or more) common SNPs.
  • The blocks have only few (2-4) haplotypes, which
    show no evidence of being derived from one
    another by recombination, and which account for
    nearly all chromosomes (gt90) in all cases in the

Haplotype block structure on 5q31
For example, an 84 kb block shows only two
distinct haplotypes that together account for 95
of the observed chromosomes (table -1).
  • The discrete blocks are separated by intervals in
    which several independent historical
    recombination event seem to have occurred, giving
    rise to greater haplotype diversity for regions
    spanning the blocks.
  • The most common recombination events are
    indicated in previous figure by lines connecting
    the haplotypes.
  • The recombination events appear to be clustered
    multiple obligate exchanges must have occurred
    between most blocks, with little or no exchange
    within block.

  • Although there is detectable recombination
    between blocks, it is modest enough for there to
    be clear long-range correlation among (that is,
    LD) blocks.
  • The haplotypes at the various blocks can be
    readily assigned to one of the four ancestral
    long-range haplotypes.
  • Indeed, 38 of the chromosomes studies carried
    one of these four haplotypes across the entire
    length of the region.

  • Using HMM, they developed an approach to define
    the block structure formally.
  • The HMM simultaneously assigns every position
    along each observed chromosome to one of the four
    ancestral haplotypes and estimates the
    maximum-likelihood values of the historical
    recombination frequency (T) between each pair of

  • The quantity T provides a convenient summary of
    the degree of haplotype exchange across
    inter-marker intervals and relates directly toe
    conventional measures of LD.
  • In this study, T is estimated at less than 1 for
    73 of the inter-marker intervals, 1-4 for 14 of
    the intervals, and more than 4 for only 9 of the

Methods Individuals and market selection
  • The individuals studies, Canadians from
    metropolitan Toronto of predominantly European
    descent and the genotyping methodologies are
    described in the paper
  • Rioux, J. D et al. Hierarchical linkage
    disequilibrium mapping of a susceptibility gene
    for Crohn s disease to the cytokine cluster on
    chromosome 5. Nature Gene. 29, 223-228(2001)
  • To ensure the ability to reconstruct multi-marker
    haplotypes, SNPs for haplotype analysis were
    selected from the set of markers for which full
    genotypes were available for all members.
  • SNPs at CpG sites were not included to prevent
    potential confounding of common haplotype
    patterns from recurrent mutations.

Methods Haplotype counting
  • Haplotype percentages in Haplotype block
    structure in 5q31 figure were computed using
    haplotypes generated by the transmission
    disequilibrium test (TDT) implementation in
    Genehunter 2.0 (ref. 22 in the paper), followed
    by use of an EM-type algorithm (ref. 23,24 in
    paper), to include the minority of chromosomes
    that had one or more markers with ambiguous phase
    or where one marker was missing genotype data.

Methods Hidden Markov model
  • The observation that over long distances most
    haplotypes can be described either as belonging
    to one of a small number of common haplotypes
    categories suggested the use of an HMM in which
    haplotype categories were defined as state.
  • Authors assigned observed chromosomes to those
    hidden states and simultaneously estimated the
    transition probability in each map interval by
    using an EM algorithm and by making the
    simplifying assumption that there was any
    transition probability for each map interval
    rather than allowing specific transition
    probabilities from each state to each state.
  • The output of this method was a
    maximum-likelihood assignment to haplotype
    category at each position and ML estimates of T
    indicating how significantly recombination has
    acted to increase haplotype diversity in each map

Discussion of Study
  • The region of chromosome 5q31 may be largely
    divided into discrete blocks of 10-100 kb each
    block has only a few common haplotypes and the
    haplotype correlation between blocks gives rise o
    long-range LD.
  • Focusing on haplotype blocks greatly clarifies LD
    analyses. Once the haplotype blocks are
    identified, they can be treated as alleles and
    tested for LD (instead of single-marker analyses
    of LD).

Discussion of Study
  • In analogous fashion, the haplotype structure
    provides a crisp approach for testing the
    association of genomic segments with disease. By
    contrast, disease association studies
    transitionally involve testing individual SNPs in
    and around a gene.
  • Once the haplotype blocks are defined, it is
    straightforward to examine a subset of SNPs that
    uniquely distinguish the common haplotypes in
    each block. This allows the common variation in a
    gene to tested exhaustively for association with

Discussion of Study
  • This approach provides a precise framework for
    creating a comprehensive haplotype map of the
    human genome.
  • By testing a sufficiently large collections of
    SNPs, it should be possible to define all of the
    common haplotypes underlying blocks of LD. Once
    such a map is created, it will be possible to
    select an optimal reference set of SNPs for any
    subsequent genotyping study.
  • This detailed understanding of common human
    variation represents an important step in the
    Human genome project.

Linkage Disequilibrium
  • Uses unrelated individuals
  • Good for fine scale mapping because there is
    greater opportunity for recombination to occur.
  • Map of loci that contribute to inherited genetic
  • States can not be considered independent because
    they are related by distance and recombination,
    so individual haplotypes may not be the cause of
    disease, but rather a combination of several
    haplotypes in blocks

Linkage Disequilibrium
  • Greater distance between genes, the greater
    chance of recombination
  • Lesser distance between genes, the less chance of
  • Knowing the above and observing inherited
    alleles, one can estimate the relative distance
    between genes

Measures of Linkage Disequilibrium
  • cM centiMorgans
  • 50cM would mean that two genes have a 50 chance
    of recombination occurring.
  • Genes are relatively far apart

Importance of Linkage Disequilibrium
  • Offers us a way to measure the distance between
  • Non-random
  • Measure of relation between markers and disease
  • Possibly used to map disease genes because high
    LD areas would be related to recombination and
    formation of new alleles

Data Mining Applied to Linkage Disequilibrium
  • HPM - Haplotype Pattern Mining
  • Method of data mining LD-based gene mapping
  • Uses haplotypes as inputs which can be obtained
    from genetic simulation programs such as
  • Extension of traditional association analysis
  • Search for shared and flexible haplotypes and
    find out which ones are strongly associated with
    a disease.
  • Uses non-parametric statistical model without
    any genetic models on the basis of the locations
    of the haplotypes

What we know
  • LD, which has a non-random association of
    haplotypes to a disease, is likely strongest
    around the DS(Disease Susceptibility) gene.
  • A locus will most likely be where the strongest
    associations are.

  • Haplotype Map M has k parameters (m1,,mk)
  • The haplotype pattern P on M consists of the
    vector space (p1,,pk), where each pi is an
    allele of mi or a wild-card ()
  • P occurs on the haplotype vector, which is simply
    the chromosome (H), so H (h1,,hk) where hi
    pi or hi
  • Example
  • P1 (, 2, 5, , 3, , , , , )
  • PC (4, 2, 5, 1, 3, 2, 6, 4, 5, 3)

Issues in Shape of Haplotype Pattern
  • Length of the pattern
  • Defined as maximal distance between any 2 markers
    measured in centiMorgans
  • Extremely long sequences dont give us much
    information, so the size of the P is constrained
    in HPM
  • Gaps in sequences
  • Accounts for mutations, errors, missing data, and
  • Gap size and number can be controlled in HPM

  • Depth-first search finds all haplotype patterns
    that exceed the lower bound threshold and meets
    the association measure
  • Calculate the frequency f(mi) of marker mi with
    respect to (M, H, Y, x), where Y phenotype and x
    positive association threshold
  • Markers with highest frequencies are predicted to
    be the area of the DS gene, assuming a DS gene is
  • Prediction of granularity of marker density
  • Ranked based on frequency

Results Simulated Data
  • Founder population which grows from 300 to 100,
    000 in 500 years was simulated in the Populus
    simulator package
  • Simulated data used because it is cheaper and can
    be easily manipulated

  • List of 11 most strongly disease-associated
    haplotype patterns in the simulated data
  • Chromosome has 101 markers
  • Dashed line indicates the true gene location

  • Frequency histogram of previous slides data, but
    with patterns exceeding the threshold of
  • Dashed line indicates the true gene location
  • Marker 5 now has the highest frequency

  • The actual vs. predicted locations for 100 data

  • Mutation carrying chromosomes, denoted by A
  • Sample founder population size
  • Corrupted data
  • Missing data

Real Data HLA complex
  • Data consisting of affected sib-pair families
    with type 1 diabetes from the UK that were
    genotyped for 25 markers was used
  • Markers covered 14-Mb and covered the entire HLA
  • The HLA-DQB1 and HLA-DRB1 loci, which are located
    in the middle of these 14-Mb, are known to be the
    primary factors for type 1 diabetes
  • Randomly selected 200 from 385 sample space to
    compare with simulated results

  • Frequency vs. Map Location of HLA markers
  • ___ HPM calculated frequencies
  • ----- Background LD frequencies
  • Vertical lines indicates true locations of

Discussion of HPM Technique
  • Robust to lost and erroneous data
  • Applicable to complex gene mapping
  • Works well with small data sets, but accuracy is
    increased with the increase of data
  • Works with real and simulated data
  • Does not include any previously derived models

  • Introduction to SNPs Discovery of Markers of
  • SNP seeking long term association with complex
  • SNP mapping using Genome-wide Unique Sequences
  • The Structure of Haplotypes Blocks in Human
  • Using Haplotype blocks to map human complex trait
  • High Resolution haplotype structure in human
  • Detection of regulatory variation in mouse genes
  • http//
  • http//
  • http//
  • Resolution of Haplotypes and Haplotype
    Frequencies from SNP Genotype of Pooled Samples
  • http//
  • http//
  • http//
  • http//
Write a Comment
User Comments (0)