Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome

About This Presentation

Title:

Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome

Description:

Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome Manish Anand Nihar Sheth Jim Costello 24th November, 2003 – PowerPoint PPT presentation

Number of Views:861

Avg rating:3.0/5.0

Slides: 72

Provided by: bioInform1

Category:

more less

Transcript and Presenter's Notes

Title: Single Nucleotide Polymorphisms (SNPs), Haplotypes, Linkage Disequilibrium, and the Human Genome

1
Single Nucleotide Polymorphisms (SNPs),
Haplotypes, Linkage Disequilibrium, and the Human
Genome

Manish Anand
Nihar Sheth
Jim Costello
24th November, 2003

2
Overview

Biological Background
Terminology
SNP related general information
SNP detection techniques
SNP Applications
References

3
Biological Background

How can researchers hope to identify and study
all the changes that occur in so many different
diseases?
How can they explain why some people respond to
treatment and not others?

SNP is the answer to these questions
So what exactly are SNPs?
How are they involved in so many different
aspects of health?

5
What is SNP ?

A SNP is defined as a single base change in a DNA
sequence that occurs in a significant proportion
(more than 1 percent) of a large population.

6
Variations in Genome
7
Terminology

Polymorphism
Linkage Disequilibrium
Correlation of characters states among
polymorphic sites
Insufficient passage of time to randomize
character states by meiotic recombinations
Haplotype

8
Some Facts

In human beings, 99.9 percent bases are same.
Remaining 0.1 percent makes a person unique.
Different attributes / characteristics / traits
how a person looks,
diseases he or she develops.
These variations can be
Harmless (change in phenotype)
Harmful (diabetes, cancer, heart disease,
Huntington's disease, and hemophilia )
Latent (variations found in coding and regulatory
regions, are not harmful on their own, and the
change in each gene only becomes apparent under
certain conditions e.g. susceptibility to lung
cancer)

9
SNP facts

SNPs are found in
coding and (mostly) noncoding regions.
Occur with a very high frequency
about 1 in 1000 bases to 1 in 100 to 300 bases.
The abundance of SNPs and the ease with which
they can be measured make these genetic
variations significant.
SNPs close to particular gene acts as a marker
for that gene.
SNPs in coding regions may alter the protein
structure made by that coding region.

10
SNPs may / may not alter protein structure
11
SNPs act as gene markers
12
SNP maps

Sequence genomes of a large number of people
Compare the base sequences to discover SNPs.
Generate a single map of the human genome
containing all possible SNPs gt SNP maps

13
SNP Maps
14
SNP Profiles

Genome of each individual contains distinct SNP
pattern.
People can be grouped based on the SNP profile.
SNPs Profiles important for identifying response
to Drug Therapy.
Correlations might emerge between certain SNP
profiles and specific responses to treatment.

15
SNP Profiles
16
Techniques to detect known Polymorphisms

Hybridization Techniques
Micro arrays
Real time PCR
Enzyme based Techniques
Nucleotide extension
Cleavage
Ligation
Reaction product detection and display
Comparison of Techniques used

17
Hybridization Techniques

Micro Arrays
Sequencing by hybridization
utilize a set of tiling oligonucleotides
somewhat complex
pooling and processing of PCR amplicons that are
subsequently hybridized to a DNA micro array and
visualized.
Theoretically capable of genotyping thousands of
polymorphisms simultaneously
Success rate 97 (Somewhat low for this kind of
analysis)
High False rates 1121
Design and fabrication of micro arrays is
expensive, hence users are confined to the set of
genotypes established by the manufacturer.

Real Time PCRs
Utilizes TaqmanTM DNA probes to detect PCR
products in real-time
TaqmanTM probe contains a fluorescent reporter at
the 5' end and a fluorescence resonance energy
transfer (FRET) moiety at the 3' end, which
quenches the fluorescent signal of the reporter.
The probe sequence is complementary to the PCR
amplicon and is designed to anneal at the
extension temperature.
During extension, the 5' 3' exonuclease activity
of Taq DNA polymerase I cleaves the probe,
emitting signal due to the separation of the
reporter from the quencher.
Polymorphism is determined solely by
hybridization and not by the ability of the
enzyme to discriminate.
Because the enzyme does not confer specificity in
detection, this technique is classified as
hybridization-based.
Depending on optical thermocycler platform 384
reactions can be monitored for each cycle without
removing any sample
amenable to robotic automation.

19
Real Time PCRs
20
Enzyme based Techniques

Nucleotide extension
Simplest techniques for known polymorphism
detection
Existing in numerous variations (also known as
minisequencing, SNuPE, GBA, APEX, AS-PE capture,
FNC, TDI or PROBE) this assay typically involves
the single base extension of an oligonucleotide
by a polymerase
Oligonucleotide is designed to anneal immediately
upstream of the polymorphism locus and
differentially labeled fluorescent
dideoxynucleotides are utilized as substrates for
polymerase extension.
The fluorescent signal emitted corresponds to the
nucleotide incorporated and thus the sequence of
the polymorphism.
Simplicity and accuracy in distinguishing between
heterozygous and homozygous genotypes.
Targets need to be PCR amplified PCR reagents
must be removed.
False negatives due to mis-priming

21
Nucleotide Extension
22

Cleavage
The InvaderTM assay utilizes the exonuclease
activity of Cleavase VIII on overlapping
oligonucleotide strands.
Two oligonucleotides, an invader probe and
either a wild-type or mutant primary probe,
overlap each other at a single nucleotide
position on the template only if they are
complementary to the polymorphism being queried.
Cleavage occurs when the specific overlapping
conformation is present, freeing an
oligonucleotide referred to as a flap.
This flap can be detected in a multiplex manner
by size, mass or sequence
Commonly the flap participates in a second
cleavage assay with another complementary target,
causing release of a fluorescent signal.
Advantage - the same flap may bind to many
targets, generating a cascading signal
amplification and thereby obviating the need for
PCR amplification.
Single-tube one-step reaction.

23
Cleavage
24

Ligation
One of the most specific assays due to the high
specificity of T4 ligase (oligo ligation assay)
and even higher specificity of thermostable
ligases (ligation detection reaction, LDR)
Two primers are designed to anneal adjacent to
one another on the target of interest
Generally, the upstream primer (discriminating
primer) contains a fluorescent label at the 5'
end, with the 3' nucleotide overlapping the
polymorphic base.
The fluorescent signal corresponds to the allele
being queried at the 3' position of the
discriminating primer
When the discriminating primer forms a perfect
complement with the target at the junction, the
ligase covalently attaches the adjacent
downstream primer (common primer)
The resulting product is approximately twice as
long as each of the individual primers and can be
easily monitored for detection by means of
capillary electrophoresis or by display on a
microarray
Advantage Very good sensitivity and specificity

25
Techniques to detect unknown Polymorphisms

Direct Sequencing
Microarray
Cleavage / Ligation
Electrophoretic mobility assays
Comparison of Techniques used

26
Direct Sequencing

Sanger dideoxysequencing can detect any type of
unknown polymorphism and its position, when the
majority of DNA contains that polymorphism.
Misses polymorphisms and mutations when the DNA
is heterozygous
limited utility for analysis of solid tumors or
pooled samples of DNA due to low sensitivity
Once a sample is known to contain a polymorphism
in a specific region, direct sequencing is
particularly useful for identifying a
polymorphism and its specific position.
Even if the identity of the polymorphism cannot
be discerned in the first pass, multiple
sequencing attempts have proven quite successful
in elucidating sequence and position information.

27
Microarray

Variation detection arrays (VDA) scans large
sequence blocks and identify regions containing
unknown polymorphisms.
This methodology suffers from the same
limitations in fabrication and design as observed
in known polymorphism analysis, but has
demonstrated much greater success in the context
of unknown polymorphism detection for both SNP
and tumor analysis.
With respect to SNP analysis, a recent study of
chromosome 21 successfully identified
approximately half of the estimated number of
common SNPs (frequency of 1050) across the
entire chromosome.
The experimental design required a sacrifice in
sensitivity in order to minimize false positives.
This explains the decrease in successful
identification from 80 to 50.

28
Cleavage/Ligation

Unknown polymorphisms can also be identified by
the cleavage of mismatches in DNADNA
heteroduplexes.
This can be achieved either chemically chemical
cleavage method (CCM) or enzymatically (T4 Endo
nuclease VII, MutY cleavage or Cleavase).
Typically, at least two samples are PCR amplified
(one sample can be sufficient for solid tumor
samples with high levels of stromal
contamination), denatured and then hybridized to
create DNADNA heteroduplexes of the variant
strands.
Enzymes cleave adjacent to the mismatch and
products are resolved via gel or capillary
electrophoresis.
Unfortunately, the cleavage enzymes often nick
complementary regions of DNA as well. This
increases background noise, lowers specificity,
and reduces the pooling capacity of the assay.

29
Cleavage / Ligation
30
SNP Applications

Gene discovery and mapping
Association-based candidate polymorphism testing
Diagnostics/risk profiling
Response prediction
Homogeneity testing/study design
Gene function identification

31
High-resolution haplotype structure in the human
genome

Mark J. Daly, John D. Rioux, Stephen F.
Schaffner, Thomas J. Hudson Eric S. Lander

32
Abstract

Authors are describing a high-resolution analysis
of the haplotype structure across 500 KB on
chromosome 5q31 using 103 SNPs in a European
derived population.
They developed an analytical model for Linkage
disequilibrium (LD) mapping based on
high-resolution haplotype blocks, which offers a
coherent framework for creating a haplotype map
of the human genome.

33
Data used

500 kb region on human chromosome 5q31 that is
implicated as containing a genetic risk factor
for Crohn disease.
Rioux, J. D et al. Hierarchical linkage
disequilibrium mapping of a susceptibility gene
for Crohn s disease to the cytokine cluster on
chromosome 5. Nature Gene. 29, 223-228(2001)
103 common (gt5 minor allele frequency) SNPs
genotyped from a European-derived population.
Study describe 258 chromosomes transmitted to
individuals with Crohn disease and 258
untransmitted chromosomes.

34
Data used

The genotype data used in study provides the
highest-resolution picture of the patterns of
genetic variation across a large genomic region,
with a market density of 1 SNP roughly every 5
kb.

35
Study

Focus on identifying the underlying haplotypes.
Authors initial focus was on untransmitted
control chromosomes, however, the same haplotype
structure was seen in the chromosomes transmitted
to individuals with Crohn disease, with the only
difference being that one of the haplotypes was
enriched in frequency, reflecting its association
with Crohn disease.

36
Study

It became evident during the study that the
region could be largely decomposed into discrete
haplotype blocks, each with a lack of diversity.
As haplotype block structure was the same in both
groups, they presented combined data from all
chromosomes (transmitted and untransmitted).

37
Haplotype block structure on 5q31
38
Haplotype block structure on 5q31
a. Common haplotype patterns in each block of low
diversity. Dashed lines indicate locations where
more than 2 of all chromosomes are observed to
transition one common haplotype to a different
one.
39
Haplotype block structure on 5q31
b. Percentage of observed chromosomes that match
one of the common patterns exactly (total
chromosomes 258 transmitted 258
untransmitted).
40
Haplotype block structure on 5q31
c. Percentage of each of the common patterns
among 258 untransmitted chromosomes.
41
Haplotype block structure on 5q31
d. Rate of haplotype exchange between the blocks
as estimated by the HMM.
42
Haplotype block structure on 5q31

The haplotype blocks span up to 100 kb and
contain multiple (five or more) common SNPs.
The blocks have only few (2-4) haplotypes, which
show no evidence of being derived from one
another by recombination, and which account for
nearly all chromosomes (gt90) in all cases in the
sample.

43
Haplotype block structure on 5q31
For example, an 84 kb block shows only two
distinct haplotypes that together account for 95
of the observed chromosomes (table -1).
44
Study

The discrete blocks are separated by intervals in
which several independent historical
recombination event seem to have occurred, giving
rise to greater haplotype diversity for regions
spanning the blocks.
The most common recombination events are
indicated in previous figure by lines connecting
the haplotypes.
The recombination events appear to be clustered
multiple obligate exchanges must have occurred
between most blocks, with little or no exchange
within block.

45
Study

Although there is detectable recombination
between blocks, it is modest enough for there to
be clear long-range correlation among (that is,
LD) blocks.
The haplotypes at the various blocks can be
readily assigned to one of the four ancestral
long-range haplotypes.
Indeed, 38 of the chromosomes studies carried
one of these four haplotypes across the entire
length of the region.

46
Study

Using HMM, they developed an approach to define
the block structure formally.
The HMM simultaneously assigns every position
along each observed chromosome to one of the four
ancestral haplotypes and estimates the
maximum-likelihood values of the historical
recombination frequency (T) between each pair of
markers.

47
Study

The quantity T provides a convenient summary of
the degree of haplotype exchange across
inter-marker intervals and relates directly toe
conventional measures of LD.
In this study, T is estimated at less than 1 for
73 of the inter-marker intervals, 1-4 for 14 of
the intervals, and more than 4 for only 9 of the
intervals.

48
Methods Individuals and market selection

The individuals studies, Canadians from
metropolitan Toronto of predominantly European
descent and the genotyping methodologies are
described in the paper
Rioux, J. D et al. Hierarchical linkage
disequilibrium mapping of a susceptibility gene
for Crohn s disease to the cytokine cluster on
chromosome 5. Nature Gene. 29, 223-228(2001)
To ensure the ability to reconstruct multi-marker
haplotypes, SNPs for haplotype analysis were
selected from the set of markers for which full
genotypes were available for all members.
SNPs at CpG sites were not included to prevent
potential confounding of common haplotype
patterns from recurrent mutations.

49
Methods Haplotype counting

Haplotype percentages in Haplotype block
structure in 5q31 figure were computed using
haplotypes generated by the transmission
disequilibrium test (TDT) implementation in
Genehunter 2.0 (ref. 22 in the paper), followed
by use of an EM-type algorithm (ref. 23,24 in
paper), to include the minority of chromosomes
that had one or more markers with ambiguous phase
or where one marker was missing genotype data.

50
Methods Hidden Markov model

The observation that over long distances most
haplotypes can be described either as belonging
to one of a small number of common haplotypes
categories suggested the use of an HMM in which
haplotype categories were defined as state.
Authors assigned observed chromosomes to those
hidden states and simultaneously estimated the
transition probability in each map interval by
using an EM algorithm and by making the
simplifying assumption that there was any
transition probability for each map interval
rather than allowing specific transition
probabilities from each state to each state.
The output of this method was a
maximum-likelihood assignment to haplotype
category at each position and ML estimates of T
indicating how significantly recombination has
acted to increase haplotype diversity in each map
interval.

51
Discussion of Study

The region of chromosome 5q31 may be largely
divided into discrete blocks of 10-100 kb each
block has only a few common haplotypes and the
haplotype correlation between blocks gives rise o
long-range LD.
Focusing on haplotype blocks greatly clarifies LD
analyses. Once the haplotype blocks are
identified, they can be treated as alleles and
tested for LD (instead of single-marker analyses
of LD).

52
Discussion of Study

In analogous fashion, the haplotype structure
provides a crisp approach for testing the
association of genomic segments with disease. By
contrast, disease association studies
transitionally involve testing individual SNPs in
and around a gene.
Once the haplotype blocks are defined, it is
straightforward to examine a subset of SNPs that
uniquely distinguish the common haplotypes in
each block. This allows the common variation in a
gene to tested exhaustively for association with
disease.

53
Discussion of Study

This approach provides a precise framework for
creating a comprehensive haplotype map of the
human genome.
By testing a sufficiently large collections of
SNPs, it should be possible to define all of the
common haplotypes underlying blocks of LD. Once
such a map is created, it will be possible to
select an optimal reference set of SNPs for any
subsequent genotyping study.
This detailed understanding of common human
variation represents an important step in the
Human genome project.

54
Linkage Disequilibrium

Uses unrelated individuals
Good for fine scale mapping because there is
greater opportunity for recombination to occur.
Map of loci that contribute to inherited genetic
disorders
States can not be considered independent because
they are related by distance and recombination,
so individual haplotypes may not be the cause of
disease, but rather a combination of several
haplotypes in blocks

55
Linkage Disequilibrium

Greater distance between genes, the greater
chance of recombination
Lesser distance between genes, the less chance of
recombination
Knowing the above and observing inherited
alleles, one can estimate the relative distance
between genes

56
Measures of Linkage Disequilibrium

cM centiMorgans
50cM would mean that two genes have a 50 chance
of recombination occurring.
Genes are relatively far apart

57
Importance of Linkage Disequilibrium

Offers us a way to measure the distance between
genes.
Non-random
Measure of relation between markers and disease
mutations.
Possibly used to map disease genes because high
LD areas would be related to recombination and
formation of new alleles

58
Data Mining Applied to Linkage Disequilibrium
Mapping

HPM - Haplotype Pattern Mining
Method of data mining LD-based gene mapping
Uses haplotypes as inputs which can be obtained
from genetic simulation programs such as
GENEHUNTER
Extension of traditional association analysis
Search for shared and flexible haplotypes and
find out which ones are strongly associated with
a disease.
Uses non-parametric statistical model without
any genetic models on the basis of the locations
of the haplotypes

59
What we know

LD, which has a non-random association of
haplotypes to a disease, is likely strongest
around the DS(Disease Susceptibility) gene.
A locus will most likely be where the strongest
associations are.

60
Notation

Haplotype Map M has k parameters (m1,,mk)
The haplotype pattern P on M consists of the
vector space (p1,,pk), where each pi is an
allele of mi or a wild-card ()
P occurs on the haplotype vector, which is simply
the chromosome (H), so H (h1,,hk) where hi
pi or hi
Example
P1 (, 2, 5, , 3, , , , , )
PC (4, 2, 5, 1, 3, 2, 6, 4, 5, 3)

61
Issues in Shape of Haplotype Pattern

Length of the pattern
Defined as maximal distance between any 2 markers
measured in centiMorgans
Extremely long sequences dont give us much
information, so the size of the P is constrained
in HPM
Gaps in sequences
Accounts for mutations, errors, missing data, and
recombination
Gap size and number can be controlled in HPM

62
Procedure

Depth-first search finds all haplotype patterns
that exceed the lower bound threshold and meets
the association measure
Calculate the frequency f(mi) of marker mi with
respect to (M, H, Y, x), where Y phenotype and x
positive association threshold
Markers with highest frequencies are predicted to
be the area of the DS gene, assuming a DS gene is
present.
Prediction of granularity of marker density
Ranked based on frequency

63
Results Simulated Data

Founder population which grows from 300 to 100,
000 in 500 years was simulated in the Populus
simulator package
Simulated data used because it is cheaper and can
be easily manipulated

List of 11 most strongly disease-associated
haplotype patterns in the simulated data
Chromosome has 101 markers
Dashed line indicates the true gene location

Frequency histogram of previous slides data, but
with patterns exceeding the threshold of
association
Dashed line indicates the true gene location
Marker 5 now has the highest frequency

The actual vs. predicted locations for 100 data
sets

Mutation carrying chromosomes, denoted by A
Sample founder population size
Corrupted data
Missing data

68
Real Data HLA complex

Data consisting of affected sib-pair families
with type 1 diabetes from the UK that were
genotyped for 25 markers was used
Markers covered 14-Mb and covered the entire HLA
complex
The HLA-DQB1 and HLA-DRB1 loci, which are located
in the middle of these 14-Mb, are known to be the
primary factors for type 1 diabetes
Randomly selected 200 from 385 sample space to
compare with simulated results

Frequency vs. Map Location of HLA markers
___ HPM calculated frequencies
----- Background LD frequencies
Vertical lines indicates true locations of
markers

70
Discussion of HPM Technique

Robust to lost and erroneous data
Applicable to complex gene mapping
Works well with small data sets, but accuracy is
increased with the increase of data
Works with real and simulated data
Does not include any previously derived models

71
References

Introduction to SNPs Discovery of Markers of
Disease
SNP seeking long term association with complex
diseases
SNP mapping using Genome-wide Unique Sequences
The Structure of Haplotypes Blocks in Human
Genome
Using Haplotype blocks to map human complex trait
loci
High Resolution haplotype structure in human
genome
Detection of regulatory variation in mouse genes
http//linkage.rockefeller.edu/wli/lld.html
http//statwww.epfl.ch/davison/teaching/Microarray
s/snp.ppt
http//www.cs.helsinki.fi/u/htoivone/pubs/ajhg_200
0.pdf
Resolution of Haplotypes and Haplotype
Frequencies from SNP Genotype of Pooled Samples
http//www.journals.uchicago.edu/AJHG/journal/issu
es/v71n6/024386/024386.html
http//www.ncbi.nlm.nih.gov/entrez/utils/fref.fcgi
?http//www.sciencemag.org/cgi/pmidlookup?viewful
lpmid11452081
http//www.genome.gov/10001665
http//walnut.usc.edu/magnus/papers/tig.pdf