SNP Discovery and Genotyping Workshop - PowerPoint PPT Presentation

1 / 94
About This Presentation
Title:

SNP Discovery and Genotyping Workshop

Description:

Identifying SNPs by association for genotype- phenotype ... SeattleSNPs - http://pga.mbt.washington.edu. Sequence each end. of the fragment. Base-calling ... – PowerPoint PPT presentation

Number of Views:278
Avg rating:3.0/5.0
Slides: 95
Provided by: mbt2
Category:

less

Transcript and Presenter's Notes

Title: SNP Discovery and Genotyping Workshop


1
SNP Discovery and Genotyping Workshop
  • SNP discovery strategies
  • Debbie Nickerson
  • Identifying SNPs by association for
    genotype- phenotype analysis of candidate genes
  • Chris Carlson
  • Identifying haplotypes for genotype-phenotype
  • analysis of candidate genes
  • Dana Crawford
  • SNP genotyping strategies
  • Debbie Nickerson

2
  • SNP Discovery and Genotyping Strategies
  • Debbie Nickerson - debnick_at_u.washington.edu
  • Overview of Variation in the Human Genome
  • SNP Discovery Strategies and Status
  • SNP Data in the PGAs
  • Genotyping SNPs

3
Total sequence variation in humans
Population size 6x109 (diploid) Mutation
rate 2x108 per bp per generation Expected
hits 240 for each bp ?Every variant compatible
with life exists in the population BUT Most are
vanishingly rare Compare 2 haploid genomes 1 SNP
per 1331 bp
The International SNP Map Working Group, Nature
409928 - 933 (2001)
4
Strategies to Find SNPs
  • Mine them from Existing Genome Resources
  • Targeted SNP Discovery in Candidate Genes

Berkeley PGA - http//pga.lbl.gov
CardioGenomics - http//www.cardiogenomics.org
InnateImmunity - http//innateimmunity.net
SeattleSNPs - http//pga.mbt.washington.edu
Southwestern - http//pga.swmed.edu
5
Sequence-based SNP Mining
G
e
n
o
m
i
c

D
N
A
m
R
N
A
B
A
C

l
i
b
r
a
r
y
R
R
S

L
i
b
r
a
r
y
c
D
N
A

L
i
b
r
a
r
y
o
r

S
a
m
p
l
i
n
g
S
h
o
t
g
u
n

O
v
e
r
l
a
p
E
S
T

O
v
e
r
l
a
p
B
A
C

O
v
e
r
l
a
p
S
e
q
u
e
n
c
e

O
v
e
r
l
a
p

S
N
P

d
i
s
c
o
v
e
r
y
G
T
T
T
A
A
A
T
A
A
T
A
C
T
G
A
T
C
A
G
T
T
T
A
A
A
T
A
A
T
A
C
T
G
A
T
C
A
G
T
T
T
A
A
A
T
A
G
T
A
C
T
G
A
T
C
A
G
T
T
T
A
A
A
T
A
G
T
A
C
T
G
A
T
C
A
4.1 Million SNPs Available http//www.ncbi.nlm.g
ov/SNP/
6
Mining Finds Only A Small Fraction of the SNPs
A
G
1.0
96
48
24
16
8
Fraction of SNPs Discovered
0.5
2
0.0
0.0
0.2
0.3
0.4
0.5
0.1
Minor Allele Frequency
7
Total Estimated SNPs and Fraction in dbSNP
mi
ni
ma
l
a
l
le
l
e
exp
e
c
t
ed
S
N
Ps
exp
e
c
t
ed
S
N
P
exp
e
c
t
ed

in
f
r
equen
c
y
(
mil
l
i
on
s
)
f
r
equen
c
y (bp)
da
ta
ba
s
e
1
11.0
290
11-12
5
7.1
450
15-17
10
5.3
600
18-20
20
3.3
960
21-25
30
2.0
1570
23-27
40
0.97
3280
24-28
L. Kruglyak and D. Nickerson, Nat Genet
27234-236 2001
8
Surfactant B - Locus Link
dbSNP (http//www.ncbi.nlm.nih.gov/SNP/)
9
Surfactant B - dbSNP
10
Confirmation of SNP Resource in New
SamplePotential Pitfalls
Confirmed Multiple Method Report in dbSNP
Confirmed Unique Method Report in dbSNP
100
90
80
70
60
50
40
30
20
10
0
BAC
EST
Other
RRS
PCR
Any Multiple Report
BRE Multiple Report
11
Strategies to Find SNPs
  • Mine them from Existing Resources
  • Targeted SNP Discovery in Candidate Genes

Berkeley PGA - http//pga.lbl.gov
CardioGenomics - http//www.cardiogenomics.org
InnateImmunity - http//innateimmunity.net
SeattleSNPs - http//pga.mbt.washington.edu
Southwestern - http//pga.swmed.edu
12
Sequence-based SNP Identification
Sequence
Amplify DNA
Phred
Phrap
Sequence each end
Base-calling
Contig assembly
5
3
Quality determination
Final quality determination
of the fragment.
PolyPhred
Polymorphism detection
ATAGACG ATACACG ATAGACG ATACACG
ATAGACG ATACACG
Consed
Sequence viewing
Polymorphism tagging
Analysis
Polymorphism reporting
Homozygotes
Heterozygote
Individual genotyping
Phylogenetic analysis
13
Sequence-Based Detection and Genotyping of SNPs
Jim Sloan, Tushar Bhangle (PolyPhred) Matthew
Stephens, Paul Scheet (Quality Scores for
SNPs) Phil Green, Brent Ewing, David Gordon
(Phred, Phrap, Consed)
14
(No Transcript)
15
PGA SNPs
  • The PGAs provide a validated SNP resource
  • (Allele Frequency Data)
  • Novel Views of the Variation Data
  • Emerging Pathway Interfaces
  • Color Fasta Formats
  • Gene Structure Views
  • Visual Genotypes
  • Linkage Disequilibrium Views
  • TagSNPs
  • Haplotypes
  • Many New Formats Under Development

16
Toward comprehensive association studies
  • 5-7 million common variants exist in genome
  • Testing all for association is impractical today
  • Can the list be reduced w/o loss of power?
  • SNPs in Coding (Amino Acid Changes)
  • Linkage disequilibrium (SNPs in other functional
    regions, i.e. regulatory elements)

17
cSNPs - Both Deep and Average Coverage
Available from the PGAs
CD36 - Southwestern PGA - Deep cSNP Discovery
Strategy - Healthy, High Cholesterol, High
Triglycerides, Congential Cardiac Abnormalities,
Left Ventricular Hypertrophy .
CD36 - SeattleSNPs PGA - Average cSNP Discovery
Strategy -Healthy only
18
SIFT (Sorting Intolerant From Tolerant) Coding
Changes
CYP4F2
Trp (W) ? Gly (G) Predicted to be tolerated
Val (V) ? Gly (G) Predicted not to be tolerated
Ng and Henikoff, Gen. Res. 2002
19
SNP-Based Association Studies
Indirect Use dense map of SNPs and test for
linkage disequilibrium (use
association to find sites in entire
sequence (non-coding) with function)

5
3
Val-Val
Arg-Cys
Collins, Guyer, Chakravarti Science 2781580-81,
1997
20
SNP Discovery and Genotyping Workshop
  • SNP discovery strategies
  • Debbie Nickerson
  • Identifying SNPs by association for
    genotype- phenotype analysis of candidate genes
  • Chris Carlson
  • Identifying haplotypes for genotype-phenotype
  • analysis of candidate genes
  • Dana Crawford
  • SNP genotyping strategies
  • Debbie Nickerson

21
Selecting SNPs for Genotype-Phenotype Analysis
Using Allelic Association(Linkage Disequilibrium)
  • Christopher Carlson
  • csc47_at_u.washington.edu

22
Candidate Gene Association Analysis
  • Describe existing genetic variation
  • Rare SNPs (deep exonic resequencing)
  • Common SNPs (complete resequencing)
  • Select a subset of SNPs for genotyping
  • cSNPs (amino acid changes)
  • htSNPs (resolve haplotypes)
  • tagSNPs (patterns of genotype)
  • Test for genotype/phenotype correlations

23
SeattleSNPs Resequencing Strategy I
  • Resequence the complete genomic region of each
    gene
  • 2000 bp upstream of first exon
  • 1500 bp downstream of poly-A signal
  • All exons and introns for genes below 35 kbp
  • Image courtesy of GeneSNPs

24
VG2
  • Visual Genotype 2
  • Web interface
  • Visualize genotypes
  • View SNPs by frequency
  • Sort on similarity between sites
  • Sort on similarity between samples
  • Visualize LD

25
SeattleSNPs Resequencing Strategy II
  • Resequence candidate genes from inflammation and
    coagulation pathways
  • Resequence 47 individuals
  • 24 African American
  • 23 European American
  • Homozygote common
  • Heterozygote
  • Homozygote rare
  • Missing Data

26
VG2
  • Visual Genotype 2
  • Web interface
  • Visualize genotypes
  • View SNPs by frequency
  • Sort on similarity between sites
  • Sort on similarity between samples
  • Visualize LD

27
VG2
  • Visual Genotype 2
  • Web interface
  • Visualize genotypes
  • View SNPs by frequency
  • Sort on similarity between sites
  • Sort on similarity between samples
  • Visualize LD

28
VG2
  • Visual Genotype 2
  • Web interface
  • Visualize genotypes
  • View SNPs by frequency
  • Sort on similarity between sites
  • Sort on similarity between samples
  • Visualize LD

29
VG2
  • Visual Genotype 2
  • Web interface
  • Visualize genotypes
  • View SNPs by frequency
  • Sort on similarity between sites
  • Sort on similarity between samples
  • Visualize LD

30
VG2
  • Visual Genotype 2
  • Web interface
  • Visualize genotypes
  • View SNPs by frequency
  • Sort on similarity between sites
  • Sort on similarity between samples
  • Visualize LD

31
VG2
  • Visual Genotype 2
  • Web interface
  • Visualize genotypes
  • View SNPs by frequency
  • Sort on similarity between sites
  • Sort on similarity between samples
  • Visualize LD

32
Preliminary Analyses
  • Hardy Weinberg Equilibrium
  • Population specificity
  • Nucleotide diversity
  • Pop genetics statistics (e.g. Tajimas D)

33
SNP Selection cSNPs
  • Genotype SNPs which change amino acids
  • Genotype other good story SNPs
  • SNPs in known regulatory elements
  • SNPs in Conserved Noncoding Sequences
  • Image courtesy of GeneSNPs

34
SNP Selection htSNPs
  • Genotype haplotype tagging SNPs which resolve
    existing common haplotypes

35
SNP Selection htSNPs
  • Genotype haplotype tagging SNPs which resolve
    existing common haplotypes

36
SNP Selection tagSNPs
  • Resequence a modest number of samples
  • Describe patterns of genotype at all common SNPs
  • Genotype tagSNPs which efficiently capture
    existing patterns of genotype

37
Linkage Disequilibrium
A B
  • Haplotype is the pattern of alleles on a single
    chromosome
  • 4 possible haplotypes
  • Linkage Disequilibrium (LD) describes the allelic
    association between two SNPs
  • Two popular LD statistics
  • D
  • r2

38
Complete LD
A B
  • Unequal allele frequency
  • Allelic association is as strong as possible
  • 3 haplotypes observed
  • No detected recombination between SNPs
  • Genotype is not perfectly correlated
  • D 1
  • r2 lt 1

39
Perfect LD
A B
  • Equal allele frequency
  • Allelic association is as strong as possible
  • 2 haplotypes observed
  • No detected recombination between SNPs
  • Genotype is perfectly correlated
  • D 1
  • r2 1

40
Rational SNP Selection
Select SNPs to genotype on the basis of LD
41
LD SNP Selection Example
  • CSF3 in European Americans
  • 5200 bp
  • 17 SNPs

42
LD SNP Selection Example
  • CSF3 in European Americans
  • 5200 bp
  • 17 SNPs

43
LD Site Selection Algorithm
  • Find minimal set of SNPs for assay, such that
    each SNP is either assayed directly or above
    r2 threshold with an assayed SNP
  • Calculate all pairwise r2 values
  • Set r2 threshold based on power estimates for
    study

44
LD Site Selection Algorithm
  • Find minimal set of SNPs for assay, such that
    each SNP is either assayed directly or above
    r2 threshold with an assayed SNP
  • Calculate all pairwise r2 values
  • Set r2 threshold based on power estimates for
    study

45
CSF3 Site Selection
  • Threshold LD r2 gt 0.64
  • Bin 1 4 sites
  • Bin 2 4 sites
  • Bin 3 2 sites
  • Genotype 1 SNP from each bin, chosen for
    biological intuition or ease of assay design

46
Power and LD
  • Given
  • All common SNPs described
  • Patterns of LD between common SNPs are known
  • Select SNPs such that every SNP is either
  • Directly assayed
  • Associated with an assayed SNP
  • Test for disease associations with assayed SNPs
  • Power to detect disease associations at unassayed
    SNPs depends on r2 between assayed and unassayed
    SNPs

47
LD Selection and Haplotype
  • LD selected SNPs provide the highest possible
    haplotype diversity for a given number of SNPs
    assayed
  • LD selection is robust to recombination and
    hotspot structure
  • LD selection is sensitive to population
    stratification

48
SNP Selection Summary
  • It is possible to test all common variants in a
    candidate gene directly for risk association
    (main effects) with meaningful null negative
    results
  • Caveat Higher order risks unaddressed
  • Haplotype (G X G effects within a locus)
  • Epistasis (G X G effects between loci)
  • Environment (G X E effects)

49
SNP Discovery and Genotyping Workshop
  • SNP discovery strategies
  • Debbie Nickerson
  • Identifying SNPs by association for
    genotype- phenotype analysis of candidate genes
  • Chris Carlson
  • Identifying haplotypes for genotype-phenotype
  • analysis of candidate genes
  • Dana Crawford
  • SNP genotyping strategies
  • Debbie Nickerson

50
Identifying Haplotypes for Genotype-Phenotype Anal
ysis
Dana C. Crawford dcrawfo_at_gs.washington.edu
51
Outline of discussion
  • Constructing or inferring haplotypes
  • Haplotype tools available in PGA
  • Description of haplotypes in SeattleSNPs genes
  • Use of VH1 tool to visually inspect
  • Haplotype blocks
  • Haplotype diversity
  • Hotspots of recombination
  • Summary of SeattleSNPs haplotype data

52
What is a Diplotype ?
  • Humans are diploid
  • At each SNP there are two alleles, which are
    observed as a genotype
  • At each gene there are two haplotypes, which are
    observed as a multi-site genotype, or diplotype

53
What is a Haplotype?
A a unique combination of genetic markers
present in a chromosome. pg 57 in Hartl
Clark, 1997
VH1 haplotype visualization tool
54
How Do You Construct Haplotypes?
1. Collect extended family members
55
How Do You Construct Haplotypes?
2. Go from diploid to haploid via somatic cell
hybrids
e.g. Patil et al 2001
56
How Do You Construct Haplotypes?
3. Allele-specific PCR
SNP 1
SNP 2
C/T
A/G
57
How Do You Construct Haplotypes?
  • Statistical inference
  • Clark Algorithm
  • EM (Arlequin)
  • Phase Ligation (HAPLOTYPER)
  • PHASE

58
Clark Algorithm
  • Find unambiguous haplotypes
  • Homozygotes
  • Single Heterozygotes

59
Clark Algorithm
  • Find ambiguous diplotypes formed from two
    unambiguous genotypes

60
Clark Algorithm
  • Find ambiguous diplotypes formed from one
    unambiguous genotype and one new genotype

61
Clark Algorithm
  • Iterate until either all haplotypes resolve, or
    ambiguous haplotypes are inconsistent with any
    inferred haplotype

62
Haplotype Algorithm Comparison
  • Clark
  • Intuitive
  • Fast
  • EM
  • Complete solution
  • Slightly more accurate than Clark
  • Robust to ambiguity
  • PHASE
  • Complete solution
  • Slightly more accurate than EM
  • Slow version 2 faster
  • Haplotyper (Ligation)
  • Fast
  • Better than Clark
  • Less accurate than EM or PHASE

63
Haplotype Tools in the PGA
  • InnateImmunity
  • 25 genes re-sequenced in innate immunity
    pathway
  • 4 populations European and African-Americans,
  • Hispanics, Asthmatics
  • PHASE and Haplotyper results posted on website

http//innateimmunity.net
64
Haplotype Tools in the PGA
  • SeattleSNPs
  • 120 genes re-sequenced in inflammation response
  • 2 populations European- and African-Americans
  • PHASE results posted on website
  • Interactive tool (VH1) to visualize and sort
    haplotypes

http//pga.gs.washington.edu
65
Distribution of Haplotypes in 100 SeattleSNPs
Genes
AD
ED
66
Common Haplotypes in 100 SeattleSNPs
Genes (Frequency gt5)
Population gt5 MAF
Average Range
ED 4.54 1 - 8 AD
4.99 0 - 11
67
Haplotype Sharing Between Populations in 100
SeattleSNPs Genes
68
Number of Haplotypes From Two Different
Discovery Strategies
69
Haplotype Structures Are Similar Across
Discovery Strategies
FGB African-Americans
29 SNPs gt5
70
But, Not For All Genes
F10 African-Americans
48 SNPs gt5
71
Are Blocks Preserved Using Different Discovery
Strategies?
Fewer blocks with fewer SNPs/kb
72
Using Visualization Tools (VH1) To Identify
Haplotype Blocks
73
Using VH1 to Identify Highly Divergent Haplotypes
  • Some haplotypes are highly divergent
  • More likely to have functional consequences?
  • Mixed Blessing
  • Easier to detect
  • Harder to dissect

74
Using Haplotypes To Identify Hotspots
Of Recombination
CD36 haplotypes, sorted by sample
75
Linkage Disequilibrium and Hotspots
Associated Sites
CD36
76
Detection of Recombination Hotspots In Candidate
Genes
  • HOTSPOTTER
  • Developed by Na Li and Matthew Stephens
  • Multilocus model for LD
  • Does not rely on block-like patterns
  • Relates LD to underlying recombination process
  • Incorporated into new version of PHASE (v2.0)

students.washington.edu/lina/software/
77
CD36 combined population
78
CD36 AD and ED populations
79
HOTSPOTTER Preliminary Results
15 out of 100 genes have evidence of a hotspot
AGTR1 APOB CD36 IL1B IL21R IL4 NOS3 PLAUR
PON1 SERPIN45 SELP SFPA2 SFTPB VCAM1 VEGF
80
SeattleSNPs Haplotype Summary
  • More haplotypes per gene than previously
    described
  • lt50 of African-American chromosomes are
    represented
  • by common shared haplotypes
  • Block structure is preserved across discovery
    strategies
  • for only a fraction of the genes
  • Evidence for hotspots of recombination in human
    genes

81
SNP Discovery and Genotyping Workshop
  • SNP discovery strategies
  • Debbie Nickerson
  • Identifying SNPs by association for
    genotype- phenotype analysis of candidate genes
  • Chris Carlson
  • Identifying haplotypes for genotype-phenotype
  • analysis of candidate genes
  • Dana Crawford
  • SNP genotyping strategies
  • Debbie Nickerson

82
Ideals for SNP Genotyping
  • High Sensitivity - PCR but moving towards direct
  • genomic DNA detection
  • High Specificity - Accurate
  • Simple process - Easy to automate - High
    Throughput
  • Multiplexing - Perform many assays at once -
    decrease costs
  • Cheap

83
SNP

Genotyping
Matched
Mis-Matched
C

A
l
l
e
l
e
T

A
l
l
e
l
e
P
r
o
b
e

a
n
d

T
a
r
g
e
t
C
C
C
Allele-Specific Hybridization
T
a
r
g
e
t
G
A
F
a
i
l

t
o

h
y
b
r
i
d
i
z
e
H
y
b
r
i
d
i
z
e

d
d
C
T
P
C
Polymerase Extension
T
a
r
g
e
t
G
A
C

i
n
c
o
r
p
o
r
a
t
e
d
C

F
a
i
l
s

t
o

i
n
c
o
r
p
o
r
a
t
e
C
C
C
Oligonucleotide
Ligation
T
a
r
g
e
t
G
A
L
i
g
a
t
e
F
a
i
l

t
o

l
i
g
a
t
e
C
Invader
C
C
T
a
r
g
e
t
G
A
C
l
e
a
v
e
F
a
i
l

t
o

c
l
e
a
v
e
C
C
C
Taqman
T
a
r
g
e
t
G
A
D
e
g
r
a
d
e
F
a
i
l

t
o

d
e
g
r
a
d
e
C
Allele-Specific
PCR
C
C
T
a
r
g
e
t
G
A
A
m
p
l
i
f
y
F
a
i
l

t
o

a
m
p
l
i
f
y
84
SNP
Typing Formats
Microtiter
Plates -
Fluorescence
eg. Taqman - Good for a few markers - lots of
samples - PCR
eg. Sequenom or SnapShot - Moderate Multiplexing
reducing costs
eg. Affymetrics, Illumina or ParAllele - Highly
multiplexed - HighThroughput - Genotype
directly on genomic DNA
85
Taqman
Genotyping with fluorescence-based homogenous
assays (single-tube assay)
A
G
Quencher
Reporter
86
Genotype Calling - Cluster Analysis
87
Genotyping by Mass Spectrometry
Multiplex 5 SNPs
88
Comparative Genotyping in Populations
PCR Pooled DNA Quantitative
Assay Estimate Allele Frequency
PCR Pooled DNA Quantitative
Assay Estimate Allele Frequency
89
Pooled Genotyping
Advantages Speed, Cost Major
Disadvantages Loss of haplotype information
Loss of stratification by phenotype or
environmental factors
90
SNP

Genotyping
Custom
SNP

Genotyping
Chips
91
Genotyping
- Universal Tag Readouts
Multiplexed
G
A
C
T
L
o
c
u
s

2

S
p
e
c
i
f
i
c

S
e
q
u
e
n
c
e
L
o
c
u
s

1

S
p
e
c
i
f
i
c

S
e
q
u
e
n
c
e
c
T
a
g
1

s
e
q
u
e
n
c
e
T
a
g
1

s
e
q
u
e
n
c
e
c
T
a
g
2

s
e
q
u
e
n
c
e
T
a
g
2

s
e
q
u
e
n
c
e


S
u
b
s
t
r
a
t
e


S
u
b
s
t
r
a
t
e
B
e
a
d

o
r

C
h
i
p
B
e
a
d

o
r

C
h
i
p
C
h
i
p

A
r
r
a
y
B
e
a
d

A
r
r
a
y
T
a
g

1
T
a
g

2
T
a
g

3
T
a
g

4
Multiplex 1,000 SNPs Not dependent on primary
PCR
ParAllele
Illumina
92
Illumina Genotyping - Gap Ligation
93
1,000 SNPs Assayed on 96 Samples
94
SNP Genotyping
Lots of systems - Still costly but
dropping Offering Moderate to High
throughputs Systems vary in price
- Laboratory Information Management Systems
(Key Track - Samples, - Assays
- Completion rate - Reproducibility/Error
Analysis)
Write a Comment
User Comments (0)
About PowerShow.com