Title: An Introduction to Sequence Variation
1An Introduction to Sequence Variation
- Chris Lee
- Dept. of Chemistry Biochemistry
- UCLA
2Types of Polymorphism
- Single nucleotide polymorphisms (SNP) constitute
about 90 of polymorphisms. - Insertions, deletions.
- Microsatellite repeats a locus where different
numbers of copies of a short repeat sequence are
found in different people. - Gross genetic losses or rearrangements.
3Large-scale Polymorphism
4Single Nucleotide Polymorphisms
- Each person different at 1 in 1000 letters.
- SNPs responsible for human individuality!
- Some SNPs cause human diseases (e.g. cancer,
cystic fibrosis, Alzheimers). - Enormous efforts have been made to identify
specific mutations that cause disease.
5Single Nucleotide Polymorphism
6Mutation can occur as easily as the loss of a
single chemical group from one nucleotide base,
e.g. the amino group of cytosine.
7(No Transcript)
8Creating a Mutation
9Genomic Density of SNPs
- Comparing two random chromosome, one SNP per 1000
bp. - Comparing 40 people (2 chromosomes each), expect
17 million SNPs in the complete human genome (3
billion bp). - In coding region (5 of genome) expect 500,000
cSNPs, perhaps 6 per gene.
10SNPs a detailed record of human genetic history
- Each SNP is typically a single mutation event,
that occurred in a context of certain
pre-existing SNPs. - As time passes this context is gradually lost due
to recombination.
SNP C initially created linked to SNPs ABDEF...
A B C D E F
time
Island of linkage shrinks...
B C D
11A record of the origins, migrations, and mixing
of the worlds peoples
- The size of the island of strong linkage around
a SNP indicates its age (small old) - The SNPs its linked to give a genetic
fingerprint of the original person its from. - In principle each SNP can be used to track all
his descendants. - Each person has 300,000 common SNPs-- a very rich
record of their genetic history.
12SNPs in lipoprotein lipase (LPL) gene.
13SNP genotypes in 71 individuals in the LPL
gene heterozygote (X/y) homozygote (y/y)
14SNP Allele Frequency
15SNP Haplotypes reconstructed from LPL genotype
data
16SNP Linkage Disequilibrium
17The Hunt for Disease Genes
- Currently finding a disease gene can take years,
because there are very few markers, forcing
researchers to search dozens of genes. - SNPs are a powerful tool for discovering genes
that cause disease with a SNP in every gene,
could directly map a disease to a single gene.
18Mapping Disease Genes
microsatellite
chromosome
disease gene
genes
SNPs
- Look for genetic linkage of disease to marker
- Microsatellite markers are too widely spaced to
get to the individual gene level. - There are common SNPs in every gene.
19Identification of SNPs
- In 1998, Wang et al. reported 3000 SNPs.
- Currently about 200,000 SNPs have been identified
in total by experiment (in public databases). - A pharmaceutical industry SNP consortium has been
formed to fund identification of 300,000 SNPs to
be shared publically.
20SNPs for Pharmacogenomics
- Differences in efficacy and side effects from
person to person can be a big problem for drug
clinical trials / approval. - If SNPs that correlate with these differences can
be identified, the clinical trial could be
limited to patients where efficacy is likely to
be best, with least side effects. - These SNPs would then also have to be tested on
prospective patients for the drug.
21Single-nucleotide polymorphism in the human mu
opioid receptor gene alters -endorphin binding
and activity Possible implications for opiate
addiction Bond et al. PNAS 959608
The mu opioid receptor is the primary site of
action for the most commonly used opioids,
including morphine, heroin, fentanyl, and
methadone. The A118G variant receptor binds
-endorphin, an endogenous opioid that activates
the mu opioid receptor, approximately three times
more tightly than the most common allelic form of
the receptor. Furthermore, -endorphin is
approximately three times more potent at the
A118G variant receptor than at the most common
allelic form in agonist-induced activation of G
protein-coupled potassium channels.
22Comprehensive EST Analysis of Single Nucleotide
Polymorphism in the Human Genome
- Chris Lee
- Dept. of Chemistry Biochemistry
- UCLA
23Targeting Functional Polymorphism via Expressed
Sequences
- Only 5 of the human genome corresponds to coding
genes coding functional protein. - Look for functional SNPs by targeting these gene
sequence regions. - Genes are expressed by transcription into mRNA,
which is spliced, poly-adenylated and
transcribed. - Purify polyA-mRNA, make cDNA, sequence.
24SNP Detection from ESTs
- 1.4 million Expressed Sequence Tag (EST)
sequences, 300-500 bp, from 950 people. - How to put together all the ESTs from the same
gene, without mixing up related genes? - How to distinguish sequencing errors (very
common) from genuine Single Nucleotide
Polymorphisms?
25SNP Detection Approaches
- Experimentally random sampling of DNA. Very
expensive, slow. - Computationally find SNPs from existing
experimental data. Sort out real SNPs from
experimental sequencing errors. Difficult
statistical and computational problems. - This experimental data was sitting around for
years...
26Distinguishing SNPs from Sequencing Errors
A
T
The frequency and pattern in which a polymorphism
is observed, must rise above the rate of
background, random error. Single-pass read
sequences contain many errors which complicate
the reliable detection of SNPs. There are
miscalls (N), and frequent letter duplications /
losses in runs (repeats of a single letter).
These non-uniform error rates are critical in
assessing the statistical significance of
candidate SNPs like A (not in a run) vs. T
(problematic because it involves a GG run).
27How to address this?
- Adopt rigorous statistical approach based on
measured frequencies from very large data. - Bayesian inference carefully separate
observations from hidden states you want to make
inferences about. - Integrate out all assumptions by considering
all possible values of the assumptions. - Explicitly measure degree of uncertainty in the
predictions due to poor data, ambiguity.
28Odds ratio SNP model vs. sequencing error model
Consider both models are the observations more
consistent with a SNP or sequencing error?
29Error Model treat True gene sequence as unknown
- Treat all sequences T as equally likely (before
you - consider the actual observations (chromatograms).
- Sum error model probability over all possible T.
30SNP Model
- Rather than summing SNP model probability over
all possible T, T, calculate the probability for
a specific SNP T in a specific consensus T.
31Sequencing error model
Treat individual observed sequences i as
independent treat alignment (what errors
occurred) as uncertain.
Treat true gene sequence T as uncertain sum over
all possible T
32Hidden Markov Model Discrimination of SNP vs.
Error
The match states (M) of a profile is the
equivalent of the true population sequence, and
deletion (D), insertion (I) and emission
probabilities are set to be the observed
frequencies of sequencing errors conditioned on
local sequence context. The sum probability for
the SNP model, vs. the sum probability for the
error-only model, yields an odds-ratio for the
SNP.
33To assess putative SNP, consider all alternative
possibilities
- Sequencing error calculate odds ratio SNP vs.
error. Use PHRED score, local context. - Orientation errors ESTs reported backwards?
- Chimeras, mixed clusters ESTs may not be
properly clustered. Some ESTs chimeric? - Alignments all possible ways EST could have been
emitted from true sequence T. - true sequence all possible T for the gene.
34SNP Model Local allele frequency qz in one
person
z 0, 1, 2 qz z/N, where N 2 typically
Assuming Hardy-Weinberg
35Use Library information which sequences are from
same person!
Combine observations from all libraries L, and
treat population allele frequency q as uncertain
(so take integral over q (0,1) ).
36Posterior probability for population allele
frequency q
Gives posterior distribution for q, taking into
account all error rates in the observations,
amount of sequence and library availability,
ambiguities in the sequence, etc.
376 SNP observations from one library
386 SNP observations scattered over all libraries
39Alignment Accuracy Challenges
- Automatic Multiple Sequence Alignment of 1000
sequences is problematic. - Alignment accuracy is much more of a problem for
SNP detection than for simply getting the right
consensus. Consensus merely requires that the
majority be aligned, whereas even a single
alignment error will result in an incorrect SNP
prediction.
40Sequencing Error Analysis
- We have produced a dataset of 400,000,000 bp
where we have reliable consensus, and therefore
can identify all the sequencing errors. This
could provide corrected EST sequences, or
alternatively consensus, assembled gene sequences
for a large fraction of human genes. - This also provides detailed statistics on the
frequency of different types of sequencing
errors, which show a startling variation
depending on local sequence context. Background
error rates of 0.3 substitution, 0.3 insertion,
0.7 deletion, rise dramatically
41Example SNP GGA C/T CAA
Cluster AA702884 C vs. T polymorphism Novel SNP,
not previously identified.
42Automated SNP Detection
Input Unigene 1,400,000 Human ESTs, 300-500 bp
long
Word frequency based overlap orientation
detection
Try all possible orientations Dont trust
Unigene!
Many errors in the reported data e.g. reversals,
in majority of clusters!
Reorient ESTs catch reversals, place in 5 -gt 3
orientation
EST Alignment accuracy predict gene consensus
SNPs
10-5000 ESTs per gene, 80,000 genes, 500-5000 bp
long
Statistical Assessment of candidate SNPs
gt50,000 believable SNPs hidden among gt10,000,000
sequencing errors.
43Sequence Alignment
44(No Transcript)
45Current Status Results
- 400,000,000 bp aligned w/ reliable consensus.
- 83,000 consensus gene sequences produced.
- 20,000 show significant homology to known
proteins, almost all in expected orientation. - 75,000 SNPs above LOD score of 3.
- 30000 SNPs above LOD score of 6.
- current estimate 60,000 high frequency SNPs.
46(No Transcript)
47Chromatographic Evidence
G G T G G T C
C C
HsS785496
G
zu42c08.r1
G G T G A T C
C C
HsS1065649
oz03ho7.x1
A
48RFLP Detection of SNPs
49Verified 56 of 79 SNPs tested so far
50Verification Test Whitehead cSNPs
- Whitehead Institute has systematically searched
for SNPs in 106 genes, using 20 Europeans, 10
Africans, 10 Asians. - On 54 genes, our predicted cSNPs (scoregt3) are
verified by their results at a 70 rate.The
Whitehead set may be incomplete.
51(No Transcript)
52Validation Test HLA-A
- HLA polymorphism has been studied very
extensively for the general population, providing
a gold standard for all true positives. - 140 distinct HLA-A allele sequences available
from Anthony Nolan Foundation database. - Are any of our predicted HLA-A SNPs not
independently verified by this data?
53(No Transcript)
54HLA-A 89 Verification Rate
- Of total 108 SNPs we predicted in the coding
region of HLA-A, 96 are independently validated
by the known HLA-A allele sequences, and 12 are
not. - By comparison, the NCI CGAP project (based on the
same EST data) predicts just 10 SNPs in HLA-A
(gt90 false negatives!)
55Mass Spectrometry Validation
- SNPs change the mass of a DNA fragment.
- Sequenom Inc. has tested more than 1000 of our
SNPs using mass spectrometry of pooled DNA
samples. - 80 were detectably polymorphic in samples of 90
people.
56Bioinformatics Key to SNPs
57EST-based SNP detection similar in reliability
with experimental methods
58Application to Disease Gene Mapping
- How do SNPs compare with traditional marker sets
used for disease gene mapping projects? - Density how dense is the marker set, when mapped
onto the human genome? - Ideal at least one marker per gene (strong
linkage disequilibrium within 3kb) - Ideal high heterozygosity for good statistics
59Hs.197713 (3/3)
Hs.205802 (1/1)
0.1
Hs.211929 (108/51) n n n n
Hs.139929 (16/12) n
Hs.193078 (3/2)
Hs.176560 (6/6)
EIF3S7 (261/106) n n n n n n
0.2
Hs.146766 (4/2)
AFM164ze3
Hs.212478 (1/1)
AFM273vd9
1
0.3
2
Chromosome 22
0.4
Hs.107692 (5/1)
3
Hs.187027 (2/2)
Hs.196536 (3/2)
p
Hs.207456 (1/1)
p13
PVALB (66/17) n n n n n
4
p12
0.5
p11.2
Hs.143856 (3/2)
AFMa046za5
NCF4 (2/1)
p11.1
5
Hs.147244 (1/1)
q11.1
CSF2RB (29/14) n
q11.2
0.6
6
q12.1
q12.2
q12.3
TST(118/56) n n n n n n n n n n n n
7
q13.1
0.7
q13.2
Hs.94810 (27/13) n
Hs.194750 (4/3)
q13.3
8
1.4 MB from 22q13.1
Hs.196941 (20/15)
IL2RB (1/1)
0.8
9
Hs.22011 (27/14) n n
Hs.177397 (2/2)
Hs.118700 (3/2)
10
RAC2 (67/35) n
0.9
11
Hs.7189 (19/14) n
1.0
Hs.174434 (1/1)
12
Hs.220558 (3/2)
Hs.187981 (8/3)
13
1.1
MFNG (123/37) n n
Hs.187933 (2/1)
14
Hs.57973 (19/12) n
AFM261ye5
1.2
MSE55 (26/16) n
Hs.178824 (1/1)
Contig NT_001454 (14.6 MB)
Hs.190885 (9/6) n
Hs.6071 (48/32) n n
1.3
Hs.119913 (1/1)
22q11.2 - q13.3
Hs.97858 (3/2)
Hs.5790 (71/31) n n n n n n
Hs.25744 (20/13)
1.4
60Mapping Test positionally cloned genes
- Positionally cloned genes represent a (somewhat)
random sampling of genes. - They are examples of actual disease-gene mapping
targets, that typically took years of linkage
analysis and chromosome walking to find. - How good is the coverage and heterozygosity of
our SNP marker set for these genes?
61(No Transcript)
62SNP validation tests b globin
- b globin polymorphism has been studied
intensively, identifying 100s of substitutions - Verify predicted SNPs against known mutations.
- We detect 21 SNPs in b globin, 17 within exons.
63SNPs highly biased towards third codon position
64SNPs Biased towards Silent or Conservative
Substitutions
65(No Transcript)
66SNPs detect three disease alleles
- Mutations previously identified as causing
disease, catalogued by Online Mendelian
Inheritance in Man. - The only two non-conservative amino acid
substitutions detected. - All three at the a-b chain interface.
67Verified SNP Hb Tacoma disrupts a-b interface
68His 77 ? Gln Exposed, unlikely to disrupt
stability
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73What are the most polymorphic genes in the Human
Genome?
- Very large differences in polymorphism levels in
different genes. - Maintaining high levels of diversity (large
numbers of alleles) may indicate a selective
pressure. - What can we learn from patterns of polymorphism?
- Why are some genes so polymorphic?
74The Most Polymorphic Genes Five Classes
- Direct interactions with pathogens.
- Very highly expressed genes.
- Genes involved in tumorigenesis and
survival/growth of tumors. - Viral- and transposon-derived sequences.
- Large families of highly similar genes?
75Acknowledgements
- Christopher Lee K. Irizarry, B. Modrek, C.
Grasso - Wing Wong (Statistics) C. Li
- Stan Nelson (Human Genetics) V. Kustanovich,
N. Brown