An Introduction to Sequence Variation - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

An Introduction to Sequence Variation

Description:

An Introduction to Sequence Variation – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 76
Provided by: Guan67
Learn more at: http://www.ipam.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Sequence Variation


1
An Introduction to Sequence Variation
  • Chris Lee
  • Dept. of Chemistry Biochemistry
  • UCLA

2
Types of Polymorphism
  • Single nucleotide polymorphisms (SNP) constitute
    about 90 of polymorphisms.
  • Insertions, deletions.
  • Microsatellite repeats a locus where different
    numbers of copies of a short repeat sequence are
    found in different people.
  • Gross genetic losses or rearrangements.

3
Large-scale Polymorphism
4
Single Nucleotide Polymorphisms
  • Each person different at 1 in 1000 letters.
  • SNPs responsible for human individuality!
  • Some SNPs cause human diseases (e.g. cancer,
    cystic fibrosis, Alzheimers).
  • Enormous efforts have been made to identify
    specific mutations that cause disease.

5
Single Nucleotide Polymorphism
6
Mutation can occur as easily as the loss of a
single chemical group from one nucleotide base,
e.g. the amino group of cytosine.
7
(No Transcript)
8
Creating a Mutation
9
Genomic Density of SNPs
  • Comparing two random chromosome, one SNP per 1000
    bp.
  • Comparing 40 people (2 chromosomes each), expect
    17 million SNPs in the complete human genome (3
    billion bp).
  • In coding region (5 of genome) expect 500,000
    cSNPs, perhaps 6 per gene.

10
SNPs a detailed record of human genetic history
  • Each SNP is typically a single mutation event,
    that occurred in a context of certain
    pre-existing SNPs.
  • As time passes this context is gradually lost due
    to recombination.

SNP C initially created linked to SNPs ABDEF...
A B C D E F
time
Island of linkage shrinks...
B C D
11
A record of the origins, migrations, and mixing
of the worlds peoples
  • The size of the island of strong linkage around
    a SNP indicates its age (small old)
  • The SNPs its linked to give a genetic
    fingerprint of the original person its from.
  • In principle each SNP can be used to track all
    his descendants.
  • Each person has 300,000 common SNPs-- a very rich
    record of their genetic history.

12
SNPs in lipoprotein lipase (LPL) gene.
13
SNP genotypes in 71 individuals in the LPL
gene heterozygote (X/y) homozygote (y/y)
14
SNP Allele Frequency
15
SNP Haplotypes reconstructed from LPL genotype
data
16
SNP Linkage Disequilibrium
17
The Hunt for Disease Genes
  • Currently finding a disease gene can take years,
    because there are very few markers, forcing
    researchers to search dozens of genes.
  • SNPs are a powerful tool for discovering genes
    that cause disease with a SNP in every gene,
    could directly map a disease to a single gene.

18
Mapping Disease Genes
microsatellite
chromosome
disease gene
genes
SNPs
  • Look for genetic linkage of disease to marker
  • Microsatellite markers are too widely spaced to
    get to the individual gene level.
  • There are common SNPs in every gene.

19
Identification of SNPs
  • In 1998, Wang et al. reported 3000 SNPs.
  • Currently about 200,000 SNPs have been identified
    in total by experiment (in public databases).
  • A pharmaceutical industry SNP consortium has been
    formed to fund identification of 300,000 SNPs to
    be shared publically.

20
SNPs for Pharmacogenomics
  • Differences in efficacy and side effects from
    person to person can be a big problem for drug
    clinical trials / approval.
  • If SNPs that correlate with these differences can
    be identified, the clinical trial could be
    limited to patients where efficacy is likely to
    be best, with least side effects.
  • These SNPs would then also have to be tested on
    prospective patients for the drug.

21
Single-nucleotide polymorphism in the human mu
opioid receptor gene alters -endorphin binding
and activity Possible implications for opiate
addiction Bond et al. PNAS 959608
The mu opioid receptor is the primary site of
action for the most commonly used opioids,
including morphine, heroin, fentanyl, and
methadone. The A118G variant receptor binds
-endorphin, an endogenous opioid that activates
the mu opioid receptor, approximately three times
more tightly than the most common allelic form of
the receptor. Furthermore, -endorphin is
approximately three times more potent at the
A118G variant receptor than at the most common
allelic form in agonist-induced activation of G
protein-coupled potassium channels.
22
Comprehensive EST Analysis of Single Nucleotide
Polymorphism in the Human Genome
  • Chris Lee
  • Dept. of Chemistry Biochemistry
  • UCLA

23
Targeting Functional Polymorphism via Expressed
Sequences
  • Only 5 of the human genome corresponds to coding
    genes coding functional protein.
  • Look for functional SNPs by targeting these gene
    sequence regions.
  • Genes are expressed by transcription into mRNA,
    which is spliced, poly-adenylated and
    transcribed.
  • Purify polyA-mRNA, make cDNA, sequence.

24
SNP Detection from ESTs
  • 1.4 million Expressed Sequence Tag (EST)
    sequences, 300-500 bp, from 950 people.
  • How to put together all the ESTs from the same
    gene, without mixing up related genes?
  • How to distinguish sequencing errors (very
    common) from genuine Single Nucleotide
    Polymorphisms?

25
SNP Detection Approaches
  • Experimentally random sampling of DNA. Very
    expensive, slow.
  • Computationally find SNPs from existing
    experimental data. Sort out real SNPs from
    experimental sequencing errors. Difficult
    statistical and computational problems.
  • This experimental data was sitting around for
    years...

26
Distinguishing SNPs from Sequencing Errors
A
T
The frequency and pattern in which a polymorphism
is observed, must rise above the rate of
background, random error. Single-pass read
sequences contain many errors which complicate
the reliable detection of SNPs. There are
miscalls (N), and frequent letter duplications /
losses in runs (repeats of a single letter).
These non-uniform error rates are critical in
assessing the statistical significance of
candidate SNPs like A (not in a run) vs. T
(problematic because it involves a GG run).
27
How to address this?
  • Adopt rigorous statistical approach based on
    measured frequencies from very large data.
  • Bayesian inference carefully separate
    observations from hidden states you want to make
    inferences about.
  • Integrate out all assumptions by considering
    all possible values of the assumptions.
  • Explicitly measure degree of uncertainty in the
    predictions due to poor data, ambiguity.

28
Odds ratio SNP model vs. sequencing error model
Consider both models are the observations more
consistent with a SNP or sequencing error?
29
Error Model treat True gene sequence as unknown
  • Treat all sequences T as equally likely (before
    you
  • consider the actual observations (chromatograms).
  • Sum error model probability over all possible T.

30
SNP Model
  • Rather than summing SNP model probability over
    all possible T, T, calculate the probability for
    a specific SNP T in a specific consensus T.

31
Sequencing error model
Treat individual observed sequences i as
independent treat alignment (what errors
occurred) as uncertain.
Treat true gene sequence T as uncertain sum over
all possible T
32
Hidden Markov Model Discrimination of SNP vs.
Error
The match states (M) of a profile is the
equivalent of the true population sequence, and
deletion (D), insertion (I) and emission
probabilities are set to be the observed
frequencies of sequencing errors conditioned on
local sequence context. The sum probability for
the SNP model, vs. the sum probability for the
error-only model, yields an odds-ratio for the
SNP.
33
To assess putative SNP, consider all alternative
possibilities
  • Sequencing error calculate odds ratio SNP vs.
    error. Use PHRED score, local context.
  • Orientation errors ESTs reported backwards?
  • Chimeras, mixed clusters ESTs may not be
    properly clustered. Some ESTs chimeric?
  • Alignments all possible ways EST could have been
    emitted from true sequence T.
  • true sequence all possible T for the gene.

34
SNP Model Local allele frequency qz in one
person
z 0, 1, 2 qz z/N, where N 2 typically
Assuming Hardy-Weinberg
35
Use Library information which sequences are from
same person!
Combine observations from all libraries L, and
treat population allele frequency q as uncertain
(so take integral over q (0,1) ).
36
Posterior probability for population allele
frequency q
Gives posterior distribution for q, taking into
account all error rates in the observations,
amount of sequence and library availability,
ambiguities in the sequence, etc.
37
6 SNP observations from one library
38
6 SNP observations scattered over all libraries
39
Alignment Accuracy Challenges
  • Automatic Multiple Sequence Alignment of 1000
    sequences is problematic.
  • Alignment accuracy is much more of a problem for
    SNP detection than for simply getting the right
    consensus. Consensus merely requires that the
    majority be aligned, whereas even a single
    alignment error will result in an incorrect SNP
    prediction.

40
Sequencing Error Analysis
  • We have produced a dataset of 400,000,000 bp
    where we have reliable consensus, and therefore
    can identify all the sequencing errors. This
    could provide corrected EST sequences, or
    alternatively consensus, assembled gene sequences
    for a large fraction of human genes.
  • This also provides detailed statistics on the
    frequency of different types of sequencing
    errors, which show a startling variation
    depending on local sequence context. Background
    error rates of 0.3 substitution, 0.3 insertion,
    0.7 deletion, rise dramatically

41
Example SNP GGA C/T CAA
Cluster AA702884 C vs. T polymorphism Novel SNP,
not previously identified.
42
Automated SNP Detection
Input Unigene 1,400,000 Human ESTs, 300-500 bp
long
Word frequency based overlap orientation
detection
Try all possible orientations Dont trust
Unigene!
Many errors in the reported data e.g. reversals,
in majority of clusters!
Reorient ESTs catch reversals, place in 5 -gt 3
orientation
EST Alignment accuracy predict gene consensus
SNPs
10-5000 ESTs per gene, 80,000 genes, 500-5000 bp
long
Statistical Assessment of candidate SNPs
gt50,000 believable SNPs hidden among gt10,000,000
sequencing errors.
43
Sequence Alignment
44
(No Transcript)
45
Current Status Results
  • 400,000,000 bp aligned w/ reliable consensus.
  • 83,000 consensus gene sequences produced.
  • 20,000 show significant homology to known
    proteins, almost all in expected orientation.
  • 75,000 SNPs above LOD score of 3.
  • 30000 SNPs above LOD score of 6.
  • current estimate 60,000 high frequency SNPs.

46
(No Transcript)
47
Chromatographic Evidence
G G T G G T C
C C
HsS785496
G
zu42c08.r1
G G T G A T C
C C
HsS1065649
oz03ho7.x1
A
48
RFLP Detection of SNPs
49
Verified 56 of 79 SNPs tested so far
50
Verification Test Whitehead cSNPs
  • Whitehead Institute has systematically searched
    for SNPs in 106 genes, using 20 Europeans, 10
    Africans, 10 Asians.
  • On 54 genes, our predicted cSNPs (scoregt3) are
    verified by their results at a 70 rate.The
    Whitehead set may be incomplete.

51
(No Transcript)
52
Validation Test HLA-A
  • HLA polymorphism has been studied very
    extensively for the general population, providing
    a gold standard for all true positives.
  • 140 distinct HLA-A allele sequences available
    from Anthony Nolan Foundation database.
  • Are any of our predicted HLA-A SNPs not
    independently verified by this data?

53
(No Transcript)
54
HLA-A 89 Verification Rate
  • Of total 108 SNPs we predicted in the coding
    region of HLA-A, 96 are independently validated
    by the known HLA-A allele sequences, and 12 are
    not.
  • By comparison, the NCI CGAP project (based on the
    same EST data) predicts just 10 SNPs in HLA-A
    (gt90 false negatives!)

55
Mass Spectrometry Validation
  • SNPs change the mass of a DNA fragment.
  • Sequenom Inc. has tested more than 1000 of our
    SNPs using mass spectrometry of pooled DNA
    samples.
  • 80 were detectably polymorphic in samples of 90
    people.

56
Bioinformatics Key to SNPs
57
EST-based SNP detection similar in reliability
with experimental methods
58
Application to Disease Gene Mapping
  • How do SNPs compare with traditional marker sets
    used for disease gene mapping projects?
  • Density how dense is the marker set, when mapped
    onto the human genome?
  • Ideal at least one marker per gene (strong
    linkage disequilibrium within 3kb)
  • Ideal high heterozygosity for good statistics

59
Hs.197713 (3/3)
Hs.205802 (1/1)
0.1
Hs.211929 (108/51) n n n n
Hs.139929 (16/12) n
Hs.193078 (3/2)
Hs.176560 (6/6)
EIF3S7 (261/106) n n n n n n
0.2
Hs.146766 (4/2)
AFM164ze3
Hs.212478 (1/1)
AFM273vd9
1
0.3
2
Chromosome 22
0.4
Hs.107692 (5/1)
3
Hs.187027 (2/2)
Hs.196536 (3/2)
p
Hs.207456 (1/1)
p13
PVALB (66/17) n n n n n
4
p12
0.5
p11.2
Hs.143856 (3/2)
AFMa046za5
NCF4 (2/1)
p11.1
5
Hs.147244 (1/1)
q11.1

CSF2RB (29/14) n
q11.2
0.6
6
q12.1
q12.2
q12.3
TST(118/56) n n n n n n n n n n n n

7
q13.1
0.7
q13.2
Hs.94810 (27/13) n
Hs.194750 (4/3)
q13.3
8
1.4 MB from 22q13.1
Hs.196941 (20/15)
IL2RB (1/1)
0.8
9
Hs.22011 (27/14) n n
Hs.177397 (2/2)
Hs.118700 (3/2)
10
RAC2 (67/35) n
0.9
11
Hs.7189 (19/14) n
1.0
Hs.174434 (1/1)
12
Hs.220558 (3/2)
Hs.187981 (8/3)
13
1.1
MFNG (123/37) n n
Hs.187933 (2/1)
14
Hs.57973 (19/12) n
AFM261ye5
1.2
MSE55 (26/16) n
Hs.178824 (1/1)
Contig NT_001454 (14.6 MB)
Hs.190885 (9/6) n
Hs.6071 (48/32) n n
1.3
Hs.119913 (1/1)
22q11.2 - q13.3
Hs.97858 (3/2)
Hs.5790 (71/31) n n n n n n
Hs.25744 (20/13)
1.4
60
Mapping Test positionally cloned genes
  • Positionally cloned genes represent a (somewhat)
    random sampling of genes.
  • They are examples of actual disease-gene mapping
    targets, that typically took years of linkage
    analysis and chromosome walking to find.
  • How good is the coverage and heterozygosity of
    our SNP marker set for these genes?

61
(No Transcript)
62
SNP validation tests b globin
  • b globin polymorphism has been studied
    intensively, identifying 100s of substitutions
  • Verify predicted SNPs against known mutations.
  • We detect 21 SNPs in b globin, 17 within exons.

63
SNPs highly biased towards third codon position
64
SNPs Biased towards Silent or Conservative
Substitutions
65
(No Transcript)
66
SNPs detect three disease alleles
  • Mutations previously identified as causing
    disease, catalogued by Online Mendelian
    Inheritance in Man.
  • The only two non-conservative amino acid
    substitutions detected.
  • All three at the a-b chain interface.

67
Verified SNP Hb Tacoma disrupts a-b interface
68
His 77 ? Gln Exposed, unlikely to disrupt
stability
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
What are the most polymorphic genes in the Human
Genome?
  • Very large differences in polymorphism levels in
    different genes.
  • Maintaining high levels of diversity (large
    numbers of alleles) may indicate a selective
    pressure.
  • What can we learn from patterns of polymorphism?
  • Why are some genes so polymorphic?

74
The Most Polymorphic Genes Five Classes
  • Direct interactions with pathogens.
  • Very highly expressed genes.
  • Genes involved in tumorigenesis and
    survival/growth of tumors.
  • Viral- and transposon-derived sequences.
  • Large families of highly similar genes?

75
Acknowledgements
  • Christopher Lee K. Irizarry, B. Modrek, C.
    Grasso
  • Wing Wong (Statistics) C. Li
  • Stan Nelson (Human Genetics) V. Kustanovich,
    N. Brown
Write a Comment
User Comments (0)
About PowerShow.com