Title: COMPUTATIONAL HUMAN GENETICS SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS
1COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR
RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS
- Eran Halperin
- November 10, 2009
2Environmental Factors
Genetic Factors
Complexdisease
Multiple genes may affect the disease.
Therefore, the effect of every single gene may
be negligible.
3The Human Chromosomes
4Each chromosome is a sequence over the
alphabet A,G,C,T (base pairs)
Copy from mother
ACCAGGACGA
ACCAGGACGA
Copy from father
5Facts about our genome
- 23 pairs of chromosomes.
- X and Y are the sex chromosomes (XX for women, XY
for men). - 3,300,000,000 base pairs in the human genome
6The Human Genome Project
What we are announcing today is that we have
reached a milestonethat is, covering the genome
ina working draft of the human sequence.
But our work previously has shown that having
one genetic code is important, but it's not all
that useful.
I would be willing to make a predication that
within 10 years, we will have the potential of
offering any of you the opportunity to find out
what particular genetic conditions you may be at
increased risk for
Washington, DC June, 26, 2000
7The Vision of Personalized Medicine
Genetic and epigenetic variants measurable
environmental/behavioral factors would be used
for a personalized treatment and diagnosis
8Paradigm shifts in medicine
9Example Warfarin
An anticoagulant drug, useful in the prevention
of thrombosis.
10Example Warfarin
Warfarin was originallyused as rat poison.
Optimal dose variesacross the
population Genetic variants (VKORC1 and CYP2C9)
affect the variation of the personalized optimal
dose.
11Association Studies
- Genetic variants such as Single Nucleotide
Polymorphisms (SNPs) are tested for association
with the trait.
12Where should we look first?
SNP Single Nucleotide Polymorphism
person 1 .AAGCTAAATTTG. person 2
.AAGCTAAGTTTG. person 3 .AAGCTAAGTTTG. person
4 .AAGCTAAATTTG. person 5 .AAGCTAAGTTTG.
Each common SNP has only two possible letters
(alleles).
13Disease Association Studies
SNP Single Nucleotide Polymorphism
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Associated SNP (lower Relative Risk)
Controls
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
14Preliminary Definitions
- SNP single nucleotide polymorphism. A genetic
variant which may carry different value for
different individuals. - Allele the variants value A,G,C, or T.
- Most SNPs are bi-allelic. There are only two
observed alleles in the populations. - Risk allele the allele which is more common in
cases than in controls (denoted R) - Nonrisk allele the allele which is more common
in the controls (denoted N)
15Relative Risk
Chances of developing type II diabetes 30
RiskG
Chances of developing type II diabetes 20
NonriskA
Relative Risk Pr(DR)/Pr(DN) 1.5
16Other Structural Variants
Inversion
Deletion
Copy number variant
17Published Genome-Wide Associations through
6/2009, 439 published GWA at p lt 5 x 10-8
NHGRI GWA Catalog www.genome.gov/GWAStudies
18(No Transcript)
19Public Genotype Data Growth
20Chance or Real Association?
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Associated SNP (lower Relative Risk)
Controls
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
21How does it work?
- For every SNP we can construct a contingency
table
22Hypothesis testing
- Null hypothesis Pr(Rcase) Pr(Rcontrol)
- Alternative hypothesis Pr(Rcase) ?
Pr(Rcontrol) - The model assumes that all individuals are
independent (unrelated), and therefore our sample
is a random sample from a Binomial distribution - Cases sampled from distribution
XB(n,Pr(Rcases)) - Controls sampled from distribution
YB(n,Pr(Rcontrols))
23Hypothesis testing, cont.
- When n is large, B(n,p) N(np, np(1-p)).
- Under the null hypothesis
24P-value
- Z is called a test-statistic (z-score in this
case). - We can calculate Z for our data, and then
calculate (using the normal approximation)p-valu
e Pr(Z gt Z) - Often we take , which is
25Results Manhattan Plots
26The curse of dimensionality corrections of
multiple testing
- In a typical Genome-Wide Association Study
(GWAS), we test millions of SNPs. - If we set the p-value threshold for each test to
be 0.05, by chance we will find about 5 of the
SNPs to be associated with the disease. - This needs to be corrected.
27Bonferroni Correction
- If the number of tests is n, we set the threshold
to be 0.05/n. - A very conservative test. If the tests are
independent then it is reasonable to use it. If
the tests are correlated this could be bad - Example If all SNPs are identical, then we lose
a lot of power the false positive rate reduces,
but so does the power.
28Challenge 1
29Population Substructure
- Imagine that all the cases are collected from
Africa, and all the controls are from Europe. - Many association signals are going to be found
- The vast majority of them are false
Why ???
Different evolutionary forces drift, selection,
mutation, migration, population bottleneck.
30Evolution Theory
- Mutations add to genetic variation
- Natural Selection controls the frequency of
certain traits and alleles - Genetic drift
31Mutations
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGA
AGAGCAGTCCACAGGTATAGCCTACATGAGATCGACATGAGA
Estimated probability of a mutation in a single
generation is 10-8
32Other mutations - recombination
Copy 1
Copy 2
Probability ri (10-8) for recombination in
position i.
child chromosome
33Natural Selection
- Example being lactose telorant is advantageous
in northern Europe, hence there is positive
selection in the LCT gene - different allele frequencies in LCT
34Genetic Drift
- Even without selection, the allele frequencies in
the population are not fixed across time. - Consider the case where we assume Hardy-Weinberg
Equilibrium (HWE), that is, individuals are
mating randomly in the population. - If at the first generation the allele frequencies
are p0 (of a) and q01-p0 (of A). - Under HWE, Epk1pk, but Vpk1 gt 0, so the
next generation will have pk1?p0.
35The rate of the drift
- N effective population size (if all individuals
are entirely unrelated than N is the total
population size). - Under an assumption of constant population size,
if Xk counts the number of occurrences of a at
generation k, then Xk1 B(N,pk). - Epk1 EXk1/N pk.
- Varpk1 pk(1-pk)/N.
- The effect of genetic drift depends on the time
and the effective populations size. Small
population increases the effect.
36Bottleneck effect
Effective population size
Time
Genetic drifts rate is higher.
37The Wright-Fisher Model
Generation 1 Allele frequency 1/9
38The Wright-Fisher Model
Generation 2 Allele frequency 1/9
39The Wright-Fisher Model
Generation 3 Allele frequency 1/9
40The Wright-Fisher Model
Generation 4 Allele frequency 1/3
41The Wright-Fisher Model
42The Wright-Fisher Model
43Ancestral population
44Ancestral population
migration
45- different allele frequencies
Ancestral population
Genetic drift
46Population Substructure
- Imagine that all the cases are collected from
Africa, and all the controls are from Europe. - Many association signals are going to be found
- The vast majority of them are false
What can we do about it?
47Jakobsson et al, Nature 421 998-103
48Principal Component Analysis
- Dimensionality reduction
- Based on linear algebra (Singular Value
Decomposition) - Intuition find the most important features of
the data project the data on the axis with the
largest variance.
49Principal Component Analysis
Plotting the data on a onedimensional line for
which the spread is maximized.
50Principal Component Analysis
- In our case, we want to look at two dimensions at
a time. - The original data has many dimensions each SNP
corresponds to one dimension.
51Ancestry Inference
- To what extent can population structure be
detected from SNP data? - What can we learn from these inferences?
- Can we build the tree of life?
- How do we analyze complexpopulations (mixed)?
Novembre et al., Nature, 2008
52Challenge 2
53A typical associated region
54Linkage Disequilibrium
55Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
56Phasing - haplotype inference
Haplotypes
ATCCGA AGACGC
- Cost effective genotyping technology gives
genotypes and not haplotypes.
57Inferring Haplotypes From Trios
Parent 1
122112
Parent 2
210022
120222
Child
Assumption No recombination
58Maximum Likelihood
- Until now we discussed the case of two hypotheses
(null, and alternative). - In some cases we are interested in many
hypotheses and we search for the best one. - Normally a hypothesis will be defined by a set of
parameters ?. - The likelihood of ? is
.We are interested in the hypothesis that
maximizes the likelihood.
59Soft assignment
- Compute probabilities Pph for all possible
haplotypes. - For each genotype g, we do not assign one pair of
haplotypes, but a distribution of possible pairs.
- The set of pairs of haplotypes compatible with g
is denoted as C(g). - In soft assignment, a pair is
explaining g with probability
60Phasing via Maximum Likelihood
- Soft decision
- Hard decision
61An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1
0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0
0 1 0 1/12 1 0 0 1 1 3/12 1 0 1 0 1 1/12 1 0 1 1
1 2/12 1 1 0 1 1 1/12 1 1 1 1 1 1/12
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
62An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1
0 0 0 1 0 .125 0 0 0 1 1 .042 1 0 0 0 1 .067 1 0
0 1 0 .042 1 0 0 1 1 .325 1 0 1 0 1 .1 1 0 1 1
1 .067 1 1 0 1 1 .067 1 1 1 1 1 .1
0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0
0 1 0 1/12 1 0 0 1 1 3/12 1 0 1 0 1 1/12 1 0 1 1
1 2/12 1 1 0 1 1 1/12 1 1 1 1 1 1/12
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0.4 0.6
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
0.75 0.25
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
0.6 0.4
63An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1
0 0 0 1 0 1/6 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1
0 0 1 0 0 1 1 1/2 1 0 1 0 1 1/6 1 0 1 1 1 0 1 1 0
1 1 0 1 1 1 1 1 1/6
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0 1
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
1 0
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
1 0
64Expectation Maximization (EM)
- D given data
- T parameters that need to be estimated
- Z Latent missing variables
65EM rationale
- Lemma.
- Proof First, note that
66(No Transcript)
67QED
68MLE from Incomplete Data
- Finding MLE parameters nonlinear optimization
problem
log P(x ?)
E ?log P(x,y ?)
?
69MLE from Incomplete Data
log P(x ?)
E ?log P(x,y ?)
?
70EM for phasing
71 72Phasing summary
- Expectation maximization is easy to implement,
works reasonably well in practice. - We can use other models (tree models) to improve
the accuracy of the phasing prediction.
73Human Genetics where to?
- We can typically explain 5-15of the
heritability of commondiseases. - Where is the missing heritability?
- Rare variants
- Gene-gene interactions
- Gene-environment interactions
- Creative computational methods are key to the
discovery of the missing heritability.
74Course Computational Human Genetics
- Semester bet
- More background in human genetics, statistics,
and machine learning. - Studying genetics of human disease
- Privacy and forensics
- Analysis of new technologies (sequencing)
- Population genetics detecting selection,
mutation rate, recombination rates, etc. - Reconstructing human history