COMPUTATIONAL HUMAN GENETICS SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS presentation

About This Presentation

Transcript and Presenter's Notes

Title: COMPUTATIONAL HUMAN GENETICS SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS

1
COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR
RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS

Eran Halperin
November 10, 2009

2
Environmental Factors
Genetic Factors
Complexdisease
Multiple genes may affect the disease.
Therefore, the effect of every single gene may
be negligible.
3
The Human Chromosomes
4
Each chromosome is a sequence over the
alphabet A,G,C,T (base pairs)
Copy from mother
ACCAGGACGA
ACCAGGACGA
Copy from father
5
Facts about our genome

23 pairs of chromosomes.
X and Y are the sex chromosomes (XX for women, XY
for men).
3,300,000,000 base pairs in the human genome

6
The Human Genome Project
What we are announcing today is that we have
reached a milestonethat is, covering the genome
ina working draft of the human sequence.
But our work previously has shown that having
one genetic code is important, but it's not all
that useful.
I would be willing to make a predication that
within 10 years, we will have the potential of
offering any of you the opportunity to find out
what particular genetic conditions you may be at
increased risk for
Washington, DC June, 26, 2000
7
The Vision of Personalized Medicine
Genetic and epigenetic variants measurable
environmental/behavioral factors would be used
for a personalized treatment and diagnosis
8
Paradigm shifts in medicine
9
Example Warfarin
An anticoagulant drug, useful in the prevention
of thrombosis.
10
Example Warfarin
Warfarin was originallyused as rat poison.
Optimal dose variesacross the
population Genetic variants (VKORC1 and CYP2C9)
affect the variation of the personalized optimal
dose.
11
Association Studies

Genetic variants such as Single Nucleotide
Polymorphisms (SNPs) are tested for association
with the trait.

12
Where should we look first?
SNP Single Nucleotide Polymorphism
person 1 .AAGCTAAATTTG. person 2
.AAGCTAAGTTTG. person 3 .AAGCTAAGTTTG. person
4 .AAGCTAAATTTG. person 5 .AAGCTAAGTTTG.
Each common SNP has only two possible letters
(alleles).
13
Disease Association Studies
SNP Single Nucleotide Polymorphism
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Associated SNP (lower Relative Risk)
Controls
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
14
Preliminary Definitions

SNP single nucleotide polymorphism. A genetic
variant which may carry different value for
different individuals.
Allele the variants value A,G,C, or T.
Most SNPs are bi-allelic. There are only two
observed alleles in the populations.
Risk allele the allele which is more common in
cases than in controls (denoted R)
Nonrisk allele the allele which is more common
in the controls (denoted N)

15
Relative Risk
Chances of developing type II diabetes 30
RiskG
Chances of developing type II diabetes 20
NonriskA
Relative Risk Pr(DR)/Pr(DN) 1.5
16
Other Structural Variants
Inversion
Deletion
Copy number variant
17
Published Genome-Wide Associations through
6/2009, 439 published GWA at p lt 5 x 10-8
NHGRI GWA Catalog www.genome.gov/GWAStudies
18
(No Transcript)
19
Public Genotype Data Growth
20
Chance or Real Association?
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Associated SNP (lower Relative Risk)
Controls
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
21
How does it work?

For every SNP we can construct a contingency
table

22
Hypothesis testing

Null hypothesis Pr(Rcase) Pr(Rcontrol)
Alternative hypothesis Pr(Rcase) ?
Pr(Rcontrol)
The model assumes that all individuals are
independent (unrelated), and therefore our sample
is a random sample from a Binomial distribution
Cases sampled from distribution
XB(n,Pr(Rcases))
Controls sampled from distribution
YB(n,Pr(Rcontrols))

23
Hypothesis testing, cont.

When n is large, B(n,p) N(np, np(1-p)).
Under the null hypothesis

24
P-value

Z is called a test-statistic (z-score in this
case).
We can calculate Z for our data, and then
calculate (using the normal approximation)p-valu
e Pr(Z gt Z)
Often we take , which is

25
Results Manhattan Plots
26
The curse of dimensionality corrections of
multiple testing

In a typical Genome-Wide Association Study
(GWAS), we test millions of SNPs.
If we set the p-value threshold for each test to
be 0.05, by chance we will find about 5 of the
SNPs to be associated with the disease.
This needs to be corrected.

27
Bonferroni Correction

If the number of tests is n, we set the threshold
to be 0.05/n.
A very conservative test. If the tests are
independent then it is reasonable to use it. If
the tests are correlated this could be bad
Example If all SNPs are identical, then we lose
a lot of power the false positive rate reduces,
but so does the power.

28
Challenge 1

Population Substructure

29
Population Substructure

Imagine that all the cases are collected from
Africa, and all the controls are from Europe.
Many association signals are going to be found
The vast majority of them are false

Why ???
Different evolutionary forces drift, selection,
mutation, migration, population bottleneck.
30
Evolution Theory

Mutations add to genetic variation
Natural Selection controls the frequency of
certain traits and alleles
Genetic drift

31
Mutations
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGA
AGAGCAGTCCACAGGTATAGCCTACATGAGATCGACATGAGA
Estimated probability of a mutation in a single
generation is 10-8
32
Other mutations - recombination
Copy 1
Copy 2
Probability ri (10-8) for recombination in
position i.
child chromosome
33
Natural Selection

Example being lactose telorant is advantageous
in northern Europe, hence there is positive
selection in the LCT gene
different allele frequencies in LCT

34
Genetic Drift

Even without selection, the allele frequencies in
the population are not fixed across time.
Consider the case where we assume Hardy-Weinberg
Equilibrium (HWE), that is, individuals are
mating randomly in the population.
If at the first generation the allele frequencies
are p0 (of a) and q01-p0 (of A).
Under HWE, Epk1pk, but Vpk1 gt 0, so the
next generation will have pk1?p0.

35
The rate of the drift

N effective population size (if all individuals
are entirely unrelated than N is the total
population size).
Under an assumption of constant population size,
if Xk counts the number of occurrences of a at
generation k, then Xk1 B(N,pk).
Epk1 EXk1/N pk.
Varpk1 pk(1-pk)/N.
The effect of genetic drift depends on the time
and the effective populations size. Small
population increases the effect.

36
Bottleneck effect
Effective population size
Time
Genetic drifts rate is higher.
37
The Wright-Fisher Model
Generation 1 Allele frequency 1/9
38
The Wright-Fisher Model
Generation 2 Allele frequency 1/9
39
The Wright-Fisher Model
Generation 3 Allele frequency 1/9
40
The Wright-Fisher Model
Generation 4 Allele frequency 1/3
41
The Wright-Fisher Model
42
The Wright-Fisher Model
43
Ancestral population
44
Ancestral population
migration
45

different allele frequencies

Ancestral population
Genetic drift
46
Population Substructure

Imagine that all the cases are collected from
Africa, and all the controls are from Europe.
Many association signals are going to be found
The vast majority of them are false

What can we do about it?
47
Jakobsson et al, Nature 421 998-103
48
Principal Component Analysis

Dimensionality reduction
Based on linear algebra (Singular Value
Decomposition)
Intuition find the most important features of
the data project the data on the axis with the
largest variance.

49
Principal Component Analysis
Plotting the data on a onedimensional line for
which the spread is maximized.
50
Principal Component Analysis

In our case, we want to look at two dimensions at
a time.
The original data has many dimensions each SNP
corresponds to one dimension.

51
Ancestry Inference

To what extent can population structure be
detected from SNP data?
What can we learn from these inferences?
Can we build the tree of life?
How do we analyze complexpopulations (mixed)?

Novembre et al., Nature, 2008
52
Challenge 2

Modeling Correlation

53
A typical associated region
54
Linkage Disequilibrium
55
Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
56
Phasing - haplotype inference
Haplotypes
ATCCGA AGACGC

Cost effective genotyping technology gives
genotypes and not haplotypes.

57
Inferring Haplotypes From Trios
Parent 1
122112
Parent 2
210022
120222
Child
Assumption No recombination
58
Maximum Likelihood

Until now we discussed the case of two hypotheses
(null, and alternative).
In some cases we are interested in many
hypotheses and we search for the best one.
Normally a hypothesis will be defined by a set of
parameters ?.
The likelihood of ? is
.We are interested in the hypothesis that
maximizes the likelihood.

59
Soft assignment

Compute probabilities Pph for all possible
haplotypes.
For each genotype g, we do not assign one pair of
haplotypes, but a distribution of possible pairs.
The set of pairs of haplotypes compatible with g
is denoted as C(g).
In soft assignment, a pair is
explaining g with probability

60
Phasing via Maximum Likelihood

Soft decision
Hard decision

61
An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1
0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0
0 1 0 1/12 1 0 0 1 1 3/12 1 0 1 0 1 1/12 1 0 1 1
1 2/12 1 1 0 1 1 1/12 1 1 1 1 1 1/12
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
62
An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1

0 0 0 1 0 .125 0 0 0 1 1 .042 1 0 0 0 1 .067 1 0
0 1 0 .042 1 0 0 1 1 .325 1 0 1 0 1 .1 1 0 1 1
1 .067 1 1 0 1 1 .067 1 1 1 1 1 .1
0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0
0 1 0 1/12 1 0 0 1 1 3/12 1 0 1 0 1 1/12 1 0 1 1
1 2/12 1 1 0 1 1 1/12 1 1 1 1 1 1/12
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0.4 0.6
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
0.75 0.25
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
0.6 0.4
63
An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1

0 0 0 1 0 1/6 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1
0 0 1 0 0 1 1 1/2 1 0 1 0 1 1/6 1 0 1 1 1 0 1 1 0
1 1 0 1 1 1 1 1 1/6
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0 1
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
1 0
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
1 0
64
Expectation Maximization (EM)

D given data
T parameters that need to be estimated
Z Latent missing variables

65
EM rationale

Lemma.
Proof First, note that

66
(No Transcript)
67
QED
68
MLE from Incomplete Data

Finding MLE parameters nonlinear optimization
problem

log P(x ?)
E ?log P(x,y ?)
?
69
MLE from Incomplete Data
log P(x ?)
E ?log P(x,y ?)
?
70
EM for phasing
71

This is maximized for

72
Phasing summary

Expectation maximization is easy to implement,
works reasonably well in practice.
We can use other models (tree models) to improve
the accuracy of the phasing prediction.

73
Human Genetics where to?

We can typically explain 5-15of the
heritability of commondiseases.
Where is the missing heritability?
Rare variants
Gene-gene interactions
Gene-environment interactions
Creative computational methods are key to the
discovery of the missing heritability.

74
Course Computational Human Genetics

Semester bet
More background in human genetics, statistics,
and machine learning.
Studying genetics of human disease
Privacy and forensics
Analysis of new technologies (sequencing)
Population genetics detecting selection,
mutation rate, recombination rates, etc.
Reconstructing human history

Write a Comment

User Comments (0)

About PowerShow.com

COMPUTATIONAL HUMAN GENETICS SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS PowerPoint PPT Presentation