COMPUTATIONAL HUMAN GENETICS SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS PowerPoint PPT Presentation

presentation player overlay
1 / 74
About This Presentation
Transcript and Presenter's Notes

Title: COMPUTATIONAL HUMAN GENETICS SEARCHING FOR RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS


1
COMPUTATIONAL HUMAN GENETICS - SEARCHING FOR
RELATIONS BETWEEN GENES, DISEASES, AND POPULATIONS
  • Eran Halperin
  • November 10, 2009

2
Environmental Factors
Genetic Factors
Complexdisease
Multiple genes may affect the disease.
Therefore, the effect of every single gene may
be negligible.
3
The Human Chromosomes
4
Each chromosome is a sequence over the
alphabet A,G,C,T (base pairs)
Copy from mother
ACCAGGACGA
ACCAGGACGA
Copy from father
5
Facts about our genome
  • 23 pairs of chromosomes.
  • X and Y are the sex chromosomes (XX for women, XY
    for men).
  • 3,300,000,000 base pairs in the human genome

6
The Human Genome Project
What we are announcing today is that we have
reached a milestonethat is, covering the genome
ina working draft of the human sequence.
But our work previously has shown that having
one genetic code is important, but it's not all
that useful.
I would be willing to make a predication that
within 10 years, we will have the potential of
offering any of you the opportunity to find out
what particular genetic conditions you may be at
increased risk for
Washington, DC June, 26, 2000
7
The Vision of Personalized Medicine
Genetic and epigenetic variants measurable
environmental/behavioral factors would be used
for a personalized treatment and diagnosis
8
Paradigm shifts in medicine
9
Example Warfarin
An anticoagulant drug, useful in the prevention
of thrombosis.
10
Example Warfarin
Warfarin was originallyused as rat poison.
Optimal dose variesacross the
population Genetic variants (VKORC1 and CYP2C9)
affect the variation of the personalized optimal
dose.
11
Association Studies
  • Genetic variants such as Single Nucleotide
    Polymorphisms (SNPs) are tested for association
    with the trait.

12
Where should we look first?
SNP Single Nucleotide Polymorphism
person 1 .AAGCTAAATTTG. person 2
.AAGCTAAGTTTG. person 3 .AAGCTAAGTTTG. person
4 .AAGCTAAATTTG. person 5 .AAGCTAAGTTTG.
Each common SNP has only two possible letters
(alleles).
13
Disease Association Studies
SNP Single Nucleotide Polymorphism
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Associated SNP (lower Relative Risk)
Controls
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
14
Preliminary Definitions
  • SNP single nucleotide polymorphism. A genetic
    variant which may carry different value for
    different individuals.
  • Allele the variants value A,G,C, or T.
  • Most SNPs are bi-allelic. There are only two
    observed alleles in the populations.
  • Risk allele the allele which is more common in
    cases than in controls (denoted R)
  • Nonrisk allele the allele which is more common
    in the controls (denoted N)

15
Relative Risk
Chances of developing type II diabetes 30
RiskG
Chances of developing type II diabetes 20
NonriskA
Relative Risk Pr(DR)/Pr(DN) 1.5
16
Other Structural Variants
Inversion
Deletion
Copy number variant
17
Published Genome-Wide Associations through
6/2009, 439 published GWA at p lt 5 x 10-8
NHGRI GWA Catalog www.genome.gov/GWAStudies
18
(No Transcript)
19
Public Genotype Data Growth
20
Chance or Real Association?
Cases
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGA
GCCGTGAGATCGACATGATAGCC AGAGCCGTCGACATGTATAGTCTACA
TGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTC AG
AGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGC
CGTGAGATCGACATGATAGCC AGAGCAGTCGACAGGTATAGCCTACATG
AGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCC AGAG
CCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCCGTCGACATGTATAGCCTACATGAG
ATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTG
AGATCAACATGATAGTC AGAGCAGTCGACAGGTATAGCCTACATGAGAT
CGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
Associated SNP (lower Relative Risk)
Controls
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGA
GCAGTGAGATCAACATGATAGCC AGAGCAGTCGACATGTATAGTCTACA
TGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC AG
AGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGC
CGTGAGATCAACATGATAGCC AGAGCCGTCGACAGGTATAGCCTACATG
AGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTC AGAG
CCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCG
TGAGATCAACATGATAGCC AGAGCAGTCGACAGGTATAGTCTACATGAG
ATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCC AGAGCC
GTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTG
AGATCGACATGATAGCC AGAGCCGTCGACAGGTATAGTCTACATGAGAT
CAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
21
How does it work?
  • For every SNP we can construct a contingency
    table

22
Hypothesis testing
  • Null hypothesis Pr(Rcase) Pr(Rcontrol)
  • Alternative hypothesis Pr(Rcase) ?
    Pr(Rcontrol)
  • The model assumes that all individuals are
    independent (unrelated), and therefore our sample
    is a random sample from a Binomial distribution
  • Cases sampled from distribution
    XB(n,Pr(Rcases))
  • Controls sampled from distribution
    YB(n,Pr(Rcontrols))

23
Hypothesis testing, cont.
  • When n is large, B(n,p) N(np, np(1-p)).
  • Under the null hypothesis

24
P-value
  • Z is called a test-statistic (z-score in this
    case).
  • We can calculate Z for our data, and then
    calculate (using the normal approximation)p-valu
    e Pr(Z gt Z)
  • Often we take , which is

25
Results Manhattan Plots
26
The curse of dimensionality corrections of
multiple testing
  • In a typical Genome-Wide Association Study
    (GWAS), we test millions of SNPs.
  • If we set the p-value threshold for each test to
    be 0.05, by chance we will find about 5 of the
    SNPs to be associated with the disease.
  • This needs to be corrected.

27
Bonferroni Correction
  • If the number of tests is n, we set the threshold
    to be 0.05/n.
  • A very conservative test. If the tests are
    independent then it is reasonable to use it. If
    the tests are correlated this could be bad
  • Example If all SNPs are identical, then we lose
    a lot of power the false positive rate reduces,
    but so does the power.

28
Challenge 1
  • Population Substructure

29
Population Substructure
  • Imagine that all the cases are collected from
    Africa, and all the controls are from Europe.
  • Many association signals are going to be found
  • The vast majority of them are false

Why ???
Different evolutionary forces drift, selection,
mutation, migration, population bottleneck.
30
Evolution Theory
  • Mutations add to genetic variation
  • Natural Selection controls the frequency of
    certain traits and alleles
  • Genetic drift

31
Mutations
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGA
AGAGCAGTCCACAGGTATAGCCTACATGAGATCGACATGAGA
Estimated probability of a mutation in a single
generation is 10-8
32
Other mutations - recombination
Copy 1
Copy 2
Probability ri (10-8) for recombination in
position i.
child chromosome
33
Natural Selection
  • Example being lactose telorant is advantageous
    in northern Europe, hence there is positive
    selection in the LCT gene
  • different allele frequencies in LCT

34
Genetic Drift
  • Even without selection, the allele frequencies in
    the population are not fixed across time.
  • Consider the case where we assume Hardy-Weinberg
    Equilibrium (HWE), that is, individuals are
    mating randomly in the population.
  • If at the first generation the allele frequencies
    are p0 (of a) and q01-p0 (of A).
  • Under HWE, Epk1pk, but Vpk1 gt 0, so the
    next generation will have pk1?p0.

35
The rate of the drift
  • N effective population size (if all individuals
    are entirely unrelated than N is the total
    population size).
  • Under an assumption of constant population size,
    if Xk counts the number of occurrences of a at
    generation k, then Xk1 B(N,pk).
  • Epk1 EXk1/N pk.
  • Varpk1 pk(1-pk)/N.
  • The effect of genetic drift depends on the time
    and the effective populations size. Small
    population increases the effect.

36
Bottleneck effect
Effective population size
Time
Genetic drifts rate is higher.
37
The Wright-Fisher Model
Generation 1 Allele frequency 1/9
38
The Wright-Fisher Model
Generation 2 Allele frequency 1/9
39
The Wright-Fisher Model
Generation 3 Allele frequency 1/9
40
The Wright-Fisher Model
Generation 4 Allele frequency 1/3
41
The Wright-Fisher Model
42
The Wright-Fisher Model
43
Ancestral population
44
Ancestral population
migration
45
  • different allele frequencies

Ancestral population
Genetic drift
46
Population Substructure
  • Imagine that all the cases are collected from
    Africa, and all the controls are from Europe.
  • Many association signals are going to be found
  • The vast majority of them are false

What can we do about it?
47
Jakobsson et al, Nature 421 998-103
48
Principal Component Analysis
  • Dimensionality reduction
  • Based on linear algebra (Singular Value
    Decomposition)
  • Intuition find the most important features of
    the data project the data on the axis with the
    largest variance.

49
Principal Component Analysis
Plotting the data on a onedimensional line for
which the spread is maximized.
50
Principal Component Analysis
  • In our case, we want to look at two dimensions at
    a time.
  • The original data has many dimensions each SNP
    corresponds to one dimension.

51
Ancestry Inference
  • To what extent can population structure be
    detected from SNP data?
  • What can we learn from these inferences?
  • Can we build the tree of life?
  • How do we analyze complexpopulations (mixed)?

Novembre et al., Nature, 2008
52
Challenge 2
  • Modeling Correlation

53
A typical associated region
54
Linkage Disequilibrium
55
Haplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
56
Phasing - haplotype inference
Haplotypes
ATCCGA AGACGC
  • Cost effective genotyping technology gives
    genotypes and not haplotypes.

57
Inferring Haplotypes From Trios
Parent 1
122112
Parent 2
210022
120222
Child
Assumption No recombination
58
Maximum Likelihood
  • Until now we discussed the case of two hypotheses
    (null, and alternative).
  • In some cases we are interested in many
    hypotheses and we search for the best one.
  • Normally a hypothesis will be defined by a set of
    parameters ?.
  • The likelihood of ? is
    .We are interested in the hypothesis that
    maximizes the likelihood.

59
Soft assignment
  • Compute probabilities Pph for all possible
    haplotypes.
  • For each genotype g, we do not assign one pair of
    haplotypes, but a distribution of possible pairs.
  • The set of pairs of haplotypes compatible with g
    is denoted as C(g).
  • In soft assignment, a pair is
    explaining g with probability

60
Phasing via Maximum Likelihood
  • Soft decision
  • Hard decision

61
An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1
0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0
0 1 0 1/12 1 0 0 1 1 3/12 1 0 1 0 1 1/12 1 0 1 1
1 2/12 1 1 0 1 1 1/12 1 1 1 1 1 1/12
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
62
An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1

0 0 0 1 0 .125 0 0 0 1 1 .042 1 0 0 0 1 .067 1 0
0 1 0 .042 1 0 0 1 1 .325 1 0 1 0 1 .1 1 0 1 1
1 .067 1 1 0 1 1 .067 1 1 1 1 1 .1
0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0
0 1 0 1/12 1 0 0 1 1 3/12 1 0 1 0 1 1/12 1 0 1 1
1 2/12 1 1 0 1 1 1/12 1 1 1 1 1 1/12
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0.4 0.6
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
0.75 0.25
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
0.6 0.4
63
An iterative algorithm
Data1 0 h h 1 h 0 0 1 h 1 h h 1 1

0 0 0 1 0 1/6 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1
0 0 1 0 0 1 1 1/2 1 0 1 0 1 1/6 1 0 1 1 1 0 1 1 0
1 1 0 1 1 1 1 1 1/6
1 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
¼ ¼ ¼ ¼
0 1
0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 0
¼ ¼ ¼ ¼
1 0
1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
¼ ¼ ¼ ¼
1 0
64
Expectation Maximization (EM)
  • D given data
  • T parameters that need to be estimated
  • Z Latent missing variables

65
EM rationale
  • Lemma.
  • Proof First, note that

66
(No Transcript)
67
QED
68
MLE from Incomplete Data
  • Finding MLE parameters nonlinear optimization
    problem

log P(x ?)
E ?log P(x,y ?)
?
69
MLE from Incomplete Data
log P(x ?)
E ?log P(x,y ?)
?
70
EM for phasing
71
  • This is maximized for

72
Phasing summary
  • Expectation maximization is easy to implement,
    works reasonably well in practice.
  • We can use other models (tree models) to improve
    the accuracy of the phasing prediction.

73
Human Genetics where to?
  • We can typically explain 5-15of the
    heritability of commondiseases.
  • Where is the missing heritability?
  • Rare variants
  • Gene-gene interactions
  • Gene-environment interactions
  • Creative computational methods are key to the
    discovery of the missing heritability.

74
Course Computational Human Genetics
  • Semester bet
  • More background in human genetics, statistics,
    and machine learning.
  • Studying genetics of human disease
  • Privacy and forensics
  • Analysis of new technologies (sequencing)
  • Population genetics detecting selection,
    mutation rate, recombination rates, etc.
  • Reconstructing human history
Write a Comment
User Comments (0)
About PowerShow.com