Title: CS 6293 Advanced Topics: Translational Bioinformatics
1CS 6293 Advanced Topics Translational
Bioinformatics
- Intro Ch2 - Data-Driven View of Disease Biology
- Jianhua Ruan
2Road map
- What is translational bioinformatics
- Probability and statistics background
- Data-driven view of disease biology
- Bayesian Inference
- Network of functional related genes
- Evaluation of network
3What is translational bioinformatics?
- Advancement in biological technology (for
high-throughput data collection) and computing
technology (for cheap and efficient large-scale
data storage, processing, and management) has
shifted modern biomedical research towards
integrative and translational - Translational medical research
- the process of moving discoveries and innovations
generated during research in the laboratory, and
in preclinical studies, to the development of
trials and studies in humans, leading to improved
diagnosis, prognosis, and treatment. - Barriers to translating our molecular
understanding into technologies that impact
patients - understanding health market size and forces, the
regulatory milieu, how to harden the technology
for routine use, and how to navigate an
increasingly complex intellectual property
landscape - Connecting the stuff of molecular biology to the
clinical world - The book chapters in this PLoS Comput Bio
collection deals mostly with computational
methodologies that likely to have an impact on
clinical research / practice
4Topic 1 Network-based understanding of disease
mechanisms
- Chapter 2 Data-Driven View of Disease Biology
- Chapter 4 Protein Interactions and Disease
- Chapter 5 Network Biology Approach to Complex
Diseases - Chapter 15 Disease Gene Prioritization
5Topic 2 drug design / discovery using
computational / systems approaches
- Chapter 3 Small Molecules and Disease
- Chapter 7 Pharmacogenomics
- Chapter 17 Bioimage Informatics for Systems
Pharmacology
6Topics 3 Genome sequencing and disease
- Chapter 6 Structural Variation and Medical
Genomics - Chapter 12 Human Microbiome Analysis
- Chapter 14 Cancer Genome Analysis
7Topic 4 Automated knowledge discovery and
representation
- Chapter 8 Biological Knowledge Assembly and
Interpretation - Chapter 9 Analyses Using Disease Ontologies
- Chapter 13 Mining Electronic Health Records in
the Genomics Era - Chapter 16 Text Mining for Translational
Bioinformatics
8Ch2 Data-Driven View of Disease Biology
- Diverse genome-scale datasets exist
- Genome sequences
- Microarrays
- genome-wide association studies
- RNA interference screens
- Proteomics databases
- Databases of gene functions, pathways, chemicals,
protein interactions, etc. - Promise to provide systems level understanding of
disease mechanisms - Modeling (understand)
- Inference (make prediction)
- Integration is the key challenge
- Experimental noise
- Biological heterogeneity e.g. source of material
cells in culture or biopsied tissues? - Computational heterogeneity e.g. data format
discrete or continuous?
9Bayesian Inference
- Powerful tool used to make predictions based on
experimental evidence - Simple yet elegant probabilistic theories
- Easy to understand and implement
- Data-driven modeling
- No explicit assumption about the underlying
biological mechanisms
10Probability Basics
- Definition (informal)
- Probabilities are numbers assigned to events that
indicate how likely it is that the event will
occur when a random experiment is performed - A probability law for a random experiment is a
rule that assigns probabilities to the events in
the experiment - The sample space S of a random experiment is the
set of all possible outcomes
11Example
0 ? P(Ai) ? 1 P(S) 1
12Random variable
- A random variable is a function from a sample to
the space of possible values of the variable - When we toss a coin, the number of times that we
see heads is a random variable - Can be discrete or continuous
- The resulting number after rolling a die
- The weight of an individual
13Cumulative distribution function (cdf)
- The cumulative distribution function FX(x) of a
random variable X is defined as the probability
of the event Xx - F (x) P(X x) for -8 lt x lt 8
14Probability density function (pdf)
- The probability density function of a continuous
random variable X, if it exists, is defined as
the derivative of FX(x) - For discrete random variables, the equivalent to
the pdf is the probability mass function (pmf)
15Probability density function vs probability
- What is the probability for somebody weighting
200lb? - The figure shows about 0.62
- What is the probability of 200.00001lb?
- The right question would be
- Whats the probability for somebody weighting
199-201lb. - The probability mass function is true probability
- The chance to get any face is 1/6
16Some common distributions
- Discrete
- Binomial
- Multinomial
- Geometric
- Hypergeometric
- Possion
- Continuous
- Normal (Gaussian)
- Uniform
- EVD
- Gamma
- Beta
17Probabilistic Calculus
- If A, B are mutually exclusive
- P(A U B) P(A) P(B)
- Thus P(not(A)) P(Ac) 1 P(A)
A
B
18Probabilistic Calculus
- P(A U B) P(A) P(B) P(A n B)
19Conditional probability
- The joint probability of two events A and B
P(AnB), or simply P(A, B) is the probability that
event A and B occur at the same time. - The conditional probability of P(BA) is the
probability that B occurs given A occurred. - P(A B) P(A n B) / P(B)
20Example
- Roll a die
- If I tell you the number is less than 4
- What is the probability of an even number?
- P(d even d lt 4) P(d even n d lt 4) / P(d lt
4) - P(d 2) / P(d 1, 2, or 3) (1/6) / (3/6)
1/3
21Independence
- P(A B) P(A n B) / P(B)
- gt P(A n B) P(B) P(A B)
- A, B are independent iff
- P(A n B) P(A) P(B)
- That is, P(A) P(A B)
- Also implies that P(B) P(B A)
- P(A n B) P(B) P(A B) P(A) P(B A)
22Examples
- Are P(d even) and P(d lt 4) independent?
- P(d even and d lt 4) 1/6
- P(d even) ½
- P(d lt 4) ½
- ½ ½ gt 1/6
- If your die actually has 8 faces, will P(d
even) and P(d lt 5) be independent? - Are P(even in first roll) and P(even in second
roll) independent? - Playing card, are the suit and rank independent?
23Theorem of total probability
- Let B1, B2, , BN be mutually exclusive events
whose union equals the sample space S. We refer
to these sets as a partition of S. - An event A can be represented as
- Since B1, B2, , BN are mutually exclusive, then
- P(A) P(AnB1) P(AnB2) P(AnBN)
- And therefore
- P(A) P(AB1)P(B1) P(AB2)P(B2)
P(ABN)P(BN) - ?i P(A Bi) P(Bi)
24Example
- Row a loaded die, 50 time 6, and 10 time for
each 1 to 5 - Whats the probability to have an even number?
- Prob(even)
- Prob(even d lt 6) Prob(dlt6)
- Prob(even d6) Prob(d6)
- 2/5 0.5 1 0.5
- 0.7
25Another example
- We have a box of dies, 99 of them are fair, with
1/6 possibility for each face, 1 are loaded so
that six comes up 50 of time. We pick up a die
randomly and roll, whats the probability well
have a six? - P(six) P(six fair) P(fair) P(six
loaded) P(loaded) - 1/6 0.99 0.5 0.01 0.17 gt 1/6
26Bayes theorem
- P(A n B) P(B) P(A B) P(A) P(B A)
Likelihood
B
A
P
)
(
B
P
)
(
Prior of B
gt
A
B
P
)
(
A
P
)
(
Posterior probability of B
Normalizing constant
This is known as Bayes Theorem or Bayes Rule, and
is (one of) the most useful relations in
probability and statistics Bayes Theorem is
definitely the fundamental relation in
Statistical Pattern Recognition
27Bayes theorem (contd)
- Given B1, B2, , BN, a partition of the sample
space S. Suppose that event A occurs what is the
probability of event Bj? - P(Bj A) P(A Bj) P(Bj) / P(A)
- P(A Bj) P(Bj) / ?jP(A
Bj)P(Bj)
Bj different models In the observation of A,
should you choose a model that maximizes P(Bj
A) or P(A Bj)? Depending on how much you know
about Bj !
28Example
- Prosecutors fallacy
- Some crime happened
- The suspect did not leave any evidence, except
some hair - The police got his DNA from his hair
- Some expert matched the DNA with that of a
suspect - Expert said that both the false-positive and
false negative rates are 10-6 - Can this be used as an evidence of guilty against
the suspect?
29Prosecutors fallacy
- Prob (match innocent) 10-6
- Prob (no match guilty) 10-6
- Prob (match guilty) 1 - 10-6 1
- Prob (no match innocent) 1 - 10-6 1
- Prob (guilty match) ?
30Prosecutors fallacy
- P (g m) P (m g) P(g) / P (m)
- P(g) / P(m)
- P(g) the probability for someone to be guilty
with no other evidence - P(m) the probability for a DNA match
- How to get these two numbers?
- We dont really care P(m)
- We want to compare two models
- P(g m) and P(i m)
31Prosecutors fallacy
- P(i m) P(m i) P(i) / P(m)
- 10-6 P(i) / P(m)
- Therefore
- P(i m) / P(g m) 10-6 P(i) / P(g)
- P(i) P(g) 1
- It is clear, therefore, that whether we can
conclude the suspect is guilty depends on the
prior probability P(i) - How do you get P(i)?
32Prosecutors fallacy
- How do you get P(i)?
- Depending on what other information you have on
the suspect - Say if the suspect has no other connection with
the crime, and the overall crime rate is 10-7 - Thats a reasonable prior for P(g)
- P(g) 10-7, P(i) 1
- P(i m) / P(g m) 10-6 P(i) / P(g)
10-6/10-7 10
33- P(observation model1) / P(observation
model2) likelihood-ratio test - LR test
- Often take logarithm log (P(mi) / P(mi))
- Log likelihood ratio (score)
- Or log odds ratio (score)
- Bayesian model selection
- log (P(model1 observation) / P(model2
observation)) - LLR log P(model1) - log P(model2)
34Prosecutors fallacy
- P(i m) / P(g m) 10-6/10-7 10
- Therefore, we would say the suspect is more
likely to be innocent than guilty, given only the
DNA samples - We can also explicitly calculate P(i m)
- P(m) P(mi)P(i) P(mg)P(g)
- 10-6 1 1 10-7
- 1.1 x 10-6
- P(i m) P(m i) P(i) / P(m) 1 / 1.1
0.91
35Prosecutors fallacy
- If you have other evidences, P(g) could be much
larger than the average crime rate - In that case, DNA test may give you higher
confidence - How to decide prior?
- Subjective?
- Important?
- There are debates about Bayes statistics
historically - Some strongly support, some strongly against
- Growing interests in many fields
- However, no question about conditional
probability - If all priors are equally possible, decisions
based on bayes inference and likelihood test are
equivalent
36Another example
- A test for a rare disease claims that it will
report a positive result for 99.5 of people with
the disease, and 99.9 of time of those without. - The disease is present in the population at 1 in
100,000 - What is P(disease positive test)?
- What is P(disease negative test)?
37Yet another example
- Weve talked about the boxes of casinos
- 99 fair, 1 loaded (50 at six)
- We said if we randomly pick a die and roll, we
have 17 of chance to get a six - If we get 3 six in a row, whats the chance that
the die is loaded? - How about 5 six in a row?
38- P(loaded 3 six in a row) P(3 six in a row
loaded) P(loaded) / P(3 six in a row) 0.53
0.01 / (0.53 0.01 (1/6)3 0.99) 0.21 - P(loaded 5 six in a row) P(5 six in a row
loaded) P(loaded) / P(5 six in a row) 0.55
0.01 / (0.55 0.01 (1/6)5 0.99) 0.71
39Relation to multiple testing problem
- When searching a DNA sequence against a database,
you get a high score, with a significant p-value - P(unrelated high score) / P(related high
score) - P(high score unrelated) P(unrelated)
- P(high score related) P(related)
- P(high score unrelated) is much smaller than
P(high score related) - But your database is huge, and most sequences
should be unrelated, so P(unrelated) is much
larger than P(related)
Likelihood ratio
40Combining Diverse Data Using Bayesian Inference
- Want to calculate the probability that a gene of
unknown function is involved in a disease - Collect positive and negative genes (gold
standard) - Measure their activities under three hypothetical
conditions
41- Figure 1. Potential distributions of experimental
results obtained for datasets collected under
three different conditions.
Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
Higher score in cond A and lower score in cond C
gt involved in disease P (involved in disease
experimental data)?
42- Table 1. A contingency table for the experimental
results for Condition A.
Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
43- Probability that a gene i is involved in disease
given the experimental results for gene i
likelihood
Prior
Normalizing factor
44Prior
45Combining datasets using Naïve Bayes
- P(D EB, EC) ? P(EB, EC D) P(D)
- ? P(EB D) P(EC D) P(D)
- P(D EB, EC) ? P(EB, EC D) P(D)
- ? P(EB D) P(EC D)
P(D) - P(D EB, EC) P(D EB, EC) 1.
46Define Gold Standard (training samples) for
gene-gene network
- Positive examples genes within the same
biological process - Rely on expert selected Gene Ontology terms
- biological regulation ?
- response to stimulus ?
- cell-matrix adhesion involved in tangential
migration using cell-cell interactions ? - response to DNA damage stimulus ?
- ldehyde metabolism ?
- Negative examples random gene pairs
- Assuming most gene pairs are not related
47Building a Network of Functionally Related Genes
- P(FRij Eij) P(Eij FRij) P(FRij)
- Eij evidence (score) for a functional
relationship between gene i and gene j from a
particular dataset - For some dataset, e.g., physical interaction
data, obtaining Sij is trivial - In general, Sij can be calculated using gene-wise
correlation
48Fisher's z-transformation
- Pearson correlation coefficient
Z-transformation
Purpose stabilizing variance
Source wikipedia
49(No Transcript)
50- Figure 4. The highest and lowest contributing
datasets for the pair of APOE and PLTP are shown
(http//hefalmp.princeton.edu/gene/one_specific_ge
ne/18543?argument21697ampcontext0).
Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
51- Figure 5. The diseases that are significantly
connected to APOE through the guilt by
association strategy used in HEFalMp.
Used Fishers exact test
Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
52- Figure 6. The genes that are most significantly
connected to Alzheimer disease genes using the
HEFalMp network and OMIM disease gene annotations
(http//hefalmp.princeton.edu/disease/all_genes/55
?context0).
Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
53Evaluating Functional Relationship Networks
- TPR vs FPR plot (ROC curve) and AUC
- Separate gold standard into training and testing
- Cross validation
- Literature evaluation
54Summary
- We talked about
- Prob / stats background
- Bayes inference method to integrate multiple
large-scale, noisy datasets to predict - gene-disease associations
- gene-gene associations
- Network useful for discovering novel gene
functions and directing experimental followups - Advantage against curated literature or analysis
based on single dataset - Limited by availability / quality of gold
standard data