CS 6293 Advanced Topics: Translational Bioinformatics - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

CS 6293 Advanced Topics: Translational Bioinformatics

Description:

CS 6293 Advanced Topics: Translational Bioinformatics Intro & Ch2 - Data-Driven View of Disease Biology Jianhua Ruan Road map What is translational bioinformatics ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 47
Provided by: Jianhu8
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 6293 Advanced Topics: Translational Bioinformatics


1
CS 6293 Advanced Topics Translational
Bioinformatics
  • Intro Ch2 - Data-Driven View of Disease Biology
  • Jianhua Ruan

2
Road map
  • What is translational bioinformatics
  • Probability and statistics background
  • Data-driven view of disease biology
  • Bayesian Inference
  • Network of functional related genes
  • Evaluation of network

3
What is translational bioinformatics?
  • Advancement in biological technology (for
    high-throughput data collection) and computing
    technology (for cheap and efficient large-scale
    data storage, processing, and management) has
    shifted modern biomedical research towards
    integrative and translational
  • Translational medical research
  • the process of moving discoveries and innovations
    generated during research in the laboratory, and
    in preclinical studies, to the development of
    trials and studies in humans, leading to improved
    diagnosis, prognosis, and treatment.
  • Barriers to translating our molecular
    understanding into technologies that impact
    patients
  • understanding health market size and forces, the
    regulatory milieu, how to harden the technology
    for routine use, and how to navigate an
    increasingly complex intellectual property
    landscape
  • Connecting the stuff of molecular biology to the
    clinical world
  • The book chapters in this PLoS Comput Bio
    collection deals mostly with computational
    methodologies that likely to have an impact on
    clinical research / practice

4
Topic 1 Network-based understanding of disease
mechanisms
  • Chapter 2 Data-Driven View of Disease Biology
  • Chapter 4 Protein Interactions and Disease
  • Chapter 5 Network Biology Approach to Complex
    Diseases
  • Chapter 15 Disease Gene Prioritization

5
Topic 2 drug design / discovery using
computational / systems approaches
  • Chapter 3 Small Molecules and Disease
  • Chapter 7 Pharmacogenomics
  • Chapter 17 Bioimage Informatics for Systems
    Pharmacology

6
Topics 3 Genome sequencing and disease
  • Chapter 6 Structural Variation and Medical
    Genomics
  • Chapter 12 Human Microbiome Analysis
  • Chapter 14 Cancer Genome Analysis

7
Topic 4 Automated knowledge discovery and
representation
  • Chapter 8 Biological Knowledge Assembly and
    Interpretation
  • Chapter 9 Analyses Using Disease Ontologies
  • Chapter 13 Mining Electronic Health Records in
    the Genomics Era
  • Chapter 16 Text Mining for Translational
    Bioinformatics

8
Ch2 Data-Driven View of Disease Biology
  • Diverse genome-scale datasets exist
  • Genome sequences
  • Microarrays
  • genome-wide association studies
  • RNA interference screens
  • Proteomics databases
  • Databases of gene functions, pathways, chemicals,
    protein interactions, etc.
  • Promise to provide systems level understanding of
    disease mechanisms
  • Modeling (understand)
  • Inference (make prediction)
  • Integration is the key challenge
  • Experimental noise
  • Biological heterogeneity e.g. source of material
    cells in culture or biopsied tissues?
  • Computational heterogeneity e.g. data format
    discrete or continuous?

9
Bayesian Inference
  • Powerful tool used to make predictions based on
    experimental evidence
  • Simple yet elegant probabilistic theories
  • Easy to understand and implement
  • Data-driven modeling
  • No explicit assumption about the underlying
    biological mechanisms

10
Probability Basics
  • Definition (informal)
  • Probabilities are numbers assigned to events that
    indicate how likely it is that the event will
    occur when a random experiment is performed
  • A probability law for a random experiment is a
    rule that assigns probabilities to the events in
    the experiment
  • The sample space S of a random experiment is the
    set of all possible outcomes

11
Example
0 ? P(Ai) ? 1 P(S) 1
12
Random variable
  • A random variable is a function from a sample to
    the space of possible values of the variable
  • When we toss a coin, the number of times that we
    see heads is a random variable
  • Can be discrete or continuous
  • The resulting number after rolling a die
  • The weight of an individual

13
Cumulative distribution function (cdf)
  • The cumulative distribution function FX(x) of a
    random variable X is defined as the probability
    of the event Xx
  • F (x) P(X x) for -8 lt x lt 8

14
Probability density function (pdf)
  • The probability density function of a continuous
    random variable X, if it exists, is defined as
    the derivative of FX(x)
  • For discrete random variables, the equivalent to
    the pdf is the probability mass function (pmf)

15
Probability density function vs probability
  • What is the probability for somebody weighting
    200lb?
  • The figure shows about 0.62
  • What is the probability of 200.00001lb?
  • The right question would be
  • Whats the probability for somebody weighting
    199-201lb.
  • The probability mass function is true probability
  • The chance to get any face is 1/6

16
Some common distributions
  • Discrete
  • Binomial
  • Multinomial
  • Geometric
  • Hypergeometric
  • Possion
  • Continuous
  • Normal (Gaussian)
  • Uniform
  • EVD
  • Gamma
  • Beta

17
Probabilistic Calculus
  • If A, B are mutually exclusive
  • P(A U B) P(A) P(B)
  • Thus P(not(A)) P(Ac) 1 P(A)

A
B
18
Probabilistic Calculus
  • P(A U B) P(A) P(B) P(A n B)

19
Conditional probability
  • The joint probability of two events A and B
    P(AnB), or simply P(A, B) is the probability that
    event A and B occur at the same time.
  • The conditional probability of P(BA) is the
    probability that B occurs given A occurred.
  • P(A B) P(A n B) / P(B)

20
Example
  • Roll a die
  • If I tell you the number is less than 4
  • What is the probability of an even number?
  • P(d even d lt 4) P(d even n d lt 4) / P(d lt
    4)
  • P(d 2) / P(d 1, 2, or 3) (1/6) / (3/6)
    1/3

21
Independence
  • P(A B) P(A n B) / P(B)
  • gt P(A n B) P(B) P(A B)
  • A, B are independent iff
  • P(A n B) P(A) P(B)
  • That is, P(A) P(A B)
  • Also implies that P(B) P(B A)
  • P(A n B) P(B) P(A B) P(A) P(B A)

22
Examples
  • Are P(d even) and P(d lt 4) independent?
  • P(d even and d lt 4) 1/6
  • P(d even) ½
  • P(d lt 4) ½
  • ½ ½ gt 1/6
  • If your die actually has 8 faces, will P(d
    even) and P(d lt 5) be independent?
  • Are P(even in first roll) and P(even in second
    roll) independent?
  • Playing card, are the suit and rank independent?

23
Theorem of total probability
  • Let B1, B2, , BN be mutually exclusive events
    whose union equals the sample space S. We refer
    to these sets as a partition of S.
  • An event A can be represented as
  • Since B1, B2, , BN are mutually exclusive, then
  • P(A) P(AnB1) P(AnB2) P(AnBN)
  • And therefore
  • P(A) P(AB1)P(B1) P(AB2)P(B2)
    P(ABN)P(BN)
  • ?i P(A Bi) P(Bi)

24
Example
  • Row a loaded die, 50 time 6, and 10 time for
    each 1 to 5
  • Whats the probability to have an even number?
  • Prob(even)
  • Prob(even d lt 6) Prob(dlt6)
  • Prob(even d6) Prob(d6)
  • 2/5 0.5 1 0.5
  • 0.7

25
Another example
  • We have a box of dies, 99 of them are fair, with
    1/6 possibility for each face, 1 are loaded so
    that six comes up 50 of time. We pick up a die
    randomly and roll, whats the probability well
    have a six?
  • P(six) P(six fair) P(fair) P(six
    loaded) P(loaded)
  • 1/6 0.99 0.5 0.01 0.17 gt 1/6

26
Bayes theorem
  • P(A n B) P(B) P(A B) P(A) P(B A)

Likelihood
B
A
P
)

(
B
P
)
(
Prior of B

gt
A
B
P
)

(
A
P
)
(
Posterior probability of B
Normalizing constant
This is known as Bayes Theorem or Bayes Rule, and
is (one of) the most useful relations in
probability and statistics Bayes Theorem is
definitely the fundamental relation in
Statistical Pattern Recognition
27
Bayes theorem (contd)
  • Given B1, B2, , BN, a partition of the sample
    space S. Suppose that event A occurs what is the
    probability of event Bj?
  • P(Bj A) P(A Bj) P(Bj) / P(A)
  • P(A Bj) P(Bj) / ?jP(A
    Bj)P(Bj)

Bj different models In the observation of A,
should you choose a model that maximizes P(Bj
A) or P(A Bj)? Depending on how much you know
about Bj !
28
Example
  • Prosecutors fallacy
  • Some crime happened
  • The suspect did not leave any evidence, except
    some hair
  • The police got his DNA from his hair
  • Some expert matched the DNA with that of a
    suspect
  • Expert said that both the false-positive and
    false negative rates are 10-6
  • Can this be used as an evidence of guilty against
    the suspect?

29
Prosecutors fallacy
  • Prob (match innocent) 10-6
  • Prob (no match guilty) 10-6
  • Prob (match guilty) 1 - 10-6 1
  • Prob (no match innocent) 1 - 10-6 1
  • Prob (guilty match) ?

30
Prosecutors fallacy
  • P (g m) P (m g) P(g) / P (m)
  • P(g) / P(m)
  • P(g) the probability for someone to be guilty
    with no other evidence
  • P(m) the probability for a DNA match
  • How to get these two numbers?
  • We dont really care P(m)
  • We want to compare two models
  • P(g m) and P(i m)

31
Prosecutors fallacy
  • P(i m) P(m i) P(i) / P(m)
  • 10-6 P(i) / P(m)
  • Therefore
  • P(i m) / P(g m) 10-6 P(i) / P(g)
  • P(i) P(g) 1
  • It is clear, therefore, that whether we can
    conclude the suspect is guilty depends on the
    prior probability P(i)
  • How do you get P(i)?

32
Prosecutors fallacy
  • How do you get P(i)?
  • Depending on what other information you have on
    the suspect
  • Say if the suspect has no other connection with
    the crime, and the overall crime rate is 10-7
  • Thats a reasonable prior for P(g)
  • P(g) 10-7, P(i) 1
  • P(i m) / P(g m) 10-6 P(i) / P(g)
    10-6/10-7 10

33
  • P(observation model1) / P(observation
    model2) likelihood-ratio test
  • LR test
  • Often take logarithm log (P(mi) / P(mi))
  • Log likelihood ratio (score)
  • Or log odds ratio (score)
  • Bayesian model selection
  • log (P(model1 observation) / P(model2
    observation))
  • LLR log P(model1) - log P(model2)

34
Prosecutors fallacy
  • P(i m) / P(g m) 10-6/10-7 10
  • Therefore, we would say the suspect is more
    likely to be innocent than guilty, given only the
    DNA samples
  • We can also explicitly calculate P(i m)
  • P(m) P(mi)P(i) P(mg)P(g)
  • 10-6 1 1 10-7
  • 1.1 x 10-6
  • P(i m) P(m i) P(i) / P(m) 1 / 1.1
    0.91

35
Prosecutors fallacy
  • If you have other evidences, P(g) could be much
    larger than the average crime rate
  • In that case, DNA test may give you higher
    confidence
  • How to decide prior?
  • Subjective?
  • Important?
  • There are debates about Bayes statistics
    historically
  • Some strongly support, some strongly against
  • Growing interests in many fields
  • However, no question about conditional
    probability
  • If all priors are equally possible, decisions
    based on bayes inference and likelihood test are
    equivalent

36
Another example
  • A test for a rare disease claims that it will
    report a positive result for 99.5 of people with
    the disease, and 99.9 of time of those without.
  • The disease is present in the population at 1 in
    100,000
  • What is P(disease positive test)?
  • What is P(disease negative test)?

37
Yet another example
  • Weve talked about the boxes of casinos
  • 99 fair, 1 loaded (50 at six)
  • We said if we randomly pick a die and roll, we
    have 17 of chance to get a six
  • If we get 3 six in a row, whats the chance that
    the die is loaded?
  • How about 5 six in a row?

38
  • P(loaded 3 six in a row) P(3 six in a row
    loaded) P(loaded) / P(3 six in a row) 0.53
    0.01 / (0.53 0.01 (1/6)3 0.99) 0.21
  • P(loaded 5 six in a row) P(5 six in a row
    loaded) P(loaded) / P(5 six in a row) 0.55
    0.01 / (0.55 0.01 (1/6)5 0.99) 0.71

39
Relation to multiple testing problem
  • When searching a DNA sequence against a database,
    you get a high score, with a significant p-value
  • P(unrelated high score) / P(related high
    score)
  • P(high score unrelated) P(unrelated)
  • P(high score related) P(related)
  • P(high score unrelated) is much smaller than
    P(high score related)
  • But your database is huge, and most sequences
    should be unrelated, so P(unrelated) is much
    larger than P(related)

Likelihood ratio
40
Combining Diverse Data Using Bayesian Inference
  • Want to calculate the probability that a gene of
    unknown function is involved in a disease
  • Collect positive and negative genes (gold
    standard)
  • Measure their activities under three hypothetical
    conditions

41
  • Figure 1. Potential distributions of experimental
    results obtained for datasets collected under
    three different conditions.

Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
Higher score in cond A and lower score in cond C
gt involved in disease P (involved in disease
experimental data)?
42
  • Table 1. A contingency table for the experimental
    results for Condition A.

Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
43
  • Probability that a gene i is involved in disease
    given the experimental results for gene i

likelihood
Prior
Normalizing factor
44
Prior
45
Combining datasets using Naïve Bayes
  • P(D EB, EC) ? P(EB, EC D) P(D)
  • ? P(EB D) P(EC D) P(D)
  • P(D EB, EC) ? P(EB, EC D) P(D)
  • ? P(EB D) P(EC D)
    P(D)
  • P(D EB, EC) P(D EB, EC) 1.

46
Define Gold Standard (training samples) for
gene-gene network
  • Positive examples genes within the same
    biological process
  • Rely on expert selected Gene Ontology terms
  • biological regulation ?
  • response to stimulus ?
  • cell-matrix adhesion involved in tangential
    migration using cell-cell interactions ?
  • response to DNA damage stimulus ?
  • ldehyde metabolism ?
  • Negative examples random gene pairs
  • Assuming most gene pairs are not related

47
Building a Network of Functionally Related Genes
  • P(FRij Eij) P(Eij FRij) P(FRij)
  • Eij evidence (score) for a functional
    relationship between gene i and gene j from a
    particular dataset
  • For some dataset, e.g., physical interaction
    data, obtaining Sij is trivial
  • In general, Sij can be calculated using gene-wise
    correlation

48
Fisher's z-transformation
  • Pearson correlation coefficient

Z-transformation
Purpose stabilizing variance
Source wikipedia
49
(No Transcript)
50
  • Figure 4. The highest and lowest contributing
    datasets for the pair of APOE and PLTP are shown
    (http//hefalmp.princeton.edu/gene/one_specific_ge
    ne/18543?argument21697ampcontext0).

Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
51
  • Figure 5. The diseases that are significantly
    connected to APOE through the guilt by
    association strategy used in HEFalMp.

Used Fishers exact test
Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
52
  • Figure 6. The genes that are most significantly
    connected to Alzheimer disease genes using the
    HEFalMp network and OMIM disease gene annotations
    (http//hefalmp.princeton.edu/disease/all_genes/55
    ?context0).

Greene CS, Troyanskaya OG (2012) Chapter 2
Data-Driven View of Disease Biology. PLoS Comput
Biol 8(12) e1002816. doi10.1371/journal.pcbi.100
2816 http//www.ploscompbiol.org/article/infodoi/
10.1371/journal.pcbi.1002816
53
Evaluating Functional Relationship Networks
  • TPR vs FPR plot (ROC curve) and AUC
  • Separate gold standard into training and testing
  • Cross validation
  • Literature evaluation

54
Summary
  • We talked about
  • Prob / stats background
  • Bayes inference method to integrate multiple
    large-scale, noisy datasets to predict
  • gene-disease associations
  • gene-gene associations
  • Network useful for discovering novel gene
    functions and directing experimental followups
  • Advantage against curated literature or analysis
    based on single dataset
  • Limited by availability / quality of gold
    standard data
Write a Comment
User Comments (0)
About PowerShow.com