Title: Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu
1Lecture 2Introduction to probabilistic models
for computational molecular biology in the 21st
century (Part 1)getting the biology right, null
models and statistical significance
- Rachel Karchin
- BME 580.688/580.488 and CS 600.488 Spring 2009
2Overview
- Get the biology right
- Probabilistic models
- Null models
- Statistical significance
- Non-parametric
- Parametric
3Probabilistic models
- Model
- System that simulates object under consideration
- Probabilistic model
- Produces different outcomes with different
probabilities - Can simulate a whole class of objects, assigning
each an associated probability - Computable function that assigns P(X M) to each
instantiation of X
Biological sequence analysis, Durbin, Eddy,
Krogh, Mitchison
4Probabilistic models in computational biology
- Require understanding of the objects under
consideration - Substantial effort to learn
- Scientific papers
- Product manuals
- Without this understanding, models will be stupid
5Horrifying example of this
- Community-wide experiment where researchers
analyze the same contest expression data sets. - The first CAMDA conference (CAMDA'00) was held
December 1819, 2000. Attended by 250 biologists,
statisticians, computer scientists and
mathematicians from 7 countries, the conference
truly brought together the major players in this
field.
6Horrifying example of this
- At one meeting, a group who admitted they did not
have much biological understanding presented
results of combining expression values from
multiple experiments with the same Affymetrix
chip. -
7Horrifying example of this
- They had matched values based on probe-set names.
8Horrifying example of this
- There were only a small number of probe-set names
that matched across the experiments.
9Horrifying example of this
- These turned out to be spike-in probes designed
to calibrate the experiments (manufacturers
protocol to assess successful dye-labeling and
hybridization). - bacterial mRNA transcripts
- housekeeping gene human mRNA transcripts
10Motif models of binding-sites on DNA/RNA
- Example of classic bioinformatics
- Transcription factor binding sites !
- Why do these remain relevant?
- More regulatatory mechanisms discovered to be
important - Binding sites on mRNA for splicing machinery
- miRNA binding sites on mRNA
11What you should know
- What a transcription factor binding site is
- Have a sense for the range of algorithms
developed to recognize them (paper on class
website) - What is a miRNA
- What kinds of binding sites have been discovered
for splicing machinery
12Probabilistic model of exonic splicing enhancer
Object Exonic Splicing Enhancer (ESE)
Maniatis and Tasic, Nature (2002)
CACAGGA
An ESE recognized by the human SR protein SF2/ASF
13Biology of alternative splicing
Alternative splicing of mRNA allows many gene
products with different functions to be produced
from a single coding sequence. It has recently
been proposed as a mechanism by which
higher-order diversity is generated.
Brett et al. Nat Gen 2001
14(No Transcript)
15Biology of alternative splicing
16Biology of alternative splicing
17Complex protein machinery involved to make it
happen
Maniatis and Tasic, Nature (2002)
18How do proteins know where to bind on the
pre-MRNA?
19How do proteins know where to bind on the
pre-MRNA?
CACAGGA
Sequence signals?
20Toy real-life example
- You have a new collaborator who has developed an
in vitro assay to detect where SF2/ASF splicing
factors bind on pre-mRNA - She would like you to give her a model that she
can use to scan the human genome for putative ESE
binding sites that her lab can test.
21ESE binding sites
She presents you with this list of ESE binding
sites that her assay has discovered.
22What kind of model might you build?
23A simple idea
Markov chain (zero-order)
Position
24A simple idea
Markov chain (zero-order)
Position
What is the probability of this RNA sequence
being a SF2/ASF ESE?
CGGCGUG
25What are strengths and weaknesses of this model?
26Overfitting
- How could we design our model better to avoid
overfitting?
27Null models
Better Than Chance, Karplus (2008)
28Null models
- If our probability model is good, an actual
SF2/ASF ESE binding site sequence should get a
higher score than a non-SF2/ASF ESE binding site
sequence. - If our probability model is good, an actual
SF2/ASF ESE binding site sequence should get a
higher score than a randomly selected mRNA
sequence of the same length.
29Null models
- If our probability model is good, the likelihood
of an actual SF2/ASF ESE binding site sequence,
based on our model, should be larger than the
likelihood the sequence would have under a random
model.
30Example Null model
- Probability of seeing A,C,G,U at one of the seven
positions is the same as the probability of
seeing A,C,G,U at any position in the ESE
31What are the strengths and weaknesses of this
null model?
- Probability of seeing A,C,G,U at one of the seven
positions is the same as the probability of
seeing A,C,G,U at any position in the ESE
32Statistical significance
- We believe in models when they give a high
likelihood to our observed data. - Statistical tests (p-values) quantify how often
we should expect to see such good scores by
chance. - In other words, when we should reject the null
model.
33Small P-Value to reject the Null Hypothesis
Better Than Chance, Karplus (2008)
34Statistical significance
Non-parametric version
- Markovs inequality. For any scoring scheme
that uses the probability of a score better
than T is less than e-T for sequences distributed
according to the null model (Milosavljevic 1993).
Better Than Chance, Karplus (2008)
35Statistical significance threshold
Non-parametric version
- Youve detected a pattern in the noise when
- For a sequence database where you search N
possible starting positions
36Statistical significance threshold
Non-parametric version
- Youve detected a pattern in the noise when
- For a sequence database where you search N
possible starting positions
Does this give us a P-value for a score?
37Statistical significance
Parametric version
- Parameter fitting. For random sequences drawn
from some distribution other than the null, we
can fit a parameterized family of distributions
to scores from a random sample, then compute
p-values.
?
Better Than Chance, Karplus (2008)
38Statistical significance
Parametric version
1. Generate random sample of 6mer
sequences.Example
AUUAAC GCACUU CGGCGG UUCCAG GUACGG CGUAUA GACCAU U
AAACC CGAGAC CCUAUU CACGAC GGAGUA CUUGCG
Not the null
39Statistical significance
Parametric version
2. Score these sequences
AUUAAC GCACUU CGGCGG UUCCAG GUACGG CGUAUA GACCAU U
AAACC CGAGAC CCUAUU CACGAC GGAGUA CUUGCG
-3.6286 -7.3581 -3.9727 -2.5188 -3.4714 -1.5415 -7
.2038 -1.9210 -1.1950 -5.2892 -2.1393 -2.0600 -1.1
991
40Parametric version
3. Match score histogram to a known parametric
distribution family
?
41Parametric version
42Parametric version
43Parametric version
?
3a. Estimate parameters with a subset of the
scores
44Parametric version
3b. Evaluate parameter fit on a second subset of
the data
45Parametric version
4. Compute score and p-value for a real
sequence. UUUGAC 0.06
46Parametric version
4. Compute score and p-Value for a real
sequence. UUUGAC 0.06
P-value
47Correct for multiple testing
- What if you have more than one candidate sequence
whose significance you want to evaluate? - You are cheating!
- Using the same data to test n alternative
hypotheses
48Correct for multiple testing
- Bonferroni
- Test each hypothesis (that sequence S is an ESE
binding site) at signficance level 1/n - To test 10 sequences and find if any have P lt
0.05 - Test each at 0.005 signficance level
- Benjamini and Hochberg
- Control for the FDR (false discovery rate)
(fraction FP among tests declared significant) - Storey
- q-value Minimum FDR we get by calling a test
significant
49What you should know
- How to recognize and correct for overly
optimistic significance calculations.
50Data integration
- 21st century probabilistic model of ESEs should
incorporate also - Sequence motif
- Structural patterns?
- Cis location patterns?
51How do proteins know where to bind on the
pre-MRNA?
CACAGGA
Structural signals
52Where are the SF2/ASF ESE binding sites?
Sanford, Genome Research 2008
53What you should know
- Before starting to model molecular biology data
- Get some understanding of the big picture
- Read as many background papers as possible
- When you get data from an experimentalist
- Learn about the technology used to collect the
data - Read up on any previous attempts to model this
kind of data in the compbio literature - Think carefully about what kinds of available
information have not yet been used to model the
problem.
54What you should understand
- Material in Durbin et alChap. 1 Intro to
maximum likelihood estimation, posterior
probabilites, Bayesian statistics and
pseudocounts as used in computational molecular
biologyChap 3.1 - Intro to Markov chains
- Chap. 5.6 More about pseudocounts and
regularizers - Chap 11.1 Parametric distributions commonly
used in computational molecular biology