Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu

About This Presentation

Title:

Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu

Description:

Lecture 2. Introduction to probabilistic models for computational molecular ... enhancer. Maniatis and Tasic, Nature (2002) Object: Exonic Splicing Enhancer ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 55

Provided by: karch7

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu

1
Lecture 2Introduction to probabilistic models
for computational molecular biology in the 21st
century (Part 1)getting the biology right, null
models and statistical significance

Rachel Karchin
BME 580.688/580.488 and CS 600.488 Spring 2009

2
Overview

Get the biology right
Probabilistic models
Null models
Statistical significance
Non-parametric
Parametric

3
Probabilistic models

Model
System that simulates object under consideration
Probabilistic model
Produces different outcomes with different
probabilities
Can simulate a whole class of objects, assigning
each an associated probability
Computable function that assigns P(X M) to each
instantiation of X

Biological sequence analysis, Durbin, Eddy,
Krogh, Mitchison
4
Probabilistic models in computational biology

Require understanding of the objects under
consideration
Substantial effort to learn
Scientific papers
Product manuals
Without this understanding, models will be stupid

5
Horrifying example of this

Community-wide experiment where researchers
analyze the same contest expression data sets.
The first CAMDA conference (CAMDA'00) was held
December 1819, 2000. Attended by 250 biologists,
statisticians, computer scientists and
mathematicians from 7 countries, the conference
truly brought together the major players in this
field.

6
Horrifying example of this

At one meeting, a group who admitted they did not
have much biological understanding presented
results of combining expression values from
multiple experiments with the same Affymetrix
chip.

7
Horrifying example of this

They had matched values based on probe-set names.

8
Horrifying example of this

There were only a small number of probe-set names
that matched across the experiments.

9
Horrifying example of this

These turned out to be spike-in probes designed
to calibrate the experiments (manufacturers
protocol to assess successful dye-labeling and
hybridization).
bacterial mRNA transcripts
housekeeping gene human mRNA transcripts

10
Motif models of binding-sites on DNA/RNA

Example of classic bioinformatics
Transcription factor binding sites !
Why do these remain relevant?
More regulatatory mechanisms discovered to be
important
Binding sites on mRNA for splicing machinery
miRNA binding sites on mRNA

11
What you should know

What a transcription factor binding site is
Have a sense for the range of algorithms
developed to recognize them (paper on class
website)
What is a miRNA
What kinds of binding sites have been discovered
for splicing machinery

12
Probabilistic model of exonic splicing enhancer
Object Exonic Splicing Enhancer (ESE)
Maniatis and Tasic, Nature (2002)
CACAGGA
An ESE recognized by the human SR protein SF2/ASF
13
Biology of alternative splicing
Alternative splicing of mRNA allows many gene
products with different functions to be produced
from a single coding sequence. It has recently
been proposed as a mechanism by which
higher-order diversity is generated.
Brett et al. Nat Gen 2001
14
(No Transcript)
15
Biology of alternative splicing
16
Biology of alternative splicing
17
Complex protein machinery involved to make it
happen
Maniatis and Tasic, Nature (2002)
18
How do proteins know where to bind on the
pre-MRNA?
19
How do proteins know where to bind on the
pre-MRNA?
CACAGGA
Sequence signals?
20
Toy real-life example

You have a new collaborator who has developed an
in vitro assay to detect where SF2/ASF splicing
factors bind on pre-mRNA
She would like you to give her a model that she
can use to scan the human genome for putative ESE
binding sites that her lab can test.

21
ESE binding sites
She presents you with this list of ESE binding
sites that her assay has discovered.
22
What kind of model might you build?
23
A simple idea
Markov chain (zero-order)
Position
24
A simple idea
Markov chain (zero-order)
Position
What is the probability of this RNA sequence
being a SF2/ASF ESE?
CGGCGUG
25
What are strengths and weaknesses of this model?
26
Overfitting

How could we design our model better to avoid
overfitting?

27
Null models
Better Than Chance, Karplus (2008)
28
Null models

If our probability model is good, an actual
SF2/ASF ESE binding site sequence should get a
higher score than a non-SF2/ASF ESE binding site
sequence.
If our probability model is good, an actual
SF2/ASF ESE binding site sequence should get a
higher score than a randomly selected mRNA
sequence of the same length.

29
Null models

If our probability model is good, the likelihood
of an actual SF2/ASF ESE binding site sequence,
based on our model, should be larger than the
likelihood the sequence would have under a random
model.

30
Example Null model

Probability of seeing A,C,G,U at one of the seven
positions is the same as the probability of
seeing A,C,G,U at any position in the ESE

31
What are the strengths and weaknesses of this
null model?

Probability of seeing A,C,G,U at one of the seven
positions is the same as the probability of
seeing A,C,G,U at any position in the ESE

32
Statistical significance

We believe in models when they give a high
likelihood to our observed data.
Statistical tests (p-values) quantify how often
we should expect to see such good scores by
chance.
In other words, when we should reject the null
model.

33
Small P-Value to reject the Null Hypothesis
Better Than Chance, Karplus (2008)
34
Statistical significance
Non-parametric version

Markovs inequality. For any scoring scheme
that uses the probability of a score better
than T is less than e-T for sequences distributed
according to the null model (Milosavljevic 1993).

Better Than Chance, Karplus (2008)
35
Statistical significance threshold
Non-parametric version

Youve detected a pattern in the noise when
For a sequence database where you search N
possible starting positions

36
Statistical significance threshold
Non-parametric version

Youve detected a pattern in the noise when
For a sequence database where you search N
possible starting positions

Does this give us a P-value for a score?
37
Statistical significance
Parametric version

Parameter fitting. For random sequences drawn
from some distribution other than the null, we
can fit a parameterized family of distributions
to scores from a random sample, then compute
p-values.

?

Better Than Chance, Karplus (2008)
38
Statistical significance
Parametric version

1. Generate random sample of 6mer
sequences.Example
AUUAAC GCACUU CGGCGG UUCCAG GUACGG CGUAUA GACCAU U
AAACC CGAGAC CCUAUU CACGAC GGAGUA CUUGCG
Not the null
39
Statistical significance
Parametric version

2. Score these sequences
AUUAAC GCACUU CGGCGG UUCCAG GUACGG CGUAUA GACCAU U
AAACC CGAGAC CCUAUU CACGAC GGAGUA CUUGCG
-3.6286 -7.3581 -3.9727 -2.5188 -3.4714 -1.5415 -7
.2038 -1.9210 -1.1950 -5.2892 -2.1393 -2.0600 -1.1
991
40
Parametric version

3. Match score histogram to a known parametric
distribution family
?
41
Parametric version
42
Parametric version

43
Parametric version
?
3a. Estimate parameters with a subset of the
scores

44
Parametric version
3b. Evaluate parameter fit on a second subset of
the data

45
Parametric version
4. Compute score and p-value for a real
sequence. UUUGAC 0.06

46
Parametric version
4. Compute score and p-Value for a real
sequence. UUUGAC 0.06

P-value
47
Correct for multiple testing

What if you have more than one candidate sequence
whose significance you want to evaluate?
You are cheating!
Using the same data to test n alternative
hypotheses

48
Correct for multiple testing

Bonferroni
Test each hypothesis (that sequence S is an ESE
binding site) at signficance level 1/n
To test 10 sequences and find if any have P lt
0.05
Test each at 0.005 signficance level
Benjamini and Hochberg
Control for the FDR (false discovery rate)
(fraction FP among tests declared significant)
Storey
q-value Minimum FDR we get by calling a test
significant

49
What you should know

How to recognize and correct for overly
optimistic significance calculations.

50
Data integration

21st century probabilistic model of ESEs should
incorporate also
Sequence motif
Structural patterns?
Cis location patterns?

51
How do proteins know where to bind on the
pre-MRNA?
CACAGGA
Structural signals
52
Where are the SF2/ASF ESE binding sites?
Sanford, Genome Research 2008
53
What you should know

Before starting to model molecular biology data
Get some understanding of the big picture
Read as many background papers as possible
When you get data from an experimentalist
Learn about the technology used to collect the
data
Read up on any previous attempts to model this
kind of data in the compbio literature
Think carefully about what kinds of available
information have not yet been used to model the
problem.

54
What you should understand

Material in Durbin et alChap. 1 Intro to
maximum likelihood estimation, posterior
probabilites, Bayesian statistics and
pseudocounts as used in computational molecular
biologyChap 3.1
Intro to Markov chains
Chap. 5.6 More about pseudocounts and
regularizers
Chap 11.1 Parametric distributions commonly
used in computational molecular biology

Write a Comment

User Comments (0)

About PowerShow.com

Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu - PowerPoint PPT Presentation

Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu

Lecture 2. Introduction to probabilistic models for computational molecular ... enhancer. Maniatis and Tasic, Nature (2002) Object: Exonic Splicing Enhancer ... – PowerPoint PPT presentation