Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu

Description:

Lecture 2. Introduction to probabilistic models for computational molecular ... enhancer. Maniatis and Tasic, Nature (2002) Object: Exonic Splicing Enhancer ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 55
Provided by: karch7
Category:

less

Transcript and Presenter's Notes

Title: Lecture 2 Introduction to probabilistic models for computational molecular biology in the 21st centu


1
Lecture 2Introduction to probabilistic models
for computational molecular biology in the 21st
century (Part 1)getting the biology right, null
models and statistical significance
  • Rachel Karchin
  • BME 580.688/580.488 and CS 600.488 Spring 2009

2
Overview
  • Get the biology right
  • Probabilistic models
  • Null models
  • Statistical significance
  • Non-parametric
  • Parametric

3
Probabilistic models
  • Model
  • System that simulates object under consideration
  • Probabilistic model
  • Produces different outcomes with different
    probabilities
  • Can simulate a whole class of objects, assigning
    each an associated probability
  • Computable function that assigns P(X M) to each
    instantiation of X

Biological sequence analysis, Durbin, Eddy,
Krogh, Mitchison
4
Probabilistic models in computational biology
  • Require understanding of the objects under
    consideration
  • Substantial effort to learn
  • Scientific papers
  • Product manuals
  • Without this understanding, models will be stupid

5
Horrifying example of this
  • Community-wide experiment where researchers
    analyze the same contest expression data sets.
  • The first CAMDA conference (CAMDA'00) was held
    December 1819, 2000. Attended by 250 biologists,
    statisticians, computer scientists and
    mathematicians from 7 countries, the conference
    truly brought together the major players in this
    field.

6
Horrifying example of this
  • At one meeting, a group who admitted they did not
    have much biological understanding presented
    results of combining expression values from
    multiple experiments with the same Affymetrix
    chip.

7
Horrifying example of this
  • They had matched values based on probe-set names.

8
Horrifying example of this
  • There were only a small number of probe-set names
    that matched across the experiments.

9
Horrifying example of this
  • These turned out to be spike-in probes designed
    to calibrate the experiments (manufacturers
    protocol to assess successful dye-labeling and
    hybridization).
  • bacterial mRNA transcripts
  • housekeeping gene human mRNA transcripts

10
Motif models of binding-sites on DNA/RNA
  • Example of classic bioinformatics
  • Transcription factor binding sites !
  • Why do these remain relevant?
  • More regulatatory mechanisms discovered to be
    important
  • Binding sites on mRNA for splicing machinery
  • miRNA binding sites on mRNA

11
What you should know
  • What a transcription factor binding site is
  • Have a sense for the range of algorithms
    developed to recognize them (paper on class
    website)
  • What is a miRNA
  • What kinds of binding sites have been discovered
    for splicing machinery

12
Probabilistic model of exonic splicing enhancer
Object Exonic Splicing Enhancer (ESE)
Maniatis and Tasic, Nature (2002)
CACAGGA
An ESE recognized by the human SR protein SF2/ASF
13
Biology of alternative splicing
Alternative splicing of mRNA allows many gene
products with different functions to be produced
from a single coding sequence. It has recently
been proposed as a mechanism by which
higher-order diversity is generated.
Brett et al. Nat Gen 2001
14
(No Transcript)
15
Biology of alternative splicing
16
Biology of alternative splicing
17
Complex protein machinery involved to make it
happen
Maniatis and Tasic, Nature (2002)
18
How do proteins know where to bind on the
pre-MRNA?
19
How do proteins know where to bind on the
pre-MRNA?
CACAGGA
Sequence signals?
20
Toy real-life example
  • You have a new collaborator who has developed an
    in vitro assay to detect where SF2/ASF splicing
    factors bind on pre-mRNA
  • She would like you to give her a model that she
    can use to scan the human genome for putative ESE
    binding sites that her lab can test.

21
ESE binding sites
She presents you with this list of ESE binding
sites that her assay has discovered.
22
What kind of model might you build?
23
A simple idea
Markov chain (zero-order)
Position
24
A simple idea
Markov chain (zero-order)
Position
What is the probability of this RNA sequence
being a SF2/ASF ESE?
CGGCGUG
25
What are strengths and weaknesses of this model?
26
Overfitting
  • How could we design our model better to avoid
    overfitting?

27
Null models
Better Than Chance, Karplus (2008)
28
Null models
  • If our probability model is good, an actual
    SF2/ASF ESE binding site sequence should get a
    higher score than a non-SF2/ASF ESE binding site
    sequence.
  • If our probability model is good, an actual
    SF2/ASF ESE binding site sequence should get a
    higher score than a randomly selected mRNA
    sequence of the same length.

29
Null models
  • If our probability model is good, the likelihood
    of an actual SF2/ASF ESE binding site sequence,
    based on our model, should be larger than the
    likelihood the sequence would have under a random
    model.

30
Example Null model
  • Probability of seeing A,C,G,U at one of the seven
    positions is the same as the probability of
    seeing A,C,G,U at any position in the ESE

31
What are the strengths and weaknesses of this
null model?
  • Probability of seeing A,C,G,U at one of the seven
    positions is the same as the probability of
    seeing A,C,G,U at any position in the ESE

32
Statistical significance
  • We believe in models when they give a high
    likelihood to our observed data.
  • Statistical tests (p-values) quantify how often
    we should expect to see such good scores by
    chance.
  • In other words, when we should reject the null
    model.

33
Small P-Value to reject the Null Hypothesis
Better Than Chance, Karplus (2008)
34
Statistical significance
Non-parametric version
  • Markovs inequality. For any scoring scheme
    that uses the probability of a score better
    than T is less than e-T for sequences distributed
    according to the null model (Milosavljevic 1993).

Better Than Chance, Karplus (2008)
35
Statistical significance threshold
Non-parametric version
  • Youve detected a pattern in the noise when
  • For a sequence database where you search N
    possible starting positions

36
Statistical significance threshold
Non-parametric version
  • Youve detected a pattern in the noise when
  • For a sequence database where you search N
    possible starting positions

Does this give us a P-value for a score?
37
Statistical significance
Parametric version
  • Parameter fitting. For random sequences drawn
    from some distribution other than the null, we
    can fit a parameterized family of distributions
    to scores from a random sample, then compute
    p-values.

?

Better Than Chance, Karplus (2008)
38
Statistical significance
Parametric version

1. Generate random sample of 6mer
sequences.Example
AUUAAC GCACUU CGGCGG UUCCAG GUACGG CGUAUA GACCAU U
AAACC CGAGAC CCUAUU CACGAC GGAGUA CUUGCG
Not the null
39
Statistical significance
Parametric version

2. Score these sequences
AUUAAC GCACUU CGGCGG UUCCAG GUACGG CGUAUA GACCAU U
AAACC CGAGAC CCUAUU CACGAC GGAGUA CUUGCG
-3.6286 -7.3581 -3.9727 -2.5188 -3.4714 -1.5415 -7
.2038 -1.9210 -1.1950 -5.2892 -2.1393 -2.0600 -1.1
991
40
Parametric version

3. Match score histogram to a known parametric
distribution family
?
41
Parametric version
42
Parametric version

43
Parametric version
?
3a. Estimate parameters with a subset of the
scores

44
Parametric version
3b. Evaluate parameter fit on a second subset of
the data

45
Parametric version
4. Compute score and p-value for a real
sequence. UUUGAC 0.06

46
Parametric version
4. Compute score and p-Value for a real
sequence. UUUGAC 0.06

P-value
47
Correct for multiple testing
  • What if you have more than one candidate sequence
    whose significance you want to evaluate?
  • You are cheating!
  • Using the same data to test n alternative
    hypotheses

48
Correct for multiple testing
  • Bonferroni
  • Test each hypothesis (that sequence S is an ESE
    binding site) at signficance level 1/n
  • To test 10 sequences and find if any have P lt
    0.05
  • Test each at 0.005 signficance level
  • Benjamini and Hochberg
  • Control for the FDR (false discovery rate)
    (fraction FP among tests declared significant)
  • Storey
  • q-value Minimum FDR we get by calling a test
    significant

49
What you should know
  • How to recognize and correct for overly
    optimistic significance calculations.

50
Data integration
  • 21st century probabilistic model of ESEs should
    incorporate also
  • Sequence motif
  • Structural patterns?
  • Cis location patterns?

51
How do proteins know where to bind on the
pre-MRNA?
CACAGGA
Structural signals
52
Where are the SF2/ASF ESE binding sites?
Sanford, Genome Research 2008
53
What you should know
  • Before starting to model molecular biology data
  • Get some understanding of the big picture
  • Read as many background papers as possible
  • When you get data from an experimentalist
  • Learn about the technology used to collect the
    data
  • Read up on any previous attempts to model this
    kind of data in the compbio literature
  • Think carefully about what kinds of available
    information have not yet been used to model the
    problem.

54
What you should understand
  • Material in Durbin et alChap. 1 Intro to
    maximum likelihood estimation, posterior
    probabilites, Bayesian statistics and
    pseudocounts as used in computational molecular
    biologyChap 3.1
  • Intro to Markov chains
  • Chap. 5.6 More about pseudocounts and
    regularizers
  • Chap 11.1 Parametric distributions commonly
    used in computational molecular biology
Write a Comment
User Comments (0)
About PowerShow.com