Title: Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the
1Lecture 3Introduction to probabilistic models.
Part 2 parametric probability distributions and
their alternatives
- Rachel Karchin
- BME 580.688 Spring 2009
2Overview
- Examples of how parametric probability
distributions can be used in computational
biology. - Alternative approaches machine learning
3Parmetric distributions commonly used in
computational biology
- Binomial
- Hypergeometric
- Multinomial
- Poisson
- Gamma
- Dirichlet
There are many more.
4Parmetric distributions commonly used in
computational biology
- Trick is understanding how to use them
intelligently.
5Binomial Distribution
- Independent events with two mutually exclusive
possible outcomes(success or failure) - Number of trials is n
PDF
CDF
6p 0.3 n 50
p(X)
PDF
X
CDF
x
7Binomial distribution applied to biological
sequence analysis
- How many matches should we see by chance between
two random DNA sequences of length 100?
Sequence 1
CCAGCGAACACTGCAATCTTGTTGGAAATTTAAATTAAGAAACATGCAGT
Sequence 2
CATGAGTAGGTCCTGCTGCCCAATAAATGAGCTTTGCAGGCACCAAAGCT
Sequence 1
TTAAAGAACCTGGCTCTGAAAACAATTCATGTGGGGACCTTATTTAAGAA
Sequence 2
GAAAAAAGGGAGAAGAATAAATGTAATGTAGTTGTAATAAGGCTAAAATC
8(No Transcript)
9Sequence analysis of DNA methylation sites
- GATC is methylated in E. coli by DNA
methyltransferase (DAM). - Basis of a GATC regulatory network (replication
and gene expression)
Bang 2008
Marinus 2005
10Sequence analysis of DNA methylation sites
- This network is the topic of a huge literature
- One function is control of cell metabolism under
cold shock and oxygen shift conditions - Two hypotheses proposed were
- Transcription of genes containing GATC clusters
in coding sequence blocked when hemi-methylated
(higher thermal stability). Henaut 1996 - Presence of methylated GATCs in non-coding 500 bp
upstream of affected genes impacts interaction
with regulatory proteins. Oshima 2002
11Sequence analysis of DNA methylation sites
- Oshima et al (2002) did micoarray (and other)
analysis of 4019 E coli genes, comparing gene
expression with wildtype and (LOF) mutant DAM. - DAM deficiency effected a large number of genes
(10). - GATC is the target of DAM
- If hypothesis 2 is true, we should see a
relationship between GATC abundance and DAM
control.
12Sequence analysis of DNA methylation sites
- Riva et al. 2004 did statistical analysis to test
this. - Oshimas hypothesis predicts regions upstream of
coding sequence that are 500bp long with an
unusual frequency of GATC
13Binomial distribution applied to biological
sequence analysis
- Riva evaluated GATC enrichment by comparing
observed number of GATCs 500 bp upstream of these
genes with a null model based on binomial
distribution.
500 bp
14Binomial distribution applied to biological
sequence analysis
- What is a binomial null model of GATC frequency
in these upstream regions?
500 bp
15Know4019 genes analyzed7174 GATC counted in
their upstream regions
Problem-solving tips
- Define success
- Calculate p
- Compute probability that a single gene has no
GATC in its 500 bp upstream region - Compute number of genes expected to have no GATC
in their 500 bp upstream region (1 GATC, 2 GATC
etc.)
16Number of GATC
Solution
Number of genes
Number of bp upstream each gene
- For a single gene, probability of seeing no
GATC - Number of genes expected to have 0 GATC
17- How would you use this model to assess whether
the 10 of genes on the microarray that were
sensitive (differentially expressed) to DAM
status were enriched in GATC in their 500 bp
upstream regions?
18What you should know
- Understand what the binomial distribution is
- Develop some intuition about how to use it in
biological sequence analysis - The logic used by Riva et al. in their analysis
- Cocktail party explanations of
- DNA methylation and the enyzmes that do it
- E. coli
- E. colis GATC regulatory network
- Hemi-methylation vs. methylation
19Hypergeometric distribution
- With the binomial distribution, we assume that on
each trial we are drawing the same kind of
object. - p is the same for each draw
- What if on each trial we are drawing two kinds of
objects?
20Hypergeometric distribution
- Number of successes in sequence of k draws from
finite population with two kinds of objects,
without replacement.
PDF
N total number of objects M number of objects of
the first type k number of draws X number of
successes (sample success)
21M 100 N 50 k 30
N total number of objects M number of objects of
the first type k number of draws X number of
successes (sample success)
p(X)
PDF
X
CDF
x
22Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .
GO Term counts
definition GO0005488 15 The
selective, often stoichiometric interaction
o... GO0005509 15 Interacting selectively with
calcium ions (Ca2).... GO0004720 2 Catalysis
of the reaction peptidyl-L-lysyl-pepti... GO0046
872 15 Interacting selectively with any metal
ion." GOC... GO0016641 2 Catalysis of an
oxidation-reduction (redox) react... GO0043169
8 Interacting selectively with cations, charged
ato... GO0005507 3 Interacting selectively
with copper (Cu) ions." ... GO0003677 2
Interacting selectively with DNA
(deoxyribonuclei... GO0046914 11 Interacting
selectively with a transition metal
i... GO0004222 3 Catalysis of the hydrolysis
of nonterminal peptid...
k
23- Microarray experiment
- Microarray covered withprobes that hybridize
cDNA - cDNA from cells in contrasting conditions of
interest applied to the array - Of most interest are differentially expressed
genes
24http//www.geneontology.org/
a collaborative effort to address the need for
consistent descriptions of gene products in
different databases.
The GO project has developed three structured
controlled vocabularies (ontologies) that
describe gene products in terms of their
associated biological processes, cellular
components and molecular functions in a
species-independent manner.
Each entry in GO has a unique numerical
identifier of the form GOnnnnnnn, and a term
name, e.g. cell, fibroblast growth factor
receptor binding or signal transduction. Each
term is also assigned to one of the three
ontologies.
GO terms can be linked by five types of
relationships is_a, part_of, regulates,
positively_regulates and negatively_regulates.
25- Pathway
- Molecular interaction and reaction networks
http//www.biocarta.com
http//www.kegg.com/kegg/kegg2.html
26Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
N d.e. genes in microarray experiment M genes are
associated with term t k number of genes in the
pathway X number of genes in the pathway
that are associated with term t
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .
k
27Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
N d.e. genes in microarray experiment M genes are
associated with term t k number of genes in the
pathway X number of genes in the pathway
that are associated with term t
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .
- What is the null hypothesis?
k
28How would we define GO term t enrichment in this
microarray experiment?
- Using this approximation the p-value for
over-represented GO terms can be calculated as
- What about p-value for under-represented GO terms?
Slide courtesy of Josep Mosquera
29How would we define GO term t enrichment in this
microarray experiment?
- Alternatively, can apply standard statistical
tests such as chi-squared and Fishers exact.
30How would you convert this to a joint probability
distribution?
31What you should know
- Understand what the hypergeometric distribution
is - Develop some intuition about where you might
apply it in biological sequence analysis - When to use hypergeometric vs. binomial
distribution - Where and when to use hypergeometric vs.
chi-squared and/or Fishers exact test to
evaluate enrichment. - Cocktail party explanations of
- Microarray
- cDNA
- Differential expression
- Gene Ontology
- pathway
32Multinomial Distribution
- Independent events with k mutually exclusive
possible outcomes having probabilities q1, q2, .
. ., qk
PDF
33Multinomial Distribution
- We roll a fair 100-sided die N times
- Probability of getting each of the 100 outcomes
is q1, q2, . . ., q100 where q1q2 . .
.q1000.01 - What is the probability of rolling 200 times and
getting each outcome twice?
34Multinomial Distribution
N total number of die rolls k number of possible
outcomes (sides to the die) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
35Multinomial distribution
- What is the probability of seeing an adenine
nucleotide (A) 13 times in the last position of
a C1 enhancer in a chordate species?
Multiple sequence alignment of intron in human
shha and 12 homologs from other chordates.
ar-C transcriptional enhancers C1,C2,C3, and
C4are shown (Hadzhiev et al. 2007).
36Multinomial distribution
- 13 species in the alignment
- The column has 9 As, 3 Gs and a T
- qA9/13, qC0/13, qG3/13, qT1/13
What kind of estimate is this of the ? parameters?
N total number of species k number of possible
outcomes (nucleotide types) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
37Multinomial distribution
N total number of species k number of possible
outcomes (nucleotide types) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
Whats this?
38Multinomial distribution
- Is this probability due to chance?
39Poisson Distribution
- Number of events occurring within a fixed
interval of time or space, which occur with known
average rate and independent of time or space
since the last event.
PDF
CDF
40?10
p(X)
PDF
X
CDF
x
41Poisson Distribution
gtHD_HUMAN Huntingtin MATLEKLMKAFESLKSFQQQQQQQQQQQQ
QQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQPLLPQPQPPPPPPPPPPG
PAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNS
- How can we test whether low complexity protein
sequences are enriched in particular proteins or
proteomes? - Define low complexity sequence (Sim et al. 2002)
- At least 10 residues long
- At least 50 composed of a single residue type
- Begin and end with the residue type
- No runs greater than length 5 of any other
residue type
42Poisson Distribution
Sim et al. 2002
p(X) probability that event happens X times
l sequence window length
43Poisson Distribution
Sim et al. 2002
- Expected number of low complexitysequences of
length l in a proteome
TlNumber of sequence windows of length l
SC S. cerevisiae CE C. elegans DM
fruitfly AT thale cress
Ratio of the number of low-complexity sequences
found above that expected from the Poisson
distribution model, DeltaR, to the number of
proteins in each eukaryote proteome plotted for
each residue type
44What you should know
- All terms in red on these slides.
- How to decide when you should use a parametric
probability distribution in your models - How to pick a good parametric probability
distribution for a particular problem
45Choosing a parametric probability distribution
for your model
- What if intuition fails you and there is no
obvious choice? - Exploratory analysis and graphics
- Look at histograms of the data
- Generate random samples from known distributions
and compare to your data, standardized - Quantile-quantile plots
- Compute kurtosis and skewness of your data,
standardized, then compare to known kurtosis and
skewness of parametric families (Ricci
tutorial).
46Histogram your data
47Compare your data to random samples from known
distribution families
Quantile-Quantile plot
48Use skewness and kurtosis of your data to select
a distribution family
mean
1st moment
variance
2nd moment
skewness
3rd moment
4th moment
kurtosis
49Use skewness and kurtosis of your data to select
a distribution family
- For scores of 10,000 random protein sequences
from model of E. Coli CheY protein
gt mean(data,1) 1 -0.014408 gt
var(data,1) 1 1.138294 gt skewness(data,1) 1
-0.06517909 gt kurtosis(data,1) 1 5.71957
50Use skewness and kurtosis of your data to select
a distribution family
- Skewness
- Asymmetry
- Negative skew gt longer left tailprobability
mass concentrated on right - Positive skew gt longer right tailprobability
mass concentrated on left - kurtosis
- High gt sharper peak, fatter tails
- Low gt rounder peak, wider shoulders
gt skewness(data,1) 1 -0.06517909 gt
kurtosis(data,1) 1 5.71957
51Use skewness and kurtosis of your data to select
a distribution family
- We want a distribution with skew 0 and kurtosis
5
Students-t Distribution
gt sample.T lt- rt(n10000,df4.6) gt
skewness(sample.T) 1 0.09057073 gt
kurtosis(sample.T) 1 5.238485
Normal Distribution
52Parameter estimation
53Assessing model fit
- Goodness of fit tests to a distribution D
- Null hypothesis sample data come from D
- Alternative hypothesis sample data do not come
from D - Read about it in Ricci.
54Machine learning approaches to biological
sequence analysis
ACCEPTOR
DONOR
Ben-Hur 2008
55Machine learning approaches to biological
sequence analysis
- lt0.1 of GT and AG are actually splice sites
- How represent acceptor sites so as to
discriminate them from other AGs?
56Machine learning approaches to biological
sequence analysis
- Support vector machines with kernels that
incorporate biological knowledge - Require a gold standard set of validate acceptor
sequences (and flanking sequence) and a set of
decoys.
57How represent acceptor sites so as to
discriminate them from other AGs?
- Use biology!
- GC content of introns higher than that of exons
- Consider properties of flanking sequence
- Consider conservation in multiple species
Ben-Hur 2008
58How represent acceptor sites so as to
discriminate them from other AGs?
- GC content of introns higher than that of exons
- Two features represent each putative acceptor
site - Feature 1 GC fraction of exon
- Feature 2 GC fraction of intron
Ben-Hur 2008
59How represent acceptor sites so as to
discriminate them from other AGs?
- Non-linear kernel (Gaussian or polynomial) gives
small improvement to classifier perfomance
compared to linear - Large degree polynomial and small-width Gaussian
kernel ield reduced accuracy. Too flexible.
auROC
Ben-Hur 2008
60How represent acceptor sites so as to
discriminate them from other AGs?
- Need better and more features.
Count of four bases on intronic and exonic sides
of the acceptor
8
- Works on a small set of reals and decoys but
doesnt scale up to whole-genome analysis
32
Count of all trimers
128
Count of all dimers
2 4l
Count of all l-mers
61How represent acceptor sites so as to
discriminate them from other AGs?
Features
Count of four bases on intronic and exonic sides
of the acceptor
Intron A 4 Exon A 0 Intron C 1 Exon C
2 Intron G 1 Exon G 3 Intron U 2 Exon U 1
8
32
Intron A C G U A 1 0 1
1 C 1 0 0 0 G
0 0 0 1 U 1 1 0 0
Exon A C G U A 0 0 0
0 C 0 0 1 0 G 0
1 1 1 U 0 1 0 0
Count of all dimers
128
Count of all trimers
2 4l
Count of all l-mers
- This approach motivated the design of the
spectrum kernel
62Spectrum Kernel
Feature map based on spectrum of a sequence
X
F(X)
- C. Leslie, E. Eskin, and W. Noble, The Spectrum
Kernel - A String Kernel for SVM Protein Classification.
Pacific Symposium on Biocomputing, 2002. - C. Leslie, E. Eskin, J. Weston and W. Noble,
- Mismatch String Kernels for SVM Protein
Classification. - NIPS 2002.
63The k-Spectrum of a Sequence
- Feature map for SVM based on spectrum of a
sequence - The k-spectrum of a sequence is the set of all
k-length contiguous subsequences that it contains - Feature map is indexed by all possible k-length
subsequences - (k-mers) from any alphabet (here the mRNA
nucleotides) - Dimension of feature space 4k
ACCUGUACGG
ACC CCU CUG UGU GUA UAC ACG
CGG
Slide courtesy of Christina Leslie
64k-Spectrum Feature Map
- Feature map for k-spectrum with no mismatches
- For sequence x, F(k)(x) (Ft (x))k-mers t,
where Ft (x) occurrences of t in x
ACCUGUGUGG
( 0 , 0 , , 1 , , 1 , , 2 ) AAA AAC
ACC CCU UGU
Slide courtesy of Christina Leslie
65Issues with spectrum kernel
K_kspectrum(x,x) F(k)(x) F(k)(x)
- Explicit computation too hard for large k
- Nucleotide sequences with kgt10
- Protein sequences with kgt5
- When k is large, the requirement for exact match
will result in very few non-zero values - One efficiency is to compute only k-mers with
non-zero counts. Why does that work?
Ben-Hur 2008
66Improved spectrum kernels
- (k,m)-Mismatch Spectrum Kernel
- Feature map for k-spectrum, allowing m
mismatches - if s is a k-mer, F(k,m)(s) (Ft(s))k-mers t,
where Ft(s) 1 if s is within m mismatches from
t,0 otherwise - extend additively to longer sequences x by
summing over all k-mers s in x
AAC
GAC
AAG
UAC
AUC
Slide courtesy of Christina Leslie
67Improved spectrum kernels
- Able to capture putatively important sequence
motifs at multi-resolution
Ben-Hur 2008
68Weighted degree spectrum kernel
k
L-d1
K_kweighteddegree(x,x) S S ßd
K_dspectrum(xkkd,xkkd)
d1
k1
xkkd
Substring of X of length d starting at position k
- Take advantage of positional information.
- Nucleotide positions are not all equally
informative for purposes of detecting acceptor
sites - Analyze a sequence of fixed length L and consider
each position separately
Ben-Hur 2008
69Weighted degree spectrum kernel
k
L-d1
K_kweighteddegree(x,x) S S ßd
K_dspectrum(xkkd,xkkd)
d1
k1
xkkd
Substring of X of length d starting at position k
- Equivalent to using a mixed spectrum kernel for
each position of the sequence separately
(ignoring boundary effect). - How choose ßd ?
Ben-Hur 2008
70SVM performance discriminating acceptor sites
with three kinds of spectrum kernel
Ben-Hur 2008
71Programming assignment 1
- Will be posted this evening
72Programming assignment 1
- You can use machine learning or statistical
modeling or a combination of both. - Libraries available for R will make this much
easier for you. - You will be able to use R functions in your
Python programs
73Programming assignment 1
- Be realistic about your current skill set.
- If youre still shaky with Python, do something
simple. - Main priority is to complete the assignment and
write clean code. - If thats easy for you, heres a chance to do
something innovative.