Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the

About This Presentation

Title:

Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the

Description:

Alternative approaches machine learning ... Alternatively, can apply standard statistical tests such as chi-squared and Fisher's exact. ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 74

Provided by: karch7

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the

1
Lecture 3Introduction to probabilistic models.
Part 2 parametric probability distributions and
their alternatives

Rachel Karchin
BME 580.688 Spring 2009

2
Overview

Examples of how parametric probability
distributions can be used in computational
biology.
Alternative approaches machine learning

3
Parmetric distributions commonly used in
computational biology

Binomial
Hypergeometric
Multinomial
Poisson
Gamma
Dirichlet

There are many more.
4
Parmetric distributions commonly used in
computational biology

Trick is understanding how to use them
intelligently.

5
Binomial Distribution

Independent events with two mutually exclusive
possible outcomes(success or failure)
Number of trials is n

PDF
CDF
6
p 0.3 n 50
p(X)
PDF
X
CDF
x
7
Binomial distribution applied to biological
sequence analysis

How many matches should we see by chance between
two random DNA sequences of length 100?

Sequence 1
CCAGCGAACACTGCAATCTTGTTGGAAATTTAAATTAAGAAACATGCAGT
Sequence 2
CATGAGTAGGTCCTGCTGCCCAATAAATGAGCTTTGCAGGCACCAAAGCT
Sequence 1
TTAAAGAACCTGGCTCTGAAAACAATTCATGTGGGGACCTTATTTAAGAA
Sequence 2
GAAAAAAGGGAGAAGAATAAATGTAATGTAGTTGTAATAAGGCTAAAATC
8
(No Transcript)
9
Sequence analysis of DNA methylation sites

GATC is methylated in E. coli by DNA
methyltransferase (DAM).
Basis of a GATC regulatory network (replication
and gene expression)

Bang 2008
Marinus 2005
10
Sequence analysis of DNA methylation sites

This network is the topic of a huge literature
One function is control of cell metabolism under
cold shock and oxygen shift conditions
Two hypotheses proposed were
Transcription of genes containing GATC clusters
in coding sequence blocked when hemi-methylated
(higher thermal stability). Henaut 1996
Presence of methylated GATCs in non-coding 500 bp
upstream of affected genes impacts interaction
with regulatory proteins. Oshima 2002

11
Sequence analysis of DNA methylation sites

Oshima et al (2002) did micoarray (and other)
analysis of 4019 E coli genes, comparing gene
expression with wildtype and (LOF) mutant DAM.
DAM deficiency effected a large number of genes
(10).
GATC is the target of DAM
If hypothesis 2 is true, we should see a
relationship between GATC abundance and DAM
control.

12
Sequence analysis of DNA methylation sites

Riva et al. 2004 did statistical analysis to test
this.
Oshimas hypothesis predicts regions upstream of
coding sequence that are 500bp long with an
unusual frequency of GATC

13
Binomial distribution applied to biological
sequence analysis

Riva evaluated GATC enrichment by comparing
observed number of GATCs 500 bp upstream of these
genes with a null model based on binomial
distribution.

500 bp
14
Binomial distribution applied to biological
sequence analysis

What is a binomial null model of GATC frequency
in these upstream regions?

500 bp
15
Know4019 genes analyzed7174 GATC counted in
their upstream regions
Problem-solving tips

Define success
Calculate p
Compute probability that a single gene has no
GATC in its 500 bp upstream region
Compute number of genes expected to have no GATC
in their 500 bp upstream region (1 GATC, 2 GATC
etc.)

16
Number of GATC
Solution
Number of genes
Number of bp upstream each gene

For a single gene, probability of seeing no
GATC
Number of genes expected to have 0 GATC

How would you use this model to assess whether
the 10 of genes on the microarray that were
sensitive (differentially expressed) to DAM
status were enriched in GATC in their 500 bp
upstream regions?

18
What you should know

Understand what the binomial distribution is
Develop some intuition about how to use it in
biological sequence analysis
The logic used by Riva et al. in their analysis
Cocktail party explanations of
DNA methylation and the enyzmes that do it
E. coli
E. colis GATC regulatory network
Hemi-methylation vs. methylation

19
Hypergeometric distribution

With the binomial distribution, we assume that on
each trial we are drawing the same kind of
object.
p is the same for each draw
What if on each trial we are drawing two kinds of
objects?

20
Hypergeometric distribution

Number of successes in sequence of k draws from
finite population with two kinds of objects,
without replacement.

PDF
N total number of objects M number of objects of
the first type k number of draws X number of
successes (sample success)
21
M 100 N 50 k 30
N total number of objects M number of objects of
the first type k number of draws X number of
successes (sample success)
p(X)
PDF
X
CDF
x
22
Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .
GO Term counts
definition GO0005488 15 The
selective, often stoichiometric interaction
o... GO0005509 15 Interacting selectively with
calcium ions (Ca2).... GO0004720 2 Catalysis
of the reaction peptidyl-L-lysyl-pepti... GO0046
872 15 Interacting selectively with any metal
ion." GOC... GO0016641 2 Catalysis of an
oxidation-reduction (redox) react... GO0043169
8 Interacting selectively with cations, charged
ato... GO0005507 3 Interacting selectively
with copper (Cu) ions." ... GO0003677 2
Interacting selectively with DNA
(deoxyribonuclei... GO0046914 11 Interacting
selectively with a transition metal
i... GO0004222 3 Catalysis of the hydrolysis
of nonterminal peptid...
k
23

Microarray experiment
Microarray covered withprobes that hybridize
cDNA
cDNA from cells in contrasting conditions of
interest applied to the array
Of most interest are differentially expressed
genes

24
http//www.geneontology.org/

Gene Ontology

a collaborative effort to address the need for
consistent descriptions of gene products in
different databases.
The GO project has developed three structured
controlled vocabularies (ontologies) that
describe gene products in terms of their
associated biological processes, cellular
components and molecular functions in a
species-independent manner.
Each entry in GO has a unique numerical
identifier of the form GOnnnnnnn, and a term
name, e.g. cell, fibroblast growth factor
receptor binding or signal transduction. Each
term is also assigned to one of the three
ontologies.
GO terms can be linked by five types of
relationships is_a, part_of, regulates,
positively_regulates and negatively_regulates.
25

Pathway
Molecular interaction and reaction networks

http//www.biocarta.com
http//www.kegg.com/kegg/kegg2.html
26
Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
N d.e. genes in microarray experiment M genes are
associated with term t k number of genes in the
pathway X number of genes in the pathway
that are associated with term t
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .

k
27
Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
N d.e. genes in microarray experiment M genes are
associated with term t k number of genes in the
pathway X number of genes in the pathway
that are associated with term t
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .

What is the null hypothesis?

k
28
How would we define GO term t enrichment in this
microarray experiment?

Using this approximation the p-value for
over-represented GO terms can be calculated as

What about p-value for under-represented GO terms?

Slide courtesy of Josep Mosquera
29
How would we define GO term t enrichment in this
microarray experiment?

Alternatively, can apply standard statistical
tests such as chi-squared and Fishers exact.

30
How would you convert this to a joint probability
distribution?

What are marginals?

31
What you should know

Understand what the hypergeometric distribution
is
Develop some intuition about where you might
apply it in biological sequence analysis
When to use hypergeometric vs. binomial
distribution
Where and when to use hypergeometric vs.
chi-squared and/or Fishers exact test to
evaluate enrichment.
Cocktail party explanations of
Microarray
cDNA
Differential expression
Gene Ontology
pathway

32
Multinomial Distribution

Independent events with k mutually exclusive
possible outcomes having probabilities q1, q2, .
. ., qk

PDF
33
Multinomial Distribution

We roll a fair 100-sided die N times
Probability of getting each of the 100 outcomes
is q1, q2, . . ., q100 where q1q2 . .
.q1000.01
What is the probability of rolling 200 times and
getting each outcome twice?

34
Multinomial Distribution
N total number of die rolls k number of possible
outcomes (sides to the die) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
35
Multinomial distribution

What is the probability of seeing an adenine
nucleotide (A) 13 times in the last position of
a C1 enhancer in a chordate species?

Multiple sequence alignment of intron in human
shha and 12 homologs from other chordates.
ar-C transcriptional enhancers C1,C2,C3, and
C4are shown (Hadzhiev et al. 2007).
36
Multinomial distribution

13 species in the alignment
The column has 9 As, 3 Gs and a T
qA9/13, qC0/13, qG3/13, qT1/13

What kind of estimate is this of the ? parameters?
N total number of species k number of possible
outcomes (nucleotide types) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
37
Multinomial distribution
N total number of species k number of possible
outcomes (nucleotide types) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
Whats this?
38
Multinomial distribution

Is this probability due to chance?

39
Poisson Distribution

Number of events occurring within a fixed
interval of time or space, which occur with known
average rate and independent of time or space
since the last event.

PDF
CDF
40
?10
p(X)
PDF
X
CDF
x
41
Poisson Distribution
gtHD_HUMAN Huntingtin MATLEKLMKAFESLKSFQQQQQQQQQQQQ
QQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQPLLPQPQPPPPPPPPPPG
PAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNS

How can we test whether low complexity protein
sequences are enriched in particular proteins or
proteomes?
Define low complexity sequence (Sim et al. 2002)
At least 10 residues long
At least 50 composed of a single residue type
Begin and end with the residue type
No runs greater than length 5 of any other
residue type

42
Poisson Distribution
Sim et al. 2002
p(X) probability that event happens X times
l sequence window length
43
Poisson Distribution
Sim et al. 2002

Expected number of low complexitysequences of
length l in a proteome

TlNumber of sequence windows of length l
SC S. cerevisiae CE C. elegans DM
fruitfly AT thale cress
Ratio of the number of low-complexity sequences
found above that expected from the Poisson
distribution model, DeltaR, to the number of
proteins in each eukaryote proteome plotted for
each residue type
44
What you should know

All terms in red on these slides.
How to decide when you should use a parametric
probability distribution in your models
How to pick a good parametric probability
distribution for a particular problem

45
Choosing a parametric probability distribution
for your model

What if intuition fails you and there is no
obvious choice?
Exploratory analysis and graphics
Look at histograms of the data
Generate random samples from known distributions
and compare to your data, standardized
Quantile-quantile plots
Compute kurtosis and skewness of your data,
standardized, then compare to known kurtosis and
skewness of parametric families (Ricci
tutorial).

46
Histogram your data
47
Compare your data to random samples from known
distribution families
Quantile-Quantile plot
48
Use skewness and kurtosis of your data to select
a distribution family

For a sample of size n

mean
1st moment
variance
2nd moment
skewness
3rd moment
4th moment
kurtosis
49
Use skewness and kurtosis of your data to select
a distribution family

For scores of 10,000 random protein sequences
from model of E. Coli CheY protein

gt mean(data,1) 1 -0.014408 gt
var(data,1) 1 1.138294 gt skewness(data,1) 1
-0.06517909 gt kurtosis(data,1) 1 5.71957
50
Use skewness and kurtosis of your data to select
a distribution family

Skewness
Asymmetry
Negative skew gt longer left tailprobability
mass concentrated on right
Positive skew gt longer right tailprobability
mass concentrated on left
kurtosis
High gt sharper peak, fatter tails
Low gt rounder peak, wider shoulders

gt skewness(data,1) 1 -0.06517909 gt
kurtosis(data,1) 1 5.71957
51
Use skewness and kurtosis of your data to select
a distribution family

We want a distribution with skew 0 and kurtosis
5

Students-t Distribution
gt sample.T lt- rt(n10000,df4.6) gt
skewness(sample.T) 1 0.09057073 gt
kurtosis(sample.T) 1 5.238485
Normal Distribution
52
Parameter estimation

Read about it in Ricci.

53
Assessing model fit

Goodness of fit tests to a distribution D
Null hypothesis sample data come from D
Alternative hypothesis sample data do not come
from D
Read about it in Ricci.

54
Machine learning approaches to biological
sequence analysis

Splice-site recognition

ACCEPTOR
DONOR
Ben-Hur 2008
55
Machine learning approaches to biological
sequence analysis

Splice-site recognition

lt0.1 of GT and AG are actually splice sites
How represent acceptor sites so as to
discriminate them from other AGs?

56
Machine learning approaches to biological
sequence analysis

Splice-site recognition

Support vector machines with kernels that
incorporate biological knowledge
Require a gold standard set of validate acceptor
sequences (and flanking sequence) and a set of
decoys.

57
How represent acceptor sites so as to
discriminate them from other AGs?

Use biology!
GC content of introns higher than that of exons
Consider properties of flanking sequence
Consider conservation in multiple species

Ben-Hur 2008
58
How represent acceptor sites so as to
discriminate them from other AGs?

GC content of introns higher than that of exons
Two features represent each putative acceptor
site
Feature 1 GC fraction of exon
Feature 2 GC fraction of intron

Ben-Hur 2008
59
How represent acceptor sites so as to
discriminate them from other AGs?

Non-linear kernel (Gaussian or polynomial) gives
small improvement to classifier perfomance
compared to linear
Large degree polynomial and small-width Gaussian
kernel ield reduced accuracy. Too flexible.

auROC
Ben-Hur 2008
60
How represent acceptor sites so as to
discriminate them from other AGs?

Need better and more features.

Count of four bases on intronic and exonic sides
of the acceptor
8

Works on a small set of reals and decoys but
doesnt scale up to whole-genome analysis

32
Count of all trimers
128
Count of all dimers
2 4l
Count of all l-mers
61
How represent acceptor sites so as to
discriminate them from other AGs?
Features
Count of four bases on intronic and exonic sides
of the acceptor
Intron A 4 Exon A 0 Intron C 1 Exon C
2 Intron G 1 Exon G 3 Intron U 2 Exon U 1
8
32
Intron A C G U A 1 0 1
1 C 1 0 0 0 G
0 0 0 1 U 1 1 0 0
Exon A C G U A 0 0 0
0 C 0 0 1 0 G 0
1 1 1 U 0 1 0 0
Count of all dimers
128
Count of all trimers
2 4l
Count of all l-mers

This approach motivated the design of the
spectrum kernel

62
Spectrum Kernel
Feature map based on spectrum of a sequence
X
F(X)

C. Leslie, E. Eskin, and W. Noble, The Spectrum
Kernel
A String Kernel for SVM Protein Classification.
Pacific Symposium on Biocomputing, 2002.
C. Leslie, E. Eskin, J. Weston and W. Noble,
Mismatch String Kernels for SVM Protein
Classification.
NIPS 2002.

63
The k-Spectrum of a Sequence

Feature map for SVM based on spectrum of a
sequence
The k-spectrum of a sequence is the set of all
k-length contiguous subsequences that it contains
Feature map is indexed by all possible k-length
subsequences
(k-mers) from any alphabet (here the mRNA
nucleotides)
Dimension of feature space 4k

ACCUGUACGG
ACC CCU CUG UGU GUA UAC ACG
CGG
Slide courtesy of Christina Leslie
64
k-Spectrum Feature Map

Feature map for k-spectrum with no mismatches
For sequence x, F(k)(x) (Ft (x))k-mers t,
where Ft (x) occurrences of t in x

ACCUGUGUGG
( 0 , 0 , , 1 , , 1 , , 2 ) AAA AAC
ACC CCU UGU
Slide courtesy of Christina Leslie
65
Issues with spectrum kernel
K_kspectrum(x,x) F(k)(x) F(k)(x)

Explicit computation too hard for large k
Nucleotide sequences with kgt10
Protein sequences with kgt5
When k is large, the requirement for exact match
will result in very few non-zero values
One efficiency is to compute only k-mers with
non-zero counts. Why does that work?

Ben-Hur 2008
66
Improved spectrum kernels

(k,m)-Mismatch Spectrum Kernel
Feature map for k-spectrum, allowing m
mismatches
if s is a k-mer, F(k,m)(s) (Ft(s))k-mers t,
where Ft(s) 1 if s is within m mismatches from
t,0 otherwise
extend additively to longer sequences x by
summing over all k-mers s in x

AAC

GAC
AAG

UAC
AUC
Slide courtesy of Christina Leslie
67
Improved spectrum kernels

Mixed spectrum kernel

Able to capture putatively important sequence
motifs at multi-resolution

Ben-Hur 2008
68
Weighted degree spectrum kernel
k
L-d1
K_kweighteddegree(x,x) S S ßd
K_dspectrum(xkkd,xkkd)
d1
k1
xkkd
Substring of X of length d starting at position k

Take advantage of positional information.
Nucleotide positions are not all equally
informative for purposes of detecting acceptor
sites
Analyze a sequence of fixed length L and consider
each position separately

Ben-Hur 2008
69
Weighted degree spectrum kernel
k
L-d1
K_kweighteddegree(x,x) S S ßd
K_dspectrum(xkkd,xkkd)
d1
k1
xkkd
Substring of X of length d starting at position k

Equivalent to using a mixed spectrum kernel for
each position of the sequence separately
(ignoring boundary effect).
How choose ßd ?

Ben-Hur 2008
70
SVM performance discriminating acceptor sites
with three kinds of spectrum kernel
Ben-Hur 2008
71
Programming assignment 1

Will be posted this evening

72
Programming assignment 1

You can use machine learning or statistical
modeling or a combination of both.
Libraries available for R will make this much
easier for you.
You will be able to use R functions in your
Python programs

73
Programming assignment 1

Be realistic about your current skill set.
If youre still shaky with Python, do something
simple.
Main priority is to complete the assignment and
write clean code.
If thats easy for you, heres a chance to do
something innovative.

Write a Comment

User Comments (0)

About PowerShow.com

Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the - PowerPoint PPT Presentation

Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the

Alternative approaches machine learning ... Alternatively, can apply standard statistical tests such as chi-squared and Fisher's exact. ... – PowerPoint PPT presentation