Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the

Description:

Alternative approaches machine learning ... Alternatively, can apply standard statistical tests such as chi-squared and Fisher's exact. ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 74
Provided by: karch7
Category:

less

Transcript and Presenter's Notes

Title: Lecture 3 Introduction to probabilistic models' Part 2: parametric probability distributions and the


1
Lecture 3Introduction to probabilistic models.
Part 2 parametric probability distributions and
their alternatives
  • Rachel Karchin
  • BME 580.688 Spring 2009

2
Overview
  • Examples of how parametric probability
    distributions can be used in computational
    biology.
  • Alternative approaches machine learning

3
Parmetric distributions commonly used in
computational biology
  • Binomial
  • Hypergeometric
  • Multinomial
  • Poisson
  • Gamma
  • Dirichlet

There are many more.
4
Parmetric distributions commonly used in
computational biology
  • Trick is understanding how to use them
    intelligently.

5
Binomial Distribution
  • Independent events with two mutually exclusive
    possible outcomes(success or failure)
  • Number of trials is n

PDF
CDF
6
p 0.3 n 50
p(X)
PDF
X
CDF
x
7
Binomial distribution applied to biological
sequence analysis
  • How many matches should we see by chance between
    two random DNA sequences of length 100?

Sequence 1
CCAGCGAACACTGCAATCTTGTTGGAAATTTAAATTAAGAAACATGCAGT
Sequence 2
CATGAGTAGGTCCTGCTGCCCAATAAATGAGCTTTGCAGGCACCAAAGCT
Sequence 1
TTAAAGAACCTGGCTCTGAAAACAATTCATGTGGGGACCTTATTTAAGAA
Sequence 2
GAAAAAAGGGAGAAGAATAAATGTAATGTAGTTGTAATAAGGCTAAAATC
8
(No Transcript)
9
Sequence analysis of DNA methylation sites
  • GATC is methylated in E. coli by DNA
    methyltransferase (DAM).
  • Basis of a GATC regulatory network (replication
    and gene expression)

Bang 2008
Marinus 2005
10
Sequence analysis of DNA methylation sites
  • This network is the topic of a huge literature
  • One function is control of cell metabolism under
    cold shock and oxygen shift conditions
  • Two hypotheses proposed were
  • Transcription of genes containing GATC clusters
    in coding sequence blocked when hemi-methylated
    (higher thermal stability). Henaut 1996
  • Presence of methylated GATCs in non-coding 500 bp
    upstream of affected genes impacts interaction
    with regulatory proteins. Oshima 2002

11
Sequence analysis of DNA methylation sites
  • Oshima et al (2002) did micoarray (and other)
    analysis of 4019 E coli genes, comparing gene
    expression with wildtype and (LOF) mutant DAM.
  • DAM deficiency effected a large number of genes
    (10).
  • GATC is the target of DAM
  • If hypothesis 2 is true, we should see a
    relationship between GATC abundance and DAM
    control.

12
Sequence analysis of DNA methylation sites
  • Riva et al. 2004 did statistical analysis to test
    this.
  • Oshimas hypothesis predicts regions upstream of
    coding sequence that are 500bp long with an
    unusual frequency of GATC

13
Binomial distribution applied to biological
sequence analysis
  • Riva evaluated GATC enrichment by comparing
    observed number of GATCs 500 bp upstream of these
    genes with a null model based on binomial
    distribution.

500 bp
14
Binomial distribution applied to biological
sequence analysis
  • What is a binomial null model of GATC frequency
    in these upstream regions?

500 bp
15
Know4019 genes analyzed7174 GATC counted in
their upstream regions
Problem-solving tips
  • Define success
  • Calculate p
  • Compute probability that a single gene has no
    GATC in its 500 bp upstream region
  • Compute number of genes expected to have no GATC
    in their 500 bp upstream region (1 GATC, 2 GATC
    etc.)

16
Number of GATC
Solution
Number of genes
Number of bp upstream each gene
  • For a single gene, probability of seeing no
    GATC
  • Number of genes expected to have 0 GATC

17
  • How would you use this model to assess whether
    the 10 of genes on the microarray that were
    sensitive (differentially expressed) to DAM
    status were enriched in GATC in their 500 bp
    upstream regions?

18
What you should know
  • Understand what the binomial distribution is
  • Develop some intuition about how to use it in
    biological sequence analysis
  • The logic used by Riva et al. in their analysis
  • Cocktail party explanations of
  • DNA methylation and the enyzmes that do it
  • E. coli
  • E. colis GATC regulatory network
  • Hemi-methylation vs. methylation

19
Hypergeometric distribution
  • With the binomial distribution, we assume that on
    each trial we are drawing the same kind of
    object.
  • p is the same for each draw
  • What if on each trial we are drawing two kinds of
    objects?

20
Hypergeometric distribution
  • Number of successes in sequence of k draws from
    finite population with two kinds of objects,
    without replacement.

PDF
N total number of objects M number of objects of
the first type k number of draws X number of
successes (sample success)
21
M 100 N 50 k 30
N total number of objects M number of objects of
the first type k number of draws X number of
successes (sample success)
p(X)
PDF
X
CDF
x
22
Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .
GO Term counts
definition GO0005488 15 The
selective, often stoichiometric interaction
o... GO0005509 15 Interacting selectively with
calcium ions (Ca2).... GO0004720 2 Catalysis
of the reaction peptidyl-L-lysyl-pepti... GO0046
872 15 Interacting selectively with any metal
ion." GOC... GO0016641 2 Catalysis of an
oxidation-reduction (redox) react... GO0043169
8 Interacting selectively with cations, charged
ato... GO0005507 3 Interacting selectively
with copper (Cu) ions." ... GO0003677 2
Interacting selectively with DNA
(deoxyribonuclei... GO0046914 11 Interacting
selectively with a transition metal
i... GO0004222 3 Catalysis of the hydrolysis
of nonterminal peptid...
k
23
  • Microarray experiment
  • Microarray covered withprobes that hybridize
    cDNA
  • cDNA from cells in contrasting conditions of
    interest applied to the array
  • Of most interest are differentially expressed
    genes

24
http//www.geneontology.org/
  • Gene Ontology

a collaborative effort to address the need for
consistent descriptions of gene products in
different databases.
The GO project has developed three structured
controlled vocabularies (ontologies) that
describe gene products in terms of their
associated biological processes, cellular
components and molecular functions in a
species-independent manner.
Each entry in GO has a unique numerical
identifier of the form GOnnnnnnn, and a term
name, e.g. cell, fibroblast growth factor
receptor binding or signal transduction. Each
term is also assigned to one of the three
ontologies.
GO terms can be linked by five types of
relationships is_a, part_of, regulates,
positively_regulates and negatively_regulates.
25
  • Pathway
  • Molecular interaction and reaction networks

http//www.biocarta.com
http//www.kegg.com/kegg/kegg2.html
26
Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
N d.e. genes in microarray experiment M genes are
associated with term t k number of genes in the
pathway X number of genes in the pathway
that are associated with term t
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .

k
27
Is a pathway is enriched for a particular GO term
in N differentially expressed genes?
TP53 pathway
N d.e. genes in microarray experiment M genes are
associated with term t k number of genes in the
pathway X number of genes in the pathway
that are associated with term t
Genes in TP53 pathway(KEGG ID hsa04115)ATM CHEK2
ATR CHECK1 CDKN2A MDM2 . . .
  • What is the null hypothesis?

k
28
How would we define GO term t enrichment in this
microarray experiment?
  • Using this approximation the p-value for
    over-represented GO terms can be calculated as
  • What about p-value for under-represented GO terms?

Slide courtesy of Josep Mosquera
29
How would we define GO term t enrichment in this
microarray experiment?
  • Alternatively, can apply standard statistical
    tests such as chi-squared and Fishers exact.

30
How would you convert this to a joint probability
distribution?
  • What are marginals?

31
What you should know
  • Understand what the hypergeometric distribution
    is
  • Develop some intuition about where you might
    apply it in biological sequence analysis
  • When to use hypergeometric vs. binomial
    distribution
  • Where and when to use hypergeometric vs.
    chi-squared and/or Fishers exact test to
    evaluate enrichment.
  • Cocktail party explanations of
  • Microarray
  • cDNA
  • Differential expression
  • Gene Ontology
  • pathway

32
Multinomial Distribution
  • Independent events with k mutually exclusive
    possible outcomes having probabilities q1, q2, .
    . ., qk

PDF
33
Multinomial Distribution
  • We roll a fair 100-sided die N times
  • Probability of getting each of the 100 outcomes
    is q1, q2, . . ., q100 where q1q2 . .
    .q1000.01
  • What is the probability of rolling 200 times and
    getting each outcome twice?

34
Multinomial Distribution
N total number of die rolls k number of possible
outcomes (sides to the die) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
35
Multinomial distribution
  • What is the probability of seeing an adenine
    nucleotide (A) 13 times in the last position of
    a C1 enhancer in a chordate species?

Multiple sequence alignment of intron in human
shha and 12 homologs from other chordates.
ar-C transcriptional enhancers C1,C2,C3, and
C4are shown (Hadzhiev et al. 2007).
36
Multinomial distribution
  • 13 species in the alignment
  • The column has 9 As, 3 Gs and a T
  • qA9/13, qC0/13, qG3/13, qT1/13

What kind of estimate is this of the ? parameters?
N total number of species k number of possible
outcomes (nucleotide types) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
37
Multinomial distribution
N total number of species k number of possible
outcomes (nucleotide types) Fi outcome is
probability of success Xi number of successes
with respect tooutcome i (sample success)
Whats this?
38
Multinomial distribution
  • Is this probability due to chance?

39
Poisson Distribution
  • Number of events occurring within a fixed
    interval of time or space, which occur with known
    average rate and independent of time or space
    since the last event.

PDF
CDF
40
?10
p(X)
PDF
X
CDF
x
41
Poisson Distribution
gtHD_HUMAN Huntingtin MATLEKLMKAFESLKSFQQQQQQQQQQQQ
QQQQQQQQQQQPPPPPPPPPPPQLPQPPPQAQPLLPQPQPPPPPPPPPPG
PAVAEEPLHRPKKELSATKKDRVNHCLTICENIVAQSVRNS
  • How can we test whether low complexity protein
    sequences are enriched in particular proteins or
    proteomes?
  • Define low complexity sequence (Sim et al. 2002)
  • At least 10 residues long
  • At least 50 composed of a single residue type
  • Begin and end with the residue type
  • No runs greater than length 5 of any other
    residue type

42
Poisson Distribution
Sim et al. 2002
p(X) probability that event happens X times
l sequence window length
43
Poisson Distribution
Sim et al. 2002
  • Expected number of low complexitysequences of
    length l in a proteome

TlNumber of sequence windows of length l
SC S. cerevisiae CE C. elegans DM
fruitfly AT thale cress
Ratio of the number of low-complexity sequences
found above that expected from the Poisson
distribution model, DeltaR, to the number of
proteins in each eukaryote proteome plotted for
each residue type
44
What you should know
  • All terms in red on these slides.
  • How to decide when you should use a parametric
    probability distribution in your models
  • How to pick a good parametric probability
    distribution for a particular problem

45
Choosing a parametric probability distribution
for your model
  • What if intuition fails you and there is no
    obvious choice?
  • Exploratory analysis and graphics
  • Look at histograms of the data
  • Generate random samples from known distributions
    and compare to your data, standardized
  • Quantile-quantile plots
  • Compute kurtosis and skewness of your data,
    standardized, then compare to known kurtosis and
    skewness of parametric families (Ricci
    tutorial).

46
Histogram your data
47
Compare your data to random samples from known
distribution families
Quantile-Quantile plot
48
Use skewness and kurtosis of your data to select
a distribution family
  • For a sample of size n

mean
1st moment
variance
2nd moment
skewness
3rd moment
4th moment
kurtosis
49
Use skewness and kurtosis of your data to select
a distribution family
  • For scores of 10,000 random protein sequences
    from model of E. Coli CheY protein

gt mean(data,1) 1 -0.014408 gt
var(data,1) 1 1.138294 gt skewness(data,1) 1
-0.06517909 gt kurtosis(data,1) 1 5.71957
50
Use skewness and kurtosis of your data to select
a distribution family
  • Skewness
  • Asymmetry
  • Negative skew gt longer left tailprobability
    mass concentrated on right
  • Positive skew gt longer right tailprobability
    mass concentrated on left
  • kurtosis
  • High gt sharper peak, fatter tails
  • Low gt rounder peak, wider shoulders

gt skewness(data,1) 1 -0.06517909 gt
kurtosis(data,1) 1 5.71957
51
Use skewness and kurtosis of your data to select
a distribution family
  • We want a distribution with skew 0 and kurtosis
    5

Students-t Distribution
gt sample.T lt- rt(n10000,df4.6) gt
skewness(sample.T) 1 0.09057073 gt
kurtosis(sample.T) 1 5.238485
Normal Distribution
52
Parameter estimation
  • Read about it in Ricci.

53
Assessing model fit
  • Goodness of fit tests to a distribution D
  • Null hypothesis sample data come from D
  • Alternative hypothesis sample data do not come
    from D
  • Read about it in Ricci.

54
Machine learning approaches to biological
sequence analysis
  • Splice-site recognition

ACCEPTOR
DONOR
Ben-Hur 2008
55
Machine learning approaches to biological
sequence analysis
  • Splice-site recognition
  • lt0.1 of GT and AG are actually splice sites
  • How represent acceptor sites so as to
    discriminate them from other AGs?

56
Machine learning approaches to biological
sequence analysis
  • Splice-site recognition
  • Support vector machines with kernels that
    incorporate biological knowledge
  • Require a gold standard set of validate acceptor
    sequences (and flanking sequence) and a set of
    decoys.

57
How represent acceptor sites so as to
discriminate them from other AGs?
  • Use biology!
  • GC content of introns higher than that of exons
  • Consider properties of flanking sequence
  • Consider conservation in multiple species

Ben-Hur 2008
58
How represent acceptor sites so as to
discriminate them from other AGs?
  • GC content of introns higher than that of exons
  • Two features represent each putative acceptor
    site
  • Feature 1 GC fraction of exon
  • Feature 2 GC fraction of intron

Ben-Hur 2008
59
How represent acceptor sites so as to
discriminate them from other AGs?
  • Non-linear kernel (Gaussian or polynomial) gives
    small improvement to classifier perfomance
    compared to linear
  • Large degree polynomial and small-width Gaussian
    kernel ield reduced accuracy. Too flexible.

auROC
Ben-Hur 2008
60
How represent acceptor sites so as to
discriminate them from other AGs?
  • Need better and more features.

Count of four bases on intronic and exonic sides
of the acceptor
8
  • Works on a small set of reals and decoys but
    doesnt scale up to whole-genome analysis

32
Count of all trimers
128
Count of all dimers
2 4l
Count of all l-mers
61
How represent acceptor sites so as to
discriminate them from other AGs?
Features
Count of four bases on intronic and exonic sides
of the acceptor
Intron A 4 Exon A 0 Intron C 1 Exon C
2 Intron G 1 Exon G 3 Intron U 2 Exon U 1
8
32
Intron A C G U A 1 0 1
1 C 1 0 0 0 G
0 0 0 1 U 1 1 0 0
Exon A C G U A 0 0 0
0 C 0 0 1 0 G 0
1 1 1 U 0 1 0 0
Count of all dimers
128
Count of all trimers
2 4l
Count of all l-mers
  • This approach motivated the design of the
    spectrum kernel

62
Spectrum Kernel
Feature map based on spectrum of a sequence
X
F(X)
  • C. Leslie, E. Eskin, and W. Noble, The Spectrum
    Kernel
  • A String Kernel for SVM Protein Classification.
    Pacific Symposium on Biocomputing, 2002.
  • C. Leslie, E. Eskin, J. Weston and W. Noble,
  • Mismatch String Kernels for SVM Protein
    Classification.
  • NIPS 2002.

63
The k-Spectrum of a Sequence
  • Feature map for SVM based on spectrum of a
    sequence
  • The k-spectrum of a sequence is the set of all
    k-length contiguous subsequences that it contains
  • Feature map is indexed by all possible k-length
    subsequences
  • (k-mers) from any alphabet (here the mRNA
    nucleotides)
  • Dimension of feature space 4k

ACCUGUACGG
ACC CCU CUG UGU GUA UAC ACG
CGG
Slide courtesy of Christina Leslie
64
k-Spectrum Feature Map
  • Feature map for k-spectrum with no mismatches
  • For sequence x, F(k)(x) (Ft (x))k-mers t,
    where Ft (x) occurrences of t in x

ACCUGUGUGG
( 0 , 0 , , 1 , , 1 , , 2 ) AAA AAC
ACC CCU UGU
Slide courtesy of Christina Leslie
65
Issues with spectrum kernel
K_kspectrum(x,x) F(k)(x) F(k)(x)
  • Explicit computation too hard for large k
  • Nucleotide sequences with kgt10
  • Protein sequences with kgt5
  • When k is large, the requirement for exact match
    will result in very few non-zero values
  • One efficiency is to compute only k-mers with
    non-zero counts. Why does that work?

Ben-Hur 2008
66
Improved spectrum kernels
  • (k,m)-Mismatch Spectrum Kernel
  • Feature map for k-spectrum, allowing m
    mismatches
  • if s is a k-mer, F(k,m)(s) (Ft(s))k-mers t,
    where Ft(s) 1 if s is within m mismatches from
    t,0 otherwise
  • extend additively to longer sequences x by
    summing over all k-mers s in x

AAC

GAC
AAG

UAC
AUC
Slide courtesy of Christina Leslie
67
Improved spectrum kernels
  • Mixed spectrum kernel
  • Able to capture putatively important sequence
    motifs at multi-resolution

Ben-Hur 2008
68
Weighted degree spectrum kernel
k
L-d1
K_kweighteddegree(x,x) S S ßd
K_dspectrum(xkkd,xkkd)
d1
k1
xkkd
Substring of X of length d starting at position k
  • Take advantage of positional information.
  • Nucleotide positions are not all equally
    informative for purposes of detecting acceptor
    sites
  • Analyze a sequence of fixed length L and consider
    each position separately

Ben-Hur 2008
69
Weighted degree spectrum kernel
k
L-d1
K_kweighteddegree(x,x) S S ßd
K_dspectrum(xkkd,xkkd)
d1
k1
xkkd
Substring of X of length d starting at position k
  • Equivalent to using a mixed spectrum kernel for
    each position of the sequence separately
    (ignoring boundary effect).
  • How choose ßd ?

Ben-Hur 2008
70
SVM performance discriminating acceptor sites
with three kinds of spectrum kernel
Ben-Hur 2008
71
Programming assignment 1
  • Will be posted this evening

72
Programming assignment 1
  • You can use machine learning or statistical
    modeling or a combination of both.
  • Libraries available for R will make this much
    easier for you.
  • You will be able to use R functions in your
    Python programs

73
Programming assignment 1
  • Be realistic about your current skill set.
  • If youre still shaky with Python, do something
    simple.
  • Main priority is to complete the assignment and
    write clean code.
  • If thats easy for you, heres a chance to do
    something innovative.
Write a Comment
User Comments (0)
About PowerShow.com