Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Description:

Genome-wide comparisons of Enteric Bacteria Yield Conserved Putative Regulatory Sites ... Set of All Putative Sites for a Single Genome Infeasible. Solution: ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 22

Provided by: rc94

Learn more at: http://www.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

1
Probabilistic clustering of sequences Inferring
newbacterial regulons by comparative genomics

Erik van Nimwegen , Mihaela Zavolan, Nikolaus
Rajewsky , and Eric D. Siggia
In PNAS 2002

2
Contents

Synopsis
Introduction
Model
Classifiability VS Clusterability
Implementation
Results
Conclusions

3
Synopsis

Why Cluster Regulons?
Genome-wide comparisons of Enteric Bacteria Yield
Conserved Putative Regulatory Sites
Large Datasets
How to Assign Sites to Clusters?
Assumption Regulatory Sites can be represented
as samples from Weight Matrices(WM)
Monte Carlo Sampling of a Probability
Distribution to Partition and Align DNA Sequences
into Clusters
Determine Number of Clusters and Assign Some
Significance Metric to Them
When is Clustering Feasible?
Set of All Putative Sites for a Single Genome
Infeasible
Solution Use Sites from All Genomes

4
Introduction - Background

Why do we need to sequence non-coding regions of
the genome?
They help in information about gene regulation
Smaller than normal coding regions and difficult
to identify
Examples
In E.Coli
60-80 sites of known binding and regulatory sites
300, actual number according to protein sequence
homology
What is a Weight Matrix(WM)?
Describes Protein Binding Sites in Bacterial
Genomes
Gives Probability of Finding Base Alpha at
position I of Binding Site

5
Existing Computational Strategies

I Identify sets of similar sequences in the
regulatory regions of functionally related groups
of genes.
II Identify repetitive patterns within an entire
genome
Some disadvantages
Those using WM cannot process genome scale data
representing thousands of Transcription Factors
Other schemes not suitable for processing sites
inferred from interspecies comparison

6
Some Proposed Improvements

Partition Entire Set of Sites at Once
Infer Number of Cluster Internally
Assign Partitions to All Subsequences of Clusters
Theoretical Limit on Clusterability of Sets of
Regulatory Sites Derived
What is Clusterability?
A set of Sites is said to be Clusterable if
It is possible to infer which sites are from the
same WM

7
Model

Scores in Motif Finding Techniques
Calculate Information Score of Estimated WM
Where ba is the background frequency of base a,
and the wia are the WM probabilities estimated
from the sequences in the alignment
Task Cluster together binding sites of a number
of unknown TFs

2 Ways of Grouping
8
Model (Contd.)

Tasks
Assign Probabilities to All Partitions
The Probability of a partition is the product of
the probabilities, for each cluster
Step 1 Consider first the conditional
probability P(Sw) that a set of n length l
sequences S was drawn from a given WM w
Step 2 Calculate probability P(S) that all
sequences in S came from some w

9
Model (Contd.)

Step 3 Define for any partition C of a data set
of sequences D into clusters Sc the likelihood
P(DC) that all sequences in each Sc were drawn
from a single WM
Step 4 Calculate Posterior Probability

Where p(C) is the prior distribution over
partitions
10
Classifiability

Classifiability
What is Correct Regulation of Gene Expression?
TFs should bind preferentially to their own sites
P(sw) gt P(sw) Called as Classification Task
Given WMs and a set of Sequences Sampled From
Them
Assign a Sequence to the WM that maximizes P(sw)
WM that maximizes P(sw) is the WM from which
sequence s was sampled

11
Clusterability

Suppose there are nG sequences
Obtained by Sampling n times in G different
sequences
Calculate Probability m of n samples co-cluster.
How?
Sum P(CD) over all partitions of C in which m
samples of w occur together
If for this set G, for more than half of WMs
m gtn/2, then this set is clusterable

12
Analytical and Numerical Evaluations of
Classifiability and Clusterability

Given I, the fraction of the space of 4l
sequences filled by the binding sites for this WM
is e-2I
I is measure of specificity of WM
Results show
exp(2I) directly proportional to (1/G) for
classification
exp(2I) directly proportional to (1/G2) for
clustering a set of n 3 binding sites
Clustering is impossible for sites near the
classification threshold

Clusterability
N3
N5
N10
N15
Classifiability
I Vs G
13
Implementation

Monte Carlo Random Walk to Sample Distribution
Choose a mini-WM and consider assigning it to a
randomly chosen cluster
Increase P(CD)
Generates Random Clusters

Move from Partition C to C
14
Impact on Clusters

Generates Dynamic Clusters
What happens when a pair of mini-Wms are moved
together?
Clusters may evaporate
New clusters may form
Task
To identify significant clusters
By finding sets of mini-WMs that are grouped
together persistently during the Monte Carlo
sampling
Membership Stability
Ideal Case Finding Stable Core Members
Reality Constantly Drifting Clusters with
Uncorrelated Memberships

15
Identifying Candidate Clusters

Searching Maximum Likelihood Partition
Use Simulated Annealing to Maximize P(CD)
P(DC) raised to Power Beta(Beta3 Ideally)
Testing Significance of ML Clusters
By Sampling P(CD)
Measures
Probability p(k) that k members co-cluster
Mean Size of Cluster
Minimum Length of Interval
Clusters with Significant

ML Partitions Obtained by Annealing
The number of co-clustering members of an ML
cluster is the maximum number of mini-WMs from
the ML cluster that co-occur in a single cluster
16
Modifications for Large Data Sets

Several Monte Carlo Random Walks
Measure that Each Pair of Mini-WMs co-clusters
Construct Graph
Nodes correspond to Mini-WMS
Edges between mini-WMs i and j exist if and only
if their co-clustering probability
(1/2)
Candidate Clusters ? Connected Components of
Graph
Calculate Probabilistic Cluster Membership
Calculate p(k)
Cluster significance is derived from p(k)

17
Data Sets Used

Uses Different Prokaryotic Genomes
Short Sequences 15 to 25 Bases
Use Genomes of E. coli,Actinobacillus
actinomycetemcomitans, Haemophilus influenzae,
Pseudomonas aeruginosa, Shewanella putrefaciens,
Salmonella typhimurium, Thiobacillus
ferrooxidans, V. cholerae, Y. pestis,Klebsiella
pneumoniae

18
Operations

Extend
Expand Alignments to Length 32
Pad Bases From Genomes
Replace Sequences of Closely Related Species by
their Consensus
Processing on E.Coli
Align all known E. coli sites for each TF into
its own mini-WM
Add 56 Mini-WMs to this Set
Create Additional Test Set Consisting of
397 Known Binding Sites and E. coli sequences of
the top 2,000 unannotated mini-WMs

19
Results

Test Set of 397 Binding Sites
Sample P(CD) and Measure How Well the Sites
Cluster
Significant Clusters Obtained for 24 of 53 TFs
Twenty two TFs have three or fewer sites in the
test set, and with the exception of trpR their
sites did not cluster significantly
Comparison
Inferred Results(Annealing) VS Results From Site
Annotation
Good Agreement Between Both Results
Likelihood for Partition Obtained By Annealing is
Higher than that by Site Annotation
Improvements of Proposed Work
Recovers almost half of all regulons for which
binding sites are known and the large majority of
regulons for which there are more than three
sites known

20
Results for Large Datasets

Annealed state and significance statistics not
converged fully within running times
1010 steps, taking a week on a workstation per
run
Solution
Converge Using Pair Statistics
Example
Around 365 clusters on average
Connectivity graph gave 274 components containing
1,139 out of 2,056 mini-WMs
1 Space Filled By Top 45 Clusters
Top 80 Clusters Fill 10 of Space
Next 115 Clusters Fill 39 of Space

21
Conclusions and Discussion

New Inference Procedure for Probabilistically
Partitioning a Set of DNA Sequences into Clusters
Assumption All WMs are of Fixed Length
Predictions from applying the algorithm to data
sets of putative regulatory sites extracted from
enteric bacteria
100 new regulons in E. coli, containing 500
binding sites, and 50 binding sites for known TFs
In addition to TF binding sites RNA stems
controlling translation and termination motifs
are present in Predicted Regulons