Title: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics
1Probabilistic clustering of sequences Inferring
newbacterial regulons by comparative genomics
- Erik van Nimwegen , Mihaela Zavolan, Nikolaus
Rajewsky , and Eric D. Siggia - In PNAS 2002
2Contents
- Synopsis
- Introduction
- Model
- Classifiability VS Clusterability
- Implementation
- Results
- Conclusions
3Synopsis
- Why Cluster Regulons?
- Genome-wide comparisons of Enteric Bacteria Yield
Conserved Putative Regulatory Sites - Large Datasets
- How to Assign Sites to Clusters?
- Assumption Regulatory Sites can be represented
as samples from Weight Matrices(WM) - Monte Carlo Sampling of a Probability
Distribution to Partition and Align DNA Sequences
into Clusters - Determine Number of Clusters and Assign Some
Significance Metric to Them - When is Clustering Feasible?
- Set of All Putative Sites for a Single Genome
Infeasible - Solution Use Sites from All Genomes
4Introduction - Background
- Why do we need to sequence non-coding regions of
the genome? - They help in information about gene regulation
- Smaller than normal coding regions and difficult
to identify - Examples
- In E.Coli
- 60-80 sites of known binding and regulatory sites
- 300, actual number according to protein sequence
homology - What is a Weight Matrix(WM)?
- Describes Protein Binding Sites in Bacterial
Genomes - Gives Probability of Finding Base Alpha at
position I of Binding Site
5Existing Computational Strategies
- I Identify sets of similar sequences in the
regulatory regions of functionally related groups
of genes. - II Identify repetitive patterns within an entire
genome - Some disadvantages
- Those using WM cannot process genome scale data
representing thousands of Transcription Factors - Other schemes not suitable for processing sites
inferred from interspecies comparison
6Some Proposed Improvements
- Partition Entire Set of Sites at Once
- Infer Number of Cluster Internally
- Assign Partitions to All Subsequences of Clusters
- Theoretical Limit on Clusterability of Sets of
Regulatory Sites Derived - What is Clusterability?
- A set of Sites is said to be Clusterable if
- It is possible to infer which sites are from the
same WM
7Model
- Scores in Motif Finding Techniques
- Calculate Information Score of Estimated WM
- Where ba is the background frequency of base a,
and the wia are the WM probabilities estimated
from the sequences in the alignment - Task Cluster together binding sites of a number
of unknown TFs
2 Ways of Grouping
8Model (Contd.)
- Tasks
- Assign Probabilities to All Partitions
- The Probability of a partition is the product of
the probabilities, for each cluster - Step 1 Consider first the conditional
probability P(Sw) that a set of n length l
sequences S was drawn from a given WM w - Step 2 Calculate probability P(S) that all
sequences in S came from some w
9Model (Contd.)
- Step 3 Define for any partition C of a data set
of sequences D into clusters Sc the likelihood
P(DC) that all sequences in each Sc were drawn
from a single WM - Step 4 Calculate Posterior Probability
Where p(C) is the prior distribution over
partitions
10Classifiability
- Classifiability
- What is Correct Regulation of Gene Expression?
- TFs should bind preferentially to their own sites
- P(sw) gt P(sw) Called as Classification Task
- Given WMs and a set of Sequences Sampled From
Them - Assign a Sequence to the WM that maximizes P(sw)
- WM that maximizes P(sw) is the WM from which
sequence s was sampled
11Clusterability
- Suppose there are nG sequences
- Obtained by Sampling n times in G different
sequences - Calculate Probability m of n samples co-cluster.
How? - Sum P(CD) over all partitions of C in which m
samples of w occur together - If for this set G, for more than half of WMs
- m gtn/2, then this set is clusterable
12Analytical and Numerical Evaluations of
Classifiability and Clusterability
- Given I, the fraction of the space of 4l
sequences filled by the binding sites for this WM
is e-2I - I is measure of specificity of WM
- Results show
- exp(2I) directly proportional to (1/G) for
classification - exp(2I) directly proportional to (1/G2) for
clustering a set of n 3 binding sites - Clustering is impossible for sites near the
classification threshold
Clusterability
N3
N5
N10
N15
Classifiability
I Vs G
13Implementation
- Monte Carlo Random Walk to Sample Distribution
- Choose a mini-WM and consider assigning it to a
randomly chosen cluster - Increase P(CD)
- Generates Random Clusters
Move from Partition C to C
14Impact on Clusters
- Generates Dynamic Clusters
- What happens when a pair of mini-Wms are moved
together? - Clusters may evaporate
- New clusters may form
- Task
- To identify significant clusters
- By finding sets of mini-WMs that are grouped
together persistently during the Monte Carlo
sampling - Membership Stability
- Ideal Case Finding Stable Core Members
- Reality Constantly Drifting Clusters with
Uncorrelated Memberships
15Identifying Candidate Clusters
- Searching Maximum Likelihood Partition
- Use Simulated Annealing to Maximize P(CD)
- P(DC) raised to Power Beta(Beta3 Ideally)
- Testing Significance of ML Clusters
- By Sampling P(CD)
- Measures
- Probability p(k) that k members co-cluster
- Mean Size of Cluster
- Minimum Length of Interval
- Clusters with Significant
ML Partitions Obtained by Annealing
The number of co-clustering members of an ML
cluster is the maximum number of mini-WMs from
the ML cluster that co-occur in a single cluster
16Modifications for Large Data Sets
- Several Monte Carlo Random Walks
- Measure that Each Pair of Mini-WMs co-clusters
- Construct Graph
- Nodes correspond to Mini-WMS
- Edges between mini-WMs i and j exist if and only
if their co-clustering probability
(1/2) - Candidate Clusters ? Connected Components of
Graph - Calculate Probabilistic Cluster Membership
- Calculate p(k)
- Cluster significance is derived from p(k)
17Data Sets Used
- Uses Different Prokaryotic Genomes
- Short Sequences 15 to 25 Bases
- Use Genomes of E. coli,Actinobacillus
actinomycetemcomitans, Haemophilus influenzae,
Pseudomonas aeruginosa, Shewanella putrefaciens,
Salmonella typhimurium, Thiobacillus
ferrooxidans, V. cholerae, Y. pestis,Klebsiella
pneumoniae
18Operations
- Extend
- Expand Alignments to Length 32
- Pad Bases From Genomes
- Replace Sequences of Closely Related Species by
their Consensus - Processing on E.Coli
- Align all known E. coli sites for each TF into
its own mini-WM - Add 56 Mini-WMs to this Set
- Create Additional Test Set Consisting of
- 397 Known Binding Sites and E. coli sequences of
the top 2,000 unannotated mini-WMs
19Results
- Test Set of 397 Binding Sites
- Sample P(CD) and Measure How Well the Sites
Cluster - Significant Clusters Obtained for 24 of 53 TFs
- Twenty two TFs have three or fewer sites in the
test set, and with the exception of trpR their
sites did not cluster significantly - Comparison
- Inferred Results(Annealing) VS Results From Site
Annotation - Good Agreement Between Both Results
- Likelihood for Partition Obtained By Annealing is
Higher than that by Site Annotation - Improvements of Proposed Work
- Recovers almost half of all regulons for which
binding sites are known and the large majority of
regulons for which there are more than three
sites known
20Results for Large Datasets
- Annealed state and significance statistics not
converged fully within running times - 1010 steps, taking a week on a workstation per
run - Solution
- Converge Using Pair Statistics
- Example
- Around 365 clusters on average
- Connectivity graph gave 274 components containing
1,139 out of 2,056 mini-WMs - 1 Space Filled By Top 45 Clusters
- Top 80 Clusters Fill 10 of Space
- Next 115 Clusters Fill 39 of Space
21Conclusions and Discussion
- New Inference Procedure for Probabilistically
Partitioning a Set of DNA Sequences into Clusters - Assumption All WMs are of Fixed Length
- Predictions from applying the algorithm to data
sets of putative regulatory sites extracted from
enteric bacteria - 100 new regulons in E. coli, containing 500
binding sites, and 50 binding sites for known TFs - In addition to TF binding sites RNA stems
controlling translation and termination motifs
are present in Predicted Regulons