Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Description:

Genome-wide comparisons of Enteric Bacteria Yield Conserved Putative Regulatory Sites ... Set of All Putative Sites for a Single Genome Infeasible. Solution: ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 22
Provided by: rc94
Learn more at: http://www.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics


1
Probabilistic clustering of sequences Inferring
newbacterial regulons by comparative genomics
  • Erik van Nimwegen , Mihaela Zavolan, Nikolaus
    Rajewsky , and Eric D. Siggia
  • In PNAS 2002

2
Contents
  • Synopsis
  • Introduction
  • Model
  • Classifiability VS Clusterability
  • Implementation
  • Results
  • Conclusions

3
Synopsis
  • Why Cluster Regulons?
  • Genome-wide comparisons of Enteric Bacteria Yield
    Conserved Putative Regulatory Sites
  • Large Datasets
  • How to Assign Sites to Clusters?
  • Assumption Regulatory Sites can be represented
    as samples from Weight Matrices(WM)
  • Monte Carlo Sampling of a Probability
    Distribution to Partition and Align DNA Sequences
    into Clusters
  • Determine Number of Clusters and Assign Some
    Significance Metric to Them
  • When is Clustering Feasible?
  • Set of All Putative Sites for a Single Genome
    Infeasible
  • Solution Use Sites from All Genomes

4
Introduction - Background
  • Why do we need to sequence non-coding regions of
    the genome?
  • They help in information about gene regulation
  • Smaller than normal coding regions and difficult
    to identify
  • Examples
  • In E.Coli
  • 60-80 sites of known binding and regulatory sites
  • 300, actual number according to protein sequence
    homology
  • What is a Weight Matrix(WM)?
  • Describes Protein Binding Sites in Bacterial
    Genomes
  • Gives Probability of Finding Base Alpha at
    position I of Binding Site

5
Existing Computational Strategies
  • I Identify sets of similar sequences in the
    regulatory regions of functionally related groups
    of genes.
  • II Identify repetitive patterns within an entire
    genome
  • Some disadvantages
  • Those using WM cannot process genome scale data
    representing thousands of Transcription Factors
  • Other schemes not suitable for processing sites
    inferred from interspecies comparison

6
Some Proposed Improvements
  • Partition Entire Set of Sites at Once
  • Infer Number of Cluster Internally
  • Assign Partitions to All Subsequences of Clusters
  • Theoretical Limit on Clusterability of Sets of
    Regulatory Sites Derived
  • What is Clusterability?
  • A set of Sites is said to be Clusterable if
  • It is possible to infer which sites are from the
    same WM

7
Model
  • Scores in Motif Finding Techniques
  • Calculate Information Score of Estimated WM
  • Where ba is the background frequency of base a,
    and the wia are the WM probabilities estimated
    from the sequences in the alignment
  • Task Cluster together binding sites of a number
    of unknown TFs

2 Ways of Grouping
8
Model (Contd.)
  • Tasks
  • Assign Probabilities to All Partitions
  • The Probability of a partition is the product of
    the probabilities, for each cluster
  • Step 1 Consider first the conditional
    probability P(Sw) that a set of n length l
    sequences S was drawn from a given WM w
  • Step 2 Calculate probability P(S) that all
    sequences in S came from some w

9
Model (Contd.)
  • Step 3 Define for any partition C of a data set
    of sequences D into clusters Sc the likelihood
    P(DC) that all sequences in each Sc were drawn
    from a single WM
  • Step 4 Calculate Posterior Probability

Where p(C) is the prior distribution over
partitions
10
Classifiability
  • Classifiability
  • What is Correct Regulation of Gene Expression?
  • TFs should bind preferentially to their own sites
  • P(sw) gt P(sw) Called as Classification Task
  • Given WMs and a set of Sequences Sampled From
    Them
  • Assign a Sequence to the WM that maximizes P(sw)
  • WM that maximizes P(sw) is the WM from which
    sequence s was sampled

11
Clusterability
  • Suppose there are nG sequences
  • Obtained by Sampling n times in G different
    sequences
  • Calculate Probability m of n samples co-cluster.
    How?
  • Sum P(CD) over all partitions of C in which m
    samples of w occur together
  • If for this set G, for more than half of WMs
  • m gtn/2, then this set is clusterable

12
Analytical and Numerical Evaluations of
Classifiability and Clusterability
  • Given I, the fraction of the space of 4l
    sequences filled by the binding sites for this WM
    is e-2I
  • I is measure of specificity of WM
  • Results show
  • exp(2I) directly proportional to (1/G) for
    classification
  • exp(2I) directly proportional to (1/G2) for
    clustering a set of n 3 binding sites
  • Clustering is impossible for sites near the
    classification threshold

Clusterability
N3
N5
N10
N15
Classifiability
I Vs G
13
Implementation
  • Monte Carlo Random Walk to Sample Distribution
  • Choose a mini-WM and consider assigning it to a
    randomly chosen cluster
  • Increase P(CD)
  • Generates Random Clusters

Move from Partition C to C
14
Impact on Clusters
  • Generates Dynamic Clusters
  • What happens when a pair of mini-Wms are moved
    together?
  • Clusters may evaporate
  • New clusters may form
  • Task
  • To identify significant clusters
  • By finding sets of mini-WMs that are grouped
    together persistently during the Monte Carlo
    sampling
  • Membership Stability
  • Ideal Case Finding Stable Core Members
  • Reality Constantly Drifting Clusters with
    Uncorrelated Memberships

15
Identifying Candidate Clusters
  • Searching Maximum Likelihood Partition
  • Use Simulated Annealing to Maximize P(CD)
  • P(DC) raised to Power Beta(Beta3 Ideally)
  • Testing Significance of ML Clusters
  • By Sampling P(CD)
  • Measures
  • Probability p(k) that k members co-cluster
  • Mean Size of Cluster
  • Minimum Length of Interval
  • Clusters with Significant

ML Partitions Obtained by Annealing
The number of co-clustering members of an ML
cluster is the maximum number of mini-WMs from
the ML cluster that co-occur in a single cluster
16
Modifications for Large Data Sets
  • Several Monte Carlo Random Walks
  • Measure that Each Pair of Mini-WMs co-clusters
  • Construct Graph
  • Nodes correspond to Mini-WMS
  • Edges between mini-WMs i and j exist if and only
    if their co-clustering probability
    (1/2)
  • Candidate Clusters ? Connected Components of
    Graph
  • Calculate Probabilistic Cluster Membership
  • Calculate p(k)
  • Cluster significance is derived from p(k)

17
Data Sets Used
  • Uses Different Prokaryotic Genomes
  • Short Sequences 15 to 25 Bases
  • Use Genomes of E. coli,Actinobacillus
    actinomycetemcomitans, Haemophilus influenzae,
    Pseudomonas aeruginosa, Shewanella putrefaciens,
    Salmonella typhimurium, Thiobacillus
    ferrooxidans, V. cholerae, Y. pestis,Klebsiella
    pneumoniae

18
Operations
  • Extend
  • Expand Alignments to Length 32
  • Pad Bases From Genomes
  • Replace Sequences of Closely Related Species by
    their Consensus
  • Processing on E.Coli
  • Align all known E. coli sites for each TF into
    its own mini-WM
  • Add 56 Mini-WMs to this Set
  • Create Additional Test Set Consisting of
  • 397 Known Binding Sites and E. coli sequences of
    the top 2,000 unannotated mini-WMs

19
Results
  • Test Set of 397 Binding Sites
  • Sample P(CD) and Measure How Well the Sites
    Cluster
  • Significant Clusters Obtained for 24 of 53 TFs
  • Twenty two TFs have three or fewer sites in the
    test set, and with the exception of trpR their
    sites did not cluster significantly
  • Comparison
  • Inferred Results(Annealing) VS Results From Site
    Annotation
  • Good Agreement Between Both Results
  • Likelihood for Partition Obtained By Annealing is
    Higher than that by Site Annotation
  • Improvements of Proposed Work
  • Recovers almost half of all regulons for which
    binding sites are known and the large majority of
    regulons for which there are more than three
    sites known

20
Results for Large Datasets
  • Annealed state and significance statistics not
    converged fully within running times
  • 1010 steps, taking a week on a workstation per
    run
  • Solution
  • Converge Using Pair Statistics
  • Example
  • Around 365 clusters on average
  • Connectivity graph gave 274 components containing
    1,139 out of 2,056 mini-WMs
  • 1 Space Filled By Top 45 Clusters
  • Top 80 Clusters Fill 10 of Space
  • Next 115 Clusters Fill 39 of Space

21
Conclusions and Discussion
  • New Inference Procedure for Probabilistically
    Partitioning a Set of DNA Sequences into Clusters
  • Assumption All WMs are of Fixed Length
  • Predictions from applying the algorithm to data
    sets of putative regulatory sites extracted from
    enteric bacteria
  • 100 new regulons in E. coli, containing 500
    binding sites, and 50 binding sites for known TFs
  • In addition to TF binding sites RNA stems
    controlling translation and termination motifs
    are present in Predicted Regulons
Write a Comment
User Comments (0)
About PowerShow.com