# Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics - PowerPoint PPT Presentation

PPT – Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics PowerPoint presentation | free to download - id: 9d087-NDFmY

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Description:

### that each pair of mini-WMs coclusters ... Yields probabilities that mini-WM i belongs to cluster j. Also calc. for each cluster the prob. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 30
Provided by: csU70
Category:
Tags:
Transcript and Presenter's Notes

Title: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

1
Probabilistic clustering of sequences Inferring
new bacterial regulons by comparative genomics
• Erik van Nimwegen et al.
• Presented by Lyndsy Kron

2
Goal
• To derive a unique probability distribution for
assignments of binding sites into clusters to
identify regulons
• Based on sequence similarity
• Partitioned so each cluster corresponds to those
targeted by same TF

3
PROCSE Algorithm
• Uses Monte Carlo sampling of this distribution to
partition and align thousands of short DNA
sequences into clusters
• Determines number of clusters
• Assigns significance to the resulting clusters
• WMs are unknown limiting factor

4
WM Unknown
• A set of sites sampled with unknown WMs is
clusterable if it is possible to infer which
sites were sampled from the same WM
• If WMs are known, this task is trivial

5
Problem
• Input
• A set D of short DNA sequences
• Output
• Most probable clustering C of input sequences

6
Assumptions
• Sequences in a cluster come from the same motif
• Use weight matrix (WM) model for motifs
• Consider only evolutionary conserved non-coding
regions of orthologous genes
• Consider bacterial genomes

7
Model
• WM prob. of finding base alpha at location i
• Information score I scores quality of an
alignment of putative binding sites
• b is background frequency of base alpha
• And are the WM probs. from sequence

8
Model
• Need to cluster a set of binding sites of an
unknown number of TFs
• Consider all ways to partition into clusters and
assign prob. to each prob. of partition is
product of probs. for each cluster

9
Model
• To calculate prob. that a set of n length l
sequences S was drawn from a given WM

10
Model
• To calculate P(S) that all sequences in S came
from some w

11
Model
• From this we can define for any partition C of a
data set of sequences D into clusters the
likelihood P(DC) that all sequences in a cluster
were drawn from the same WM
• P(DC) given by P(S) on previous slide

12
Model
• Posterior prob. P(CD) for partition C given the
data D
• Allows calculation of any statistic of interest
by summing over the appropriate partitions C

13
Classifiability vs. Clusterability
• Classification
• Associating TFs with WMs
• P(sw) prob. that w binds to s
• Implies that for a sample s from w, we have that
P(sw) gt P(sw') for all other TFs w'

14
• Clusterability
• Assume clustering nG sequences obtained by
sampling n times from G different WMs
• Can calculate prob. That m of its n samples
cocluster by summing P(CD) over all partitions
in which m samples of w occur together
• Clusterable if for more than ½ of the G WMs the
avg. of m gt n/2

15
(No Transcript)
16
Monte Carlo Implementation
• Monte Carlo random walk to sample the
distribution P(CD)
• At each step
• Choose mini-WM at random
• Consider reassigning it to a randomly chosen
cluster
• Evaluated using Metropolis-Hastings scheme

17
Metropolis-Hastings Scheme
• Moves that increase prob. P(CD) are always
accepted
• Moves that lower P(CD) are accepted with prob.
P(C'D)/P(CD)

18
Result of Monte Carlo
• Generates dynamic clusters, membership
fluctuates over time
• Clusters can disappear altogether
• New clusters can appear when a pair of mini-WMs
is moved together
• Find significant clusters by finding sets of
mini-WMs that are persistent

19
Solutions to Lack of Persistence
• Search for ML partition to maximize P(CD)
through simulated annealing
• Raise P(CD) to the power ß, increasing ß over
time
• Provides candidate clusters
• Significance of ML clusters are tested by
sampling P(CD)

20
• Complications
• Computationally prohibitive for large data sets

21
Solutions to Lack of Persistence
• Second Approach
• Use several Monte Carlo random walks
• Measure prob. that each pair of mini-WMs
coclusters
• Construct graph, node corresponds to mini-WMs,
edges between mini-WMs i and j exist if their
coclustering prob. Is gt ½

22
Second Approach Cont.
• Candidate clusters are now given by connected
components of graph
• Pairwise stats. are processed to obtain prob.
cluster membership
• Yields probabilities that mini-WM i belongs to
cluster j
• Also calc. for each cluster the prob.
distribution p(k) of k of its members
coclustering
• Cluster significance judged from p(k)

23
Finally
• Once clusters are inferred, a WM can be estimated
for each cluster
• Then search for additional matching motifs to the
cluster WMs in all regulatory regions

24
Data set
• 15-25 bp sequences

25
Ex. Alignment
26
Thank you!
27
Results
• Found that likelihood P(CD) for the partition
obtained in annealing runs is higher than that
obtained when the sites are partitioned by
annotation
• Algorithm recovers almost ½ of all regulons for
which binding sites are known and the large
majority of regulons for which there are more
than 3 sites known
• Most E. coli binding sites are in the
unclusterable regime

28
(No Transcript)
29
Discussion
• Algorithm assumes all WMs be of fixed length, so
prior information about lengths and their dimeric
nature need to be incorporated
• Could also extend the hypothesis, by assuming
that only some fraction, rather than all, of the
sequences are WM samples others are background
model