Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics - PowerPoint PPT Presentation

Loading...

PPT – Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics PowerPoint presentation | free to download - id: 9d087-NDFmY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Description:

that each pair of mini-WMs coclusters ... Yields probabilities that mini-WM i belongs to cluster j. Also calc. for each cluster the prob. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 30
Provided by: csU70
Learn more at: http://www.cs.uiuc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics


1
Probabilistic clustering of sequences Inferring
new bacterial regulons by comparative genomics
  • Erik van Nimwegen et al.
  • Presented by Lyndsy Kron

2
Goal
  • To derive a unique probability distribution for
    assignments of binding sites into clusters to
    identify regulons
  • Based on sequence similarity
  • Partitioned so each cluster corresponds to those
    targeted by same TF

3
PROCSE Algorithm
  • Uses Monte Carlo sampling of this distribution to
    partition and align thousands of short DNA
    sequences into clusters
  • Determines number of clusters
  • Assigns significance to the resulting clusters
  • WMs are unknown limiting factor

4
WM Unknown
  • A set of sites sampled with unknown WMs is
    clusterable if it is possible to infer which
    sites were sampled from the same WM
  • If WMs are known, this task is trivial

5
Problem
  • Input
  • A set D of short DNA sequences
  • Output
  • Most probable clustering C of input sequences

6
Assumptions
  • Sequences in a cluster come from the same motif
  • Use weight matrix (WM) model for motifs
  • Consider only evolutionary conserved non-coding
    regions of orthologous genes
  • Consider bacterial genomes

7
Model
  • WM prob. of finding base alpha at location i
  • Information score I scores quality of an
    alignment of putative binding sites
  • b is background frequency of base alpha
  • And are the WM probs. from sequence

8
Model
  • Need to cluster a set of binding sites of an
    unknown number of TFs
  • Consider all ways to partition into clusters and
    assign prob. to each prob. of partition is
    product of probs. for each cluster

9
Model
  • To calculate prob. that a set of n length l
    sequences S was drawn from a given WM

10
Model
  • To calculate P(S) that all sequences in S came
    from some w

11
Model
  • From this we can define for any partition C of a
    data set of sequences D into clusters the
    likelihood P(DC) that all sequences in a cluster
    were drawn from the same WM
  • P(DC) given by P(S) on previous slide

12
Model
  • Posterior prob. P(CD) for partition C given the
    data D
  • Allows calculation of any statistic of interest
    by summing over the appropriate partitions C

13
Classifiability vs. Clusterability
  • Classification
  • Associating TFs with WMs
  • P(sw) prob. that w binds to s
  • Implies that for a sample s from w, we have that
    P(sw) gt P(sw') for all other TFs w'

14
  • Clusterability
  • Assume clustering nG sequences obtained by
    sampling n times from G different WMs
  • Can calculate prob. That m of its n samples
    cocluster by summing P(CD) over all partitions
    in which m samples of w occur together
  • Clusterable if for more than ½ of the G WMs the
    avg. of m gt n/2

15
(No Transcript)
16
Monte Carlo Implementation
  • Monte Carlo random walk to sample the
    distribution P(CD)
  • At each step
  • Choose mini-WM at random
  • Consider reassigning it to a randomly chosen
    cluster
  • Evaluated using Metropolis-Hastings scheme

17
Metropolis-Hastings Scheme
  • Moves that increase prob. P(CD) are always
    accepted
  • Moves that lower P(CD) are accepted with prob.
    P(C'D)/P(CD)

18
Result of Monte Carlo
  • Generates dynamic clusters, membership
    fluctuates over time
  • Clusters can disappear altogether
  • New clusters can appear when a pair of mini-WMs
    is moved together
  • Find significant clusters by finding sets of
    mini-WMs that are persistent

19
Solutions to Lack of Persistence
  • Search for ML partition to maximize P(CD)
    through simulated annealing
  • Raise P(CD) to the power ß, increasing ß over
    time
  • Provides candidate clusters
  • Significance of ML clusters are tested by
    sampling P(CD)

20
  • Complications
  • Computationally prohibitive for large data sets

21
Solutions to Lack of Persistence
  • Second Approach
  • Use several Monte Carlo random walks
  • Measure prob. that each pair of mini-WMs
    coclusters
  • Construct graph, node corresponds to mini-WMs,
    edges between mini-WMs i and j exist if their
    coclustering prob. Is gt ½

22
Second Approach Cont.
  • Candidate clusters are now given by connected
    components of graph
  • Pairwise stats. are processed to obtain prob.
    cluster membership
  • Yields probabilities that mini-WM i belongs to
    cluster j
  • Also calc. for each cluster the prob.
    distribution p(k) of k of its members
    coclustering
  • Cluster significance judged from p(k)

23
Finally
  • Once clusters are inferred, a WM can be estimated
    for each cluster
  • Then search for additional matching motifs to the
    cluster WMs in all regulatory regions

24
Data set
  • 15-25 bp sequences

25
Ex. Alignment
26
Thank you!
27
Results
  • Found that likelihood P(CD) for the partition
    obtained in annealing runs is higher than that
    obtained when the sites are partitioned by
    annotation
  • Algorithm recovers almost ½ of all regulons for
    which binding sites are known and the large
    majority of regulons for which there are more
    than 3 sites known
  • Most E. coli binding sites are in the
    unclusterable regime

28
(No Transcript)
29
Discussion
  • Algorithm assumes all WMs be of fixed length, so
    prior information about lengths and their dimeric
    nature need to be incorporated
  • Could also extend the hypothesis, by assuming
    that only some fraction, rather than all, of the
    sequences are WM samples others are background
    model
About PowerShow.com