A Statistical Method for Finding Transcriptional Factor Binding Sites - PowerPoint PPT Presentation

About This Presentation
Title:

A Statistical Method for Finding Transcriptional Factor Binding Sites

Description:

Previous & Proposed Methods for Finding Motifs. Previous Methods: Find longer, general motifs ... Previous study of pattern autocorrelation ... – PowerPoint PPT presentation

Number of Views:259
Avg rating:3.0/5.0
Slides: 29
Provided by: christopher151
Category:

less

Transcript and Presenter's Notes

Title: A Statistical Method for Finding Transcriptional Factor Binding Sites


1
A Statistical Method for Finding Transcriptional
Factor Binding Sites
  • Authors Saurabh Sinha and Martin Tompa
  • Presenter Christopher Schlosberg
  • CS598ss

2
Regulation of Gene Expression
3
Difficulties of Motif Finding
  • Regulatory sequences dont follow same
    orientation as the coding sequence or each other
  • Multiple binding sites might exist for each
    regulated gene
  • Large variation in the binding sites of a single
    factor. Variations are not well understood.

4
Previous Proposed Methods for Finding Motifs
  • Previous Methods
  • Find longer, general motifs
  • Use local search algorithms (Gibbs sampling,
    Expectation Maximization, greedy algorithms)
  • Proposed Method
  • TFBS is small enough to use enumerative methods
  • Enumerative statistical methods guarantee global
    optimality and affordability

5
Proposed Method Highlights
  • Allows variations in the binding site instances
    of a given transcription factor
  • Allows for motifs to include spacers
  • Allows for overlapping occurrences (in both
    orientations), which lends to complex
    dependencies
  • Statistical significance of a motif (s) is based
    on the frequencies of shorter (more frequent)
    oligonucleotides
  • Use of Markov chain to model background genomic
    distribution
  • Use of z-score to measure statistical
    significance
  • Allows for multiple binding sites

6
Characteristics of a Motif
  • Any single TFBS has significant variation
  • Many motifs have spacers from 1-11bp
  • Variation often occurs as a transition (e.g.
    purine ? purine) rather than a transversion (e.g.
    pyrimidine ? purine)
  • Variation occurs less between a pair of
    complementary bases.
  • Indels are uncommon

7
Proposed Motif Definition
  • Motif will be a string with S A,C,G,T,
    R,Y,S,W,N
  • A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S
    (strong), W (weak), N (spacer)
  • TF database (SCPD) confirms this model of
    variation
  • Of 50 binding site consensi, 31 exact fits (62)
  • Another 10 fit if slight variations allowed

8
Measure of Statistical Significance
  • Given set of corregulated S. cerevisiae genes,
    the input to the problem is corresponding set of
    800bp upstream sequences having 3 end on start
    site of gene translation.
  • Model must measure from input sequences
  • Absolute number of occurrences (Ns) of motif (s)
  • Background genomic distribution
  • X is a set of random DNA sequences in the same
    number and lengths of the input sequences
  • Generated by Markov chain of order m
  • Transition probabilities determined by (m1)-mer
    frequencies in fully complement of 6000 (800bp
    in length)
  • Background model chooses m3

9
z-score
  • Xs r.v. is number of occurrences of motif (s)
    in X
  • E(Xs) expectation, s(Xs) standard deviation
  • zs number of S.D. by which observed value Ns
    exceeds expectation

10
Implications
  • Possibility of overlap of a motif with itself (in
    either orientation)
  • Previous study of pattern autocorrelation
  • Generalized computation of SD, treating motif as
    a finite set of strings
  • Higher order Markov chains
  • Spacers handled at no extra computational cost
  • Handles motif in either orientation

11
Algorithm
  • Enumerates over each input sequence
  • Tabulates number Ns of occurrences of each motif
    in either direction
  • Compute expectation and SD for each motif s.t.
    Nsgt0
  • Calculate z-score
  • Rank motifs by z-score

12
Algorithm Analysis
  • For single motif, complexity is O(c2k2)
  • k of nonspacer characters in motif
  • c of instantiations of R, Y, S, W in motif
  • Only modest values of k
  • Linear dependence on genome size
  • Can trim variance calculation to optimize

13
Number of Occurrences
  • Convert motif s into a multiset W
  • Add reverse complements for each string in W
  • Motif s only occurs at position in X iff some
    string in W occurs at same position
  • Xs - of occurrences (in X) of each member of W
  • Handling Palindromes
  • Wi member of W
  • W T

14
Number of Occurrences Cont

15
Expectation
  • Linearity of Expectation

16
Variance
? B term
? C term

17
C Term

? A term
18
A Term

19
Overlapping Concatenation
  • CW (like W) is potentially a multiset
  • One-to-one correspondence

20
C Term Simplification

21
A Term Revisited

22
Si1Si2 Term Approximation
  • Kleffe and Borodovsky (1992) Approximation

23
B Term

24
B Term Cont

25
Summary
26
Higher Order Markov Models
  • Variance calculations remain the same except for
    Si1Si2 term
  • Experimental m 3

27
Experimental Results Future Considerations
  • 17 coregulated sets of genes
  • Known TF with known binding site consensus
  • In 9 experiments, known consensus was one of 3
    highest scoring motifs
  • Future Topics
  • Non-centered spacers
  • Enumeration Loop optimization
  • Filtering repeats

28
Question
  • E(Xs) is more straight-forward to calculate
    compared to s(Xs). Under the assumptions given in
    the paper, name one of the reasons for this
    complication.
Write a Comment
User Comments (0)
About PowerShow.com