A Statistical Method for Finding Transcriptional Factor Binding Sites - PowerPoint PPT Presentation

About This Presentation

Title:

A Statistical Method for Finding Transcriptional Factor Binding Sites

Description:

Previous & Proposed Methods for Finding Motifs. Previous Methods: Find longer, general motifs ... Previous study of pattern autocorrelation ... – PowerPoint PPT presentation

Number of Views:259

Avg rating:3.0/5.0

Slides: 29

Provided by: christopher151

Learn more at: https://courses.grainger.illinois.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Statistical Method for Finding Transcriptional Factor Binding Sites

1
A Statistical Method for Finding Transcriptional
Factor Binding Sites

Authors Saurabh Sinha and Martin Tompa
Presenter Christopher Schlosberg
CS598ss

2
Regulation of Gene Expression
3
Difficulties of Motif Finding

Regulatory sequences dont follow same
orientation as the coding sequence or each other
Multiple binding sites might exist for each
regulated gene
Large variation in the binding sites of a single
factor. Variations are not well understood.

4
Previous Proposed Methods for Finding Motifs

Previous Methods
Find longer, general motifs
Use local search algorithms (Gibbs sampling,
Expectation Maximization, greedy algorithms)
Proposed Method
TFBS is small enough to use enumerative methods
Enumerative statistical methods guarantee global
optimality and affordability

5
Proposed Method Highlights

Allows variations in the binding site instances
of a given transcription factor
Allows for motifs to include spacers
Allows for overlapping occurrences (in both
orientations), which lends to complex
dependencies
Statistical significance of a motif (s) is based
on the frequencies of shorter (more frequent)
oligonucleotides
Use of Markov chain to model background genomic
distribution
Use of z-score to measure statistical
significance
Allows for multiple binding sites

6
Characteristics of a Motif

Any single TFBS has significant variation
Many motifs have spacers from 1-11bp
Variation often occurs as a transition (e.g.
purine ? purine) rather than a transversion (e.g.
pyrimidine ? purine)
Variation occurs less between a pair of
complementary bases.
Indels are uncommon

7
Proposed Motif Definition

Motif will be a string with S A,C,G,T,
R,Y,S,W,N
A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S
(strong), W (weak), N (spacer)
TF database (SCPD) confirms this model of
variation
Of 50 binding site consensi, 31 exact fits (62)
Another 10 fit if slight variations allowed

8
Measure of Statistical Significance

Given set of corregulated S. cerevisiae genes,
the input to the problem is corresponding set of
800bp upstream sequences having 3 end on start
site of gene translation.
Model must measure from input sequences
Absolute number of occurrences (Ns) of motif (s)
Background genomic distribution
X is a set of random DNA sequences in the same
number and lengths of the input sequences
Generated by Markov chain of order m
Transition probabilities determined by (m1)-mer
frequencies in fully complement of 6000 (800bp
in length)
Background model chooses m3

9
z-score

Xs r.v. is number of occurrences of motif (s)
in X
E(Xs) expectation, s(Xs) standard deviation
zs number of S.D. by which observed value Ns
exceeds expectation

10
Implications

Possibility of overlap of a motif with itself (in
either orientation)
Previous study of pattern autocorrelation
Generalized computation of SD, treating motif as
a finite set of strings
Higher order Markov chains
Spacers handled at no extra computational cost
Handles motif in either orientation

11
Algorithm

Enumerates over each input sequence
Tabulates number Ns of occurrences of each motif
in either direction
Compute expectation and SD for each motif s.t.
Nsgt0
Calculate z-score
Rank motifs by z-score

12
Algorithm Analysis

For single motif, complexity is O(c2k2)
k of nonspacer characters in motif
c of instantiations of R, Y, S, W in motif
Only modest values of k
Linear dependence on genome size
Can trim variance calculation to optimize

13
Number of Occurrences

Convert motif s into a multiset W
Add reverse complements for each string in W
Motif s only occurs at position in X iff some
string in W occurs at same position
Xs - of occurrences (in X) of each member of W
Handling Palindromes
Wi member of W
W T

14
Number of Occurrences Cont

15
Expectation

Linearity of Expectation

16
Variance
? B term
? C term

17
C Term

? A term
18
A Term

19
Overlapping Concatenation

CW (like W) is potentially a multiset

One-to-one correspondence

20
C Term Simplification

21
A Term Revisited

22
Si1Si2 Term Approximation

Kleffe and Borodovsky (1992) Approximation

23
B Term

24
B Term Cont

25
Summary
26
Higher Order Markov Models

Variance calculations remain the same except for
Si1Si2 term
Experimental m 3

27
Experimental Results Future Considerations

17 coregulated sets of genes
Known TF with known binding site consensus
In 9 experiments, known consensus was one of 3
highest scoring motifs
Future Topics
Non-centered spacers
Enumeration Loop optimization
Filtering repeats

28
Question

E(Xs) is more straight-forward to calculate
compared to s(Xs). Under the assumptions given in
the paper, name one of the reasons for this
complication.

Write a Comment

User Comments (0)