Title: A Statistical Method for Finding Transcriptional Factor Binding Sites
1A Statistical Method for Finding Transcriptional
Factor Binding Sites
- Authors Saurabh Sinha and Martin Tompa
- Presenter Christopher Schlosberg
- CS598ss
2Regulation of Gene Expression
3Difficulties of Motif Finding
- Regulatory sequences dont follow same
orientation as the coding sequence or each other - Multiple binding sites might exist for each
regulated gene - Large variation in the binding sites of a single
factor. Variations are not well understood.
4Previous Proposed Methods for Finding Motifs
- Previous Methods
- Find longer, general motifs
- Use local search algorithms (Gibbs sampling,
Expectation Maximization, greedy algorithms) - Proposed Method
- TFBS is small enough to use enumerative methods
- Enumerative statistical methods guarantee global
optimality and affordability
5Proposed Method Highlights
- Allows variations in the binding site instances
of a given transcription factor - Allows for motifs to include spacers
- Allows for overlapping occurrences (in both
orientations), which lends to complex
dependencies - Statistical significance of a motif (s) is based
on the frequencies of shorter (more frequent)
oligonucleotides - Use of Markov chain to model background genomic
distribution - Use of z-score to measure statistical
significance - Allows for multiple binding sites
6Characteristics of a Motif
- Any single TFBS has significant variation
- Many motifs have spacers from 1-11bp
- Variation often occurs as a transition (e.g.
purine ? purine) rather than a transversion (e.g.
pyrimidine ? purine) - Variation occurs less between a pair of
complementary bases. - Indels are uncommon
-
7Proposed Motif Definition
- Motif will be a string with S A,C,G,T,
R,Y,S,W,N - A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S
(strong), W (weak), N (spacer) - TF database (SCPD) confirms this model of
variation - Of 50 binding site consensi, 31 exact fits (62)
- Another 10 fit if slight variations allowed
8Measure of Statistical Significance
- Given set of corregulated S. cerevisiae genes,
the input to the problem is corresponding set of
800bp upstream sequences having 3 end on start
site of gene translation. - Model must measure from input sequences
- Absolute number of occurrences (Ns) of motif (s)
- Background genomic distribution
- X is a set of random DNA sequences in the same
number and lengths of the input sequences - Generated by Markov chain of order m
- Transition probabilities determined by (m1)-mer
frequencies in fully complement of 6000 (800bp
in length) - Background model chooses m3
9z-score
- Xs r.v. is number of occurrences of motif (s)
in X - E(Xs) expectation, s(Xs) standard deviation
- zs number of S.D. by which observed value Ns
exceeds expectation
10Implications
- Possibility of overlap of a motif with itself (in
either orientation) - Previous study of pattern autocorrelation
- Generalized computation of SD, treating motif as
a finite set of strings - Higher order Markov chains
- Spacers handled at no extra computational cost
- Handles motif in either orientation
11Algorithm
- Enumerates over each input sequence
- Tabulates number Ns of occurrences of each motif
in either direction - Compute expectation and SD for each motif s.t.
Nsgt0 - Calculate z-score
- Rank motifs by z-score
12Algorithm Analysis
- For single motif, complexity is O(c2k2)
- k of nonspacer characters in motif
- c of instantiations of R, Y, S, W in motif
- Only modest values of k
- Linear dependence on genome size
- Can trim variance calculation to optimize
13Number of Occurrences
- Convert motif s into a multiset W
- Add reverse complements for each string in W
- Motif s only occurs at position in X iff some
string in W occurs at same position - Xs - of occurrences (in X) of each member of W
- Handling Palindromes
- Wi member of W
- W T
14Number of Occurrences Cont
15Expectation
16Variance
? B term
? C term
17C Term
? A term
18A Term
19Overlapping Concatenation
- CW (like W) is potentially a multiset
- One-to-one correspondence
20C Term Simplification
21A Term Revisited
22Si1Si2 Term Approximation
- Kleffe and Borodovsky (1992) Approximation
23B Term
24B Term Cont
25Summary
26Higher Order Markov Models
- Variance calculations remain the same except for
Si1Si2 term - Experimental m 3
27Experimental Results Future Considerations
- 17 coregulated sets of genes
- Known TF with known binding site consensus
- In 9 experiments, known consensus was one of 3
highest scoring motifs - Future Topics
- Non-centered spacers
- Enumeration Loop optimization
- Filtering repeats
28Question
- E(Xs) is more straight-forward to calculate
compared to s(Xs). Under the assumptions given in
the paper, name one of the reasons for this
complication.