A Statistical Method for Finding Transcription Factor Binding Sites - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

A Statistical Method for Finding Transcription Factor Binding Sites

Description:

Basic routine. Enumerating all possibilities. Capturing # of occurrences Ns ... Move work in the enumeration work out to preprocessing step. Faster ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 28

Provided by: csU70

Category:

more less

Transcript and Presenter's Notes

Title: A Statistical Method for Finding Transcription Factor Binding Sites

1
A Statistical Method for Finding Transcription
Factor Binding Sites

S. Sinha M. Tompa
ISMB 2000

2
Transcriptional Regulation

TRANSCRIPTION FACTOR
3
Introduction

Goal
To understand the mechanisms that determine the
regulation of gene expression
Fundamental sub-problem is to identify
DNA-binding sites for unknown regulatory factors
Then we can go to find regulatory factors
Input
A collection of genes believed to be coregulated
Non-coding DNA sequences near those regions

4
Challenges

Analysis of non-coding regions in eukaryotic
genome
Located quite far from coding region
Regulatory sequence orientation
Multiple binding sites
Great variability in the binding sites, the
nature of allowable variations not well
understood
For S. cerevisiae (subject of experimentation)
Only the first problem is not severe (800bp
upstream of the translation starting site, these
are the input)

5
Overview

Previous Methods local search
EM
Gibbs sampling
etc.
This paper enumerative statistical method
Enumerative affordable because of a relatively
small search space
Statistical high z-scores suggest good candidates

6
Previous Methods

Aim at finding longer and more general motifs
Longer motif length
General represented by weighted matrices,
alignments, gapped alignments
Price paid local search, no global optimum
guaranteed
More than required for transcription binding
sites
Length typically 6 to 10
Enumerative method?

7
Enumerative Methods

Van Helden et al. (1998)
Only exact matches
No spacers N
Occurrences at distinct positions are assumed to
be independent
overlapping structures in both orientations
Example sequence ATATAT, motif ATAT
Frequency not normalized
Brazma et al. (1998)
Allows 3 Ns

8
Enumerative Methods

Tompa (1999)
Pro
Markov chain to model the background genomic
distribution
z-score for statistical significance
Consider the autocorrelation of overlapping motif
occurrences
Con
Subject of prokaryotic ribosome binding site ? no
spacers, insufficient variability

9
For S. cerevisiae

Spacers in length from 1 to 11 bp, occurring in
the middle
Conserved bases (not including spacers) 6-10 bp
Mutation
In the way of transition rather than transversion
Purine for purine, pyrimidine for pyrimidine
Between complementary bases also possible, A/T or
G/C
Other kind of variations are much rarer
Insertion/deletions uncommon ? gaps unnecessary

10
What is a Motif?

String on alphabet A, C, G, T, R, Y, S, W, N,
with spacers N in the middle
This much variation is enough
Such a motif is suitable for enumeration (not
like weighted matrix)
An examination of 50 binding consensi included in
SCPD (Zhu Zhang 1999)
31 exactly fit
10 more if slight differences are tolerated

11
Statistics

Consider frequency wrt. background genomic
distribution
Hypothesis testing ? how unlikely it is to have
this many occurrences for a background
distribution
Background
Order m Markov chain from (m1)-mer frequencies
m3 to account for TATA, AAAA, TTTT
z-score normalization ? to compare different
motifs

12
Expectation and Variance

E(Xs) relatively straightforward
s(Xs) substantially more efforts
Need to consider overlap of a motif to itself
Fortunately, autocorrelation well studied in the
past
Mathematical basis!
Generalize to the case where s represents a
finite set of strings (more complex overlapping
one string with any other)
Higher order Markov models
Spacers handled without extra cost
Motif occurring in either orientation
Details in Appendix of the paper

13
Preliminary Calculations

For a given motif s
String of l over A, C, G, T, R, Y, S, W, N
For simplicity, only first order Markov model
here
Transform into a multiset W
Replace R, Y, S, W by all possible combinations
of appropriate instantiations
For each string in W, add its reverse complement
(both strands)
Xs is sum of the of occurrences of each member
of W
Overlapping instances are counted as separate
A simple special case palindrome

14
Preliminary Calculations (cont)

Represent members of W as Wi, TW

15
Preliminary Calculations (cont)

E(Xs) is easy linearity of expectation!
p stationary distribution of Markov chain
Skip N (spaces) by looking at higher powers of
transition matrix

16
Preliminary Calculations (cont)

s(Xs) harder

17
Preliminary Calculations (cont)

First part B, second part 2C

18
Preliminary Calculations (cont)

To compute the first term of C
One-to-one correspondence between Xi1,j
Xi2,jk1 and Xij(CW)1
Autocorrelation factor
Transfer to expectation on a single variable
Xij(CW)
B and A more complex (in appendix)

19
Algorithm Summary

Basic routine
Enumerating all possibilities
Capturing of occurrences Ns
Normalize to z-score with E(Xs) ands(Xs), ranking
Scalability
k of non-spacers, c of R, Y, S, W
instantiations
of enumerated motifs exponential in k ? small k
Linear to the genome size ? apply to larger
regions
To compute a z-score, O(c2k2)
Do not need to always pay that , prune based on
less costly sub-components (details of the
heuristics in the paper)

20
Experiment Settings

Known regulons
17 well studied coregulated sets of gene in S.
cerevisiae
All having known transcription factor with a
known binding site consensus
Success of experiments can be tested
Another test set about coexpressed gene clusters
Details in the paper

21
Output Presentation

Italicized for instances in the programs output
z-score compared with mean max z-score
Run on several sets of simulated test data
(randomly generated), see normally how large a
z-score can be
In this sense, our output is statistically
significant!