A Statistical Method for Finding Transcription Factor Binding Sites - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

A Statistical Method for Finding Transcription Factor Binding Sites

Description:

Basic routine. Enumerating all possibilities. Capturing # of occurrences Ns ... Move work in the enumeration work out to preprocessing step. Faster ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 28
Provided by: csU70
Category:

less

Transcript and Presenter's Notes

Title: A Statistical Method for Finding Transcription Factor Binding Sites


1
A Statistical Method for Finding Transcription
Factor Binding Sites
  • S. Sinha M. Tompa
  • ISMB 2000

2
Transcriptional Regulation

TRANSCRIPTION FACTOR
3
Introduction
  • Goal
  • To understand the mechanisms that determine the
    regulation of gene expression
  • Fundamental sub-problem is to identify
    DNA-binding sites for unknown regulatory factors
  • Then we can go to find regulatory factors
  • Input
  • A collection of genes believed to be coregulated
  • Non-coding DNA sequences near those regions

4
Challenges
  • Analysis of non-coding regions in eukaryotic
    genome
  • Located quite far from coding region
  • Regulatory sequence orientation
  • Multiple binding sites
  • Great variability in the binding sites, the
    nature of allowable variations not well
    understood
  • For S. cerevisiae (subject of experimentation)
  • Only the first problem is not severe (800bp
    upstream of the translation starting site, these
    are the input)

5
Overview
  • Previous Methods local search
  • EM
  • Gibbs sampling
  • etc.
  • This paper enumerative statistical method
  • Enumerative affordable because of a relatively
    small search space
  • Statistical high z-scores suggest good candidates

6
Previous Methods
  • Aim at finding longer and more general motifs
  • Longer motif length
  • General represented by weighted matrices,
    alignments, gapped alignments
  • Price paid local search, no global optimum
    guaranteed
  • More than required for transcription binding
    sites
  • Length typically 6 to 10
  • Enumerative method?

7
Enumerative Methods
  • Van Helden et al. (1998)
  • Only exact matches
  • No spacers N
  • Occurrences at distinct positions are assumed to
    be independent
  • overlapping structures in both orientations
  • Example sequence ATATAT, motif ATAT
  • Frequency not normalized
  • Brazma et al. (1998)
  • Allows 3 Ns

8
Enumerative Methods
  • Tompa (1999)
  • Pro
  • Markov chain to model the background genomic
    distribution
  • z-score for statistical significance
  • Consider the autocorrelation of overlapping motif
    occurrences
  • Con
  • Subject of prokaryotic ribosome binding site ? no
    spacers, insufficient variability

9
For S. cerevisiae
  • Spacers in length from 1 to 11 bp, occurring in
    the middle
  • Conserved bases (not including spacers) 6-10 bp
  • Mutation
  • In the way of transition rather than transversion
  • Purine for purine, pyrimidine for pyrimidine
  • Between complementary bases also possible, A/T or
    G/C
  • Other kind of variations are much rarer
  • Insertion/deletions uncommon ? gaps unnecessary

10
What is a Motif?
  • String on alphabet A, C, G, T, R, Y, S, W, N,
    with spacers N in the middle
  • This much variation is enough
  • Such a motif is suitable for enumeration (not
    like weighted matrix)
  • An examination of 50 binding consensi included in
    SCPD (Zhu Zhang 1999)
  • 31 exactly fit
  • 10 more if slight differences are tolerated

11
Statistics
  • Consider frequency wrt. background genomic
    distribution
  • Hypothesis testing ? how unlikely it is to have
    this many occurrences for a background
    distribution
  • Background
  • Order m Markov chain from (m1)-mer frequencies
  • m3 to account for TATA, AAAA, TTTT
  • z-score normalization ? to compare different
    motifs

12
Expectation and Variance
  • E(Xs) relatively straightforward
  • s(Xs) substantially more efforts
  • Need to consider overlap of a motif to itself
  • Fortunately, autocorrelation well studied in the
    past
  • Mathematical basis!
  • Generalize to the case where s represents a
    finite set of strings (more complex overlapping
    one string with any other)
  • Higher order Markov models
  • Spacers handled without extra cost
  • Motif occurring in either orientation
  • Details in Appendix of the paper

13
Preliminary Calculations
  • For a given motif s
  • String of l over A, C, G, T, R, Y, S, W, N
  • For simplicity, only first order Markov model
    here
  • Transform into a multiset W
  • Replace R, Y, S, W by all possible combinations
    of appropriate instantiations
  • For each string in W, add its reverse complement
    (both strands)
  • Xs is sum of the of occurrences of each member
    of W
  • Overlapping instances are counted as separate
  • A simple special case palindrome

14
Preliminary Calculations (cont)
  • Represent members of W as Wi, TW

15
Preliminary Calculations (cont)
  • E(Xs) is easy linearity of expectation!
  • p stationary distribution of Markov chain
  • Skip N (spaces) by looking at higher powers of
    transition matrix

16
Preliminary Calculations (cont)
  • s(Xs) harder

17
Preliminary Calculations (cont)
  • First part B, second part 2C

18
Preliminary Calculations (cont)
  • To compute the first term of C
  • One-to-one correspondence between Xi1,j
    Xi2,jk1 and Xij(CW)1
  • Autocorrelation factor
  • Transfer to expectation on a single variable
    Xij(CW)
  • B and A more complex (in appendix)

19
Algorithm Summary
  • Basic routine
  • Enumerating all possibilities
  • Capturing of occurrences Ns
  • Normalize to z-score with E(Xs) ands(Xs), ranking
  • Scalability
  • k of non-spacers, c of R, Y, S, W
    instantiations
  • of enumerated motifs exponential in k ? small k
  • Linear to the genome size ? apply to larger
    regions
  • To compute a z-score, O(c2k2)
  • Do not need to always pay that , prune based on
    less costly sub-components (details of the
    heuristics in the paper)

20
Experiment Settings
  • Known regulons
  • 17 well studied coregulated sets of gene in S.
    cerevisiae
  • All having known transcription factor with a
    known binding site consensus
  • Success of experiments can be tested
  • Another test set about coexpressed gene clusters
  • Details in the paper

21
Output Presentation
  • Italicized for instances in the programs output
  • z-score compared with mean max z-score
  • Run on several sets of simulated test data
    (randomly generated), see normally how large a
    z-score can be
  • In this sense, our output is statistically
    significant!

22
Three Examples - 1
23
Three Examples - 2
24
Three Examples - 3
25
Experiment Results
  • Known regulons
  • Succeed for all but two experiments
  • 9 of them, ranks among three highest
  • In 6 others, a very similar looking motif is in
    top three
  • Other 2 are families with very few genes in them,
    reported in top 20
  • Coexpressed gene clusters
  • Similar outcome

26
Future Work
  • Motif characterization limited
  • Spacers may not be centered
  • More than one run of spacers
  • Move work in the enumeration work out to
    preprocessing step
  • Faster
  • Filter out well known repeats from the upstream
    regions
  • Improve accuracy

27
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com