Motif Finding presentation

About This Presentation

Transcript and Presenter's Notes

Title: Motif Finding

1
Motif Finding

Yueyi Irene Liu
CS374 Lecture
Oct. 17, 2002

2
Outline

Background biology
Motif-finding methods
Word enumeration
Gibbs sampling
Random projection
Phylogenetic footprinting
Reducer

3
(No Transcript)
4
Regulation of Gene Expression

Chromatin structure
Transcription initiation
Transcript processing and modification
RNA transport
Transcript stability
Translation initiation
Post-Translational Modification
Protein Transport
Control of Protein Stability

5
Typical Structure of an Eukaryotic mRNA Gene
6
Control of Transcription Initiation
7
Motif

A conserved pattern that is found in two or more
sequences
Can be found in
DNA (e.g., transcription factor binding sites)
Protein
RNA

8
Models for Representing Motifs

Regular expression
Consensus
TGACGCA
Degenerate
WGACRCA
Position Specific Matrix

TGACGCA TGACGCA AGACGCA TGACACA AGACGCA
9
Where to look for motifs?

Gene families a set of genes controlled by a
common transcription factor or common
environmental stimulus
How do you construct gene families?
Microarray experiments

10
Microarrays
10
11
Motif-finding Methods

Goal Look for motifs (5-15bp) in the data set
Methods
Word enumeration method
Gibbs sampling
Random projection
Phylogenetic footprinting
Reducer

12
Word Enumeration

For every word w, calculate
Expected frequency based on entire upstream
region of the yeast genome
E.g., P(ATTGA) (0.4)4(0.1)1, given P(A) P(T)
0.4,
P(G)P(C) 0.1
Expected number of occurrences of ATTGA
nP(ATTGA)
Observed frequency in the data set
Statistical significance of enrichment
Z (O - E) / sqrtnp ? (1 - p) N(0, 1)
Disadvantage only consider exact word
E.g, YCTGCA TCTGCA and CCTGCA

13
Gibbs Sampling

Matrix to capture a motif
Goal find the best ak to maximize the difference
between motif and background base distribution.

a1
a2
a3
a4
ak
Liu, X
14
Gibbs Sampling (Lawrence, et al, 1993)

Step 1 Pick random start position, compute
current motif matrix
Step 2 Iterative update
Take one sequence out, update motif matrix
Calcuate fitness score of each position of out
sequence
Pick start position in out sequence based on
weight Ax
Take out another sequence, , until converge
Step 3 Reset starting position

Liu, X
15
Gibbs Sampling InitializationPick random start
position, compute motif matrix
Liu, X
16
Gibbs Sampling Iteration Steps1) Take out one
sequence, calculate the fitness score of every
subsequence relative to the current motif
a1'
?????????????????
a2'
a3'
a4'
ak'
Liu, X
17
Fitness Score
Current Motif

Ax Qx / Px
Qx probability of generating subsequence x from
current motif
Px probability of generating subsequence x from
background

Background P(A) P(T) 0.4 P(G) P(C) 0.1
X GGA Q? P?
18
Gibbs Sampling Iteration Steps2) Pick new start
position sampling from fitness score
a1''
a2'
a3'
a4'
ak'
Liu, X
19
Recent Development

Random Projection
Phylogenetic Footprinting
Reducer

20
Random Projection (Buhler, 2002)

(l, d)-motif problem
M is an (unknown) motif of length l
Each occurrence of M is corrupted by exactly d
point substitutions in random positions
No known biological motifs are
of (l, d)-motif

CCcaAG CCcgAG CCgcAG CCtaAG CCtgAG
CtATgG CCctAc tCtTAG CaAcAG CCAgAa
21
Random Projection Algorithm

Guiding principle Some instances of a motif
agree on a subset of positions.
Use information from multiple motif instances to
construct model.

Buhler, J
22
k-Projections

Choose k positions in string of length l.
Concatenate nucleotides at chosen k positions to
form k-tuple.
In l-dimensional Hamming space, projection onto k
dimensional subspace.

k 7
l 15
P
ATGGCATTCAGATTC
TGCTGAT
Buhler, J
P (2, 4, 5, 7, 11, 12, 13)
23
Random Projection Algorithm

Choose a projection by selecting k positions
uniformly at random.
For each l-tuple in input sequences, hash into
bucket based on letters at k selected positions.
Recover motif from bucket containing multiple
l-tuples.

Input sequence x(i) TCAATGCACCTAT...
Buhler, J
24
Example

l 7 (motif size) , k 4 (projection size)
Choose projection (1,2,5,7)

Input Sequence
...TAGACATCCGACTTGCCTTACTAC...
Buckets
GCCTTAC
Buhler, J
25
Hashing and Buckets

Hash function h(x) obtained from k positions of
projection.
Buckets are labeled by values of h(x).
Enriched buckets contain more than s l-tuples,
for some parameter s.

Buhler, J
26
Motif Refinement

How do we recover the motif from the sequences in
the enriched buckets?
k nucleotides are known from hash value of
bucket.
Use information in other l-k positions as
starting point for local refinement scheme, e.g.
EM or Gibbs sampler

Local refinement algorithm
ATGCGTC
Candidate motif
Buhler, J
27
Parameter Selection

Projection size k
Choose k small so several motif instances hash to
same bucket. (k lt l - d)
Choose k large to avoid contamination by spurious
l-mers. ( 4k gt t (n - l 1)
Bucket threshold s (s 3, s 4)

Buhler, J
28
Recent Development

Random Projection
Phylogenetic Footprinting
Reducer

29
Conservation of Regulatory Elements in Upstream
of ApoAI Gene
TATA box
TATA box
30
(No Transcript)
31
Substring Parsimony Problem

Given
orthologous upstream sequences S1,Sn
phylogenetic tree T of the n species
size k of the motif, threshold d
Problem
Find all sets of substrings s1,sn of S1,Sn ,
each of size k, such that the parsimony score of
s1,sn on T is at most d

Blanchette, M
32
Parsimony Score
s1
Tree T
s2
s34
s3
s4
s5
s6
Minimum (all possible labelings of internal nodes)

l(v) label of node v
d(l1, l2) Hamming distance

Blanchette, M
33
String Parsimony Problem
S1 AAAGCATTC S2 TACGCACCC S3 GAAGCAGGG
k 5 d 1
S1
S2
S3
34
Algorithm version I

Root the tree at arbitrary internal node r
Compute table Wu of size 4k for each node u,
where Wus best parsimony score for subtree
rooted at u when u is labeled with s
Direct implementation of this recursion gives
O(nk(42k l), where l average sequence
length

Blanchette, M
35
Algorithm version II

Define X(u, v)s best parsimony score for
subtree consisting of edge (u,v) and the subtree
rooted at v

u labeled s
w
v
Blanchette, M
36
Algorithm version II (continued)

Update X(u, v) in phases in phase p maintain set
Bp of sequences t, such that X(u, v)t p
Define
Ra s Wvs a
N(s) t in ?k d(s, t) 1
Start in phase m and let Bm Rm
Update
Computation of X(u, v) takes O(k4k)

Blanchette, M
37
Improvements

Reduce the size of Bp when sequences contribute
to X(u, v) greater than threshold d
In phase p, only care for sequence X(u, v) s if
Leads to significant reductions in stages d/2 d
Reduce the number of substrings inserted in W at
the leaves
For substring s of Si, if its best match against
any Sj, has Hamming distance at least d, s can be
discarded

Blanchette, M
38
Results

Practical limit on k 10
There appeared to be a threshold d0 with very few
solutions below and many above
Algorithm found 80 known binding sites
Performed better than ClustalW, MEME, Consensus

Blanchette, M
39
Recent Development

Random Projection
Phylogenetic Footprinting
Reducer

40
Reducer (Bussemaker, et al 2001)

Links motif finding to expression level
Ag C S Fu Nug
Ag gene expression level (logarithm of
expression ratio)
M number of significant motifs
Ng number of occurrences of motif u in gene g
C baseline expression level (same for all genes)
F increase/decrease of expression level caused
by presence of motif

41
Reducer (Contd)
Liu, X
42
Reducer (Contd)

Normalize expression (A) and motif (n) vectors
Linear regression between A vector and every n
vector to find the best fit n to A
Step-wise regression to combine effects of motifs
Subtract the effect of one motif
Find the next best motif

Liu, X
43
Acknowlegement

People from whom I borrowed slides
Xiaole Liu (Reducer)
Olga Troyanskaya (Microarray)
Jeremy Buhler (Random projections)
Mathieu Blanchette (Phylogenetic footprinting)
Various web sources

44
(No Transcript)
45
excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target)
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
46
Information Content of Motifs

Uncertainty
Information Hbefore - Hafter

47
Improvement on Original Gibbs sampler

0 n copies of sites in each sequence
Iterative masking to find multiple motifs
Use higher order Markov models to improve motif
specificity

48
Clinical Importance of Defects in Regulatory
Elements
Burkitts Lymphoma
49
Statistical Methods

Expectation Maximization (EM)
MEME
Gibbs sampling
BioProspector
AlignACE

50
Motifs are not limited to DNAs

RNA motifs
RNA RNA interaction motifs, e.g., intron-exon
splice sites
RNA protein interaction motifs, e.g., binding
of proteins to RNA polyA tail
Protein motifs
E.g., Helix-turn-helix motif

Motif Finding PowerPoint PPT Presentation