Title: Transcription Regulation Transcription Factor Motif Finding
1Transcription RegulationTranscription Factor
Motif Finding
- Xiaole Shirley Liu
- STAT115, STAT215, BIO298, BIST520
2Outline
- Biology of transcription regulation and
challenges of computational motif finding - Scan for known TF motif sites
- TRASFAC and JASPAR, Sequence Logo
- De novo method
- Regular expression enumeration w-mer enumerate
- Position weight matrix update EM and Gibbs
- Motif finding in different organisms
- Motif clusters and conservation
3Imagine a Chef
4Each Cell Is Like a Chef
5Each Cell Is Like a Chef
6Understanding a Genome
Get the complete sequence (encoded cook book)
Observe gene expressions at different cell
states (meals prepared at different situations)
Decode gene regulation (decode the book,
understand the rules)
7Information in DNA
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACAGTTCAGGACTAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
8Information in DNA
- Non-coding region 98
- Regulation When, Where,
- Amount, Other Conditions, etc
- ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
- ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG
- GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
- TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
- CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT
Coding region 2
Milk-gtYogurt
Egg-gtOmelet
Fish-gtSushi
Flour-gtCake
Beef-gtBurger
9Measure Gene Expression
- Microarray or SAGE detects the expression of
every gene at a certain cell state - Clustering find genes that are co-expressed
(potentially share regulation)
10Decode Gene Regulation
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
- GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
- TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
- CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
11Decode Gene Regulation
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
- GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
- TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
- CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
12Decode Gene Regulation
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
- GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
- TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
- CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
13Biology of Transcription Regulation
- ...acatttgcttctgacacaactgtgttcactagcaacctca...aaca
gacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...
...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggc
agcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...
...cgctcgcgggccggcactcttctggtccccacagactcag...gat
acccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...
...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gat
gcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...
Motif can only be computational discovered when
there are enough cases for machine learning
14Computational Motif Finding
- Input data
- Upstream sequences of gene expression profile
cluster - 20-800 sequences, each 300-5000 bps long
- Output enriched sequence patterns (motifs)
- Ultimate goals
- Which TFs are involved and their binding motifs
and effects (enhance / repress gene expression)? - Which genes are regulated by this TF, why is
there disease when a TF goes wrong? - Are there binding partner / competitor for a TF?
15Challenges Where/what the signal
- The motif should be abundant
- GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATTTACCACCAAATAAGACACGGACGGC
- GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
- TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
- CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
16Challenges Where/what the signal
- The motif should be abundant
- And Abundant with significance
- GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATTTACCACCAAATAAGACACGGACGGC
- GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
- TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
- CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
17Challenges Double stranded DNA
- Motif appears in both
- strands
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
- TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG
18Challenges Base substitutions
- Sequences do not have to match the motif
- perfectly, base substitutions are allowed
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATGTACCACCAGTTCAGACACGGACGGC
- GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA
- TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG
- CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT
19Challenges Variable motif copies
- Some sequences do not have the motif
- Some have multiple copies of the motif
- GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
- CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
- TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
- GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
- CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
20Challenges Variable motif copies
- Some sequences do not have the motif
- Some have multiple copies of the motif
- GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
- CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
- TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
- GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
- CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
21Challenges Two-block motifs
- Some motifs have two parts
- GACACATTTACCTATGC TGGCCCTACGACCTCTCGC
- CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC
- GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA
- TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG
- CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT
22Scan for Known TF Motif Sites
- Experimental TF sites TRANSFAC, JASPAR
- Motif representation
- Regular expression Consensus CACAAAA
- binary decision Degenerate CRCAAAW
- IUPAC
A/T
A/G
23Scan for Known TF Motif Sites
- Experimental TF sites TRANSFAC, JASPAR
- Motif representation
- Regular expression Consensus CACAAAA
- binary decision Degenerate CRCAAAW
- Position weight matrix (PWM) need score cutoff
Motif Matrix
Pos 12345678 ATGGCATG AGGGTGCG
ATCGCATG TTGCCACG ATGGTATT ATTGCACG
AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Sites
24IUPAC for DNA
- A adenosine
- C cytidine
- G guanine
- T thymidine
- U uridine
- R G A (purine)
- Y T C (pyrimidine)
- K G T (keto)
- M A C (amino)
- S G C (strong)
- W A T (weak)
- B C G T (not A)
- D A G T (not C)
- H A C T (not G)
- V A C G (not T)
- N A C G T (any)
25Protein Binding Microarrays
- In vitro protein-DNA interactions
- Better capture motifs
26JASPAR
- User defined cutoff to scan for a particular motif
27A Word on Sequence Logo
- SeqLogo consists of stacks of symbols, one stack
for each position in the sequence - The overall height of the stack indicates the
sequence conservation at that position - The height of symbols within the stack indicates
the relative frequency of nucleic acid at that
position
ATGGCATG AGGGTGCG ATCGCATG
TTGCCACG ATGGTATT ATTGCACG AGGGCGTT
ATGACATG ATGGCATG ACTGGATG
28Scan Known TF Motifs
- Drawbacks
- Limited number of motifs
- Limited number of sites to represent each motif
- Low sensitivity and specificity
- Poor description of motif
- Binding site borders not clear
- Binding site many mismatches
- Many motifs look very similar
- E.g. GC-rich motif, E-box (CACGTG)
29De novo Sequence Motif Finding
- Goal look for common sequence patterns enriched
in the input data (compared to the genome
background) - Regular expression enumeration
- Pattern driven approach
- Enumerate patterns, check significance in
dataset - Oligonucleotide analysis, MobyDick
- Position weight matrix update
- Data driven approach, use data to refine motifs
- Consensus, EM Gibbs sampling
- Motif score and Markov background
30Regular Expression Enumeration
- Oligonucleotide Analysis check
over-representation for every w-mer - Expected w occurrence in data
- Consider genome sequence current data size
- Observed w occurrence in data
- Over-represented w is potential TF binding motif
Observed occurrence of w in the data
31MobyDick
- A sequence data and a dictionary of motif words
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.22, 0.28,
0.28, 0.22
32MobyDick
- A sequence data and a dictionary of motif words
- Check over-representation of every word-pair
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.28, 0.22,
0.22, 0.28
33MobyDick
- A sequence data and a dictionary of motif words
- Check over-representation of every word-pair
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.28, 0.28,
0.22, 0.22
D A,C,G,T,AA,GA,TA,GG Pw ?
34MobyDick
- D A,C,G,T,AA,GA,TA,GG
- Seq AAGATAA
- Possible partitions
- A A G A T A A pA pA pG pA pT pA pA
- AA G A T A A pAA pG pA pT pA pA
- AA GA T A A pAA pGA pT pA pA
- AA GA TA A pAA pGA pTA pA
- A A GA T AA pAA pGA pT pAA
-
- Assign probabilities as to maximize total
probability of generating the sequence
35MobyDick
- A sequence data and a dictionary of motif words
- Check over-representation of every word-pair
- Reassign word probability and consider every new
word-pair to build even longer words
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.28, 0.28,
0.22, 0.22
D A,C,G,T,AA,GA,TA,GG Pw ?
36Regular Expression Enumeration
- RE Enumeration Derivatives
- oligo-analysis, spaced dyads w1.ns.w2
- IUPAC alphabet
- Markov background (later)
- 2-bit encoding, fast index access
- Enumerate limited RE patterns known for a TF
protein structure or interaction theme - Exhaustive, guaranteed to find global optimum,
and can find multiple motifs - Not as flexible with base substitutions, long
list of similar good motifs, and limited with
motif width
37Consensus
- Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained
with the additional sequence
38Consensus
- Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained
with the additional sequence
Remaining good motifs
39Consensus
- Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained
with the additional sequence - G Stormo, algorithm runs very fast
- Sequence order plays a big role in performance
- First two sequences better contain the motif
- Sites stop accumulating at the first bad sequence
- Newer version allowing 0-n is much slower
40Expectation Maximization and Gibbs Sampling Model
- Objects
- Seq sequence data to search for motif
- ?0 non-motif (genome background) probability
- ? motif probability matrix parameter
- ? motif site locations
- Problem P(?, ? seq, ?0)
- Approach alternately estimate
- ? by P(? ?, seq, ?0)
- ? by P(? ?, seq, ?0)
- EM and Gibbs differ in the estimation methods
41Expectation Maximization
- E step ? ?, seq, ?0
- TTGACGACTGCACGT
- TTGAC p1
- TGACG p2
- GACGA p3
- ACGAC p4
- CGACT p5
- GACTG p6
- ACTGC p7
- CTGCA p8
- ...
- P1 likelihood ratio
- P(TTGAC ?)
- P(TTGAC ?0)
p0T ? p0T ? p0G ? p0A ? p0C 0.3 ? 0.3 ? 0.2 ?
0.3 ? 0.2
42Expectation Maximization
- E step ? ?, seq, ?0
- TTGACGACTGCACGT
- TTGAC p1
- TGACG p2
- GACGA p3
- ACGAC p4
- CGACT p5
- GACTG p6
- ACTGC p7
- CTGCA p8
- ...
- M step ? ?, seq, ?0
- p1 ? TTGAC
- p2 ? TGACG
- p3 ? GACGA
- p4 ? ACGAC
- ...
-
- Scale ACGT at each position, ? reflects weighted
average of ?
43EM Derivatives
- First EM motif finder (C Lawrence)
- Deterministic algorithm, guarantee local optimum
- MEME (TL Bailey)
- Prior probability allows 0-n site / sequence
- Parallel running multiple
- EM with different seed
- User friendly results
44Gibbs Sampling
- Stochastic process, although still may need
multiple initializations - Sample ? from P(? ?, seq, ?0)
- Sample ? from P(? ?, seq, ?0)
- Collapsed form
- ? estimated with counts, not sampling from
Dirichlet - Sample site from one seq based on sites from
other seqs - Converged motif matrix ? and converged motif
sites ? represent stationary distribution of a
Markov Chain
45Gibbs Sampler
- Randomly initialize a probability matrix
46Gibbs Sampler
- Take out one sequence with its sites from current
motif
?11
?21
?31
?41
?51
47Gibbs Sampler
- Score each possible segment of this sequence
Sequence 1
Segment (1-8)
?21
?31
?41
?51
48Gibbs Sampler
- Score each possible segment of this sequence
Sequence 1
Segment (2-9)
?21
?31
?41
?51
49Segment Score
- Use current motif matrix to score a segment
50Scoring Segments
- Motif 1 2 3 4 5 bg
- A 0.4 0.1 0.3 0.4 0.2 0.3
- T 0.2 0.5 0.1 0.2 0.2 0.3
- G 0.2 0.2 0.2 0.3 0.4 0.2
- C 0.2 0.2 0.4 0.1 0.2 0.2
- Ignore pseudo counts for now
- Sequence TTCCATATTAATCAGATTCCG score
- TAATC
- AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x
0.2/0.3 0.049383 - ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x
0.4/0.2 11.85185 - TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x
0.2/0.3 0.444444 - CAGAT
51Gibbs Sampler
- Sample site from one seq based on sites from
other seqs
?21
?31
?41
?51
52How to Sample?
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 1 2 6
SubT 3 4 16 21 29 38 39 41 47
- Rand(subtotal) X
- Find the first position with subtotal larger than
X
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 500 2 6
SubT 3 4 16 21 29 38 538 540 546
53Gibbs Sampler
- Repeat the process until motif converges
?21
?12
?31
?41
?51
54Gibbs Sampler Intuition
- Beginning
- Randomly initialized motif
- No preference towards any segment
55Gibbs Sampler Intuition
- Motif appears
- Motif should have enriched signal (more sites)
- By chance some correct sites come to alignment
- Sites bias motif to attract other similar sites
56Gibbs Sampler Intuition
- Motif converges
- All sites come to alignment
- Motif totally biased to sample sites every time
57Gibbs Sampler
- Column shift
- Metropolis algorithm
- Propose ? as ? shifted 1 column to left or right
- Calculate motif score u(?) and u(?)
- Accept ? with prob min(1, u(?) / u(?))
58Gibbs Sampling Derivatives
- Gibbs Motif Sampler (JS Liu)
- Add prior probability to allow 0-n site / seq
- Sample motif positions to consider
- AlignACE (F Roth)
- Look for motifs from both strands
- Mask out one motif to find more different motifs
- BioProspector (XS Liu)
- Use background model with Markov dependencies
- Sampling with threshold (0-n sites / seq), new
scoring function - Can find two-block motifs with variable gap
59Scoring Motifs
- Information Content (also known as relative
entropy) - Suppose you have x aligned segments for the motif
- pb(s1 from mtf) / pb(s1 from bg)
- pb(s2 from mtf) / pb(s2 from bg)
- pb(sx from mtf) / pb(sx from bg)
60Scoring Motifs
- Information Content (also known as relative
entropy) - Suppose you have x aligned segments for the motif
- pb(s1 from mtf) / pb(s1 from bg)
- pb(s2 from mtf) / pb(s2 from bg)
- pb(sx from mtf) / pb(sx from bg)
61Scoring Motifs
- pb(s1 from mtf) / pb(s1 from bg)
- pb(s2 from mtf) / pb(s2 from bg)
- pb(sx from mtf) / pb(sx from bg)
- (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2
(pC2/pC0)C2 - Take log of this
- A1 log (pA1/pA0) T1 log (pT1/pT0)
- T2 log (pT2/pT0) G2 log (pG2/pG0)
- Divide by the number of segments (if all the
motifs have same number of segments) - pA1 log (pA1/pA0) pT1 log (pT1/pT0) pT2 log
(pT2/pT0)
Pos 12345678 ATGGCATG AGGGTGCG
ATCGCATG TTGCCACG ATGGTATT ATTGCACG
AGGGCGTT ATGACATG ATGGCATG ACTGGATG
62Scoring Motifs
- Original function Information Content
-
-
63Scoring Motifs
- Original function Information Content
-
-
Good AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC
Bad ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA
64Scoring Motifs
- Original function Information Content
- Which is better?
- (data 8 seqs)
-
-
Motif 1 AGGCTAAC AGGCTAAC
Motif 2 AGGCTAAC AGGCTACC AGGCTAAC AGCCTAAC AGGCCA
AC AGGCTAAC TGGCTAAC AGGCTTAC AGGCTAAC AGGGTAAC
65Scoring Motifs
- Motif scoring function
- Prefer conserved motifs with many sites, but are
not often seen in the genome background
66Markov Background Increases Motif Specificity
- Prefers motif segments enriched only in data,
but not so likely to occur in the background - Segment ATGTA score
- p(generate ATGTA from ?)
- p(generate ATGTA from ?0)
TCAGC .25 ? .25 ? .25 ? .25 ? .25 .3 ? .18
? .16 ? .22 ? .24 ATATA .25 ? .25 ? .25 ? .25
? .25 .3 ? .41 ? .38 ? .42 ? .30
67Position Weight Matrix Update
- Advantage
- Can look for motifs of any widths
- Flexible with base substitutions
- Disadvantage
- EM and Gibbs sampling no guaranteed convergence
time - No guaranteed global optimum
68Motif Finding in Bacteria
- Promoter sequences are short (200-300 bp)
- Motif are usually long (10-20 bases)
- Some have two blocks with a gap, some are
palindromes - Long motifs are usually very degenerate
- Single microarray experiment sometimes already
provides enough information to search for TF
motifs
69Motif Finding in Lower Eukaryotes
- Upstream sequences longer (500-1000 bp), with
some simple repeats - Motif width varies (5 17 bases)
- Expression clusters provide decent input
sequences quality for TF motif finding - Motif combination and redundancy appears,
although single motifs are usually significant
enough for identification
70Yeast Promoter Architecture
- Co-occurring regulators suggest physical
interaction between the regulators
71Motif Finding in Higher Eukaryotes
- Upstream sequences very long (3KB-20KB) with
repeats, TF motif could appear downstream - Motifs can be short or long (6-20 bases), and
appear in combination and clusters - Gene expression cluster not good enough input
- Need
- Comparative Genomics phastcons score
- Motif modules motif clusters
- ChIP-chip/seq
72Yeast Regulatory Sequence Conservation
73UCSC PhastCons Conservation
- Functional regulatory sequences are under
stronger evolutionary constraint - Align orthologous sequences together
- PhastCons conservation score (0 1) for each
nucleotide in the genome can be downloaded from
UCSC
74Conserved Motif Clusters
- First find conserved regions in the genome
- Then look for repeated transcription factors (TF)
binding sites - They form transcription factor modules
75Summary
- Biology and challenge of transcription regulation
- Scan for known TF motif sites TRANSFAC JASPAR
- De novo method
- Regular expression enumeration
- Oligonucleotide analysis
- MobyDick build long motifs from short ones
- Position weight matrix update
- CONSENSUS (sequence order)
- EM (iterate ?, ? ? weighted ? average)
- Gibbs Sampler (sample ?, ? Markov chain
convergence) - Motif score and Markov background
- Motif cluster and motif conservation