Transcription Regulation Transcription Factor Motif Finding

About This Presentation

Title:

Transcription Regulation Transcription Factor Motif Finding

Description:

Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 ... – PowerPoint PPT presentation

Number of Views:842

Avg rating:3.0/5.0

Slides: 76

Provided by: xliu

Category:

more less

Transcript and Presenter's Notes

Title: Transcription Regulation Transcription Factor Motif Finding

1
Transcription RegulationTranscription Factor
Motif Finding

Xiaole Shirley Liu
STAT115, STAT215, BIO298, BIST520

2
Outline

Biology of transcription regulation and
challenges of computational motif finding
Scan for known TF motif sites
TRASFAC and JASPAR, Sequence Logo
De novo method
Regular expression enumeration w-mer enumerate
Position weight matrix update EM and Gibbs
Motif finding in different organisms
Motif clusters and conservation

3
Imagine a Chef
4
Each Cell Is Like a Chef
5
Each Cell Is Like a Chef
6
Understanding a Genome
Get the complete sequence (encoded cook book)
Observe gene expressions at different cell
states (meals prepared at different situations)
Decode gene regulation (decode the book,
understand the rules)
7
Information in DNA
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACAGTTCAGGACTAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
8
Information in DNA

Non-coding region 98
Regulation When, Where,
Amount, Other Conditions, etc
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG
GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Coding region 2
Milk-gtYogurt
Egg-gtOmelet
Fish-gtSushi
Flour-gtCake
Beef-gtBurger
9
Measure Gene Expression

Microarray or SAGE detects the expression of
every gene at a certain cell state
Clustering find genes that are co-expressed
(potentially share regulation)

10
Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
11
Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
12
Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
13
Biology of Transcription Regulation

...acatttgcttctgacacaactgtgttcactagcaacctca...aaca
gacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggc
agcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...
...cgctcgcgggccggcactcttctggtccccacagactcag...gat
acccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...
...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gat
gcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...
Motif can only be computational discovered when
there are enough cases for machine learning
14
Computational Motif Finding

Input data
Upstream sequences of gene expression profile
cluster
20-800 sequences, each 300-5000 bps long
Output enriched sequence patterns (motifs)
Ultimate goals
Which TFs are involved and their binding motifs
and effects (enhance / repress gene expression)?
Which genes are regulated by this TF, why is
there disease when a TF goes wrong?
Are there binding partner / competitor for a TF?

15
Challenges Where/what the signal

The motif should be abundant
GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAAATAAGACACGGACGGC
GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

16
Challenges Where/what the signal

The motif should be abundant
And Abundant with significance
GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAAATAAGACACGGACGGC
GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

17
Challenges Double stranded DNA

Motif appears in both
strands
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

18
Challenges Base substitutions

Sequences do not have to match the motif
perfectly, base substitutions are allowed
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATGTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG
CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT

19
Challenges Variable motif copies

Some sequences do not have the motif
Some have multiple copies of the motif
GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

20
Challenges Variable motif copies

Some sequences do not have the motif
Some have multiple copies of the motif
GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

21
Challenges Two-block motifs

Some motifs have two parts
GACACATTTACCTATGC TGGCCCTACGACCTCTCGC
CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC
GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA
TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG
CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT

22
Scan for Known TF Motif Sites

Experimental TF sites TRANSFAC, JASPAR
Motif representation
Regular expression Consensus CACAAAA
binary decision Degenerate CRCAAAW
IUPAC

A/T
A/G
23
Scan for Known TF Motif Sites

Experimental TF sites TRANSFAC, JASPAR
Motif representation
Regular expression Consensus CACAAAA
binary decision Degenerate CRCAAAW
Position weight matrix (PWM) need score cutoff

Motif Matrix
Pos 12345678 ATGGCATG AGGGTGCG
ATCGCATG TTGCCACG ATGGTATT ATTGCACG
AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Sites
24
IUPAC for DNA

A adenosine
C cytidine
G guanine
T thymidine
U uridine
R G A (purine)
Y T C (pyrimidine)
K G T (keto)

M A C (amino)
S G C (strong)
W A T (weak)
B C G T (not A)
D A G T (not C)
H A C T (not G)
V A C G (not T)
N A C G T (any)

25
Protein Binding Microarrays

In vitro protein-DNA interactions
Better capture motifs

26
JASPAR

User defined cutoff to scan for a particular motif

27
A Word on Sequence Logo

SeqLogo consists of stacks of symbols, one stack
for each position in the sequence
The overall height of the stack indicates the
sequence conservation at that position
The height of symbols within the stack indicates
the relative frequency of nucleic acid at that
position

ATGGCATG AGGGTGCG ATCGCATG
TTGCCACG ATGGTATT ATTGCACG AGGGCGTT
ATGACATG ATGGCATG ACTGGATG
28
Scan Known TF Motifs

Drawbacks
Limited number of motifs
Limited number of sites to represent each motif
Low sensitivity and specificity
Poor description of motif
Binding site borders not clear
Binding site many mismatches
Many motifs look very similar
E.g. GC-rich motif, E-box (CACGTG)

29
De novo Sequence Motif Finding

Goal look for common sequence patterns enriched
in the input data (compared to the genome
background)
Regular expression enumeration
Pattern driven approach
Enumerate patterns, check significance in
dataset
Oligonucleotide analysis, MobyDick
Position weight matrix update
Data driven approach, use data to refine motifs
Consensus, EM Gibbs sampling
Motif score and Markov background

30
Regular Expression Enumeration

Oligonucleotide Analysis check
over-representation for every w-mer
Expected w occurrence in data
Consider genome sequence current data size
Observed w occurrence in data
Over-represented w is potential TF binding motif

Observed occurrence of w in the data
31
MobyDick

A sequence data and a dictionary of motif words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.22, 0.28,
0.28, 0.22
32
MobyDick

A sequence data and a dictionary of motif words
Check over-representation of every word-pair

A sequence data and a dictionary of motif words
Check over-representation of every word-pair

D A,C,G,T,AA,GA,TA,GG
Seq AAGATAA
Possible partitions
A A G A T A A pA pA pG pA pT pA pA
AA G A T A A pAA pG pA pT pA pA
AA GA T A A pAA pGA pT pA pA
AA GA TA A pAA pGA pTA pA
A A GA T AA pAA pGA pT pAA
Assign probabilities as to maximize total
probability of generating the sequence

35
MobyDick

A sequence data and a dictionary of motif words
Check over-representation of every word-pair
Reassign word probability and consider every new
word-pair to build even longer words

RE Enumeration Derivatives
oligo-analysis, spaced dyads w1.ns.w2
IUPAC alphabet
Markov background (later)
2-bit encoding, fast index access
Enumerate limited RE patterns known for a TF
protein structure or interaction theme
Exhaustive, guaranteed to find global optimum,
and can find multiple motifs
Not as flexible with base substitutions, long
list of similar good motifs, and limited with
motif width

37
Consensus

Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained
with the additional sequence

38
Consensus

Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained
with the additional sequence

Remaining good motifs

39
Consensus

Starting from the 1st sequence, add one sequence
at a time to look for the best motifs obtained
with the additional sequence
G Stormo, algorithm runs very fast
Sequence order plays a big role in performance
First two sequences better contain the motif
Sites stop accumulating at the first bad sequence
Newer version allowing 0-n is much slower

40
Expectation Maximization and Gibbs Sampling Model

Objects
Seq sequence data to search for motif
?0 non-motif (genome background) probability
? motif probability matrix parameter
? motif site locations
Problem P(?, ? seq, ?0)
Approach alternately estimate
? by P(? ?, seq, ?0)
? by P(? ?, seq, ?0)
EM and Gibbs differ in the estimation methods

41
Expectation Maximization

E step ? ?, seq, ?0
TTGACGACTGCACGT
TTGAC p1
TGACG p2
GACGA p3
ACGAC p4
CGACT p5
GACTG p6
ACTGC p7
CTGCA p8
...

P1 likelihood ratio
P(TTGAC ?)
P(TTGAC ?0)

p0T ? p0T ? p0G ? p0A ? p0C 0.3 ? 0.3 ? 0.2 ?
0.3 ? 0.2
42
Expectation Maximization

E step ? ?, seq, ?0
TTGACGACTGCACGT
TTGAC p1
TGACG p2
GACGA p3
ACGAC p4
CGACT p5
GACTG p6
ACTGC p7
CTGCA p8
...

M step ? ?, seq, ?0
p1 ? TTGAC
p2 ? TGACG
p3 ? GACGA
p4 ? ACGAC
...
Scale ACGT at each position, ? reflects weighted
average of ?

43
EM Derivatives

First EM motif finder (C Lawrence)
Deterministic algorithm, guarantee local optimum
MEME (TL Bailey)
Prior probability allows 0-n site / sequence
Parallel running multiple
EM with different seed
User friendly results

44
Gibbs Sampling

Stochastic process, although still may need
multiple initializations
Sample ? from P(? ?, seq, ?0)
Sample ? from P(? ?, seq, ?0)
Collapsed form
? estimated with counts, not sampling from
Dirichlet
Sample site from one seq based on sites from
other seqs
Converged motif matrix ? and converged motif
sites ? represent stationary distribution of a
Markov Chain

45
Gibbs Sampler

Randomly initialize a probability matrix

46
Gibbs Sampler

Take out one sequence with its sites from current
motif

?11
?21
?31
?41
?51
47
Gibbs Sampler

Score each possible segment of this sequence

Sequence 1
Segment (1-8)
?21
?31
?41
?51
48
Gibbs Sampler

Score each possible segment of this sequence

Sequence 1
Segment (2-9)
?21
?31
?41
?51
49
Segment Score

Use current motif matrix to score a segment

50
Scoring Segments

Motif 1 2 3 4 5 bg
A 0.4 0.1 0.3 0.4 0.2 0.3
T 0.2 0.5 0.1 0.2 0.2 0.3
G 0.2 0.2 0.2 0.3 0.4 0.2
C 0.2 0.2 0.4 0.1 0.2 0.2
Ignore pseudo counts for now
Sequence TTCCATATTAATCAGATTCCG score
TAATC
AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x
0.2/0.3 0.049383
ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x
0.4/0.2 11.85185
TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x
0.2/0.3 0.444444
CAGAT

51
Gibbs Sampler

Sample site from one seq based on sites from
other seqs

?21
?31
?41
?51
52
How to Sample?
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 1 2 6
SubT 3 4 16 21 29 38 39 41 47

Rand(subtotal) X
Find the first position with subtotal larger than
X

Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 500 2 6
SubT 3 4 16 21 29 38 538 540 546
53
Gibbs Sampler

Repeat the process until motif converges

?21
?12
?31
?41
?51
54
Gibbs Sampler Intuition

Beginning
Randomly initialized motif
No preference towards any segment

55
Gibbs Sampler Intuition

Motif appears
Motif should have enriched signal (more sites)
By chance some correct sites come to alignment
Sites bias motif to attract other similar sites

56
Gibbs Sampler Intuition

Motif converges
All sites come to alignment
Motif totally biased to sample sites every time

57
Gibbs Sampler

Column shift
Metropolis algorithm
Propose ? as ? shifted 1 column to left or right
Calculate motif score u(?) and u(?)
Accept ? with prob min(1, u(?) / u(?))

58
Gibbs Sampling Derivatives

Gibbs Motif Sampler (JS Liu)
Add prior probability to allow 0-n site / seq
Sample motif positions to consider
AlignACE (F Roth)
Look for motifs from both strands
Mask out one motif to find more different motifs
BioProspector (XS Liu)
Use background model with Markov dependencies
Sampling with threshold (0-n sites / seq), new
scoring function
Can find two-block motifs with variable gap

59
Scoring Motifs

Information Content (also known as relative
entropy)
Suppose you have x aligned segments for the motif
pb(s1 from mtf) / pb(s1 from bg)
pb(s2 from mtf) / pb(s2 from bg)
pb(sx from mtf) / pb(sx from bg)

60
Scoring Motifs

Information Content (also known as relative
entropy)
Suppose you have x aligned segments for the motif
pb(s1 from mtf) / pb(s1 from bg)
pb(s2 from mtf) / pb(s2 from bg)
pb(sx from mtf) / pb(sx from bg)

61
Scoring Motifs

pb(s1 from mtf) / pb(s1 from bg)
pb(s2 from mtf) / pb(s2 from bg)
pb(sx from mtf) / pb(sx from bg)
(pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2
(pC2/pC0)C2
Take log of this
A1 log (pA1/pA0) T1 log (pT1/pT0)
T2 log (pT2/pT0) G2 log (pG2/pG0)
Divide by the number of segments (if all the
motifs have same number of segments)
pA1 log (pA1/pA0) pT1 log (pT1/pT0) pT2 log
(pT2/pT0)

Pos 12345678 ATGGCATG AGGGTGCG
ATCGCATG TTGCCACG ATGGTATT ATTGCACG
AGGGCGTT ATGACATG ATGGCATG ACTGGATG
62
Scoring Motifs

Original function Information Content

63
Scoring Motifs

Original function Information Content

Good AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC
Bad ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA
64
Scoring Motifs

Original function Information Content
Which is better?
(data 8 seqs)

Motif 1 AGGCTAAC AGGCTAAC
Motif 2 AGGCTAAC AGGCTACC AGGCTAAC AGCCTAAC AGGCCA
AC AGGCTAAC TGGCTAAC AGGCTTAC AGGCTAAC AGGGTAAC
65
Scoring Motifs

Motif scoring function
Prefer conserved motifs with many sites, but are
not often seen in the genome background

66
Markov Background Increases Motif Specificity

Prefers motif segments enriched only in data,
but not so likely to occur in the background
Segment ATGTA score
p(generate ATGTA from ?)
p(generate ATGTA from ?0)

TCAGC .25 ? .25 ? .25 ? .25 ? .25 .3 ? .18
? .16 ? .22 ? .24 ATATA .25 ? .25 ? .25 ? .25
? .25 .3 ? .41 ? .38 ? .42 ? .30
67
Position Weight Matrix Update

Advantage
Can look for motifs of any widths
Flexible with base substitutions
Disadvantage
EM and Gibbs sampling no guaranteed convergence
time
No guaranteed global optimum

68
Motif Finding in Bacteria

Promoter sequences are short (200-300 bp)
Motif are usually long (10-20 bases)
Some have two blocks with a gap, some are
palindromes
Long motifs are usually very degenerate
Single microarray experiment sometimes already
provides enough information to search for TF
motifs

69
Motif Finding in Lower Eukaryotes

Upstream sequences longer (500-1000 bp), with
some simple repeats
Motif width varies (5 17 bases)
Expression clusters provide decent input
sequences quality for TF motif finding
Motif combination and redundancy appears,
although single motifs are usually significant
enough for identification

70
Yeast Promoter Architecture

Co-occurring regulators suggest physical
interaction between the regulators

71
Motif Finding in Higher Eukaryotes

Upstream sequences very long (3KB-20KB) with
repeats, TF motif could appear downstream
Motifs can be short or long (6-20 bases), and
appear in combination and clusters
Gene expression cluster not good enough input
Need
Comparative Genomics phastcons score
Motif modules motif clusters
ChIP-chip/seq

72
Yeast Regulatory Sequence Conservation
73
UCSC PhastCons Conservation

Functional regulatory sequences are under
stronger evolutionary constraint
Align orthologous sequences together
PhastCons conservation score (0 1) for each
nucleotide in the genome can be downloaded from
UCSC

74
Conserved Motif Clusters

First find conserved regions in the genome
Then look for repeated transcription factors (TF)
binding sites
They form transcription factor modules

75
Summary

Biology and challenge of transcription regulation
Scan for known TF motif sites TRANSFAC JASPAR
De novo method
Regular expression enumeration
Oligonucleotide analysis
MobyDick build long motifs from short ones
Position weight matrix update
CONSENSUS (sequence order)
EM (iterate ?, ? ? weighted ? average)
Gibbs Sampler (sample ?, ? Markov chain
convergence)
Motif score and Markov background
Motif cluster and motif conservation

Write a Comment

User Comments (0)

About PowerShow.com

Transcription Regulation Transcription Factor Motif Finding - PowerPoint PPT Presentation

Transcription Regulation Transcription Factor Motif Finding

Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 ... – PowerPoint PPT presentation