Title: CS5263 Bioinformatics
1CS5263 Bioinformatics
2What is a (biological) motif?
- A motif is a recurring fragment, theme or pattern
- Sequence motif a sequence pattern of nucleotides
in a DNA sequence or amino acids in a protein - Structural motif a pattern in a protein
structure formed by the spatial arrangement of
amino acids. - Network motif patterns that occur in different
parts of a network at frequencies much higher
than those found in randomized network - Commonality
- higher frequency than would be expected by chance
- Has, or is conjectured to have, a biological
significance
3(Sequence) motif finding
- Given a set of sequences
- Goal find sequence motifs that appear in all or
the majority of the sequences, and are likely
associated with some functions - In DNA regulatory sequences
- In protein functional/structural domains
4Roadmap
- Biological background
- Representation of motifs
- Algorithms for finding motifs
- Other issues
- Distinguish functional vs non-functional motifs
- Search for instances of given motifs
- Interpretation of motifs
5- In motif finding, understanding the motivations,
significance of the problems, difficulties, and
ideas that have been explored are more important
than knowing the details of the existing
algorithms! - Most algorithms often perform poorly in real
challenges! - Not necessarily a fault of algorithm designers
- Algorithms will be improved
6- Biological background for motif finding
7Cells respond to environment
Various external messages
Heat
Responds to environmental conditions
Food Supply
8Genome is fixed Cells are dynamic
- A genome is static
- Every cell in our body has a copy of same genome
- A cell is dynamic
- Responds to external conditions
- Most cells follow a cell cycle of division
- Cells differentiate during development
9Gene regulation
- is responsible for the dynamic cell
- Gene expression (production of protein) varies
according to - Cell type
- Cell cycle
- External conditions
- Location
10Where gene regulation takes place
- Opening of chromatin
- Transcription
- Translation
- Protein stability
- Protein modifications
11Transcriptional Regulation
- Strongest regulation happens during transcription
- Best place to regulate
- No energy wasted making intermediate products
- However, slowest response time
- After a receptor notices a change
- Cascade message to nucleus
- Open chromatin bind transcription factors
- Recruit RNA polymerase and transcribe
- Splice mRNA and send to cytoplasm
- Translate into protein
12Transcription Factors Binding to DNA
- Transcriptional regulation
- Certain transcription factors bind to DNA
- Binding recognizes DNA substrings
- Regulatory motifs
13Regulation of Genes
Transcription Factor (TF) (Protein)
RNA polymerase (Protein)
DNA
Gene
Promoter
14Regulation of Genes
Transcription Factor (TF) (Protein)
RNA polymerase (Protein)
DNA
Gene
Regulatory Element, TF binding site, TF binding
motif, cis-regulatory motif (element)
15Regulation of Genes
Transcription Factor (Protein)
RNA polymerase
DNA
Regulatory Element
Gene
16Regulation of Genes
New protein
RNA polymerase
Transcription Factor
DNA
Gene
Regulatory Element
17The Cell as a Regulatory Network
If C then D
gene D
A
B
Make D
C
If B then NOT D
D
If A and B then D
gene B
Make B
D
C
If D then B
18Code for protein-DNA binding?
Some knowledge exists
19However, overall code still missing
20Experimental methods
21Experimental methods
- To determine protein-DNA binding site is tedious
and time-consuming - To determine the binding specificity is even
harder - Involves mutating different combinations of
nucleic acids in promoter region and observe the
biological effects - Computational methods can help
22Finding Regulatory Motifs
. . .
- Given a collection of genes that are believed to
be regulated by the same protein, - Find the common TF-binding motif from promoters
23Essentially a Multiple Local Alignment
. . .
- Find best multiple local alignment
24- Then why dont we just use multiple sequence
alignment algorithms like the Multidimensional
Dynamic Programming?
25Characteristics of Regulatory Motifs
- Tiny (6-12bp)
- Intergenic regions are very long
- Highly Variable
- Constant Size
- Because a constant-size transcription factor
binds - Often repeated
- Often conserved
26 27Motif representation
- Collection of exact words
- ACGTTAC, ACGCTAC, AGGTGAC,
- Consensus sequence (with wild cards)
- AcGTgTtAC
- ASGTKTKAC SC/G, KG/T (IUPAC code)
- Position specific weight matrices
28Position Specific Weight Matrix
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
A S G T K T K A C
29Sequence Logo
frequency
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
30Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
31Entropy and information content
- Entropy a measure of uncertainty
- The entropy of a random variable X that can
assume the n different values x1, x2, . . . , xn
with the respective probabilities p1, p2, . . . ,
pn is defined as
32Entropy and information content
- Example A,C,G,T with equal probability
- H 4 (-0.25 log2 0.25) log2 4 2 bits
- Need 2 bits to encode (e.g. 00 A, 01 C, 10
G, 11 T) - Maximum uncertainty
- 50 A and 50 C
- H 2 (-0. 5 log2 0.5) log2 2 1 bit
- 100 A
- H 1 (-1 log2 1) 0 bit
- Minimum uncertainty
- Information the opposite of uncertainty
- I maximum uncertainty entropy
- The above examples provide 0, 1, and 2 bits of
information, respectively
33Entropy and information content
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
H .24 1.72 .36 .63 1.60 0.24 1.40 0.85 0.58
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
Mean 1.15 1.15 1.15 1.15 1.15 1.15 1.15 1.15 1.15
Total 10.4 10.4 10.4 10.4 10.4 10.4 10.4 10.4 10.4
Expected occurrence in random DNA 1 / 210.4 1
/ 1340 Expected occurrence of an exact 5-mer 1 /
210 1 / 1024
34Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
35Background-normalized Seq Logo
- Many genomes have skewed base distribution
- In a thermophilic bacteria (i.e. living in a hot
spring), GC content can be as high as 70. - Thus a motif ATAT in the genome of a thermophilic
bacteria would contain more information than a
motif GCGC
36Relative Entropy
- Definition 6.1. Let P and Q be two probability
measures on the same alphabet X. Then the
relative entropy (information divergence,
Kullback-Leibler distance, discrimination) from P
to Q is defined as - Easy to prove that if Q is a uniform
distribution, D(P Q) is equal to the
Information content of P
37Relative Entropy
- Background pA pT 0.2, pC pG 0.3
- Distribution on some column of a PWM
- Case 1 pA 0.85, pC pG pT 0.05
- Case 2 pG 0.85 pC pA pT 0.05
- Assuming uniform background distribution
- I1 I2 1.15
- With the non-uniform background distribution
- D1 1.42
- D2 0.95
38Background-normalized Seq Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
I 2 .13 1.35 1.6 0.45 2 .70 1.37 1.65
39Physical interpretation
- Information content is reversely proportional to
the binding energy - High information content gt lower energy gt high
affinity of binding - Relative entropy represents the specificity of
the binding sites compared to random DNA sequences
40Real example
- E. coli. Promoter
- TATA-Box 10bp upstream of transcription start
- TACGAT
- TAAAAT
- TATACT
- GATAAT
- TATGAT
- TATGTT
Consensus TATAAT
Note none of the instances matches the consensus
perfectly
41 42Definitions of terms
- Motif a consensus sequence or a PWM
- Pattern alias for motif (used in combinatorial
motif finding) - Instance of a motif a substring of a sequence
that matches to the motif - How to define match will be shown later
43Motif finding schemes
Conservation Conservation
Yes No
Whole genome Yes Genome 1 2 3 Genome 1
Whole genome No Gene 1A 1B 1C or Gene Set 1 2 3 Gene Set 1
Phylogenetic footprinting
Dictionary building
Motif finding
1A
1B
1C
Gene set 1
Gene set 2
Gene set 3
Genome 1
Genome 2
Genome 3
Ideally, all information should be used, at some
stage. i.e., inside algorithm vs pre- or
post-processing.
44Classification of approaches
- Combinatorial search
- Based on enumeration of words and computing word
similarities - Analogy to DP for sequence alignment
- Probabilistic modeling
- Construct models to distinguish motifs vs
non-motifs - Analogy to HMM for sequence alignment
45Combinatorial motif finding
- Idea 1 find all k-mers that appeared at least m
times - Idea 2 find all k-mers that are statistically
significant - Problem most motifs allow divergence. Each
variation may only appear once. - Idea 3 find all k-mers, considering IUPAC code
- e.g. ASGTKTKAC, S C/G, K G/T
- Still inflexible
- Idea 4 find k-mers that approximately appeared
at least m times - i.e. allow some mismatches
46Combinatorial motif finding
- Given a set of sequences S x1, , xn
- A motif W is a consensus string w1wK
- Find motif W with best match to x1, , xn
- Definition of best
- d(W, xi) min hamming dist. between W and a word
in xi - d(W, S) ?i d(W, xi)
- W argmin( d(W, S) )
47Exhaustive searches
- 1. Pattern-driven algorithm
- For W AAA to TTT (4K possibilities)
- Find d( W, S )
- Report W argmin( d(W, S) )
- Running time O( K N 4K )
- (where N ?i xi)
- Guaranteed to find the optimal solution.
48Exhaustive searches
- 2. Sample-driven algorithm
- For W a K-long word in some xi
- Find d( W, S )
- Report W argmin( d( W, S ) )
- OR Report a local improvement of W
- Running time O( K N2 )
49Exhaustive searches
- Problem with sample-driven approach
- If
- True motif does not occur in data, and
- True motif is weak
- Then,
- random strings may score better than any instance
of true motif
50Example
- E. coli. Promoter
- TATA-Box 10bp upstream of transcription start
- TACGAT
- TAAAAT
- TATACT
- GATAAT
- TATGAT
- TATGTT
Consensus TATAAT
Each instance differs at most 2 bases from the
consensus None of the instances matches the
consensus perfectly
51Heuristic methods
- Cannot afford exhaustive search on all patterns
- Sample-driven approaches may miss real patterns
- However, a real pattern should not differ too
much from its instances in S - Start from the space of all words in S, extend to
the space with real patterns
52Some of the popular tools
- Consensus (Hertz Stormo, 1999)
- WINNOWER (Pevzner Sze, 2000)
- MULTIPROFILER (Keich Pevzner, 2002)
- PROJECTION (Buhler Tompa, 2001)
- WEEDER (Pavesi et. al. 2001)
- And a dozen of others
53Consensus
- Algorithm
- Cycle 1
- For each word W in S
- For each word W in S
- Create alignment (gap free) of W, W
- Keep the C1 best alignments, A1, , AC1
- ACGGTTG , CGAACTT , GGGCTCT
- ACGCCTG , AGAACTA , GGGGTGT
54- Algorithm (contd)
- Cycle i
- For each word W in S
- For each alignment Aj from cycle i-1
- Create alignment (gap free) of W, Aj
- Keep the Ci best alignments A1, , ACi
55- C1, , Cn are user-defined heuristic constants
- Running time
- O(kN2) O(kN C1) O(kN C2) O(kN Cn)
- O(kN2 NCtotal)
- Where Ctotal ?i Ci, typically O(nC), where C is
a big constant
56Extended sample-driven (ESD) approaches
- Hybrid between pattern-driven and sample-driven
- Assume each instance does not differ by more than
a bases to the motif (? usually depends on k)
motif
instance
?
The real motif will reside in the ?-neighborhood
of some words in S. Instead of searching all 4K
patterns, we can search the ?-neighborhood of
every word in S.
a-neighborhood
57WEEDER
of patterns to test
of words in sequences
58Better idea
- Using a joint suffix tree, find all patterns
that - Have length K
- Appeared in at least m sequences with at most a
mismatches - Post-processing
59WEEDER algorithm sketch
Current pattern P, P lt K
- A list containing all eligible nodes with at
most a mismatches to P - For each node, remember mismatches accumulated
(e), and bit vector (B) for seq occ, e.g.
011100010 - Bit OR all Bs to get seq occurrence for P
- Suppose occ gt m
- Pattern still valid
- Now add a letter
ACGTT
mismatches
(e, B)
Seq occ
60WEEDER algorithm sketch
Current pattern P
ACGTTA
- Simple extension no branches.
- No change to B
- e may increase by 1 or no change
- Drop node if e gt a
- Branches replace a node with its child nodes
- Drop if e gt a
- B may change
- Re-do Bit OR using all Bs
- Try a different char if occ lt m
- Report P when P K
(e, B)
61WEEDER complexity
- Can get all D(P, S) in time
- O(nN (K choose a) 3a) O(nN Ka 3a).
- n sequences. Needed for Bit OR.
- Better than O(KN 4K) since usually a ltlt K
- Ka 3a may still be expensive for large K
- E.g. K 20, a 6
62WEEDER More tricks
Current pattern P
ACGTTA
- Eligible nodes with at most a mismatches to P
- Eligible nodes with at most min(?L, a)
mismatches to P - L current pattern length
- ? error ratio
- Require that mismatches to be somewhat evenly
distributed among positions - Prune tree at length K
63MULTIPROFILER
- W differs from W at ? positions.
- The consensus sequence for the words in the
?-neighborhood of W is similar to W. - If we ignore all the chars that are similar to W,
the rest may suggest the difference between W and
W
W
W
W ACGTACG W ATGTAAG
64MULTIPROFILER alg sketch
- For each word P in S
- Find its a-neighborhood in S
- List of words C
- For each position j from 1..K of the words in C
- Find the most popular char that differ from Pj
- Replace a positions in P with the chars found
above - Call the new word P
- W argmin D(P, S)
W
W
W ACGTACG W ATGTAAG
65MULTIPROFILER
- No complexity provided in the paper
- More efficient than WEEDER for longer patterns N
lt Ka 3a - How to choose a is an issue
- Large a too many noises in neighborhood
- Small a few true instances in neighborhood
W
W
W ACGTACG W ATGTAAG
66- Probabilistic modeling approaches
- for motif finding
67Probabilistic modeling approaches
- A motif model
- Usually a PWM
- M (Pij), i 1..4, j 1..k, k motif length
- A background model
- Usually the distribution of base frequencies in
the genome (or other selected subsets of
sequences) - B (bi), i 1..4
- A word can be generated by M or B
68Expectation-Maximization
- For any word W,
- P(W M) PW1 1 PW2 2PWK K
- P(W B) bW1 bW2 bWK
- Let ? P(M), i.e., the probability for any word
to be generated by M. - Then P(B) 1 - ?
- Can compute the posterior probability P(MW) and
P(BW) - P(MW) P(WM) ?
- P(BW) P(WB) (1-?)
69Expectation-Maximization
- Initialize
- Randomly assign each word to M or B
- Let Zxy 1 if position y in sequence x is a
motif, and 0 otherwise - Estimate parameters M, ?, B
- Iterate until converge
- E-step Zxy P(M Xy..yk-1) for all x and y
- M-step re-estimate M, ? given Z (B usually fixed)
70Expectation-Maximization
position
1
1
Initialize
E-step
probability
5
5
9
9
M-step
- E-step Zxy P(M Xy..yk-1) for all x and y
- M-step re-estimate M, ? given Z
71MEME
- Multiple EM for Motif Elicitation
- Bailey and Elkan, UCSD
- http//meme.sdsc.edu/
- Multiple starting points
- Multiple modes ZOOPS, OOPS, TCM
72Gibbs Sampling
- Another very useful technique for estimating
missing parameters - EM is deterministic
- Often trapped by local optima
- Gibbs sampling stochastic behavior to avoid
local optima
73Gibbs sampling
- Initialize
- Randomly assign each word to M or B
- Let Zxy 1 if position y in sequence x is a
motif, and 0 otherwise - Estimate parameters M, B, ?
- Iterate
- Randomly remove a sequence X from S
- Recalculate model parameters using S \ X
- Compute Zxy for X
- Sample a y from Zxy.
- Let Zxy 1 for y y and 0 otherwise
74Gibbs Sampling
position
probability
Sampling
- Gibbs sampling sample one position according to
probability - Update prediction of one training sequence at a
time - Viterbi always take the highest
- EM take weighted average
Simultaneously update predictions of all sequences
75Gibbs sampling motif finders
- Gibbs Sampler, based on C. Larence et.al.
Science, 1993 - AlignACE, Nat Biotech 1998, developed in Church
lab, Harvard Univ - BioProspector, X. Liu et. al. PSB 2001 , an
improvement of AlignACE
76Better background model
- Repeat DNA can be confused as motif
- Especially low-complexity CACACA AAAAA, etc.
- Solution more elaborate background model
- Higher-order Markov model
- 0th order B pA, pC, pG, pT
- 1st order B P(AA), P(AC), , P(TT)
-
- Kth order B P(X b1bK) X, bi?A,C,G,T
- Has been applied to EM and Gibbs (up to 3rd
order)
77Limits of Motif Finders
0
???
gene
- Given upstream regions of coregulated genes
- Increasing length makes motif finding harder
random motifs clutter the true ones - Decreasing length makes motif finding harder
true motif missing in some sequences
78Challenging problem
d mutations
n 20
k
L 1000
- (k, d)-motif challenge problem
- Many algorithms fail at (15, 4)-motif for n 20
and L 1000 - Combinatorial algorithms usually work better on
challenge problem - However, they are usually designed to find (k,
d)-motifs - Performance in real data varies
79Motif finding in practice
- Now weve found some good looking motifs
- Easiest step?
- What to do next?
- Are they real?
- How do we find more instances in the rest of the
genome? - What are their functional meaning?
- Motifs gt regulatory networks
80To make sense about the motifs
- Each program usually reports a number of motifs
(tens to hundreds) - Many motifs are variations of each other
- Each program also report some different ones
- Each program has its own way of scoring motifs
- Best scored motifs often not interesting
- AAAAAAAA
- ACACACAC
- TATATATAT
81Strategies to improve results
- Combine results from different algorithms usually
helpful - Ones that appeared multiple times are probably
more interesting - Except simple repeats like AAAAA or ATATATATA
- Will talk about this later.
- Cluster motifs into groups. Issues
- Measure similarities between two motifs (PWMs)
- of clusters
82Strategies to improve results
- Compare with known motifs in database
- TRANSFAC
- JASPAR
- Issues
- Compute similarities among motifs
- How similar is similar?
83Strategies to improve results
- Statistical test of significance
- Enrichment in target sequences vs background
sequences
Background set B
Target set T
Assumed to contain a common motif, P
Assumed to not contain P, or with very low
frequency
Ideal case every sequence in T has P, no
sequence in B has P
84Statistical test for significance
Background set target set B T
P
Target set T
M
N
Appeared in n sequences
Appeared in m sequences
- If n / N gtgt m / M
- P is enriched (over-represented) in T
- Statistical significance?
- If we randomly draw N sequences from (BT), how
likely we will see at least n sequences with P?
85Hypergeometric distribution
- A box with M balls, of which N are red, and the
rest are blue. - We randomly draw m balls from the box
- Whats the probability well see n red balls?
- Red ball target sequences
- Blue ball background sequences
- Total of choices (M choose m)
- of choices to have n red balls (N choose n) x
(M-N choose m-n)
86Cumulative hypergeometric test for motif
significance
- We are interested in if we randomly pick m
balls, how likely that well see at least n red
balls?
This can be interpreted as the p-value for the
null hypothesis that we are randomly
picking. Alternative hypothesis our selection
favors red balls. Equivalent the target set T is
enriched with motif P. Or P is over-represented
in T.
87Examples
- Yeast genome has 6000 genes
- Select 50 genes believed to be co-regulated by a
common TF - Found a motif for these 50 genes
- It appeared in 20 out of these 50 genes
- In the whole genome, 100 genes have this motif
- M 6000, N 50, m 10020 120, n 20
- Intuition
- m/M 120/6000. In Genome, 1 out 50 genes have
the motif - N 50, would expect only 1 gene in the target
set to have the motif - 20-fold enrichment
- P-value 6 x 10-22
- n 5. 5-fold enrichment. P-value 0.003
- Normally a very low p-value is needed, e.g. 10-10
88ROC curve for motif significance
- Motif is usually a PWM
- Any word will have a score
- Typical scoring function Log P(W M) / P(W B)
- W a word.
- M a PWM.
- B background model
- To determine whether a sequence contains a motif,
a cutoff has to be decided - With different cutoffs, you get different number
of genes with the motif - Hyper-geometric test first assumes a cutoff
- It may be better to look at a range of cutoffs
89ROC curve for motif significance
Background set target set B T
P
Target set T
M
N
Given a score cutoff
Appeared in n sequences
Appeared in m sequences
- With different score cutoff, will have different
m and n - Assume you want to use P to classify T and B
- Sensitivity n / N
- Specificity (M-N-mn) / (M-N)
- False Positive Rate 1 specificity (m n) /
(M-N) - With decreasing cutoff, sensitivity ?, FPR ?
90ROC curve for motif significance
A good cutoff
Lowest cutoff. Every sequence has the motif.
Sensitivity 1. specificity 0.
1
- ROC-AUC area under curve.
- 1 the best. 0.5 random.
- Motif 1 is more enriched than motif 2.
sensitivity
Motif 1
Motif 2
Random
0
1-specificity
1
0
Highest cutoff. No motif can pass the cutoff.
Sensitivity 0. specificity 1.
91Other strategies
- Cross-validation
- Randomly divide sequences into 10 sets, hold 1
set for test. - Do motif finding on 9 sets. Does the motif also
appear in the testing set? - Phylogenetic conservation information
- Does a motif also appears in the homologous genes
of another species? - Strongest evidence
- However, will not be able to find
species-specific ones
92Other strategies
- Finding motif modules
- Will two motifs always appear in the same gene?
- Location preference
- Some motifs appear to be in certain location
- E.g., within 50-150bp upstream to transcription
start - If a detected motif has strong positional bias,
may be a sign of its function - Evidence from other types of data sources
- Do the genes having the motif always have similar
activities (gene expression levels) across
different conditions? - Interact with the same set of proteins?
- Similar functions?
- etc.