CS5263 Bioinformatics

About This Presentation

Title:

CS5263 Bioinformatics

Description:

A motif is a recurring fragment, theme or pattern ... Has, or is conjectured to have, a biological significance (Sequence) motif finding ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 93

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics

1
CS5263 Bioinformatics

Lecture 18
Motif finding

2
What is a (biological) motif?

A motif is a recurring fragment, theme or pattern
Sequence motif a sequence pattern of nucleotides
in a DNA sequence or amino acids in a protein
Structural motif a pattern in a protein
structure formed by the spatial arrangement of
amino acids.
Network motif patterns that occur in different
parts of a network at frequencies much higher
than those found in randomized network
Commonality
higher frequency than would be expected by chance
Has, or is conjectured to have, a biological
significance

3
(Sequence) motif finding

Given a set of sequences
Goal find sequence motifs that appear in all or
the majority of the sequences, and are likely
associated with some functions
In DNA regulatory sequences
In protein functional/structural domains

4
Roadmap

Biological background
Representation of motifs
Algorithms for finding motifs
Other issues
Distinguish functional vs non-functional motifs
Search for instances of given motifs
Interpretation of motifs

In motif finding, understanding the motivations,
significance of the problems, difficulties, and
ideas that have been explored are more important
than knowing the details of the existing
algorithms!
Most algorithms often perform poorly in real
challenges!
Not necessarily a fault of algorithm designers
Algorithms will be improved

Biological background for motif finding

7
Cells respond to environment
Various external messages
Heat
Responds to environmental conditions
Food Supply
8
Genome is fixed Cells are dynamic

A genome is static
Every cell in our body has a copy of same genome
A cell is dynamic
Responds to external conditions
Most cells follow a cell cycle of division
Cells differentiate during development

9
Gene regulation

is responsible for the dynamic cell
Gene expression (production of protein) varies
according to
Cell type
Cell cycle
External conditions
Location

10
Where gene regulation takes place

Opening of chromatin
Transcription
Translation
Protein stability
Protein modifications

11
Transcriptional Regulation

Strongest regulation happens during transcription
Best place to regulate
No energy wasted making intermediate products
However, slowest response time
After a receptor notices a change
Cascade message to nucleus
Open chromatin bind transcription factors
Recruit RNA polymerase and transcribe
Splice mRNA and send to cytoplasm
Translate into protein

12
Transcription Factors Binding to DNA

Transcriptional regulation
Certain transcription factors bind to DNA
Binding recognizes DNA substrings
Regulatory motifs

13
Regulation of Genes

Transcription Factor (TF) (Protein)
RNA polymerase (Protein)
DNA
Gene
Promoter
14
Regulation of Genes

Transcription Factor (TF) (Protein)
RNA polymerase (Protein)
DNA
Gene
Regulatory Element, TF binding site, TF binding
motif, cis-regulatory motif (element)
15
Regulation of Genes

Transcription Factor (Protein)
RNA polymerase
DNA
Regulatory Element
Gene
16
Regulation of Genes

New protein
RNA polymerase
Transcription Factor
DNA
Gene
Regulatory Element
17
The Cell as a Regulatory Network
If C then D
gene D
A
B
Make D
C
If B then NOT D
D
If A and B then D
gene B
Make B
D
C
If D then B
18
Code for protein-DNA binding?
Some knowledge exists
19
However, overall code still missing
20
Experimental methods

DNase footprinting

21
Experimental methods

To determine protein-DNA binding site is tedious
and time-consuming
To determine the binding specificity is even
harder
Involves mutating different combinations of
nucleic acids in promoter region and observe the
biological effects
Computational methods can help

22
Finding Regulatory Motifs
. . .

Given a collection of genes that are believed to
be regulated by the same protein,
Find the common TF-binding motif from promoters

23
Essentially a Multiple Local Alignment
. . .

Find best multiple local alignment

Then why dont we just use multiple sequence
alignment algorithms like the Multidimensional
Dynamic Programming?

25
Characteristics of Regulatory Motifs

Tiny (6-12bp)
Intergenic regions are very long
Highly Variable
Constant Size
Because a constant-size transcription factor
binds
Often repeated
Often conserved

Motif Representation

27
Motif representation

Collection of exact words
ACGTTAC, ACGCTAC, AGGTGAC,
Consensus sequence (with wild cards)
AcGTgTtAC
ASGTKTKAC SC/G, KG/T (IUPAC code)
Position specific weight matrices

28
Position Specific Weight Matrix
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
A S G T K T K A C
29
Sequence Logo
frequency
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
30
Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
31
Entropy and information content

Entropy a measure of uncertainty
The entropy of a random variable X that can
assume the n different values x1, x2, . . . , xn
with the respective probabilities p1, p2, . . . ,
pn is defined as

32
Entropy and information content

Example A,C,G,T with equal probability
H 4 (-0.25 log2 0.25) log2 4 2 bits
Need 2 bits to encode (e.g. 00 A, 01 C, 10
G, 11 T)
Maximum uncertainty
50 A and 50 C
H 2 (-0. 5 log2 0.5) log2 2 1 bit
100 A
H 1 (-1 log2 1) 0 bit
Minimum uncertainty
Information the opposite of uncertainty
I maximum uncertainty entropy
The above examples provide 0, 1, and 2 bits of
information, respectively

33
Entropy and information content
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
H .24 1.72 .36 .63 1.60 0.24 1.40 0.85 0.58
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
Mean 1.15 1.15 1.15 1.15 1.15 1.15 1.15 1.15 1.15
Total 10.4 10.4 10.4 10.4 10.4 10.4 10.4 10.4 10.4
Expected occurrence in random DNA 1 / 210.4 1
/ 1340 Expected occurrence of an exact 5-mer 1 /
210 1 / 1024
34
Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
35
Background-normalized Seq Logo

Many genomes have skewed base distribution
In a thermophilic bacteria (i.e. living in a hot
spring), GC content can be as high as 70.
Thus a motif ATAT in the genome of a thermophilic
bacteria would contain more information than a
motif GCGC

36
Relative Entropy

Definition 6.1. Let P and Q be two probability
measures on the same alphabet X. Then the
relative entropy (information divergence,
Kullback-Leibler distance, discrimination) from P
to Q is defined as
Easy to prove that if Q is a uniform
distribution, D(P Q) is equal to the
Information content of P

37
Relative Entropy

Background pA pT 0.2, pC pG 0.3
Distribution on some column of a PWM
Case 1 pA 0.85, pC pG pT 0.05
Case 2 pG 0.85 pC pA pT 0.05
Assuming uniform background distribution
I1 I2 1.15
With the non-uniform background distribution
D1 1.42
D2 0.95

38
Background-normalized Seq Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
I 2 .13 1.35 1.6 0.45 2 .70 1.37 1.65
39
Physical interpretation

Information content is reversely proportional to
the binding energy
High information content gt lower energy gt high
affinity of binding
Relative entropy represents the specificity of
the binding sites compared to random DNA sequences

40
Real example

E. coli. Promoter
TATA-Box 10bp upstream of transcription start
TACGAT
TAAAAT
TATACT
GATAAT
TATGAT
TATGTT

Consensus TATAAT
Note none of the instances matches the consensus
perfectly
41

Finding Motifs

42
Definitions of terms

Motif a consensus sequence or a PWM
Pattern alias for motif (used in combinatorial
motif finding)
Instance of a motif a substring of a sequence
that matches to the motif
How to define match will be shown later

43
Motif finding schemes
Conservation Conservation
Yes No
Whole genome Yes Genome 1 2 3 Genome 1
Whole genome No Gene 1A 1B 1C or Gene Set 1 2 3 Gene Set 1
Phylogenetic footprinting
Dictionary building
Motif finding
1A
1B
1C
Gene set 1
Gene set 2
Gene set 3
Genome 1
Genome 2
Genome 3
Ideally, all information should be used, at some
stage. i.e., inside algorithm vs pre- or
post-processing.
44
Classification of approaches

Combinatorial search
Based on enumeration of words and computing word
similarities
Analogy to DP for sequence alignment
Probabilistic modeling
Construct models to distinguish motifs vs
non-motifs
Analogy to HMM for sequence alignment

45
Combinatorial motif finding

Idea 1 find all k-mers that appeared at least m
times
Idea 2 find all k-mers that are statistically
significant
Problem most motifs allow divergence. Each
variation may only appear once.
Idea 3 find all k-mers, considering IUPAC code
e.g. ASGTKTKAC, S C/G, K G/T
Still inflexible
Idea 4 find k-mers that approximately appeared
at least m times
i.e. allow some mismatches

46
Combinatorial motif finding

Given a set of sequences S x1, , xn
A motif W is a consensus string w1wK
Find motif W with best match to x1, , xn
Definition of best
d(W, xi) min hamming dist. between W and a word
in xi
d(W, S) ?i d(W, xi)
W argmin( d(W, S) )

47
Exhaustive searches

1. Pattern-driven algorithm
For W AAA to TTT (4K possibilities)
Find d( W, S )
Report W argmin( d(W, S) )
Running time O( K N 4K )
(where N ?i xi)
Guaranteed to find the optimal solution.

48
Exhaustive searches

2. Sample-driven algorithm
For W a K-long word in some xi
Find d( W, S )
Report W argmin( d( W, S ) )
OR Report a local improvement of W
Running time O( K N2 )

49
Exhaustive searches

Problem with sample-driven approach
If
True motif does not occur in data, and
True motif is weak
Then,
random strings may score better than any instance
of true motif

50
Example

E. coli. Promoter
TATA-Box 10bp upstream of transcription start
TACGAT
TAAAAT
TATACT
GATAAT
TATGAT
TATGTT

Consensus TATAAT
Each instance differs at most 2 bases from the
consensus None of the instances matches the
consensus perfectly
51
Heuristic methods

Cannot afford exhaustive search on all patterns
Sample-driven approaches may miss real patterns
However, a real pattern should not differ too
much from its instances in S
Start from the space of all words in S, extend to
the space with real patterns

52
Some of the popular tools

Consensus (Hertz Stormo, 1999)
WINNOWER (Pevzner Sze, 2000)
MULTIPROFILER (Keich Pevzner, 2002)
PROJECTION (Buhler Tompa, 2001)
WEEDER (Pavesi et. al. 2001)
And a dozen of others

53
Consensus

Algorithm
Cycle 1
For each word W in S
For each word W in S
Create alignment (gap free) of W, W
Keep the C1 best alignments, A1, , AC1
ACGGTTG , CGAACTT , GGGCTCT
ACGCCTG , AGAACTA , GGGGTGT

Algorithm (contd)
Cycle i
For each word W in S
For each alignment Aj from cycle i-1
Create alignment (gap free) of W, Aj
Keep the Ci best alignments A1, , ACi

C1, , Cn are user-defined heuristic constants
Running time
O(kN2) O(kN C1) O(kN C2) O(kN Cn)
O(kN2 NCtotal)
Where Ctotal ?i Ci, typically O(nC), where C is
a big constant

56
Extended sample-driven (ESD) approaches

Hybrid between pattern-driven and sample-driven
Assume each instance does not differ by more than
a bases to the motif (? usually depends on k)

motif
instance
?
The real motif will reside in the ?-neighborhood
of some words in S. Instead of searching all 4K
patterns, we can search the ?-neighborhood of
every word in S.
a-neighborhood
57
WEEDER

Naïve N Ka 3a NK

of patterns to test
of words in sequences
58
Better idea

Using a joint suffix tree, find all patterns
that
Have length K
Appeared in at least m sequences with at most a
mismatches
Post-processing

59
WEEDER algorithm sketch
Current pattern P, P lt K

A list containing all eligible nodes with at
most a mismatches to P
For each node, remember mismatches accumulated
(e), and bit vector (B) for seq occ, e.g.
011100010
Bit OR all Bs to get seq occurrence for P
Suppose occ gt m
Pattern still valid
Now add a letter

ACGTT
mismatches
(e, B)
Seq occ
60
WEEDER algorithm sketch
Current pattern P
ACGTTA

Simple extension no branches.
No change to B
e may increase by 1 or no change
Drop node if e gt a
Branches replace a node with its child nodes
Drop if e gt a
B may change
Re-do Bit OR using all Bs
Try a different char if occ lt m
Report P when P K

(e, B)
61
WEEDER complexity

Can get all D(P, S) in time
O(nN (K choose a) 3a) O(nN Ka 3a).
n sequences. Needed for Bit OR.
Better than O(KN 4K) since usually a ltlt K
Ka 3a may still be expensive for large K
E.g. K 20, a 6

62
WEEDER More tricks
Current pattern P
ACGTTA

Eligible nodes with at most a mismatches to P
Eligible nodes with at most min(?L, a)
mismatches to P
L current pattern length
? error ratio
Require that mismatches to be somewhat evenly
distributed among positions
Prune tree at length K

63
MULTIPROFILER

W differs from W at ? positions.
The consensus sequence for the words in the
?-neighborhood of W is similar to W.
If we ignore all the chars that are similar to W,
the rest may suggest the difference between W and
W

W
W
W ACGTACG W ATGTAAG
64
MULTIPROFILER alg sketch

For each word P in S
Find its a-neighborhood in S
List of words C
For each position j from 1..K of the words in C
Find the most popular char that differ from Pj
Replace a positions in P with the chars found
above
Call the new word P
W argmin D(P, S)

W
W
W ACGTACG W ATGTAAG
65
MULTIPROFILER

No complexity provided in the paper
More efficient than WEEDER for longer patterns N
lt Ka 3a
How to choose a is an issue
Large a too many noises in neighborhood
Small a few true instances in neighborhood

W
W
W ACGTACG W ATGTAAG
66

Probabilistic modeling approaches
for motif finding

67
Probabilistic modeling approaches

A motif model
Usually a PWM
M (Pij), i 1..4, j 1..k, k motif length
A background model
Usually the distribution of base frequencies in
the genome (or other selected subsets of
sequences)
B (bi), i 1..4
A word can be generated by M or B

68
Expectation-Maximization

For any word W,
P(W M) PW1 1 PW2 2PWK K
P(W B) bW1 bW2 bWK
Let ? P(M), i.e., the probability for any word
to be generated by M.
Then P(B) 1 - ?
Can compute the posterior probability P(MW) and
P(BW)
P(MW) P(WM) ?
P(BW) P(WB) (1-?)

69
Expectation-Maximization

Initialize
Randomly assign each word to M or B
Let Zxy 1 if position y in sequence x is a
motif, and 0 otherwise
Estimate parameters M, ?, B
Iterate until converge
E-step Zxy P(M Xy..yk-1) for all x and y
M-step re-estimate M, ? given Z (B usually fixed)

70
Expectation-Maximization
position
1
1
Initialize
E-step
probability
5
5
9
9
M-step

E-step Zxy P(M Xy..yk-1) for all x and y
M-step re-estimate M, ? given Z

71
MEME

Multiple EM for Motif Elicitation
Bailey and Elkan, UCSD
http//meme.sdsc.edu/
Multiple starting points
Multiple modes ZOOPS, OOPS, TCM

72
Gibbs Sampling

Another very useful technique for estimating
missing parameters
EM is deterministic
Often trapped by local optima
Gibbs sampling stochastic behavior to avoid
local optima

73
Gibbs sampling

Initialize
Randomly assign each word to M or B
Let Zxy 1 if position y in sequence x is a
motif, and 0 otherwise
Estimate parameters M, B, ?
Iterate
Randomly remove a sequence X from S
Recalculate model parameters using S \ X
Compute Zxy for X
Sample a y from Zxy.
Let Zxy 1 for y y and 0 otherwise

74
Gibbs Sampling
position
probability
Sampling

Gibbs sampling sample one position according to
probability
Update prediction of one training sequence at a
time
Viterbi always take the highest
EM take weighted average

Simultaneously update predictions of all sequences
75
Gibbs sampling motif finders

Gibbs Sampler, based on C. Larence et.al.
Science, 1993
AlignACE, Nat Biotech 1998, developed in Church
lab, Harvard Univ
BioProspector, X. Liu et. al. PSB 2001 , an
improvement of AlignACE

76
Better background model

Repeat DNA can be confused as motif
Especially low-complexity CACACA AAAAA, etc.
Solution more elaborate background model
Higher-order Markov model
0th order B pA, pC, pG, pT
1st order B P(AA), P(AC), , P(TT)
Kth order B P(X b1bK) X, bi?A,C,G,T
Has been applied to EM and Gibbs (up to 3rd
order)

77
Limits of Motif Finders
0
???
gene

Given upstream regions of coregulated genes
Increasing length makes motif finding harder
random motifs clutter the true ones
Decreasing length makes motif finding harder
true motif missing in some sequences

78
Challenging problem
d mutations
n 20
k
L 1000

(k, d)-motif challenge problem
Many algorithms fail at (15, 4)-motif for n 20
and L 1000
Combinatorial algorithms usually work better on
challenge problem
However, they are usually designed to find (k,
d)-motifs
Performance in real data varies

79
Motif finding in practice

Now weve found some good looking motifs
Easiest step?
What to do next?
Are they real?
How do we find more instances in the rest of the
genome?
What are their functional meaning?
Motifs gt regulatory networks

80
To make sense about the motifs

Each program usually reports a number of motifs
(tens to hundreds)
Many motifs are variations of each other
Each program also report some different ones
Each program has its own way of scoring motifs
Best scored motifs often not interesting
AAAAAAAA
ACACACAC
TATATATAT

81
Strategies to improve results

Combine results from different algorithms usually
helpful
Ones that appeared multiple times are probably
more interesting
Except simple repeats like AAAAA or ATATATATA
Will talk about this later.
Cluster motifs into groups. Issues
Measure similarities between two motifs (PWMs)
of clusters

82
Strategies to improve results

Compare with known motifs in database
TRANSFAC
JASPAR
Issues
Compute similarities among motifs
How similar is similar?

83
Strategies to improve results

Statistical test of significance
Enrichment in target sequences vs background
sequences

Background set B
Target set T
Assumed to contain a common motif, P
Assumed to not contain P, or with very low
frequency
Ideal case every sequence in T has P, no
sequence in B has P
84
Statistical test for significance
Background set target set B T
P
Target set T
M
N
Appeared in n sequences
Appeared in m sequences

If n / N gtgt m / M
P is enriched (over-represented) in T
Statistical significance?
If we randomly draw N sequences from (BT), how
likely we will see at least n sequences with P?

85
Hypergeometric distribution

A box with M balls, of which N are red, and the
rest are blue.
We randomly draw m balls from the box
Whats the probability well see n red balls?
Red ball target sequences
Blue ball background sequences
Total of choices (M choose m)
of choices to have n red balls (N choose n) x
(M-N choose m-n)

86
Cumulative hypergeometric test for motif
significance

We are interested in if we randomly pick m
balls, how likely that well see at least n red
balls?

This can be interpreted as the p-value for the
null hypothesis that we are randomly
picking. Alternative hypothesis our selection
favors red balls. Equivalent the target set T is
enriched with motif P. Or P is over-represented
in T.
87
Examples

Yeast genome has 6000 genes
Select 50 genes believed to be co-regulated by a
common TF
Found a motif for these 50 genes
It appeared in 20 out of these 50 genes
In the whole genome, 100 genes have this motif
M 6000, N 50, m 10020 120, n 20
Intuition
m/M 120/6000. In Genome, 1 out 50 genes have
the motif
N 50, would expect only 1 gene in the target
set to have the motif
20-fold enrichment
P-value 6 x 10-22
n 5. 5-fold enrichment. P-value 0.003
Normally a very low p-value is needed, e.g. 10-10

88
ROC curve for motif significance

Motif is usually a PWM
Any word will have a score
Typical scoring function Log P(W M) / P(W B)
W a word.
M a PWM.
B background model
To determine whether a sequence contains a motif,
a cutoff has to be decided
With different cutoffs, you get different number
of genes with the motif
Hyper-geometric test first assumes a cutoff
It may be better to look at a range of cutoffs

89
ROC curve for motif significance
Background set target set B T
P
Target set T
M
N
Given a score cutoff
Appeared in n sequences
Appeared in m sequences

With different score cutoff, will have different
m and n
Assume you want to use P to classify T and B
Sensitivity n / N
Specificity (M-N-mn) / (M-N)
False Positive Rate 1 specificity (m n) /
(M-N)
With decreasing cutoff, sensitivity ?, FPR ?

90
ROC curve for motif significance
A good cutoff
Lowest cutoff. Every sequence has the motif.
Sensitivity 1. specificity 0.
1

ROC-AUC area under curve.
1 the best. 0.5 random.
Motif 1 is more enriched than motif 2.

sensitivity
Motif 1
Motif 2
Random
0
1-specificity
1
0
Highest cutoff. No motif can pass the cutoff.
Sensitivity 0. specificity 1.
91
Other strategies

Cross-validation
Randomly divide sequences into 10 sets, hold 1
set for test.
Do motif finding on 9 sets. Does the motif also
appear in the testing set?
Phylogenetic conservation information
Does a motif also appears in the homologous genes
of another species?
Strongest evidence
However, will not be able to find
species-specific ones

92
Other strategies

Finding motif modules
Will two motifs always appear in the same gene?
Location preference
Some motifs appear to be in certain location
E.g., within 50-150bp upstream to transcription
start
If a detected motif has strong positional bias,
may be a sign of its function
Evidence from other types of data sources
Do the genes having the motif always have similar
activities (gene expression levels) across
different conditions?
Interact with the same set of proteins?
Similar functions?
etc.