CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 92
About This Presentation
Title:

CS5263 Bioinformatics

Description:

A motif is a recurring fragment, theme or pattern ... Has, or is conjectured to have, a biological significance (Sequence) motif finding ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 93
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics


1
CS5263 Bioinformatics
  • Lecture 18
  • Motif finding

2
What is a (biological) motif?
  • A motif is a recurring fragment, theme or pattern
  • Sequence motif a sequence pattern of nucleotides
    in a DNA sequence or amino acids in a protein
  • Structural motif a pattern in a protein
    structure formed by the spatial arrangement of
    amino acids.
  • Network motif patterns that occur in different
    parts of a network at frequencies much higher
    than those found in randomized network
  • Commonality
  • higher frequency than would be expected by chance
  • Has, or is conjectured to have, a biological
    significance

3
(Sequence) motif finding
  • Given a set of sequences
  • Goal find sequence motifs that appear in all or
    the majority of the sequences, and are likely
    associated with some functions
  • In DNA regulatory sequences
  • In protein functional/structural domains

4
Roadmap
  • Biological background
  • Representation of motifs
  • Algorithms for finding motifs
  • Other issues
  • Distinguish functional vs non-functional motifs
  • Search for instances of given motifs
  • Interpretation of motifs

5
  • In motif finding, understanding the motivations,
    significance of the problems, difficulties, and
    ideas that have been explored are more important
    than knowing the details of the existing
    algorithms!
  • Most algorithms often perform poorly in real
    challenges!
  • Not necessarily a fault of algorithm designers
  • Algorithms will be improved

6
  • Biological background for motif finding

7
Cells respond to environment
Various external messages
Heat
Responds to environmental conditions
Food Supply
8
Genome is fixed Cells are dynamic
  • A genome is static
  • Every cell in our body has a copy of same genome
  • A cell is dynamic
  • Responds to external conditions
  • Most cells follow a cell cycle of division
  • Cells differentiate during development

9
Gene regulation
  • is responsible for the dynamic cell
  • Gene expression (production of protein) varies
    according to
  • Cell type
  • Cell cycle
  • External conditions
  • Location

10
Where gene regulation takes place
  • Opening of chromatin
  • Transcription
  • Translation
  • Protein stability
  • Protein modifications

11
Transcriptional Regulation
  • Strongest regulation happens during transcription
  • Best place to regulate
  • No energy wasted making intermediate products
  • However, slowest response time
  • After a receptor notices a change
  • Cascade message to nucleus
  • Open chromatin bind transcription factors
  • Recruit RNA polymerase and transcribe
  • Splice mRNA and send to cytoplasm
  • Translate into protein

12
Transcription Factors Binding to DNA
  • Transcriptional regulation
  • Certain transcription factors bind to DNA
  • Binding recognizes DNA substrings
  • Regulatory motifs

13
Regulation of Genes

Transcription Factor (TF) (Protein)
RNA polymerase (Protein)
DNA
Gene
Promoter
14
Regulation of Genes

Transcription Factor (TF) (Protein)
RNA polymerase (Protein)
DNA
Gene
Regulatory Element, TF binding site, TF binding
motif, cis-regulatory motif (element)
15
Regulation of Genes

Transcription Factor (Protein)
RNA polymerase
DNA
Regulatory Element
Gene
16
Regulation of Genes

New protein
RNA polymerase
Transcription Factor
DNA
Gene
Regulatory Element
17
The Cell as a Regulatory Network
If C then D
gene D
A
B
Make D
C
If B then NOT D
D
If A and B then D
gene B
Make B
D
C
If D then B
18
Code for protein-DNA binding?
Some knowledge exists
19
However, overall code still missing
20
Experimental methods
  • DNase footprinting

21
Experimental methods
  • To determine protein-DNA binding site is tedious
    and time-consuming
  • To determine the binding specificity is even
    harder
  • Involves mutating different combinations of
    nucleic acids in promoter region and observe the
    biological effects
  • Computational methods can help

22
Finding Regulatory Motifs
. . .
  • Given a collection of genes that are believed to
    be regulated by the same protein,
  • Find the common TF-binding motif from promoters

23
Essentially a Multiple Local Alignment
. . .
  • Find best multiple local alignment

24
  • Then why dont we just use multiple sequence
    alignment algorithms like the Multidimensional
    Dynamic Programming?

25
Characteristics of Regulatory Motifs
  • Tiny (6-12bp)
  • Intergenic regions are very long
  • Highly Variable
  • Constant Size
  • Because a constant-size transcription factor
    binds
  • Often repeated
  • Often conserved

26
  • Motif Representation

27
Motif representation
  • Collection of exact words
  • ACGTTAC, ACGCTAC, AGGTGAC,
  • Consensus sequence (with wild cards)
  • AcGTgTtAC
  • ASGTKTKAC SC/G, KG/T (IUPAC code)
  • Position specific weight matrices

28
Position Specific Weight Matrix
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
A S G T K T K A C
29
Sequence Logo
frequency
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
30
Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
31
Entropy and information content
  • Entropy a measure of uncertainty
  • The entropy of a random variable X that can
    assume the n different values x1, x2, . . . , xn
    with the respective probabilities p1, p2, . . . ,
    pn is defined as

32
Entropy and information content
  • Example A,C,G,T with equal probability
  • H 4 (-0.25 log2 0.25) log2 4 2 bits
  • Need 2 bits to encode (e.g. 00 A, 01 C, 10
    G, 11 T)
  • Maximum uncertainty
  • 50 A and 50 C
  • H 2 (-0. 5 log2 0.5) log2 2 1 bit
  • 100 A
  • H 1 (-1 log2 1) 0 bit
  • Minimum uncertainty
  • Information the opposite of uncertainty
  • I maximum uncertainty entropy
  • The above examples provide 0, 1, and 2 bits of
    information, respectively

33
Entropy and information content
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
H .24 1.72 .36 .63 1.60 0.24 1.40 0.85 0.58
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
Mean 1.15 1.15 1.15 1.15 1.15 1.15 1.15 1.15 1.15
Total 10.4 10.4 10.4 10.4 10.4 10.4 10.4 10.4 10.4
Expected occurrence in random DNA 1 / 210.4 1
/ 1340 Expected occurrence of an exact 5-mer 1 /
210 1 / 1024
34
Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
35
Background-normalized Seq Logo
  • Many genomes have skewed base distribution
  • In a thermophilic bacteria (i.e. living in a hot
    spring), GC content can be as high as 70.
  • Thus a motif ATAT in the genome of a thermophilic
    bacteria would contain more information than a
    motif GCGC

36
Relative Entropy
  • Definition 6.1. Let P and Q be two probability
    measures on the same alphabet X. Then the
    relative entropy (information divergence,
    Kullback-Leibler distance, discrimination) from P
    to Q is defined as
  • Easy to prove that if Q is a uniform
    distribution, D(P Q) is equal to the
    Information content of P

37
Relative Entropy
  • Background pA pT 0.2, pC pG 0.3
  • Distribution on some column of a PWM
  • Case 1 pA 0.85, pC pG pT 0.05
  • Case 2 pG 0.85 pC pA pT 0.05
  • Assuming uniform background distribution
  • I1 I2 1.15
  • With the non-uniform background distribution
  • D1 1.42
  • D2 0.95

38
Background-normalized Seq Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
I 2 .13 1.35 1.6 0.45 2 .70 1.37 1.65
39
Physical interpretation
  • Information content is reversely proportional to
    the binding energy
  • High information content gt lower energy gt high
    affinity of binding
  • Relative entropy represents the specificity of
    the binding sites compared to random DNA sequences

40
Real example
  • E. coli. Promoter
  • TATA-Box 10bp upstream of transcription start
  • TACGAT
  • TAAAAT
  • TATACT
  • GATAAT
  • TATGAT
  • TATGTT

Consensus TATAAT
Note none of the instances matches the consensus
perfectly
41
  • Finding Motifs

42
Definitions of terms
  • Motif a consensus sequence or a PWM
  • Pattern alias for motif (used in combinatorial
    motif finding)
  • Instance of a motif a substring of a sequence
    that matches to the motif
  • How to define match will be shown later

43
Motif finding schemes
Conservation Conservation
Yes No
Whole genome Yes Genome 1 2 3 Genome 1
Whole genome No Gene 1A 1B 1C or Gene Set 1 2 3 Gene Set 1
Phylogenetic footprinting
Dictionary building
Motif finding
1A
1B
1C
Gene set 1
Gene set 2
Gene set 3
Genome 1
Genome 2
Genome 3
Ideally, all information should be used, at some
stage. i.e., inside algorithm vs pre- or
post-processing.
44
Classification of approaches
  • Combinatorial search
  • Based on enumeration of words and computing word
    similarities
  • Analogy to DP for sequence alignment
  • Probabilistic modeling
  • Construct models to distinguish motifs vs
    non-motifs
  • Analogy to HMM for sequence alignment

45
Combinatorial motif finding
  • Idea 1 find all k-mers that appeared at least m
    times
  • Idea 2 find all k-mers that are statistically
    significant
  • Problem most motifs allow divergence. Each
    variation may only appear once.
  • Idea 3 find all k-mers, considering IUPAC code
  • e.g. ASGTKTKAC, S C/G, K G/T
  • Still inflexible
  • Idea 4 find k-mers that approximately appeared
    at least m times
  • i.e. allow some mismatches

46
Combinatorial motif finding
  • Given a set of sequences S x1, , xn
  • A motif W is a consensus string w1wK
  • Find motif W with best match to x1, , xn
  • Definition of best
  • d(W, xi) min hamming dist. between W and a word
    in xi
  • d(W, S) ?i d(W, xi)
  • W argmin( d(W, S) )

47
Exhaustive searches
  • 1. Pattern-driven algorithm
  • For W AAA to TTT (4K possibilities)
  • Find d( W, S )
  • Report W argmin( d(W, S) )
  • Running time O( K N 4K )
  • (where N ?i xi)
  • Guaranteed to find the optimal solution.

48
Exhaustive searches
  • 2. Sample-driven algorithm
  • For W a K-long word in some xi
  • Find d( W, S )
  • Report W argmin( d( W, S ) )
  • OR Report a local improvement of W
  • Running time O( K N2 )

49
Exhaustive searches
  • Problem with sample-driven approach
  • If
  • True motif does not occur in data, and
  • True motif is weak
  • Then,
  • random strings may score better than any instance
    of true motif

50
Example
  • E. coli. Promoter
  • TATA-Box 10bp upstream of transcription start
  • TACGAT
  • TAAAAT
  • TATACT
  • GATAAT
  • TATGAT
  • TATGTT

Consensus TATAAT
Each instance differs at most 2 bases from the
consensus None of the instances matches the
consensus perfectly
51
Heuristic methods
  • Cannot afford exhaustive search on all patterns
  • Sample-driven approaches may miss real patterns
  • However, a real pattern should not differ too
    much from its instances in S
  • Start from the space of all words in S, extend to
    the space with real patterns

52
Some of the popular tools
  • Consensus (Hertz Stormo, 1999)
  • WINNOWER (Pevzner Sze, 2000)
  • MULTIPROFILER (Keich Pevzner, 2002)
  • PROJECTION (Buhler Tompa, 2001)
  • WEEDER (Pavesi et. al. 2001)
  • And a dozen of others

53
Consensus
  • Algorithm
  • Cycle 1
  • For each word W in S
  • For each word W in S
  • Create alignment (gap free) of W, W
  • Keep the C1 best alignments, A1, , AC1
  • ACGGTTG , CGAACTT , GGGCTCT
  • ACGCCTG , AGAACTA , GGGGTGT

54
  • Algorithm (contd)
  • Cycle i
  • For each word W in S
  • For each alignment Aj from cycle i-1
  • Create alignment (gap free) of W, Aj
  • Keep the Ci best alignments A1, , ACi

55
  • C1, , Cn are user-defined heuristic constants
  • Running time
  • O(kN2) O(kN C1) O(kN C2) O(kN Cn)
  • O(kN2 NCtotal)
  • Where Ctotal ?i Ci, typically O(nC), where C is
    a big constant

56
Extended sample-driven (ESD) approaches
  • Hybrid between pattern-driven and sample-driven
  • Assume each instance does not differ by more than
    a bases to the motif (? usually depends on k)

motif
instance
?
The real motif will reside in the ?-neighborhood
of some words in S. Instead of searching all 4K
patterns, we can search the ?-neighborhood of
every word in S.
a-neighborhood
57
WEEDER
  • Naïve N Ka 3a NK

of patterns to test
of words in sequences
58
Better idea
  • Using a joint suffix tree, find all patterns
    that
  • Have length K
  • Appeared in at least m sequences with at most a
    mismatches
  • Post-processing

59
WEEDER algorithm sketch
Current pattern P, P lt K
  • A list containing all eligible nodes with at
    most a mismatches to P
  • For each node, remember mismatches accumulated
    (e), and bit vector (B) for seq occ, e.g.
    011100010
  • Bit OR all Bs to get seq occurrence for P
  • Suppose occ gt m
  • Pattern still valid
  • Now add a letter

ACGTT
mismatches
(e, B)
Seq occ
60
WEEDER algorithm sketch
Current pattern P
ACGTTA
  • Simple extension no branches.
  • No change to B
  • e may increase by 1 or no change
  • Drop node if e gt a
  • Branches replace a node with its child nodes
  • Drop if e gt a
  • B may change
  • Re-do Bit OR using all Bs
  • Try a different char if occ lt m
  • Report P when P K

(e, B)
61
WEEDER complexity
  • Can get all D(P, S) in time
  • O(nN (K choose a) 3a) O(nN Ka 3a).
  • n sequences. Needed for Bit OR.
  • Better than O(KN 4K) since usually a ltlt K
  • Ka 3a may still be expensive for large K
  • E.g. K 20, a 6

62
WEEDER More tricks
Current pattern P
ACGTTA
  • Eligible nodes with at most a mismatches to P
  • Eligible nodes with at most min(?L, a)
    mismatches to P
  • L current pattern length
  • ? error ratio
  • Require that mismatches to be somewhat evenly
    distributed among positions
  • Prune tree at length K

63
MULTIPROFILER
  • W differs from W at ? positions.
  • The consensus sequence for the words in the
    ?-neighborhood of W is similar to W.
  • If we ignore all the chars that are similar to W,
    the rest may suggest the difference between W and
    W

W
W
W ACGTACG W ATGTAAG
64
MULTIPROFILER alg sketch
  • For each word P in S
  • Find its a-neighborhood in S
  • List of words C
  • For each position j from 1..K of the words in C
  • Find the most popular char that differ from Pj
  • Replace a positions in P with the chars found
    above
  • Call the new word P
  • W argmin D(P, S)

W
W
W ACGTACG W ATGTAAG
65
MULTIPROFILER
  • No complexity provided in the paper
  • More efficient than WEEDER for longer patterns N
    lt Ka 3a
  • How to choose a is an issue
  • Large a too many noises in neighborhood
  • Small a few true instances in neighborhood

W
W
W ACGTACG W ATGTAAG
66
  • Probabilistic modeling approaches
  • for motif finding

67
Probabilistic modeling approaches
  • A motif model
  • Usually a PWM
  • M (Pij), i 1..4, j 1..k, k motif length
  • A background model
  • Usually the distribution of base frequencies in
    the genome (or other selected subsets of
    sequences)
  • B (bi), i 1..4
  • A word can be generated by M or B

68
Expectation-Maximization
  • For any word W,
  • P(W M) PW1 1 PW2 2PWK K
  • P(W B) bW1 bW2 bWK
  • Let ? P(M), i.e., the probability for any word
    to be generated by M.
  • Then P(B) 1 - ?
  • Can compute the posterior probability P(MW) and
    P(BW)
  • P(MW) P(WM) ?
  • P(BW) P(WB) (1-?)

69
Expectation-Maximization
  • Initialize
  • Randomly assign each word to M or B
  • Let Zxy 1 if position y in sequence x is a
    motif, and 0 otherwise
  • Estimate parameters M, ?, B
  • Iterate until converge
  • E-step Zxy P(M Xy..yk-1) for all x and y
  • M-step re-estimate M, ? given Z (B usually fixed)

70
Expectation-Maximization
position
1
1
Initialize
E-step
probability
5
5
9
9
M-step
  • E-step Zxy P(M Xy..yk-1) for all x and y
  • M-step re-estimate M, ? given Z

71
MEME
  • Multiple EM for Motif Elicitation
  • Bailey and Elkan, UCSD
  • http//meme.sdsc.edu/
  • Multiple starting points
  • Multiple modes ZOOPS, OOPS, TCM

72
Gibbs Sampling
  • Another very useful technique for estimating
    missing parameters
  • EM is deterministic
  • Often trapped by local optima
  • Gibbs sampling stochastic behavior to avoid
    local optima

73
Gibbs sampling
  • Initialize
  • Randomly assign each word to M or B
  • Let Zxy 1 if position y in sequence x is a
    motif, and 0 otherwise
  • Estimate parameters M, B, ?
  • Iterate
  • Randomly remove a sequence X from S
  • Recalculate model parameters using S \ X
  • Compute Zxy for X
  • Sample a y from Zxy.
  • Let Zxy 1 for y y and 0 otherwise

74
Gibbs Sampling
position
probability
Sampling
  • Gibbs sampling sample one position according to
    probability
  • Update prediction of one training sequence at a
    time
  • Viterbi always take the highest
  • EM take weighted average

Simultaneously update predictions of all sequences
75
Gibbs sampling motif finders
  • Gibbs Sampler, based on C. Larence et.al.
    Science, 1993
  • AlignACE, Nat Biotech 1998, developed in Church
    lab, Harvard Univ
  • BioProspector, X. Liu et. al. PSB 2001 , an
    improvement of AlignACE

76
Better background model
  • Repeat DNA can be confused as motif
  • Especially low-complexity CACACA AAAAA, etc.
  • Solution more elaborate background model
  • Higher-order Markov model
  • 0th order B pA, pC, pG, pT
  • 1st order B P(AA), P(AC), , P(TT)
  • Kth order B P(X b1bK) X, bi?A,C,G,T
  • Has been applied to EM and Gibbs (up to 3rd
    order)

77
Limits of Motif Finders
0
???
gene
  • Given upstream regions of coregulated genes
  • Increasing length makes motif finding harder
    random motifs clutter the true ones
  • Decreasing length makes motif finding harder
    true motif missing in some sequences

78
Challenging problem
d mutations
n 20
k
L 1000
  • (k, d)-motif challenge problem
  • Many algorithms fail at (15, 4)-motif for n 20
    and L 1000
  • Combinatorial algorithms usually work better on
    challenge problem
  • However, they are usually designed to find (k,
    d)-motifs
  • Performance in real data varies

79
Motif finding in practice
  • Now weve found some good looking motifs
  • Easiest step?
  • What to do next?
  • Are they real?
  • How do we find more instances in the rest of the
    genome?
  • What are their functional meaning?
  • Motifs gt regulatory networks

80
To make sense about the motifs
  • Each program usually reports a number of motifs
    (tens to hundreds)
  • Many motifs are variations of each other
  • Each program also report some different ones
  • Each program has its own way of scoring motifs
  • Best scored motifs often not interesting
  • AAAAAAAA
  • ACACACAC
  • TATATATAT

81
Strategies to improve results
  • Combine results from different algorithms usually
    helpful
  • Ones that appeared multiple times are probably
    more interesting
  • Except simple repeats like AAAAA or ATATATATA
  • Will talk about this later.
  • Cluster motifs into groups. Issues
  • Measure similarities between two motifs (PWMs)
  • of clusters

82
Strategies to improve results
  • Compare with known motifs in database
  • TRANSFAC
  • JASPAR
  • Issues
  • Compute similarities among motifs
  • How similar is similar?

83
Strategies to improve results
  • Statistical test of significance
  • Enrichment in target sequences vs background
    sequences

Background set B
Target set T
Assumed to contain a common motif, P
Assumed to not contain P, or with very low
frequency
Ideal case every sequence in T has P, no
sequence in B has P
84
Statistical test for significance
Background set target set B T
P
Target set T
M
N
Appeared in n sequences
Appeared in m sequences
  • If n / N gtgt m / M
  • P is enriched (over-represented) in T
  • Statistical significance?
  • If we randomly draw N sequences from (BT), how
    likely we will see at least n sequences with P?

85
Hypergeometric distribution
  • A box with M balls, of which N are red, and the
    rest are blue.
  • We randomly draw m balls from the box
  • Whats the probability well see n red balls?
  • Red ball target sequences
  • Blue ball background sequences
  • Total of choices (M choose m)
  • of choices to have n red balls (N choose n) x
    (M-N choose m-n)

86
Cumulative hypergeometric test for motif
significance
  • We are interested in if we randomly pick m
    balls, how likely that well see at least n red
    balls?

This can be interpreted as the p-value for the
null hypothesis that we are randomly
picking. Alternative hypothesis our selection
favors red balls. Equivalent the target set T is
enriched with motif P. Or P is over-represented
in T.
87
Examples
  • Yeast genome has 6000 genes
  • Select 50 genes believed to be co-regulated by a
    common TF
  • Found a motif for these 50 genes
  • It appeared in 20 out of these 50 genes
  • In the whole genome, 100 genes have this motif
  • M 6000, N 50, m 10020 120, n 20
  • Intuition
  • m/M 120/6000. In Genome, 1 out 50 genes have
    the motif
  • N 50, would expect only 1 gene in the target
    set to have the motif
  • 20-fold enrichment
  • P-value 6 x 10-22
  • n 5. 5-fold enrichment. P-value 0.003
  • Normally a very low p-value is needed, e.g. 10-10

88
ROC curve for motif significance
  • Motif is usually a PWM
  • Any word will have a score
  • Typical scoring function Log P(W M) / P(W B)
  • W a word.
  • M a PWM.
  • B background model
  • To determine whether a sequence contains a motif,
    a cutoff has to be decided
  • With different cutoffs, you get different number
    of genes with the motif
  • Hyper-geometric test first assumes a cutoff
  • It may be better to look at a range of cutoffs

89
ROC curve for motif significance
Background set target set B T
P
Target set T
M
N
Given a score cutoff
Appeared in n sequences
Appeared in m sequences
  • With different score cutoff, will have different
    m and n
  • Assume you want to use P to classify T and B
  • Sensitivity n / N
  • Specificity (M-N-mn) / (M-N)
  • False Positive Rate 1 specificity (m n) /
    (M-N)
  • With decreasing cutoff, sensitivity ?, FPR ?

90
ROC curve for motif significance
A good cutoff
Lowest cutoff. Every sequence has the motif.
Sensitivity 1. specificity 0.
1
  • ROC-AUC area under curve.
  • 1 the best. 0.5 random.
  • Motif 1 is more enriched than motif 2.

sensitivity
Motif 1
Motif 2
Random
0
1-specificity
1
0
Highest cutoff. No motif can pass the cutoff.
Sensitivity 0. specificity 1.
91
Other strategies
  • Cross-validation
  • Randomly divide sequences into 10 sets, hold 1
    set for test.
  • Do motif finding on 9 sets. Does the motif also
    appear in the testing set?
  • Phylogenetic conservation information
  • Does a motif also appears in the homologous genes
    of another species?
  • Strongest evidence
  • However, will not be able to find
    species-specific ones

92
Other strategies
  • Finding motif modules
  • Will two motifs always appear in the same gene?
  • Location preference
  • Some motifs appear to be in certain location
  • E.g., within 50-150bp upstream to transcription
    start
  • If a detected motif has strong positional bias,
    may be a sign of its function
  • Evidence from other types of data sources
  • Do the genes having the motif always have similar
    activities (gene expression levels) across
    different conditions?
  • Interact with the same set of proteins?
  • Similar functions?
  • etc.
Write a Comment
User Comments (0)
About PowerShow.com