Transcription Regulation Transcription Factor Motif Finding - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Transcription Regulation Transcription Factor Motif Finding

Description:

Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 ... – PowerPoint PPT presentation

Number of Views:842
Avg rating:3.0/5.0
Slides: 76
Provided by: xliu
Category:

less

Transcript and Presenter's Notes

Title: Transcription Regulation Transcription Factor Motif Finding


1
Transcription RegulationTranscription Factor
Motif Finding
  • Xiaole Shirley Liu
  • STAT115, STAT215, BIO298, BIST520

2
Outline
  • Biology of transcription regulation and
    challenges of computational motif finding
  • Scan for known TF motif sites
  • TRASFAC and JASPAR, Sequence Logo
  • De novo method
  • Regular expression enumeration w-mer enumerate
  • Position weight matrix update EM and Gibbs
  • Motif finding in different organisms
  • Motif clusters and conservation

3
Imagine a Chef
4
Each Cell Is Like a Chef
5
Each Cell Is Like a Chef
6
Understanding a Genome
Get the complete sequence (encoded cook book)
Observe gene expressions at different cell
states (meals prepared at different situations)
Decode gene regulation (decode the book,
understand the rules)
7
Information in DNA
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACAGTTCAGGACTAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
8
Information in DNA
  • Non-coding region 98
  • Regulation When, Where,
  • Amount, Other Conditions, etc
  • ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
  • ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG
  • GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
  • TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
  • CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Coding region 2
Milk-gtYogurt
Egg-gtOmelet
Fish-gtSushi
Flour-gtCake
Beef-gtBurger
9
Measure Gene Expression
  • Microarray or SAGE detects the expression of
    every gene at a certain cell state
  • Clustering find genes that are co-expressed
    (potentially share regulation)

10
Decode Gene Regulation
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
  • GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
  • TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
  • CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
11
Decode Gene Regulation
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
  • GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
  • TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
  • CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
12
Decode Gene Regulation
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
  • GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
  • TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
  • CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
13
Biology of Transcription Regulation
  • ...acatttgcttctgacacaactgtgttcactagcaacctca...aaca
    gacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggc
agcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...
...cgctcgcgggccggcactcttctggtccccacagactcag...gat
acccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...
...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gat
gcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...
Motif can only be computational discovered when
there are enough cases for machine learning
14
Computational Motif Finding
  • Input data
  • Upstream sequences of gene expression profile
    cluster
  • 20-800 sequences, each 300-5000 bps long
  • Output enriched sequence patterns (motifs)
  • Ultimate goals
  • Which TFs are involved and their binding motifs
    and effects (enhance / repress gene expression)?
  • Which genes are regulated by this TF, why is
    there disease when a TF goes wrong?
  • Are there binding partner / competitor for a TF?

15
Challenges Where/what the signal
  • The motif should be abundant
  • GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATTTACCACCAAATAAGACACGGACGGC
  • GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
  • TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
  • CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

16
Challenges Where/what the signal
  • The motif should be abundant
  • And Abundant with significance
  • GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATTTACCACCAAATAAGACACGGACGGC
  • GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
  • TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
  • CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

17
Challenges Double stranded DNA
  • Motif appears in both
  • strands
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
  • TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

18
Challenges Base substitutions
  • Sequences do not have to match the motif
  • perfectly, base substitutions are allowed
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATGTACCACCAGTTCAGACACGGACGGC
  • GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA
  • TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG
  • CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT

19
Challenges Variable motif copies
  • Some sequences do not have the motif
  • Some have multiple copies of the motif
  • GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
  • CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
  • TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
  • GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
  • CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

20
Challenges Variable motif copies
  • Some sequences do not have the motif
  • Some have multiple copies of the motif
  • GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
  • CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
  • TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
  • GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
  • CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

21
Challenges Two-block motifs
  • Some motifs have two parts
  • GACACATTTACCTATGC TGGCCCTACGACCTCTCGC
  • CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC
  • GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA
  • TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG
  • CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT

22
Scan for Known TF Motif Sites
  • Experimental TF sites TRANSFAC, JASPAR
  • Motif representation
  • Regular expression Consensus CACAAAA
  • binary decision Degenerate CRCAAAW
  • IUPAC

A/T
A/G
23
Scan for Known TF Motif Sites
  • Experimental TF sites TRANSFAC, JASPAR
  • Motif representation
  • Regular expression Consensus CACAAAA
  • binary decision Degenerate CRCAAAW
  • Position weight matrix (PWM) need score cutoff

Motif Matrix
Pos 12345678 ATGGCATG AGGGTGCG
ATCGCATG TTGCCACG ATGGTATT ATTGCACG
AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Sites
24
IUPAC for DNA
  • A adenosine
  • C cytidine
  • G guanine
  • T thymidine
  • U uridine
  • R G A (purine)
  • Y T C (pyrimidine)
  • K G T (keto)
  • M A C (amino)
  • S G C (strong)
  • W A T (weak)
  • B C G T (not A)
  • D A G T (not C)
  • H A C T (not G)
  • V A C G (not T)
  • N A C G T (any)

25
Protein Binding Microarrays
  • In vitro protein-DNA interactions
  • Better capture motifs

26
JASPAR
  • User defined cutoff to scan for a particular motif

27
A Word on Sequence Logo
  • SeqLogo consists of stacks of symbols, one stack
    for each position in the sequence
  • The overall height of the stack indicates the
    sequence conservation at that position
  • The height of symbols within the stack indicates
    the relative frequency of nucleic acid at that
    position

ATGGCATG AGGGTGCG ATCGCATG
TTGCCACG ATGGTATT ATTGCACG AGGGCGTT
ATGACATG ATGGCATG ACTGGATG
28
Scan Known TF Motifs
  • Drawbacks
  • Limited number of motifs
  • Limited number of sites to represent each motif
  • Low sensitivity and specificity
  • Poor description of motif
  • Binding site borders not clear
  • Binding site many mismatches
  • Many motifs look very similar
  • E.g. GC-rich motif, E-box (CACGTG)

29
De novo Sequence Motif Finding
  • Goal look for common sequence patterns enriched
    in the input data (compared to the genome
    background)
  • Regular expression enumeration
  • Pattern driven approach
  • Enumerate patterns, check significance in
    dataset
  • Oligonucleotide analysis, MobyDick
  • Position weight matrix update
  • Data driven approach, use data to refine motifs
  • Consensus, EM Gibbs sampling
  • Motif score and Markov background

30
Regular Expression Enumeration
  • Oligonucleotide Analysis check
    over-representation for every w-mer
  • Expected w occurrence in data
  • Consider genome sequence current data size
  • Observed w occurrence in data
  • Over-represented w is potential TF binding motif

Observed occurrence of w in the data
31
MobyDick
  • A sequence data and a dictionary of motif words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.22, 0.28,
0.28, 0.22
32
MobyDick
  • A sequence data and a dictionary of motif words
  • Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.28, 0.22,
0.22, 0.28
33
MobyDick
  • A sequence data and a dictionary of motif words
  • Check over-representation of every word-pair

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.28, 0.28,
0.22, 0.22
D A,C,G,T,AA,GA,TA,GG Pw ?
34
MobyDick
  • D A,C,G,T,AA,GA,TA,GG
  • Seq AAGATAA
  • Possible partitions
  • A A G A T A A pA pA pG pA pT pA pA
  • AA G A T A A pAA pG pA pT pA pA
  • AA GA T A A pAA pGA pT pA pA
  • AA GA TA A pAA pGA pTA pA
  • A A GA T AA pAA pGA pT pAA
  • Assign probabilities as to maximize total
    probability of generating the sequence

35
MobyDick
  • A sequence data and a dictionary of motif words
  • Check over-representation of every word-pair
  • Reassign word probability and consider every new
    word-pair to build even longer words

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
D A, C, G, T Pw 0.28, 0.28,
0.22, 0.22
D A,C,G,T,AA,GA,TA,GG Pw ?
36
Regular Expression Enumeration
  • RE Enumeration Derivatives
  • oligo-analysis, spaced dyads w1.ns.w2
  • IUPAC alphabet
  • Markov background (later)
  • 2-bit encoding, fast index access
  • Enumerate limited RE patterns known for a TF
    protein structure or interaction theme
  • Exhaustive, guaranteed to find global optimum,
    and can find multiple motifs
  • Not as flexible with base substitutions, long
    list of similar good motifs, and limited with
    motif width

37
Consensus
  • Starting from the 1st sequence, add one sequence
    at a time to look for the best motifs obtained
    with the additional sequence

38
Consensus
  • Starting from the 1st sequence, add one sequence
    at a time to look for the best motifs obtained
    with the additional sequence

Remaining good motifs

39
Consensus
  • Starting from the 1st sequence, add one sequence
    at a time to look for the best motifs obtained
    with the additional sequence
  • G Stormo, algorithm runs very fast
  • Sequence order plays a big role in performance
  • First two sequences better contain the motif
  • Sites stop accumulating at the first bad sequence
  • Newer version allowing 0-n is much slower

40
Expectation Maximization and Gibbs Sampling Model
  • Objects
  • Seq sequence data to search for motif
  • ?0 non-motif (genome background) probability
  • ? motif probability matrix parameter
  • ? motif site locations
  • Problem P(?, ? seq, ?0)
  • Approach alternately estimate
  • ? by P(? ?, seq, ?0)
  • ? by P(? ?, seq, ?0)
  • EM and Gibbs differ in the estimation methods

41
Expectation Maximization
  • E step ? ?, seq, ?0
  • TTGACGACTGCACGT
  • TTGAC p1
  • TGACG p2
  • GACGA p3
  • ACGAC p4
  • CGACT p5
  • GACTG p6
  • ACTGC p7
  • CTGCA p8
  • ...
  • P1 likelihood ratio
  • P(TTGAC ?)
  • P(TTGAC ?0)

p0T ? p0T ? p0G ? p0A ? p0C 0.3 ? 0.3 ? 0.2 ?
0.3 ? 0.2
42
Expectation Maximization
  • E step ? ?, seq, ?0
  • TTGACGACTGCACGT
  • TTGAC p1
  • TGACG p2
  • GACGA p3
  • ACGAC p4
  • CGACT p5
  • GACTG p6
  • ACTGC p7
  • CTGCA p8
  • ...
  • M step ? ?, seq, ?0
  • p1 ? TTGAC
  • p2 ? TGACG
  • p3 ? GACGA
  • p4 ? ACGAC
  • ...
  • Scale ACGT at each position, ? reflects weighted
    average of ?

43
EM Derivatives
  • First EM motif finder (C Lawrence)
  • Deterministic algorithm, guarantee local optimum
  • MEME (TL Bailey)
  • Prior probability allows 0-n site / sequence
  • Parallel running multiple
  • EM with different seed
  • User friendly results

44
Gibbs Sampling
  • Stochastic process, although still may need
    multiple initializations
  • Sample ? from P(? ?, seq, ?0)
  • Sample ? from P(? ?, seq, ?0)
  • Collapsed form
  • ? estimated with counts, not sampling from
    Dirichlet
  • Sample site from one seq based on sites from
    other seqs
  • Converged motif matrix ? and converged motif
    sites ? represent stationary distribution of a
    Markov Chain

45
Gibbs Sampler
  • Randomly initialize a probability matrix

46
Gibbs Sampler
  • Take out one sequence with its sites from current
    motif

?11
?21
?31
?41
?51
47
Gibbs Sampler
  • Score each possible segment of this sequence

Sequence 1
Segment (1-8)
?21
?31
?41
?51
48
Gibbs Sampler
  • Score each possible segment of this sequence

Sequence 1
Segment (2-9)
?21
?31
?41
?51
49
Segment Score
  • Use current motif matrix to score a segment

50
Scoring Segments
  • Motif 1 2 3 4 5 bg
  • A 0.4 0.1 0.3 0.4 0.2 0.3
  • T 0.2 0.5 0.1 0.2 0.2 0.3
  • G 0.2 0.2 0.2 0.3 0.4 0.2
  • C 0.2 0.2 0.4 0.1 0.2 0.2
  • Ignore pseudo counts for now
  • Sequence TTCCATATTAATCAGATTCCG score
  • TAATC
  • AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x
    0.2/0.3 0.049383
  • ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x
    0.4/0.2 11.85185
  • TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x
    0.2/0.3 0.444444
  • CAGAT

51
Gibbs Sampler
  • Sample site from one seq based on sites from
    other seqs

?21
?31
?41
?51
52
How to Sample?
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 1 2 6
SubT 3 4 16 21 29 38 39 41 47
  • Rand(subtotal) X
  • Find the first position with subtotal larger than
    X

Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 500 2 6
SubT 3 4 16 21 29 38 538 540 546
53
Gibbs Sampler
  • Repeat the process until motif converges

?21
?12
?31
?41
?51
54
Gibbs Sampler Intuition
  • Beginning
  • Randomly initialized motif
  • No preference towards any segment

55
Gibbs Sampler Intuition
  • Motif appears
  • Motif should have enriched signal (more sites)
  • By chance some correct sites come to alignment
  • Sites bias motif to attract other similar sites

56
Gibbs Sampler Intuition
  • Motif converges
  • All sites come to alignment
  • Motif totally biased to sample sites every time

57
Gibbs Sampler
  • Column shift
  • Metropolis algorithm
  • Propose ? as ? shifted 1 column to left or right
  • Calculate motif score u(?) and u(?)
  • Accept ? with prob min(1, u(?) / u(?))

58
Gibbs Sampling Derivatives
  • Gibbs Motif Sampler (JS Liu)
  • Add prior probability to allow 0-n site / seq
  • Sample motif positions to consider
  • AlignACE (F Roth)
  • Look for motifs from both strands
  • Mask out one motif to find more different motifs
  • BioProspector (XS Liu)
  • Use background model with Markov dependencies
  • Sampling with threshold (0-n sites / seq), new
    scoring function
  • Can find two-block motifs with variable gap

59
Scoring Motifs
  • Information Content (also known as relative
    entropy)
  • Suppose you have x aligned segments for the motif
  • pb(s1 from mtf) / pb(s1 from bg)
  • pb(s2 from mtf) / pb(s2 from bg)
  • pb(sx from mtf) / pb(sx from bg)

60
Scoring Motifs
  • Information Content (also known as relative
    entropy)
  • Suppose you have x aligned segments for the motif
  • pb(s1 from mtf) / pb(s1 from bg)
  • pb(s2 from mtf) / pb(s2 from bg)
  • pb(sx from mtf) / pb(sx from bg)

61
Scoring Motifs
  • pb(s1 from mtf) / pb(s1 from bg)
  • pb(s2 from mtf) / pb(s2 from bg)
  • pb(sx from mtf) / pb(sx from bg)
  • (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2
    (pC2/pC0)C2
  • Take log of this
  • A1 log (pA1/pA0) T1 log (pT1/pT0)
  • T2 log (pT2/pT0) G2 log (pG2/pG0)
  • Divide by the number of segments (if all the
    motifs have same number of segments)
  • pA1 log (pA1/pA0) pT1 log (pT1/pT0) pT2 log
    (pT2/pT0)

Pos 12345678 ATGGCATG AGGGTGCG
ATCGCATG TTGCCACG ATGGTATT ATTGCACG
AGGGCGTT ATGACATG ATGGCATG ACTGGATG
62
Scoring Motifs
  • Original function Information Content

63
Scoring Motifs
  • Original function Information Content

Good AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC
Bad ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA
64
Scoring Motifs
  • Original function Information Content
  • Which is better?
  • (data 8 seqs)

Motif 1 AGGCTAAC AGGCTAAC
Motif 2 AGGCTAAC AGGCTACC AGGCTAAC AGCCTAAC AGGCCA
AC AGGCTAAC TGGCTAAC AGGCTTAC AGGCTAAC AGGGTAAC
65
Scoring Motifs
  • Motif scoring function
  • Prefer conserved motifs with many sites, but are
    not often seen in the genome background

66
Markov Background Increases Motif Specificity
  • Prefers motif segments enriched only in data,
    but not so likely to occur in the background
  • Segment ATGTA score
  • p(generate ATGTA from ?)
  • p(generate ATGTA from ?0)

TCAGC .25 ? .25 ? .25 ? .25 ? .25 .3 ? .18
? .16 ? .22 ? .24 ATATA .25 ? .25 ? .25 ? .25
? .25 .3 ? .41 ? .38 ? .42 ? .30
67
Position Weight Matrix Update
  • Advantage
  • Can look for motifs of any widths
  • Flexible with base substitutions
  • Disadvantage
  • EM and Gibbs sampling no guaranteed convergence
    time
  • No guaranteed global optimum

68
Motif Finding in Bacteria
  • Promoter sequences are short (200-300 bp)
  • Motif are usually long (10-20 bases)
  • Some have two blocks with a gap, some are
    palindromes
  • Long motifs are usually very degenerate
  • Single microarray experiment sometimes already
    provides enough information to search for TF
    motifs

69
Motif Finding in Lower Eukaryotes
  • Upstream sequences longer (500-1000 bp), with
    some simple repeats
  • Motif width varies (5 17 bases)
  • Expression clusters provide decent input
    sequences quality for TF motif finding
  • Motif combination and redundancy appears,
    although single motifs are usually significant
    enough for identification

70
Yeast Promoter Architecture
  • Co-occurring regulators suggest physical
    interaction between the regulators

71
Motif Finding in Higher Eukaryotes
  • Upstream sequences very long (3KB-20KB) with
    repeats, TF motif could appear downstream
  • Motifs can be short or long (6-20 bases), and
    appear in combination and clusters
  • Gene expression cluster not good enough input
  • Need
  • Comparative Genomics phastcons score
  • Motif modules motif clusters
  • ChIP-chip/seq

72
Yeast Regulatory Sequence Conservation
73
UCSC PhastCons Conservation
  • Functional regulatory sequences are under
    stronger evolutionary constraint
  • Align orthologous sequences together
  • PhastCons conservation score (0 1) for each
    nucleotide in the genome can be downloaded from
    UCSC

74
Conserved Motif Clusters
  • First find conserved regions in the genome
  • Then look for repeated transcription factors (TF)
    binding sites
  • They form transcription factor modules

75
Summary
  • Biology and challenge of transcription regulation
  • Scan for known TF motif sites TRANSFAC JASPAR
  • De novo method
  • Regular expression enumeration
  • Oligonucleotide analysis
  • MobyDick build long motifs from short ones
  • Position weight matrix update
  • CONSENSUS (sequence order)
  • EM (iterate ?, ? ? weighted ? average)
  • Gibbs Sampler (sample ?, ? Markov chain
    convergence)
  • Motif score and Markov background
  • Motif cluster and motif conservation
Write a Comment
User Comments (0)
About PowerShow.com