Pattern Discovery in Bioinformatics Jaak Vilo vilout'ee http:biit'cs'ut'ee - PowerPoint PPT Presentation

1 / 165
About This Presentation
Title:

Pattern Discovery in Bioinformatics Jaak Vilo vilout'ee http:biit'cs'ut'ee

Description:

Pattern Discovery in Bioinformatics Jaak Vilo vilout'ee http:biit'cs'ut'ee – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 166
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Pattern Discovery in Bioinformatics Jaak Vilo vilout'ee http:biit'cs'ut'ee


1
Pattern Discovery in Bioinformatics Jaak
Vilovilo_at_ut.eehttp//biit.cs.ut.ee
2
Topics
  • Bioinformatics
  • Pattern discovery
  • Microarray data
  • ...

3
Pattern Discovery
  • Choose the language (formalism) to represent the
    patterns (search space)
  • Choose the rating for patterns, to tell which is
    better than others
  • Design an algorithm that finds the best patterns
    from the pattern class, fast.

Brazma A, Jonassen I, Eidhammer I, Gilbert
D.Approaches to the automatic discovery of
patterns in biosequences.J Comput Biol.
19985(2)279-305.
4
Bioinformatics
  • Have the right data (real, relevant,
    interesting)
  • Interpret and report the results (make someones
    life easier)
  • Contribute to the field of biology

5
Bioinformatics
  • Study of biological data with the goal to better
    understand biology (JV)

6
Level 0
ATCGCTGAATTCCAATGTG
Level 1
Eukaryotic genome can be thought of as six Levels
of DNA structure. The loops at Level 4 range
from 0.5kb to 100kb in length. If these loops
were stabilized then the genes inside the loop
would not be expressed.
Level 2
Level 3
Level 4
Level 5
Level 6
7
DNA determines function (?)
Protein SwissProt/TrEMBL
Structure PDB/Molecular Structure Database
DNA GenBank / EMBL Bank
20 Amino Acids (3nt 1 AA)
4 Nucleotides
Function?
8
A Simple Gene
A
B
C
Upstream/ promoter
Downstream
ATCGAAAT TAGCTTTA
Modifications
DNA
9
Species and individuals
  • Animals, plantsfungi, bacteria,
  • Species
  • Individuals

www.tolweb.org
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
http//www.youtube.com/watch?vbk7PW1FKMTI
14
Gene regulation
  • How are all genetic entities regulated?
  • Networks
  • parts lists and connections
  • parameters and dynamics

15
Possible mechanisms of action for secreted
protein function in cell proliferation, either by
intracellular second messengers pathways or by
nuclear import. FT transcription factor Co-reg
Co-regulator. Planque Cell Communication and
Signaling 2006 47   doi10.1186/1478-811X-4-7
16
http//wwwmgs.bionet.nsc.ru/mgs/gnw/genenet/viewer
/
17
(No Transcript)
18
Model of RNA Polymerase II Transcription
Initiation Machinery. The machinery depicted here
encompasses over 85 polypeptides in ten (sub)
complexes core RNA polymerase II (RNAPII)
consists of 12 subunits TFIIH, 9 subunits
TFIIE, 2 subunits TFIIF, 3 subunits TFIIB, 1
subunit, TFIID, 14 subunits core SRB/mediator,
more than 16 subunits Swi/Snf complex, 11
subunits Srb10 kinase complex, 4 subunits and
SAGA, 13 subunits.
F.C.P. Holstege, E.G. Jenning, J.J. Wyrick, Tong
Ihn Lee, C.J. Hengartner, M.R. Green, T.R. Golub,
E.S. Lander, and R.A. Young Dissecting the
Regulatory Circuitry of a Eukaryotic Genome Cell
95 717-728 (1998)
19
Chen and Rajewsky Nature Reviews Genetics 8,
93103 (February 2007) doi10.1038/nrg1990
20
microRNA
21
(No Transcript)
22
(No Transcript)
23
gtmmu-mir-678 MI0004635 GUGGACUGUGACUUGCAGAGCUGUG
CUCCAAUAUGAGAGAUGGCCAUGCACCCGUGUCUCGGUGCAAGGACUGG
AGGUGGCAGU
24
Alignment of microRNA targets
  • CtTCCA-TCCTT--G-ACCGAGAt ENSMUSG00000041359tCTtC
    AtaCCcTcaGgACCGAGAC ENSMUSG00000032436CCTttAGcCC
    TT--GggCtGgGAg ENSMUSG00000028581CCatttGaCtcc--a
    CACtGAGAg ENSMUSG00000034006gCTCCAGgCCTT--GgACCt
    AGgC ENSMUSG00000001053CactgccTaacT--GCACtGAGAa
    ENSMUSG00000032470ggaCCAGgttTT--GCACCaAGgC
    ENSMUSG00000053175CCTCagGaCCTT--GtgtCGAGAg
    ENSMUSG00000004040agagaccTCgaa--GaACtGAGAa
    ENSMUSG00000031163tagCCtGTCCTTctG-ACtGAGAC
    ENSMUSG00000006342

25
Sequence patterns in BI
26
Biological applications
  • DNA
  • Gene regulation (promoters, TF binding)
  • Gene prediction (including TSS, to polyA site)
  • Repeats, duplications, tandem repeats, etc
    sequence features
  • RNA
  • Splicing of the mRNA
  • microRNA targeting mRNA-s
  • Secondary structure, 3D structure
  • Proteins
  • Protein families and their functional conserved
    elements
  • Active sites and protein-protein interactions
  • 3-D structure of proteins

20min
27
Gene Regulatory Signal Finding
Transcription Factor
Transcription Factor Binding Site
Goal Detect Transcription Factor Binding Sites.
Eleazar Eskin Columbia Univ.
28
How can we find TF binding sites?
Tallinn
29
How to detect signals in DNA?
  • Biologists in past have created some experimental
    data few examples
  • Generalise from these
  • Indirect evidence of being co-regulated
  • Search for common signals
  • New techniques (lab)
  • Identify regions in which binding occurs (ChIP
    chip)
  • SELEX

30
Position weight matrices (PWM, PSSM,...)
  • Examples Counts Logo

ACGTGA ACGATG AGGTGG ACGAGG TCGTGA ACGAGG ACGAGA T
CGTGA
A 6 0 0 4 0 4 C 0 7 0 0 0 0 G 0 1 8 0 7 4
T 2 0 0 4 1 0
PWM
p/f log p/f
31
Motif matching
  • Find all occurrences of the given motif(s)
  • Databases of biologically valid motifs
  • Well touch it a bit later

32
Motif discovery
  • Hypothesis a (sub)set of sequences may share a
    common signal.

33
Common biological role
  • Genes known to have related roles and hence
    needed at the same time
  • e.g. same Gene Ontology class
  • Measurements by microarrays
  • genes coordinately expressed should have common
    regulators (and signals)

34
Microarrays
  • Measure gene expression activity
  • genes mRNA
  • tiling anywhere in the genome
  • Measure in vitro TF binding
  • ChIP-chip
  • Methylation etc features of DNA

35
How to know whats in the cells?
Cells and mRNAs
I
36
How to know whats in the cells?
Cells and mRNAs
I
II
37
Microarray,the measurement device
Gene 3
Gene 1
Gene 2
38
Microarray, after hybridisation
39
Microarray, 2 colors mixed
40
TIGR 32k Human Arrays
41
Affymetrix Wafer and Chip Format
20 - 50 µm
20 - 50 µm
one oligonucleotide sequence per pixel
49 - 400 chips/wafer
1.0 cm
up to 1.3 million features/chip
42
From microarray images to gene expression data
Intermediate data
Raw data
Final data
Array scans
Image quantifications
Samples
Spots
Genes
Gene expression levels
Spot/Image quantiations
43
Eisen et.al, PNAS 98
Spellman et.al. Mol Biol Cell 98
44
Tumor classification 1) class prediction 2)
class discovery
ALL AML
Golub et al, Science Oct 15th 1999
  • 38 samples of acute myeloic leukemia (AML) and
    acute lymphoblastic leukemia (ALL)
  • 6817 genes
  • classificator built based on 50 best correlated
    genes
  • tested on 34 new samples, 29 of them predicted
    accurately

ALL AML
45
Hughes, T. R. et al Functional Discovery via a
Compendium of Expression Profiles, Cell 102
(2000), 109-126.
46
Cluster of co-expressed genes, pattern discovery
in regulatory regions
Expression profiles
600 basepairs
Retrieve
Upstream regions
Find patterns over-represented within cluster
Genome Research 1998 ISMB (Intelligent Systems
in Mol. Biol.) 2000
47
Binomial or hypergeometric distribution tail
Background - ALL upstream sequences
? occurs 3 times P(3,6,0.2) is probability of
having ?3 matches in 6 sequences P(?,3,6,0.2)
0.0989
Cluster
5 out of 25, p 0.2
48
ChIP-chip (or sequencing)
I
49
ChIP-chip (or sequencing)
I
II
50
ChIP-chip (or sequencing)
I
II
III
51
ChIP-chip (or sequencing)
I
II
III
Microarray or sequencing
IV
52
Clustering and Gene set enrichment
  • Analysis of (any) HT data (cluster, visualise,
    test of significance, ...)
  • Produces gene lists
  • partitioning produces bags or sets
  • sorting produces ranked lists
  • How to interpret these results?
  • What to do next?

53
K-means k 200 vs 50
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
Clustering observe your first patterns
60
gProfiler
61
Find a common function(pattern)
  • Experiment or analysis identifies a set of genes
  • What is a common theme to these genes?
  • Bilogical function Gene Ontology molecular
    pathway shared regulatory motif or miRNA target
    site

62
(No Transcript)
63
Previously known functions
F1
F4
F2
F5
F3
F6
F7
64
Your query
F1
F4
F2
F5
Q1
F3
F6
Q2
F7
65
(No Transcript)
66
(No Transcript)
67
GO Evidence Codes
From reviews or introductions
IDA - Inferred from Direct Assay IMP - Inferred
from Mutant Phenotype IGI - Inferred from
Genetic Interaction IPI - Inferred from
Physical Interaction IEP - Inferred from
Expression Pattern
TAS - Traceable Author Statement NAS -
Non-traceable Author Statement IC - Inferred by
Curator ISS - Inferred from Sequence or
structural Similarity IEA - Inferred from
Electronic Annotation ND - Not Determined
automated
From primary literature
68
Evidencecodes
Genes
GOcategories
P-value
Ordered list query
KEGGpathways
69
KEGG Biosynthesis of steroids

70
p-value
  • tail of the hypergeometric distribution
  • Multiple testing
  • multiple sets to compare against
  • different sizes of queries
  • different sizes (and nrs) of reference sets

71
SCS - Set Counts and Sizes threshold
72
Motif discovery in sequences
  • Deterministic and probabilistic
  • Pattern driven vs sequence driven
  • Descriptive or Discriminative

73
Cluster of co-expressed genes, pattern discovery
in regulatory regions
Expression profiles
600 basepairs
Retrieve
Upstream regions
Find patterns over-represented within cluster
Genome Research 1998 ISMB (Intelligent Systems
in Mol. Biol.) 2000
74
(No Transcript)
75
(No Transcript)
76
Pattern vs cluster strength
The pattern probability vs. the average
silhouette for the cluster
The same for randomised clusters
Vilo et.al. ISMB 2000
77
Suffix tree represent all suffixes
CATAT gt suffix tree 123456CATAT 1 ATAT 2
TAT 3 AT 4 T 5 6
AT

T
6

CATAT
AT
AT

5
3
1
2
4
O(n) time and space
78
SPEXS - Sequence Pattern EXhaustive SearchJaak
Vilo, 1998, 2002
  • User-definable pattern language substrings,
    character groups, wildcards, flexible wildcards
    (c.f. PROSITE)
  • Fast exhaustive search over pattern language
  • Lazy suffix tree construction-like algorithm
    (Kurtz, Giegerich)
  • Analyze multiple sets of sequences simultaneously
  • Restrict search to most frequent patterns only
    (in each set)
  • Report most frequent patterns, patterns over- or
    underrepresented in selected subsets, or
    patterns significant by various statistical
    criteria, e.g. by binomial distribution

30min
79
SPEXS 1998
Jaak Vilo Discovering Frequent Patterns from
Strings. Technical Report C-1998-9 (pp. 20) May
1998. Department of Computer Science, University
of Helsinki.
80
(No Transcript)
81
(No Transcript)
82
(No Transcript)
83
Sequence patterns the basis of the SPEXS
A
G
A
A
T
C
G
C
C
C
GCAT (4 positions)
GCATA (3 positions)
GCATA.
GCATA.C
84
Implementation example
Input 1 ACGTGCACGATATCG
Input 2 AGTACATGAAGCAGG
P pattern e.g. AC AC.pos 2, 9, 23 ACG.pos
3, 10
Convert into internal representation ...........
...................... 36ACGTGCACGATATCGAG
TACATGAAGCAGG11111-11111-11111-22222-22222-222
22
85
(No Transcript)
86
SPEXS general algorithm
  • 1. S input sequences ( Sn )
  • 2. ? empty pattern, ?.pos 1,...,n
  • 3. enqueue( order , ?, priority)
  • 4. while p dequeue( order )
  • 5. generate all allowed extensions (p,
    p.pos) of p
  • 6. enqueue( output, p, fitness(p))
  • 7. enqueue( order, p, priority(p) )
  • 8. while pdequeue( output )
  • 9. Output p

Jaak Vilo Discovering Frequent Patterns from
Strings.Technical Report C-1998-9 (pp. 20) May
1998. Department of Computer Science, University
of Helsinki. Jaak Vilo Pattern Discovery from
Biosequences PhD Thesis, Department of Computer
Science, University of Helsinki, Finland. Report
A-2002-3 Helsinki, November 2002, 149 pages
87
Order
Breadth-first
Depth-first
1
1
2
3
4
4
3
2
5
6
7
7
6
5
8
9
10
10
9
8
88
Order
Frequent-first
50
40
6
4
4
34
2
6
24
4
89
SPEXS count and memorize
i...v....x....v....x abracadabradadabraca
a
1,4,6,8,11,13,15,18,20
2,5,7,9,12,14,16,19,21
90
SPEXS extend
i...v....x....v....x abracadabradadabraca
a
2,5,7,9,12,14,16,19,21
b
c
d
7,12,14
5,19
2,9,16
91
SPEXS find frequent first
i...v....x....v....x abracadabradadabraca
a
2,5,7,9,12,14,16,19,21
b
d
7,12,14
2,9,16
92
SPEXS group positions
i...v....x....v....x abracadabradadabraca
a
.
2,5,7,9,12,14,16,19,21
bd
b
d
7,12,14
2,9,16
2,7,9,12,14,16
93
The wildcards
GCAT.X
94
The wildcards
GCAT.A
95
The wildcards
GCAT.3,6X
96
The wildcards not too many
w0
a
.3.6
w0
b
w1
w0
97
Multiple data sets
D1
D2
D3
4/3 (6)
3/3 (12)
2/2 (9)
98
GPCR coupling
Agonist
Signal
Current perspective
GPCR
Effector Enzyme channels
Intracellular messengers
G-protein
99
Our Computational Approach
  • Membrane topology 7TMHMM
  • Intracellular domains of ? 100 receptor sequences
    with
  • well-characterised, and non-promiscuous coupling
    (split into Gs, Gi/o and Gq/11)

Steffen Möller, Jaak Vilo, Michael D.R.
CroningPrediction of the coupling specificity of
G protein coupled receptors to their G
proteins.ISMB-2001 July 2001. Bioinformatics
2001 17 S174-S181.
100
RK....R.0,9EK DR.4,11H...AGS FR....RK.0
,3L S...L.1,10TILV C.FWY.2,11K
ILV.L.6,10A.T S....RKA.3,10S
AILV.1,5Y..ILV.T LR.1,9T...ILV
Steffen Möller, Jaak Vilo, Michael D.R.
CroningPrediction of the coupling specificity of
G protein coupled receptors to their G
proteins.ISMB-2001 July 2001. Bioinformatics
2001 17 S174-S181.
101
Receptor Match Positions
Möller, Vilo, Croning, ISMB 2001
102
Improving upon discrete patterns
103
101 Sequences relative to ORF start
YGR128C 100
gtYAL036C chromo1 coord(76154-75048(C))
start-600 end2 seq(76152-76754) TGTTCTTTCTTCTT
CTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTA
GTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTC
TTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGC
ACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGC
TGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCG
GCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACT
CTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATC
CCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTT
CAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGC
CCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGA
TATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATG
CAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_ gtYAL025C
chromo1 coord(101147-100230(C)) start-600
end2 seq(101145-101747) CTTAGAAGATAAAGTAGTGAATT
ACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGG
GTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAG
TAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTAT
CCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTA
AAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCAT
ACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAG
AATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTT
TTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATG
CAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTT
TCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGA
TATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAA
CTACAACGGTTAGAATACAACGAGGAT_ATG_ ... gtYBR084W
chromo2 coord(411012-413936) start-600 end2
seq(410412-411014) CCATGTATCCAAGACCTGCTGAAGATGCTT
ACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTT
TCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCA
AAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGT
CTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTA
GTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTC
AAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTT
CTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGAC
GCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTC
ACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCA
GCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACAT
CAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAG
ATATTTAAGGCTAAAAGGAA_ATG_
GATGAG.T 152/70 2453/508 R7.52345
BP1.02391e-33G.GATGAG.T 139/49 2193/222
R13.244 BP2.49026e-33AAAATTTT 163/77
2833/911 R4.95687 BP5.02807e-32TGAAAA.TTT
145/53 2333/350 R8.85687 BP1.69905e-31TG.A
AA.TTT 153/61 2538/570 R6.45662
BP3.24836e-31TG.AAA.TTTT 140/43 2254/260
R10.3214 BP3.84624e-30TGAAA..TTT 154/65
2608/645 R5.82106 BP1.0887e-29 ...
GATGAG.T TGAAA..TTT
104
.G.GATGAG.T. 39 seq
.G.GATGAG.T. 39 seq (vs 193) p 2.5e-33
105
-1 .G.GATGAG.T. 61 seq (vs 1292)
-1 .G.GATGAG.T. 61 seq (vs 1292) p 1.4e-19
106
-2 .G.GATGAG.T. 91 seq
-2 .G.GATGAG.T. 91 seq (vs 5464)
107
-3 .G.GATGAG.T. 98 seq
108
Jaak Vilo Pattern Discovery from Biosequences
PhD Thesis, Department of Computer Science,
University of Helsinki, FinlandSeries of
Publications A, Report A-2002-3 Helsinki,
November 2002, 149 pages
109
-2 .G.GATGAG.T. 91 seq
These hits result in a PWM
110
PWM based on all previous hits, here shown
highest-scoring occurrences in blue
111
All against all approximate matching
For every subsequence of every sequence Match
approximately against all the the sequences.
Approximate hits define PWM matrices (not all
positions vary equally). Look for ALL PWM-s
derived from data that are enriched in data set
(vs. background).
Hendrik Nigul, Jaak Vilo
112
Dynamic programming
  • Small nr of edit operations allows to limit the
    search efficiently around main diagonal

113
Suffix Tree
A
T
G
C
G
G
T
124,212,223
114
Trie based all against all approximate matching
  • trieindex
  • trieagrep
  • trieallagrep
  • triematrix

Hendrik Nigul, Jaak Vilo
115
More directions for PD
116
Multiple alignment
Marko Hyvönen
117
Artificial setup
118
Challenge problem
  • Pevzner, P., Sze, S.H. 2000. Combinatorial
    Approaches to Finding Subtle Signals in DNA
    Sequences. Proc. 8th Int. Conf, Intelligent
    Systems of Molecular Biology, 269-278.
  • Plant into every sequence a string X of length l,
    with d characters randomly altered.
  • What was the original X ?
  • (l,d)-problem

119
(4, 1) - problem
ACTG
Seed -
CCTG
12 possible planted versions
GCTG
TCTG
AATG
AGTG
ATTG
....
120
Graph constructed by WINNOWER
  • For (15,4)-signal - connect all words with
    distance at most 8
  • atgaccgggatactgatAgAAgAAAGGttGGGtataatggagtacgataa
  • atgacttcAAtAAAAcGGcGGGtgctctcccgattttgagtatccctggg
  • gcaatcgcgaaccaagctgagaattggatgtcAAAAtAAtGGaGtGGcac
  • gtcaatcgaaaaaacggtggaggatttcAAAAAAAGGGattGgaccgctt

real signals
signal edges
spurious signals
spurious edges
from Eleazar Eskin
121
Pairs of motifs
122
Composite Patterns
atgactAGGGTAACATgattgagaccagtgaCAGGAATTCactgacaa
Conserved Region
Conserved Region
Unconserved Spacing
  • Co-occurring patterns
  • (GuhaThakurta, Stormo 2001)
  • Fixed Order (Dyad Problem)
  • (van Helden et. al 2000)
  • (Gelfand et. al 2000)
  • (Marsan, Sagot 2000)

from Eleazar Eskin
123
Patterns with Mismatches
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
Mismatches d8
Instances
AcAAAAcAGGGGtGG-11-CTGAcTCTtATAaAG
AAAcAAAgaGtGGtG-12-CTGcgTCTAATtcAG
AtAAAAAtcGGGcGG-10-CTGATcCTAtTACcG
AAAAAtAAGGGGcGG-14-CgGAcTCTAATgCAG
Eleazar Eskin
124
Sample Sequences
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
actgatAAAAAAAAGGGGGGGggcgtacacattagCTGATTCCAATACAG
acgt aaAAAAAAAAGGGGGGGaaacttttccgaataCTGATTCCAATA
CAGgatcagt atgacttAAAAAAAAGGGGGGGtgctctcccgattttc
CTGATTCCAATACAGc aggAAAAAAAAGGGGGGGagccctaacggact
taatCCTGATTCCAATACAGta ggaggAAAAAAAAGGGGGGGagccct
aacggacttaatCCTGATTCCAATACAG
Eleazar Eskin
125
Sample Sequences
AAAAAAAAGGGGGGG-(10,15)-CTGATTCCAATACAG
actgatAAAAtAAAGcGGGaGggcgtacacattagCaGAcTCCAATtgAG
acgt aaAAtAAAAAaaaGGcGaaacttttccgaataCTGAcTCCAAag
CAGgatcagt atgacttAAcAAtAgGGGaGGGtgctctcccgattttc
CTGcTaCCAAgAtAGc aggAAtAAAAtGGaGGGGagccctaacggact
taatCCaGATTgCAcTAaAata ggaggAAgAAAAAGGaGaGGagccct
aacggacttaatCtTGAaTCCtATACAc
Eleazar Eskin
126
Traditional Approach Weaknesses
atgactAGGGTAACATgattgagaccagtgaCAGGAATTCactgacaa
Conserved Region
Conserved Region
Unconserved Spacing
Traditional Approach Find each conserved region
separately.
Problem Each region too weak.
Eleazar Eskin
127
Traditional Approach Solution
atgactAGGGTAACATgattgagaccagtgaCAGGAATTCactgacaa
Conserved Region
Conserved Region
Unconserved Spacing
Traditional Approach Find each conserved region
separately.
Problem Each region too weak.
Our approach Find both regions simultaneously.
Conserved Region
Conserved Region
single pattern after preprocessing.
Eleazar Eskin
128
Combinations and modules
  • Regulatory signals do not work alone
  • Motif co-occurrences

129
  • 700bp widows with at least 13 binding site
    occurrences

130
(No Transcript)
131
  • 700bp widows with at least 13 binding site
    occurrences

132
Using multiple species
  • Phylogenetic footprinting
  • Phylogenetic shadowing
  • Conservation cross species

133
(No Transcript)
134
Phylogenetic footprinting
  • McCue L, Thompson W, Carmack C, Ryan MP, Liu JS,
    Derbyshire V, Lawrence CE.Phylogenetic
    footprinting of transcription factor binding
    sites in proteobacterial genomes.Nucleic Acids
    Res. 2001 Feb 129(3)
  • Mathieu Blanchette, and Martin Tompa Discovery
    of Regulatory Elements by a Computational Method
    for Phylogenetic Footprinting Genome Research 
    Vol. 12, Issue 5, 739-748, May 2002

135
Men and mice are alike
136
(No Transcript)
137
26 species
138
45 species
139
Phylogenetic footprinting
Study the same gene in many species
human
ape
mouse
fish
chicken

If preserved during evolution then must be
important for something!!!
140
What if species too similar?
  • Almost entire genome is highly similar
  • Signal gets lost

141
Phylogenetic shadowing
  • Use many closely related species (monkeys, apes,
    ...)
  • All regions that differ, are shadowed out
  • These regions that do not have differences in
    (almost) any, are probably important
  • Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD,
    Ovcharenko I, Pachter L, Rubin EM.Phylogenetic
    shadowing of primate sequences to find functional
    regions of the human genome.Science. 2003 Feb
    28299(5611)1331-3.

142
Phylogenetic shadowing
http//chr21.molgen.mpg.de/images/projects/BACH1_s
mall.jpg
143
Alignment of functional elements
  • Outi Hallikas, Kimmo Palin, Natalia Sinjushina,
    Reetta Rautiainen, Juha Partanen, Esko Ukkonen
    and Jussi Taipale. Genome-wide Prediction of
    Mammalian Enhancers Based on Analysis of
    Transcription-Factor Binding Affinity CELL
    124(1), 13 January 2006, Pages 47-59.
  • Kimmo Palin, Jussi Taipale and Esko Ukkonen.
    Locating potential enhancer elements by
    comparative genomics using the EEL
    softwareNature Protocols 1(1), 27. June 2006,
    Pages 368-374.
  • E. Blanco, X. Messeguer, T. Smith and R. Guigó
    "Transcription Factor Map Alignment of Promoter
    Regions." PLoS Computational Biology, 2(5)e49
    (2006)

144
(No Transcript)
145
Ranked list data
Tartu
146
Problem
  • Target vs background data?
  • Strong vs weak
  • No clear cut

147
  • (c1) the cutoff used to partition data into a
    target set and background set of sequences is
    often chosen arbitrarily
  • (c2) lack of an exact statistical score and
    p-value for motif enrichment
  • (c3) a need for an appropriate framework that
    accounts for multiple motif occurrences in a
    single promoter.
  • (c4) motif discovery methods tend to report
    presumably significant motifs even when applied
    on randomly generated data. These motifs are
    clear cases of false positives and should be
    avoided.

148
(No Transcript)
149
(No Transcript)
150
(No Transcript)
151
(No Transcript)
152
Summary
153
Pattern languages
  • Substrings ATCGA
  • Character groups ATCGC.A
  • Unrestricted wildcards AT.CG
  • Restricted wildcards AT.2,5CG
  • Combine all above A.TGC.1,3GTAC TGC
    GCA
  • Closures TGAAATTT
  • Allow mismatches, insertions, deletions
  • Probabilistic versions of the above

154
Probabilistic motifs
  • Gary Stormo lab
  • EM-algorithm
  • MEME (Bailey, Elkan)
  • Gibbs Sampling
  • AlignAce (Roth et al)
  • (Rocke, Tompa)
  • Neural networks
  • HMM models, SCFG

155
The advantages and disadvantages of discrete
patterns
  • Advantages
  • simple and easily interpretable objects
  • easier to discover from scratch (i.e., if no
    additional information to sequences are given),
    particularly in noisy data
  • Disadvantages
  • limited descriptive power (no weights can be
    attributed to alternatives)
  • No probability of a match

156
Fitness measures
  • Ratio (times over-reprsented)
  • ROC AUC
  • Probability (p-value)
  • Domain specific (biological) score

157
Multiple testing due
  • Large pattern (search) space
  • Many data sets analysed
  • Different cut-off thresholds

158
Search algorithms
  • Pattern driven
  • generate all possible patterns, evaluate
  • Data Driven
  • e.g. align data sets, read out patterns
  • EM, Gibbs, ...
  • (all probabilistic methods)

159
Search algorithm
  • Pattern driven
  • generate all possible patterns, evaluate
  • Data Driven
  • e.g. align data sets, read out patterns
  • Combined
  • Use data as a guide for exhaustive search through
    pattern space

160
Regular pattern tools
  • SPEXS (Jaak Vilo)
  • Pratt (Inge Jonassen, U. of Bergen)
  • TEIRESIAS (IBM Research, Rigoutsos, Floratos)
  • MobyDick etc (Harmen Bussemaker)
  • RSA-tools (Jacques van Helden)
  • Martin Tompa
  • Marsan Sagot (suffix tree gapped motifs)
  • Jensen Knudsen (suffix tree based substrings)
  • Verbumculus (Stefano Lonardi, A. Apostolico)

161
(No Transcript)
162
(No Transcript)
163
(No Transcript)
164
Anno 2007 (BIIT and Quretec)
165
Tartu, ESTONIA
Write a Comment
User Comments (0)
About PowerShow.com