To view this presentation, you'll need to enable Flash.

Show me how

After you enable Flash, refresh this webpage and the presentation should play.

Loading...

PPT – In the search of motifs and other hidden structures PowerPoint presentation | free to download - id: 43740-ZDc1Z

The Adobe Flash plugin is needed to view this content

View by Category

Presentations

Products
Sold on our sister site CrystalGraphics.com

About This Presentation

Write a Comment

User Comments (0)

Transcript and Presenter's Notes

In the search of motifs (and other hidden

structures)

- Esko Ukkonen
- Department of Computer Science

Helsinki Institute of Information Technology

HIIT - University of Helsinki
- CPM 2005, Jeju, 21 June 2005

(No Transcript)

Uncover a hidden structure(?)

Motif?

- a pattern that occurs unexpectedly often in (a

set of) strings - pattern substring, substring with gaps, string

in generalized alphabet (e.g., IUPAC), HMMs,

binding affinity matrix, cluster of binding

affinity matrices, ( the hidden structure to be

learned from data) - (unexpectedly statistical modelling)
- occurrence exact, approximate, with high

probability, - strings ? applications bioinformatics

Plan of the talk

- Gapped motifs in a string
- Founder sequence reconstruction problem, with

applications to haplotype analysis and genotype

phasing (WABI 2002, ALT 2004, WABI 2005) - Uncovering gene enhancer elements

1. Gapped motifs

ATT HATTIVATTI IA

HATTIVATTI

Substring motifs of a string S

- string S s1 sn in alphabet A.
- Problem what are the frequently occurring

(ungapped) substrings of S? Longest substring

that occurs at least q times? - Thm Suffix tree T(S) of S gives complete

occurrence counts of all substring motifs of S in

O(n) time (although S may have O(n2) substrings!)

T(S) is full text index

T(S)

P

P occurs in S at locations 8, 31,

31

8

Path for P exists in T(S) ? P occurs in S

Counting the substring motifs

- internal nodes of T(S) ? repeating substrings

of S - number of leaves of the subtree of a node for

string P number of occurrences of P in S

T(hattivatti)

hattivatti attivatti ttivatti tivatti ivatti vatti

atti tti ti i

vatti

i

vatti

t

i

ti

atti

i

vatti

hattivatti

ivatti

ti

vatti

vatti

tti

vatti

atti

tivatti

hattivatti

ttivatti

attivatti

Substring motifs of hattivatti

vatti

i

vatti

t

4

2

i

ti

atti

i

2

vatti

2

hattivatti

ivatti

2

ti

vatti

vatti

tti

vatti

atti

tivatti

hattivatti

ttivatti

attivatti

Counts for the O(n) maximal motifs shown

Finding repeats in DNA

- human chromosome 3
- the first 48 999 930 bases
- 31 min cpu time (8 processors, 4 GB)
- Human genome 3x109 bases
- T(HumanGenome) feasible

Longest repeat?

Occurrences at 28395980, 28401554r Length

2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtata

cacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttag

gtatatctccgaatgctatccctcccccctccccccaccccacaacagtc

cccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaa

ttcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcga

aagtttgctgagaatgatggtttccagcttcatccatatccctacaaagg

acatgaactcatcatttttttatggctgcatagtattccatggtgtatat

gtgccacattttcttaacccagtctacccttgttggacatctgggttggt

tccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcat

gtgtctttatagcagcatgatttataatcctttgggtatatacccagtaa

tgggatggctgggtcaaatggtatttctagttctagatccctgaggaatc

accacactgacttccacaatggttgaactagtttacagtcccagcaacag

ttcctatttctccacatcctctccagcacctgttgtttcctgacttttta

atgatcgccattctaactggtgtgagatggtatctcattgtggttttgat

ttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttt

tggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcc

cacttttgatggggttgtttgtttttttcttgtaaatttgttggagttca

ttgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaat

tttctcccattctgtaggttgcctgttcactctgatggtggtttcttctg

ctgtgcagaagctctttagtttaattagatcccatttgtcaattttggct

tttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcc

tatgtcctgaatggtattgcctaggttttcttctagggtttttatggttt

taggtctaacatgtaagtctttaatccatcttgaattaattataaggtgt

atattataaggtgtaattataaggtgtataattatatattaattataagg

tgtatattaattataaggtgtaaggaagggatccagtttcagctttctac

atatggctagccagttttccctgcaccatttattaaatagggaatccttt

ccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagat

atgcggcattatttctgagggctctgttctgttccattggtctatatctc

tgttttggtaccagtaccatgctgttttggttactgtagccttgtagtat

agtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttag

gattgacttggcaatgtgggctcttttttggttccatatgaactttaaag

tagttttttccaattctgtgaagaaattcattggtagcttgatggggatg

gcattgaatctataaattaccctgggcagtatggccattttcacaatatt

gaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcct

cttttatttcattgagcagtggtttgtagttctccttgaagaggtccttc

acatcccttgtaagttggattcctaggtattttattctctttgaagcaat

tgtgaatgggagttcactcatgatttgactctctgtttgtctgttattgg

tgtataagaatgcttgtgatttttgcacattgattttgtatcctgagact

ttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggg

gttttctagatatacaatcatgtcatctgcaaacagggacaatttgactt

cctcttttcctaattgaatacccgttatttccctctcctgcctgattgcc

ctggccagaacttccaacactatgttgaataggagtggtgagagagggca

tccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccat

tcagtatgatattggctgtgggtttgtcatagatagctcttattattttg

agatacatcccatcaatacctaatttattgagagtttttagcatgaagag

ttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgt

ggtttctgtctttggttctgtttatatgctggagtacgtttattgatttt

cgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatg

gtggataagctttttgatgtgctgctggattcggtttgccagtattttat

tgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctct

ttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctc

ataaaatgagttagg

Ten occurrences?

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtg

cagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgcc

attctcctgcctcagcctcccaagtagctgggactacaggcgcccgccac

tacgcccggctaattttttgtatttttagtagagacggggtttcaccgtt

ttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggc

ctcccaaagtgctgggattacaggcgt Length

277 Occurrences at 10130003, 11421803,

18695837, 26652515, 42971130, 47398125In the

reversed complement at 17858493, 41463059,

42431718, 42580925

Gapped motifs of S

- gapped pattern P in (A U )
- gap symbol matches any symbol in A
- aabbb
- L(P) occurrences of P in S
- P is called a motif of S if L(P) gt 1 and a

motif with quorum q if L(P) q. - Problem find occurrence count L(P) for all

gapped motifs P of S - anban has exponentially many motifs (M-F. Sagot)!

Motifs vs self-alignments

- self-alignments of S gt maximal motifs

S

align the occurrences

Motifs vs multiple self-alignments

- self-alignments of S gt maximal motifs

expand if possible

Motifs vs self-alignments

- S aaaaabaaaaa P aa
- aaaaabaaaaa aaaaabaaaaa

aa aaaaabaaaaa aaaaabaaaaa

Motifs vs self-alignments

- S aaaaabaaaaa P aa
- aaaaabaaaaa aaaaabaaaaa

aa aaaaabaaaaa aaaaabaaaaa

Motifs vs self-alignments

- S aaaaabaaaaa P aa
- aaaaabaaaaa aaaaabaaaaa
- aaaaaaa is maximal motif for this

self-alignment

aaaaaaa aaaaabaaaaa aaaaabaaaaa

Maximal motifs

- multiple self-alignments of S ? maximal gapped

motifs of S the unanimous columns give the

non-gap symbols of the motif - any motif P has a unique maximal motif M(P)

(align the occurrences and maximize) L(M(P))

L(P) d - unfortunately anban has exponentially many

maximal motifs

Blocks of maximal motifs

- aaabba has blocks aaa, b, ba
- Lemma Maximal substring motifs (1-block motifs)

? (branching) nodes of T(S) - Thm Each block of a maximal motif of S is a

maximal substring motif of S, hence there are

O(n) different strings that can be used as a

block of a maximal motif. - Cor There are O(n2k-1) different maximal motifs

with k blocks O(n2k) unrestricted motifs.

Counting 2-block maximal motifs

- Thm The occurrence counts for all maximal motifs

with two blocks can be found in (optimal) time

O(n3).

Algorithm (very simple)

d

Y

X

2-block motif (X,d,Y)

for each maximal substring motif X for each

distance d 1,2, mark the leaves of

T(S) that correspond to locations L(X) d

for each maximal substring motif Y,

find the number h(Y) of marked leaves in its

subtree in T(S) the occurrence count of

motif (X,d,Y) is h(Y)

Algorithm (very simple)

d

Y

X

2-block motif (X,d,Y)

for each maximal substring motif X for each

distance d 1,2, mark the leaves of

T(S) that correspond to locations L(X) d

for each maximal substring motif Y,

find the number h(Y) of marked leaves in its

subtree in T(S) the occurrence count of

motif (X,d,Y) is h(Y)

O(n) O(n) O(n)

Counting 2-block maximal motifs (cont)

- Thm The occurrence counts for all maximal motifs

with two blocks can be found in (optimal) time

O(n3). - flexible gaps xy gap of

any length - Thm The occurrence counts for all maximal motifs

with two blocks and one flexible gap can be found

in (optimal) time O(n2).

General case

- Q1 Given q and W, has S a motif with at least W

non-gap symbols and at least q occurrences? - In k-block case, is O(n2k-1) (or even better)

time possible? - related work A. Apostolico, M-F. Sagot, L.

Parida, N. Pisanti,

2. Founder reconstruction and applications

Haplotype evolution founders and iterated

recombinations

- WABI 2002

founder haplotypes

current (observed) haplotypes

only recombinations mutations not shown

(No Transcript)

statistical models of recombination average

fragment length 1/generations

(No Transcript)

Uncovering founder sequences

- Problem Given current sequences C (haplotypes),

construct their founders that produce the

sequences by iterated recombinations using

minimum possible total number of cross-overs

(i.e., current sequences have a parse into

smallest possible number of fragments taken from

the founders)

Example

0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0

0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1

Example

0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0

0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1

Example

0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0

0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0

0 1 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 1 0

6 cross-overs

Example

0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0

0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1

Example

0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0

0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1

18 cross-overs

OBS two founders (colors) always suffice if no

restrictions

Founder reconstruction problem

- given a set D of m sequences, construct M founder

sequences that give D in minimum number of

cross-overs - solution by dynamic programming, exponential time

in m (WABI 2002) - Q2 NP-hard?

Modeling a set of haplotypes by a HMM

- motif Hidden Markov Model
- minimum description length (MDL) modeling
- ALT 2004

Hidden Markov Model (HMM)

- states i with emission alphabet Hi
- emission probabilities P(H 0 Hi)
- state transition probabilities wij

. . . .

P(H)

wij

j

i

Conserved fragments and parses

- haplotypes 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2

2 2 2 2 2 - parse 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2

2 2 2 2 2 2 2 - conserved 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1

2 2 2 2 1 fragments - fragmentation model

(HMM)

2 1 2 1 2

1 1 1 1

2 2 1 2 2 2 2

1 1 2 1 1 1 1

Lactose tolerance

- recent finding in Finnish population an SNP

C/T-13910, 14 kb upstream from the lactase gene,

associates completely with lactose intolerance - two datasets over 23 SNPs in the vicinity of this

SNP - lactose intolerant persons 21 haplotypes
- lactose tolerant persons 38 haplotypes

Case/control study by HMM

Lactose tolerant (2

fragments per haplotype gt young)

Lactose intolerant (6 fragments per

haplotype)

Genotype phasing via founders using a HMM

- the genotype phasing problem given a set of

genotypes, find their resolving haplotype pairs - find at most M founders that produce resolving

haplotype pairs in minimum possible number of

cross-overs gt relatively good haplotyping method - improved results with a related HMM, trained with

the Expectation Maximization algorithm - WABI 2005

HMM for haplotyping

emission probability distribution

transition probability distribution

transition probability distribution

Example HMM

(No Transcript)

3. Uncovering gene enhancer elements

Introduction

- Gene expression regulation in multicellular

organisms is controlled in combinatorial fashion

by so called transcription factors. - Transcription factors bind to DNA cis-elements on

enhancer modules (promoters), and multiple

factors need to bind to activate the module. - In mammals, the modules are few and far
- The problem Locate functional regulatory modules.

Gene regulation

promoter1 gene1

promoter2 gene2

promoter3 gene3

promoter4 gene4

DNA

transcription

transcription factors

RNA

translation

Proteins

Model of cell type specific regulation of target

gene expression

Common targets (e.g. Patched)

GLI

GLI

Ubiquitously expressed TF

transcription

Cell type specific targets (e.g. N-myc)

GLI

X

Y (tissue specific TFs)

transcription

Binding affinity matrices

- The cis-elements are represented by affinity

matrices. - A column per position
- A row per nucleotide
- Discovered
- Computationally
- Traditional wet lab
- Microarrays

9 11 49 51 0 1 1 4 19 3 0 0

0 45 25 16 5 1 2 0 17 0 4 21

18 36 0 0 34 5 21 10

Finding preserved motifs of binding sites

- looking at one (human) genome gives too many

positives - comparative approach take the 200 kB regions

surrounding the same genes (paralogs and

orthologs) of different mammals (human, mouse,

chicken, ), find preserved clusters (motifs) of

binding sites - Smith-Waterman type algorithm with a novel

scoring function

Whole genome comparisons

- Whole genomes can be analyzed with our

implementation - We have compared human genes to orthologs in

mouse, rat, chicken, fugu, tetraodon and

zebrafish - 100kbp flanking regions on both sides of the

gene. - Coding regions masked out.
- About 20 000 comparisons for each pair of

species. - About 2 min each

Enhancer prediction for N-myc

200 kb Mouse N-Myc genomic region

200 kb Human N-Myc genomic region

Conserved GLI binding sites in two predicted

enhancer elements, CM5 and CM7

Wet-lab verification

- Selected predicted cis-modules for wet-lab

verification - Fused 1kb DNA segment containing the predicted

enhancer to a marker gene with a minimal promoter

and generated transgenic embryos.

To conclude

- combinatorial vs probabilistic motifs
- significance of the findings for the applications

gt statistical modeling - Want to do computational biology? Then find a

good biologist who has good computational

intuition.

Acknowledgements

- Mikko Koivisto
- Heikki Mannila
- Kimmo Palin
- Pasi Rastas
- Morris Michael
- Stefan Kurzt (Hamburg)

- Outi Hallikas (Biom)
- Jussi Taipale (Biom)
- Markus Perola (Biom)
- Hans Söderlund (VTT)

About PowerShow.com

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

Recommended

«

/ »

Page of

«

/ »

Promoted Presentations

Related Presentations

Page of

Home About Us Terms and Conditions Privacy Policy Presentation Removal Request Contact Us Send Us Feedback

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

The PowerPoint PPT presentation: "In the search of motifs and other hidden structures" is the property of its rightful owner.

Do you have PowerPoint slides to share? If so, share your PPT presentation slides online with PowerShow.com. It's FREE!