cisGreedy Motif Finder for Cistematic - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

cisGreedy Motif Finder for Cistematic

Description:

De novo motif finder which implements a greedy algorithm similar to Consensus motif finder ... Hertz, Gerald Z., and Gary D. Stormo. ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 42
Provided by: sarah350
Category:

less

Transcript and Presenter's Notes

Title: cisGreedy Motif Finder for Cistematic


1
cisGreedy Motif Finder for Cistematic
  • Sarah Aerni
  • Mentors Ali Mortazavi
  • Barbara Wold

2
cisGreedy
  • De novo motif finder which implements a greedy
    algorithm similar to Consensus motif finder
  • Goal To provide an efficient algorithm to be
    included in the Cistematic package that performs
    similarly to Consensus and meme

3
Cistematic
  • Integrate visualization, refinement of motifs and
    improve performance of multiple motif finders in
    a single package

Mortazavi, 2006
4
Cistematic
  • cisGreedy becomes part of Bottom Tier
  • Motif finder would be included in the Cistematic
    package (prevents need for complicated
    installations)

Image Ali Mortazavi
5
What is a Motif?
  • cis-Regulatory elements
  • Transcription Factor Binding Sites(TFBS)
  • Binding by transcription factors may increase or
    decrease transcription of genes

6
What is a Motif?
  • GAL4 in Yeast
  • Activator of galactose-induced genes (convert
    galactose to glucose)
  • Protein structure determines motif
  • DNA-protein interactions require certain bases at
    specified locations
  • Motif reflects homodimer structure

7
What is a Motif?
  • cis-Regulatory elements
  • Transcription Factor Binding Sites(TFBS)
  • Binding by transcription factors may increase or
    decrease transcription of genes
  • Gene Regulation believed to be a major source of
    complexity
  • Plants may have more genes or larger genomes than
    humans are they more complex?
  • Identification of cis-regulatory elements will
    help us understand gene regulatory networks
    (bigger picture)

8
How do we find motifs?
  • Hard to identify
  • Relatively short sequences (as small as 6 bases)
  • Many positions not well conserved
  • Factors improving identification
  • Usually localized in certain proximity of a gene
    (search within 3 kb upstream)
  • Some positions highly conserved
  • Use other data (Microarray?)

9
Motif Finders
  • Greedy
  • Maximizes similarity of motifs from sequences
    through a greedy approach
  • Eliminate background modeling by using Cistematic
    package preprocessing steps
  • Improves speed
  • Prevents false negatives
  • Implements multiple models (zoops, oops, TCM)

10
Consensus Scoring
  • Use equation similar to log likelihood called
    Information Content

L columns in the matrix A A,C,G,T
frequency of each letter i at each
position j a priori probability of letter
i
Hertz, Gerald Z., and Gary D. Stormo.
"Identifying DNA and protein patterns with
statistically significant alignments of multiple
sequences." Bioinformatics 8 1999 563-577.
11
Removing Background
  • Goal of a background model differentiate noise
    from signal
  • Issues with background
  • What background should be used?
  • Whole genome? Conserved regions?
  • Selective pressures maintain conserved regions
  • Arguably searching in conserved regions
    guarantees there is little noise (it has been
    maintained)
  • Solution
  • Search in conserved regions
  • Use simple repeat masking
  • Sequences which reoccur are likely TFBS

12
cisGreedy scoring
  • Scoring focuses on maximizing number of identical
    bases
  • Percent identity is dependent on number of
    deviations from the strict consensus
  • Background adds complexity that may lead to false
    negatives

13
cisGreedy
  • Input sequences are analyzed
  • Randomly select 2 sequences to be compared

14
cisGreedy
  • The two selected sequences are analyzed
    independently of the remaining sequences

15
cisGreedy
  • The two selected sequences are analyzed
    independently of the remaining sequences
  • Windows of motif size are scanned starting at the
    beginning of each sequence

16
cisGreedy
  • Sequences are scanned in an attempt to locate the
    highest scoring alignment
  • Alignments are ungapped
  • Score is established as the number of sequences
    containing the most frequently occurring base at
    each position

17
cisGreedy
  • Reverse Complements are analyzed (user specified)
  • Once start locations are established with a top
    alignment score, these are left unchanged (Greedy)

18
cisGreedy
  • Select an additional sequence in which to
    identify the location of the motif
  • Windows in the additional sequence are aligned to
    previously established windows (Greedy)

19
cisGreedy
  • Additional sequence scanned as before, reverse
    complement (user specified)
  • Alignment score established as before

20
cisGreedy
  • Final motif locations are used in order to build
    position specific frequency matrices
  • Reverse complement sequence used in building PSFM
    if used

21
Testing cis-Greedy
  • AIY
  • 16bp cis-regulatory motif drives expression
  • Experimentally verified
  • Gene battery consists of a set of genes bound by
    AIY
  • Orthologous genes contain highly specified
    binding sites
  • Individual binding sites of battery genes within
    a single species can vary considerably

(Wenick and Hobert 759)
22
Cistematic Results for AIY
regions of conservation
orthologous genes
hen-1
hen-1
23
Results for AIY
  • AIY Identified AAATTGGCTTCCTCAAA
  • cisGreedy TTTGAGGAAGCCAATTT
  • (reverse comp) AAATTGGCTTCCTCAAA
  • meme AAATTGGCTTCCTCAAA

AIY- Battery Consensus
24
Cistematic Results for AIY
hen-1
25
Results for AIY
hen-1
26
Tompa Bakeoff
  • 3 benchmark datasets
  • Real
  • Markov Chain
  • Generic
  • 4 organisms
  • Human
  • Mouse
  • Fruitfly
  • Yeast
  • Each dataset contains 0-1 motifs.
  • Each sequence can have 0 or multiple motifs
  • Report 0-1 motif per dataset and locations of
    motifs
  • Use statistical tools provided by bakeoff to
    analyze runs

27
Bakeoff example (hm03)
  • Identify most reasonable motif based in each
    dataset independently

28
Real
Real
Interesting pattern appears between 3 of 10
sequences
29
Markov
30
Generic
31
Bakeoff example (hm03)
  • Identify most reasonable motif based in each
    dataset independently
  • Determine which motif appears most reasonable
    across 3 benchmarks and map motif in sequences
    using Cistematic
  • Compare results to actual locations (provided in
    bakeoff package)

32
Solution
33
Real
Real
34
Solution
35
Markov
36
Bakeoff results
  • Correlation Coefficient
  • nCC (nTP nTN - nFN nFP) /
    v((nTPnFN)(nTNnFP)(nTPnFP)(nTNnFN))
  • Sensitivity (fraction of known sites that are
    predicted)
  • sSn sTP / (sTP sFN)
  • Positive Predictive Value (fraction of predicted
    sites that are known.)
  • sPPV sTP / (sTP sFP)

37
Bakeoff results
  • cisGreedy overall 7th best performer (excluding
    those with no data)
  • Overall top performer in fly
  • Worst performer in yeast
  • 3rd worst performer in mouse
  • 4th best performer in human

38
Bakeoff results
Adapted from Tompa, 2005
When running programs in parallel, correlation of
motif finder results to true binding sites
improves
39
Future goals
  • Complete analysis of results for cisGreedy using
    benchmarks established by Tompa paper (Nature
    Biotech, 2005)
  • Document results and algorithm development
  • Continue improving cisGreedy

40
References
  • Bioalgorithms.info
  • Jones, Neil C., and Pavel A. Pevzner. An
    Introduction to Bioinformatics Algorithms . MIT
    Press , 2004.
  • Hertz, Gerald Z., and Gary D. Stormo.
    "Identifying DNA and protein patterns with
    statistically significant alignments of multiple
    sequences." Bioinformatics 8 1999 563-577.
  • Tompa, Martin et al. Assessing computational
    tools for the discovery of transcription factor
    binding sites." Nature Biotechnology January
    2005 137-144.
  • Wenick, Adam S., and Oliver Hobert. "Genomic
    cis-Regulatory Architecture and trans-Acting
    Regulators of a Single Interneuron-Specific Gene
    Battery in C. elegans." Developmental Cell
    6(2005) 757-770.
  • http//cistematic.caltech.edu

41
Acknowledgements
  • Ali Mortazavi
  • Barbara Wold
  • Wold Lab funding provided by DOE NASA
  • Additional funding by NSF NIH
  • SoCalBSI faculty, staff and fellow students
Write a Comment
User Comments (0)
About PowerShow.com