6.095 / 6.895 Computational Biology: Genomes, Networks, Evolution - PowerPoint PPT Presentation

Loading...

PPT – 6.095 / 6.895 Computational Biology: Genomes, Networks, Evolution PowerPoint presentation | free to download - id: 50abd3-NzQxO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

6.095 / 6.895 Computational Biology: Genomes, Networks, Evolution

Description:

6.095 / 6.895 Computational Biology: Genomes, Networks, Evolution Rapid database search Manolis Kellis Piotr Indyk TA: Pouya Kheradpour Protein interaction network – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 48
Provided by: Mano74
Learn more at: http://ai.stanford.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: 6.095 / 6.895 Computational Biology: Genomes, Networks, Evolution


1
6.095 / 6.895Computational Biology Genomes,
Networks, Evolution
Rapid database search
  • Manolis Kellis
  • Piotr Indyk
  • TA Pouya Kheradpour

Protein interaction network
Genome duplication
2
Administrivia
  • Course information
  • Lecturers Manolis Kellis and Piotr Indyk
  • TA Pouya Kheradpour
  • TR11-1230, in 3-370
  • http// compbio.mit.edu / 6.895 /
  • Grading
  • 5 problem sets
  • Each problem set covers 4 lectures, contains 4
    problems.
  • Algorithmic problems and programming assignments
  • Graduate version includes 5th problem on current
    research
  • Exams
  • In-class midterm, no final exam
  • Collaboration policy
  • Collaboration allowed, but you must
  • Work independently on each problem before
    discussing it
  • Write solutions on your own
  • Acknowledge sources and collaborators. No
    outsourcing.

3
Goals for the term
  • Introduction to computational biology
  • Fundamental problems in computational biology
  • Algorithmic/machine learning techniques for data
    analysis
  • Research directions for active participation in
    the field
  • Ability to tackle research
  • Problem set questions algorithmic rigorous
    thinking
  • Programming assignments hands-on experience w/
    real datasets
  • Final project
  • Research initiative to propose an innovative
    project
  • Ability to carry out projects goals, produce
    deliverables
  • Write-up goals, approach, and findings in
    conference format
  • Present your project to your peers in conference
    setting

4
Course outline
  • Organization
  • Duality Computation and Biology
  • Important biological problems
  • Fundamental computational techniques
  • Foundations and Frontiers
  • First half well-defined problems and general
    methodologies
  • Second half in-depth look at complex problems,
    combine techniques learned, opens to projects,
    research directions
  • Topics covered
  • First half the foundations
  • String matching, genome analysis, expression
    clustering, regulatory motifs, biological
    networks, evolutionary theory
  • Second half the frontiers
  • genome assembly, gene finding, RNA folding,
    microRNAs, Bayesian networks, generative models,
    genome evolution

5
Books used in the course
Jones Pevzner
Durbin et al.
Price 40 Availability Quantum Books (J),
amazon.com (JD) MIT libraries Both books have
several copies on reserve
6
Todays Goals
  • Computational challenges in modern biology
  • The basic questions of molecular biology
  • The computational problems that arise
  • How we can address them algorithmically
  • Overview of topics
  • String matching, genome analysis, expression
    clustering, regulatory motifs, biological
    networks, evolutionary theory
  • RNA folding, genome assembly, gene finding,
    microRNAs, Bayesian networks, generative models,
    genome evolution
  • First problem regulatory motif discovery
  • Problem introduction, biol. intuition,
    computational formulation
  • Brute-force searching and algorithmic
    improvements
  • Machine learning approach and research directions

7
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
8
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
9
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
10
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
Extracting signal from noise
11
Challenges in Computational Biology
DNA
12
Molecular Biology Primer
  • 20 mins

13
DNA The double helix
  • The most noble molecule of our time

14
DNA the molecule of heredity
  • Self-complementarity sets molecular basis of
    heredity
  • Knowing one strand, creates a template for the
    other
  • It has not escaped our notice that the specific
    pairing we have postulated immediately suggests a
    possible copying mechanism for the genetic
    material. Watson Crick, 1953

15
DNA chemical details
  • Bases hidden on the inside
  • Phosphate backbone outside
  • Weak hydrogen bonds hold the two strands together
  • This allows low-energy opening and re-closing of
    two strands
  • Anti-parallel strands
  • Extension 5?3 tri-phosphate coming from newly
    added nucleotide
  • The only parings are
  • A with T
  • C with G

16
DNA deoxyribose sugar
17
DNA the four bases
Purine Purine
Pyrimidine Pyrimidine
Weak Weak
Strong Strong
Amino Amino
Keto Keto
18
DNA base pairs
19
DNA sequences
20
DNA packaging
  • Why packaging
  • DNA is very long
  • Cell is very small
  • Compression
  • Chromosome is 50,000 times shorter than extended
    DNA
  • Using the DNA
  • Before a piece of DNA is used for anything, this
    compact structure must open locally

21
Chromosomes inside the cell
  • Eukaryote cell
  • Prokaryote cell

22
Central dogma of Molecular Biology
DNA
makes
RNA
makes
Protein
23
Genes control the making of cell parts
  • The gene is a fundamental unit of inheritance
  • DNA molecule contains tens of thousands of genes
  • Each gene governs the making of one functional
    element, one part of the cell machine
  • Every time a part must be made, a piece of the
    genome is copied, transported, and used as a
    blueprint
  • RNA is a temporary copy
  • The medium for transporting genetic information
    from the DNA information repository to the
    protein-making machinery is an RNA molecule
  • The more parts are needed, the more copies are
    made
  • Each mRNA only lasts a limited time before
    degradation

24
RNA The messenger
  • Information changes medium
  • single strand vs. double strand
  • ribose vs. deoxyribose sugar
  • A T T A C G G T A C C G T
  • U A A U G C C A U G G C A
  • Compatible base-pairing in hybrid

25
From DNA to RNA Transcription
26
From pre-mRNA to mRNA Splicing
  • In Eukaryotes, not every part of a gene is coding
  • Functional exons interrupted by non-translated
    introns
  • During pre-mRNA maturation, introns are spliced
    out
  • In humans, primary transcript can be 106 bp long
  • Alternative splicing can yield different exon
    subsets for the same gene, and hence different
    protein products

27
RNA can be functional
  • Single Strand allows complex structure
  • Self-complementary regions form helical stems
  • Three-dimensional structure allows functionality
    of RNA
  • Four types of RNA
  • mRNA messenger of genetic information
  • tRNA codon-to-amino acid specificity
  • rRNA core of the ribosome
  • snRNA splicing reactions
  • To be continued
  • Well learn more in a dedicated lecture on RNA
    world
  • Once upon a time, before DNA and protein, RNA did
    all

28
RNA structure 2ndary and 3rdary
29
Splicing machinery made of RNA
30
Central dogma of Molecular Biology
DNA
makes
RNA
makes
Protein
31
Proteins carry out the cells chemistry
  • More complex polymer
  • Nucleic Acids have 4 building blocks
  • Proteins have 20. Greater versatility
  • Each amino acid has specific properties
  • Sequence ? Structure ? Function
  • The amino acid sequence determines the
    three-dimensional fold of protein
  • The proteins function largely depends on the
    features of the 3D structure
  • Proteins play diverse roles
  • Catalysis, binding, cell structure, signaling,
    transport, metabolism

32
Protein structure
Alpha-beta horseshoe this placental ribonuclease
inhibitor is a cytosolic protein that binds
extremely strongly to any ribonuclease that may
leak into the cytosol. 17-stranded parallel b
sheet curved into an open horseshoe shape, with
16 a-helices packed against the outer surface. It
doesn't form a barrel although it looks as though
it should. The strands are only very slightly
slanted, being nearly parallel to the central
axis'.
Beta-barrel Some antiparallel b-sheet domains are
better described as b-barrels rather than
b-sandwiches, for example streptavadin and porin.
Note that some structures are intermediate
between the extreme barrel and sandwich
arrangements.
Helix-turn-helix Common motif for DNA-binding
proteins that often play a regulatory role as
mRNA level transcription factors
33
Protein building blocks
  • Amino Acids

34
From RNA to protein Translation
  • tRNA
  • Ribosome

35
The Genetic Code
36
The genetic code
  • Degeneracy of the genetic code
  • To encode 20 amino acids, two nucleotides are not
    enough (4216). Three nucleotides are too many
    (4364)
  • The genetic code is degenerate. Same amino acid
    can be represented by more than one codon. Room
    for innovation
  • Moreover, amino acids with similar properties can
    be substituted for each other without changing
    the structure of the protein
  • Six possible translation frames for every
    nucleotide stretch
  • GCU.UGU.UUA.CGA.AUU.A ? Ala Cys Leu Arg
    Ile -
  • G.CUU.GUU.UAC.GAA.UUA ? - Leu Val Tyr Glu -
    Leu
  • Stop codon every 3/64. Long ORFs are unlikely,
    probably genes
  • In some viruses as many as four overlapping
    frames are functional

37
Summary The Central Dogma
  • DNA makes RNA makes Protein

Inheritance
Messages
Reactions
38
Why Computational Biology ?
39
Why Computational Biology Student answers
  • Lots of data ( lots of data)
  • Pattern finding
  • Its all about data
  • Ability to visualize
  • Simulations
  • Guess verify (generate hypotheses for testing)
  • Propose mechanisms / theory to explain
    observations
  • Networks / combinations of variables
  • Efficiency (reduce experimental space to cover)
  • Informatics infrastructure (ability to combine
    datasets)
  • Correlations
  • Life itself is digital. Understand cellular
    instruction set

40
Challenges in Computational Biology
DNA
41
Lecture Date Date Date Topic Reading
Lecture 1 Thursday Sept 8 Algorithms Machine Learning Biology J3.1-3.7J2.8-2.10D11
Lecture 2 Tuesday Sept 13 Evolutionary models seq alignment Dynamic programming D2.1-2.3J6.4-6.9
Lecture 3 Thursday Sept 15 Local/Global alignments Variations on dynamic programming D2.1-2.3J6.4-6.9
Lecture 4 Tuesday Sept 20 Linear time string searching Suffix trees String preprocessing J9.1-9.8
Lecture 5 Thursday Sept 22 Database search Hashing random projections J9.1-9.8
Lecture 6 Tuesday Sept 27 Biological signals HMMs D3.1-3.2
Lecture 7 Thursday Sept 29 CpG islands / simple ORFs Learning with HMMs D3.3
Lecture 8 Tuesday Oct 4 Expression analysis clustering J10.1-10.3
Lecture 9 Thursday Oct 6 Multi-dimensional clustering feature selection J10.1-10.3
No lecture Tuesday Oct 11 Holiday - Happy Columbus day  
Lecture 10 Thursday Oct 13 Regulatory Motifs Gibbs Sampling Expectation Maximization J4.4-4.9J5.5J12.2
Lecture 11 Tuesday Oct 18 Biological networks graph algorithms network dynamics J8.1-8.2P1
Lecture 12 Thursday Oct 20 Phylogenetic trees Greedy algorithms parsimony EM D7.1-7.5
Lecture 13 Tuesday Oct 25 Multiple alignment profile alignment iterative alignment J6.10
Lecture 14 Thursday Oct 27 Midterm  
Lecture 15 Tuesday Nov 1 RNA folding context-free grammars Phylo-CFGs D9
Lecture 16 Thursday Nov 3 Combine alignment and feature finding Pair HMMs D4
Lecture 17 Tuesday Nov 8 Gene Finding Generalized HMMs Burge
Lecture 18 Thursday Nov 10 Comparative gene finding Phylogenetic HMMs Siepel
Lecture 19 Tuesday Nov 15 microRNA regulation target prediction Bartel
Lecture 20 Thursday Nov 17 Regulatory relationships Bayesian networks Hartemink
Lecture 21 Tuesday Nov 22 Generative models of regulation Bayesian graphs Segal
No lecture Thursday Nov 24 Holiday - Happy Thanksgiving break  
Lecture 22 Tuesday Nov 29 Genome assembly Euler graphs J8.4
Lecture 23 Thursday Dec 1 Genome rearrangements Genome duplication J5.1-5.4
Lecture 24 Tuesday Dec 6 Genome-scale comparative genomics Kellis
Lecture 25 Thursday Dec 8 Final presentations  
Course Outline
42
Today Regulatory Motif Discovery
43
Regulatory motif discovery
GAL1
Gal4
Gal4
Mig1
ATGACTAAATCTCATTCAGAAGAAGTGA
CCCCW
CGG
CCG
CGG
CCG
  • Regulatory motifs (summary)
  • Genes are turned on / off in response to changing
    environments
  • No direct addressing subroutines (genes)
    contain sequence tags (motifs)
  • Specialized proteins (transcription factors)
    recognize these tags
  • What makes motif discovery hard?
  • Motifs are short (6-8 bp), sometimes degenerate
  • Can contain any set of nucleotides (no ATG or
    other rules)
  • Act at variable distances upstream (or
    downstream) of target gene
  • How can we discover them?

44
Motifs are preferentially conserved across
evolution
Gal10
Gal1
GAL10
Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTT
TTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACC
ATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-
CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCT
TTCCTATCATACACA Smik
GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAA
AA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA
Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATT
ATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA


Scer TATCCATATCTAATCTTACTTATATGTTGT-G
GAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTT
GGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATA
TGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--
TT-TCTATGAAACTTGAACTG-TACG Smik
TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCC
AGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG
Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCA
ATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCC
CTATTTTG

Scer CTTAACTGCTCATTGC-----TAT
ATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGA
AGACTCTCCTCCGTGCGTCCTCGTCT Spar
CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCG
AGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT
Smik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGA
AGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGG
CGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATAC
GGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCT
CCGTGCGAAGTCGTCT

Scer
TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTC
CGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA
Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCG
CCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATG
GTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGC
TCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATT
TCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACT
GAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-
TGCCTGTAGTG--GCAGTTATGGT

Scer
GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTA
ACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T
Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTT
TCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG-----
-TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCA
CATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTA
GCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCC
CT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGA
TGGGGTTGCGGTCAAGCCTACTCG

Scer
TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT
-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT
Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAA
TGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCA
C-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCG
AAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCG
CAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCT
CAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATAT
GAAAGTAAGATCGCCTCAATTGTA

Scer
TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAA
T----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT
Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TT
TGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACA
TCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATAT
TTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTC
AGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACT
TCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACAT
CAATAACAAGTATTCAACATTTGT

Scer
TTAA-CGTCAAGGA---GAAAAAACTATA Spar
TTAT-CGTCAAGGAAA-GAACAAACTATA Smik
TCGTTCATCAAGAA----AAAAAACTA.. Sbay
TTATCCCAAAAAAACAACAACAACATATA

Increase power by testing conservation in many
regions
GAL1
45
Framing the problem computationally
  • How do we find all instances of a motif in a
    genome?
  • Naïve algorithm Search every position
  • How do we count all instances of every 6-mer in a
    genome
  • Naïve algorithm Scan the genome for each motif
  • Improvement Scan genome once, filling a table
  • How do we count all instances of every 50-mer in
    a genome
  • Table is no longer feasible, most entries empty
  • Use a hash table
  • How do we search a new motif in a known genome
  • Pre-processing of the database

46
Computational approaches for motif discovery
  • Method 1 Enumerate all motifs
  • Combinatorial search
  • Method 2 Randomly sample the genome
  • Statistical approach
  • Method 3 Enumerate motif seeds refinement
  • Hill-climbing
  • Method 4 Content-based addressing
  • Hashing

47
Recitation tomorrow!
  • Introduction to algorithms / running time
  • Searching for motifs
  • Basic table indexing
  • Hints to hashing
  • Introduction to molecular biology
  • Central dogma, splicing, genomes
  • Introduction to probability
  • Searching motifs with ambiguity codes
  • Modeling background distribution
  • Likelihood ratios and hypothesis testing
About PowerShow.com