Nikolaj Blom - PowerPoint PPT Presentation

Loading...

PPT – Nikolaj Blom PowerPoint presentation | free to download - id: fb145-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Nikolaj Blom

Description:

Start codon: ATG. p(ATG)=p(A) x p(T) x p(G) ~ x x = 1/64 ... Mark codons until first in-frame Stop codon. Center for Biologisk Sekvensanalyse ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 65
Provided by: LarsJuh3
Learn more at: http://www.cbs.dtu.dk
Category:
Tags: blom | codon | nikolaj

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Nikolaj Blom


1
Gene Finding in Eukaryotic Genomes DTU course
27614
Nikolaj Blom Center for Biological Sequence
Analysis BioCentrum-DTU Technical University of
Denmark nikob_at_cbs.dtu.dk
2
Outline
  • Gene finding in eukaryotic genomes
  • Why look for genes?
  • Genes as products
  • Orphan genes
  • What is the problem?
  • Needles in haystacks
  • Signal and background
  • Gene finding by hand
  • Gene features
  • Strategies
  • Ab initio gene prediction
  • Gene prediction methods
  • Isolated
  • Integrated

3
Gene Finding - Gene Hunting Gene Discovery
  • Why Look for Genes?
  • Genes may
  • Explain Basic Biological Functions
  • Protein kinases, Cyclins, etc.
  • Explain Medical Conditions
  • Cystic fibrosis gene
  • Be Used for Treatment of Disease
  • Contain commercial value
  • As enzymes (Lipases, Amylases, washing
    detergent)
  • As drug targets (Ion channels, Receptors)
  • As therapeutic factors

4
Genes/Proteins(Biologics) as Pharmaceutical
Products
  • Blockbusters gt1 billion US yearly
  • Avonex
  • Interferon-beta from Biogen inc.
  • Multiple schlerosis
  • EPOgen
  • EPO (Erythropoetin) from Amgen inc.
  • Anemia

5
At Least 40 Orphan Proteins in the Human Genome
  • Uncharted territory
  • Novel genes
  • Novel opportunities
  • Novel biological functions
  • Novel biomarkers and therapeutic factors

Venter et al., Science, 2001
6
Human Genome Published HUGO Nature,
15.feb.2001 Celera Science, 16.feb.2001
7
We Have the Human Genome Sequence...now what?
  • Are there still novel genes to be discovered?
  • Yes!
  • What is the challenge?
  • We dont know how many genes there are!
  • We dont know where they (all) are!
  • We dont know what they (all) do!

8
The cellular machinery recognize genes without
access to GenBank, SwissProt or computers can
we?
9
(No Transcript)
10
Why is Gene Finding Difficult?
  • Because genes
  • are embedded in the genome sequence
  • are needles hiding in genome haystacks...
  • constitute only 2 of human genome (the coding
    regions)
  • are often split, ie. have exon-intron structure
  • Can we distinguish the gene features from the
    background?

11
Can U spot Spot?
12
  • TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA
    TGCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGC
    TGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCA
    CCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCAT
    CTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTT
    CCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTA
    AGGCTGCGGTGAGCTGTGATTGTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGT
    GAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCCATTCCTGGTGTT
    GGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTT
    CATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGT
    GTTGTGTGTGTTATATATATAAAATATATAGGAAGAGGCACCAGAGAGCT
    CTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGA
    TGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCC
    TGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTGGCTCGCAC
    CTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTC
    AAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAA
    AAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTC
    CTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGT
    GGAGGCTGCAGTGAGCCATGATCACACCTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTCCTTGTCAGGTTTTCACCCCATGCTCCTCCATTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGCTAGTCTG
    CTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCT
    TCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCT
    CCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCC
    TGCTTATTGTCTTCCTCAGTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTT

13
  • TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTA
    TGCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGC
    TGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCA
    CCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCAT
    CTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTT
    CCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTA
    AGGCTGCGGTGAGCTGTGATTGTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGT
    GAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCCATTCCTGGTGTT
    GGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTT
    CATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGT
    GTTGTGTGTGTTATATATATAAAATATATAGGAAGAGGCACCAGAGAGCT
    CTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGA
    TGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCC
    TGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTGGCTCGCAC
    CTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTC
    AAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAA
    AAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTC
    CTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGT
    GGAGGCTGCAGTGAGCCATGATCACACCTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTCCTTGTCAGGTTTTCACCCCATGCTCCTCCATTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGCTAGTCTG
    CTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCT
    TCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCT
    CCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCC
    TGCTTATTGTCTTCCTCAGTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTT

14
(No Transcript)
15
  • AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAG
    GACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGA
    AAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGG
    CTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCT
    GAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTG
    GCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTG
    AGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCT
    ACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAG
    CTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGC
    TGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCA
    AGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGT
    GCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAG
    CTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACT
    CTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCA
    TGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGT
    GTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAA
    TTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTG
    TGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATA
    TATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTC
    TCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGAT
    GTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCT
    GGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCG
    CACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAG
    GTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAA
    AAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGA
    GTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGA
    AGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGT
    GACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGAC
    AAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACG
    GCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCT
    CAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCA
    ATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCT
    TCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTG
    TTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTA
    ACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTC
    TCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCAT
    CCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCAC
    CAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTT
    ACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTC
    CTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGA
    TAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAA
    AAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTT
    ATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAA
    TGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACC
    ACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTT
    GGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGG
    GTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTC
    TGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGT
    GCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATG
    TGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAA
    AAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCT
    GGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCC
    TAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACA
    AGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAA
    AAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCC
    CACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCT
    GGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCC
    AGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTT
    TCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGG
    GCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGC
    ATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAAT
    GTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTC
    ACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCT
    GTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTG
    ACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAG
    ACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAAT
    TTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTT
    TAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGA
    AATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACC
    CACAATTAGCTGAG

Can U spot the Gin?
Can U spot the Gene?
Ooops
16
  • AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAG
    GACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGA
    AAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGG
    CTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCT
    GAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTG
    GCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTG
    AGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCT
    ACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAG
    CTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGC
    TGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCA
    AGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGT
    GCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAG
    CTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACT
    CTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCA
    TGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGT
    GTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAA
    TTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTG
    TGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATA
    TATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTC
    TCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGAT
    GTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCT
    GGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCG
    CACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAG
    GTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAA
    AAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGA
    GTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGA
    AGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGT
    GACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGAC
    AAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACG
    GCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCT
    CAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCA
    ATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCT
    TCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTG
    TTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTA
    ACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTC
    TCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCAT
    CCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCAC
    CAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTT
    ACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTC
    CTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGA
    TAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAA
    AAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTT
    ATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAA
    TGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACC
    ACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTT
    GGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGG
    GTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTC
    TGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGT
    GCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATG
    TGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAA
    AAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCT
    GGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCC
    TAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACA
    AGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAA
    AAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCC
    CACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCT
    GGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCC
    AGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTT
    TCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGG
    GCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGC
    ATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAAT
    GTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTC
    ACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCT
    GTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTG
    ACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAG
    ACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAAT
    TTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTT
    TAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGA
    AATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACC
    CACAATTAGCTGAG

17
Manual Genefinding
Find, mark and count all ATG
Start codon ATG Stop codons TAA, TAG, TGA Donor
splice site GTAGAG Acceptor splice site
CTAG   gtU70368 (950 bp) 1 CTCCCTTAGA
AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 51
GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA
AGGTCTTTAT 101 GTCTTGTGTG TCCCCCAGCA
GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 151
CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG
ACAAAGGGGC 201 TGCTCTGCCC TTCTAAGAGG
TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 251
TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA
CCAGACAGGA 301 ACTGACGAGA TGCAATCACT
GTGTGGACTT TTTACCAGCT AGCTAGGGCA 351
CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG
GTGTGCCCCG 401 AATATCTCTC AGGGTAAGAG
TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 451
GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC
TGTAACATAT 501 TGGTGGGTGT TGGGAGTCAT
AAGCTAAATG TTTGCTTTCC TCTGCATTGG 551
TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA
GATCTGTTGG 601 AGTAATAACA AGACACTGGT
CTTGTTGGGG GTATAACCTA GAGACTCGAT 651
TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG
TTTTCTTTTT 701 TGGGGAGGGG GTCGGTTAAC
TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 751
CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT
GGAGTCTGAA 801 GGTAAAACAT TTGGCCACTG
GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 851
CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC
AGCTGCATCA 901 GGATGAAGCT CAGGTAGTGG
TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
How many ATGs do you expect?
18
Manual Genefinding
Start codon ATG p(ATG)p(A) x p(T) x p(G) ¼ x
¼ x ¼ 1/64 (in 950 bp 14.8 ATG expected)  
19
Manual Genefinding
Start codon ATG p(ATG)p(A) x p(T) x p(G) ¼ x
¼ x ¼ 1/64 (in 950 bp 14.8 ATG expected
observed 16)   gtU70368 (950 bp) 1
CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT
TGGAGACATG 51 GTGAGTTCTC TTTCCTTCCC
AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 101
GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC
CCTAGACCTG 151 CATAAGGACA GATTGAGTGT
GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 201
TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC
TTTTGCAGCT 251 TGCATGTGTA GTGCCAGGAA
AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 301
ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT
AGCTAGGGCA 351 CTACCATGAG CCACTGTCTA
GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 401
AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA
GAGGGGTGTG 451 GGTGAGTGTG CAAGTATCTA
ATTGGCTAGT TTTTGTGGCC TGTAACATAT 501
TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC
TCTGCATTGG 551 TGGTCATTAG GGAGGGGGCA
GATTATGAAC CTAGGTTGCA GATCTGTTGG 601
AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA
GAGACTCGAT 651 TTATGTTCAT GTTTGGTTTG
GGATGGGTTT TATGTGAGTG TTTTCTTTTT 701
TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA
CTGTCCTGTT 751 CATTTCCCTG AGGTGAAAGT
TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 801
GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT
GTGTTCTTGT 851 CCCCTAGCAG ATCCAGCCCT
ATCATCTCCT GGTGCCCAAC AGCTGCATCA 901
GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC
AAGCCCAGTC
20
Manual Genefinding
Start codon ATG p(ATG)p(A) x p(T) x p(G) ¼ x
¼ x ¼ 1/64 (in 950 bp 14.8 ATG expected
observed 16 17)   gtU70368 (950 bp) 1
CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT
TGGAGACATG 51 GTGAGTTCTC TTTCCTTCCC
AGAAGGTAAG TCTCACTGTA AGGTCTTTAT 101
GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC
CCTAGACCTG 151 CATAAGGACA GATTGAGTGT
GCTGGGATAG ACTTTTGTTG ACAAAGGGGC 201
TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC
TTTTGCAGCT 251 TGCATGTGTA GTGCCAGGAA
AGAGTAGTCA TCCCCCAAAA CCAGACAGGA 301
ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT
AGCTAGGGCA 351 CTACCATGAG CCACTGTCTA
GCAGGGAGGC TTTGGGGATG GTGTGCCCCG 401
AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA
GAGGGGTGTG 451 GGTGAGTGTG CAAGTATCTA
ATTGGCTAGT TTTTGTGGCC TGTAACATAT 501
TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC
TCTGCATTGG 551 TGGTCATTAG GGAGGGGGCA
GATTATGAAC CTAGGTTGCA GATCTGTTGG 601
AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA
GAGACTCGAT 651 TTATGTTCAT GTTTGGTTTG
GGATGGGTTT TATGTGAGTG TTTTCTTTTT 701
TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA
CTGTCCTGTT 751 CATTTCCCTG AGGTGAAAGT
TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA 801
GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT
GTGTTCTTGT 851 CCCCTAGCAG ATCCAGCCCT
ATCATCTCCT GGTGCCCAAC AGCTGCATCA 901
GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC
AAGCCCAGTC
21
Manual Genefinding
Mark codons until first in-frame Stop codon
Start codon ATG Stop codons TAA, TAG,
TGA   gtU70368 (950 bp) 1 CTCCCTTAGA
AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 51
GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA
AGGTCTTTAT 101 GTCTTGTGTG TCCCCCAGCA
GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 151
CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG
ACAAAGGGGC 201 TGCTCTGCCC TTCTAAGAGG
TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 251
TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA
CCAGACAGGA 301 ACTGACGAGA TGCAATCACT
GTGTGGACTT TTTACCAGCT AGCTAGGGCA 351
CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG
GTGTGCCCCG 401 AATATCTCTC AGGGTAAGAG
TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 451
GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC
TGTAACATAT 501 TGGTGGGTGT TGGGAGTCAT
AAGCTAAATG TTTGCTTTCC TCTGCATTGG 551
TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA
GATCTGTTGG 601 AGTAATAACA AGACACTGGT
CTTGTTGGGG GTATAACCTA GAGACTCGAT 651
TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG
TTTTCTTTTT 701 TGGGGAGGGG GTCGGTTAAC
TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 751
CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT
GGAGTCTGAA 801 GGTAAAACAT TTGGCCACTG
GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 851
CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC
AGCTGCATCA 901 GGATGAAGCT CAGGTAGTGG
TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
22
Manual Genefinding
Mark codons until first in-frame Stop codon
Start codon ATG Stop codons TAA, TAG,
TGA   gtU70368 (950 bp) 1 CTCCCTTAGA
AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 51
GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA
AGGTCTTTAT 101 GTCTTGTGTG TCCCCCAGCA
GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 151
CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG
ACAAAGGGGC 201 TGCTCTGCCC TTCTAAGAGG
TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 251
TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA
CCAGACAGGA 301 ACTGACGAGA TGCAATCACT
GTGTGGACTT TTTACCAGCT AGCTAGGGCA 351
CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG
GTGTGCCCCG 401 AATATCTCTC AGGGTAAGAG
TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 451
GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC
TGTAACATAT 501 TGGTGGGTGT TGGGAGTCAT
AAGCTAAATG TTTGCTTTCC TCTGCATTGG 551
TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA
GATCTGTTGG 601 AGTAATAACA AGACACTGGT
CTTGTTGGGG GTATAACCTA GAGACTCGAT 651
TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG
TTTTCTTTTT 701 TGGGGAGGGG GTCGGTTAAC
TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 751
CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT
GGAGTCTGAA 801 GGTAAAACAT TTGGCCACTG
GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 851
CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC
AGCTGCATCA 901 GGATGAAGCT CAGGTAGTGG
TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
23
Genes and Signals
24
(No Transcript)
25
(No Transcript)
26
Manual Genefinding
Find and mark potential donor splice sites in
first exon
Start codon ATG Stop codons TAA, TAG, TGA Donor
splice site GTAGAG Acceptor splice site
CTAG gtU70368 (950 bp) 1 CTCCCTTAGA
AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 51
GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA
AGGTCTTTAT 101 GTCTTGTGTG TCCCCCAGCA
GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 151
CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG
ACAAAGGGGC 201 TGCTCTGCCC TTCTAAGAGG
TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 251
TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA
CCAGACAGGA 301 ACTGACGAGA TGCAATCACT
GTGTGGACTT TTTACCAGCT AGCTAGGGCA 351
CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG
GTGTGCCCCG 401 AATATCTCTC AGGGTAAGAG
TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 451
GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC
TGTAACATAT 501 TGGTGGGTGT TGGGAGTCAT
AAGCTAAATG TTTGCTTTCC TCTGCATTGG 551
TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA
GATCTGTTGG 601 AGTAATAACA AGACACTGGT
CTTGTTGGGG GTATAACCTA GAGACTCGAT 651
TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG
TTTTCTTTTT 701 TGGGGAGGGG GTCGGTTAAC
TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 751
CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT
GGAGTCTGAA 801 GGTAAAACAT TTGGCCACTG
GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 851
CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC
AGCTGCATCA 901 GGATGAAGCT CAGGTAGTGG
TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
First exon
(Coding exons!)
Second exon
27
Manual Genefinding
Find and mark potential donor splice sites in
first exon
Start codon ATG Stop codons TAA, TAG, TGA Donor
splice site GTAGAG Acceptor splice site
CTAG gtU70368 (950 bp) 1 CTCCCTTAGA
AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 51
GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA
AGGTCTTTAT 101 GTCTTGTGTG TCCCCCAGCA
GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 151
CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG
ACAAAGGGGC 201 TGCTCTGCCC TTCTAAGAGG
TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 251
TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA
CCAGACAGGA 301 ACTGACGAGA TGCAATCACT
GTGTGGACTT TTTACCAGCT AGCTAGGGCA 351
CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG
GTGTGCCCCG 401 AATATCTCTC AGGGTAAGAG
TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 451
GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC
TGTAACATAT 501 TGGTGGGTGT TGGGAGTCAT
AAGCTAAATG TTTGCTTTCC TCTGCATTGG 551
TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA
GATCTGTTGG 601 AGTAATAACA AGACACTGGT
CTTGTTGGGG GTATAACCTA GAGACTCGAT 651
TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG
TTTTCTTTTT 701 TGGGGAGGGG GTCGGTTAAC
TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 751
CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT
GGAGTCTGAA 801 GGTAAAACAT TTGGCCACTG
GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 851
CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC
AGCTGCATCA 901 GGATGAAGCT CAGGTAGTGG
TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
First exon
(Coding exons!)
Second exon
28
Manual Genefinding
Alternative splice forms(?)
Start codon ATG Stop codons TAA, TAG, TGA Donor
splice site GTAGAG Acceptor splice site
CTAG gtU70368 (950 bp) 1 CTCCCTTAGA
AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 51
GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA
AGGTCTTTAT 101 GTCTTGTGTG TCCCCCAGCA
GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 151
CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG
ACAAAGGGGC 201 TGCTCTGCCC TTCTAAGAGG
TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 251
TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA
CCAGACAGGA 301 ACTGACGAGA TGCAATCACT
GTGTGGACTT TTTACCAGCT AGCTAGGGCA 351
CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG
GTGTGCCCCG 401 AATATCTCTC AGGGTAAGAG
TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 451
GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC
TGTAACATAT 501 TGGTGGGTGT TGGGAGTCAT
AAGCTAAATG TTTGCTTTCC TCTGCATTGG 551
TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA
GATCTGTTGG 601 AGTAATAACA AGACACTGGT
CTTGTTGGGG GTATAACCTA GAGACTCGAT 651
TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG
TTTTCTTTTT 701 TGGGGAGGGG GTCGGTTAAC
TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 751
CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT
GGAGTCTGAA 801 GGTAAAACAT TTGGCCACTG
GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 851
CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC
AGCTGCATCA 901 GGATGAAGCT CAGGTAGTGG
TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
First exon
(Coding exons!)
Not in frame
Second exon
29
Manual Genefinding
Alternative splice forms(?)
Start codon ATG Stop codons TAA, TAG, TGA Donor
splice site GTAGAG Acceptor splice site
CTAG gtU70368 (950 bp) 1 CTCCCTTAGA
AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG 51
GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA
AGGTCTTTAT 101 GTCTTGTGTG TCCCCCAGCA
GCCTTGTCAT CTCCGGCTGC CCTAGACCTG 151
CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG
ACAAAGGGGC 201 TGCTCTGCCC TTCTAAGAGG
TTGAGTCTCA TCATAAGGCC TTTTGCAGCT 251
TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA
CCAGACAGGA 301 ACTGACGAGA TGCAATCACT
GTGTGGACTT TTTACCAGCT AGCTAGGGCA 351
CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG
GTGTGCCCCG 401 AATATCTCTC AGGGTAAGAG
TTTACAGTAA GCAGCAAGCA GAGGGGTGTG 451
GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC
TGTAACATAT 501 TGGTGGGTGT TGGGAGTCAT
AAGCTAAATG TTTGCTTTCC TCTGCATTGG 551
TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA
GATCTGTTGG 601 AGTAATAACA AGACACTGGT
CTTGTTGGGG GTATAACCTA GAGACTCGAT 651
TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG
TTTTCTTTTT 701 TGGGGAGGGG GTCGGTTAAC
TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT 751
CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT
GGAGTCTGAA 801 GGTAAAACAT TTGGCCACTG
GCATGCCCTA AAGTCTTTTT GTGTTCTTGT 851
CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC
AGCTGCATCA 901 GGATGAAGCT CAGGTAGTGG
TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
First exon
(Coding exons!)
Second exon
30
How to Approach a Novel Genome
  • First hunt for similar genes
  • Align all known genes and ESTs from all other
    organisms against genome sequence
  • Some exons more conserved than others
  • Will not result in complete gene structures
  • Will indicate regions potentially encoding genes
  • Some genes will have no homology to any known
    genes
  • Second hunt includes
  • ab initio gene prediction
  • Predict full gene structure from genomic DNA

31
Gene Prediction
  • Eukaryotic Gene Prediction
  • Prediction relies on integration of several gene
    features
  • Each gene feature carries a low signal
  • E.g. ATG, splice sites, etc.
  • Combinatorial explosion
  • Some are mutually exclusive (e.g. reading frame)
  • Sensor based HMMs well suited for gene prediction

32
Sensor-based methods
  • Ab initio Gene Finders
  • HMM-based
  • GenScan
  • HMMgene
  • Neural network-based
  • GRAIL
  • NetGene2 (splice sites)

33
Gene Features
  • Codon frequency/bias
  • Organism dependent
  • Hexamer statistics
  • Transcriptional
  • Promoters/enhancers
  • Exon/introns
  • Length distributions
  • ORFs
  • Splicing
  • Donor/acceptor sites
  • Branchpoints
  • Translational
  • Start codon context

34
Codon Bias
  • tRNA availability
  • Expression level
  • Gene Finders are often organism specific
  • Coding regions often modelled by 5th order Markov
    chain (hexamers/di-codons)

35
Needles Hiding in Genome Haystacks...
  • Intron-exon structure of genes
  • Large introns (average 3365 bp )
  • Small exons (average 145 bp)
  • Long genes (average 27 kb)

36
  • Human genes
  • Short exons
  • Long introns

37
Intron lengths
  • Human genes
  • Introns lengths have broad distribution
  • Min. Length ca. 60 bp

38
Intron Prevalence
39
Gene Prediction
  • Isolated methods
  • Predict individual features
  • E.g. splice sites, coding regions
  • NetGene (Neural network)
  • http//www.cbs.dtu.dk/services/NetGene2/
  • Integrated methods
  • Predict genes in context
  • Grammar of genes
  • Certain elements in specific order are required
  • HMMgene http//www.cbs.dtu.dk/services/HMMgene/
  • GenScan (HMM-based) http//genes.mit.edu/GENSCAN.h
    tml

40
Gene Grammar
Isolated features
  • HAPPYEUGENEAWASGUYFINDER

41
Gene Grammar
Isolated features
  • HAPPYEUGENEAWASGUYFINDER

Intron 3UTR Exon Promoter Exon RBS
42
Gene Grammar
Integrated features
HAPPYEUGENEAWASGUYFINDER
  • EUGENEFINDERWASAHAPPYGUY

43
Gene Grammar
Integrated features
  • EUGENEFINDERWASAHAPPYGUY

Prom?RBS?Exon?Intron?Exon?3UTR
44
Gene Grammar
  • Isolated methods (e.g.NN)
  • HAPPYEUGENEAWASGUYFINDER
  • Integrated methods (e.g.HMM)
  • EUGENEFINDERWASAHAPPYGUY

45
HMMs for genefinding
  • GenScan principle
  • Eexon
  • Iintron
  • F5 UTR
  • T3 UTR
  • Ppromoter
  • Nintergenic

46
Gene Prediction Programs
  • Integrated methods
  • HMMgene
  • http//www.cbs.dtu.dk/services/HMMgene/
  • GenScan (HMM-based)
  • http//genes.mit.edu/GENSCAN.html
  • Isolated methods
  • NetGene (Neural network)
  • http//www.cbs.dtu.dk/services/NetGene2/

47
Genscan http//genes.mit.edu/GENSCAN.html
48
Genscan
49
Genscan http//genes.mit.edu/GENSCAN.html
50
Genscan
51
Genscan
52
HMMgene http//www.cbs.dtu.dk/services/HMMgene/
53
Defining the term exon
  • Gene Prediction programs often use
  • Exon CDS (coding sequence)
  • Real exons may contain 5 or 3 UTRs
    (untranslated regions)

54
Gene Prediction NetGene 2
55
Gene Prediction NetGene 2
56
Gene Prediction NetGene 2
57
Gene Prediction NetGene 2
58
NIX Visualizing Gene Predictions
http//www.hgmp.mrc.ac.uk/NIX/
NO method is always best!
59
Future Challenges
  • Bootstrapping prediction improves as more genes
    become known
  • Extreme genes (long/short) still difficult
  • Initial and terminal exons are predicted with
    lower confidence
  • Combine with Sequence Similarity Matches
  • Non-coding RNAs
  • Most gene prediction programs only predict
    protein-coding genes
  • tRNA and rRNA genes are not predicted
  • Predict alternatice splicing, enhancers and
    silencers
  • Predict matrix- and scaffold-attachment regions,
    insulators and boundary elements

60
Take home messages
  • Genes may be predicted by computer programs
  • Masking of repetitive sequences may be required
    for large genomic sequences
  • Unusual genes are difficult (high GC, short or
    terminal exons)
  • HMM-based gene prediction programs are suitable
    for Gene Grammar
  • No single method is always best
  • Prediction methods are not perfect!

61
The End
62
(No Transcript)
63
Gene Prediction Exercise
Sequence GenBank Genscan HMMgene NetGene2
Seq1 (HoxA10) 320..1226 2401..2675 320 1226 0.871 2401 2675 0.988 320 1226 0.744 2401 2675 0.971 Donor 1227 0.95H Acc. 2400 1.00H
Seq2 (Dub-2) 398..425 1208..2817 - 1208 2817 0.800 398 425 0.418 1208 2817 0.735 Donor 426 0.87 Acc. 1207 0.42 Acc. 1210 0.71
http//www.cbs.dtu.dk/dtucourse/cookbooks/nikob/ex
ercises/gf_exercise_solution.html
64
HMMgene http//www.cbs.dtu.dk/services/HMMgene/
  • Columns
  • Sequence identifier
  • Program name
  • Prediction (see table below for the meaning).
  • Beginning
  • End
  • Score between 0 and 1
  • Strand for direct and - for complementary
  • Frame (for exons it is the position of the donor
    in the frame)
  • Group to which prediction belong. If several
    CDS's are found they will be called cds_1, cds_2,
    etc. bestparse' is there because alternative
    predictions will also be available (see below).

Name Meaning firstex The coding part of the
first coding exon starting with the first base of
the start codon. exon_N The N'th predicted
internal coding exon. lastex The coding part
of the last coding exon ending with the last base
of the stop codon. singleex The coding part
of an exon in a gene with only one coding exon.
CDS Coding region composed of the exon
predictions prior to this line.
About PowerShow.com