Genome Annotation - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Genome Annotation

Description:

DNA involved in producing a polypeptide; it includes regions preceding and ... National Center for Biotechnology Information (NCBI) Ensemble ... – PowerPoint PPT presentation

Number of Views:439
Avg rating:3.0/5.0
Slides: 38
Provided by: ritas7
Learn more at: https://archive.vcu.edu
Category:

less

Transcript and Presenter's Notes

Title: Genome Annotation


1
Genome Annotation
  • BBSI
  • July 14, 2005
  • Rita Shiang

2
Genome Annotation
  • Identification of important components in genomic
    DNA

3
What is a Gene?
  • Fundamental unit of heredity
  • DNA involved in producing a polypeptide it
    includes regions preceding and following the
    coding region (leader and trailer) as well as
    intervening sequences (introns)
  • Entire DNA sequence including exons, introns, and
    noncoding transcription-control regions

4
What Components are Important in Protein Coding
Genes?
  • Sequences that initiate transcription
  • Sequences that process hnRNA to mRNA
  • Signals important in translation

5
TATA Box
Lodishet al, Molecular Cell Biology, 2000, Fig.
10.30.
6
Other Promoters
  • Initiator consensus
  • 5Py Py A(1) N T/A Py Py Py
  • N A, T, G or C
  • Py pyrimidine C or T
  • GC rich sequences
  • Stretch of 20-50 GC nucleotides 100 bp upstream
    of start site (CpG not common in genome)
  • Housekeeping genes
  • Multiple initiation sites

7
Polyadenylation Cleavage
  • Addition of a string of As to mRNAs
  • Polyadenylation signal AAUAAA found before
    cleavage site
  • GU or UU rich region 50 bp from the cleavage
    site
  • Stabilizes mRNA transcripts

Lodishet al, Molecular Cell Biology, 2000, Fig.
11.23.
8
Splicing
Electron micrograph of adenovirus DNA and hexon
gene mRNA
Lodishet al, Molecular Cell Biology, 2000, Fig.
11,13.
9
Splice Reaction
Lodishet al, Molecular Cell Biology, 2000, Fig.
11.15.
10
Splice Sites
Lodishet al, Molecular Cell Biology, 2000, Fig.
11,14.
11
Additional Splice Sites
Consensus Py7NCAG-G(exon)AG GUAAGU
98.12 Nonconsensus GC U12 introns
AC PuUAUCCUPy 0.76 Other rare
sequences 1 Py C or U Pu A or G
12
Translation Signals
  • 5 Cap structure directs ribosomal binding
  • AUG codes for methionine. The first AUG in a
    transcript is where translation starts
  • Open reading frame (ORF)
  • Stretch of sequence that codes for amino acids
    before a stop codon
  • Translation stop codons UAG, UAA, UGA

13
Capping of 5RNA with 7-methylguanylate (m7G)
Lodish et al, Molecular Cell Biology, 2000, Fig.
11.8.
14
Known Gene Components
Lodishet al, Molecular Cell Biology, 2000, Fig.
10.34.
15
Genome Annotation
  • What is in a genome besides protein coding genes?

16
Repetitive DNA makes up at least 50 of the genome
  • Transposon-derived interspersed repeats
  • Inactive retroposed copies of genes pseudogenes
  • Simple short repeats
  • Segmental Duplications
  • Blocks of tandemly repeated sequences
  • Centromeres
  • Telomeres
  • Short arm of acrocentric chromosomes
  • Ribosomal gene clusters

17
Non-protein coding genes or non-coding RNA (ncRNA)
  • tRNA genes
  • rRNA genes
  • snRNA genes
  • Splicing
  • Telomere maintenance
  • snoRNA genes
  • Other
  • microRNA

18
Annotation of Genomic DNA
  • Identifying Protein Coding Genes
  • Placing the genes on the genome (where are they?)

19
How Many Genes in the Genome?
  • Early on based on reassociation kinetics the
    estimate was 40,000
  • Walter Gilbert estimated 100,000 based on gene
    and genome size
  • 70,000 80,000 based on an extrapolated number
    of CpG islands
  • With the Human sequence the estimate is 30,000
    40,000

20
Annotation of Genomic DNA Specifically for Genes
that Code for Proteins
  • Match genomic DNA to genes that have been
    previously cloned and sequenced looking for
    sequence similarity using BLAST programs
  • Predict genes using computer programs to scan
    genomic DNA using known elements
  • Many strategies use a combination of both methods

21
cDNA Library Construction
Lodishet al, Molecular Cell Biology, 2000, Fig.
7.14
22
Lodishet al, Molecular Cell Biology, 2000, Fig.
7.15
23
Gene AnnotationCelera
  • Constructed gene models using sequence from cDNAs
  • Used Unigene database
  • Partitions GenBank sequences (mRNAs ESTs) into
    non-redundant set using 3 UTRs
  • 111,064 Unigene clusters for human

24
Gene AnnotationCelera cont.
  • Predicts gene boundaries by identifying
    overlapping sets of EST and protein matches
  • Known full-length genes were annotated on the map
    (matched w/50 of the length gt92 identity)
  • Clusters that did not match a full-length gene
    were evaluated using other references
  • Conservation of genomic sequence between mouse
    human
  • Similarity between human rodent transcripts
  • Similarity to known proteins

25
Validation
  • Validated by construction of known genes (RefSeq)
  • 6.1 of RefSeq genes were not annotated by Otto

26
Gene Annotation - Human Genome Sequencing
Consortium
  • Start with Ensemble predicted genes
  • ab initio predictions using Genscan
  • Based on probabilistic model of genome sequence
    composition and gene structure
  • Confirm similarity to mRNAs, ESTs, protein motifs
    from all organisms
  • Extend protein matches using GeneWise
  • Compares protein based information to genomic
    sequence and allows for frameshifts and large
    introns
  • Produces partial gene predictions

27
Consortium cont.
  • Merge Ensemble gene predictions w/ Genie
    predictions
  • Genie identifies matches of mRNAs and ESTs
  • Employs hidden Markov models (HMMs) to extend
    matches using ab initio statistical methods
  • Links information from 5 and 3 ESTs from the
    same cDNA clone to complete a sequence from the
    ATG to the stop codon
  • Can generate alternatively spliced products
    (though only longest used in this build)
  • Merge results with genes in RefSeq, SWISSPROT and
    TrEMBL databases

28
Validation
  • Validate method by comparing to a new set of
    known genes, a set of mouse cDNAs and genes on
    Chromosome 22 (Finished Sequence)
  • 85 Sensitivity
  • 13 spurious predictions

29
Factors Affecting Gene Annotation
  • Splice sites do not conform to consensus
  • Noncoding exons are common
  • Exon what is left over after splicing after
    introns are removed and does not refer to a
    stretch of coding information
  • tRNAs are spliced but noncoding
  • gt35 of human genes have noncoding exons
  • No statistical bias so they are difficult to
    identify

30
Factors Affecting Gene Annotation Cont.
  • Internal exons can be very small
  • Avg. size of internal exons are 130 bp
  • 65 of vertebrate exons are 68-208 bp
  • gt10 are lt60 bp
  • Exons lt 10 bp have been identified
  • Invected gene in Drosophila
  • One of four exons is 6 bp (GTCGAA)
  • Flanked by introns of 27.6 and 1.1 kb
  • Not correctly recognized by cDNA alignment
    software and creates a frameshift in the gene
  • Exons of size 0
  • Resizing exons create an intermediate splice
    product

31
Places to View Annotated Genomes
  • National Center for Biotechnology Information
    (NCBI)
  • Ensemble
  • The Golden Path (UCSC Genome Browser)
  • Celera

32
Verification of Annotation in C. elegans by
Experimentation
  • Complete genomic sequence
  • Small introns
  • Small intergenic regions

33
(No Transcript)
34
Results
  • 11,984 cDNAs successfully cloned out of a
    prediction of 19,477
  • 4,365 were not represented by cDNAs or ESTs
  • Failure of cloning could be due to
  • Wrongly predicted exons
  • Very low expressing genes
  • Not a real gene

35
Verification of intron/exon structures
36
Comparison of a Single Transcript
37
Greater than 50 of intron/exon structures need
correcting?
Write a Comment
User Comments (0)
About PowerShow.com