Title: Genome Annotation
1Genome Annotation
- BBSI
- July 14, 2005
- Rita Shiang
2Genome Annotation
- Identification of important components in genomic
DNA
3What is a Gene?
- Fundamental unit of heredity
- DNA involved in producing a polypeptide it
includes regions preceding and following the
coding region (leader and trailer) as well as
intervening sequences (introns) - Entire DNA sequence including exons, introns, and
noncoding transcription-control regions
4What Components are Important in Protein Coding
Genes?
- Sequences that initiate transcription
- Sequences that process hnRNA to mRNA
- Signals important in translation
5TATA Box
Lodishet al, Molecular Cell Biology, 2000, Fig.
10.30.
6Other Promoters
- Initiator consensus
- 5Py Py A(1) N T/A Py Py Py
- N A, T, G or C
- Py pyrimidine C or T
- GC rich sequences
- Stretch of 20-50 GC nucleotides 100 bp upstream
of start site (CpG not common in genome) - Housekeeping genes
- Multiple initiation sites
7Polyadenylation Cleavage
- Addition of a string of As to mRNAs
- Polyadenylation signal AAUAAA found before
cleavage site - GU or UU rich region 50 bp from the cleavage
site - Stabilizes mRNA transcripts
Lodishet al, Molecular Cell Biology, 2000, Fig.
11.23.
8Splicing
Electron micrograph of adenovirus DNA and hexon
gene mRNA
Lodishet al, Molecular Cell Biology, 2000, Fig.
11,13.
9Splice Reaction
Lodishet al, Molecular Cell Biology, 2000, Fig.
11.15.
10Splice Sites
Lodishet al, Molecular Cell Biology, 2000, Fig.
11,14.
11Additional Splice Sites
Consensus Py7NCAG-G(exon)AG GUAAGU
98.12 Nonconsensus GC U12 introns
AC PuUAUCCUPy 0.76 Other rare
sequences 1 Py C or U Pu A or G
12Translation Signals
- 5 Cap structure directs ribosomal binding
- AUG codes for methionine. The first AUG in a
transcript is where translation starts - Open reading frame (ORF)
- Stretch of sequence that codes for amino acids
before a stop codon - Translation stop codons UAG, UAA, UGA
13Capping of 5RNA with 7-methylguanylate (m7G)
Lodish et al, Molecular Cell Biology, 2000, Fig.
11.8.
14Known Gene Components
Lodishet al, Molecular Cell Biology, 2000, Fig.
10.34.
15Genome Annotation
- What is in a genome besides protein coding genes?
16Repetitive DNA makes up at least 50 of the genome
- Transposon-derived interspersed repeats
- Inactive retroposed copies of genes pseudogenes
- Simple short repeats
- Segmental Duplications
- Blocks of tandemly repeated sequences
- Centromeres
- Telomeres
- Short arm of acrocentric chromosomes
- Ribosomal gene clusters
17Non-protein coding genes or non-coding RNA (ncRNA)
- tRNA genes
- rRNA genes
- snRNA genes
- Splicing
- Telomere maintenance
- snoRNA genes
- Other
- microRNA
18Annotation of Genomic DNA
- Identifying Protein Coding Genes
- Placing the genes on the genome (where are they?)
19How Many Genes in the Genome?
- Early on based on reassociation kinetics the
estimate was 40,000 - Walter Gilbert estimated 100,000 based on gene
and genome size - 70,000 80,000 based on an extrapolated number
of CpG islands - With the Human sequence the estimate is 30,000
40,000
20Annotation of Genomic DNA Specifically for Genes
that Code for Proteins
- Match genomic DNA to genes that have been
previously cloned and sequenced looking for
sequence similarity using BLAST programs - Predict genes using computer programs to scan
genomic DNA using known elements - Many strategies use a combination of both methods
21cDNA Library Construction
Lodishet al, Molecular Cell Biology, 2000, Fig.
7.14
22Lodishet al, Molecular Cell Biology, 2000, Fig.
7.15
23Gene AnnotationCelera
- Constructed gene models using sequence from cDNAs
- Used Unigene database
- Partitions GenBank sequences (mRNAs ESTs) into
non-redundant set using 3 UTRs - 111,064 Unigene clusters for human
24Gene AnnotationCelera cont.
- Predicts gene boundaries by identifying
overlapping sets of EST and protein matches - Known full-length genes were annotated on the map
(matched w/50 of the length gt92 identity) - Clusters that did not match a full-length gene
were evaluated using other references - Conservation of genomic sequence between mouse
human - Similarity between human rodent transcripts
- Similarity to known proteins
25Validation
- Validated by construction of known genes (RefSeq)
- 6.1 of RefSeq genes were not annotated by Otto
26Gene Annotation - Human Genome Sequencing
Consortium
- Start with Ensemble predicted genes
- ab initio predictions using Genscan
- Based on probabilistic model of genome sequence
composition and gene structure - Confirm similarity to mRNAs, ESTs, protein motifs
from all organisms - Extend protein matches using GeneWise
- Compares protein based information to genomic
sequence and allows for frameshifts and large
introns - Produces partial gene predictions
27Consortium cont.
- Merge Ensemble gene predictions w/ Genie
predictions - Genie identifies matches of mRNAs and ESTs
- Employs hidden Markov models (HMMs) to extend
matches using ab initio statistical methods - Links information from 5 and 3 ESTs from the
same cDNA clone to complete a sequence from the
ATG to the stop codon - Can generate alternatively spliced products
(though only longest used in this build) - Merge results with genes in RefSeq, SWISSPROT and
TrEMBL databases -
28Validation
- Validate method by comparing to a new set of
known genes, a set of mouse cDNAs and genes on
Chromosome 22 (Finished Sequence) - 85 Sensitivity
- 13 spurious predictions
29Factors Affecting Gene Annotation
- Splice sites do not conform to consensus
- Noncoding exons are common
- Exon what is left over after splicing after
introns are removed and does not refer to a
stretch of coding information - tRNAs are spliced but noncoding
- gt35 of human genes have noncoding exons
- No statistical bias so they are difficult to
identify
30Factors Affecting Gene Annotation Cont.
- Internal exons can be very small
- Avg. size of internal exons are 130 bp
- 65 of vertebrate exons are 68-208 bp
- gt10 are lt60 bp
- Exons lt 10 bp have been identified
- Invected gene in Drosophila
- One of four exons is 6 bp (GTCGAA)
- Flanked by introns of 27.6 and 1.1 kb
- Not correctly recognized by cDNA alignment
software and creates a frameshift in the gene - Exons of size 0
- Resizing exons create an intermediate splice
product
31Places to View Annotated Genomes
- National Center for Biotechnology Information
(NCBI) - Ensemble
- The Golden Path (UCSC Genome Browser)
- Celera
32Verification of Annotation in C. elegans by
Experimentation
- Complete genomic sequence
- Small introns
- Small intergenic regions
33(No Transcript)
34Results
- 11,984 cDNAs successfully cloned out of a
prediction of 19,477 - 4,365 were not represented by cDNAs or ESTs
- Failure of cloning could be due to
- Wrongly predicted exons
- Very low expressing genes
- Not a real gene
35Verification of intron/exon structures
36Comparison of a Single Transcript
37Greater than 50 of intron/exon structures need
correcting?