Title: Has the Yoyo Stopped Reviewing the Evidence for a Low Basal Human Protein Number and the Implication
1Has the Yoyo Stopped? Reviewing the Evidence
for a Low Basal Human Protein Number and the
Implications for Proteomics and Drug Discovery
6th Swedish Annual Bioinformatics Workshop
Göteborg, November 2005 Christopher Southan
Molecular Pharmacology, AstraZeneca RD, Mölndal
2Presentation Outline
- The importance of gene number
- Gene definition and detection
- Genome inflation arguments
- Post-completion changes in model eukaryotes
- Ensembl pipeline numbers
- Completed chromosome gene numbers
- International Protein Index
- Novel gene skimming
- Implications for Proteomics
- Implications for Drug Discovery
3So Who Cares About the Human Gene Number?
- Central to evolutionary questions of gene number
expansion vs. protein diversity from alternative
splicing, post-translational modifications and
differencial expression modulation - Mammalian gene totals expected to be similar but
clade-specific genes may be crucial to speciation - Accurate gene delineation essential for
interpreting genetic association studies - Large-scale, hypothesis-neutral, transcript
and/or protein profilling experiments also need a
complete gene set - False negatives (missed genes) are more important
that false positives - For mass-spec and other proteomic technologies it
is crucial to have at least a basal complete ORF
set - For Pharma and Biotech the numbers set finite
limits for potential drug targets and therapeutic
proteins
4Definitions
- The basal (unspliced) protein-coding gene number
transcriptional units that translate to one or
more proteins that share overlapping sequence
identity and are products of the same unique
genomic locus and strand orientation - However, the Guidelines for Human Gene
Nomenclature define a gene as "a DNA segment
that contributes to phenotype/function. In the
absence of demonstrated function a gene may be
characterised by sequence, transcription or
homology" - The increasing complexity of the transcriptome
makes the wider definition of gene more
difficult e.g. micro and antisence RNA
5Identifying Protein Coding Genes
- In silico
- Detection of protein identity in genomic DNA
- Gene prediction with protein similarity support
- Matches with ESTs that include ORFs and/or splice
sites - Cross-species comparisons for orthologous exon
detection - Presence of gene anatomy features e.g. CpG
islands, promoters, transcription start sites,
polyadenylation signals and the absence of repeat
elements - In vitro
- Cloning of predicted genes
- Detection of active transcription by Northern
blot, RT-PCR or microarray hybridisation - Loss-of-function approaches
- High-throughput transcript sampling by EST, MPSS
or SAGE tagging - Heterologous expression of cDNAs
- Direct verification of protein sequence by Edman
sequencing, mass-mapping and/or MS/MS sequencing
6Historical Arguments and Estimates for High Gene
Numbers
- Initial eukaryote (yeast/worm/fly) numbers
assumed to be underestimates - Gene prediction programs have a significant
false-negative rate - The Ensembl gene annotation pipeline is
conservative - Mammalian protein and transcript coverage is
incomplete - Chromosome annotation teams find more genes than
automated pipelines - Selective transcript skimming experiments have
revealed new genes - Extensive mamallian genomic sequence conservation
outside known exons - Postulated large numbers of undetected small
proteins (smORFs or dark matter) - EST clustering and commecial gene inflation
claims
Genesweep 2000
Literature estimates
7Model Eukaryotes No Significant Post-Completion
Gene Increases
- S.pombe 3 increase since 2002
- S.cerevisiae 8 decrease since 1997 (excluding
820 dubious ORFs) - C.elegans 5 increase since 1998
- D.melanogaster 0.2 increase since 2001
- Little increase in spite of global functional
genomics focus
8Human Transcripts Post-genomic mRNA Growth in
UniGene
- Rapid growth in redundant mRNA
- But slow growth in clustered set 9,000 over 2
years with plateau 28,000 - This will include splice variants and some
spurious ORFs
9Ensembl Human Gene Number
- Only 22,218 genes, a decrease of 1826 over 4
years - Knowns from 90 lt 95
- Novel genes 12,398 gt 2,263
- Exons-per-gene 6.5 lt 9.6
- Alternative splicing from 3,669 lt to 8,078
10Addressing the smORF Question Protein Size
Distributions in Human SPTr
Pre Oct-01 6.3 gt 100aa
Post Oct-01 5.5 gt 100aa
Novel in title 3.4 gt 100aa
11Summarising the smORF Issue
- smORFs are particularlly difficult to detect
computationaly and experimentally, However - No database evidence for increased smORF
discovery in eukaryotes or mammals - The observation that only 1 of mouse genes
have no detectable human homology contradicts the
idea of large order-specific gene expansion in
mammals - Although small proteins are less conserved i.e.
evolve more rapidly, those much shorter than 100
residues will fall below the threshold necessary
to fold into the domain structures necessary for
biological function - No evidence for de-novo gene invention in
higher eukaryotes
12Release History of the International Protein
IndexSignificant Non-overlap in Protein
Annotation Sets
56537 Entries
13Experimental Transcript Skimming Used to be
Evidence for High Gene Numbers but Now Points to
Non-coding Transcription
- Exon arrays (Dunham et al. 1999)
- Gene arrays (Penn et al. 2000)
- RT-PCR (Das et al. 2001)
- SAGE-tags (Saha et al. 2002, Chen et al. 2002)
- Oligo tiling from 21 and 22 (Kapranov et al.
2002, Kampa, et al (2004) - No novel proteins were submitted to the primary
databases - Necessary to clone a full length ORF with the
features of gene anatomy, and submission to the
public databases, before the discovery of novel
proteins can be claimed - There is increasing evidence for significant
amounts of antisence and other non-ORF
transcription in human and mouse
14Gene Numbers for Individual Completed Chromosomes
- Averaging the completed chromosomes exceeds
Ensembl GP31 genes by 12 - Extrapolate to 25,000 genes without novel
transcipts or putatives - Extrapolate to 28,000 genes without putatives
- Extrapolate to 31,000 genes with putatives
- The chromosome reports were made at different
times using different assemblies and different
grades of gene definition and evidence support
(e.g. different results for chrom 7) - Difficult to explicitly cross-map VEGA vs.
Ensembl chromosome gene numbers - Future status of novel transcripts and putative
genes unclear most will be non-coding
15Disappearing Gene Novelty
- EMBL hum cds March 2003 1491
- Plus novel 159
- Plus PubMed 2003
- 120
- Novel in title 11
- Previous cds 8
- Novel genes 2
- Now both in RefSeq and Ens 18.34
16Conclusions on Number
- The model eukaryotes have shown no significant
post-genomic rises in gene number - The Ensembl gene number has been essentially flat
since 2001 - The pseudogene-adjusted Ensembl basal protein
total is 22,500 - 2,500 predicted genes still need experimental
verification likly to be errors in terminal
exons - Putative genes from curated complete chromosomes
could raise numbers but the status of this class
of transcripts is in doubt - Early over-estimates explicable by non-ORF
transcription activity - The massive increase in post-genomic transcript
coverage is predominantly re-sampling known
genes - Database submissions of novel human genes have
slowed to a trickle - No evidence for large numbers of cryptic smORFs
- Proteomics has not revealed new proteins
- October 2004 Nature paper on finished human
genome suggests 20-25,000 protein-coding genes
17The Consequences of a Low Protein-coding Gene
Number for MS-based Proteomics
- Shifts the identification risk from false
negatives (missed novel proteins) towards false
positives (inflated hit-lists) - Encourages experimental progression beyond simple
gene stamping towards exon corrections, detecting
splice forms, PTMs, SAPs and quantitation - 10 of proteins predicted from genomic data
still have no experimentaly confirmed mRNA - These could be detected by MS-based proteomics so
long as the sequences are included in the search
space - The basal (unspliced) protein number is likely to
be simillar for most mammals - Despite the technical challenges one of the goals
of the Human Proteome Organisation of being able
to detect at least the basal forms of all 25K
expressed human proteins in vivo now seems
achievable
18Human Proteome Sampling by MS/MS Identification
Complete Absence of Unpredicted Genes ?
- 3778 from plasma (Muthusamy et al 2005)
- 2486 from liver cells (Yan et al. 2006)
- 615 from the human heart mitochondria (Taylor et
al. 2003) - 500 from breast cancer cell membranes (Adams et
al. 2003) - 491 from microsomal fractions (Han et al. 2001)
- 490 from blood serum (Adkins et al. 2003)
- 311 from the splicesome (Rappsilber et al. 2002)
- No verifiable data on gene prediction
confirmation - Caveats on search space for novel gene detection
by correlative algorithms - One novel gene reported from a genome-only
peptide match by Kuster et al in 2001 but this
appeared from a high-throughput project later in
the same year (Tr Q96DA0) - Proteomics will have a key impact on
characterising the proteome but there is no
evidence so far for novel protein discovery
19From Minimal to Maximal Sequence Collections
Many databases to choose from
20Implications of Low Gene Number for Drug
Discovery Finite Numbers for Drug Target
Families
- IPR000276 Rhodopsin-like GPCR superfamily 720
- IPR000719 Protein kinase 486
- IPR001254 Serine protease, trypsin family 85
- IPR000832 G-protein coupled receptors family 2
(secretin-like) 44 - IPR006202 Neurotransmitter-gated ion-channel
ligand binding domain 41 - IPR000242 Tyrosine specific protein phosphatase
31 - IPR000834 Zinc carboxypeptidase A metalloprotease
(M14) 31 - IPR000337 G-protein coupled receptors family (3)
23 - IPR000340 Dual specificity protein phosphatase 17
We now have largely finite choices for the usual
suspects small molecule drug targets All of
these are now covered by sequence patent claims
of variable IP weight
21A Low Gene Number Brings target saturation
point Closer
- Current declared human drug target proteins
under investigation world-wide is approaching
1200 - This is getting close to the currently predicted
maximum of druggable targets - However the emergence of public target validation
evidence per se is arguably pre-competitive for
drug discovery
22Low Gene Number Increases the Competive Drive to
Exapand the Druggable Intervention Envelope
- Functional genomics may convert some unkown genes
to new enzymes or receptors - Threading/structural genomics may identify new
members of known druggable famlies - Systems biology may identify new and/or combined
intervention points - Protein-protein interactions may become more
amenable to specific chemical modulation
23Acknowledgments and References
- Paul Kersey of the EBI for IPI figures
- Lucas Wagner of the NCBI for the retrospective
UniGene data - Arek Kasprzyk of the EBI for historical and
preview Ensembl statistics - Numerous other people at NCBI, EBI, and Sanger
Centre who have graciously answered queries on
their data collections - The OGS Proteome Discovery Team
- Astra Zeneca for current support
McGowan et al.