Has the Yoyo Stopped Reviewing the Evidence for a Low Basal Human Protein Number and the Implication - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Has the Yoyo Stopped Reviewing the Evidence for a Low Basal Human Protein Number and the Implication

Description:

Central to evolutionary questions of gene number expansion vs. protein diversity ... of public target validation evidence per se is arguably pre-competitive for drug ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 24

Provided by: chriss67

Category:

more less

Transcript and Presenter's Notes

Title: Has the Yoyo Stopped Reviewing the Evidence for a Low Basal Human Protein Number and the Implication

1
Has the Yoyo Stopped? Reviewing the Evidence
for a Low Basal Human Protein Number and the
Implications for Proteomics and Drug Discovery
6th Swedish Annual Bioinformatics Workshop
Göteborg, November 2005 Christopher Southan
Molecular Pharmacology, AstraZeneca RD, Mölndal
2
Presentation Outline

The importance of gene number
Gene definition and detection
Genome inflation arguments
Post-completion changes in model eukaryotes
Ensembl pipeline numbers
Completed chromosome gene numbers
International Protein Index
Novel gene skimming
Implications for Proteomics
Implications for Drug Discovery

3
So Who Cares About the Human Gene Number?

Central to evolutionary questions of gene number
expansion vs. protein diversity from alternative
splicing, post-translational modifications and
differencial expression modulation
Mammalian gene totals expected to be similar but
clade-specific genes may be crucial to speciation
Accurate gene delineation essential for
interpreting genetic association studies
Large-scale, hypothesis-neutral, transcript
and/or protein profilling experiments also need a
complete gene set
False negatives (missed genes) are more important
that false positives
For mass-spec and other proteomic technologies it
is crucial to have at least a basal complete ORF
set
For Pharma and Biotech the numbers set finite
limits for potential drug targets and therapeutic
proteins

4
Definitions

The basal (unspliced) protein-coding gene number
transcriptional units that translate to one or
more proteins that share overlapping sequence
identity and are products of the same unique
genomic locus and strand orientation
However, the Guidelines for Human Gene
Nomenclature define a gene as "a DNA segment
that contributes to phenotype/function. In the
absence of demonstrated function a gene may be
characterised by sequence, transcription or
homology"
The increasing complexity of the transcriptome
makes the wider definition of gene more
difficult e.g. micro and antisence RNA

5
Identifying Protein Coding Genes

In silico
Detection of protein identity in genomic DNA
Gene prediction with protein similarity support
Matches with ESTs that include ORFs and/or splice
sites
Cross-species comparisons for orthologous exon
detection
Presence of gene anatomy features e.g. CpG
islands, promoters, transcription start sites,
polyadenylation signals and the absence of repeat
elements
In vitro
Cloning of predicted genes
Detection of active transcription by Northern
blot, RT-PCR or microarray hybridisation
Loss-of-function approaches
High-throughput transcript sampling by EST, MPSS
or SAGE tagging
Heterologous expression of cDNAs
Direct verification of protein sequence by Edman
sequencing, mass-mapping and/or MS/MS sequencing

6
Historical Arguments and Estimates for High Gene
Numbers

Initial eukaryote (yeast/worm/fly) numbers
assumed to be underestimates
Gene prediction programs have a significant
false-negative rate
The Ensembl gene annotation pipeline is
conservative
Mammalian protein and transcript coverage is
incomplete
Chromosome annotation teams find more genes than
automated pipelines
Selective transcript skimming experiments have
revealed new genes
Extensive mamallian genomic sequence conservation
outside known exons
Postulated large numbers of undetected small
proteins (smORFs or dark matter)
EST clustering and commecial gene inflation
claims

Genesweep 2000
Literature estimates
7
Model Eukaryotes No Significant Post-Completion
Gene Increases

S.pombe 3 increase since 2002
S.cerevisiae 8 decrease since 1997 (excluding
820 dubious ORFs)
C.elegans 5 increase since 1998
D.melanogaster 0.2 increase since 2001
Little increase in spite of global functional
genomics focus

8
Human Transcripts Post-genomic mRNA Growth in
UniGene

Rapid growth in redundant mRNA
But slow growth in clustered set 9,000 over 2
years with plateau 28,000
This will include splice variants and some
spurious ORFs

9
Ensembl Human Gene Number

Only 22,218 genes, a decrease of 1826 over 4
years
Knowns from 90 lt 95
Novel genes 12,398 gt 2,263
Exons-per-gene 6.5 lt 9.6
Alternative splicing from 3,669 lt to 8,078

10
Addressing the smORF Question Protein Size
Distributions in Human SPTr
Pre Oct-01 6.3 gt 100aa
Post Oct-01 5.5 gt 100aa
Novel in title 3.4 gt 100aa
11
Summarising the smORF Issue

smORFs are particularlly difficult to detect
computationaly and experimentally, However
No database evidence for increased smORF
discovery in eukaryotes or mammals
The observation that only 1 of mouse genes
have no detectable human homology contradicts the
idea of large order-specific gene expansion in
mammals
Although small proteins are less conserved i.e.
evolve more rapidly, those much shorter than 100
residues will fall below the threshold necessary
to fold into the domain structures necessary for
biological function
No evidence for de-novo gene invention in
higher eukaryotes

12
Release History of the International Protein
IndexSignificant Non-overlap in Protein
Annotation Sets
56537 Entries
13
Experimental Transcript Skimming Used to be
Evidence for High Gene Numbers but Now Points to
Non-coding Transcription

Exon arrays (Dunham et al. 1999)
Gene arrays (Penn et al. 2000)
RT-PCR (Das et al. 2001)
SAGE-tags (Saha et al. 2002, Chen et al. 2002)
Oligo tiling from 21 and 22 (Kapranov et al.
2002, Kampa, et al (2004)
No novel proteins were submitted to the primary
databases
Necessary to clone a full length ORF with the
features of gene anatomy, and submission to the
public databases, before the discovery of novel
proteins can be claimed
There is increasing evidence for significant
amounts of antisence and other non-ORF
transcription in human and mouse

14
Gene Numbers for Individual Completed Chromosomes

Averaging the completed chromosomes exceeds
Ensembl GP31 genes by 12
Extrapolate to 25,000 genes without novel
transcipts or putatives
Extrapolate to 28,000 genes without putatives
Extrapolate to 31,000 genes with putatives
The chromosome reports were made at different
times using different assemblies and different
grades of gene definition and evidence support
(e.g. different results for chrom 7)
Difficult to explicitly cross-map VEGA vs.
Ensembl chromosome gene numbers
Future status of novel transcripts and putative
genes unclear most will be non-coding

15
Disappearing Gene Novelty

EMBL hum cds March 2003 1491
Plus novel 159
Plus PubMed 2003
120
Novel in title 11
Previous cds 8
Novel genes 2
Now both in RefSeq and Ens 18.34

16
Conclusions on Number

The model eukaryotes have shown no significant
post-genomic rises in gene number
The Ensembl gene number has been essentially flat
since 2001
The pseudogene-adjusted Ensembl basal protein
total is 22,500
2,500 predicted genes still need experimental
verification likly to be errors in terminal
exons
Putative genes from curated complete chromosomes
could raise numbers but the status of this class
of transcripts is in doubt
Early over-estimates explicable by non-ORF
transcription activity
The massive increase in post-genomic transcript
coverage is predominantly re-sampling known
genes
Database submissions of novel human genes have
slowed to a trickle
No evidence for large numbers of cryptic smORFs
Proteomics has not revealed new proteins
October 2004 Nature paper on finished human
genome suggests 20-25,000 protein-coding genes

17
The Consequences of a Low Protein-coding Gene
Number for MS-based Proteomics

Shifts the identification risk from false
negatives (missed novel proteins) towards false
positives (inflated hit-lists)
Encourages experimental progression beyond simple
gene stamping towards exon corrections, detecting
splice forms, PTMs, SAPs and quantitation
10 of proteins predicted from genomic data
still have no experimentaly confirmed mRNA
These could be detected by MS-based proteomics so
long as the sequences are included in the search
space
The basal (unspliced) protein number is likely to
be simillar for most mammals
Despite the technical challenges one of the goals
of the Human Proteome Organisation of being able
to detect at least the basal forms of all 25K
expressed human proteins in vivo now seems
achievable

18
Human Proteome Sampling by MS/MS Identification
Complete Absence of Unpredicted Genes ?

3778 from plasma (Muthusamy et al 2005)
2486 from liver cells (Yan et al. 2006)
615 from the human heart mitochondria (Taylor et
al. 2003)
500 from breast cancer cell membranes (Adams et
al. 2003)
491 from microsomal fractions (Han et al. 2001)
490 from blood serum (Adkins et al. 2003)
311 from the splicesome (Rappsilber et al. 2002)
No verifiable data on gene prediction
confirmation
Caveats on search space for novel gene detection
by correlative algorithms
One novel gene reported from a genome-only
peptide match by Kuster et al in 2001 but this
appeared from a high-throughput project later in
the same year (Tr Q96DA0)
Proteomics will have a key impact on
characterising the proteome but there is no
evidence so far for novel protein discovery

19
From Minimal to Maximal Sequence Collections
Many databases to choose from
20
Implications of Low Gene Number for Drug
Discovery Finite Numbers for Drug Target
Families

IPR000276 Rhodopsin-like GPCR superfamily 720
IPR000719 Protein kinase 486
IPR001254 Serine protease, trypsin family 85
IPR000832 G-protein coupled receptors family 2
(secretin-like) 44
IPR006202 Neurotransmitter-gated ion-channel
ligand binding domain 41
IPR000242 Tyrosine specific protein phosphatase
31
IPR000834 Zinc carboxypeptidase A metalloprotease
(M14) 31
IPR000337 G-protein coupled receptors family (3)
23
IPR000340 Dual specificity protein phosphatase 17

We now have largely finite choices for the usual
suspects small molecule drug targets All of
these are now covered by sequence patent claims
of variable IP weight
21
A Low Gene Number Brings target saturation
point Closer

Current declared human drug target proteins
under investigation world-wide is approaching
1200
This is getting close to the currently predicted
maximum of druggable targets
However the emergence of public target validation
evidence per se is arguably pre-competitive for
drug discovery

22
Low Gene Number Increases the Competive Drive to
Exapand the Druggable Intervention Envelope

Functional genomics may convert some unkown genes
to new enzymes or receptors
Threading/structural genomics may identify new
members of known druggable famlies
Systems biology may identify new and/or combined
intervention points
Protein-protein interactions may become more
amenable to specific chemical modulation

23
Acknowledgments and References

Paul Kersey of the EBI for IPI figures
Lucas Wagner of the NCBI for the retrospective
UniGene data
Arek Kasprzyk of the EBI for historical and
preview Ensembl statistics
Numerous other people at NCBI, EBI, and Sanger
Centre who have graciously answered queries on
their data collections
The OGS Proteome Discovery Team
Astra Zeneca for current support

McGowan et al.

Write a Comment

User Comments (0)