Has the Yoyo Stopped Reviewing the Evidence for a Low Basal Human Protein Number and the Implication - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Has the Yoyo Stopped Reviewing the Evidence for a Low Basal Human Protein Number and the Implication

Description:

Central to evolutionary questions of gene number expansion vs. protein diversity ... of public target validation evidence per se is arguably pre-competitive for drug ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 24
Provided by: chriss67
Category:

less

Transcript and Presenter's Notes

Title: Has the Yoyo Stopped Reviewing the Evidence for a Low Basal Human Protein Number and the Implication


1
Has the Yoyo Stopped? Reviewing the Evidence
for a Low Basal Human Protein Number and the
Implications for Proteomics and Drug Discovery
6th Swedish Annual Bioinformatics Workshop
Göteborg, November 2005 Christopher Southan
Molecular Pharmacology, AstraZeneca RD, Mölndal
2
Presentation Outline
  • The importance of gene number
  • Gene definition and detection
  • Genome inflation arguments
  • Post-completion changes in model eukaryotes
  • Ensembl pipeline numbers
  • Completed chromosome gene numbers
  • International Protein Index
  • Novel gene skimming
  • Implications for Proteomics
  • Implications for Drug Discovery

3
So Who Cares About the Human Gene Number?
  • Central to evolutionary questions of gene number
    expansion vs. protein diversity from alternative
    splicing, post-translational modifications and
    differencial expression modulation
  • Mammalian gene totals expected to be similar but
    clade-specific genes may be crucial to speciation
  • Accurate gene delineation essential for
    interpreting genetic association studies
  • Large-scale, hypothesis-neutral, transcript
    and/or protein profilling experiments also need a
    complete gene set
  • False negatives (missed genes) are more important
    that false positives
  • For mass-spec and other proteomic technologies it
    is crucial to have at least a basal complete ORF
    set
  • For Pharma and Biotech the numbers set finite
    limits for potential drug targets and therapeutic
    proteins

4
Definitions
  • The basal (unspliced) protein-coding gene number
    transcriptional units that translate to one or
    more proteins that share overlapping sequence
    identity and are products of the same unique
    genomic locus and strand orientation
  • However, the Guidelines for Human Gene
    Nomenclature define a gene as "a DNA segment
    that contributes to phenotype/function. In the
    absence of demonstrated function a gene may be
    characterised by sequence, transcription or
    homology"
  • The increasing complexity of the transcriptome
    makes the wider definition of gene more
    difficult e.g. micro and antisence RNA

5
Identifying Protein Coding Genes
  • In silico
  • Detection of protein identity in genomic DNA
  • Gene prediction with protein similarity support
  • Matches with ESTs that include ORFs and/or splice
    sites
  • Cross-species comparisons for orthologous exon
    detection
  • Presence of gene anatomy features e.g. CpG
    islands, promoters, transcription start sites,
    polyadenylation signals and the absence of repeat
    elements
  • In vitro
  • Cloning of predicted genes
  • Detection of active transcription by Northern
    blot, RT-PCR or microarray hybridisation
  • Loss-of-function approaches
  • High-throughput transcript sampling by EST, MPSS
    or SAGE tagging
  • Heterologous expression of cDNAs
  • Direct verification of protein sequence by Edman
    sequencing, mass-mapping and/or MS/MS sequencing

6
Historical Arguments and Estimates for High Gene
Numbers
  • Initial eukaryote (yeast/worm/fly) numbers
    assumed to be underestimates
  • Gene prediction programs have a significant
    false-negative rate
  • The Ensembl gene annotation pipeline is
    conservative
  • Mammalian protein and transcript coverage is
    incomplete
  • Chromosome annotation teams find more genes than
    automated pipelines
  • Selective transcript skimming experiments have
    revealed new genes
  • Extensive mamallian genomic sequence conservation
    outside known exons
  • Postulated large numbers of undetected small
    proteins (smORFs or dark matter)
  • EST clustering and commecial gene inflation
    claims

Genesweep 2000
Literature estimates
7
Model Eukaryotes No Significant Post-Completion
Gene Increases
  • S.pombe 3 increase since 2002
  • S.cerevisiae 8 decrease since 1997 (excluding
    820 dubious ORFs)
  • C.elegans 5 increase since 1998
  • D.melanogaster 0.2 increase since 2001
  • Little increase in spite of global functional
    genomics focus

8
Human Transcripts Post-genomic mRNA Growth in
UniGene
  • Rapid growth in redundant mRNA
  • But slow growth in clustered set 9,000 over 2
    years with plateau 28,000
  • This will include splice variants and some
    spurious ORFs

9
Ensembl Human Gene Number
  • Only 22,218 genes, a decrease of 1826 over 4
    years
  • Knowns from 90 lt 95
  • Novel genes 12,398 gt 2,263
  • Exons-per-gene 6.5 lt 9.6
  • Alternative splicing from 3,669 lt to 8,078

10
Addressing the smORF Question Protein Size
Distributions in Human SPTr
Pre Oct-01 6.3 gt 100aa
Post Oct-01 5.5 gt 100aa
Novel in title 3.4 gt 100aa
11
Summarising the smORF Issue
  • smORFs are particularlly difficult to detect
    computationaly and experimentally, However
  • No database evidence for increased smORF
    discovery in eukaryotes or mammals
  • The observation that only 1 of mouse genes
    have no detectable human homology contradicts the
    idea of large order-specific gene expansion in
    mammals
  • Although small proteins are less conserved i.e.
    evolve more rapidly, those much shorter than 100
    residues will fall below the threshold necessary
    to fold into the domain structures necessary for
    biological function
  • No evidence for de-novo gene invention in
    higher eukaryotes

12
Release History of the International Protein
IndexSignificant Non-overlap in Protein
Annotation Sets
56537 Entries
13
Experimental Transcript Skimming Used to be
Evidence for High Gene Numbers but Now Points to
Non-coding Transcription
  • Exon arrays (Dunham et al. 1999)
  • Gene arrays (Penn et al. 2000)
  • RT-PCR (Das et al. 2001)
  • SAGE-tags (Saha et al. 2002, Chen et al. 2002)
  • Oligo tiling from 21 and 22 (Kapranov et al.
    2002, Kampa, et al (2004)
  • No novel proteins were submitted to the primary
    databases
  • Necessary to clone a full length ORF with the
    features of gene anatomy, and submission to the
    public databases, before the discovery of novel
    proteins can be claimed
  • There is increasing evidence for significant
    amounts of antisence and other non-ORF
    transcription in human and mouse

14
Gene Numbers for Individual Completed Chromosomes
  • Averaging the completed chromosomes exceeds
    Ensembl GP31 genes by 12
  • Extrapolate to 25,000 genes without novel
    transcipts or putatives
  • Extrapolate to 28,000 genes without putatives
  • Extrapolate to 31,000 genes with putatives
  • The chromosome reports were made at different
    times using different assemblies and different
    grades of gene definition and evidence support
    (e.g. different results for chrom 7)
  • Difficult to explicitly cross-map VEGA vs.
    Ensembl chromosome gene numbers
  • Future status of novel transcripts and putative
    genes unclear most will be non-coding

15
Disappearing Gene Novelty
  • EMBL hum cds March 2003 1491
  • Plus novel 159
  • Plus PubMed 2003
  • 120
  • Novel in title 11
  • Previous cds 8
  • Novel genes 2
  • Now both in RefSeq and Ens 18.34

16
Conclusions on Number
  • The model eukaryotes have shown no significant
    post-genomic rises in gene number
  • The Ensembl gene number has been essentially flat
    since 2001
  • The pseudogene-adjusted Ensembl basal protein
    total is 22,500
  • 2,500 predicted genes still need experimental
    verification likly to be errors in terminal
    exons
  • Putative genes from curated complete chromosomes
    could raise numbers but the status of this class
    of transcripts is in doubt
  • Early over-estimates explicable by non-ORF
    transcription activity
  • The massive increase in post-genomic transcript
    coverage is predominantly re-sampling known
    genes
  • Database submissions of novel human genes have
    slowed to a trickle
  • No evidence for large numbers of cryptic smORFs
  • Proteomics has not revealed new proteins
  • October 2004 Nature paper on finished human
    genome suggests 20-25,000 protein-coding genes

17
The Consequences of a Low Protein-coding Gene
Number for MS-based Proteomics
  • Shifts the identification risk from false
    negatives (missed novel proteins) towards false
    positives (inflated hit-lists)
  • Encourages experimental progression beyond simple
    gene stamping towards exon corrections, detecting
    splice forms, PTMs, SAPs and quantitation
  • 10 of proteins predicted from genomic data
    still have no experimentaly confirmed mRNA
  • These could be detected by MS-based proteomics so
    long as the sequences are included in the search
    space
  • The basal (unspliced) protein number is likely to
    be simillar for most mammals
  • Despite the technical challenges one of the goals
    of the Human Proteome Organisation of being able
    to detect at least the basal forms of all 25K
    expressed human proteins in vivo now seems
    achievable

18
Human Proteome Sampling by MS/MS Identification
Complete Absence of Unpredicted Genes ?
  • 3778 from plasma (Muthusamy et al 2005)
  • 2486 from liver cells (Yan et al. 2006)
  • 615 from the human heart mitochondria (Taylor et
    al. 2003)
  • 500 from breast cancer cell membranes (Adams et
    al. 2003)
  • 491 from microsomal fractions (Han et al. 2001)
  • 490 from blood serum (Adkins et al. 2003)
  • 311 from the splicesome (Rappsilber et al. 2002)
  • No verifiable data on gene prediction
    confirmation
  • Caveats on search space for novel gene detection
    by correlative algorithms
  • One novel gene reported from a genome-only
    peptide match by Kuster et al in 2001 but this
    appeared from a high-throughput project later in
    the same year (Tr Q96DA0)
  • Proteomics will have a key impact on
    characterising the proteome but there is no
    evidence so far for novel protein discovery

19
From Minimal to Maximal Sequence Collections
Many databases to choose from
20
Implications of Low Gene Number for Drug
Discovery Finite Numbers for Drug Target
Families
  • IPR000276 Rhodopsin-like GPCR superfamily 720
  • IPR000719 Protein kinase 486
  • IPR001254 Serine protease, trypsin family 85
  • IPR000832 G-protein coupled receptors family 2
    (secretin-like) 44
  • IPR006202 Neurotransmitter-gated ion-channel
    ligand binding domain 41
  • IPR000242 Tyrosine specific protein phosphatase
    31
  • IPR000834 Zinc carboxypeptidase A metalloprotease
    (M14) 31
  • IPR000337 G-protein coupled receptors family (3)
    23
  • IPR000340 Dual specificity protein phosphatase 17

We now have largely finite choices for the usual
suspects small molecule drug targets All of
these are now covered by sequence patent claims
of variable IP weight
21
A Low Gene Number Brings target saturation
point Closer
  • Current declared human drug target proteins
    under investigation world-wide is approaching
    1200
  • This is getting close to the currently predicted
    maximum of druggable targets
  • However the emergence of public target validation
    evidence per se is arguably pre-competitive for
    drug discovery

22
Low Gene Number Increases the Competive Drive to
Exapand the Druggable Intervention Envelope
  • Functional genomics may convert some unkown genes
    to new enzymes or receptors
  • Threading/structural genomics may identify new
    members of known druggable famlies
  • Systems biology may identify new and/or combined
    intervention points
  • Protein-protein interactions may become more
    amenable to specific chemical modulation

23
Acknowledgments and References
  • Paul Kersey of the EBI for IPI figures
  • Lucas Wagner of the NCBI for the retrospective
    UniGene data
  • Arek Kasprzyk of the EBI for historical and
    preview Ensembl statistics
  • Numerous other people at NCBI, EBI, and Sanger
    Centre who have graciously answered queries on
    their data collections
  • The OGS Proteome Discovery Team
  • Astra Zeneca for current support

McGowan et al.
Write a Comment
User Comments (0)
About PowerShow.com