Title: The%20Gene%20Ontology%20Annotation%20(GOA)%20Database%20and%20enhancement%20of%20GO%20annotations%20through%20InterPro2GO
1The Gene Ontology Annotation (GOA) Database and
enhancement of GO annotations through InterPro2GO
- Nicky Mulder
- mulder_at_ebi.ac.uk
2Contents
- Introduction to GOA
- Manual GOA annotation
- Electronic annotation
- InterPro2GO
- GOA data flow
- Uses of GOA
- Future plans
3What is GO annotation?
GO Term ID
- An annotation is a statement that a gene product
- has a particular molecular function
- is involved in a particular biological process
- is located within a certain cellular component
- as determined by a particular method
- as described in a particular reference.
Evidence Code
Reference
4Gene Ontology Annotation (GOA) Database
- GOAs priority is to annotate the human, mouse
and rat proteomes - Largest open-source contributor of annotations to
GO - Provides 10 million annotations for more than
111,000 species - Share and integrate GO annotation
5How do we annotate GO terms
? Manual Annotation ? Electronic
Annotation
- All annotations must
- be attributed to a source
- indicate what evidence was found to support the
GO term-gene/protein association
6Manual annotation
- High quality
- Specific gene or gene product associations made
using - Peer reviewed papers
- Evidence codes
- BUT
- Time-consuming
- Requires trained biologists
7Manual GO annotation
8Protein2GO tool Online
9Information captured by GOA
Source GOID Term Evid RefDB RefID With DB With ID Qualifier
10 How successful is manual-GOA?
Source No. of annotations No. of distinct proteins
Proteome Inc. 22054 6568
UniProt 67910 13697
IntAct 22002 11013
MGI 124919 29837
SGD 21761 5076
FlyBase 52386 8775
RGD 8036 3369
HGNC 3699 798
GeneDB 5502 1384
TAIR/TIGR 3367 1895
ZFIN 1012 334
Roslin Institute 14 6
AgBase 889 173
Reactome 15 12
WormBase 893 443
TIGR 139 79
Gramene 139 2812
GDB 165 103
TOTAL MANUAL 336237 70728
111740 taxa
July 2006
11 Electronic Annotation
- Large-scale assignment of GO terms to UniProtKB
entries using existing information within
database entries and manual mappings - Get IEA evidence code
12www.uniprot.org/
13Mappings of external concepts to GO
http//www.geneontology.org/GO.indices.shtml
14InterPro2GO mapping
- InterPro is a resource that integrates protein
signatures databases, e.g. Pfam, Prints, Prosite,
ProDom, SMART, TIGRFAMs etc. - It provides a means of classifying proteins into
families and identifying domains. - Each InterPro entry groups proteins belonging to
the same family and potentially having the same
function
15InterPro2Go mapping
- Done manually, but using tools
- Look at InterPro and protein annotation
- For all Swiss-Prot proteins matching entry truly
- Get stats on DE lines, keywords, comments
- Check how conserved common annotation is
- Find appropriate GO term at most specific level
that applies to all proteins (not necessarily
domains)
16Tools used SQUID
Statistics options keyword description Gene
name Organism Comments, etc.
17SQUID statistics output
18SQUID statistics output
19InterPro2GO mapping in entry
20InterProScan output with GO terms
21InterPro2GO sanity checks
- Run weekly
- Reports
- Obsolete GO terms
- Obsolete (deleted) IPRs
- Secondary IPRs
22Quality of GO mapping
- BioCreAtIvE test set -635 GO annotations through
InterPro2GO
Manually checked 44 proteins, 107 predictions 97
correct (90) -40 exact -57 same lineage 10
new lineage (unknown) 0 incorrect
Camon et al., 2005, BMC Bioinformatics
23InterPro2GO mapping statistics
Total no. IPRS mapped to GO 7126
of IPRs mapped to at least 1 GO term 54
No. IPRS mapped to molecular function 5741
No. IPRS mapped to biological process 5543
No. IPRS mapped to cellular component 3426
No. GO terms mapped 2811
No. UniProt proteins mapped through interpro2go 2006489 (61)
UniProt covered by InterPro 77.6
24 How successful is IEA-GOA in general?
- Provides large coverage
- High Quality
- However these annotations often use high-level
GO terms and provide little detail.
IEA Method No. of annotations No. of distinct proteins
InterPro2GO 6281916 2006489
HAMAP2GO 199904 85814
SP Keyword2GO 3613883 1287830
EC2GO 207540 202657
TOTAL 10303243 2167001
Manual ones 336237 70728
Jun 2006
25Total GO statistics
Total no. GO annotations 10639480
GO associations manual 3.16
GO associations electronic 96.84
GO associations interpro2GO 59
Total no. proteins annotated to GO 2168717
UniProt GO annotated in total 68.2
UniProt GO annotated manually 2.2
UniProt GO annotated electronically 66
UniProt GO annotated through interpro2go 61
26GOA data flow
Gene association files
27Gene Association file format
http//www.geneontology.org/GO.annotation.shtml
28Example GOA cow file
29Output from the GOA database
New
Non-Redundant based on IPI
GOA Cow
Redundant
GA slim for UniProt GO slims
Data also available in SRS, UniProt, QuickGO,
MODs, Ensembl etc.
30GA Files for Non-redundant species
- Non-redundant complete protein set for each
proteome is identified (gt25 GO coverage) - Includes UniProt, IPI and MOD-specific IDs, e.g.
mouse (MGI), rat (RGD), zebrafish (ZFIN) etc. - Xref files available with identifiers from
UniProt, IPI, RefSeq, Ensembl, UniGene etc.
ftp//ftp.ebi.ac.uk/pub/databases/GO/goa ftp//ftp
.ebi.ac.uk/pub/databases/integr8
31Uses of GOA data
- Access protein functional information
- Look at relationships between proteins, e.g.
IntAct - Connect biological information to gene expression
data - Determine functional composition of a proteome
using GO slim
32Uses of GOA
Find functional information on proteins
http//www.ebi.ac.uk/ego
33Uses of GOA
Find functional information on interaction
proteins (IntAct)
httpwww.ebi.ac.uk/intact
34Uses of GOA
Overview proteome with GO Slim
http//www.ebi.ac.uk/integr8
35Uses of GOA
Analysis of high-throughput data according to GO
Microarray data analysis
Proteomics data analysis
GO classification
GO classification
Larkin JE et al, Physiol Genomics, 2004
Kislinger T et al, Mol Cell Proteomics, 2003
Cunliffe HE et al, Cancer Res, 2003
36Future plans
- Continue deep level annotation of human, mouse
and rat - Manually annotate splice variants
- Outreach and inclusion of new datasets e.g. grape
- New electronic mappings, e.g. unipathway2go
- Ortholog prediction for electronic GO annotation
- Develop tools for annotation training
37Acknowledgements
Rolf Apweiler Head of sequence database group
Evelyn Camon GOA Coordinator Daniel
Barrell GOA Programmer Emily Dimmer GOA
Curator Rachael Huntley GOA Curator David
Binns John Maslen QuickGO, GOA tools All EBI
UniProtKB Curators, HAMAP(SIB), IntAct, GO
Editorial Office _at_ EBI All GO Consortium
associate members