Title: caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG
1caGrid Version 0.5 Reference Implementation
Grid-Enablement of Protein Information Resource
(PIR) caBIG Architecture Face-to-Face
Workspace Georgetown UniversityAugust 16 -18,
2005
- Baris Ethem Suzek
- Georgetown University - Lombardi Cancer Center
PIR - bes23_at_georgetown.edu
2Outline
- Introduction
- High Level Overview of Grid-Enablement of PIR
- Data Model
- Project Architecture
- Process of getting to Silver level compliance
- Functionality Exposed to Grid
- Process of Grid Enablement
- Demo/Screenshots
- Lessons Learned / Technical Difficulties / Wish
List - Acknowledgements
3Introduction
- Protein Information Resource (PIR) Integrated
Protein Informatics Resource for
Genomic/Proteomic Research
- UniProt Universal Protein Resource Central
Resource of Protein Sequence and Function - PIRSF Family Classification System Protein
Classification and Functional Annotation - iProClass Integrated Protein Knowledgebase Data
Integration and Functional Analysis
http//pir.georgetown.edu
4Introduction
- UniProt Universal Protein Resource - Central
Resource of Protein Sequence and Function
- International Consortium
- PIR at GUMC
- European Bioinformatics Institute (EBI)
- Swiss Institute of Bioinformatics (SIB)
- Unifies PIR-PSD, Swiss-Prot, TrEMBL Protein
Sequence Databases
http//www.uniprot.org
5Introduction
Primary data source for Grid-Enablement of PIR
6Project Overview
- Project Goal Provide the most comprehensive and
fully annotated protein related information for
genomic and proteomic cancer research - Major Functionality Providing methods to query
and retrieve protein related information for the
cancer research community - Being developed in two phases
- First phase Jan 1 Aug 1, 2005
- Demonstrate PIR data source can be discovered and
consumed in caBIG - Make PIR data source caBIG silver-complaint
- Originally, final product was web services
- June, 13, 2005 the project changed direction to
become a grid service
7Data Model
- Current Total 48 objects , 51 attributes
8Data Model
- Protein/Gene related objects
9Data Model
- Annotation related objects Protein Features
10Data Model
- Taxonomy related objects (Proposed as Taxonomy
CDE)
11Project Architecture
Client
Data Access Objects/PIR Data Service
Object to Relational Mapping
Domain Objects
UniProtKB Database
12Process to Silver Compliance
- Month 1 Use cases developed with adopter from
UPenn - Setting Search Criteria
- Simple Search based on individual field
UniProtKB, PIR ID or accession number, NCBI
Taxonomy ID, Protein name etc. - Advanced Search based on two fields combined with
Boolean operators AND , OR and AND_NOT - All-ID Search a Google-like search for the
identifier fields if source of identifier is not
known - Batch Retrieval using multiple UniProtKB IDs or
accessions - Setting Response Criteria
- UniProtKB XML or FASTA
- Month 2 Test approach document approved
- Month 3-4 Requirements and specifications
document approved - Month 4 Database schema developed (reverse
engineered)
13Process to Silver Compliance
- Month 4.5 First version of the object model and
semantic annotation report UniProtKB
record-based Object Model 70 objects, 130
attributes
14Process to Silver Compliance
- Month 5 First version of the web services
developed and tested - Month 5.5, June 13 Feedback from caCORE team on
the first version of the object model - Design a protein domain model, rather than a
UniProtKB record-based model - Use caCORE SDK to generate the middle layer
- New page in Grid-enablement of PIR project web
services development became unnecessary for grid
deployment, put more emphasis on object modeling
15Process to Silver Compliance
- June 17 Second version of the object model
submitted for VCDE/caCORE Workspace review
16Process to Silver Compliance
- June 22 Review object model with VCDE and caCORE
teams - Use bidirectional associations, unless necessary
- Use java.util.Collection rather than list
- Follow caCORE SDK naming conventions
- Replace complex data types (i.e. Lists) with
association classes - Implement subclass of Feature to express
semantics - June 24 Third version of the object model is
submitted for VCDE/caCORE Workspace review - June 28 Semantic annotations for the first
version of the object model received from VCDE
team
17Process to Silver Compliance
- June 30 Object model finalized
18Process to Silver Compliance
- July 2 Semantic annotation partially done
using the annotations of the first object model - July 11 Code generated using caCORE SDK 1.0.2,
ORM completed, first version of silver level
application deployed - July 21 Semantic annotation completed (149
Concepts)
19Process to Silver Compliance
- July 22 caCORE SDK 1.0.3 became public
- July 25 Grid-node deployed for caCORE SDK 1.0.2
generated API - July 25 Migration to caCORE SDK 1.0.3
- July 28 Object model loaded into caDSR staging
server - August 3 Object model and semantic annotation
modified for caCORE SDK 1.0.3 (id attributes) - August 4 Code generated using caCORE SDK 1.0.3,
ORM modified, current version of silver level
application deployed - August 5 Grid-node deployed for caCORE SDK 1.0.3
generated API - August 8 Object model is loaded into caDSR
production server
20Functionality Exposed to the Grid
- Grid-Enablement of PIR project is a data service
- All the objects in our model exposed to the grid
- Example queries
- Find the proteins for the gene BRCA2 (Breast
Cancer Gene 2) - Find all the proteins that contain the domain
BRCA2 repeat (PFAMPF00634, a domain in Breast
cancer type 2 susceptibility protein) - ID mapping Find all the database
cross-references from various databases
corresponding to RefSeq Accession NP_009225 - Since all the PIR and UniProt databases are
public, no security layers implemented
21Process of Grid Enablement
- Download prerequisites (CVSGrab, Ant, JDK
1.4.2_04) - Set environment variables (JAVA_HOME, ANT_HOME,
CVSGRAB_HOME) - Run ant script bootstrap.xml provided by caGRID
team - Set more environment variables (CATALINA_HOME,
OGSADAI_LOCATION, GLOBUS_LOCATION) - Copy the client.jar generated by caCORE SDK to
the correct data services directory - Change the configuration so that it points to
HTTP server URL for your data service - Deploy the service with ant f bootstrap.xml
deployDS - Start Tomcat Server
22Demo and/or Screenshots
- Retrieve the proteins for gene BRCA2 (Breast
Cancer Gene 2) -
-
-
- valueBRCA2"/
-
-
23Demo and/or Screenshots
- Retrieve the proteins for gene BRCA2 (Breast
Cancer Gene 2) RESPONSE - .uk/namespaces/2003/07/gds/types"
- .
- O35923aryAccession
- BRCA2_RATe
-
- P51587ryAccession
- BRCA2_HUMAName
- .....
- MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPY
NSEPAEES - EHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDK
- ..
- P97929ryAccession
- BRCA2_MOUSEame
- ..
- /result
-
24Demo and/or Screenshots
- Find all the proteins that contain the domain
BRCA2 repeat (PFAMPF00634, a domain in Breast
cancer type 2 susceptibility protein) -
-
- .DatabaseCrossReference"
- predicate"equal" value"PF00634"/
-
-
-
25Demo and/or Screenshots
- Find all the proteins that contain the domain
BRCA2 repeat (PFAMPF00634, a domain in Breast
cancer type 2 susceptibility protein) RESPONSE - .uk/namespaces/2003/07/gds/types"
- ..
- O35923aryAccession
- BRCA2_RATe
- ..
- P51587aryAccession
- BRCA2_HUMAName
- .
- P70098aryAccession
- Q5TBJ7_HUMANName
- .
- Q7RG20aryAccession
- Q7RG20_PLAYOName
- .
-
26Demo and/or Screenshots
- ID mapping Find all the database
cross-references from various databases
corresponding to RefSeq Accession NP_061820 - 1st Step Find the protein for RefSeq Accession
NP_061820 -
- "
- aseCrossReference"
- value"RefSeq"/
- predicate"equal" value"NP_061820"/
-
-
-
27Demo and/or Screenshots
- ID mapping Find all the database
cross-references from various databases
corresponding to RefSeq Accession NP_061820 - 1st Step Find the protein for RefSeq Accession
NP_061820 RESPONSE - .uk/namespaces/2003/07/gds/types"
-
- status"COMPLETED"/
- status"COMPLETED"
main.impl.ProteinImpl - .
- P99999
- P99999aryAccession
- CYC_HUMANe
- .
-
28Demo and/or Screenshots
- ID mapping Find all the database
cross-references from various databases
corresponding to RefSeq Accession NP_061820 - 2nd Step Find all the cross-references for the
protein found -
- rossReference"
- n"
- predicate"equal valueCYC_HUMAN"/
-
-
29Demo and/or Screenshots
- ID mapping Find all the database
cross-references from various databases
corresponding to RefSeq Accession NP_061820 - 2nd Step Find all the cross-references for the
cross-references for the protein found RESPONSE - .uk/namespaces/2003/07/gds/types"
- .
- EMBL
- M22877
- .
- PIR
- CCHU
- .
- GenPept
- AAH09579
- .
- NCBI GI
- 14250124
- .
30Lessons Learned / Technical Difficulties
- Before starting
- Read caCORE SDK Programmers Guide, especially
Chapter 5 Generating UML Models - Download and install caCORE SDK
- Keep in mind caGRID and caCORE SDK under
development - Use case development
- Identify caCORE SDK limitations
31Lessons Learned / Technical Difficulties
- Object modeling most important step
- Considerations
- Scientific meaning if possible Dont do record
modeling - Use cases Consider search criteria objects
- caCORE SDK constraints Consider naming
conventions, id attribute constraints,
supported collection types. e.g. List is not
supported - Data related constraints Include only
associations or objects based on your data. e.g.
Gene to Protein, but not Protein to DNASequence - Semantics Express semantics and avoid using
type attributes. e.g. ProteinFeature subclasses,
Lineage - Ask for a VCDE representative before you begin,
employ an iterative approach
32Lessons Learned / Technical Difficulties
- Semantic annotation/caDSR load
- Enter tagged values while modeling
- Review semantic annotation report to check
classifications of Concepts (ObjectClass,
ObjectClassQualifier, Property,
PropertyQualifier) - Ensure all the Concepts are included in
Annotated Model before you send to caDSR - Code generation / Grid deployment
- Using caCORE SDK helped in grid-enabling PIR
33Lessons Learned / Technical Difficulties
- Database Performance
- In caCORE SDK generated SQL statements, all
string fields are lower-cased additional
function-based indexes are needed - select from. where lower(PRIMARYACCESSION)
'p88888 - Object-to-relational mapping
- New id attribute constraint introduced in
caCORE SDK 1.0.3 - Option 1 Change the database schema ,add id of
type Integer to all the tables (caBIO) - Option 2 Map existing primary keys as id, so
id can be String or Integer (PIR)
34Wish List
- Semantic annotation
- Excel spreadsheets are error-prone, a tool that
will make semantic annotation easier - Traversing the object model
- Checking errors (wrong classifications, missing
annotations etc.) - caCORE SDK
- Google-like/any field search
- Allow optional exact string search
- Use of logger instead of System.out. or
System.err. - Support queries involving class inheritance
- Support Boolean queries involving multiple
collection type attributes - caGRID
- Training opportunities for developers
- Documentation
35Acknowledgements
Georgetown University Cathy Wu (Faculty
Lead) Hongzhan Huang (Chief Architect) Peter
McGarvey (Domain Expert) Baris Suzek (Project
Manager) Sehee Chung (SW Developer) Hsing-Kuo Hua
(DB Developer) Jess Cannata (System Admin) Robert
Clarke Steve Moore Arnie Miles Panther
Informatics Brian Gilman (Consultant)
NCI Center for Bioinformatics Peter
Covitz George Komatsoulis Avinash Shanbhag Tara
Akhavan William Sanchez Manav Kher Jijin
Yan Clarie Wolfe Nicole Thomas Himanso
Sahni Jennifer Zeng Nafis Zebarjani
- 3rd Millenium
- Juli Klemm
- Brian Davis
- BAH
- Mark Adams
- Arumani Manisundaram
University of Pennsylvania - BMIF David
Fenstermacher Craig Street Vishal Nayak Casey
Overby