Title: caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG
1 caGrid Version 0.5 Reference Implementation Grid-Enablement of Protein Information Resource (PIR) caBIG Architecture Face-to-Face Workspace Georgetown UniversityAugust 16 -18 2005
Baris Ethem Suzek
Georgetown University - Lombardi Cancer Center PIR
bes23_at_georgetown.edu
2 Outline
Introduction
High Level Overview of Grid-Enablement of PIR
Data Model
Project Architecture
Process of getting to Silver level compliance
Functionality Exposed to Grid
Process of Grid Enablement
Demo/Screenshots
Lessons Learned / Technical Difficulties / Wish List
Acknowledgements
3 Introduction
Protein Information Resource (PIR) Integrated Protein Informatics Resource for Genomic/Proteomic Research
UniProt Universal Protein Resource Central Resource of Protein Sequence and Function
PIRSF Family Classification System Protein Classification and Functional Annotation
iProClass Integrated Protein Knowledgebase Data Integration and Functional Analysis
http//pir.georgetown.edu 4 Introduction
UniProt Universal Protein Resource - Central Resource of Protein Sequence and Function
International Consortium
PIR at GUMC
European Bioinformatics Institute (EBI)
Swiss Institute of Bioinformatics (SIB)
Unifies PIR-PSD Swiss-Prot TrEMBL Protein Sequence Databases
http//www.uniprot.org 5 Introduction
UniProt Databases
Primary data source for Grid-Enablement of PIR 6 Project Overview
Project Goal Provide the most comprehensive and fully annotated protein related information for genomic and proteomic cancer research
Major Functionality Providing methods to query and retrieve protein related information for the cancer research community
Being developed in two phases
First phase Jan 1 Aug 1 2005
Demonstrate PIR data source can be discovered and consumed in caBIG
Make PIR data source caBIG silver-complaint
Originally final product was web services
June 13 2005 the project changed direction to become a grid service
7 Data Model
Current Total 48 objects 51 attributes
8 Data Model
Protein/Gene related objects
9 Data Model
Annotation related objects Protein Features
10 Data Model
Taxonomy related objects (Proposed as Taxonomy CDE)
11 Project Architecture Client Data Access Objects/PIR Data Service Object to Relational Mapping Domain Objects UniProtKB Database 12 Process to Silver Compliance
Month 1 Use cases developed with adopter from UPenn
Setting Search Criteria
Simple Search based on individual field UniProtKB PIR ID or accession number NCBI Taxonomy ID Protein name etc.
Advanced Search based on two fields combined with Boolean operators AND OR and AND_NOT
All-ID Search a Google-like search for the identifier fields if source of identifier is not known
Batch Retrieval using multiple UniProtKB IDs or accessions
Setting Response Criteria
UniProtKB XML or FASTA
Month 2 Test approach document approved
Month 3-4 Requirements and specifications document approved
Month 4 Database schema developed (reverse engineered)
13 Process to Silver Compliance
Month 4.5 First version of the object model and semantic annotation report UniProtKB record-based Object Model 70 objects 130 attributes
14 Process to Silver Compliance
Month 5 First version of the web services developed and tested
Month 5.5 June 13 Feedback from caCORE team on the first version of the object model
Design a protein domain model rather than a UniProtKB record-based model
Use caCORE SDK to generate the middle layer
New page in Grid-enablement of PIR project web services development became unnecessary for grid deployment put more emphasis on object modeling
15 Process to Silver Compliance
June 17 Second version of the object model submitted for VCDE/caCORE Workspace review
16 Process to Silver Compliance
June 22 Review object model with VCDE and caCORE teams
Use bidirectional associations unless necessary
Use java.util.Collection rather than list
Follow caCORE SDK naming conventions
Replace complex data types (i.e. Lists) with association classes
Implement subclass of Feature to express semantics
June 24 Third version of the object model is submitted for VCDE/caCORE Workspace review
June 28 Semantic annotations for the first version of the object model received from VCDE team
17 Process to Silver Compliance
June 30 Object model finalized
18 Process to Silver Compliance
July 2 Semantic annotation partially done using the annotations of the first object model
July 11 Code generated using caCORE SDK 1.0.2 ORM completed first version of silver level application deployed
July 21 Semantic annotation completed (149 Concepts)
19 Process to Silver Compliance
July 22 caCORE SDK 1.0.3 became public
July 25 Grid-node deployed for caCORE SDK 1.0.2 generated API
July 25 Migration to caCORE SDK 1.0.3
July 28 Object model loaded into caDSR staging server
August 3 Object model and semantic annotation modified for caCORE SDK 1.0.3 (id attributes)
August 4 Code generated using caCORE SDK 1.0.3 ORM modified current version of silver level application deployed
August 5 Grid-node deployed for caCORE SDK 1.0.3 generated API
August 8 Object model is loaded into caDSR production server
20 Functionality Exposed to the Grid
Grid-Enablement of PIR project is a data service
All the objects in our model exposed to the grid
Example queries
Find the proteins for the gene BRCA2 (Breast Cancer Gene 2)
Find all the proteins that contain the domain BRCA2 repeat (PFAMPF00634 a domain in Breast cancer type 2 susceptibility protein)
ID mapping Find all the database cross-references from various databases corresponding to RefSeq Accession NP_009225
Since all the PIR and UniProt databases are public no security layers implemented
21 Process of Grid Enablement
Download prerequisites (CVSGrab Ant JDK 1.4.2_04)
Set environment variables (JAVA_HOME ANT_HOME CVSGRAB_HOME)
Run ant script bootstrap.xml provided by caGRID team
Set more environment variables (CATALINA_HOME OGSADAI_LOCATION GLOBUS_LOCATION)
Copy the client.jar generated by caCORE SDK to the correct data services directory
Change the configuration so that it points to HTTP server URL for your data service
Deploy the service with ant f bootstrap.xml deployDS
Start Tomcat Server
22 Demo and/or Screenshots
Retrieve the proteins for gene BRCA2 (Breast Cancer Gene 2)
valueBRCA2/
23 Demo and/or Screenshots
Retrieve the proteins for gene BRCA2 (Breast Cancer Gene 2) RESPONSE
Use of logger instead of System.out. or System.err.
Support queries involving class inheritance
Support Boolean queries involving multiple collection type attributes
caGRID
Training opportunities for developers
Documentation
35 Acknowledgements Georgetown University Cathy Wu (Faculty Lead) Hongzhan Huang (Chief Architect) Peter McGarvey (Domain Expert) Baris Suzek (Project Manager) Sehee Chung (SW Developer) Hsing-Kuo Hua (DB Developer) Jess Cannata (System Admin) Robert Clarke Steve Moore Arnie Miles Panther Informatics Brian Gilman (Consultant) NCI Center for Bioinformatics Peter Covitz George Komatsoulis Avinash Shanbhag Tara Akhavan William Sanchez Manav Kher Jijin Yan Clarie Wolfe Nicole Thomas Himanso Sahni Jennifer Zeng Nafis Zebarjani
3rd Millenium
Juli Klemm
Brian Davis
BAH
Mark Adams
Arumani Manisundaram
University of Pennsylvania - BMIF David Fenstermacher Craig Street Vishal Nayak Casey Overby