Title: caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG
1 caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG Architecture FacetoFace Workspace Georgetown UniversityAugust 16 18 2005
Baris Ethem Suzek
Georgetown University Lombardi Cancer Center PIR
bes23@georgetown.edu
2 Outline
Introduction
High Level Overview of GridEnablement of PIR
Data Model
Project Architecture
Process of getting to Silver level compliance
Functionality Exposed to Grid
Process of Grid Enablement
Demo/Screenshots
Lessons Learned / Technical Difficulties / Wish List
Acknowledgements
3 Introduction
Protein Information Resource PIR Integrated Protein Informatics Resource for Genomic/Proteomic Research
UniProt Universal Protein Resource Central Resource of Protein Sequence and Function
PIRSF Family Classification System Protein Classification and Functional Annotation
iProClass Integrated Protein Knowledgebase Data Integration and Functional Analysis
http//pir.georgetown.edu 4 Introduction
UniProt Universal Protein Resource Central Resource of Protein Sequence and Function
International Consortium
PIR at GUMC
European Bioinformatics Institute EBI
Swiss Institute of Bioinformatics SIB
Unifies PIRPSD SwissProt TrEMBL Protein Sequence Databases
http//www.uniprot.org 5 Introduction
UniProt Databases
Primary data source for GridEnablement of PIR 6 Project Overview
Project Goal Provide the most comprehensive and fully annotated protein related information for genomic and proteomic cancer research
Major Functionality Providing methods to query and retrieve protein related information for the cancer research community
Being developed in two phases
First phase Jan 1 Aug 1 2005
Demonstrate PIR data source can be discovered and consumed in caBIG
Make PIR data source caBIG silvercomplaint
Originally final product was web services
June 13 2005 the project changed direction to become a grid service
7 Data Model
Current Total 48 objects 51 attributes
8 Data Model
Protein/Gene related objects
9 Data Model
Annotation related objects Protein Features
10 Data Model
Taxonomy related objects Proposed as Taxonomy CDE
11 Project Architecture Client Data Access Objects/PIR Data Service Object to Relational Mapping Domain Objects UniProtKB Database 12 Process to Silver Compliance
Month 1 Use cases developed with adopter from UPenn
Setting Search Criteria
Simple Search based on individual field UniProtKB PIR ID or accession number NCBI Taxonomy ID Protein name etc.
Advanced Search based on two fields combined with Boolean operators AND OR and AND_NOT
AllID Search a Googlelike search for the identifier fields if source of identifier is not known
Batch Retrieval using multiple UniProtKB IDs or accessions
Setting Response Criteria
UniProtKB XML or FASTA
Month 2 Test approach document approved
Month 34 Requirements and specifications document approved
Month 4 Database schema developed reverse engineered
13 Process to Silver Compliance
Month 4.5 First version of the object model and semantic annotation report UniProtKB recordbased Object Model 70 objects 130 attributes
14 Process to Silver Compliance
Month 5 First version of the web services developed and tested
Month 5.5 June 13 Feedback from caCORE team on the first version of the object model
Design a protein domain model rather than a UniProtKB recordbased model
Use caCORE SDK to generate the middle layer
New page in Gridenablement of PIR project web services development became unnecessary for grid deployment put more emphasis on object modeling
15 Process to Silver Compliance
June 17 Second version of the object model submitted for VCDE/caCORE Workspace review
16 Process to Silver Compliance
June 22 Review object model with VCDE and caCORE teams
Use bidirectional associations unless necessary
Use java.util.Collection rather than list
Follow caCORE SDK naming conventions
Replace complex data types i.e. Lists with association classes
Implement subclass of Feature to express semantics
June 24 Third version of the object model is submitted for VCDE/caCORE Workspace review
June 28 Semantic annotations for the first version of the object model received from VCDE team
17 Process to Silver Compliance
June 30 Object model finalized
18 Process to Silver Compliance
July 2 Semantic annotation partially done using the annotations of the first object model
July 11 Code generated using caCORE SDK 1.0.2 ORM completed first version of silver level application deployed
July 21 Semantic annotation completed 149 Concepts
19 Process to Silver Compliance
July 22 caCORE SDK 1.0.3 became public
July 25 Gridnode deployed for caCORE SDK 1.0.2 generated API
July 25 Migration to caCORE SDK 1.0.3
July 28 Object model loaded into caDSR staging server
August 3 Object model and semantic annotation modified for caCORE SDK 1.0.3 id attributes
August 4 Code generated using caCORE SDK 1.0.3 ORM modified current version of silver level application deployed
August 5 Gridnode deployed for caCORE SDK 1.0.3 generated API
August 8 Object model is loaded into caDSR production server
20 Functionality Exposed to the Grid
GridEnablement of PIR project is a data service
All the objects in our model exposed to the grid
Example queries
Find the proteins for the gene BRCA2 Breast Cancer Gene 2
Find all the proteins that contain the domain BRCA2 repeat PFAMPF00634 a domain in Breast cancer type 2 susceptibility protein
ID mapping Find all the database crossreferences from various databases corresponding to RefSeq Accession NP_009225
Since all the PIR and UniProt databases are public no security layers implemented
21 Process of Grid Enablement
Download prerequisites CVSGrab Ant JDK 1.4.2_04
Set environment variables JAVA_HOME ANT_HOME CVSGRAB_HOME
Run ant script bootstrap.xml provided by caGRID team
Set more environment variables CATALINA_HOME OGSADAI_LOCATION GLOBUS_LOCATION
Copy the client.jar generated by caCORE SDK to the correct data services directory
Change the configuration so that it points to HTTP server URL for your data service
Deploy the service with ant f bootstrap.xml deployDS
Start Tomcat Server
22 Demo and/or Screenshots
Retrieve the proteins for gene BRCA2 Breast Cancer Gene 2
valueBRCA2/
23 Demo and/or Screenshots
Retrieve the proteins for gene BRCA2 Breast Cancer Gene 2 [RESPONSE]
Keep in mind caGRID and caCORE SDK under development
Use case development
Identify caCORE SDK limitations
31 Lessons Learned / Technical Difficulties
Object modeling most important step
Considerations
Scientific meaning if possible Dont do record modeling
Use cases Consider search criteria objects
caCORE SDK constraints Consider naming conventions id attribute constraints supported collection types. e.g. List is not supported
Data related constraints Include only associations or objects based on your data. e.g. Gene to Protein but not Protein to DNASequence
Semantics Express semantics and avoid using type attributes. e.g. ProteinFeature subclasses Lineage
Ask for a VCDE representative before you begin employ an iterative approach
32 Lessons Learned / Technical Difficulties
Semantic annotation/caDSR load
Enter tagged values while modeling
Review semantic annotation report to check classifications of Concepts ObjectClass ObjectClassQualifier Property PropertyQualifier
Ensure all the Concepts are included in Annotated Model before you send to caDSR
Code generation / Grid deployment
Using caCORE SDK helped in gridenabling PIR
33 Lessons Learned / Technical Difficulties
Database Performance
In caCORE SDK generated SQL statements all string fields are lowercased additional functionbased indexes are needed
select from. where lowerPRIMARYACCESSION p88888
Objecttorelational mapping
New id attribute constraint introduced in caCORE SDK 1.0.3
Option 1 Change the database schema add id of type Integer to all the tables caBIO
Option 2 Map existing primary keys as id so id can be String or Integer PIR
34 Wish List
Semantic annotation
Excel spreadsheets are errorprone a tool that will make semantic annotation easier
Traversing the object model
Checking errors wrong classifications missing annotations etc.
caCORE SDK
Googlelike/any field search
Allow optional exact string search
Use of logger instead of System.out. or System.err.
Support queries involving class inheritance
Support Boolean queries involving multiple collection type attributes
caGRID
Training opportunities for developers
Documentation
35 Acknowledgements Georgetown University Cathy Wu Faculty Lead Hongzhan Huang Chief Architect Peter McGarvey Domain Expert Baris Suzek Project Manager Sehee Chung SW Developer HsingKuo Hua DB Developer Jess Cannata System Admin Robert Clarke Steve Moore Arnie Miles Panther Informatics Brian Gilman Consultant NCI Center for Bioinformatics Peter Covitz George Komatsoulis Avinash Shanbhag Tara Akhavan William Sanchez Manav Kher Jijin Yan Clarie Wolfe Nicole Thomas Himanso Sahni Jennifer Zeng Nafis Zebarjani
3rd Millenium
Juli Klemm
Brian Davis
BAH
Mark Adams
Arumani Manisundaram
University of Pennsylvania BMIF David Fenstermacher Craig Street Vishal Nayak Casey Overby