caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG

Description:

caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 36
Provided by: ArumaniMan8
Category:

less

Transcript and Presenter's Notes

Title: caGrid Version 0.5 Reference Implementation GridEnablement of Protein Information Resource PIR caBIG


1
caGrid Version 0.5 Reference Implementation
Grid-Enablement of Protein Information Resource
(PIR) caBIG Architecture Face-to-Face
Workspace Georgetown UniversityAugust 16 -18,
2005
  • Baris Ethem Suzek
  • Georgetown University - Lombardi Cancer Center
    PIR
  • bes23_at_georgetown.edu

2
Outline
  • Introduction
  • High Level Overview of Grid-Enablement of PIR
  • Data Model
  • Project Architecture
  • Process of getting to Silver level compliance
  • Functionality Exposed to Grid
  • Process of Grid Enablement
  • Demo/Screenshots
  • Lessons Learned / Technical Difficulties / Wish
    List
  • Acknowledgements

3
Introduction
  • Protein Information Resource (PIR) Integrated
    Protein Informatics Resource for
    Genomic/Proteomic Research
  • UniProt Universal Protein Resource Central
    Resource of Protein Sequence and Function
  • PIRSF Family Classification System Protein
    Classification and Functional Annotation
  • iProClass Integrated Protein Knowledgebase Data
    Integration and Functional Analysis

http//pir.georgetown.edu
4
Introduction
  • UniProt Universal Protein Resource - Central
    Resource of Protein Sequence and Function
  • International Consortium
  • PIR at GUMC
  • European Bioinformatics Institute (EBI)
  • Swiss Institute of Bioinformatics (SIB)
  • Unifies PIR-PSD, Swiss-Prot, TrEMBL Protein
    Sequence Databases

http//www.uniprot.org
5
Introduction
  • UniProt Databases

Primary data source for Grid-Enablement of PIR
6
Project Overview
  • Project Goal Provide the most comprehensive and
    fully annotated protein related information for
    genomic and proteomic cancer research
  • Major Functionality Providing methods to query
    and retrieve protein related information for the
    cancer research community
  • Being developed in two phases
  • First phase Jan 1 Aug 1, 2005
  • Demonstrate PIR data source can be discovered and
    consumed in caBIG
  • Make PIR data source caBIG silver-complaint
  • Originally, final product was web services
  • June, 13, 2005 the project changed direction to
    become a grid service

7
Data Model
  • Current Total 48 objects , 51 attributes

8
Data Model
  • Protein/Gene related objects

9
Data Model
  • Annotation related objects Protein Features

10
Data Model
  • Taxonomy related objects (Proposed as Taxonomy
    CDE)

11
Project Architecture
Client
Data Access Objects/PIR Data Service
Object to Relational Mapping
Domain Objects
UniProtKB Database
12
Process to Silver Compliance
  • Month 1 Use cases developed with adopter from
    UPenn
  • Setting Search Criteria
  • Simple Search based on individual field
    UniProtKB, PIR ID or accession number, NCBI
    Taxonomy ID, Protein name etc.
  • Advanced Search based on two fields combined with
    Boolean operators AND , OR and AND_NOT
  • All-ID Search a Google-like search for the
    identifier fields if source of identifier is not
    known
  • Batch Retrieval using multiple UniProtKB IDs or
    accessions
  • Setting Response Criteria
  • UniProtKB XML or FASTA
  • Month 2 Test approach document approved
  • Month 3-4 Requirements and specifications
    document approved
  • Month 4 Database schema developed (reverse
    engineered)

13
Process to Silver Compliance
  • Month 4.5 First version of the object model and
    semantic annotation report UniProtKB
    record-based Object Model 70 objects, 130
    attributes

14
Process to Silver Compliance
  • Month 5 First version of the web services
    developed and tested
  • Month 5.5, June 13 Feedback from caCORE team on
    the first version of the object model
  • Design a protein domain model, rather than a
    UniProtKB record-based model
  • Use caCORE SDK to generate the middle layer
  • New page in Grid-enablement of PIR project web
    services development became unnecessary for grid
    deployment, put more emphasis on object modeling

15
Process to Silver Compliance
  • June 17 Second version of the object model
    submitted for VCDE/caCORE Workspace review

16
Process to Silver Compliance
  • June 22 Review object model with VCDE and caCORE
    teams
  • Use bidirectional associations, unless necessary
  • Use java.util.Collection rather than list
  • Follow caCORE SDK naming conventions
  • Replace complex data types (i.e. Lists) with
    association classes
  • Implement subclass of Feature to express
    semantics
  • June 24 Third version of the object model is
    submitted for VCDE/caCORE Workspace review
  • June 28 Semantic annotations for the first
    version of the object model received from VCDE
    team

17
Process to Silver Compliance
  • June 30 Object model finalized

18
Process to Silver Compliance
  • July 2 Semantic annotation partially done
    using the annotations of the first object model
  • July 11 Code generated using caCORE SDK 1.0.2,
    ORM completed, first version of silver level
    application deployed
  • July 21 Semantic annotation completed (149
    Concepts)

19
Process to Silver Compliance
  • July 22 caCORE SDK 1.0.3 became public
  • July 25 Grid-node deployed for caCORE SDK 1.0.2
    generated API
  • July 25 Migration to caCORE SDK 1.0.3
  • July 28 Object model loaded into caDSR staging
    server
  • August 3 Object model and semantic annotation
    modified for caCORE SDK 1.0.3 (id attributes)
  • August 4 Code generated using caCORE SDK 1.0.3,
    ORM modified, current version of silver level
    application deployed
  • August 5 Grid-node deployed for caCORE SDK 1.0.3
    generated API
  • August 8 Object model is loaded into caDSR
    production server

20
Functionality Exposed to the Grid
  • Grid-Enablement of PIR project is a data service
  • All the objects in our model exposed to the grid
  • Example queries
  • Find the proteins for the gene BRCA2 (Breast
    Cancer Gene 2)
  • Find all the proteins that contain the domain
    BRCA2 repeat (PFAMPF00634, a domain in Breast
    cancer type 2 susceptibility protein)
  • ID mapping Find all the database
    cross-references from various databases
    corresponding to RefSeq Accession NP_009225
  • Since all the PIR and UniProt databases are
    public, no security layers implemented

21
Process of Grid Enablement
  • Download prerequisites (CVSGrab, Ant, JDK
    1.4.2_04)
  • Set environment variables (JAVA_HOME, ANT_HOME,
    CVSGRAB_HOME)
  • Run ant script bootstrap.xml provided by caGRID
    team
  • Set more environment variables (CATALINA_HOME,
    OGSADAI_LOCATION, GLOBUS_LOCATION)
  • Copy the client.jar generated by caCORE SDK to
    the correct data services directory
  • Change the configuration so that it points to
    HTTP server URL for your data service
  • Deploy the service with ant f bootstrap.xml
    deployDS
  • Start Tomcat Server

22
Demo and/or Screenshots
  • Retrieve the proteins for gene BRCA2 (Breast
    Cancer Gene 2)
  • valueBRCA2"/

23
Demo and/or Screenshots
  • Retrieve the proteins for gene BRCA2 (Breast
    Cancer Gene 2) RESPONSE
  • .uk/namespaces/2003/07/gds/types"
  • .
  • O35923aryAccession
  • BRCA2_RATe
  • P51587ryAccession
  • BRCA2_HUMAName
  • .....
  • MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPY
    NSEPAEES
  • EHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDK
  • ..
  • P97929ryAccession
  • BRCA2_MOUSEame
  • ..
  • /result

24
Demo and/or Screenshots
  • Find all the proteins that contain the domain
    BRCA2 repeat (PFAMPF00634, a domain in Breast
    cancer type 2 susceptibility protein)
  • .DatabaseCrossReference"
  • predicate"equal" value"PF00634"/

25
Demo and/or Screenshots
  • Find all the proteins that contain the domain
    BRCA2 repeat (PFAMPF00634, a domain in Breast
    cancer type 2 susceptibility protein) RESPONSE
  • .uk/namespaces/2003/07/gds/types"
  • ..
  • O35923aryAccession
  • BRCA2_RATe
  • ..
  • P51587aryAccession
  • BRCA2_HUMAName
  • .
  • P70098aryAccession
  • Q5TBJ7_HUMANName
  • .
  • Q7RG20aryAccession
  • Q7RG20_PLAYOName
  • .

26
Demo and/or Screenshots
  • ID mapping Find all the database
    cross-references from various databases
    corresponding to RefSeq Accession NP_061820
  • 1st Step Find the protein for RefSeq Accession
    NP_061820
  • "
  • aseCrossReference"
  • value"RefSeq"/
  • predicate"equal" value"NP_061820"/

27
Demo and/or Screenshots
  • ID mapping Find all the database
    cross-references from various databases
    corresponding to RefSeq Accession NP_061820
  • 1st Step Find the protein for RefSeq Accession
    NP_061820 RESPONSE
  • .uk/namespaces/2003/07/gds/types"
  • status"COMPLETED"/
  • status"COMPLETED"
    main.impl.ProteinImpl
  • .
  • P99999
  • P99999aryAccession
  • CYC_HUMANe
  • .

28
Demo and/or Screenshots
  • ID mapping Find all the database
    cross-references from various databases
    corresponding to RefSeq Accession NP_061820
  • 2nd Step Find all the cross-references for the
    protein found
  • rossReference"
  • n"
  • predicate"equal valueCYC_HUMAN"/

29
Demo and/or Screenshots
  • ID mapping Find all the database
    cross-references from various databases
    corresponding to RefSeq Accession NP_061820
  • 2nd Step Find all the cross-references for the
    cross-references for the protein found RESPONSE
  • .uk/namespaces/2003/07/gds/types"
  • .
  • EMBL
  • M22877
  • .
  • PIR
  • CCHU
  • .
  • GenPept
  • AAH09579
  • .
  • NCBI GI
  • 14250124
  • .

30
Lessons Learned / Technical Difficulties
  • Before starting
  • Read caCORE SDK Programmers Guide, especially
    Chapter 5 Generating UML Models
  • Download and install caCORE SDK
  • Keep in mind caGRID and caCORE SDK under
    development
  • Use case development
  • Identify caCORE SDK limitations

31
Lessons Learned / Technical Difficulties
  • Object modeling most important step
  • Considerations
  • Scientific meaning if possible Dont do record
    modeling
  • Use cases Consider search criteria objects
  • caCORE SDK constraints Consider naming
    conventions, id attribute constraints,
    supported collection types. e.g. List is not
    supported
  • Data related constraints Include only
    associations or objects based on your data. e.g.
    Gene to Protein, but not Protein to DNASequence
  • Semantics Express semantics and avoid using
    type attributes. e.g. ProteinFeature subclasses,
    Lineage
  • Ask for a VCDE representative before you begin,
    employ an iterative approach

32
Lessons Learned / Technical Difficulties
  • Semantic annotation/caDSR load
  • Enter tagged values while modeling
  • Review semantic annotation report to check
    classifications of Concepts (ObjectClass,
    ObjectClassQualifier, Property,
    PropertyQualifier)
  • Ensure all the Concepts are included in
    Annotated Model before you send to caDSR
  • Code generation / Grid deployment
  • Using caCORE SDK helped in grid-enabling PIR

33
Lessons Learned / Technical Difficulties
  • Database Performance
  • In caCORE SDK generated SQL statements, all
    string fields are lower-cased additional
    function-based indexes are needed
  • select from. where lower(PRIMARYACCESSION)
    'p88888
  • Object-to-relational mapping
  • New id attribute constraint introduced in
    caCORE SDK 1.0.3
  • Option 1 Change the database schema ,add id of
    type Integer to all the tables (caBIO)
  • Option 2 Map existing primary keys as id, so
    id can be String or Integer (PIR)

34
Wish List
  • Semantic annotation
  • Excel spreadsheets are error-prone, a tool that
    will make semantic annotation easier
  • Traversing the object model
  • Checking errors (wrong classifications, missing
    annotations etc.)
  • caCORE SDK
  • Google-like/any field search
  • Allow optional exact string search
  • Use of logger instead of System.out. or
    System.err.
  • Support queries involving class inheritance
  • Support Boolean queries involving multiple
    collection type attributes
  • caGRID
  • Training opportunities for developers
  • Documentation

35
Acknowledgements
Georgetown University Cathy Wu (Faculty
Lead) Hongzhan Huang (Chief Architect) Peter
McGarvey (Domain Expert) Baris Suzek (Project
Manager) Sehee Chung (SW Developer) Hsing-Kuo Hua
(DB Developer) Jess Cannata (System Admin) Robert
Clarke Steve Moore Arnie Miles Panther
Informatics Brian Gilman (Consultant)
NCI Center for Bioinformatics Peter
Covitz George Komatsoulis Avinash Shanbhag Tara
Akhavan William Sanchez Manav Kher Jijin
Yan Clarie Wolfe Nicole Thomas Himanso
Sahni Jennifer Zeng Nafis Zebarjani
  • 3rd Millenium
  • Juli Klemm
  • Brian Davis
  • BAH
  • Mark Adams
  • Arumani Manisundaram

University of Pennsylvania - BMIF David
Fenstermacher Craig Street Vishal Nayak Casey
Overby
Write a Comment
User Comments (0)
About PowerShow.com