Title: Grid-Enablement of Protein Information Resource (PIR) caBIG ICR Face-to-Face Workspace Columbia University, Irving Cancer Research Center January 26-27, 2006
1 Grid-Enablement of Protein Information Resource
(PIR) caBIG ICR Face-to-Face Workspace
Columbia University, Irving Cancer Research
Center January 26-27, 2006
- Baris Ethem Suzek
- Georgetown University
- Lombardi Cancer Center PIR
- bes23_at_georgetown.edu
Craig Street University of Pennsylvania
Biomedical Informatics Facility street_at_mail.med.up
enn.edu
2Outline
- Introduction PIR/BMIF
- Data Model
- Overview of the Grid-Enablement of Protein
Information Resource - Demo/Screenshots API/caGRID Browser
- Acknowledgements
3Introduction - PIR
- Protein Information Resource (PIR) Integrated
Protein Informatics Resource for
Genomic/Proteomic Research
- UniProt Universal Protein Resource Central
Resource of Protein Sequence and Function - PIRSF Family Classification System Protein
Classification and Functional Annotation - iProClass Integrated Protein Knowledgebase Data
Integration and Functional Analysis
http//pir.georgetown.edu
4Introduction - PIR
- UniProt Universal Protein Resource - Central
Resource of Protein Sequence and Function
- International Consortium
- PIR at GUMC
- European Bioinformatics Institute (EBI)
- Swiss Institute of Bioinformatics (SIB)
- Unifies PIR-PSD, Swiss-Prot, TrEMBL Protein
Sequence Databases
http//www.uniprot.org
5Introduction - PIR
Primary data source for Grid-Enablement of PIR
6Project Overview
- Grid-Enablement of PIR project is a data service
- Developer PIR _at_ Georgetown University
- Adopter BMIF _at_ University of Pennsylvania
- All the objects in our model exposed to the grid
- API is developed using caCORE SDK 1.0.3.1
- All the PIR and UniProt databases are public gt
no security layers implemented
7Data Model
- Protein/Gene related objects
8Data Model
- Annotation related objects Protein Features
9Data Model
- Taxonomy related objects (Proposed as Taxonomy
CDE)
10Demo and/or Screenshots - API (Example 1)
- Purpose of the script Rudimentary ID Mapper
- Find Corresponding PIR Database Cross Reference
ID(s) that match EMBL M15034 - Test Script
- public void Demo1_DBXR2DBXR_Reports()
- try
- DatabaseCrossReference source new
DatabaseCrossReferenceImpl() - final String id "M15034"
- source.setCrossReferenceId(id)
- source.setDataSourceName("EMBL")
- final String answer "PIR"
- log.info("Find all PIR Database Cross Reference
ids that match EMBL id " id ") - try
- String path "edu.georgetown.pir.domain.Databas
eCrossReference," - "edu.georgetown.pir.domain.Protein"
- List resultList appService.search(path,
source) - log.info("Size " resultList.size())
11Demo and/or Screenshots API (Example 1)
12Demo and/or Screenshots - API (Example 2)
- Purpose of the script Return all Organisms
Containing your Favorite Protein of Interest - Find all Organisms having a Protein Named
Transferrin receptor protein 1 - Test Script
- public void Demo2_TestProtein2Organism()
- ProteinName source new ProteinNameImpl()
- final String id Transferrin receptor protein
1" - source.setValue(id)
- log.info(" Find all Organisms which
have Protein Name " id " ") - log.info("COMMON NAME" "\t\t" "SCIENTIFIC
NAME") - try
- String path "edu.georgetown.pir.domain.Organi
sm," - "edu.georgetown.pir.domain.Protein,"
- List resultList appService.search(path,
source) - for( Iterator it resultList.iterator()
it.hasNext()) - Organism organism (OrganismImpl)it.next()
- log.info(organism.getCommonName()
- "\t\t"organism.getScientificName())
-
13Demo and/or Screenshots API (Example 2)
14Demo and/or Screenshots - API (Example 3)
- Purpose of the script Need to Identify Protein
with Known Molecular Weight from Proteomics
Experiment - Find all Proteins with a Molecular Weight of
26266 Daltons - Test Script
-
- public void Demo3_ProteinMeolcularWeight()
- try
- ProteinSequence object new
ProteinSequenceImpl() - final Integer id new Integer(26266)
- object.setMolecularWeightInDaltons(id)
- log.info(" Find all Proteins which
have a Molecular Weight in Daltons of " id ".
") - try
- List resultList appService.search(Protein.cl
ass, object) - for( Iterator it resultList.iterator()
it.hasNext()) - Protein protein (ProteinImpl)it.next()
- List seqs appService.search(ProteinSequence
.class, protein) - for(Iterator iseqs.iterator()
i.hasNext()) - ProteinSequence seq (ProteinSequence)i.nex
t() - log.info(protein.getUniprotkbEntryName()"\
t\t" seq.getMolecularWeightInDaltons()) -
15Demo and/or Screenshots API (Example 3)
16Demo and/or Screenshots caGRID Browser
- Retrieve the proteins for gene BRCA2 (Breast
Cancer Gene 2) - ltcaBIGXMLQuery name"testGene2Protein"gt
- ltTarget name"edu.georgetown.pir.domain.Protein
gt - ltObjects name"edu.georgetown.pir.domain.Gene"
gt - ltProperty namename predicate"equal
valueBRCA2"/gt - lt/Objectsgt
- lt/Targetgt
- lt/caBIGXMLQuerygt
17Demo and/or Screenshots caGRID Browser
- Retrieve the proteins for gene BRCA2 (Breast
Cancer Gene 2) RESPONSE - ltgridDataServiceResponse xmlns"http//ogsadai.org
.uk/namespaces/2003/07/gds/types"gt - .
- ltuniprotkbPrimaryAccessiongtO35923lt/uniprotkbPrim
aryAccessiongt - ltuniprotkbEntryNamegtBRCA2_RATlt/uniprotkbEntryNam
egt -
- ltuniprotkbPrimaryAccessiongtP51587lt/uniprotkbPrima
ryAccessiongt - ltuniprotkbEntryNamegtBRCA2_HUMANlt/uniprotkbEntryN
amegt - .....
- ltvaluegtMPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPY
NSEPAEES - EHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDK
- ..
- ltuniprotkbPrimaryAccessiongtP97929lt/uniprotkbPrima
ryAccessiongt - ltuniprotkbEntryNamegtBRCA2_MOUSElt/uniprotkbEntryN
amegt - ..
- lt/edu.georgetown.pir.domain.impl.ProteinImplgtgtlt
/resultgt - lt/gridDataServiceResponsegt
18Demo and/or Screenshots caGRID Browser
- Find all the proteins that contain the domain
BRCA2 repeat (PFAMPF00634, a domain in Breast
cancer type 2 susceptibility protein) - ltcaBIGXMLQuery name"testPfam2Protein"gt
- ltTarget nameedu.georgetown.pir.domain.Protein
gt - ltObjects name"edu.georgetown.pir.domain
.DatabaseCrossReference"gt - ltProperty name"crossReferenceId"
predicate"equal" value"PF00634"/gt - lt/Objectsgt
- lt/Targetgt
- lt/caBIGXMLQuerygt
19Demo and/or Screenshots caGRID Browser
- Find all the proteins that contain the domain
BRCA2 repeat (PFAMPF00634, a domain in Breast
cancer type 2 susceptibility protein) RESPONSE - ltgridDataServiceResponse xmlns"http//ogsadai.org
.uk/namespaces/2003/07/gds/types"gt - ..
- ltuniprotkbPrimaryAccessiongtO35923lt/uniprotkbPrim
aryAccessiongt - ltuniprotkbEntryNamegtBRCA2_RATlt/uniprotkbEntryNam
egt - ..
- ltuniprotkbPrimaryAccessiongtP51587lt/uniprotkbPrim
aryAccessiongt - ltuniprotkbEntryNamegtBRCA2_HUMANlt/uniprotkbEntryN
amegt - .
- ltuniprotkbPrimaryAccessiongtP70098lt/uniprotkbPrim
aryAccessiongt - ltuniprotkbEntryNamegtQ5TBJ7_HUMANlt/uniprotkbEntry
Namegt - .
- ltuniprotkbPrimaryAccessiongtQ7RG20lt/uniprotkbPrim
aryAccessiongt - ltuniprotkbEntryNamegtQ7RG20_PLAYOlt/uniprotkbEntry
Namegt - .
- lt/gridDataServiceResponsegt
20Demo and/or Screenshots caGRID Browser
- ID mapping Find all the database
cross-references from various databases
corresponding to RefSeq Accession NP_061820 - ltcaBIGXMLQuery name"testIDMapping"gt
- ltTarget name"edu.georgetown.pir.domain.DatabaseC
rossReference pathedu.georgetown.pir.domain.Pro
teingt - ltObjects name"edu.georgetown.pir.domain.Databa
seCrossReference"gt - ltProperty name"dataSourceName"
predicate"equal" value"RefSeq"/gt - ltProperty name"crossReferenceId"
predicate"equal" value"NP_061820"/gt - lt/Objectsgt
- lt/Targetgt
- lt/caBIGXMLQuerygt
21Demo and/or Screenshots caGRID Browser
- ID mapping Find all the database
cross-references from various databases
corresponding to RefSeq Accession NP_061820 - ltgridDataServiceResponse xmlns"http//ogsadai.org
.uk/namespaces/2003/07/gds/types"gt - .
- ltdataSourceNamegtEMBLlt/dataSourceNamegt
- ltcrossReferenceIdgtM22877lt/crossReferenceIdgt
- .
- ltdataSourceNamegtPIRlt/dataSourceNamegt
- ltcrossReferenceIdgtCCHUlt/crossReferenceIdgt
- .
- ltdataSourceNamegtGenPeptlt/dataSourceNamegt
- ltcrossReferenceIdgtAAH09579lt/crossReferenceIdgt
- .
- ltdataSourceNamegtNCBI GIlt/dataSourceNamegt
- ltcrossReferenceIdgt14250124lt/crossReferenceIdgt
- .
- ltgridDataServiceResponsegt
22Acknowledgements
- 3rd Millenium
- Juli Klemm
- Brian Davis
- BAH
- Mark Adams
- Arumani Manisundaram
NCI Center for Bioinformatics Peter
Covitz George Komatsoulis Avinash Shanbhag Tara
Akhavan William Sanchez Manav Kher Jijin
Yan Clarie Wolfe Nicole Thomas Himanso
Sahni Jennifer Zeng Nafis Zebarjani
Georgetown University Cathy Wu (Faculty
Lead) Hongzhan Huang (Chief Architect) Peter
McGarvey (Domain Expert) Baris Suzek (Project
Manager) Sehee Chung (SW Developer) Hsing-Kuo Hua
(DB Developer) Jess Cannata (System Admin) Robert
Clarke Steve Moore Arnie Miles Panther
Informatics Brian Gilman (Consultant)
University of Pennsylvania - BMIF David
Fenstermacher Craig Street Vishal Nayak Casey
Overby