Title: caGrid Version 0.5 Reference Implementation caTIES caBIG Architecture Workspace Face to Face Georget
1caGrid Version 0.5 Reference ImplementationScaBIG Architecture Workspace Face to
FaceGeorgetown UniversityAugust 16th 18th
2005
- Rebecca Crowley and Kevin Mitchell
- crowelyrs@upmc.edu
- mitchellkj@upmc.edu
2Outline
- Project History and demonstration of existing
application - Data Model
- Project Architecture
- Process of getting to Silver level compliance
- Functionality Exposed to Grid
- Process of Grid Enablement
- Lessons Learned / Technical Difficulties / Wish
List - Acknowledgements
3Project History
- caTIES is a text processing system that creates
deidentified structured data from unstructured
freetext pathology reports and makes reports
accessible to researchers. - Information about tumor stage prognostic
factors - Index to fixed tissue source of annotation for
frozen or processed tissue - caTIES deidentifies entire corpus of reports
creates concept codes using NCI metathesaurus to
MySQL datastore - Deployed to adopter at University of
Pennsylvania intention is to create a network of
institutions that can share data and tissue - Used OGSADAI and OGSI to facilitate datasharing
between Pitt and Penn IRB protocols to provide
data creating culture for datasharing
4- Demonstration of existing webservice based
application
5Data Model
- Evolving data model
- Rich annotations derived from Pathology Report
not all of which appear in Phase I model - Each Phase of the caTIES project adds additional
use cases and expands the scope of the data model - Phase I Basic mechanisms for document search and
retrieval based on NCI Metathesaurus concepts - Phase II Fill requests for tissue that utilize
HB model add queries based on temporality - Phase III Retrieval of data or tissues based on
more finely granular structured information
extracted from documents
6Phase I model
7Early phase 2 model
8Small part of Phase III model
9No Transcript
10Reference Implementation Goals from our
perspective
- Achieve silver compliance using caCORE SDK
- Replace existing OGSADAI webservices based
method for data sharing with caGrid - Utilize grid and local security mechanisms
- Learn more about how caGrid would be used for
other facets of this project - Analytic services for our coder processing
resources - Communication between applications
- Requirements for next iteration of caGrid
11Process of getting to Silver level compliance
- ROUND ONE
- Developed original model by extracting the two
classes needed to reproduce Phase I functionality - In particular we left out the identified side
of our model - Generated caCORElike API using caCORE SDK
- Created data model in EA mySQL
- Automatically created ORM using caCORE SDK
- Created semantic annotations file using
semanticconnector script - Manually annotated our own models prior to
sending to NCICB - Learned a lot about what we can and cannot expect
from these annotations - Developers have to be the ones to catch the
mistakes. May be best if we know these
annotations well
12First Round
13Process of getting to Silver level compliance
- But there was a problemthe object model had
never been formally reviewedbecause there was no
formal process to do so when we went through
ROUND ONE. Started anew on August 8th. - Email and teleconference with key NCICB staff for
our WS Ian Fore and VCDE George Komatsoulis.
Issued new object model. NCICB requested 13
additional changes. Came to agreement over most
of these changes over course of 5 days 15
emails one teleconference. - New objects required for concept referent
application and execution - Removal of relationship clarification of naming
and definitions - Most contentious issue was use of ordered concept
code lists a kind of documenttransformation
that we find very useful - In general we found that there were some slight
contortions needed to use domainmodeling
principles when trying to model documents - Regenerated data model dependencies created new
db and API - Moved data from backend generated in 1st attempt
to final backend
14Process of getting to Silver level compliance
- Ran semanticconnector script over XMI to produce
Excel file - Sent annotations to NCICB for review and creation
of 2 additional concepts - Created annotated XMI and sent to NCICB
- Loaded to caDSR Stage small human error in
excel yielded model with three classes appearing
as one
15Phase I model
16No Transcript
17No Transcript
18No Transcript
19Process of getting to Silver level compliance
- Ran semanticconnector script again
- Sent annotated XMI to NCICB
- Loaded to caDSR Stage reviewed and approved
- Loaded to caDSR Production
- Metadata extract generated by caDSR and returned
to us
20Process of getting to Silver level compliance
21Functionality Exposed to the Grid
- Data service exposing all objects in current
Phase I model to the grid but very likely
tightly restricting access until processes are in
place to administer users for research on the
caGrid - Example queries we will or would like to support
Phases I and II - Return the total number of cases across all
caTIES nodes across the entire Grid of pediatric
PNET in patients - Return deidentified reports for women age 3050
with Atypical Ductal Hyperplasia on breast biopsy
followed within 15 years by DCIS or Infiltrating
Ductal Carcinoma on any procedure users with IRB
authorization to access deidentified data - Return deidentified accession numbers for
patients with high grade prostatic
intraepithelial neoplasia HGPIN but no prostate
cancer and then order these blocks users with
IRB authorization to access deidentified data
and get tissue approved for materials transfer
from the institution - Return actual identified accession s for cases
at my institution of pilocytic astrocytomas from
19891992 honest broker
22Security Requirements
- All data and communications must be secure
- Single sign on for honest brokers and others who
will need to access our identified datastore
linkage file - Preference for one integrated set of security
tools - Attribute and operation level security
- Access to patient or report level data can only
be granted to individuals known to have an IRB
protocol to do research - Access to generic operations get a histogram of
X type of cases can be universal to any user - Access to patient or report level data can only
be granted to individuals who have an active IRB
protocol
23Process of Grid Enablement
- Request public access via http for 1upmcspn01
- Deploy CSM for local application security
- Deploy and configure grid services including
gridlevel security - Contribute to requirements for next iteration of
caGrid security based on IRB requirements Pitt
and Penn hopefully others - Virtual organizations
- Delegation
- Single sign on
- Attribute management
- Gradually open up access as user management
policies processes and technology mature
24Lessons Learned Modeling
- Modeling
- Many to many relationships in the object domain
require correlation tables in the data model. But
when the correlation itself has data a
relationship object must exist e.g.
Application ? Execution
Execution.startTime and Execution.endTime
necessitate object domain accessor - Useful to use multiple model views to focus on
key model aspects. Phase I model could be vetted
while overall model expanded to PhaseII - Importing semantically annotated XMI back to the
model clobbers diagrams and can negatively effect
the data model - A good idea to always generate the application
and test it with a simple call before proceeding
to semantic activity -
25Lessons Learned Semantic Annotation
- Semantic Connection
- One way trip less troublesome
- Better to treat the import and recovery of
annotated XMI into EA as read only - Start each semanticconnector run with only
Documentation and Description tags - This eliminates stray tags from previous
annotation which may not be overwritten in the
EVS Semantic Connector Report - Check at multiple points in process human errors
creep in at every step - UML Model Excel file modified by deliverable
Excel file returned from NCICB caDSR Stage - Vocabulary development process adds some
unanticipated complexities to annotation
26Lessons Learned Modeling and Semantic Annotation
- Difficulties with maintenance and iterative
modelbuilding - Manual merge of Excel files is very errorprone
- Many artifacts with partial representations
- High effort cost to changing things later in
process - Need to develop work processes that minimize
wasted effort
27Questions we are still trying to answer
- How do we map our models to data standards when
the name of the class in the OM and the object
class of the standard CDE are different? - Participant.gender
- Execution.startTime
- What would manual mapping do to the semantics?
- What are we trying to achieve with semantic
annotation? - Query? Aggregation across sources? Inference?
- How should resources be balanced with effort
required? - When will real researchers be accessing caGrid
data services? How will we deal with process of
credentialing and granting permissions to
researchers? - How should we handle local application vs. grid
level security in the shortterm and in the
longterm?
28Future Work
- Before end of August
- Secure silver API with CSM
- Complete installation of caGRID
- Understand CSM as well as caGRID credentialing
and user management and restrict access - Make node public
- Register service
- Eventually
- Create text processing components as analytic
services - Integrate API with GUI
- Expand model to phase II and III
29Wish List
- Better communication
- Semantic Annotation
- End point of reference implementations
- Security policies and processes
- Communication between reference implementations
- Could we start to use listserv to ask and
address problems? - Better tooling
- Semantic annotation
- Single artifact
30Acknowledgements
- caTIES/UPMC
- Kevin MitchellLinda SchmandtGirish ChavanAdi
NemlekarJon Tobias - UPCI
- Ronald HerbermanMichael Becich
- caTIES/Penn
- Michael FeldmanDavid FenstermacherTara
McSherryJohn QuigleyVishal Nayak
NCICB Ian ForeGeorge KomatsoulisRam
Chilukuri Avinash Shanbhag Manav KherTara
AkhavanMike ConnollyKevin Fitzpatrick
Christophe Ludet Nicole Thomas Himanso
SahniWilliam SanchezRuowei WuNafis Zebarjani
BAH Brian DavisGreg EleyBal HarshawardhanAruman
i ManisundaramMark Adams Ardais David Aronow
Metadata Mentor Washington University Rakesh
NagarajanArchitecture Mentor