Title: Distributed, Modular Grid Software for Management and Exploration of Data in Patient-Centric Healthcare IT
1Distributed, Modular Grid Software for Management
and Exploration of Data in Patient-Centric
Healthcare IT
- Andrew Hart
- NASA Jet Propulsion Laboratory
- David Kale
- Whittier VPICU, Childrens Hospital LA
- Heather Kincaid
- NASA Jet Propulsion Laboratory
2Agenda
- Health Care Data Challenges for Large-scale
Research - Intro to Object Oriented Data Technology (OODT)
- Applications of OODT in distributed scientific
data systems - NASAs Planetary Data System
- NCIs Early Detection Research Network
- Whittier Virtual Pediatric Intensive Care Unit
(VPICU) - OODT as Open Source
- Learning More Keeping in Touch
3Health care research
- Increasingly collaborative
- Increasingly geographically distributed
- Scale, Complexity, Cost drive cooperation
- Opportunities for discovery emerge through larger
data sets - Increase in need for technology to support for
virtual organizations carrying out distributed
scientific research
4OODT What Is It?
- A data grid software infrastructure for
constructing large-scale, distributed
data-intensive systems - Reference Architecture
- Software Product Line
- Reusable Components
- Common Patterns
5A Brief History of OODT
- Funded out of NASAs Office of Space Science in
1998 - Funded to address critical software engineering
challenges affecting the design of mission
science data systems - Designed, implemented, and refined over the past
7 years across multiple scientific domains - Planetary Science,
- Earth Science,
- Cancer Research,
- Space Physics,
- Modeling and Simulation,
- Pediatric Intensive Care
- Runner up NASA software of the year in 2003
6Principles behind OODT
- Division of LaborAvoid making one component the
workhorse, configurable - Technology Independence Guard against unexpected
changes in the technology landscape - Metadata as a first-class citizenDescriptions of
resources come in handy - Separation of software and data modelsAllow each
to evolve independently - Modular, domain-agnostic Pick and choose from
adaptable components with defined interfaces
7OODT Core Framework Services
- Archive ServiceIngest data metadata,
processing algorithms, workflow support - Profile ServiceDeliver metadata from an
underlying data store - Product ServiceDeliver data from an underlying
data store - Query ServiceManage sets of profile servers
- Data Grid ServiceInterfaces and tools for
connecting distributed resources over the web
8Applications of OODT PDS
- Planetary Data System
- National Aeronautics and Space Administration
- http//pds.nasa.gov
9NASA Planetary Data System
- Official NASA archive for all planetary data
- 9 Nodes with data located at discipline sites
- All missions must add theirdata (required as
part of mission Announcement of Opportunity - Prior to October 2002, no ability to find and
share data between PDS nodes
10PDS Data Key Challenges
- Challenges to building a science data system for
the PDS - NASA often flies unique, one of a kind missions
- A static infrastructure wont work Nodes and
models change - Data stored at PDS nodes differs dramatically in
structure - Missions are required to share science data
results with the research community
11PDS Data Architecture
- Distributed data system environment with
federated governanceEach site maintains their
own database and infrastructure - Common domain information model (regularly
updated) used to drive system implementationsOnto
logy and Common Data Elements (based on ISO/IEC
11179) - Common query interface to distributed
servicesimplemented with OODT Query Handlers - Software services that wrap existing data systems
to share data Implemented with OODT Product
Profile servers - Publishing of data products to a common portal
Implemented using Resource Description Format
(RDF)
12PDS Architecture Decomposition
13Applications of OODT EDRN
- Early Detection Research Network
- Division of Cancer Prevention, National Cancer
Institute - http//cancer.gov/edrn
14EDRN Overview
- Focus investigator-initiated, collaborative
research on molecular, genetic and other
biomarkers for cancer detection and risk
assessment. - Funded since 2000 by the Division of Cancer
Prevention in the National Cancer Institute (NCI) - 40 geographically distributed centers
performing parallel, complementary studies - Strong emphasis on therole of informatics
15EDRN Participants
- Biomarker Development LaboratoriesResponsible
for the development and characterization of new
biomarkers or the refinement of existing
biomarkers. - Biomarker Reference LaboratoriesServe as a
Network resource for clinical and laboratory
validation of biomarkers, which includes
technological development, quality control,
refinement, and high throughput. - Clinical Epidemiology and Validation
CentersConduct clinical and epidemiological
research regarding the clinical application of
biomarkers. - Data Management and Coordinating
CenterCoordinate EDRN research activities,
provide logistic support, conduct statistical and
computational research for data analysis,
analyzing data for validation.
16OODT and EDRN
- OODTs success lead to interagency agreements
with both NIH and NCI, resulting in - EDRN Informatics CenterSupport EDRN's efforts
through the development of software systems for
information management. Located at NASA Jet
Propulsion Laboratory, Pasadena, CA. - Principal Investigator Dan Crichton, JPL.
17EDRN Data
- EDRN collects, generates, analyzes, and stores a
wide variety of different data, including - Specimen Inventories Map specimens collected
(blood, sputum, etc.) to patient characteristics - Studies and Publications Information about
studies conducted in the EDRN as well as
published results (publications, outputs) - Biomarkers Information about indicators of early
disease - Science DataOutputs of experiments on specimens,
regarding biomarkers, driven by particular
studies and protocols
18EDRN Data Flow
- Moving beyond the local laboratory
- Scalability, interoperability
19Case Study ERNE
- ERNE EDRN Resource Network Exchange
- Challenge Overcome differences in local schema
to develop a national distributed specimen
information infrastructure - All sites running different software and
following own procedures - Rely on a common informationmodel for
distributed querying,and provide site-specific
mappings at each participant
20ERNE Architecture
21Connecting Research
- Designing the EDRN informatics architecture as a
collection of well-defined components via OODT
has simplified the process of building interfaces
to non-EDRN systems - Wrappers can be built to link non-EDRN systems
- Translators can be developed to deal with
different semantic architectures - caBIG
- ERNE/caTissue Wrapper
- EDRN-Canary Collaboration
- A cloud computing effort that shares raw science
data via Amazon S3 between EDRN and the Canary
group which uses software from GenoLogics Life
Sciences
22EDRN Knowledge Environment
- Building a Semantic Bioinformatics Grid for the
EDRN
23Lessons From EDRN
- Architecture and a vision has been critical
- Technology hasnt been as critical
- Keep it simple
- Science support has been critical
- Getting buy-in and participation from domain
experts is key - Incremental development and deployment
- Starting with a few sites was very helpful in
understanding the issues - We had both development sites and observer sites
initially - The IRB process has been a big schedule driver
- Distributed architecture can be a challenge
- Not all sites up to maintaining the
implementation - Loosely coupled architecture with simple
interfaces helped
24Applications of OODT VPICU
- Whittier Virtual Pediatric Intensive Care Unit
- Childrens Hospital Los Angeles
- http//picu.net
- Collaboration between 85 Multi-disciplinary
pediatric intensive care units across the U.S.
25Collaboration with VPICU
- Laura P. and Leland K. Whittier Virtual Pediatric
Intensive Care Unit (VPICU), founded in 1998 by
clinicians at CHLA - Leverage advances in technology to
- Improve patient care
- Educate practitioners
- Conduct research
- Reduce cost of providing care
26VPICU Research Data
Secondary use of observational clinical (EHR,
monitor, annotations) data
- Real Health Care Data Set
- Massive, grows continuously
- Heterogeneous formats, types, etc.
- Incomplete, proprietary, descriptions
- Fragmented across stores, organizational
boundaries - Incomplete, inconsistent
- Highly restricted (legal, privacy, ethical
considerations)
- Ideal Research Data Set
- Manageable size, Static
- Homogeneous
- Complete, standardized descriptions and
annotations - Available as single unit
- Complete, consistent
- Minimal usage restrictions
27VPICU Project Areas
- Data extraction and managementTake data from
proprietary stores, make it accessible - Transformation of data into knowledgeProcess
(and re-process) the data to extract insight - Data-driven decision supportDevelop tools that
learn continuously from the data - Distributed data-sharing over a national
networkEnable research on scales previously
impossible while maintaining security, privacy,
compliance
28Principles behind VPICU
- Decouple from (proprietary) vendor databases
- Integrate disparate data sources into a single
model - Dynamically (re)generate research database(s)
- we dont know for sure what queries will be most
useful at the outset - Provide web services for multi-faceted access to
the data to enable discovery analysis - Support federation among multiple PICU sites
29Algorithm for VPICU Data System
- Develop a common Domain Ontology to describe the
information space - Develop compute services that support extraction
of data from existing databases - Identify mechanisms to integrate information
objects from disparate repositories and map them
to the common domain ontology - Construct a set of online research databases to
enable data mining and analysis - Deploy a data grid infrastructure of hardware
software to facilitate utilization of the data
environment at CHLA and beyond (external entities
and applications) - Deploy a set of compute services to support data
mining and analysis - Develop an architectural plan and roadmap for
scaling and integrating other PICUs
30VPICU Architecture
File-based storage
31VPICU Architecture
- Original data sources/stores at backend
- Proprietary schema
- Hardware that we dont own or control
- Production systems (very load-sensitive)
- Legacy technologies (sometimes)
- Unreliable (cant guarantee always available)
- Includes
- Hospital-wide commercial EHR system(s)
- Homegrown critical care database
- Specialized clinical applications
- Raw bedside monitor data
EHR
Homegrown
File-based storage
Clinical apps
Monitor data
Proprietary data sources
32VPICU Architecture
- Regular extraction of new data
- VPICU-controlled resources(Our hardware and
software) - Transform to VPICU schema
- Link data belonging to same patient
- May contain PHIMust be highly secure
- Data at this stage is normalized, stored in a
format suitable for ingestion into any number of
research databases
File-based storage
VPICU-owned resources
33VPICU Architecture
- Research databases
- Application-specific
- Optimized
- Contain de-identified or anonymized data
- VPICU ontology, schema
- Access via configurable web services
File-based storage
34What are research databases?
- Designed for specific research questions,
analytical techniques - Need not always be relational or databases at all
- Available via web interfaces and software
servicesResearcher using R can connect directly
through R bindings - Examples
- Relational database for traditional retrospective
studies - Search engine over free text clinical notes, etc.
- Patient/patient comparison, retrieval (find
patient like this one) - Data-backed patient simulator for testing
interventions
35VPICU Architecture
File-based storage
36OODT and the VPICU Data System
- Develop an Information Model (Ontology) to
describe the domain - Develop compute services that support extraction
of data from existing CHLA databases (OODT Query
Handlers) - Identify mechanisms to integrate information
objects from disparate repositories and map them
to the common domain ontology (OODT CAS crawler,
catalog services) - Construct a set of online research databases to
enable data mining and analysis (OODT Catalog and
Archive Services) - Deploy a data grid infrastructure of hardware
software to facilitate utilization of the data
environment at CHLA and beyond (external entities
and applications) (OODT Data Grid Services) - Deploy a set of compute services to support data
mining and analysis - Develop an architectural plan and roadmap for
scaling and integrating other PICUs
37OODT as Open Source
- Jan 2010 OODT Accepted as a podling in the
Apache Software Foundation (ASF) Incubator - First NASA software licensed and incubating
within the ASF - Learn more and track our progress at
- http//incubator.apache.org/projects/oodt.html
- Join the mailing list
- oodt-dev_at_incubator.apache.org
- Chat on IRC
- oodt on irc.freenode.net
38Acknowledgements
- Jet Propulsion Laboratory Dan Crichton, Chris
Mattmann, Sean Kelly, Steve Hughes, Amy
Braverman, Thuy Tran - National Cancer Institute Sudhir Srivastava,
Christos Patriotis, Don Johnsey - Fred Hutchinson Cancer Research Center Mark
Thornquist, Ziding Feng, Jackie Dalhgren, Suzanna
Reid - Childrens Hospital Los Angeles Randall Wetzel,
Robinder Khemani,Paul Vee, Jeff Terry, Robert
Kaptan,Doug Hallam