Title: An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain
1An NLP Ecosystemfor Development and Use of
Natural Language Processing in the Clinical
Domain
Integrating Data for Analysis, Anonymization, and
Sharing
Division of Biomedical Informatics University of
California, San Diego
2Overview
- The promise of natural language processing (NLP)
- Challenges of developing NLP in the clinical
domain - Challenges in applying NLP in the clinical domain
- iDASH
- Opportunities for sharing and collaboration in NLP
3NLP Success
- Fresh off its butt-kicking performance on
Jeopardy!, IBMs supercomputer "Watson" has
enrolled in medical school at Columbia
University, New York Daily News February 18th
2011
IBM's computer could very well herald a whole
new era in medicine." ComputerWorld February 17,
2011
Dr. Watson??
4Clinical NLP Since 1960s
- Why has clinical NLP had little impact on
clinical care?
5Barriers to Development
- Sharing clinical data difficult
- Have not had shared datasets for development and
evaluation - Modules trained on general English not sufficient
- Insufficient common conventions and standards for
annotations - Data sets are unique to a lab
- Not easily interchangeable
6- Limited collaboration
- Clinical NLP applications silos and black boxes
- Have not had open source applications
- Reproducibility is formidable
- Open source release not always sufficient
- Software engineering quality not always great
- Mechanisms for reproducing results are sparse
7Overview
- The promise of natural language processing (NLP)
- Challenges of developing NLP in the clinical
domain - Challenges in applying NLP in the clinical domain
- Developing an NLP ecosystem on iDASH
8Security Privacy Concerns
- Clinical texts have many patient identifiers
- 18 HIPAA identifiers
- Names
- Addresses
- Items not regulated by HIPAA
- tight end for the Steelers
- Unique cases
- 50s-year-old woman who is pregnant
- Sensitive information
- HIV status
Institutions are reluctant to share data
9- Lack of user-centered development and scalability
- Perceived cost of applying NLP outweighs the
perceived benefit (Len DAvolio)
10Overview
- The promise of natural language processing (NLP)
- Challenges of developing NLP in the clinical
domain - Challenges in applying NLP in the clinical domain
- Developing an NLP ecosystem on iDASH
11iDASH
- integrating Data
- Analysis
- Anonymization
- Sharing
Data
Software/Tools
Computational Resources
12Disincentives to Share
- Scooping by faster analysts Exposure of
potential errors in data - Resources for preparing data submissions
- Maintaining data
- Interacting with potential users takes time
- Threat of privacy breach when human subjects are
involved - Do not have policies in place
- Fallible de-identification, anonymization
algorithms
13nlp-ecosystem.ucsd.edu
14HIPAA /or FISMA Compliant Cloud
DigitalInformed consent
- Access control
- De-identification
- Query counts
- Artificial data generators
Privacy preserving
Informed Consent Registry
Customizable DUAs
Researcher access
15Bibliography
Schemas
Tutorials
Research
Guidelines
Resources
Education
NLP Ecosystem
UCSD Clinical Data
Data
Evaluation Workbench
De-Identification
MT Samples
Tools Services
Collaborative Development Tools
TxtVect
Virtual Machines
Annotation Admin eHOST
Registry
16Collaborative Effort to Build Ecosystem
Evaluation Workbench
De-Identification
Tools Services
Collaborative Knowledge Authoring
TextVect
Increase access to NLP
Virtual Machines
Annotation Environment
Decrease Burden of Developing NLP
Registry
17orbit
- Increase ability to find NLP tools
18Registry orbit.nlm.nih.gov
Len DAvolio, Dina Demner-Fushman
19De-identification service
- Increase access to clinical text
20De-identification
- Several available de-identification modules
- Need to adapt to local text
- Efficient
- Secure
- Customizable ensemble de-identification system
- Build a de-identified corpus
- Incorporate existing de-id modules
- Launch as virtual machine
- Iterative training, evaluation, and modification
by user - Correct mistakes
- Add regular expressions
Brett South, Stephane Meystre, Oscar Fernandez,
Danielle Mowery
21TextVect
- Increase access to textual features
22TextVect
NLM Abhishek Kumar
23collaborative Knowledge Authoring Support Service
(cKass)
- Decrease the Burden of Customizing an NLP
Application
24Customizing an IE App
Users Concepts Cough Dyspnea Infiltrate on
CXR Wheezing Fever Cervical Lymphadenopathy
IE Output
Map
25Customizing an IE App
Users Concepts Cough Dyspnea Infiltrate on
CXR Wheezing Fever Cervical Lymphadenopathy
IE Output Dry cough Productive
cough Cough Hacking cough Bloody cough
Which concepts?
26Customizing an IE App
Users Concepts Cough Dyspnea Infiltrate on
CXR Wheezing Fever Cervical Lymphadenopathy
IE Output Temp 38.0C Low-grade temperature
What is a fever?
27Customizing an IE App
Users Concepts Cough Dyspnea Infiltrate on
CXR Wheezing Fever Cervical Lymphadenopathy
IE Output NECK no adenopathy Disorder
adenopathy Negation negated
Section mapping
28KOS-IEKnowledge Organization Systems for
Information Extraction
29Compile information helpful for IE
30Collaborative Knowledge Base Development cKASS
NLP Tools
- Physician
- Radiologist
- Nurse
- Clinical Researcher
- Knowledge Engineer.
Decision Support System
User KB
Shared KB
External KB
LQ Wang, M Conway, F Fana, M Tharp, D Hillert
31Knowledge Authoring
- Augment user KB with lexical variants, synonyms,
and related concepts - User-driven authoring
- Top-down Provide access to external knowledge
sources - UMLS, Specialist Lexicon, Bioportal
- Bottom-up Annotate to derive synonyms
- Recommendation-based authoring
- Generate lexical variants
- Mine external knowledge sources
- Mine patient records
32Evaluation workbench
- Decrease the Burden of Evaluation Error Analysis
33Evaluation Workbench
- Compare the output of two NLP annotators on
clinical text - NLP system vs human annotation
- View annotations
- Calculate outcome measures
- Drill down to all levels of annotation
- Document-level
- Perform error analysis
- Future versions will support formal error
analysis
34Levels of Annotation
- Document
- Report classified as Shigellosis
- Group
- Section classified as Past Medical History
Section - Utterance
- Group of text classified as Sentence
- Snippet
- chest pain classified as CUI 058273
- Word
- pain classified as noun)
- Token
- . classified as EOS marker
35Select Classifications to View
Document annotations
Outcome Measures for Selected Annotations
Report List
Attributes for Selected Annotation
Relationships for Selected Annotation
VA and ONC SHARP Christensen, Murphy, Frabetti,
Rodriguez, Savova
36Annotation Environment
- Decrease the Burden of Annotation
37Challenges to Annotating
- Time consuming
- Recruiting training annotators for high
agreement - Expensive
- Domain experts especially expensive
- Need for annotation by multiple people
- Challenging to design annotation task
- How many annotators?
- How should I quantify quality of annotations?
- Logistically challenging
- Managing files and batches of reports
- Setting up annotation tool
- Reinventing the wheel
- Hasnt someone created a schema for this before?
38How can we reduce the burden of annotation?
39iDASH Annotation Environment
Goal provide an environment to decrease
the Burden of annotation for research and
application
Annotator Registry
eHOST
Annotation Admin
Web application iDASH cloud
Client app on your computer
VA, SHARP, and NIGMS S Duvall, B South, G
Savova, N Elhadad, H Hochheiser
40Annotator Registry
- Enlist for annotation
- Certify for annotation tasks
- Personal health information
- Part-of-speech tagging
- UMLS mapping
- Set pay rate
- Searchable
- Available for inclusion in new annotation task
- http//idash.ucsd.edu/nlp-annotator-registry
41Annotation Admin Intended Users Uses
- Users
- NLP researchers
- Annotation administrators
- Uses
- Manage annotation projects who annotates what
- Currently done with hundreds of files on hard
drive - Integrate with annotation tool (eHOST)
- Download batches of raw reports to annotators
- Upload and store annotated reports
- Manage simple annotation projects
- Facilitate distributed annotation
42Annotation Admin
1. Assign annotators to a task
432. Create a Schema
443. Assign users and set time expectations
453. Keep track of progress
46Collaborative Effort to Build Resources
Evaluation Workbench
De-Identification
Tools Services
Collaborative Knowledge Authoring
TextVect
Increase access to NLP
Virtual Machines
Annotation Environment
Decrease Burden of Developing NLP
Registry
47Conclusion
- More demand for EHR data
- NLP has potential to extend value of narrative
clinical reports - There have been many barriers
- To development
- To deployment
- Recent developments facilitate collaboration
sharing - Common annotation conventions
- Privacy algorithms
- Shared datasets
- Hosted environments
- iDASH hopes to facilitate
- Development of NLP
- Application of NLP
48Questions Discussion
Integrating Data for Analysis, Anonymization, and
Sharing
iDASH/ShARe Workshop on Annotation September 29,
2012 La Jolla, CA
Division of Biomedical Informatics University of
California, San Diego