Title: Converting Semi-structured Clinical Medical Records into Information and Knowledge
1Converting Semi-structured Clinical Medical
Records into Information and Knowledge
- Dr. Hyoil Han and Xiaohua Zhou
- College of Information Science Technology
Drexel University
2Agenda
- Problem Addressed
- Methods
- Approach to numeric values
- Approach to medical terms
- Approach to text classification
- Implementation
- Evaluation
- Future Work
3Problem Addressed
- Descriptions
- Automatically extract information from
semi-structured patient records. - Three types of information
- Number blood pressure, weight, pulse, etc.
- Medical terms medical history, surgical history
- Text classification smoking behavior, alcohol
use, appearance, etc. - Each record consists of multiple sections
beginning with fixed strings. Each section is
written in natural language.
4Problem Addressed (cont.)
5Problem Addressed (cont.)
6Approach to Numeric Values (1)
- Number Identification
- Tokenization
- Named Entity Recognition
- Concept Identification
- String Match
- Synonym Expansion
- Association
- Pattern-based association approach
- Linkage-based association approach
7Approach to Numeric Values (2)
- Pattern-based Approach
- Examples
- CONCEPT is NUMBER
- CONCEPT of NUMBER
- CONCEPT, NUMBER
- CONCEPT NUMBER
- Very simple but has generalization problem.
- Linkage-based Association Approach
- Convert linkage diagram (produced by link grammar
parser) to graph - Calculate the shortest distance of any pair of
concept and number in a sentence.
8Approach to Numeric Values (3)
- Link Grammar Parser
- Converts word to node, link to (weighted) edge
- Assume that if a number is the value of a certain
concept, the numbers shortest distance from the
concept must be less than from any other concept
in the sentence.
9Approach to Medical Terms (1)
- State of the Art
- Current NER algorithms dont work well for
medical terms identification - Ontology is important to achieve high accuracy of
medical term extraction. - Search of any combination of sequence in sentence
through ontology is not efficient. - Solution
- POS-based Ordered Patterns Search
10Approach to Medical Terms (2)
- Flow
- Part of speech tagging
- Ordered Patterns Matching, for example
- JJ NN NN
- NN NN
- JJ NN
- NN
- Normalization of the candidate term.
- Search candidate term through Ontology (e.g.
UMLS).
11Approach to Text Classification (1)
- Available Methods
- Analytic approach
- Machine learning
- Decision tree is frequently used in natural
language understanding - Examples
- Each patient is either current smoker, former
smoker, or nonsmoker. - Texts
- She quitted smoking five years ago (former
smoker) - She is currently a smoker (current smoker)
- None (non-smoker)
12Approach to Text Classification (2)
- Word-based Boolean Feature Extraction
- Choose one or multiple part of speeches verb,
noun, adjective, and adverb. - Choose one or multiple sentence constituents
subject, verb, object, and supplement. - Head noun or head adjective only. If this option
is enabled, for noun phrase or adjective phrase,
only head word is extracted. - Use lemma (uninfected form) of any word. If this
option is enabled, denies, denied and deny
will be treated as the same feature.
13Approach to Text Classification (3)
- ID3-based Decision Tree
- The criteria for feature selection is maximum
Information Gain (mutual information) - ID3 yield fewer features than other algorithms
14Approach to Text Classification (4)
- Example ID3-based Decision Tree for
Classification of Smoking Behavior.
15Implementation
16Evaluation
- 50 semi-structured patient records
- The goal is to extract 24 attributes (18 fields),
4 medical terms, 8 numbers, and 12 categorical
attributes. - Measures
- Precision is defined as the proportion of
correctly extracted instances of those extracted.
- Recall is the proportion of correctly extracted
instances of total instances.
17Evaluation of Numeric Attributes
- The precisions (recall) for eight numeric
attributes are all 100. - By examining all 50 records manually, we find
that the extremely high precision is in part
attributed to the very consistent writing style. - If the size of data set increases and diversified
writing styles are introduced, the performance
may be degraded.
18Evaluation of Smoking Behavior
- 45 cases, 5 former smokers, 12 current smokers,
and 28 nonsmokers. - 5-folder cross-validation
- Run experiments for 10 rounds. (For each round,
data set is randomly shuffled.) - Average precision (recall) is 92.2
- The number of features used ranges from 4 to 7)
19Evaluation of Medical Terms (1)
- Each attribute can have multiple values (medical
terms). - Where
- ETruei number of extracted true terms in i-th
subject. - ETotali number of extracted terms in i-th
subject. - TInsti number of total true terms in i-th
subject.
20Evaluation of Medical Terms (2)
- Extracted false terms and unextracted true terms
are mainly caused by the incompleteness of domain
ontology - The low recall of predefined past surgical
history and low precision of other past surgical
history are due to failure to recognize the
synonyms of predefined surgical terms and
improper recognition of them as other surgical
terms.
Attribute Name Precision Recall
Predefined Past Medical History 96.7 96.7
Other Past Medical History 76.1 86.4
Predefined Past Surgical History 77.8 35
Other Past Surgical History 62.0 75
21Future Work
- Test our work on larger data set
- A generic framework for any concept associations
- Medical Terms Extraction
- Ontology selection
- The use of synonym and semantic type
- Text Classification
- How to deal with categories containing numeric
threshold information
22Questions