Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding - PowerPoint PPT Presentation

About This Presentation
Title:

Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding

Description:

Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding Julia Medori CENTAL (Centre for Natural Language Processing) Universit catholique de ... – PowerPoint PPT presentation

Number of Views:256
Avg rating:3.0/5.0
Slides: 30
Provided by: Matr196
Category:

less

Transcript and Presenter's Notes

Title: Symbolic and Machine Learning Methods for Patient Discharge Summaries Encoding


1
Symbolic and Machine Learning Methods for Patient
Discharge Summaries Encoding
Julia Medori CENTAL (Centre for Natural Language
Processing) Université catholique de Louvain
(Belgium) Séminaire du Cental - 17/12/2010
2
Overview
  • Problem outline
  • System structure
  • Extraction
  • Encoding
  • Extraction module
  • Encoding module
  • Machine learning methods
  • Experiments for features selection
  • Results
  • Symbolic methods description
  • Method 1 Morphological Analysis (MA)
  • Method 2 Extended lexical patterns (ELP)
  • Methods combination
  • Results
  • Conclusions

3
Introduction
  • Aim
  • Build a (semi-)automated system for ICD-9-CM
    encoding
  • Collaboration CENTAL/Saint-Luc
  • Université catholique de Louvain (Belgium)
  • CENTAL Centre for Natural Language Processing
  • Saint-Luc hospital
  • team of 10 coders processes medical records
  • Extraction of medical acts and diagnoses ?
    ICD-9-CM codes
  • 85,000 patients stays encoded each year.

4
Manual encoding
5
Data
  • International Classification of Diseases -9th
    Revision-Clinical Modification (ICD-9-CM)
  • Hierarchy
  • first 3 digits -gt general category 1,135
    categories
  • Digits 4 and 5 -gt specific diagnosis 15,688
    codes
  • Example

Code Label
001 Cholera
0010 Cholera due to Vibrio cholerae
0011 Cholera due to Vibrio cholerae el tor
0019 Cholera, unspecified
6
Objectives
  • Design a coding help
  • a tool that will suggest the most likely codes to
    be assigned to a patients medical record.
  • Why not a fully automated system?
  • Main source of information Patient discharge
    summary (PDS)
  • PDS letter, addressed to patients GP with no
    standard structure
  • 15-20 of the codes inferred from other sources
    from patients medical record (often scanned
    documents).

7
System structure
Machine learning module
Extraction
Coding
Code modification according to context and stats
Context analysis tagging
PDS
PDS ordered list of codes
Dictionaries and linguistic structures
Preprocessing
Manual checking
Morphological processing
ICD9CM Inclusions
Matching lists
8
Structure outline
  • 2 steps
  • Extraction
  • Develop an extraction system able to extract
    information necessary to the encoding task
  • Diagnoses, procedures, locations, dates,
    allergies, aggravating factors, etc.
  • gt Reading help tool.
  • Encoding
  • Extracted information gt codes
  • through a combination of statistical and symbolic
    methods.

9
Extraction
  • Develop specialized linguistic resources
  • Specialized dictionaries
  • Diagnoses and procedures lt ICD-9-CM UMLS
  • Medications
  • Anatomy
  • Linguistic structure description
  • Diagnoses context (present, absent, probable,
    etc.)
  • Allergies and smoking
  • Dates
  • Weight and height

10
Example of linguistic structure graph
  • Fracture de lépaule gt
  • ltMALINDETgt Fracture de lltANATgtépaulelt/ANATgtlt/MALI
    NDETgt

11
Extraction result
12
Structure outline
  • 2 steps
  • Extraction
  • Develop an extraction system able to extract
    information necessary to the encoding task
  • Diagnoses, procedures, localisations, dates,
    allergies, aggravating factors, etc.
  • gt Reading help tool.
  • Encoding
  • Extracted information gt codes
  • through a combination of statistical and symbolic
    methods.

13
Machine Learning
  • Encoding categorization problem
  • Features extracted phrases?
  • Classes codes
  • Baseline method Naive Bayes
  • Tool Weka
  • Corpus
  • 13,635 PDS from Digestive Surgery
  • 90 training set / 10 test set (1364 PDS)
  • Average number of codes per PDS 6.2
  • Trained 1 classifier per code occurringgt5 times
    in the corpus
  • 775 codes -gt 775 classifiers
  • Limitation 5 rare codes
  • attributes kept only those co-occurring at least
    twice with the codes.
  • Measures Precision and recall according to the
    probability returned by the Naive Bayes test.

14
Experiments
  • A series of experiments were conducted where
    attributes were variants of the extracted
    diagnoses and procedures after stemming.
  • Variants implied
  • Kept original word order or not.
  • Ex excisional biopsy bile duct
  • Or bile biopsy duct excisional
  • Included details like location, date, context.
  • Excisional biopsy
  • Each word of the extracted phrases is a feature
  • Excisional
  • Biopsy
  • Bile
  • Duct
  • Words and morphemes (together) composing the
    extracted phrase
  • Bile biopsy excision excisional duct
  • Words and morphemes (separately) composing the
    extracted phrase
  • Excisional biopsy bile duct
  • Excision biopsy bile duct
  • Values were 0 or 1 whether the attribute was in
    the text or not.
  • Values were the frequency of the attribute in the
    text.

15
Results
3 best results when thresholding the list of
results where the probability returned by Naive
Bayes 1
Features Recall Precision Average number of suggested codes
Extracted phrases details same word order 0/1 as values 68,7 73,2 7,87
Extract phrases details alphabetical order 0/1 as values 59,1 75,7 6,49
Words and morphemes (together) details alphabetical order 0/1 as values 68,5 74,2 7,54
16
Discussion
  • Limitations of the machine learning method
  • 5 rare codes not enough data to build a
    classifier for these codes
  • Need for annotated data means that these methods
    are unable to face changes in classifications
  • In these cases, we need to use symbolic methods
  • Kevers Laurent et Medori Julia, Symbolic
    classification methods for patient discharge
    summaries encoding into ICD, In Advances in
    Natural Language Processing, 7th International
    Conference on NLP, IceTAL 2010, Reykjavik, August
    16-18, 2010, Lecture Notes in Artificial
    Intelligence, 2010, p. 197-208

17
Objective
  • Automatic encoding of PDS according to categories
    (first 3 digits)
  • Use of symbolic methods
  • No need for annotated data
  • Can assign rare codes (27 used 5 times or less)
  • Principle
  • Make use of the nomenclature
  • Enrich it with other resources in French from
    UMLS (Unified Medical Language System)

18
Corpus
  • 19,692 patient discharge summaries (PDS) in
    French
  • General Internal Medicine
  • 150,116 codes (137,336 categories)
  • 6,029 distinct codes (895 categories)
  • Average 7.6 codes/document (7 categories)

19
Method 1 (MA) General Principle
  • Based on the rich morphology of medical language
  • Ex. Bronchoscopy
  • Fibroscopie bronchique bronchoscopie par fibre
    optique
  • 2 steps process
  • Extract phrases or terms describing diagnoses or
    procedures to be encoded
  • Encoding match these terms to the right code.

20
Method 1 (MA) Encoding
  • Bags-of-words
  • Words stop words morphemes meaning

ICD-9-CM
PDS
Fibroscopie bronchique
Bronchoscopie par fibre optique
fibroscopie bronchique fibro- fibre -scopie bronch
- bronche -ique
bronchoscopie par fibre optique bronch- bronche -s
copie
Similarity score
21
Method 1 (MA) Results
Recall Precision F-measure Nb. classes
Best Recall 46.13 14.70 21.10 20
Best F-measure 34.52 27.34 28.00 8.6
22
Method 2 (ELP) General principle
  • Developed by L. Kevers as designed for the
    Stratego project on parliamentary documents.
  • Symbolic method with less manual work
  • Use existing  terminological  resources
  • ICD-9-CM UMLS
  • Two steps process
  • Automatic transformation of existing
    terminological resources into an extraction
    resource (only once)
  • Use extraction resource on documents for terms
    extraction and classification (for each document)

23
Method 2 (ELP) build extraction resource (1)
  • For each ICD-9-CM term ( a class), the automatic
    processing implies
  • Gather synonyms (UMLS)
  •  dengue 
  • ?  dengues ,  dengue fever ,  infection
    by the dengue virus 
  • Parse complex compound expressions
  •  Infectious and parasitic diseases  ?  Infec
    tious disease 
  • ?  Parasitic disease 
  • Transform initial term into Extended Lexical
    Pattern (ELP)
  • Stopwords ?  infection ltTOKENgt dengue virus 
  • Stemming ?  infect ltTOKENgt dengue virus 
  • Allow insertions ?  infect ltIgt ltTOKENgt ltIgt
    dengue ltIgt virus 
  • Add negative contexts patterns
  • Build the main transducer for text annotation

24
Method 2 (ELP) Transducer output
  • Transducer for class '061'
  • Output of main transducer for a document
  • Zona 053
  • extremement douloureux 729
  • gastroscopie Z44
  • acide E96
  • anemie normochrome normocytaire 285
  • sequellaires apicales droite (tuberculose 137
  • intestin grele Z45
  • tuberculose V12

oesophagite moderee aspeci?que 947 infection
a mycobacterie 031 fond de oeil Z16 pas
de - atteinte du nerf 957 zona
053 hyperthyroidie 242 goitre
706 goitre 240
25
Method 2 (ELP) Class assignment (2)
  • For a text to classify, analyse the main
    transducer output
  • When negative contexts, the phrase is skipped
  • Each recognized phrase has one (or more) related
    code
  • Compute a weight for each phrase based on
  • Frequency
  • Is a multi word expression (frequency2), or not
  • Compute a weight for each code by summing up the
    weights obtained for the phrases
  • Result ordered list of codes (possibly
    threshold it)

26
Method 2 (ELP) Results
Recall Precision F-measure Nb of classes
Best Recall 52.74 20.69 27.37 19.6
Best F-measure 37.97 30.30 29.43 9.8
27
Combination of methods 1 2
  • Merge the lists from method 1 2
  • Threshold(M.1 union M.2)
  • Threshold(M.1 inter M.2)
  • Threshold(M.1) union Threshold(M.2)
  • Threshold(M.1) inter Threshold(M.2)
  • The weight for each method can be balanced
  • Example 0.4M.1 union 0.6 M.2

28
Evaluation of symbolic methods combination
Recall (R) Precision (P) F-measure (F1) Nb. classes Threshold a/1-a
Mix1 Threshold(Method1 union Method2) Mix1 Threshold(Method1 union Method2) Mix1 Threshold(Method1 union Method2) Mix1 Threshold(Method1 union Method2) Mix1 Threshold(Method1 union Method2) Mix1 Threshold(Method1 union Method2) Mix1 Threshold(Method1 union Method2)
Best R 60.21 13.20 20.86 30.5 No Any
Best F1 37.13 33.12 31.64 8.1 Yes 0.3/0.7
Mix2 Threshold(Method1 inter Method2) Mix2 Threshold(Method1 inter Method2) Mix2 Threshold(Method1 inter Method2) Mix2 Threshold(Method1 inter Method2) Mix2 Threshold(Method1 inter Method2) Mix2 Threshold(Method1 inter Method2) Mix2 Threshold(Method1 inter Method2)
Best R 38.66 29.28 30.52 9.1 No Any
Best F1 34.73 34.55 31.50 7 Yes 0.3/0.7
Mix3 Threshold(Method1) union Threshold(Method2) Mix3 Threshold(Method1) union Threshold(Method2) Mix3 Threshold(Method1) union Threshold(Method2) Mix3 Threshold(Method1) union Threshold(Method2) Mix3 Threshold(Method1) union Threshold(Method2) Mix3 Threshold(Method1) union Threshold(Method2) Mix3 Threshold(Method1) union Threshold(Method2)
Best F1 43.28 20.59 27.90 14.7 Yes N/A
Mix4 Threshold(Method1) inter Threshold(Method2) Mix4 Threshold(Method1) inter Threshold(Method2) Mix4 Threshold(Method1) inter Threshold(Method2) Mix4 Threshold(Method1) inter Threshold(Method2) Mix4 Threshold(Method1) inter Threshold(Method2) Mix4 Threshold(Method1) inter Threshold(Method2) Mix4 Threshold(Method1) inter Threshold(Method2)
Best F1 24.07 37.95 29.46 4.4 Yes N/A
29
Conclusions
  • Results have to be put into perspective
  • Inter-annotator agreement 70
  • 15 to 20 cannot be inferred from PDS
  • Machine learning methods performed well.
  • Symbolic methods
  • MA method based on extraction module 66 of
    useful information is extracted.
  • ELP method performs better when built from short
    unambiguous phrases. ICD-9-CM code descriptions
    are more complex.
  • Future work
  • Give more weight to information contained in
    important parts of the PDS (introduction,
    conclusion)
  • Evaluate the actual help given to human coders
  • Combine with learning algorithms
Write a Comment
User Comments (0)
About PowerShow.com