Title: CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE
1CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS
WITH RUBRYX-2 FIRST EXPERIENCE
- Olga Kaurova 1 kaurovskiy_at_gmail.com
- Mikhail Alexandrov 1 malexandrov_at_mail.ru
- Ales Bourek 2 bourek_at_med.muni.cz
-
- 1 - Autonomous University of Barcelona,
Bellaterra, Spain - 2 - Masaryk University, Brno, Czech Republic
- 2012
2Contents
- INTRODUCTION
- RUBRYX - General info- Lexical resources -
Algorithm - EXPERIMENTS- Influence of thresholds -
Influence of terminological vocabulary -
Influence of size of training set - CONCLUSIONS
3Introduction
- Problem settings
- Classification of primary care medical records
(free text form) - is a problem closely related to medical
diagnostics - We address possibilities/performance of the
document - classifier RUBRYX to process such specific
documents - as primary medical records
4Introduction
- Specificity of classification of clinical texts
- Preprocessing low quality data (abbreviations,
ungrammatical statements, etc.) - Necessity to consider also hidden information
(relations between descriptors of diseases)
5Introduction
- Objectives of study
- To allow medical professionals to continue using
the preferred traditional fulltext format - To help to reduce medical errors
- To facilitate medical data exchange, storage and
retrieval of individual records - To help to form Internet communities with similar
health issues
6Introduction
- State-of-the-Art
- Bayesian classifiers ( accuracy 95 )
- Support Vector Machine ( accuracy 49)
- K-nearest neighbors ( precision 57, recall 77 )
- Decision trees ( accuracy 80 )
- Lexis based methods ( recall 49, precision 90 )
-
- Note Data is taken from the bibliography
publications
7RUBRYX
- General info
- Developed in 2000s Polyakov, Sinitsin
- Last version 2.2
- Free shareware
- http//www.sowsoft.com/rubryx/rubryx2.zip
- Test on Reuter news F-measure 86
- (training set 5 representatives from each of 10
categories)
8RUBRYX
- Lexical resources
- RUBRYX uses patterns in the form of n-grams and
- takes into account their joint position in a
document - - Mini-vocabularies for each category
- - Terminological vocabulary
- - Stop terms Black Lists (regular expressions)
9RUBRYX
- Mini-vocabularies
- - Related to a concrete category
- - Created during the training stage
- (most representative samples from each
category, - just 5 training documents give acceptable
results) - - Can be created withwithout the support of
- terminological vocabulary
10RUBRYX
- Mini-vocabularies
- All stop terms are eliminated
- All common unigrams form the first list, file
WordList - All common bigrams form the second list, file
WordLst2 - All common trigrams form the third list, file
WordLst3 - Common terms which occur at least in m
documents - Here m M, M is a number of all documents
related - to a certain category. In our experiments mM
11RUBRYX
- Terminological vocabulary
- - Optional
- - Created for a given domain by external experts
- - Reflects a common terminology for all
categories - in a document corpus
- - Contains 3 lists one-word terms (unigrams),
- two-word terms (bigrams), three-word terms
(trigrams)
12RUBRYX
- Construction of terminological vocabulary
- - Lexis Term tool (criterion of specificity)
allows - to extract all specific terms from the whole
corpus - - The level of specificity of a given word w in
- a given document corpus C is a number K 1,
- which shows how much its frequency in the
document - corpus fC(w) exceeds its frequency
- in the general lexis fL(w)
- K fC(w) / fL(w)
13RUBRYX
- Construction of terminological vocabulary
- Note.
- Vocabulary of unigrams is compiled by LexisTerm,
- bigrams and trigrams are compiled by expert
(doctor)
14RUBRYX
- Algorithm lineal combination of indexes
- Contribution of each category to a given document
is a linear - combination of category indexes (terms from
mini-vocabularies) - Cj K1 (Lj1/Nj1) K2 (Lj2/Nj2)
K3 (Lj3/Nj3) - j - the number of category K1 0.2/3, K2 1.3/3,
K3 1.5/3 - Lj1, Lj2, Lj3 - the numbers of terms from all
three lists - in a given document
- Nj1, Nj2, Nj3 - the numbers of all unigrams,
bigrams and - trigrams respectively in a
given document
15RUBRYX
- Algorithm classification rule
- A document belongs to j - category if Cj Tj
- (Tj thresholds calculated on training stage)
- We select the Ti max (Tj ) if there are
several categories - for which Cj Tj
- Note. The coefficients K1 , K2 , K3 in the
formula - Cj K1 (Lj1/Nj1) K2 (Lj2/Nj2)
K3 (Lj3/Nj3) - were recommended us by developers. It is a result
of their - experiments with Reuters corpus. A user can easy
change them
16RUBRYX
- Algorithm joint position of terms
- Close terms the simultaneous term occurrences
- in a given window (S10)
- Weights (K1, K2, and K3) of close terms
- in a document are increased on a certain value p
- Parameter p is an algorithm parameter,
- we set p0.3 (it can be easy changed by a user)
17RUBRYX
Examples of RUBRYX interface (it is very simple)
18RUBRYX
- Tuning thresholds
- To improve results we can change thresholds Tj
- Small number of j-documents and a small number of
alien documents. - Decrease Tj that allows to increase the number
of j-documents. - Small number of j-documents and a large number of
alien documents. - The situation is undefined.
19RUBRYX
- Tuning thresholds
- Improve results - change thresholds Tj
- Large number of j-documents and a small number of
alien documents. OK - Large number of j-documents and a large number of
alien documents. - Increase the threshold Tj that allows to
decrease the number - of alien documents.
20RUBRUX
Tuning thresholds
alien documents small number alien documents large number
j-documents small number
j-documents large number
- Rules for adjusting thresholds
21Experimental corpus
- 55 GP medical records (gastrointestinal diseases,
Spanish)
Class Disease Number of texts Number of words in all texts N of diff words in all texts
1 Gallbladder disease 12 2849 428
2 Mechanical jaundice 8 2076 458
3 Stomach cancer 11 2873 572
4 Acute appendicitis 6 1339 245
5 Gastrointestinal bleeding 7 1525 373
6 Inguinal hernia 11 2396 243
Total 55 12828 1269
22Experiments
- Measures for result evaluation
- Precision P k/l
- Recall R k/m
- F-measure (binary classification)
- F 2PR / (PR)
- m all documents to be selected,
- l selected by a classifier
- k correctly selected documents
23Measures for results evaluation
- Measures for result evaluation
- Accuracy A n/N,
- where
- n is a number of all correctly classified
documents - N is a number of all documents
24Experiments
- Terminology
- Class is the Gold Standard
- Category is a result of classification with
RUBRYX -
- Results tables
- Rows - the distribution of documents from
- a given class between categories
- Columns - the distribution of all documents
assigned - to a given category between
classes
25Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 27 1 5 1 1
2 5 3 24 3 3
3 5 6 23 1 5 5
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 6 6
Total 30 25 1 9 5 1 3 6 18
- Tj are calculated automatically, Accuracy 75
- Category 1 practically does not contain the
documents from Class 1. - We decrease the threshold T1 by 25.
26Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 21 7 7
2 5 3 24 3 3
3 5 6 23 1 5 5
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 6 0 0
Total 30 25 13 4 5 1 2 0 18
T1 decreased by 25 Accuracy 75
27Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 24 6 1 6
2 5 3 30 3 3
3 5 6 23 6 6
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 1 5 5
Total 30 25 7 3 6 1 3 5 23
- T124 (10 decreased), T230 (25 increased)
- Accuracy 92
28Experiment 2 Sensitivity to Term Vocabulary
Training set - 5 samples of each category
Terminological vocabulary is used
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 22 5 2 5
2 5 3 16 3 3
3 5 6 10 5 1 5
4 5 1 23 0 1 0
5 5 2 5 2 2
6 5 6 33 6 6
Total 30 25 5 5 5 0 4 6 21
- Classification with terminological vocabulary,
- Tj are calculated automatically
- Accuracy 84
29Experiment 2 Sensitivity to Term Vocabulary
Training set - 5 samples of each category
Terminological vocabulary is used
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 22 7 7
2 5 3 20 3 3
3 5 6 10 6 6
4 5 1 23 0 1 0
5 5 2 7 2 2
6 5 6 33 6 6
Total 30 25 7 3 6 0 3 6 24
- Classification with terminological vocabulary
- T220, T57 (both 25 increased)
- Accuracy 96
30Experiment 3 Sensitivity to size of Training
Set Terminological vocabulary is
not used
Class Training Set Test Set T 1 3 6 Correct Docs
1 3 6 32 2 4 2
3 3 5 30 5 5
6 3 5 42 5 5
Total 9 16 2 9 5 12
- Training set 3 samples (the largest categories)
- Tj are calculated automatically
- Accuracy 75, F-measure 73
31Experiment 3 Sensitivity to size of Training
Set Terminological vocabulary is
not used
Class Training Set Test Set T 1 3 6 Correct Docs
1 6 6 25 4 2 4
3 6 5 22 5 5
6 6 5 34 5 5
Total 18 16 4 7 5 14
Training set 3 samples (the largest categories)
Tj are calculated automatically Accuracy 88,
F-measure 87
32Experiments
- Results
- Experiment 1. Sensitivity to thresholds
- Accuracy 75 (Tj automatic)
- Best Accuracy 92
- Experiment 2. Sensitivity to terminological
vocabulary - Accuracy 84 (Tj automatic)
- Best Accuracy 96
- Experiment 3. Sensitivity to the size of training
set - Accuracy 75, F-measure 73 (training
set 3 samples) - Accuracy 88, F-measure 87 (training
set 6 samples)
33Experiments
- Discussion
- Intersection of lexis wide vs.
- narrow domain
- Narrow domain lexical resources
- of each class are
- similar gt many errors.
- When classes, e.g. diseases
- have absolutely different
- descriptor lists
- then the quality of classification
- is the highest gt 100 accuracy.
34Discussion Intersection of Lexis
- Discussion
- Narrow domain the intersection of lexis 16-25.
Categories 1-2 Categories 2-3 Categories 3-4 Categories 4-5 Categories 5-6
Common words 23 23 16 25 19
35Conclusions
- The document classifier RUBRYX was tested on a
limited set - of primary medical records (narrow domain
collection) - We studied the sensitivity of classification
results - to threshold variations, use of terminological
vocabulary - and size of training set
36Conclusions
- RUBRYX is easily tuned automatically and manually
- on a given corpus that allows to reach high
results - Unigrams, bigrams and trigrams taken together
- and taking into account their mutual position
- an a document allow to process narrow domain
collections
37Conclusions
- Future Work
-
- To combine the pre-processing procedure of RUBRYX
- with other classifiers (Na?ve Bayes, SVM, etc.)
38Acknowledgments
- The authors are very thankful to
- Mr. Vladimir Sinitsyn, one of RUBRYX developers,
for his numerous consultations and additional
software tools he offered for our work. - support_at_sowsoft.com
39Bibliography (selected papers)
- D.B.Aronow, J.R.Cooley, S.Soderland. Automated
identification of episodes of asthma exacerbation
for quality measurement in a computer-based
medical record. In Proc. of Annual Symposium on
Computer Applications in Medical Care. 309-13,
USA, 1995. - D.B.Aronow, et.al. Automated classification of
encounter notes in a computer based medical
record. In Proc. of MEDINFO '95 8th World
Congress on Medical Informatics, Medinfo, Canada,
p. 8-12, 1995. - Baeza-Yates, B. Ribeiro-Neta. Modern Information
Retrieval. Addison Wesley, 1999 - C. Bishop. Pattern Recognition and Machine
Learning, Springer, 2006 - A.Catena, M.Alexandrov, B.Alexandrov,
M.Demenkova. NLP-Tools Try To Make Medical
Diagnosis. In Proc. of the 1-st Intern. Workshop
on Social Networking (SoNet-2008), Skalica,
Slovakia, 2008. - E. Kaurova, M. Alexandrov, X. Blanco.
Classification of free text clinical narratives
(short review). In Scient.Book Information
Science and Computing, Publ. House ITHEA, 2011,
12 pp.
40Bibliography (selected papers)
- R.Lopez, M.Alexandrov, D,Barreda, J.Tejada.
Proc. of the 4-th Intern. Conf.on Intelligent
Information and and Engineering Systems
(INFOS-2011), Publ. House ITHEA, Poland, 2011, 8
pp. -
- C.D.Manning, H.Schutze, Foundations of
statistical natural language processing. MIT
Press, 1999 - C.D.Manning, H.Schutze, Introduction to
Information Retrieval. Cambridge, 2009 - T.Mitchell. Machine Learning, McGrow Hill, 1997
- D. Pinto, On Clustering and Evaluation of Narrow
Domain Short-Text Corpora. Doctoral Dissertation,
Polytechnic University of Valencia, Spain, 2008 - V. Polyakov, V. Sinitsyn. Method for automatic
classification of web-resource by patterns in
text processing and cognitive technologies. In
Text Collection, No.6, Publ. House Otechestvo, p.
120-126, 2001 (rus.) - V.Polyakov, V. Sinitsyn. RUBRYX technology of
text classification using lexical meaning based
approach. In Proc. of Intern. Conf. Speech and
Computing (SPECOM-2003), Moscow, MSLU, p.
137-143, 2003
41Contacts
- Olga Kaurova, PhD student, Autonomous University
of - Barcelona, Spain
- kaurovskiy_at_gmail.com
- Mikhail Alexandrov, professor, Russian
Presidential Academy - of national economy and public
administration, Moscow, Russia - fLexSem research group, Autonomous University
of Barcelona, - Spain malexandrov_at_mail.ru
- Ales Bourek, senior lecturer, Masaryk University
- head of Center for Healthcare Quality, Brno,
Czech Republic - bourek_at_med.muni.cz