CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE

About This Presentation

Title:

CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE

Description:

CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE Olga Kaurova 1 kaurovskiy_at_gmail.com Mikhail Alexandrov 1 malexandrov_at_mail.ru – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 42

Provided by: Olg69

Category:

more less

Transcript and Presenter's Notes

Title: CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE

1
CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS
WITH RUBRYX-2 FIRST EXPERIENCE

Olga Kaurova 1 kaurovskiy_at_gmail.com
Mikhail Alexandrov 1 malexandrov_at_mail.ru
Ales Bourek 2 bourek_at_med.muni.cz
1 - Autonomous University of Barcelona,
Bellaterra, Spain
2 - Masaryk University, Brno, Czech Republic
2012

2
Contents

INTRODUCTION
RUBRYX - General info- Lexical resources -
Algorithm
EXPERIMENTS- Influence of thresholds -
Influence of terminological vocabulary -
Influence of size of training set
CONCLUSIONS

3
Introduction

Problem settings
Classification of primary care medical records
(free text form)
is a problem closely related to medical
diagnostics
We address possibilities/performance of the
document
classifier RUBRYX to process such specific
documents
as primary medical records

4
Introduction

Specificity of classification of clinical texts
Preprocessing low quality data (abbreviations,
ungrammatical statements, etc.)
Necessity to consider also hidden information
(relations between descriptors of diseases)

5
Introduction

Objectives of study
To allow medical professionals to continue using
the preferred traditional fulltext format
To help to reduce medical errors
To facilitate medical data exchange, storage and
retrieval of individual records
To help to form Internet communities with similar
health issues

6
Introduction

State-of-the-Art
Bayesian classifiers ( accuracy 95 )
Support Vector Machine ( accuracy 49)
K-nearest neighbors ( precision 57, recall 77 )
Decision trees ( accuracy 80 )
Lexis based methods ( recall 49, precision 90 )
Note Data is taken from the bibliography
publications

7
RUBRYX

General info
Developed in 2000s Polyakov, Sinitsin
Last version 2.2
Free shareware
http//www.sowsoft.com/rubryx/rubryx2.zip
Test on Reuter news F-measure 86
(training set 5 representatives from each of 10
categories)

8
RUBRYX

Lexical resources
RUBRYX uses patterns in the form of n-grams and
takes into account their joint position in a
document
- Mini-vocabularies for each category
- Terminological vocabulary
- Stop terms Black Lists (regular expressions)

9
RUBRYX

Mini-vocabularies
- Related to a concrete category
- Created during the training stage
(most representative samples from each
category,
just 5 training documents give acceptable
results)
- Can be created withwithout the support of
terminological vocabulary

10
RUBRYX

Mini-vocabularies
All stop terms are eliminated
All common unigrams form the first list, file
WordList
All common bigrams form the second list, file
WordLst2
All common trigrams form the third list, file
WordLst3
Common terms which occur at least in m
documents
Here m M, M is a number of all documents
related
to a certain category. In our experiments mM

11
RUBRYX

Terminological vocabulary
- Optional
- Created for a given domain by external experts
- Reflects a common terminology for all
categories
in a document corpus
- Contains 3 lists one-word terms (unigrams),
two-word terms (bigrams), three-word terms
(trigrams)

12
RUBRYX

Construction of terminological vocabulary
- Lexis Term tool (criterion of specificity)
allows
to extract all specific terms from the whole
corpus
- The level of specificity of a given word w in
a given document corpus C is a number K 1,
which shows how much its frequency in the
document
corpus fC(w) exceeds its frequency
in the general lexis fL(w)
K fC(w) / fL(w)

13
RUBRYX

Construction of terminological vocabulary
Note.
Vocabulary of unigrams is compiled by LexisTerm,
bigrams and trigrams are compiled by expert
(doctor)

14
RUBRYX

Algorithm lineal combination of indexes
Contribution of each category to a given document
is a linear
combination of category indexes (terms from
mini-vocabularies)
Cj K1 (Lj1/Nj1) K2 (Lj2/Nj2)
K3 (Lj3/Nj3)
j - the number of category K1 0.2/3, K2 1.3/3,
K3 1.5/3
Lj1, Lj2, Lj3 - the numbers of terms from all
three lists
in a given document
Nj1, Nj2, Nj3 - the numbers of all unigrams,
bigrams and
trigrams respectively in a
given document

15
RUBRYX

Algorithm classification rule
A document belongs to j - category if Cj Tj
(Tj thresholds calculated on training stage)
We select the Ti max (Tj ) if there are
several categories
for which Cj Tj
Note. The coefficients K1 , K2 , K3 in the
formula
Cj K1 (Lj1/Nj1) K2 (Lj2/Nj2)
K3 (Lj3/Nj3)
were recommended us by developers. It is a result
of their
experiments with Reuters corpus. A user can easy
change them

16
RUBRYX

Algorithm joint position of terms
Close terms the simultaneous term occurrences
in a given window (S10)
Weights (K1, K2, and K3) of close terms
in a document are increased on a certain value p
Parameter p is an algorithm parameter,
we set p0.3 (it can be easy changed by a user)

17
RUBRYX
Examples of RUBRYX interface (it is very simple)
18
RUBRYX

Tuning thresholds
To improve results we can change thresholds Tj
Small number of j-documents and a small number of
alien documents.
Decrease Tj that allows to increase the number
of j-documents.
Small number of j-documents and a large number of
alien documents.
The situation is undefined.

19
RUBRYX

Tuning thresholds
Improve results - change thresholds Tj
Large number of j-documents and a small number of
alien documents. OK
Large number of j-documents and a large number of
alien documents.
Increase the threshold Tj that allows to
decrease the number
of alien documents.

20
RUBRUX
Tuning thresholds
alien documents small number alien documents large number
j-documents small number
j-documents large number

Rules for adjusting thresholds

21
Experimental corpus

55 GP medical records (gastrointestinal diseases,
Spanish)

Class Disease Number of texts Number of words in all texts N of diff words in all texts
1 Gallbladder disease 12 2849 428
2 Mechanical jaundice 8 2076 458
3 Stomach cancer 11 2873 572
4 Acute appendicitis 6 1339 245
5 Gastrointestinal bleeding 7 1525 373
6 Inguinal hernia 11 2396 243
Total 55 12828 1269
22
Experiments

Measures for result evaluation
Precision P k/l
Recall R k/m
F-measure (binary classification)
F 2PR / (PR)
m all documents to be selected,
l selected by a classifier
k correctly selected documents

23
Measures for results evaluation

Measures for result evaluation
Accuracy A n/N,
where
n is a number of all correctly classified
documents
N is a number of all documents

24
Experiments

Terminology
Class is the Gold Standard
Category is a result of classification with
RUBRYX
Results tables
Rows - the distribution of documents from
a given class between categories
Columns - the distribution of all documents
assigned
to a given category between
classes

25
Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 27 1 5 1 1
2 5 3 24 3 3
3 5 6 23 1 5 5
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 6 6
Total 30 25 1 9 5 1 3 6 18

Tj are calculated automatically, Accuracy 75
Category 1 practically does not contain the
documents from Class 1.
We decrease the threshold T1 by 25.

26
Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 21 7 7
2 5 3 24 3 3
3 5 6 23 1 5 5
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 6 0 0
Total 30 25 13 4 5 1 2 0 18
T1 decreased by 25 Accuracy 75
27
Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 24 6 1 6
2 5 3 30 3 3
3 5 6 23 6 6
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 1 5 5
Total 30 25 7 3 6 1 3 5 23

T124 (10 decreased), T230 (25 increased)
Accuracy 92

28
Experiment 2 Sensitivity to Term Vocabulary
Training set - 5 samples of each category
Terminological vocabulary is used
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 22 5 2 5
2 5 3 16 3 3
3 5 6 10 5 1 5
4 5 1 23 0 1 0
5 5 2 5 2 2
6 5 6 33 6 6
Total 30 25 5 5 5 0 4 6 21

Classification with terminological vocabulary,
Tj are calculated automatically
Accuracy 84

29
Experiment 2 Sensitivity to Term Vocabulary
Training set - 5 samples of each category
Terminological vocabulary is used
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 22 7 7
2 5 3 20 3 3
3 5 6 10 6 6
4 5 1 23 0 1 0
5 5 2 7 2 2
6 5 6 33 6 6
Total 30 25 7 3 6 0 3 6 24

Classification with terminological vocabulary
T220, T57 (both 25 increased)
Accuracy 96

30
Experiment 3 Sensitivity to size of Training
Set Terminological vocabulary is
not used
Class Training Set Test Set T 1 3 6 Correct Docs
1 3 6 32 2 4 2
3 3 5 30 5 5
6 3 5 42 5 5
Total 9 16 2 9 5 12

Training set 3 samples (the largest categories)
Tj are calculated automatically
Accuracy 75, F-measure 73

31
Experiment 3 Sensitivity to size of Training
Set Terminological vocabulary is
not used
Class Training Set Test Set T 1 3 6 Correct Docs
1 6 6 25 4 2 4
3 6 5 22 5 5
6 6 5 34 5 5
Total 18 16 4 7 5 14
Training set 3 samples (the largest categories)
Tj are calculated automatically Accuracy 88,
F-measure 87
32
Experiments

Results
Experiment 1. Sensitivity to thresholds
Accuracy 75 (Tj automatic)
Best Accuracy 92
Experiment 2. Sensitivity to terminological
vocabulary
Accuracy 84 (Tj automatic)
Best Accuracy 96
Experiment 3. Sensitivity to the size of training
set
Accuracy 75, F-measure 73 (training
set 3 samples)
Accuracy 88, F-measure 87 (training
set 6 samples)

33
Experiments

Discussion
Intersection of lexis wide vs.
narrow domain
Narrow domain lexical resources
of each class are
similar gt many errors.
When classes, e.g. diseases
have absolutely different
descriptor lists
then the quality of classification
is the highest gt 100 accuracy.

34
Discussion Intersection of Lexis

Discussion
Narrow domain the intersection of lexis 16-25.

Categories 1-2 Categories 2-3 Categories 3-4 Categories 4-5 Categories 5-6
Common words 23 23 16 25 19
35
Conclusions

The document classifier RUBRYX was tested on a
limited set
of primary medical records (narrow domain
collection)
We studied the sensitivity of classification
results
to threshold variations, use of terminological
vocabulary
and size of training set

36
Conclusions

RUBRYX is easily tuned automatically and manually
on a given corpus that allows to reach high
results
Unigrams, bigrams and trigrams taken together
and taking into account their mutual position
an a document allow to process narrow domain
collections

37
Conclusions

Future Work
To combine the pre-processing procedure of RUBRYX
with other classifiers (Na?ve Bayes, SVM, etc.)

38
Acknowledgments

The authors are very thankful to
Mr. Vladimir Sinitsyn, one of RUBRYX developers,
for his numerous consultations and additional
software tools he offered for our work.
support_at_sowsoft.com

39
Bibliography (selected papers)

D.B.Aronow, J.R.Cooley, S.Soderland. Automated
identification of episodes of asthma exacerbation
for quality measurement in a computer-based
medical record. In Proc. of Annual Symposium on
Computer Applications in Medical Care. 309-13,
USA, 1995.
D.B.Aronow, et.al. Automated classification of
encounter notes in a computer based medical
record. In Proc. of MEDINFO '95 8th World
Congress on Medical Informatics, Medinfo, Canada,
p. 8-12, 1995.
Baeza-Yates, B. Ribeiro-Neta. Modern Information
Retrieval. Addison Wesley, 1999
C. Bishop. Pattern Recognition and Machine
Learning, Springer, 2006
A.Catena, M.Alexandrov, B.Alexandrov,
M.Demenkova. NLP-Tools Try To Make Medical
Diagnosis. In Proc. of the 1-st Intern. Workshop
on Social Networking (SoNet-2008), Skalica,
Slovakia, 2008.
E. Kaurova, M. Alexandrov, X. Blanco.
Classification of free text clinical narratives
(short review). In Scient.Book Information
Science and Computing, Publ. House ITHEA, 2011,
12 pp.

40
Bibliography (selected papers)

R.Lopez, M.Alexandrov, D,Barreda, J.Tejada.
Proc. of the 4-th Intern. Conf.on Intelligent
Information and and Engineering Systems
(INFOS-2011), Publ. House ITHEA, Poland, 2011, 8
pp.
C.D.Manning, H.Schutze, Foundations of
statistical natural language processing. MIT
Press, 1999
C.D.Manning, H.Schutze, Introduction to
Information Retrieval. Cambridge, 2009
T.Mitchell. Machine Learning, McGrow Hill, 1997
D. Pinto, On Clustering and Evaluation of Narrow
Domain Short-Text Corpora. Doctoral Dissertation,
Polytechnic University of Valencia, Spain, 2008
V. Polyakov, V. Sinitsyn. Method for automatic
classification of web-resource by patterns in
text processing and cognitive technologies. In
Text Collection, No.6, Publ. House Otechestvo, p.
120-126, 2001 (rus.)
V.Polyakov, V. Sinitsyn. RUBRYX technology of
text classification using lexical meaning based
approach. In Proc. of Intern. Conf. Speech and
Computing (SPECOM-2003), Moscow, MSLU, p.
137-143, 2003

41
Contacts

Olga Kaurova, PhD student, Autonomous University
of
Barcelona, Spain
kaurovskiy_at_gmail.com
Mikhail Alexandrov, professor, Russian
Presidential Academy
of national economy and public
administration, Moscow, Russia
fLexSem research group, Autonomous University
of Barcelona,
Spain malexandrov_at_mail.ru
Ales Bourek, senior lecturer, Masaryk University
head of Center for Healthcare Quality, Brno,
Czech Republic
bourek_at_med.muni.cz

Write a Comment

User Comments (0)

About PowerShow.com

CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE - PowerPoint PPT Presentation

CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE

CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE Olga Kaurova 1 kaurovskiy_at_gmail.com Mikhail Alexandrov 1 malexandrov_at_mail.ru – PowerPoint PPT presentation