CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE

Description:

CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE Olga Kaurova 1 kaurovskiy_at_gmail.com Mikhail Alexandrov 1 malexandrov_at_mail.ru – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 42
Provided by: Olg69
Category:

less

Transcript and Presenter's Notes

Title: CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE


1
CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS
WITH RUBRYX-2 FIRST EXPERIENCE
  • Olga Kaurova 1 kaurovskiy_at_gmail.com
  • Mikhail Alexandrov 1 malexandrov_at_mail.ru
  • Ales Bourek 2 bourek_at_med.muni.cz
  • 1 - Autonomous University of Barcelona,
    Bellaterra, Spain
  • 2 - Masaryk University, Brno, Czech Republic
  • 2012

2
Contents
  • INTRODUCTION
  • RUBRYX - General info- Lexical resources  -
    Algorithm
  • EXPERIMENTS- Influence of thresholds  -
    Influence of terminological vocabulary -
    Influence of size of training set
  • CONCLUSIONS

3
Introduction
  • Problem settings
  • Classification of primary care medical records
    (free text form)
  • is a problem closely related to medical
    diagnostics
  • We address possibilities/performance of the
    document
  • classifier RUBRYX to process such specific
    documents
  • as primary medical records

4
Introduction
  • Specificity of classification of clinical texts
  • Preprocessing low quality data (abbreviations,
    ungrammatical statements, etc.)
  • Necessity to consider also hidden information
    (relations between descriptors of diseases)

5
Introduction
  • Objectives of study
  • To allow medical professionals to continue using
    the preferred traditional fulltext format
  • To help to reduce medical errors
  • To facilitate medical data exchange, storage and
    retrieval of individual records
  • To help to form Internet communities with similar
    health issues

6
Introduction
  • State-of-the-Art
  • Bayesian classifiers ( accuracy 95 )
  • Support Vector Machine ( accuracy 49)
  • K-nearest neighbors ( precision 57, recall 77 )
  • Decision trees ( accuracy 80 )
  • Lexis based methods ( recall 49, precision 90 )
  • Note Data is taken from the bibliography
    publications

7
RUBRYX
  • General info
  • Developed in 2000s Polyakov, Sinitsin
  • Last version 2.2
  • Free shareware
  • http//www.sowsoft.com/rubryx/rubryx2.zip
  • Test on Reuter news F-measure 86
  • (training set 5 representatives from each of 10
    categories)

8
RUBRYX
  • Lexical resources
  • RUBRYX uses patterns in the form of n-grams and
  • takes into account their joint position in a
    document
  • - Mini-vocabularies for each category
  • - Terminological vocabulary
  • - Stop terms Black Lists (regular expressions)

9
RUBRYX
  • Mini-vocabularies
  • - Related to a concrete category
  • - Created during the training stage
  • (most representative samples from each
    category,
  • just 5 training documents give acceptable
    results)
  • - Can be created withwithout the support of
  • terminological vocabulary

10
RUBRYX
  • Mini-vocabularies
  • All stop terms are eliminated
  • All common unigrams form the first list, file
    WordList
  • All common bigrams form the second list, file
    WordLst2
  • All common trigrams form the third list, file
    WordLst3
  • Common terms which occur at least in m
    documents
  • Here m M, M is a number of all documents
    related
  • to a certain category. In our experiments mM

11
RUBRYX
  • Terminological vocabulary
  • - Optional
  • - Created for a given domain by external experts
  • - Reflects a common terminology for all
    categories
  • in a document corpus
  • - Contains 3 lists one-word terms (unigrams),
  • two-word terms (bigrams), three-word terms
    (trigrams)

12
RUBRYX
  • Construction of terminological vocabulary
  • - Lexis Term tool (criterion of specificity)
    allows
  • to extract all specific terms from the whole
    corpus
  • - The level of specificity of a given word w in
  • a given document corpus C is a number K 1,
  • which shows how much its frequency in the
    document
  • corpus fC(w) exceeds its frequency
  • in the general lexis fL(w)
  • K fC(w) / fL(w)

13
RUBRYX
  • Construction of terminological vocabulary
  • Note.
  • Vocabulary of unigrams is compiled by LexisTerm,
  • bigrams and trigrams are compiled by expert
    (doctor)

14
RUBRYX
  • Algorithm lineal combination of indexes
  • Contribution of each category to a given document
    is a linear
  • combination of category indexes (terms from
    mini-vocabularies)
  • Cj K1 (Lj1/Nj1) K2 (Lj2/Nj2)
    K3 (Lj3/Nj3)
  • j - the number of category K1 0.2/3, K2 1.3/3,
    K3 1.5/3
  • Lj1, Lj2, Lj3 - the numbers of terms from all
    three lists
  • in a given document
  • Nj1, Nj2, Nj3 - the numbers of all unigrams,
    bigrams and
  • trigrams respectively in a
    given document

15
RUBRYX
  • Algorithm classification rule
  • A document belongs to j - category if Cj Tj
  • (Tj thresholds calculated on training stage)
  • We select the Ti max (Tj ) if there are
    several categories
  • for which Cj Tj
  • Note. The coefficients K1 , K2 , K3 in the
    formula
  • Cj K1 (Lj1/Nj1) K2 (Lj2/Nj2)
    K3 (Lj3/Nj3)
  • were recommended us by developers. It is a result
    of their
  • experiments with Reuters corpus. A user can easy
    change them

16
RUBRYX
  • Algorithm joint position of terms
  • Close terms the simultaneous term occurrences
  • in a given window (S10)
  • Weights (K1, K2, and K3) of close terms
  • in a document are increased on a certain value p
  • Parameter p is an algorithm parameter,
  • we set p0.3 (it can be easy changed by a user)

17
RUBRYX
Examples of RUBRYX interface (it is very simple)
18
RUBRYX
  • Tuning thresholds
  • To improve results we can change thresholds Tj
  • Small number of j-documents and a small number of
    alien documents.
  • Decrease Tj that allows to increase the number
    of j-documents.
  • Small number of j-documents and a large number of
    alien documents.
  • The situation is undefined.

19
RUBRYX
  • Tuning thresholds
  • Improve results - change thresholds Tj
  • Large number of j-documents and a small number of
    alien documents. OK
  • Large number of j-documents and a large number of
    alien documents.
  • Increase the threshold Tj that allows to
    decrease the number
  • of alien documents.

20
RUBRUX
Tuning thresholds
alien documents small number   alien documents large number  
j-documents small number 
j-documents large number 
  • Rules for adjusting thresholds

21
Experimental corpus
  • 55 GP medical records (gastrointestinal diseases,
    Spanish)

Class Disease Number of texts Number of words in all texts N of diff words in all texts
1 Gallbladder disease 12 2849 428
2 Mechanical jaundice 8 2076 458
3 Stomach cancer 11 2873 572
4 Acute appendicitis 6 1339 245
5 Gastrointestinal bleeding 7 1525 373
6 Inguinal hernia 11 2396 243
Total 55 12828 1269
22
Experiments
  • Measures for result evaluation
  • Precision P k/l
  • Recall R k/m
  • F-measure (binary classification)
  • F 2PR / (PR)
  • m all documents to be selected,
  • l selected by a classifier
  • k correctly selected documents

23
Measures for results evaluation
  • Measures for result evaluation
  • Accuracy A n/N,
  • where
  • n is a number of all correctly classified
    documents
  • N is a number of all documents

24
Experiments
  • Terminology
  • Class is the Gold Standard
  • Category is a result of classification with
    RUBRYX
  • Results tables
  • Rows - the distribution of documents from
  • a given class between categories
  • Columns - the distribution of all documents
    assigned
  • to a given category between
    classes

25
Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 27 1 5 1 1
2 5 3 24 3 3
3 5 6 23 1 5 5
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 6 6
Total 30 25 1 9 5 1 3 6 18
  • Tj are calculated automatically, Accuracy 75
  • Category 1 practically does not contain the
    documents from Class 1.
  • We decrease the threshold T1 by 25.

26
Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 21 7 7
2 5 3 24 3 3
3 5 6 23 1 5 5
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 6 0 0
Total 30 25 13 4 5 1 2 0 18
T1 decreased by 25 Accuracy 75
27
Experiment 1 Sensitivity to Thresholds
Training set - 5 samples of each category
No terminological vocabulary
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 24 6 1 6
2 5 3 30 3 3
3 5 6 23 6 6
4 5 1 37 1 1
5 5 2 27 2 2
6 5 6 35 1 5 5
Total 30 25 7 3 6 1 3 5 23
  • T124 (10 decreased), T230 (25 increased)
  • Accuracy 92

28
Experiment 2 Sensitivity to Term Vocabulary
Training set - 5 samples of each category
Terminological vocabulary is used
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 22 5 2 5
2 5 3 16 3 3
3 5 6 10 5 1 5
4 5 1 23 0 1 0
5 5 2 5 2 2
6 5 6 33 6 6
Total 30 25 5 5 5 0 4 6 21
  • Classification with terminological vocabulary,
  • Tj are calculated automatically
  • Accuracy 84

29
Experiment 2 Sensitivity to Term Vocabulary
Training set - 5 samples of each category
Terminological vocabulary is used
Class Training Set Test Set T 1 2 3 4 5 6 Correct Docs
1 5 7 22 7 7
2 5 3 20 3 3
3 5 6 10 6 6
4 5 1 23 0 1 0
5 5 2 7 2 2
6 5 6 33 6 6
Total 30 25 7 3 6 0 3 6 24
  • Classification with terminological vocabulary
  • T220, T57 (both 25 increased)
  • Accuracy 96

30
Experiment 3 Sensitivity to size of Training
Set Terminological vocabulary is
not used
Class Training Set Test Set T 1 3 6 Correct Docs
1 3 6 32 2 4 2
3 3 5 30 5 5
6 3 5 42 5 5
Total 9 16 2 9 5 12
  • Training set 3 samples (the largest categories)
  • Tj are calculated automatically
  • Accuracy 75, F-measure 73

31
Experiment 3 Sensitivity to size of Training
Set Terminological vocabulary is
not used
Class Training Set Test Set T 1 3 6 Correct Docs
1 6 6 25 4 2 4
3 6 5 22 5 5
6 6 5 34 5 5
Total 18 16 4 7 5 14
Training set 3 samples (the largest categories)
Tj are calculated automatically Accuracy 88,
F-measure 87
32
Experiments
  • Results
  • Experiment 1. Sensitivity to thresholds
  • Accuracy 75 (Tj automatic)
  • Best Accuracy 92
  • Experiment 2. Sensitivity to terminological
    vocabulary
  • Accuracy 84 (Tj automatic)
  • Best Accuracy 96
  • Experiment 3. Sensitivity to the size of training
    set
  • Accuracy 75, F-measure 73 (training
    set 3 samples)
  • Accuracy 88, F-measure 87 (training
    set 6 samples)

33
Experiments
  • Discussion
  • Intersection of lexis wide vs.
  • narrow domain
  • Narrow domain lexical resources
  • of each class are
  • similar gt many errors.
  • When classes, e.g. diseases
  • have absolutely different
  • descriptor lists
  • then the quality of classification
  • is the highest gt 100 accuracy.

34
Discussion Intersection of Lexis
  • Discussion
  • Narrow domain the intersection of lexis 16-25.

Categories 1-2 Categories 2-3 Categories 3-4 Categories 4-5 Categories 5-6
Common words 23 23 16 25 19
35
Conclusions
  • The document classifier RUBRYX was tested on a
    limited set
  • of primary medical records (narrow domain
    collection)
  • We studied the sensitivity of classification
    results
  • to threshold variations, use of terminological
    vocabulary
  • and size of training set

36
Conclusions
  • RUBRYX is easily tuned automatically and manually
  • on a given corpus that allows to reach high
    results
  • Unigrams, bigrams and trigrams taken together
  • and taking into account their mutual position
  • an a document allow to process narrow domain
    collections

37
Conclusions
  • Future Work
  • To combine the pre-processing procedure of RUBRYX
  • with other classifiers (Na?ve Bayes, SVM, etc.)

38
Acknowledgments
  • The authors are very thankful to
  • Mr. Vladimir Sinitsyn, one of RUBRYX developers,
    for his numerous consultations and additional
    software tools he offered for our work.
  • support_at_sowsoft.com

39
Bibliography (selected papers)
  • D.B.Aronow, J.R.Cooley, S.Soderland. Automated
    identification of episodes of asthma exacerbation
    for quality measurement in a computer-based
    medical record. In Proc. of Annual Symposium on
    Computer Applications in Medical Care. 309-13,
    USA, 1995.
  • D.B.Aronow, et.al. Automated classification of
    encounter notes in a computer based medical
    record. In Proc. of MEDINFO '95 8th World
    Congress on Medical Informatics, Medinfo, Canada,
    p. 8-12, 1995.
  • Baeza-Yates, B. Ribeiro-Neta. Modern Information
    Retrieval. Addison Wesley, 1999
  • C. Bishop. Pattern Recognition and Machine
    Learning, Springer, 2006
  • A.Catena, M.Alexandrov, B.Alexandrov,
    M.Demenkova. NLP-Tools Try To Make Medical
    Diagnosis. In Proc. of the 1-st Intern. Workshop
    on Social Networking (SoNet-2008), Skalica,
    Slovakia, 2008.
  • E. Kaurova, M. Alexandrov, X. Blanco.
    Classification of free text clinical narratives
    (short review). In Scient.Book Information
    Science and Computing, Publ. House ITHEA, 2011,
    12 pp.

40
Bibliography (selected papers)
  • R.Lopez, M.Alexandrov, D,Barreda, J.Tejada.
    Proc. of the 4-th Intern. Conf.on Intelligent
    Information and and Engineering Systems
    (INFOS-2011), Publ. House ITHEA, Poland, 2011, 8
    pp.
  • C.D.Manning, H.Schutze, Foundations of
    statistical natural language processing. MIT
    Press, 1999
  • C.D.Manning, H.Schutze, Introduction to
    Information Retrieval. Cambridge, 2009
  • T.Mitchell. Machine Learning, McGrow Hill, 1997
  • D. Pinto, On Clustering and Evaluation of Narrow
    Domain Short-Text Corpora. Doctoral Dissertation,
    Polytechnic University of Valencia, Spain, 2008
  • V. Polyakov, V. Sinitsyn. Method for automatic
    classification of web-resource by patterns in
    text processing and cognitive technologies. In
    Text Collection, No.6, Publ. House Otechestvo, p.
    120-126, 2001 (rus.)
  • V.Polyakov, V. Sinitsyn. RUBRYX technology of
    text classification using lexical meaning based
    approach. In Proc. of Intern. Conf. Speech and
    Computing (SPECOM-2003), Moscow, MSLU, p.
    137-143, 2003

41
Contacts
  • Olga Kaurova, PhD student,  Autonomous University
    of
  • Barcelona, Spain
  • kaurovskiy_at_gmail.com
  • Mikhail Alexandrov, professor, Russian
    Presidential Academy
  • of national economy and public
    administration, Moscow, Russia
  • fLexSem research group, Autonomous University
    of Barcelona,
  • Spain malexandrov_at_mail.ru
  • Ales Bourek, senior lecturer, Masaryk University
  • head of Center for Healthcare Quality, Brno,
    Czech Republic
  • bourek_at_med.muni.cz
Write a Comment
User Comments (0)
About PowerShow.com