Title: A three-step approach for STULONG database analysis: characterization of patients
1A three-step approach for STULONG database
analysis characterization of patients groups
Discovery Challenge (PKDD 2004)
- O. Couturier, H. Delalin, H. Fu, E. Kouamou, E.
Mephu Nguifo - Computer Science Research Center of Lens (CRIL)
- CNRS - Université dArtois IUT de Lens
2Goal
- What are the relations between social factors
(social characteristics) and the other
characteristics of men in the respective groups?
3Overview
- Discovery process
- Techniques and results
- Clustering
- Classification
- Association rules
- Conclusion and further work
4Discovery Process
- Hypothesis on data
- ENTRY table
- Groups provided by expert
- Merging groups 1 and 2 Normal group
- Merging groups 3 and 4 Risk group
- Ignoring group 6
- Characteristics
- Considering previous work of LRI ML research team
at previous PKDD Challenges
5Discovery Process
- Can we find a model that fits with the provided
groups ? - Are there strong similarities among instances of
different groups ? - Which kind of relations exist among group
characteritics ?
6Discovery Process
Data Tasks Knowledge
Clustering Generated clusters vs provided ones
Entry data groups Supervised classification Similarities among instances, and groups
Association rules search Affinity among groups characteristics
7Techniques and Results Clustering
- Goal do the initials groups can be considered as
they were defined? - Data groups 12, 34 and 5
- Clustering systems (WEKA package)
- COBWEB 2 groups
- EM 4 groups
- KMEANS 2 groups
- Results difficulty to identify properties which
allow to retrieve the initial groups
8Techniques and Results Supervised Classification
- Risk group patients similar to those in Normal or
Pathological group ? - Data
- Training set group 12 and group 5
- Test set group 34 (Risk)
- System (WEKA package)
- Decision tree C4.5
9Techniques and Results Supervised Classification
- Training results
- HT descriptor are one of the most relevant
factors of the disease - Thirdteen instances of Pathological group are
classified as Normal
Confusion Matrix a e
lt--classified as 276 0 a 12 13 101
e 5
10Techniques and Results Supervised Classification
- Test set Risk Group34
- Health district number is not a relevant factor
- 2/5 Risk patients similar to Normal group patients
Confusion Matrix a c d e lt--
classified as 0 0 0 0 a 12 177
0 0 250 c 3 (odd) 197 0 0 235 d
4 (even) 0 0 0 0 e 5
11Techniques and Results Association rules search
- Goal Find relations that exist among group
characteritics - Data 1417 patients of groups 12, 34 and 5
- System
- Apriori B. Goethals implementation
- Preprocess
- Binary conversion of the 27 characteristics
- Frequent Itemsets Search
- Results
- Frequent itemsets common to different groups
12Techniques and Results Association rules search
- Preprocessing Binary conversion
- BMI weight / size² (m)
- If bmi gt 27 then 1 else 0
- Age
- If age gt 45 then 1 else 0
- Smoker
- If smokerconsumption!0 OR duration then 1 else 0
13Techniques and Results Association rules search
- Pre-processing Binary conversion
- Bolhr (chest pain)
- If bolhr1 or bolhr6 then 0 else 1
- Chol
- If (chol gt 2(age/100)) then 1 else 0
- Tg
- If tglt150 then 0 else 1
14Techniques and Results Association rules search
- Frequent itemsets search
- Support threshold (Minsup) 0.10
- significant for at least 10 of the population
- Search was done with no MinSup (i.e MinSup value
0)
Itemsets Itemsets Itemsets
Class 12 Class 34 Class 5
Support value Support value Support value
15(No Transcript)
16Techniques and Results Association rules search
- Frequent itemsets search Results
- Support value of Alcohol attribute 1
- 1-itemsets
- Attribute IM is false for each patient of group
12 and 34. The value is true for 33 of patients
of group 5. - HT is false for each patient of group 12.
- STUDY is more frequent in group12 than in group5
- 3-itemsets
- AGE SMOKER CHOL is less frequent in group12
than in group5 - etc
- SupportValue Group 34 is between SupportValue
Group 12 and SupportValue of Group 5.
17Conclusion
- RG similarity with NG and PG.
- 3 steps
- Clustering initial groups are not found
- Classification some attributes which
characterize the pathological group but already
known - Frequent itemsets search difficult to highlight
concrete results but interesting informations
18Further work
- Upgrade the binary conversion
- Refining the data set on the population
- for instance, 12 patients died because of
atherosclerosis while they were in the NG. - Refining our hypothesis
- Data set of ENTRY table
- Look at the CONTROL table
19Thanks !