Title: Prediction of Diabetes by Employing a Meta-Heuristic Which Can Optimize the Performance of Existing Data Mining Approaches
1Prediction of Diabetes by Employing a
Meta-Heuristic Which Can Optimize the Performance
of Existing Data Mining Approaches
- by Huy Nguyen Anh Pham and Evangelos
Triantaphyllou - ICIS2008 Portland, Oregon, May 14 - 16, 2008
- Department of Computer Science, Louisiana State
University - Baton Rouge, LA 70803
- Emails hpham15_at_lsu.edu and trianta_at_lsu.edu
-
2Outline
- Diabetes and the Pima Indian Diabetes (PID)
dataset - Selected current work
- Motivation
- The Homogeneity Based Algorithm (HBA)
- Rationale for the HBA
- Some computational results
- Conclusions
3Diabetes and the PID dataset
- Diabetes If the body does not produce or
properly use insulin, the redundant amount of
sugar will be driven out by urination. This
phenomenon (or disease) is called diabetes. - 20.8 million children and adults in the United
States (approximately 7 of the population) were
diagnosed with diabetes (American Diabetes
Association, 11/2007). - The Pima Indian Diabetes (PID) dataset 768
records describing female patients of Pima Indian
heritage which are at least 21 years old living
near Phoenix, Arizona, USA (UCI-Machine Learning
Repository, 2007).
4Diabetes and the PID dataset Contd
- The eight attributes for each record in the PID
No. Attribute
1 Number of times pregnant
2 Plasma glucose concentration in an oral glucose tolerance test
3 Diastolic blood pressure (mm/Hg)
4 Triceps skin fold thickness (mm)
5 2-hour serum insulin (µU/ml)
6 Body mass index (kg/m2)
7 Diabetes Pedigree function
8 Age (years)
5Selected Current work
- 76.0 diagnosis accuracy by Smith et al (1988)
when using an early neural network. - 77.6 diagnosis accuracy by Jankowski and
Kadirkamanathan (1997) when using IncNet. - 77.6 diagnosis accuracy by Au and Chan (2001)
using a fuzzy approach. - 78.6 diagnosis accuracy by Rutkowski and Cpalka
(2003) when using a flexible neural-fuzzy
inference system (FLEXNFIS). - 81.8 diagnosis accuracy by Davis (2006) when
using a fuzzy neural network. - Less than 78 diagnosis accuracy by the Statlog
project (1994) when using different
classification algorithms.
6Motivation
- In medical diagnosis there are three different
types of possible errors - The false-negative type in which a patient, who
in reality has that disease, is diagnosed as
disease free. - The false-positive type in which a patient, who
in reality does not have that disease, is
diagnosed as having that disease. - The unclassifiable type in which the diagnostic
system cannot diagnose a given case. This happens
due to insufficient knowledge extracted from the
historic data.
7Motivation Contd
- Current medical data mining approaches often
- Assign equal penalty costs for the false-positive
and the false-negative types - Diagnose a new patient to be in the
false-positive type - Make the patient to worry unnecessarily.
- Lead to unnecessary treatments and expenses.
- Not life-threatening possibilities.
- Diagnose a new patient to be in the
false-negative type - No treatment on time or none at all.
- Conditions may deteriorate and the patients life
may be at risk. - gt The two penalty costs for the false-positive
and the false-negative types may be significantly
different.
8Motivation Contd
- Current medical data mining approaches ignore the
penalty cost for the unclassifiable type - Because of insufficient knowledge extracted from
the historic data, a given patient should be
predicted as in the unclassifiable type. - However, in reality current approaches have often
predicted the patient as either having diabetes
or being disease free. - Such misdiagnosis may lead to either unnecessary
treatments or no treatment when one is needed. - gt Consideration for the unclassifiable type is
required.
9Outline
- Diabetes and the PID dataset
- Selected current work
- Motivation
- The Homogeneity Based Algorithm (HBA)
- Rationale for the HBA
- Some computational results
- Conclusions
10The Homogeneity Based Algorithm - HBA
- Developed by the authors of this presentation
(Pham and Triantaphyllou, 2007 and 2008). - Define the total misclassification cost TC as an
optimization problem in terms of the
false-positive, the false-negative, and the
unclassifiable costs is - (1)
- Where
- RateFP, RateFN, and RateUC are the false
positive, the false negative, and the
unclassifiable rates, respectively. - CFP, CFN, and CUC are the penalty costs for the
false positive, the false negative, and the
unclassifiable cases, respectively. Usually, CFN
is much higher then CFP and CUC. - Minimize the total misclassification cost.
11The HBA - Some key observations
- Please notice that
- Pattern A covers a region that is not adequately
populated by training points. - Pattern B does not have such sparely covered
regions. - The assumption that point P is a diabetes case
may not be that accurate. - However, the assumption that point Q is a
diabetes case may be more accurate.
- The accuracy of the inferred systems can be
increased if the derived patterns correspond to
homogenous sets. - A homogenous set describes a steady or uniform
distribution of a set of distinct points.
12The HBA - Some key observations contd
- Break pattern A into A1 and A2. Suppose that all
patterns A1, A2 and B correspond to homogenous
sets. - The number of points in B is higher than that in
A1. - Thus, the assumption that point Q is a diabetes
case may be more accurate than the assumption
that point S is a diabetes case . - The accuracy of the inferred systems may also be
affected by the density, thus to be used as the
Homogeneity Degree (HD).
13The HBA The Main Algorithm
- Phase 1 Assume that given is a training
dataset T. We divide T into the two sub datasets - T1 whose size is, say, equal to 90 of T s size
- T2 whose size is, say, equal to 10 of T s size.
- The training points in T1 are randomly selected
from T. - Phase 2
- Apply a classification approach (such as a DT,
ANN, or SVM) on the training dataset T1 to infer
the classification systems. Suppose that each
classification system consists of a set of
patterns. - Break the inferred patterns into hyperspheres.
- Phase 3
- Determine whether or not the hyperspheres are
homogenous sets. - If so, then compute their Homogeneity Degrees and
go to phase 4. - Otherwise, break them into smaller hyperspheres
and repeat phase 3 until all the hyperspheres
are homogenous sets.
14The HBA The Main Algorithm contd
- Phase 4
- Sort the Homogeneity Degrees in decreasing order.
- For each homogenous set, do
- If its Homogeneity Degree is greater than a
certain threshold value, then expand it. - Otherwise, break it into smaller homogenous sets
and then we may expand them. - The approach stops when all of the homogenous
sets have been processed. - Phase 5
- Apply a genetic algorithm (GA) for Phases 2 to
4 to find optimal threshold values by using the
total misclassification cost as an objective
function and the dataset T2 as a calibration
dataset. - After obtaining the optimal threshold values, the
training points in T2 can be divided into two sub
datasets - T2,1 consists of the classifiable points
- T2,2 includes the unclassifiable points.
- Let S1 denote an inferred classification system
after the GA approach is completed. - The dataset T2,2 then uses Phases 2 to 4 with
the optimal threshold values obtained from the GA
approach to infer the additional classification
system S2. - The final classification system is the union of
S1 and S2.
15Rationale for the HBA
- Consider the problem as a optimization
formulation in terms of the false-positive, the
false-negative, and the unclassifiable costs. - The HBA optimally adjusts the inferred
classification systems. - We use the Homogeneity Degree in the control
conditions for both expansion (to control
generalization) and breaking (to control
fitting). - Homogenous sets are expanded in decreasing order
of their Homogeneity Degrees.
16Some computational results
Parameters
- The four parameters needed in the HBA consist of
- Two expansion threshold values a and a- to be
used for expanding the positive and negative
homogenous sets, respectively. - Two breaking threshold values ß and ß- to be
used for breaking the positive and negative
homogenous sets, respectively. - Experimental methodology
- Step 1 The original algorithm was first trained
on the training dataset T and then derived the
value for TC by using the testing dataset. - Step 2 The HBA was trained on the training
dataset T1 and then derived the value for TC by
also using the testing dataset. - Step 3 The two values of TC returned in steps 1
and 2 were compared with each other.
17Some computational results Contd
- Case 1 minimize TC 1 RateFP 1 RateFN 0
RateUC - (i.e., only the false-positive and the
false-negative costs are considered and do so
equally). -
-
- The HBA, on the average, decreased the total
misclassification cost by about 81.57.
Algorithm RateFP RateFN RateUC TC Avg. of improvement
SVM 0 74 109 74
DT 27 36 118 63
ANN 22 39 118 61
SVM-HBA 0 10 143 10 86.49
DT-HBA 0 16 113 16 74.60
ANN-HBA 0 10 143 10 83.61
Algorithm in Accuracy Avg. of improvement
Smith et al 76.0
Jankowski Kadirkamanathan 77.6
Au Chan 77.6
Rutkowski Cpalka 78.6
Davis 81.8
Statlog 77.7
SVM-HBA 94.79 16.53
ANN-HBA 94.79 16.53
DT-HBA 91.67 13.45
18Some computational results Contd
- Case 2 minimize TC 3 RateFP 3 RateFN 3
RateUC - (i.e., all three costs are assumed to be
equal). -
-
- The HBA, on the average, decreased the total
misclassification cost by about 50.48.
Algorithm RateFP RateFN RateUC TC of improvement
SVM 0 74 109 549
DT 27 36 118 543
ANN 22 39 118 537
SVM-HBA 2 40 54 288 47.54
DT-HBA 1 61 24 258 52.49
ANN-HBA 1 57 29 261 51.40
- Case 3 minimize TC 1 RateFP 20 RateFN
3 RateUC - (i.e., the false-negative cost is assumed to
be much higher than the other two costs). -
-
- The HBA, on the average, decreased the total
misclassification cost by about 51.59.
Algorithm RateFP RateFN RateUC TC of improvement
SVM 0 74 109 1,807
DT 27 36 118 1,101
ANN 22 39 118 1,156
SVM-HBA 0 16 105 635 64.86
DT-HBA 5 10 136 613 44.32
ANN-HBA 0 10 143 629 45.59
19Some computational results Contd
- Case 4 minimize TC 1 RateFP 100 RateFN
3 RateUC - (i.e., the false-negative cost is assumed to
the significantly higher than the other two
costs). -
-
- The HBA, on the average, decreased the total
misclassification cost by about 76.00. - The higher the penalty cost for false-negative
type is set, the fewer cases of false-negative
can be found.
Algorithm RateFP RateFN RateUC TC of improvement
SVM 0 74 109 401
DT 27 36 118 3,090
ANN 22 39 118 2,593
SVM-HBA 61 1 24 233 41.90
DT-HBA 56 1 32 252 91.84
ANN-HBA 59 0 30 149 94.25
20Conclusions
- Millions of people in the United States and in
the world have diabetes. - The ability to predict diabetes early plays an
important role for the patients treatment
process. - The correct prediction percentage of current
algorithms may oftentimes be coincidental. - This study identified the need for different
penalty costs for the false-positive, the
false-negative, and the unclassifiable types of
errors in medical data mining. - This study applied a meta heuristic approach,
called the Homogeneity-Based Algorithm (HBA), for
enhancing the diabetes prediction. - The HBA first defines the desired goal as an
optimization problem in terms of the
false-positive, the false-negative, and the
unclassifiable costs. - The HBA is then used in conjunction with
traditional classification algorithms (such as
SVMs, DTs, ANNs, etc) to enhance the diabetes
prediction. - The Pima Indian diabetes dataset has been used
for evaluating the performance of the HBA. - The obtained results appear to be very important
both for accurately predicting diabetes and also
for the medical data mining community in general.
21References
- Asuncion A. and D.J. Newman, UCI-Machine
Learning Repository, University of California,
Irvine, California, USA, School of Information
and Computer Sciences, 2007. - Smith J. W., J. E. Everhart, W. C. Dickson, W. C.
Knowler, and R. S. Johannes, Using the ADAP
learning algorithm to forecast the onset of
diabetes mellitus, Proceedings of 12th Symposium
on Computer Applications and Medical Care, Los
Angeles, California, USA, 1988, pp. 261 - 265. - Jankowski N. and V. Kadirkamanathan, Statistical
control of RBF-like networks for classification,
Proceedings of the 7th International Conference
on Artificial Neural Networks (ICANN), Lausanne,
Switzerland, 1997, pp. 385 - 390. - Au W. H. and K. C. C. Chan, Classification with
degree of membership A fuzzy approach,
Proceedings of the 1st IEEE Int'l Conference on
Data Mining, San Jose, California, USA, 2001, pp.
35 - 42. - Rutkowski L. and K. Cpalka, Flexible neuro-fuzzy
systems, IEEE Transactions on Neural Networks,
Vol. 14, 2003, pp. 554 - 574. - Davis W. L. IV, Enhancing Pattern Classification
with Relational Fuzzy Neural Networks and Square
BK-Products, PhD Dissertation in Computer
Science, 2006, pp. 71 - 74. - Michie D., D. J. Spiegelhalter, and C. C. Taylor,
Machine Learning, Neural and Statistical
Classification, Englewood Cliffs in Series
Artificial Intelligence, Prentice Hall, Chapter
9, 1994, pp. 157 - 160. - Pham H. N. A. and E. Triantaphyllou, The Impact
of Overfitting and Overgeneralization on the
Classification Accuracy in Data Mining, in Soft
Computing for Knowledge Discovery and Data
Mining, (O. Maimon and L. Rokach, Editors),
Springer, New-York, USA, 2007, Part 4, Chapter 5,
pp. 391 - 431. - Pham H. N. A. and E. Triantaphyllou, "An
Optimization Approach for Improving Accuracy by
Balancing Overfitting and Overgeneralization in
Data Mining," submitted for publication, January
2008. - Thank you
- Any questions?