Title: Combining classifiers based on kernel density estimates and Gaussian mixtures
1Combining classifiers based on kernel density
estimates and Gaussian mixtures
- Edgar Acuña
- Department of Mathematics
- University of Puerto Rico
- Mayagüez Campus
- (www.math.uprm.edu/edgar)
- This research is supported by ONR
-
2Acknowledgments
- Frida Coaquira (UPRM)
- Luis Daza (UPRM)
- Alex Rojas (CMU)
3OUTLINE
- The supervised classification problem
- Combining classifiers
- Kernel density estimators classifiers
- Gaussian mixtures classifiers
- Feature selection problem
- Results and concluding remarks
- Current work
4The Supervised classification problem
5Applications
- Satellite Image Analysis
- Handwritten character recognition
- Automatic target recognition
- Medical diagnosis
- Speech and Face Recognition
- Credit Card approval
- Multisensor data fusion for command and control
6 Type of Classifiers Linear
Discriminant and its extensions Linear
Discriminant, Logistic discriminant, Quadratic
discriminant, Multilayer Perceptron, Projection
Pursuit.Decision Trees and rule-based methods
NewID, AC2, Cal5, CN2, C4.5, CART, Bayes Rule,
ItruleDensity estimates k-NN, kernel density
estimates, Gaussian Mixtures, Naïve Bayes, Radial
basis function, Polytrees, Kohonens SOM,
LVQSupport Vector Machines
7Example of classifiers
8 The Misclassification ErrorThe
misclassification error, ME(C ), of the
classifier C(x,L) is the proportion of
misclassified cases of the test dataset T using
C. T is coming from the same population as L.
The ME can be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Methods
to estimate ME Resubstitution,
Cross-validation, Bootstrapping
9The classifier may either overfit or underfit
the data. Breiman (1996) heuristically defines a
classifier as unstable if a small change in the
data L can make large changes in the
classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable. Are KDE and Gaussian Mixtures
classifiers unstable?
10Instability of the KDE classifiers for Diabetes
11Instability of the KDE classifiers for Heart
12Instability of the GM classifier for Sonar
13Instability of the GM classifier for Ionosfera
14 Instability of the GM classifier for Segmentation
15Combining classifiersAn Ensemble is a
combination of the predictions of several single
classifiers.The variance and bias of the ME
could be reduced. Methods for creating
ensembles are Bagging (Bootstrap aggregating by
Breiman, 1996) AdaBoosting (Adaptive Boosting by
Freund and Schapire, 1996) Arcing (Adaptively
resampling and combining, by Breiman, 1998).
16(No Transcript)
17(No Transcript)
18Previous results on combining classifiers
19Bayesian approach to classification
An object with measurement vector x is assigned
to the class j if
P(Yj/x)gtP(Yj/x) for all j?j By Bayess
theorem P(Yj/x)?jf(x/j)/f(x)?jP(Yj) Prior
of the j-th class,f(x/j) Class conditional
density,f(x) Density function of x. Thus, j
argmaxj ?jf(x/j). Kernel density estimates and
Gaussian mixtures can be used to estimate f(x/j)
20Kernel density estimator classifiers
21Problems
- Choice of the Bandwidth Fixed, adaptive.
- Fixed
- Adaptive Use a large bandwidth where the data
are sparse and a small bandwidth if there are
plenty of data. Silvermans proposal. - Mixed type of predictors Continuous and
Categorical (Binary, Ordinal, Nominal) - Use of Product Kernel
22 If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by where is the kernel
estimator for the vector x(1) of categorical
predictors (Titterington, 1980) and is
the kernel density estimator of the vector x(2)
of continuous predictors.
23Problems
- The curse of dimensionalityFeature selection
Filters (The Relief), Wrappers. - Missing valuesImputation
24Previous Results for KDE classifiers
- Habbema, J.D.F. et al. (1980,83, 85) Comparison
with LDA and QDA. ALLOC10. - Hand, D (1983). Kernel Discriminant Analysis
- The Statlog Project (1994) KDE classifiers
performed better than CART (a 13-8 win) and tied
with C4.5 (11-11). Also they appeared as the top
5 classifiers for 11 datasets whereas C4.5 and
CART appeared only 6 and 3 times respectively.
25 Gaussian Mixtures classifiers
Where
This model
has Rj components for the j-th class and the same
covariance matrix for every subclass
The class conditional density for the j-th class
is estimated by the Gaussian mixture model
(8)
26The posterior probability is given by
stands for the prior of the j-th class.
Like in the LDA case, the parameter estimation
is using Maximun Likelihood. The log-likelihood
is maximized using the EM algorithm. Cluster
sizes Rj are selected using LVQ.
27Previous results on GM classifiers
- Hastie and Tibshirani (1996). Discriminant
analysis by Gaussian mixtures (MDA). - Ormoneit and Tresp (1996). Applied Bagging to two
small datasets - T. Lim, W. Loh and Y. Shih (2000). Compare MDA
with other classification algorithms - P. Smyth and Wolpert. (1999). Stacking of GM
classifiers
28Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
29Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
30Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
31Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
32Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
33Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
34Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
35Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
36Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
37Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
38 Datasets
39 Statistical properties of datasets
40Best results on ME using CV10
41 Bagging for KDE classifiers
42 Boosting for KDE classifiers
43Forward feature selection
- The first feature selected is the one giving
the lowest CV ME for the kernel classifier. The
second feature chosen is one that along with the
first one gives the lowest CV ME. The steps
continue until the ME can not be decreased.The
whole procedure is repeated 10 times, and the
subsets size is averaged and its elements are
chosen by voting.
44 Features using FS
45 Effect of feature selection on the GE
46 Bagging KDE classifiers after feature selection
47Bagging GM classifier for Sonar
48Bagging for GM classifiers
49Boosting for GM Classifiers
50Concluding Remarks
- On average, the use of bagging and boosting is
much better for adaptive kernel classifiers than
for standard kernel classifiers, but it requires
at least three times more computing time. - Increasing the number of bootstrap samples for
Bagging improves the misclassification error for
both types of classifiers. - On average, Boosting KDE classifiers is better
than Bagging. However Bagging performs more
uniformly. - The use of special kernel for categorical
variables does not seems to be effective.
51Concluding Remarks
- After feature selection the performance of
bagging deteriorates for both type of kernels. - Feature selection does a good job, because after
that KDE classifiers gives lower ME saving
computing time - Boosting performs well for datasets with high
correlation. Similar effect as feature selection. - Bagging GM classifiers is very effective.
Comparable to Bagging DT. - Boosting GM classifiers does not work
52Current work
- Analyze the effect of Bagging and Boosting on the
bias-variance decomposition of the
misclassification error for KDE classifiers. - Use bigger datasets.
- Implementation of parallel computer algorithms to
build ensembles based on KDE. - Implementation of parallel computer algorithms to
build ensembles based on Gaussian Mixtures.