Combining classifiers based on kernel density estimates and Gaussian mixtures - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Combining classifiers based on kernel density estimates and Gaussian mixtures

Description:

Decision Trees and rule-based methods : NewID, AC2, Cal5, CN2, C4.5, CART, Bayes Rule, Itrule ... CART and Neural networks are unstable classifiers. ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 53

Provided by: INELIC

Category:

more less

Transcript and Presenter's Notes

Title: Combining classifiers based on kernel density estimates and Gaussian mixtures

1
Combining classifiers based on kernel density
estimates and Gaussian mixtures

Edgar Acuña
Department of Mathematics
University of Puerto Rico
Mayagüez Campus
(www.math.uprm.edu/edgar)
This research is supported by ONR

2
Acknowledgments

Frida Coaquira (UPRM)
Luis Daza (UPRM)
Alex Rojas (CMU)

3
OUTLINE

The supervised classification problem
Combining classifiers
Kernel density estimators classifiers
Gaussian mixtures classifiers
Feature selection problem
Results and concluding remarks
Current work

4
The Supervised classification problem
5
Applications

Satellite Image Analysis
Handwritten character recognition
Automatic target recognition
Medical diagnosis
Speech and Face Recognition
Credit Card approval
Multisensor data fusion for command and control

6
Type of Classifiers Linear
Discriminant and its extensions Linear
Discriminant, Logistic discriminant, Quadratic
discriminant, Multilayer Perceptron, Projection
Pursuit.Decision Trees and rule-based methods
NewID, AC2, Cal5, CN2, C4.5, CART, Bayes Rule,
ItruleDensity estimates k-NN, kernel density
estimates, Gaussian Mixtures, Naïve Bayes, Radial
basis function, Polytrees, Kohonens SOM,
LVQSupport Vector Machines
7
Example of classifiers
8
The Misclassification ErrorThe
misclassification error, ME(C ), of the
classifier C(x,L) is the proportion of
misclassified cases of the test dataset T using
C. T is coming from the same population as L.
The ME can be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Methods
to estimate ME Resubstitution,
Cross-validation, Bootstrapping
9
The classifier may either overfit or underfit
the data. Breiman (1996) heuristically defines a
classifier as unstable if a small change in the
data L can make large changes in the
classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable. Are KDE and Gaussian Mixtures
classifiers unstable?
10
Instability of the KDE classifiers for Diabetes
11
Instability of the KDE classifiers for Heart
12
Instability of the GM classifier for Sonar
13
Instability of the GM classifier for Ionosfera
14
Instability of the GM classifier for Segmentation
15
Combining classifiersAn Ensemble is a
combination of the predictions of several single
classifiers.The variance and bias of the ME
could be reduced. Methods for creating
ensembles are Bagging (Bootstrap aggregating by
Breiman, 1996) AdaBoosting (Adaptive Boosting by
Freund and Schapire, 1996) Arcing (Adaptively
resampling and combining, by Breiman, 1998).
16
(No Transcript)
17
(No Transcript)
18
Previous results on combining classifiers
19
Bayesian approach to classification
An object with measurement vector x is assigned
to the class j if
P(Yj/x)gtP(Yj/x) for all j?j By Bayess
theorem P(Yj/x)?jf(x/j)/f(x)?jP(Yj) Prior
of the j-th class,f(x/j) Class conditional
density,f(x) Density function of x. Thus, j
argmaxj ?jf(x/j). Kernel density estimates and
Gaussian mixtures can be used to estimate f(x/j)

20
Kernel density estimator classifiers
21
Problems

Choice of the Bandwidth Fixed, adaptive.
Fixed
Adaptive Use a large bandwidth where the data
are sparse and a small bandwidth if there are
plenty of data. Silvermans proposal.
Mixed type of predictors Continuous and
Categorical (Binary, Ordinal, Nominal)
Use of Product Kernel

22
If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by where is the kernel
estimator for the vector x(1) of categorical
predictors (Titterington, 1980) and is
the kernel density estimator of the vector x(2)
of continuous predictors.
23
Problems

The curse of dimensionalityFeature selection
Filters (The Relief), Wrappers.
Missing valuesImputation

24
Previous Results for KDE classifiers

Habbema, J.D.F. et al. (1980,83, 85) Comparison
with LDA and QDA. ALLOC10.
Hand, D (1983). Kernel Discriminant Analysis
The Statlog Project (1994) KDE classifiers
performed better than CART (a 13-8 win) and tied
with C4.5 (11-11). Also they appeared as the top
5 classifiers for 11 datasets whereas C4.5 and
CART appeared only 6 and 3 times respectively.

25
Gaussian Mixtures classifiers
Where
This model
has Rj components for the j-th class and the same
covariance matrix for every subclass
The class conditional density for the j-th class
is estimated by the Gaussian mixture model
(8)
26
The posterior probability is given by
stands for the prior of the j-th class.
Like in the LDA case, the parameter estimation
is using Maximun Likelihood. The log-likelihood
is maximized using the EM algorithm. Cluster
sizes Rj are selected using LVQ.
27
Previous results on GM classifiers

Hastie and Tibshirani (1996). Discriminant
analysis by Gaussian mixtures (MDA).
Ormoneit and Tresp (1996). Applied Bagging to two
small datasets
T. Lim, W. Loh and Y. Shih (2000). Compare MDA
with other classification algorithms
P. Smyth and Wolpert. (1999). Stacking of GM
classifiers

28
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
29
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
30
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
31
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
32
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
33
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
34
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
35
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
36
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
37
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
38
Datasets
39
Statistical properties of datasets
40
Best results on ME using CV10
41
Bagging for KDE classifiers
42
Boosting for KDE classifiers
43
Forward feature selection

The first feature selected is the one giving
the lowest CV ME for the kernel classifier. The
second feature chosen is one that along with the
first one gives the lowest CV ME. The steps
continue until the ME can not be decreased.The
whole procedure is repeated 10 times, and the
subsets size is averaged and its elements are
chosen by voting.

44
Features using FS
45
Effect of feature selection on the GE
46
Bagging KDE classifiers after feature selection
47
Bagging GM classifier for Sonar
48
Bagging for GM classifiers
49

Boosting for GM Classifiers
50
Concluding Remarks

On average, the use of bagging and boosting is
much better for adaptive kernel classifiers than
for standard kernel classifiers, but it requires
at least three times more computing time.
Increasing the number of bootstrap samples for
Bagging improves the misclassification error for
both types of classifiers.
On average, Boosting KDE classifiers is better
than Bagging. However Bagging performs more
uniformly.
The use of special kernel for categorical
variables does not seems to be effective.

51
Concluding Remarks

After feature selection the performance of
bagging deteriorates for both type of kernels.
Feature selection does a good job, because after
that KDE classifiers gives lower ME saving
computing time
Boosting performs well for datasets with high
correlation. Similar effect as feature selection.
Bagging GM classifiers is very effective.
Comparable to Bagging DT.
Boosting GM classifiers does not work

52
Current work

Analyze the effect of Bagging and Boosting on the
bias-variance decomposition of the
misclassification error for KDE classifiers.
Use bigger datasets.
Implementation of parallel computer algorithms to
build ensembles based on KDE.
Implementation of parallel computer algorithms to
build ensembles based on Gaussian Mixtures.

Write a Comment

User Comments (0)