Combining classifiers based on kernel density estimates and Gaussian mixtures - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Combining classifiers based on kernel density estimates and Gaussian mixtures

Description:

Decision Trees and rule-based methods : NewID, AC2, Cal5, CN2, C4.5, CART, Bayes Rule, Itrule ... CART and Neural networks are unstable classifiers. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 53
Provided by: INELIC
Category:

less

Transcript and Presenter's Notes

Title: Combining classifiers based on kernel density estimates and Gaussian mixtures


1
Combining classifiers based on kernel density
estimates and Gaussian mixtures
  • Edgar Acuña
  • Department of Mathematics
  • University of Puerto Rico
  • Mayagüez Campus
  • (www.math.uprm.edu/edgar)
  • This research is supported by ONR

2
Acknowledgments
  • Frida Coaquira (UPRM)
  • Luis Daza (UPRM)
  • Alex Rojas (CMU)

3
OUTLINE
  • The supervised classification problem
  • Combining classifiers
  • Kernel density estimators classifiers
  • Gaussian mixtures classifiers
  • Feature selection problem
  • Results and concluding remarks
  • Current work

4
The Supervised classification problem
5
Applications
  • Satellite Image Analysis
  • Handwritten character recognition
  • Automatic target recognition
  • Medical diagnosis
  • Speech and Face Recognition
  • Credit Card approval
  • Multisensor data fusion for command and control

6
Type of Classifiers Linear
Discriminant and its extensions Linear
Discriminant, Logistic discriminant, Quadratic
discriminant, Multilayer Perceptron, Projection
Pursuit.Decision Trees and rule-based methods
NewID, AC2, Cal5, CN2, C4.5, CART, Bayes Rule,
ItruleDensity estimates k-NN, kernel density
estimates, Gaussian Mixtures, Naïve Bayes, Radial
basis function, Polytrees, Kohonens SOM,
LVQSupport Vector Machines
7
Example of classifiers
8
The Misclassification ErrorThe
misclassification error, ME(C ), of the
classifier C(x,L) is the proportion of
misclassified cases of the test dataset T using
C. T is coming from the same population as L.
The ME can be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Methods
to estimate ME Resubstitution,
Cross-validation, Bootstrapping
9
The classifier may either overfit or underfit
the data. Breiman (1996) heuristically defines a
classifier as unstable if a small change in the
data L can make large changes in the
classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable. Are KDE and Gaussian Mixtures
classifiers unstable?
10
Instability of the KDE classifiers for Diabetes
11
Instability of the KDE classifiers for Heart
12
Instability of the GM classifier for Sonar
13
Instability of the GM classifier for Ionosfera
14
Instability of the GM classifier for Segmentation
15
Combining classifiersAn Ensemble is a
combination of the predictions of several single
classifiers.The variance and bias of the ME
could be reduced. Methods for creating
ensembles are Bagging (Bootstrap aggregating by
Breiman, 1996) AdaBoosting (Adaptive Boosting by
Freund and Schapire, 1996) Arcing (Adaptively
resampling and combining, by Breiman, 1998).
16
(No Transcript)
17
(No Transcript)
18
Previous results on combining classifiers
19
Bayesian approach to classification
An object with measurement vector x is assigned
to the class j if
P(Yj/x)gtP(Yj/x) for all j?j By Bayess
theorem P(Yj/x)?jf(x/j)/f(x)?jP(Yj) Prior
of the j-th class,f(x/j) Class conditional
density,f(x) Density function of x. Thus, j
argmaxj ?jf(x/j). Kernel density estimates and
Gaussian mixtures can be used to estimate f(x/j)

20
Kernel density estimator classifiers
21
Problems
  • Choice of the Bandwidth Fixed, adaptive.
  • Fixed
  • Adaptive Use a large bandwidth where the data
    are sparse and a small bandwidth if there are
    plenty of data. Silvermans proposal.
  • Mixed type of predictors Continuous and
    Categorical (Binary, Ordinal, Nominal)
  • Use of Product Kernel

22
If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by   where is the kernel
estimator for the vector x(1) of categorical
predictors (Titterington, 1980) and is
the kernel density estimator of the vector x(2)
of continuous predictors.
23
Problems
  • The curse of dimensionalityFeature selection
    Filters (The Relief), Wrappers.
  • Missing valuesImputation

24
Previous Results for KDE classifiers
  • Habbema, J.D.F. et al. (1980,83, 85) Comparison
    with LDA and QDA. ALLOC10.
  • Hand, D (1983). Kernel Discriminant Analysis
  • The Statlog Project (1994) KDE classifiers
    performed better than CART (a 13-8 win) and tied
    with C4.5 (11-11). Also they appeared as the top
    5 classifiers for 11 datasets whereas C4.5 and
    CART appeared only 6 and 3 times respectively.

25
Gaussian Mixtures classifiers
Where
This model
has Rj components for the j-th class and the same
covariance matrix for every subclass
The class conditional density for the j-th class
is estimated by the Gaussian mixture model
(8)
26
The posterior probability is given by
stands for the prior of the j-th class.
Like in the LDA case, the parameter estimation
is using Maximun Likelihood. The log-likelihood
is maximized using the EM algorithm. Cluster
sizes Rj are selected using LVQ.
27
Previous results on GM classifiers
  • Hastie and Tibshirani (1996). Discriminant
    analysis by Gaussian mixtures (MDA).
  • Ormoneit and Tresp (1996). Applied Bagging to two
    small datasets
  • T. Lim, W. Loh and Y. Shih (2000). Compare MDA
    with other classification algorithms
  • P. Smyth and Wolpert. (1999). Stacking of GM
    classifiers

28
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
29
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
30
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
31
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
32
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
33
Experimental Methodology
Training Sample
C
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
34
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
35
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
36
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
37
Experimental Methodology
Training Sample
...
B1
B2
B50
C1(x)
C2(x)
C50(x)
C
38
Datasets
39
Statistical properties of datasets
40
Best results on ME using CV10
41
Bagging for KDE classifiers
42
Boosting for KDE classifiers
43
Forward feature selection
  • The first feature selected is the one giving
    the lowest CV ME for the kernel classifier. The
    second feature chosen is one that along with the
    first one gives the lowest CV ME. The steps
    continue until the ME can not be decreased.The
    whole procedure is repeated 10 times, and the
    subsets size is averaged and its elements are
    chosen by voting.

44
Features using FS
45
Effect of feature selection on the GE
46
Bagging KDE classifiers after feature selection
47
Bagging GM classifier for Sonar
48
Bagging for GM classifiers
49

Boosting for GM Classifiers
50
Concluding Remarks
  • On average, the use of bagging and boosting is
    much better for adaptive kernel classifiers than
    for standard kernel classifiers, but it requires
    at least three times more computing time.
  • Increasing the number of bootstrap samples for
    Bagging improves the misclassification error for
    both types of classifiers.
  • On average, Boosting KDE classifiers is better
    than Bagging. However Bagging performs more
    uniformly.
  • The use of special kernel for categorical
    variables does not seems to be effective.

51
Concluding Remarks
  • After feature selection the performance of
    bagging deteriorates for both type of kernels.
  • Feature selection does a good job, because after
    that KDE classifiers gives lower ME saving
    computing time
  • Boosting performs well for datasets with high
    correlation. Similar effect as feature selection.
  • Bagging GM classifiers is very effective.
    Comparable to Bagging DT.
  • Boosting GM classifiers does not work

52
Current work
  • Analyze the effect of Bagging and Boosting on the
    bias-variance decomposition of the
    misclassification error for KDE classifiers.
  • Use bigger datasets.
  • Implementation of parallel computer algorithms to
    build ensembles based on KDE.
  • Implementation of parallel computer algorithms to
    build ensembles based on Gaussian Mixtures.
Write a Comment
User Comments (0)
About PowerShow.com